Skip to content

Why pfff will replace md5 for big data

Probablistic fast file fingerprinting (pfff) is a tool by Konstantin Tretyakov meant for file comparison of large files. Currently MD5 is used, which requires reading the full origin and target files.  Reading full files can become very slow when dealing with large files and large repositories, or when the transport layer is very slow, such as with a network stack, or USB.

Pfff makes use of the variation within files to quickly asses whether files are the same, or whether they are different. Pfff does this by using samples of the file, the finger prints, and calculating a unique identifier based on these samples.  This works when large files are highly variable. Typical use cases are video, audio, photo, geograpical, meteorological, and biological data. With biology, so called ‘big data’ is gathered by biologists, through genomics sequencing, transcriptomics, proteomics etc.

We are using pfff to for copying files across networks, and for Cloud and GRID computing. We are using it to find duplicate files in repositories. We are using it to see when files get truncated during copy.  I even use it to copy files to my GPS over USB, a tediously slow process, to check whether the copy has succeeded, and to find duplicates in my music collection.

The chance of a collision (finding the same pfff identifier with two different files) is in the same order as with MD5. But where MD5 is slow, pfff is fast, particularly with large files, and/or a slow transport layer, but also on flash SSD drives(!), because pfff gets faster when seek times are low.

The only time you may not want to use pfff is when validating the full correctness of a file (as in the case of a backup, or when avoiding malware). In that case there is no solution but to read the full file.

In practise, with modern hardware and software techniques, it is very unlikely for a single byte (a mutation) to change in a file, without other side-effects (hardware errors, parser errors). What is likely, however, is truncation, incomplete transfers, and/or partial deletion of files. Pfff will always find these, because finger prints are guaranteed to change when data shifts.

pfff is free software, and written in C++ for Linux, OS X and Windows. It can be downloaded and compiled from [github](  Soon we expect pfff to be available in the major software distributions.


JRuby for bioinformatics

Today I attended an interesting talk by Charles Nutter of JRuby fame at FOSDEM 2012. The short of it is that much of the work on JRuby is on performance and low level access to C libraries (through FFI). This is excellent news for the BioRuby community. The current version of JRuby is already the only Ruby with full multi-threading support – so using those many cores is feasible (e.g. using Akka) And the execution time is on par with the other Rubies. The next version of JRuby (1.7) will be 2-10x faster than the current version(!). The speed achievement is by virtue of better feeding of information to the JVM’s JIT compiler. The performance increase not only has major implications for large commercial users of Ruby, but also to us (big) data crunchers.

The other good news is almost transparent access to C libraries using the FFI. Charles called it a paradox (i.e. many of their goals were considered impossible on the JVM), I call it subverting the JVM. JRuby is taking the JVM places it was never meant to go. Pragmatically it means low level access to disk and network IO.

In all, for us it means we get way better performance, and we can use the tools and bindings we have now. JRuby is really a drop-in replacement for Ruby 1.9 (though still some late language additions may be missing). The main downside to the JRuby is JVM startup time. But with big data that is hardly an issue.

Way to go JRuby team! I expect to see a lot more JRuby for bioinformatics

D is a dragon, or why D matters for Bioinformatics

Ruby is a pony. Everyone loves a pony. Ruby is nice.

Scala is a thoroughbred. You know I like Scala – it is beautiful, and runs circles around the pony

D is a dragon. Very powerful, and somewhat unpredictable

The programming languages Ruby, Python, R and Perl have proven to be very popular in bioinformatics. These languages are interpreted and dynamically typed computer languages. They are all great at parsing and handling genomic information. Results are quick to get, and the development cycle may be gratifying. However, as the language shootout shows, they are also rather slow, and hard to parallelize. It is not easy to get them to use those multi-cores everyone has.

Some newer languages, such as Scala and D, are not only strongly typed, which has a real impact on performance, but are also very good at automatically handling types. This means that coding Scala or D, feels similar to coding dynamically typed languages. Also, Scala and D are OOP languages that marry the functional programming paradigm. In practise, that means that we get OOP goodness (and badness), with constructs that make it safer and easier to parallelize code.

At this point bioinformaticians should sit up and prick up their ears: it is not much harder to program in Scala and D than in Ruby, Python, R and Perl. A little harder, yes, but the rewards can be great. Bioinformatics has entered the era of Big Data (Trelles and Prins, 2010), and we need parallelization and cores to get the work done. There is a lot of hype about the cloud, but with the current affordable large multi-core systems, a lot of programming and analysis can be handled on single largish memory multi-core machines. Scala and D allow fine-grained parallelization using high-level abstractions, these languages have immutable data and built-in high performance message passing it the form of Actors, e.g. Akka, which makes it possible to use those cores.

I wrote the BioScala project. I love the Scala language. Beautifully designed, it is a programmers dream, were it not for the JVM.

With big data performance matters. And whatever the Java love boys claim, the JVM is often slower than directly compiled code. With C and D, carefully crafted code can run 4x, and sometimes 10x, faster than compiled JVM code, even with JIT compilation. That is a significant difference that is due to memory handling, object size and handling (so called ‘death by object creation’), and pointer arithmetic (which D makes safe using slicing), i.e. low-level control that the JVM does not allow, and which prevents clever optimizations. It means buying a 32 core machine instead of a 128 core machine. It means running your code on a 250 machine cluster (or Cloud) instead of on a 1000 machine cluster. It means waiting a month, instead of 4 months for a calculation on the same setup. In short, run time performance matters with big data.

D matters for bioinformatics, as it is a next generation computer language with the performances of C or C++. It comes with garbage collection – and you can still use malloc for fine tuning. Anyone using C or C++, and I know who you are, should reconsider. While programming in JAVA may feel to you like programming with the hands behind the back, C programming is a matter of repeatedly shooting yourself in the foot.

D is by far the nicer language, because it is much safer than C and C++, protects you from making mistakes, while it giving you almost the productivity of Ruby or Python.

Why use D over Ruby or Python? Well, I say, don’t stop using Ruby and Python any time soon. D complements them. Ruby, Python, R and Perl are great languages and can easily be bound against D code. Just like with C code, the functions are connected through a foreign function interface (FFI). Use each language to its full potential. D is for tight memory control, raw speed and parallelization. The others are there to churn out code in the simplest and quickest way. In time, however, you may find you’ll use D for more than time critical code. For true software engineering a strong type system can be very beneficial. For more on bridging computer languages, check out my soon to be published Springer book chapter on Sharing programming resources between Bio* projects through remote procedure call and native call stack strategies.

Why use D over Scala? Simply because D’s performance is higher. Not only is it faster running code, it is much easier to get low level tweaked code. For a simple GFF3 file parser I, admittedly somewhat unexpectedly, managed to increase speed significantly by allocating often accessed data on the stack, rather than on the heap. Modern CPUs take advantage of that (stack memory is closer to the processor). It is something the JVM does not allow you to do. Another power feature of D is slicing, or safe pointer arithmetic. The fastest XML parser in the world is written in D. Look it up.

Why is D a dragon? D’s language is amazingly powerful, but not as carefully designed as Scala’s. Scala’s design is as powerful, and simply beautiful. D feels more clunky and can get in the way sometimes. I find its functional language implementation less intuitive than that of Scala. Still, it works rather well, and it even has tail end recursion optimization (unlike the JVM). What clinched it for me is that, next to raw runtime speed, there are three areas D beats Scala. First, the D compiler itself is blazingly fast. Second, the D template system (generics) is simpler and easier to understand. Even in my earlier blog examples have trouble with Scala’s advanced templating, which is not a good sign. Third, D code generation, or compile time evaluation, rocks. Another thing to look into, there are many examples in the D standard library. In the Beginning Scala book by David Pollack he gives an example of a computer game that featured in _Why the lucky stiff‘s world (for Ruby insiders). What was enlightening to me was the code repetition in David’s book, necessary to build the players. That would not be necessary in D’s compile time evaluation. There are a few things I miss in D. For example pattern recognition on unpacking data, which is great in Haskell, Erlang, and Scala (see example). D has something for actors, so it may come to the main language. The second thing I miss is that language elements do not always return values. I use that in Ruby all over the place, because it makes for shorter code.

Finally some things that keep cropping up when I bring up D. First, the licensing issues. D, for historical reasons was closed source. That is changing now, with D2 compilers getting part of Fedora and Debian to follow. Second the schism and negativism of D1 users caused by an the move to D2. That you’ll find on the Internet. D2 is not compatible with D1, and that has caused grief. D2 was reinvented as the language designers progressed their ideas. If you want to read more about the excellent D2 language I strongly recommend Andrei’s book. It is a classic in its own right, describing a next generation programming language. Even if you never get to appreciate the power of the D language itself.

Migrating to Ruby 1.9, the good, the bad and the ugly

Ruby had a major overhaul with the introduction of the 1.9 series. Not least because the full interpreter was replaced. In addition, the language was changed with some incompatibilities between 1.8 and 1.9.

Lately I have been moving bioinformatics Ruby source code from 1.8 to 1.9. A minor version bump involving, rather too many, major changes. Small step for man etc. etc. Here I wish to share my strategy in getting the migration job done.

The good

There are two major gotchas, which have bitten me all over the place. The first has to do with lambda and Proc. The second with Array.to_s. But first, run Ruby 1.9 with the -w switch(!) That will catch a number of incompatibilities, such as ‘when’ without ‘then’, and variable scope issues. It works rather well, and comes with other useful warnings.

The bad

Array.to_s behaves differently in 1.8 and 1.9. In 1.8 is behaves like Array.join. In 1.9 it behaves like the output of ‘p’.

  1. a = [ "1", "2", "3" ]
  2. print a
  3. 123 # Ruby 1.8
  4. [ "1", "2", "3" ] # Ruby 1.9

this looks like a minor thing, but it is amazing what it has broken in my code. I used Array.to_s a lot which is not surprising for bioinformatics processing. Worse, it is rather hard to find these bugs. Even with unit testing in place (you do have unit testing, I hope).

The strategy that works is to replace all ‘to_s’ calls with ‘to_string’. And write to_string methods for the objects the interpreter complains about (hopefully you have full coverage),usually returning Array.join.

The ugly

The ‘proc’ method returns a in 1.8, and a lambda in 1.9. Now this is really bad. The problem is that ‘return’ behaves differently, between the two. A returns from the surrounding method, a lambda returns from its own. And there is no indication from the interpreter what goes wrong. I think this is a rather questionable decision by the Ruby authors. It breaks stuff badly, for an all too purist reason – proc now returns a similar named object. I think they should have obsoleted the Kernel.proc method, rather that this failing silently.

Anyway, throughout your sources, replace ‘proc’ with ‘lambda’.

The upside

The upside is that when Ruby libraries/plugin support both 1.8 and 1.9, at least now you know they are supported by their authors. It acts a s filter for weeding out unsupported code. But that could not have been the intention…

Big data in biology – dreaming about Cloud

Big data analysis in biology requires thinking beyond the PC, and bioinformaticians are looking for ways to scale up their calculations. Some biologists bet their future on Cloud computing, fed by Googles example, and (perhaps) industry interests. Unfortunately, Cloud computing does not escape the physical realities of any other type of computing. With Cloud there are major bottle necks, particularly, in both networking and disk IO, i.e. the physical constraints are no different from any ‘normal’ computing cluster. In fact, the nature of Cloud computing is closer to Grid computing, because of distances, with a additional handicap of cheap hardware, and latencies introduced by the virtualization layer. Moving data from central storage (S3) to the cluster (EC2) is slow (15 Mbs for a single node, parallel access of multiple nodes can increase this figure, but it quickly flattens out at 50-100 MBs because of S3). Moving data from the node’s hard disk to RAM is slow (disk hardware, VM transport, VM contention with other users on the same hardware), moving data between nodes is slow (network hardware, VM transport, VM contention).

In this figure we show what it takes to push either half a tera-byte, or  a peta-byte through an EC2 compute node. One peta-byte takes in the order of two years. I think that figure is optimistic.

A peta-byte may look a lot to push through one node, but for complex heterogeneous big data in biology, this is not that far off. One human genome takes 3.5 Gb. 1,000 humans takes 3.5 TB. mRNA data is many times the genome size, we will be sequencing the mRNA of many tissues of millions of individuals. It may be possible to be smart about analysis, and distribute calculations Hadoop style. But, even so, the jury is still out. Hint: try genome assembly of one individual’s NGS DNA dataset through Cloud computing.

Cloud is cool. But it is not the answer to all our dreams. I don’t see all labs converting to Cloud computing any time soon. The sheer size of the data, and the physical reality of computing when random access is required, means we still end up with stacks of hardware.

The BioScala project

BioScala is coming into its own on github, including a tutorial.

Bioinformatics example in Scala (part 6)

The sequence splitter function takes a gapped sequence and splits it into a list of gapped and non-gapped sections. So “agc—tcg-t” becomes (“agc”,”—”,”tcg”,”-”,”t”). We use this to discuss various ways of approaching the problem in Scala. See also the first post.

The following version of the sequence splitter function adds compile type checking properly, using generics.

The following succeeds:

new Splitter7[Nucleotide].splitSimplePass7(List(A,G,GapN,GapN,C,T,GapN,T)) should equal (List(List(A,G),List(GapN,GapN),List(C,T),List(GapN),List(T)))

because GapN is a Nucleotide and we wrote def isGapType[Q <: T](c : Q), which states that the type Q can only be passed in, if it is derived from T – which is Nucleotide in Splitter7[Nucleotide]

The following fails to compile:

new Splitter7[Nucleotide].splitSimplePass7(List(A,G,GapN,Gap,C,T,GapN,T))

with the error

type mismatch;
[error] found : GapSplitting7Spec.this.Gap.type (with underlying type object GapSplitting7Spec.this.Gap)
[error] required: GapSplitting7Spec.this.Nucleotide

The compile time checker basically does not allow Gap because it is not a Nucleotide. Note that the case comparison inside the function allows for different types. This is the key to runtime checking (and transformation) of types.

Here we start hitting the real difference with interpreted weakly typed languages, like R, Perl, Python and Ruby. Scala allows you to strictly define the types that go in, and come out, of functions. This is done at compile time, so mismatches get caught early.

This is really benefitial when refactoring code – dynamic languages tend to ignore type problems which ought to be caught. You don’t have to add code for type checking and, in many cases, you can save on unit tests for type checking.

Apart from catching errors, strict type checking allows for easy compiler optimizations resulting in great runtime performance. This is why Scala (and Java) perform so well in benchmarks (even without taking advantage of parallellization).

We will have a performance comparison in a later Blog.

Scala command line parsing

I added a comment on Stackoverflow on using Scala’s powerful pattern matching for parsing of command line options. It is trivial to write such a parser. See the code snippet.

Bioinformatics example in Scala (part 5)

The sequence splitter function takes a gapped sequence and splits it into a list of gapped and non-gapped sections. So “agc—tcg-t” becomes (“agc”,”—”,”tcg”,”-”,”t”). We use this to discuss various ways of approaching the problem in Scala. See also the first post.

The fifth version of the sequence splitter function adds compile type checking. Scala has generics, which are similar to C++ templates and Java generics. When you design a class, like we did, for Char or Symbol we can make it more generic by making the type a ‘variable’. So a class defined as:

class MyClass[T](seq : List[T])

works for multiple types. We can use it with Char

val x = MyClass[Char]

or Symbol

val x = MyClass[Symbol]

Scala’s type system is very powerful. Here we add a lower bound type definition

def splitSimplePass6[T >: Symbol](seq: List[T]): List[List[T]]

where T is defined as the lower bound for Symbol. To me that reads that T is a subtype of Symbol (T is greater than Symbol). However, this version does not work as expected – because it is the other way round. T is a super type of Symbol!




But we expect


to fail on AnyGap. It does not. In fact it returns


First lesson on type constraints, T >: Symbol means that T has to be a super type of Symbol. AnyGap, apparently passes, as AnyGap is an Object, which is s super type of Symbol.

Note also

case Gap => true

so called 'pattern matching'. Gap is a 'case object', so we can pattern match on it. Pattern matching is found in functional langauges and is very powerful. Pattern matching allows for writing code that is easy to understand.

Bioinformatics example in Scala (part 4)

The sequence splitter function takes a gapped sequence and splits it into a list of gapped and non-gapped sections. So “agc—tcg-t” becomes (“agc”,”—”,”tcg”,”-”,”t”). We use this to discuss various ways of approaching the problem in Scala. See also the first post.

The fourth version of the sequence splitter function shows why a global gap value may fail. It fails when we define multiple gap types. Calling


results in


where Gap is not split out properly, with the code

Typically with OOP-type languages you would create a solution that checks for the object type, or create a ‘status’ method. In the next example we add a query method isGap(), which is part of every Symbol and Nucleotide. This is a typical OOP solution:

It does the job, the sequence is split properly into


still, you ought to start wondering why every object should ‘know’ it is a gap. This is especially for non-gap objects – why should a Nucleotide contain gap information?

There are other ways to find out whether an object is a gap – you can check on the object name, or use reflection, or create a list of matching objects. None of these are particularly good when it comes to compile time checking – these are all runtime solutions! In the next install of this series we will see if we can use Scala’s type system for this purpose.