Ruby is a pony. Everyone loves a pony. Ruby is nice.
Scala is a thoroughbred. You know I like Scala – it is beautiful, and runs circles around the pony
D is a dragon. Very powerful, and somewhat unpredictable
The programming languages Ruby, Python, R and Perl have proven to be very popular in bioinformatics. These languages are interpreted and dynamically typed computer languages. They are all great at parsing and handling genomic information. Results are quick to get, and the development cycle may be gratifying. However, as the language shootout shows, they are also rather slow, and hard to parallelize. It is not easy to get them to use those multi-cores everyone has.
Some newer languages, such as Scala and D, are not only strongly typed, which has a real impact on performance, but are also very good at automatically handling types. This means that coding Scala or D, feels similar to coding dynamically typed languages. Also, Scala and D are OOP languages that marry the functional programming paradigm. In practise, that means that we get OOP goodness (and badness), with constructs that make it safer and easier to parallelize code.
At this point bioinformaticians should sit up and prick up their ears: it is not much harder to program in Scala and D than in Ruby, Python, R and Perl. A little harder, yes, but the rewards can be great. Bioinformatics has entered the era of Big Data (Trelles and Prins, 2010), and we need parallelization and cores to get the work done. There is a lot of hype about the cloud, but with the current affordable large multi-core systems, a lot of programming and analysis can be handled on single largish memory multi-core machines. Scala and D allow fine-grained parallelization using high-level abstractions, these languages have immutable data and built-in high performance message passing it the form of Actors, e.g. Akka, which makes it possible to use those cores.
I wrote the BioScala project. I love the Scala language. Beautifully designed, it is a programmers dream, were it not for the JVM.
With big data performance matters. And whatever the Java love boys claim, the JVM is often slower than directly compiled code. With C and D, carefully crafted code can run 4x, and sometimes 10x, faster than compiled JVM code, even with JIT compilation. That is a significant difference that is due to memory handling, object size and handling (so called ‘death by object creation’), and pointer arithmetic (which D makes safe using slicing), i.e. low-level control that the JVM does not allow, and which prevents clever optimizations. It means buying a 32 core machine instead of a 128 core machine. It means running your code on a 250 machine cluster (or Cloud) instead of on a 1000 machine cluster. It means waiting a month, instead of 4 months for a calculation on the same setup. In short, run time performance matters with big data.
D matters for bioinformatics, as it is a next generation computer language with the performances of C or C++. It comes with garbage collection – and you can still use malloc for fine tuning. Anyone using C or C++, and I know who you are, should reconsider. While programming in JAVA may feel to you like programming with the hands behind the back, C programming is a matter of repeatedly shooting yourself in the foot.
D is by far the nicer language, because it is much safer than C and C++, protects you from making mistakes, while it giving you almost the productivity of Ruby or Python.
Why use D over Ruby or Python? Well, I say, don’t stop using Ruby and Python any time soon. D complements them. Ruby, Python, R and Perl are great languages and can easily be bound against D code. Just like with C code, the functions are connected through a foreign function interface (FFI). Use each language to its full potential. D is for tight memory control, raw speed and parallelization. The others are there to churn out code in the simplest and quickest way. In time, however, you may find you’ll use D for more than time critical code. For true software engineering a strong type system can be very beneficial. For more on bridging computer languages, check out my soon to be published Springer book chapter on Sharing programming resources between Bio* projects through remote procedure call and native call stack strategies.
Why use D over Scala? Simply because D’s performance is higher. Not only is it faster running code, it is much easier to get low level tweaked code. For a simple GFF3 file parser I, admittedly somewhat unexpectedly, managed to increase speed significantly by allocating often accessed data on the stack, rather than on the heap. Modern CPUs take advantage of that (stack memory is closer to the processor). It is something the JVM does not allow you to do. Another power feature of D is slicing, or safe pointer arithmetic. The fastest XML parser in the world is written in D. Look it up.
Why is D a dragon? D’s language is amazingly powerful, but not as carefully designed as Scala’s. Scala’s design is as powerful, and simply beautiful. D feels more clunky and can get in the way sometimes. I find its functional language implementation less intuitive than that of Scala. Still, it works rather well, and it even has tail end recursion optimization (unlike the JVM). What clinched it for me is that, next to raw runtime speed, there are three areas D beats Scala. First, the D compiler itself is blazingly fast. Second, the D template system (generics) is simpler and easier to understand. Even in my earlier blog examples have trouble with Scala’s advanced templating, which is not a good sign. Third, D code generation, or compile time evaluation, rocks. Another thing to look into, there are many examples in the D standard library. In the Beginning Scala book by David Pollack he gives an example of a computer game that featured in _Why the lucky stiff‘s world (for Ruby insiders). What was enlightening to me was the code repetition in David’s book, necessary to build the players. That would not be necessary in D’s compile time evaluation. There are a few things I miss in D. For example pattern recognition on unpacking data, which is great in Haskell, Erlang, and Scala (see example). D has something for actors, so it may come to the main language. The second thing I miss is that language elements do not always return values. I use that in Ruby all over the place, because it makes for shorter code.
Finally some things that keep cropping up when I bring up D. First, the licensing issues. D, for historical reasons was closed source. That is changing now, with D2 compilers getting part of Fedora and Debian to follow. Second the schism and negativism of D1 users caused by an the move to D2. That you’ll find on the Internet. D2 is not compatible with D1, and that has caused grief. D2 was reinvented as the language designers progressed their ideas. If you want to read more about the excellent D2 language I strongly recommend Andrei’s book. It is a classic in its own right, describing a next generation programming language. Even if you never get to appreciate the power of the D language itself.