Wednesday, January 15, 2014

The $1000 Genome - A Computational Perspective

I haven't written a lot of domain-specific posts, but some recent news in genomics got me thinking about how high-performance and data-intensive computing is having to become an integral part of bioinformatics.  Getting meaningful insight out of a person's genome doesn't stop when the sequencer finishes generating data for a sample; the process of transforming that raw read data into an aligned genome invariably requires a significant amount of computation too.  In the past, this has generally been a task that could be satisfied on a beefy workstation and it was easy to overlook the computational implications when new sequencing technologies took the spotlight.  However, it now seems like we're rapidly approaching a point where high-end computational infrastructure needs to share the stage with emerging sequencing technology.

What follows are some thoughts and back-of-the-envelope cost considerations that arise from the computational requirements that high-throughput sequencers will be demanding.

Edit on Jan 17: For anyone who is really interested in hearing more about this topic, I will be presenting a 45-minute talk on the computational aspects of large-scale genomic analysis on Thursday, April 17 at 11:00 am PDT that will be broadcast for free on the web.  I will post more information on how to tune in as the date draws closer.

The $1000 Genome

Illumina made the big announcement yesterday that they will begin shipping a new high-throughput sequencer, dubbed the HiSeq X, towards the end of this quarter that can sequence human genomes at a cost of $1000 each.  This price point has been a long-sought milestone in gene sequencing, and these HiSeq X systems will be the first to market that can realistically reach this--the cost breakdown is $800 per genome for the necessary reagents, $137 for "hardware," and $60 for sample preparation.

What exactly falls under this $137 "hardware" cost is a little unclear since I'm not too familiar with sequencing in practice, but the precision of this dollar amount suggests that it is some sort of figure that resembles the cost of actually purchasing a HiSeq X and dividing it by the genomes it can produce during its lifetime.  This up-front capital cost is a million dollars per HiSeq X, and apparently Illumina is only selling them as "HiSeq X 10" systems (i.e., you must buy at least ten HiSeq X units at once) which puts the price tag at around $10 million.

While this cost breakdown sounds plausible, most of the discussion surrounding this new $1000-per-genome sequencer, including Illumina's own marketing, also references this popular figure from the National Human Genome Research Institute:


That is, the cost of sequencing (and, by extension, the global capacity to produce sequenced genomes) has been outpacing Moore's law.  On the one hand, this sounds fantastic in the sense that the field of bioinformatics is developing faster than the famously fast-paced semiconductor industry, and the associated benefits to related fields like translational medicine are rapidly approaching.  On the other hand, though, sequencing is tied to Moore's law by the computations needed to actually turn sequencer output into useful information.

The Other Shoe Drops

As a computationalist, this announcement from Illumina and the prospect of a $1000 genome is actually quite horrifying because the end of that $1000 sequencing process is only the very beginning of the computational demands associated with genome sequencing.  Everyone in the trenches of human genomics knows that the field of bioinformatics is being flooded by the data coming out of next-generation sequencing machines, but until now, cyberinfrastructure has generally been able to satisfy the demands imposed by the throughput rates of the state of the art sequencers.

However, one of these Illumina HiSeq X 10 systems is allegedly capable of producing the raw data for up to 18,000 human genomes per year, which is about 340 genomes per week.  This throughput corresponds to about thirty to fifty terabytes of data being generated per week, or on the order of 200 terabytes* per month.  Obviously, providing storage for this quantity of data is a non-trivial challenge and requires some serious forward thinking at the design stages of a large-scale genomics study.

Storage is certainly not the only computational issue though; the raw read data being generated by these sequencers isn't even immediately useful in this raw form.  Rather, read mapping needs to be performed on each of these 340 genomes to actually assemble all of the reads physically measured by the sequencer into the sequence of base pairs that actually comprise the genome being analyzed.  After that, variant calling is typically performed to extract the variants in each genome that can then be used to develop or test a model that relates genetic variants with observable health issues.

This is where the other hidden cost of the $1000 genome comes into play.

* Edit on Jan 30, 2013: A reader over at The Genetic Literacy Project pointed out that my initial statement that 50 TB/week × 4.33 weeks/month = 1 PB/month is not true. Indeed, I was mixing up my numbers when I initially posted this.

Processor Time

I recently helped out with an ambitious large-scale genomics study where 438 human genomes had to go through this very process, and as a frame of reference, each genome's read mapping consumed an average of 518 core hours for a highly optimized pipeline on optimized hardware.  Doing a little math, it follows that a HiSeq X 10 system's output will require about 175,000 CPU core hours per week to transform the raw sequencer outputs into actual aligned sequences.

At a standard rate of between $0.05 (academic) and $0.12 (industrial) per core hour, this is between $25 and $65 per genome, or an additional $8,800 to $21,000 per week to transform the raw output into aligned sequences.  For reference, going to a cloud provider like Amazon would cost a bit more per core-hour ($0.15/core/hr for C3.8xlarge) in addition to the costs of EBS, S3, and data movement.

The variant calling is a bit harder to ballpark since there are many different approaches to this; for this project in which I was involved, the variant calling cost an additional 200 core hours per genome; at the HiSeq X 10 rate of 340 genomes per week, this corresponds to an additional cost of $10 to $24 per genome or $3,400 to $8,160 per week.

Storage Capacity

I also mentioned the volume of raw, compressed data being generated as being on the order of 30-50 TB per week.  It's similarly difficult to ballpark the costs associated with supporting this volume of data influx, as much of the costs are tied to policy decisions (how long do we keep data?  and what data do we keep?), but a good rate for purchasing high-performance parallel storage hardware is around $0.30 per gigabyte.

At this price, the cost of being able to store four weeks worth of HiSeq X 10 output will run about $60,000 in capital.  When considering the cost of storing the mapped reads as well as capability to store intermediate files during analysis, the actual storage capacity needed will probably increase by a factor of at least 3x.

Again, Amazon might sound like a nice option since you aren't footing this up-front capital and only pay for what you use, but at these scales, S3 pricing is wholly intractable--just storing these volumes of data in the cloud would cost over ten thousand dollars a month.  The actual act of moving that data around only adds to that.

Keeping Up and Delivering Capacity

Compared to the $800 cost per sample in reagants, this additional $35-$80 in compute time for actually transforming the raw reads into something vaguely meaningful isn't too bad.  However, there is one additional factor that is a bit more difficult to quantify: available capacity.

Compute Capacity

An operational fact is that actually delivering 240,000 core hours per week is not trivial--you would need about 1,450 cores burning away 24/7 to actually meet this.  So for your average 16-core nodes, you would need a dedicated 90-node cluster to actually keep up with the computational demands made by a HiSeq X 10.  Again, as a very general ballpark, the last cluster I personally procured ran about $40,000 for a subrack of eight general-purpose nodes, and this is about the same cost as purchasing a suitable dedicated condo node on a campus cluster.  Thus, keeping up with a $10 million HiSeq X 10 system may require a compute system costing on the order of $450,000 in capital up front if paying for hotel-style or cloud compute is not palatable.

Of course, this assumes ideal conditions including

  • hardware never fails (ha)
  • software never fails (ha ha)
  • hardware never needs preventative maintenance
  • data transfer from the sequencers to the compute nodes is instantaneous

Storage Capacity

In addition to computing capacity, there is the problem of storage capacity--as I indicated earlier, when generating 30-50TB of raw data per week, the sky is the limit in terms of how much capacity a given project will need.  In addition, there is often significant overhead involved during the read mapping and variant calling stages, resulting in the need for a non-trivial safety margin of storage capacity on top of the capacity required to simply store input and output data.

A Real Case Study of Capacity

To actually back up these ballpark estimates, I think it's meaningful to look at the computational demands of a real sequencing study that involved volumes of genomes and data on the same order as a HiSeq X 10 is capable of producing in a week.  As I alluded above, I recently helped out with a project in which 438 human genomes were delivered in raw format straight from a sequencer, and they had to undergo read mapping and variant calling before the actual science could take place.

As one may expect, real projects tend to have bursty (rather than sustained) demands, and doing read mapping and variant calling for this 438-genome project had an overall activity profile that looked like this:


At its peak, the read mapping stage was using over 5,000 cores concurrently (green blocks on September 30).  Similarly with storage, the peak capacity required for intermediate files was 250 TB (blue line on September 14) and the overall peak storage was 330 TB (230 TB intermediate, 50 TB raw input, 50 TB aligned output; red line on September 30).  Going back to the cost of purchasing parallel storage I mentioned earlier, this project used about $100,000 worth of storage hardware capacity at this peak.

I would anticipate these peaks only growing as the burstiness decreases and the timeline contracts, so being able to efficiently handle uneven loading and deliver burst capacity is a realistically important capability.

One last unquantifiable aspect of this project was the software pipelines' very diverse computational demands--all of the following resources were required for this study in at least one stage of the analysis:

  • very fast serial performance
  • high memory/core ratio (> 4.0 GB/core)
  • high IOPS with large capacity (> 3,500 IOPS sustained)
  • large, online, high-performance storage capacity (> 1GB/s sustained read)
  • capability for wide (16-way) thread-level parallelism

These data-intensive architectural features are not present on vanilla compute clusters, and although EC2 does provide instances that allow you to support most of these requirements, the costs for the actual data-intensive processing are significantly higher due to the need for EBS.  Thus, an optimal compute system to match the volume of data being produced by high-throughput sequencers does require some nontrivial system architecture design which add, either directly or indirectly, to the overall cost of deploying such a system.  As a very general ballpark, both of SDSC's Gordon and Trestles systems are equipped with SSDs and large memory on each node, and the cost per node was very roughly between $8,000 and $10,000.

The Scary Truth

Illumina has been pitching the HiSeq X 10 as being at the cusp of a new era of sequencing, and this is undeniably true.  However, rather than enabling transformational new studies as one might envision based on the sequencers alone, I'm nervous that the "new era" will really be one in which the net rate of discovery is finally not limited by sequencing technology, but by available computing capacity.  Anyone who wants to generate data at the full capability of the HiSeq X 10 system and truly get $1000 genomes out of their $10 million investment will be hitting walls in computations that are equally unparalleled.

This 438-genome study to which I keep referring was ambitious for a two-month project, so I have a difficult time imagining what it would take to keep up with what a HiSeq X 10 can deliver in a week.  Gordon, which was a $10 million data-intensive capital investment, had the muscle to handle a 438-genome project alongside its regular workload.  However, should we expect this "new era" of genomics to now require a miniature Gordon (complete with technical consultants like me!) for every array of HiSeq X units?  And if so, are these really still $1000 genomes?

These questions remain truly unanswered, but I'm far from the only one asking them.  I'd be interested in hearing what others, especially those who have been supporting bioinformatics workloads for a while, think.