Wednesday, January 15, 2014

The $1000 Genome - A Computational Perspective

I haven't written a lot of domain-specific posts, but some recent news in genomics got me thinking about how high-performance and data-intensive computing is having to become an integral part of bioinformatics.  Getting meaningful insight out of a person's genome doesn't stop when the sequencer finishes generating data for a sample; the process of transforming that raw read data into an aligned genome invariably requires a significant amount of computation too.  In the past, this has generally been a task that could be satisfied on a beefy workstation and it was easy to overlook the computational implications when new sequencing technologies took the spotlight.  However, it now seems like we're rapidly approaching a point where high-end computational infrastructure needs to share the stage with emerging sequencing technology.

What follows are some thoughts and back-of-the-envelope cost considerations that arise from the computational requirements that high-throughput sequencers will be demanding.

Edit on Jan 17: For anyone who is really interested in hearing more about this topic, I will be presenting a 45-minute talk on the computational aspects of large-scale genomic analysis on Thursday, April 17 at 11:00 am PDT that will be broadcast for free on the web.  I will post more information on how to tune in as the date draws closer.

The $1000 Genome

Illumina made the big announcement yesterday that they will begin shipping a new high-throughput sequencer, dubbed the HiSeq X, towards the end of this quarter that can sequence human genomes at a cost of $1000 each.  This price point has been a long-sought milestone in gene sequencing, and these HiSeq X systems will be the first to market that can realistically reach this--the cost breakdown is $800 per genome for the necessary reagents, $137 for "hardware," and $60 for sample preparation.

What exactly falls under this $137 "hardware" cost is a little unclear since I'm not too familiar with sequencing in practice, but the precision of this dollar amount suggests that it is some sort of figure that resembles the cost of actually purchasing a HiSeq X and dividing it by the genomes it can produce during its lifetime.  This up-front capital cost is a million dollars per HiSeq X, and apparently Illumina is only selling them as "HiSeq X 10" systems (i.e., you must buy at least ten HiSeq X units at once) which puts the price tag at around $10 million.

While this cost breakdown sounds plausible, most of the discussion surrounding this new $1000-per-genome sequencer, including Illumina's own marketing, also references this popular figure from the National Human Genome Research Institute:


That is, the cost of sequencing (and, by extension, the global capacity to produce sequenced genomes) has been outpacing Moore's law.  On the one hand, this sounds fantastic in the sense that the field of bioinformatics is developing faster than the famously fast-paced semiconductor industry, and the associated benefits to related fields like translational medicine are rapidly approaching.  On the other hand, though, sequencing is tied to Moore's law by the computations needed to actually turn sequencer output into useful information.

The Other Shoe Drops

As a computationalist, this announcement from Illumina and the prospect of a $1000 genome is actually quite horrifying because the end of that $1000 sequencing process is only the very beginning of the computational demands associated with genome sequencing.  Everyone in the trenches of human genomics knows that the field of bioinformatics is being flooded by the data coming out of next-generation sequencing machines, but until now, cyberinfrastructure has generally been able to satisfy the demands imposed by the throughput rates of the state of the art sequencers.

However, one of these Illumina HiSeq X 10 systems is allegedly capable of producing the raw data for up to 18,000 human genomes per year, which is about 340 genomes per week.  This throughput corresponds to about thirty to fifty terabytes of data being generated per week, or on the order of 200 terabytes* per month.  Obviously, providing storage for this quantity of data is a non-trivial challenge and requires some serious forward thinking at the design stages of a large-scale genomics study.

Storage is certainly not the only computational issue though; the raw read data being generated by these sequencers isn't even immediately useful in this raw form.  Rather, read mapping needs to be performed on each of these 340 genomes to actually assemble all of the reads physically measured by the sequencer into the sequence of base pairs that actually comprise the genome being analyzed.  After that, variant calling is typically performed to extract the variants in each genome that can then be used to develop or test a model that relates genetic variants with observable health issues.

This is where the other hidden cost of the $1000 genome comes into play.

* Edit on Jan 30, 2013: A reader over at The Genetic Literacy Project pointed out that my initial statement that 50 TB/week × 4.33 weeks/month = 1 PB/month is not true. Indeed, I was mixing up my numbers when I initially posted this.

Processor Time

I recently helped out with an ambitious large-scale genomics study where 438 human genomes had to go through this very process, and as a frame of reference, each genome's read mapping consumed an average of 518 core hours for a highly optimized pipeline on optimized hardware.  Doing a little math, it follows that a HiSeq X 10 system's output will require about 175,000 CPU core hours per week to transform the raw sequencer outputs into actual aligned sequences.

At a standard rate of between $0.05 (academic) and $0.12 (industrial) per core hour, this is between $25 and $65 per genome, or an additional $8,800 to $21,000 per week to transform the raw output into aligned sequences.  For reference, going to a cloud provider like Amazon would cost a bit more per core-hour ($0.15/core/hr for C3.8xlarge) in addition to the costs of EBS, S3, and data movement.

The variant calling is a bit harder to ballpark since there are many different approaches to this; for this project in which I was involved, the variant calling cost an additional 200 core hours per genome; at the HiSeq X 10 rate of 340 genomes per week, this corresponds to an additional cost of $10 to $24 per genome or $3,400 to $8,160 per week.

Storage Capacity

I also mentioned the volume of raw, compressed data being generated as being on the order of 30-50 TB per week.  It's similarly difficult to ballpark the costs associated with supporting this volume of data influx, as much of the costs are tied to policy decisions (how long do we keep data?  and what data do we keep?), but a good rate for purchasing high-performance parallel storage hardware is around $0.30 per gigabyte.

At this price, the cost of being able to store four weeks worth of HiSeq X 10 output will run about $60,000 in capital.  When considering the cost of storing the mapped reads as well as capability to store intermediate files during analysis, the actual storage capacity needed will probably increase by a factor of at least 3x.

Again, Amazon might sound like a nice option since you aren't footing this up-front capital and only pay for what you use, but at these scales, S3 pricing is wholly intractable--just storing these volumes of data in the cloud would cost over ten thousand dollars a month.  The actual act of moving that data around only adds to that.

Keeping Up and Delivering Capacity

Compared to the $800 cost per sample in reagants, this additional $35-$80 in compute time for actually transforming the raw reads into something vaguely meaningful isn't too bad.  However, there is one additional factor that is a bit more difficult to quantify: available capacity.

Compute Capacity

An operational fact is that actually delivering 240,000 core hours per week is not trivial--you would need about 1,450 cores burning away 24/7 to actually meet this.  So for your average 16-core nodes, you would need a dedicated 90-node cluster to actually keep up with the computational demands made by a HiSeq X 10.  Again, as a very general ballpark, the last cluster I personally procured ran about $40,000 for a subrack of eight general-purpose nodes, and this is about the same cost as purchasing a suitable dedicated condo node on a campus cluster.  Thus, keeping up with a $10 million HiSeq X 10 system may require a compute system costing on the order of $450,000 in capital up front if paying for hotel-style or cloud compute is not palatable.

Of course, this assumes ideal conditions including

  • hardware never fails (ha)
  • software never fails (ha ha)
  • hardware never needs preventative maintenance
  • data transfer from the sequencers to the compute nodes is instantaneous

Storage Capacity

In addition to computing capacity, there is the problem of storage capacity--as I indicated earlier, when generating 30-50TB of raw data per week, the sky is the limit in terms of how much capacity a given project will need.  In addition, there is often significant overhead involved during the read mapping and variant calling stages, resulting in the need for a non-trivial safety margin of storage capacity on top of the capacity required to simply store input and output data.

A Real Case Study of Capacity

To actually back up these ballpark estimates, I think it's meaningful to look at the computational demands of a real sequencing study that involved volumes of genomes and data on the same order as a HiSeq X 10 is capable of producing in a week.  As I alluded above, I recently helped out with a project in which 438 human genomes were delivered in raw format straight from a sequencer, and they had to undergo read mapping and variant calling before the actual science could take place.

As one may expect, real projects tend to have bursty (rather than sustained) demands, and doing read mapping and variant calling for this 438-genome project had an overall activity profile that looked like this:


At its peak, the read mapping stage was using over 5,000 cores concurrently (green blocks on September 30).  Similarly with storage, the peak capacity required for intermediate files was 250 TB (blue line on September 14) and the overall peak storage was 330 TB (230 TB intermediate, 50 TB raw input, 50 TB aligned output; red line on September 30).  Going back to the cost of purchasing parallel storage I mentioned earlier, this project used about $100,000 worth of storage hardware capacity at this peak.

I would anticipate these peaks only growing as the burstiness decreases and the timeline contracts, so being able to efficiently handle uneven loading and deliver burst capacity is a realistically important capability.

One last unquantifiable aspect of this project was the software pipelines' very diverse computational demands--all of the following resources were required for this study in at least one stage of the analysis:

  • very fast serial performance
  • high memory/core ratio (> 4.0 GB/core)
  • high IOPS with large capacity (> 3,500 IOPS sustained)
  • large, online, high-performance storage capacity (> 1GB/s sustained read)
  • capability for wide (16-way) thread-level parallelism

These data-intensive architectural features are not present on vanilla compute clusters, and although EC2 does provide instances that allow you to support most of these requirements, the costs for the actual data-intensive processing are significantly higher due to the need for EBS.  Thus, an optimal compute system to match the volume of data being produced by high-throughput sequencers does require some nontrivial system architecture design which add, either directly or indirectly, to the overall cost of deploying such a system.  As a very general ballpark, both of SDSC's Gordon and Trestles systems are equipped with SSDs and large memory on each node, and the cost per node was very roughly between $8,000 and $10,000.

The Scary Truth

Illumina has been pitching the HiSeq X 10 as being at the cusp of a new era of sequencing, and this is undeniably true.  However, rather than enabling transformational new studies as one might envision based on the sequencers alone, I'm nervous that the "new era" will really be one in which the net rate of discovery is finally not limited by sequencing technology, but by available computing capacity.  Anyone who wants to generate data at the full capability of the HiSeq X 10 system and truly get $1000 genomes out of their $10 million investment will be hitting walls in computations that are equally unparalleled.

This 438-genome study to which I keep referring was ambitious for a two-month project, so I have a difficult time imagining what it would take to keep up with what a HiSeq X 10 can deliver in a week.  Gordon, which was a $10 million data-intensive capital investment, had the muscle to handle a 438-genome project alongside its regular workload.  However, should we expect this "new era" of genomics to now require a miniature Gordon (complete with technical consultants like me!) for every array of HiSeq X units?  And if so, are these really still $1000 genomes?

These questions remain truly unanswered, but I'm far from the only one asking them.  I'd be interested in hearing what others, especially those who have been supporting bioinformatics workloads for a while, think.

12 comments:

  1. It appears the instrument has alignment built in, which is how it is being launched as human only. Not clear whether it does out BAM, VCF or something else

    ReplyDelete
    Replies
    1. Thanks for the info, Keith. The followup question I have, though, is if already-aligned data is what researchers will really want, or if some/many will revert to fastq and re-align. Would it be possible to skip the aligning part within the HiSeq X, just get the fastq, and get even better throughput?

      In my experience, people have their own favorite read mapping pipeline, and it often gets updated or tweaked as algorithms improve. The matter of what to do with these pre-aligned BAMs coming from Illumina is an open question within the projects on which I am currently working. I'm really curious to hear what others are doing.

      I wouldn't expect the sequencers to deliver final VCFs though, as the choice of variant calling method is subject to the goals of the underlying science.

      Delete
    2. Existing Illumina machines have alignment capability built in and output .bam and .bai files. The default option is to use Isaac (https://github.com/sequencing/isaac_aligner). If 8-level binning (http://res.illumina.com/documents/products/whitepapers/whitepaper_datacompression.pdf) and cram formatting are applied to the bam file, the final disk space required is considerably reduced.

      Delete
  2. it dawned on me that at the 4 year depreciation rates and output rates, illumina expects 72 thousand genomes output at a cost (to the raw data of course) of 72 million!

    ReplyDelete
  3. ..and what about the comparative genomics, I mean there's little chance of any breakthroughs without the genomes being compared. After all, it's a statistical science where the better answers live in the bigger datasets.

    ReplyDelete
  4. None of these figures includes the cost of the coffee needed to keep the bioinformaticians working...but seriously, we are still far away from automated medical genome analysis, and there will be major additionally costs following the alignment and variant calling steps that will include inteerpretative bioinformatics, visualization, Q/C, as well as medical and research evaluation. Even now, this is becoming more expensive that the mere sequencing and this is likely to become more in the future.
    -Peter

    ReplyDelete
    Replies
    1. Absolutely--and there are non-trivial personnel costs associated with keeping the compute hardware working, which are analogous to the $137/genome technician contribution to the $1000/genome price.

      If we were to include that cost, this post would reflect the total costs associated with just getting to the starting line. The actual race itself is a completely different story, and I agree that it's apt to get costlier as more and more scientists are needed to extract science from the data.

      Delete
    2. $137 goes toward depreciation, only $63 is allocated for prep/technician time. No money is allocated for analysis personnel or infrastructure.

      Delete
  5. I find it interesting that once ~7000 HiSeq X machines have been sold, and are running at the 18k/year capacity, they can sequence everyone born on Earth (based on 130e6 births per year). I wonder how many they are planning on selling?

    ReplyDelete
    Replies
    1. There's no limit to genomes that can be sequenced because in a tumour each cell can have a different genome. Once we start directly sequencing epigenomes which is inevitable then each organ or facet of the body has a different profile. The possibilities are truly limitless.

      Delete
  6. While I can claim no real experience with genomic science, I do find that it parallels the medical imaging space where companies are producing CT studies with 256 slices compared to a few years back when 16 slice CT was the norm. The shift to higher quality, better images has outpaced the technology needed to store, transit and display these same images.

    Seems genomics is not encountering the same issues,

    ReplyDelete
  7. This doesn't take into account the eloquent description of analysis and storage costs, but today, the sequencing costs for a 30X whole human genome on a HiSeq X Ten is $1,400 ($1,550 with library prep): https://genohub.com/shop-by-next-gen-sequencing-technology/#query=e304abac02105b87079fd1a19e70b9ed

    ReplyDelete