Chances are that what you would consider big, Martin Leach would consider very, very small.
Delivering his keynote at the Bio-IT World Conference this week here in Boston, Leach challenged attendees to reconsider their concepts of "BIG!" -- the word emblazoned in ultra-large, all caps font, replete with exclamation point, on his first slide -- in this, our technology-driven, buzzword-infused world of "big-data" and "big analytics."
Leach told the audience that the Broad Institute (at which Leach serves as CIO) has a mind-boggling 10 petabytes of data on spinning disks (so many that any given one is expected to fail every 24 hours to 36 hours). This makes the Broad Institute's datacenter the biggest genomics datacenter in the world.
Yet, as recently as 20 years ago, Leach relates, a CIO might have had his mind similarly boggled by a datacenter housing no more than 16 gigabytes of data. Today, you can store 16 gigabytes of data on a reasonably priced thumb drive -- but in the early 1990s, 16GB was "BIG!"
To offer a comparison, Leach raises the example of the 1,000 Genomes Project -- a self-explanatory genome sequencing project that might seem ambitious ("BIG!" even), considering that humanity's first genome map was completed only nine years ago this month.
"If you can do 1,000 genomes, why can't you do a million?" Leach asks (a fair question, especially given the dramatically falling cost of genome sequencing). "If you can do a million genomes, why can't you do a billion?"
As with cost, data storage issues do not present much of an obstacle any more -- or, at least, not any time soon. Rather, storage is an analytics red herring.
"Looking into the future... I don't think it's a big data problem," says Leach. He predicts that in about 23 years, 16,777,216,000 gigabytes (16,000 petabytes, or just shy of 16 exabytes) will be able to fit onto a single $50 hard drive unit on a typical home PC.
"A few years from now, it won't really be all that big -- ten petabytes," says Leach nonchalantly.
"Really, the big question here," he continues, failing to clarify whether the pun is intended, "is 'How do we make sense of it?' "
Leach identifies the truly serious data management problems facing the biomedical and healthcare fields as ones of data movement, data indexing, and data accessibility. "Why don't we have a 'Google search' for data?" Leach laments. "How can you look inside that data? How can you integrate that data? How can you do it in a frictionless way?"
There is some hope on the horizon. Leach's preceding presenter, Jill Mesirov (also of the Broad Institute), introduced GenomeSpace -- an integrated, open-source, cloud-based data management infrastructure for genomics researchers -- to Bio-IT World attendees. Still, with GenomeSpace only in beta, Leach believes the data management problems are far from solved, requiring a great deal more investment -- particularly for already cash-strapped research organizations.
Leach also identifies the need for more data scientists, predicting tremendous job growth in the area. Massachusetts -- site of the conference and home to some of the world's finest hospitals, biotech companies, and research facilities -- will gain an additional 50,000 such jobs by 2018, Leach points out.
Whether that will be enough, however, remains to be seen. 50,000 may seem like a big number of jobs now -- but will it still be big in six years?