Storage Doesn't Matter for Bioinformatics? Not So Fast

Joe Stanganelli, Founder and Principal, Beacon Hill Law | 4/30/2012 | 28 comments

Joe Stanganelli


Last week, I wrote an article about how keynote speaker Martin Leach presented a convincing argument to Bio-IT World Conference 2012 attendees here in Boston as to why the biggest obstacle facing the health and life sciences industry in the age of "big-data" is not one of storage, but of computing.

Accessibility, analysis, and integration are the sole true bugaboos, says Leach, making storage issues but a petty distraction when it comes to genomics and others who work with intensive bioinformatics.

Turns out, not everyone here agrees.

Robert Bjornson is director of IT at the Yale Center for Genome Analysis (YCGA). "We spend almost no time thinking about computing. We spend all of our time thinking about storage," told Bjornson to a room of a few dozen conference attendees.

In a presentation about IT infrastructure and hardware, Bjornson talked about the technological challenges YCGA and similar organizations face.

"Drives," he aptly observed, "fail."

Even Leach does not dispute this fact of IT life. The Broad Institute of MIT and Harvard, where Leach is CIO, boasts the largest genomic datacenter in the world, with over 10 petabytes of data on spinning disks -- and every day to day-and-a-half, one of those disks fails.

"When you have 1,000 drives, expect failure," confirms Bjornson, by way of advice. What's more, backing up all of a genomics organization's data -- which can number in the petabytes -- just isn't practical.

Cost is also a factor (Moore's Law notwithstanding) for some customers, says Bjornson -- at least psychologically. Despite the price of storage falling, many enterprise and high-level organizational customers maintain a consumer market perspective. "I can't tell you how many times people have said, 'Why does this cost $1,000 a terabyte?' " says Bjornson, relating laughable characterizations of customers who protest that hard drives at Best Buy can go for about $65 per terabyte.

Big-data customers may be their own worst enemy in more ways than one. YCGA's customers use YCGA's storage and YCGA's cluster. Cautions Bjornson, however, "It's risky to let customers into the factory." They can crash the login node. They can overload the storage. They can "do any one of a number of things that people do when they get the chance," Bjornson says, and any of those things can interfere with their data management and data analysis.

To be fair, this is an example of a risk that falls under both the "storage" and the "accessibility" umbrellas -- and there are others.

For instance, Bjornson himself concedes that search is a huge problem in big data genomics, as he presents a slide that reads, " 'Find' does not work on 2PB on Storage." Genome sequencing, of course, is a data-intensive field -- yet the field of genomics lacks a truly effective data identification solution ("a Google search for data," as Leach called it on Tuesday), says Bjornson. "We don't have it. We really need it."

Nonetheless, "storage, for us, is by far the hard part," maintains Bjornson.

Both sides of the accessibility vs. storage discussion raise very valid points -- and have very real concerns. Alas, the hardware presentation series here at Bio-IT has been somewhat sparsely attended compared to other sessions. Conversely, so many Bio-IT attendees clamored to see the opening keynote speeches on Tuesday that dozens were relegated to an overflow room with a live video feed.

With so many of the attendees here having heard only Leach's advocacy for accessibility, arguments like Bjornson's about the importance of storage seem to have become lost in the din of the conference -- and therefore, ironically, much less accessible.

Copyright © 2020 TechWeb, A UBM Company, All rights reserved.