Storage Doesn't Matter for Bioinformatics? Not So Fast

Joe Stanganelli, Founder and Principal, Beacon Hill Law | 4/30/2012 | 28 comments

Joe Stanganelli
Last week, I wrote an article about how keynote speaker Martin Leach presented a convincing argument to Bio-IT World Conference 2012 attendees here in Boston as to why the biggest obstacle facing the health and life sciences industry in the age of "big-data" is not one of storage, but of computing.

Accessibility, analysis, and integration are the sole true bugaboos, says Leach, making storage issues but a petty distraction when it comes to genomics and others who work with intensive bioinformatics.

Turns out, not everyone here agrees.

Robert Bjornson is director of IT at the Yale Center for Genome Analysis (YCGA). "We spend almost no time thinking about computing. We spend all of our time thinking about storage," told Bjornson to a room of a few dozen conference attendees.

In a presentation about IT infrastructure and hardware, Bjornson talked about the technological challenges YCGA and similar organizations face.

"Drives," he aptly observed, "fail."

Even Leach does not dispute this fact of IT life. The Broad Institute of MIT and Harvard, where Leach is CIO, boasts the largest genomic datacenter in the world, with over 10 petabytes of data on spinning disks -- and every day to day-and-a-half, one of those disks fails.

"When you have 1,000 drives, expect failure," confirms Bjornson, by way of advice. What's more, backing up all of a genomics organization's data -- which can number in the petabytes -- just isn't practical.

Cost is also a factor (Moore's Law notwithstanding) for some customers, says Bjornson -- at least psychologically. Despite the price of storage falling, many enterprise and high-level organizational customers maintain a consumer market perspective. "I can't tell you how many times people have said, 'Why does this cost $1,000 a terabyte?' " says Bjornson, relating laughable characterizations of customers who protest that hard drives at Best Buy can go for about $65 per terabyte.

Big-data customers may be their own worst enemy in more ways than one. YCGA's customers use YCGA's storage and YCGA's cluster. Cautions Bjornson, however, "It's risky to let customers into the factory." They can crash the login node. They can overload the storage. They can "do any one of a number of things that people do when they get the chance," Bjornson says, and any of those things can interfere with their data management and data analysis.

To be fair, this is an example of a risk that falls under both the "storage" and the "accessibility" umbrellas -- and there are others.

For instance, Bjornson himself concedes that search is a huge problem in big data genomics, as he presents a slide that reads, " 'Find' does not work on 2PB on Storage." Genome sequencing, of course, is a data-intensive field -- yet the field of genomics lacks a truly effective data identification solution ("a Google search for data," as Leach called it on Tuesday), says Bjornson. "We don't have it. We really need it."

Nonetheless, "storage, for us, is by far the hard part," maintains Bjornson.

Both sides of the accessibility vs. storage discussion raise very valid points -- and have very real concerns. Alas, the hardware presentation series here at Bio-IT has been somewhat sparsely attended compared to other sessions. Conversely, so many Bio-IT attendees clamored to see the opening keynote speeches on Tuesday that dozens were relegated to an overflow room with a live video feed.

With so many of the attendees here having heard only Leach's advocacy for accessibility, arguments like Bjornson's about the importance of storage seem to have become lost in the din of the conference -- and therefore, ironically, much less accessible.

View Comments: Threaded | Newest First | Oldest First
nasimson   Storage Doesn't Matter for Bioinformatics? Not So Fast   4/30/2012 10:02:34 AM
What about Cloud?
We're no longer in the 70s when Bill Gates announced that a couple of kilobytes of storage space is more than enough for everyone.

Storage demands increase, and it goes for every kind of organization given today's storage needs.

With the inception of Cloud we sure have another pretty darn good alternative for storing. Storage providers aren't charging much at this point in order to commercialize Cloud. Perhaps this is the next wise step?

It's at least better than backing up your data in a hundred Hard Drives when all the data can perish if a fire breaks out.
David Wagner   Storage Doesn't Matter for Bioinformatics? Not So Fast   4/30/2012 11:46:08 AM
Re: What about Cloud?
It is tempting to think that perhaps the Broad Institute is just better at running its data center (or better funded) than Yale. Sure, drives fail. Sure backing up petabytes of data is hard and expensive.

But it isn't like this isn't being done.

On the other hand, what I really think Dr. Bjornson is probably feeling is the tightness of the standard academic budget. Hopefully, Dr. Leach is correct that prices continue to drop and academic datacenters can afford to do more.

In the meantime, I doubt there's a datacenter out there that doesn't feel the budget pinch.
Taimoor Zubair   Storage Doesn't Matter for Bioinformatics? Not So Fast   4/30/2012 8:06:26 PM
Re: What about Cloud?
"But it isn't like this isn't being done."

@David: I'd say there are quite a few companies, that are dealing with big data and may have volumes much larger than YCGA. They need to study the models by other companies and see how they have resolved the storage issues. Yes, funds can be a problem though if there's limited supply of them.

zerox203   Storage Doesn't Matter for Bioinformatics? Not So Fast   4/30/2012 8:23:34 PM
Re: What about Cloud?
well, we have to keep in mind that we're talking about a specific field here rather than general storage. Bionformatics definitely brings some special concerns to the table, although it is a broad field itself. At organizations like the ones Joe talks about in the article, and especially with some types of data, security may be of more importance than it is at other organizations - if only because it's bionformatics data could prove to be a more enticing theft target.  Even if not enough to prohibit cloud storage, these concerns could be enough to delay or prohibit it, which brings us back to square one - storage is still something that needs to be planned for, and can't be taken for granted yet.
Taimoor Zubair   Storage Doesn't Matter for Bioinformatics? Not So Fast   4/30/2012 10:17:09 PM
Re: What about Cloud?
@Zerox203: I agree that bioinformatics data might be very different from data that normally organizations store. However, at the end of the day, it's essentially bits and bytes that get stored. The volume of bioinformatics data may be considerably large but considering the volume of data in companies like Facebook, E-Bay or Amazon, it may equal if not less. I certainly feel it can YCGA can look up to solutions other companies are using in managing Big Data.
Gigi   Storage Doesn't Matter for Bioinformatics? Not So Fast   5/1/2012 6:12:58 AM
Gigi
Re: What about Cloud?
Taimoor, we had similar storage constrains for the Bio informatics projects and it's a collaborative project. What we had done is we formed a virtual group and create some common repository, where we kept all the datas. So those who are interested can access the data at any point of time, irrespective of location through net. This can help to avoid storing the same data at multiple locations.
Joe Stanganelli   Storage Doesn't Matter for Bioinformatics? Not So Fast   4/30/2012 11:58:24 PM
Re: What about Cloud?
Hi, Dave.  Thanks for weighing in.

This reminds me of a discussion from a Dell-sponsored Webinar E2 hosted quite some time back in which the speakers discussed how many IT Departments are most concerned with just keeping the lights on -- more than anything else.

Perhaps that is another part of YCGA's struggle.  Sure, better integration and accessibility would be nice, but it's a huge effort just to deal with what they have.
syedzunair   Storage Doesn't Matter for Bioinformatics? Not So Fast   4/30/2012 12:21:44 PM
Re: What about Cloud?
It's at least better than backing up your data in a hundred Hard Drives when all the data can perish if a fire breaks out.

It seems much better than storing data on tapes or on hard drives. With storage on the cloud you get the option of storing your data in a geographically separate location as compared to your current business location. It will most certainly help in data recovery if the primary location goes down. 
Joe Stanganelli   Storage Doesn't Matter for Bioinformatics? Not So Fast   5/1/2012 12:03:08 AM
Re: What about Cloud?
Well, hold on there, kiddies, with all the talk about the cloud.

A recurring theme at the Conference was the unsuitability of the public cloud for a lot of bioinformatics work -- especially in the field of genomics -- because of the TREMENDOUS amounts of data.  Far too much to be sending across public cloud data lines.

One speaker related a tale of how a NY hospital ran a query that lasted four months.

And, of course, there are issues with proprietary data, HIPAA, and other confidentiality (this one's for you, Sara!) bugaboos.

Private clouds can be suitable in many instances (and, indeed, often the best option), however.

tekedge   Storage Doesn't Matter for Bioinformatics? Not So Fast   4/30/2012 1:49:44 PM
Storage Doesn't Matter for Bioinformatics? Not So Fast
"What's more, backing up all of a genomics organization's data -- which can number in the petabytes -- just isn't practical. "

I am a bit confused with that statement. Does the poster mean to say, backing up petabytes of data daily? Yes that wd be a difficult task but surely on a periodic basis the entire database can be backed up and incremental backups on daily basis which should probably be in terrabytes should be possible.

 
Taimoor Zubair   Storage Doesn't Matter for Bioinformatics? Not So Fast   4/30/2012 8:00:35 PM
Re: Storage Doesn't Matter for Bioinformatics? Not So Fast
@tekedge: I agree that incremetntal backups can be a useful way of taking daily backups rather than backing the whole DB up everyday. As long as the incremental backups are limited to terabytes, that should not be a problem.
LuFu   Storage Doesn't Matter for Bioinformatics? Not So Fast   4/30/2012 2:05:09 PM
Storage hubris
I worked in data storage for many years. What I've learned is it's the one leg in computer processing that is taken for granted until it fails. One of the first things I learned was the MTBF of a disk drive which is it's Mean Time Between Failure. It's a calibration in hours of a hard drive's expected lifetime. So, a manufacturer reports 1.5 million hours MTBF and you think that this will last a lifetime if not longer. Of course the drive spec should include the asterisk - *Your drive mileage may vary - since Murphy's Law is usually excluded from the MTBF equation. Ergo, I vote for Bjornson and his concern about storage. And when it's big storage then it becomes a larger issue. As we all know, size does matter.
Taimoor Zubair   Storage Doesn't Matter for Bioinformatics? Not So Fast   4/30/2012 2:51:05 PM
Re: Storage hubris
"What I've learned is it's the one leg in computer processing that is taken for granted until it fails."

@Lufu: I agree with you on this. Recently I had a meeting over a project's budget with finance. It took a whole while and considerable efforts convincing those non-technical folks why we needed redundant data disks inside a server and why can't normal disks replace a RAID controller.

Skr2011   Storage Doesn't Matter for Bioinformatics? Not So Fast   4/30/2012 9:40:46 PM
Re: Storage hubris
@LuFuThe number one tenet of storage at scale is "things fail". When you scale up, you will find that 2-5% of your disks are going to fail (as iI am sure your are aware) and when you have a lot of spindles that's a pretty large number. You have to manage against such failures, so it isn't just about buying TBs of disk. The design of the systems is to  fail gracefully. Depending on your applications/goals you might need to make any storage solution highly available, which means you need redundancy, and to scale reads you will almost certainly need to partition your data.

I recommend checking out some of the presentations by Chris Dagdigian, e.g. http://blog.bioteam.net/wp-content/uploads/2010/03/cdag-xgen-storageForNGS_v3.pdf
David Wagner   Storage Doesn't Matter for Bioinformatics? Not So Fast   4/30/2012 11:27:46 PM
Re: Storage hubris
@Skr2011- I think we do need to find a way to reduce the number of spindles as storage grows. But you are right, there are graceful ways to lower the problem of disk failure. I think eventually, we're going to ditch the spindle for some very clever new storage device. When we do, we'll see yet another explosion of data.
Skr2011   Storage Doesn't Matter for Bioinformatics? Not So Fast   4/30/2012 11:32:35 PM
Re: Storage hubris
@ David yep! It always works that way doesn't it. We think we have enough space and then BAM something new comes along that requires just that much more....
David Wagner   Storage Doesn't Matter for Bioinformatics? Not So Fast   4/30/2012 11:36:01 PM
Re: Storage hubris
@Skr2011- And the opposite seems to be true, too. If we find we have some extra space, someone is going to come up with something to fill that space.
white.space   Storage Doesn't Matter for Bioinformatics? Not So Fast   4/30/2012 11:57:23 PM
Re: Storage hubris
If we find we have some extra space, someone is going to come up with something to fill that space.

Absolutely! You could have posters and bumper stickers with that adage! I bet you have a Dropbox, and a Google Drive account (at the very least!), and several external hard drives filled with stuff.. :) And still wish there was more. The library of Congress has roughly 20Tb worth of stuff, and sometimes I wish I had as much space!
User Ranking: Blogger
CurtisFranklin   Storage Doesn't Matter for Bioinformatics? Not So Fast   5/1/2012 10:50:32 AM
Re: Storage hubris
@white.space, I remember the old statement, "Nature abhors an empty horizontal surface." I think that we could also say that, "Nature abhors an un-filled storage bit," and we'd be just as correct. I've watched storage requirements grow exponentially during the last 30 years -- I can hardly wait for the day when the desktop petabyte is commonplace!
Joe Stanganelli   Storage Doesn't Matter for Bioinformatics? Not So Fast   5/1/2012 12:13:59 AM
Re: Storage hubris
Yes, Skr2011, scalability was a big part of the talks at the Conference.

Incidentally, this is why Netflix was pretty unaffected by the huge AWS cloud outage a year ago even though they're a major AWS customer -- because they were smart enough as an organization that handles enormous amounts of data to have so many redundancies that they could handle an outage.  It was the little guys who relied completely but without backups on the cloud who were screwed for the following days.
David Wagner   Storage Doesn't Matter for Bioinformatics? Not So Fast   4/30/2012 11:21:33 PM
Re: Storage hubris
@Lufu- You're right. Storage is the field goal kicker of IT. It is a shame, too, because as data gains more value when it is stored well and made more easily accessible.
Joe Stanganelli   Storage Doesn't Matter for Bioinformatics? Not So Fast   5/1/2012 12:04:49 AM
Re: Storage hubris
As the saying goes, LuFu, the bigger they are, the harder they fall.
Gigi   Storage Doesn't Matter for Bioinformatics? Not So Fast   5/1/2012 6:15:56 AM
Gigi
Re: Storage hubris
Joe, any potential outcomes for bioinformatics collaborative projects, for drug discovery. I heard that there are some proposals for TB, HIV and cancer drug discovery.
Sara Peters   Storage Doesn't Matter for Bioinformatics? Not So Fast   4/30/2012 4:22:03 PM
gold star
Joe you get extra credit simply for using the word "bugaboo."
Sara Peters   Storage Doesn't Matter for Bioinformatics? Not So Fast   4/30/2012 4:31:51 PM
other genomic research institute
If you haven't already, check out the On the Case video documentary series we're doing about the Translational Genomics Research Institute. Storage is DEFINITELY a challenge for them -- but processing speed and data sharing are just as important. Check it out: http://www.enterpriseefficiency.com/video.asp?section_id=1467&doc_id=242705

 
zerox203   Storage Doesn't Matter for Bioinformatics? Not So Fast   4/30/2012 8:06:10 PM
Re: Storage for Bioinformatics
The give and take here seems pretty reasonable on both sides. Those saying that storage is not a primary concern are aware (hopefully) that they're talking about an ideal theoretical situation where things run as best as they possibly can in the present or the near future. They're also probably aware that it doesn't work this way at many organizations yet. Those that are talking about how important storage still is seem to agree that their target goal is to make storage a given so that other things can be focused on and that that's an attainable goal - they just think we'll get there later rather than sooner, and want to make sure the ''optimists'' don't get ahead of themselves.

In other words, it's a classic balancing act - the trick is to keep pushing the envelope and moving forward without overextending yourself. It's to be expected that there's a little push and pull to keep it in the right spot. As for me personally, I'm inclined to agree with the storage folks - it's more likely that things look great at an organization becasue nothing has gone wrong yet than for the opposite to be true.
David Wagner   Storage Doesn't Matter for Bioinformatics? Not So Fast   4/30/2012 11:25:15 PM
Re: Storage for Bioinformatics
@Zerox203- Very nice point. Just because someone isn't focusing on an aspect of the problem doesn't mean it didn't require work. It just means they're talking about somehting else.
Joe Stanganelli   Storage Doesn't Matter for Bioinformatics? Not So Fast   5/1/2012 12:07:52 AM
Re: Storage for Bioinformatics
Very nice breakdown, zerox, and rather one of the points I was going for.  Both concerns ought be heard and addressed for real progress.


The blogs and comments posted on EnterpriseEfficiency.com do not reflect the views of TechWeb, EnterpriseEfficiency.com, or its sponsors. EnterpriseEfficiency.com, TechWeb, and its sponsors do not assume responsibility for any comments, claims, or opinions made by authors and bloggers. They are no substitute for your own research and should not be relied upon for trading or any other purpose.

More Blogs from Joe Stanganelli
Joe Stanganelli   11/20/2013   58 comments
The Internet may be global, and we may call what we see in our browsers the world wide web, but about 70 percent of the world doesn't have Internet access -- the part that's covered by water.
Joe Stanganelli   10/10/2013   62 comments
"Passwords are dead," a Google information security manager decreed at last month's TechCrunch Disrupt. Other pundits have come to the same conclusion. However, these reports are greatly ...
Joe Stanganelli   9/11/2013   83 comments
Nietzsche said, "That which does not kill me can only make me stronger." Scientists have recently discovered that this may be literally true in the case of plastics, and it could be a real ...
Joe Stanganelli   4/24/2013   28 comments
Big-data is a perennial concern at Boston's annual Bio-IT World Expo because of the sheer volume of information the life sciences industry must contend with. The pain points expressed at ...
Latest Archived Broadcast
In this episode, you'll learn how to stretch the limits of your private cloud -- and how to recognize the limits that can't be exceeded.
On-demand Video with Chat
IT has to deploy Server 2012 in a way that fits the architecture of its application delivery system.
E2 IT Migration Zones
IT Migration Zone - UK
Why PowerShell Is Important
Reduce the Windows 8 Footprint for VDI
Rethinking Storage Management
IT Migration Zone - FR
SQL Server : 240 To de mémoire flash pour votre data warehouse
Quand Office vient booster les revenus Cloud et Android de Microsoft
Windows Phone : Nokia veut davantage d'applications (et les utilisateurs aussi)
IT Migration Zone - DE
Cloud Computing: Warum Unternehmen trotz NSA auf die „private“ Wolke setzen sollten
Cloud Computing bleibt Wachstumsmarkt – Windows Azure ist Vorreiter
Like Us on Facebook
Twitter Feed
Enterprise Efficiency Twitter Feed
Site Moderators Wanted
Enterprise Efficiency is looking for engaged readers to moderate the message boards on this site. Engage in high-IQ conversations with IT industry leaders; earn kudos and perks. Interested? E-mail:
[email protected]
Informed CIO: Dollars & Sense: Virtual Desktop Infrastructure
Cut through the VDI hype and get the full picture -- including ROI and the impact on your Data Center -- to make an informed decision about your virtual desktop infrastructure deployments.

Read the full report
Virtualization Management: Time To Get Serious
Welcome to the backside of the virtualization wave. Discover the state of virtualization management and where analysts are predicting it is heading

Read the full report
PUBLIC SECTOR RESOURCES
WHITE PAPERS
A Video Case Study – Translational Genomics Research Institute
e2 Storage Video


On the Case
TGen IT: Where We're Going Next

7|11|12   |   08:12   |   10 comments


Now that TGen has broken new ground in genomic research by using Dell's storage, cloud, and high-performance computing solutions, the company discusses what will come next for it and for personalized medicine.
On the Case
Better Care Through Better Communications

6|6|12   |   02:24   |   11 comments


The achievements of the TGen/Dell project could improve how all people receive healthcare, because they are creating ways to improve end-to-end communication of medical data.
On the Case
TGen IT: Where We Are Now

5|15|12   |   06:58   |   6 comments


TGen is breaking new ground in genomic research by using Dell's storage, cloud, and high-performance computing solutions.
On the Case
TGen IT: Where We Were

4|27|12   |   06:45   |   10 comments


The Translational Genomics Research Institute wanted to save lives, but its efforts were hobbled by immense computing challenges related to collecting, processing, sharing, and storing enormous amounts of data.
On the Case
1,200% Faster

4|18|12   |   02:27   |   12 comments


Through their partnership, Dell and TGen have increased the speed of TGen’s medical research by 1,200 percent.
On the Case
IT May Improve Children's Chances of Survival

4|17|12   |   02:12   |   8 comments


IT is helping medical researchers reach breakthroughs in a way and pace never seen before.
On the Case
Medical Advances in the Cloud

4|10|12   |   1:25   |   5 comments


TGen and Dell are pushing the boundaries of computing, and harnessing the power of the cloud to improve healthcare.
On the Case
TGen: Living the Mission

4|9|12   |   2:25   |   3 comments


TGen's CIO puts the organizational mission at the heart of everything the IT staff does.
On the Case
TGen Speeding Up Biomedical Research to Save More Lives

4|5|12   |   1:59   |   6 comments


The Translational Genomics Research Institute is revamping its computing to improve speed, storage, and collaboration – and, most importantly, to save lives.
On the Case
Computing Power Helping to Save Children's Lives

3|28|12   |   2:13   |   3 comments


The Translational Genomics Institute’s partnership with Dell is enabling them to treat kids with neuroblastoma more quickly and save more lives.
Tom Nolle
How Deep Is My Storage Hierarchy?

7|3|12   |   2:13   |   5 comments


At the GigaOM Structure conference, a startup announced a cloud and virtualization storage optimizing approach that shows there's still a lot of thinking to be done on the way storage joins the virtual world.
E2 Interview
What Other Industries Can Learn From Financial Services

6|13|12   |   02:08   |   3 comments


We asked CIO Steve Rubinow what CIOs in other industries can learn from the financial services industry about datacenter efficiency, security, and green computing.
E2 Interview
Removing Big-Data Flow Bottlenecks

6|12|12   |   02:55   |   No comments


We ask CIO Steve Rubinow what pieces of financial services infrastructure need to perform better to get traders info faster.
E2 Interview
Getting Traders the Data They Need

6|11|12   |   02:04   |   1 comment


We ask CIO Steve Rubinow: What do stock market traders need to know, how fast do they need it, and how can CIOs get it to them?
E2 Interview
Can IT Help Fix the Global Economy?

6|8|12   |   02:32   |   2 comments


We ask CIO Steve Rubinow whether today's IT can help repair the global economy (and if IT played any role in the economy's collapse).
E2 Interview
More Competitive Business via Datacenter Strategy

5|4|12   |   2:46   |   1 comment


Businesses need to be competitive, yet efficient, and both goals affect datacenter design.
E2 Interview
The Recipe for Greater Efficiency

5|3|12   |   3:14   |   2 comments


Intel supplies the best ingredients to drive greater datacenter efficiency and support new compute, storage, and networking needs.
E2 Interview
Datacenters Enabling Business Transformation

5|1|12   |   06:37   |   1 comment


Dell’s Gaurav Chand says that for the first time ever datacenter technology is truly enabling all kinds of organizations to transform their business and achieve new objectives.
Tom Nolle
Cloud Data: Big AND Persistent!

3|28|12   |   2:11   |   10 comments


We always hear about "Big" data, but a real issue in cloud storage is not just bigness but also persistence. A large data model is less complicated than a big application repository that somehow needs to be accessed. The Hadoop send-program-to-data model may be the answer.
Tom Nolle
Project Lightning Streamlines Storage

2|16|12   |   2:09   |   2 comments


EMC's Project Lightning has matured into a product set, and it's important, less because it has new features or capabilities in storage technology and management, than because it may package the state of the art in a way more businesses can deploy.
Tom Nolle
Big Data Appliance Is Big News

1|12|12   |   2:18   |   No comments


Oracle's release of a Hadoop appliance for Big Data may be a signal that we're shifting to database appliances.
Tom Nolle
Myopia Can Hurt Storage Policy

12|22|11   |   2:08   |   No comments


We're at the beginning of a cloud-driven revolution in storage, but Oracle's quarter shows that enterprises are hunkering down on old concepts because they're afraid of the costs in the near term.
Sara Peters
An Untrained User & a Mobile Medical Device

12|19|11   |   2:43   |   11 comments


Untrained end users, clueless central IT staff, and expensive mobile devices are a worrisome combination for healthcare CIOs.
Tom Nolle
Too Many Labels on 'Big Data'?

12|9|11   |   2:12   |   3 comments


However you label it, structured and unstructured information are different and will likely always require different tools.
Sara Peters
E2 Debuts New Storage Section

12|8|11   |   1:51   |   1 comment


Need strategic guidance on everything from SSDs to 100 percent virtualized datacenters? Look no further.