The Cube - hBase 2012 - Norbert Burger, TBS, with John Furrier
When John Meyers, Ph.D., came to the Boston University School of Medicine as a medical researcher, he stepped into a chaotic data environment. "Servers were under lab benches. Irreplaceable repositories of years worth of data resided on NAS boxes on lab benches with no backup. I wouldn't use the term 'infrastructure' to describe the situation."
The problem, Dr. Meyers said, was that the central IT systems had not kept up with the basic change in the nature of medical research. It had become a big data game -- a single genome sequence is 3 Tbytes of data, but the IT architecture at the time was running on Microsoft Windows shares.
"I kept bugging the head of research about this until finally I guess I annoyed him and he asked me if I would like to take over running IT as well," he said at a Wikibon Peer Incite discussion Tuesday, May 15, 2012, recorded here. "I said yes on one condition: I could throw out everything and start over."
And that is exactly what he did. Specifically, he replaced the outmoded system with a shared resource across the research departments based on an EMC Isilon cluster. He chose Isilon specifically for its ability to handle large volumes of streaming data in what is now called a bit data environment and to support automated data tiering of all that data to move inactive data onto less expensive archive media while keeping it available.
Today that cluster has a 300TB capacity, and BU will add 200 more TBs this summer. But that expansion will all be archive media. One of the ways in which research differs from commercial environments is that researchers are always focused on "the next shiny thing" -- the next piece of research. Once they have won the grant or published the research paper, the teams move on to the next project and start generating new data to answer it while the older data languishes. However, preserving that data is important because sometimes it can be reused or the original research revisited. As a result, Dr. Meyer said, the Isilon cluster has plenty of tier 1 capacity but is running low on archive.
Providing remote site backup for this quantity of data presents its own challenges. One of those is adequate filtering. Actually only the fraction of original data in the system needs to be backed up. A great deal of the database is derived data from analysis. If necessary that can be recreated, while another portion that Dr. Meyer calls "scratch data" documenting intermediary steps, questions and comments, do not need to be preserved.
The problem at the time was that none of the systems from the leading vendors, including EMC, really met the need. "One day a consultant suggested that we talk to a startup called Actifio in Waltham that had novel technologies," Dr Meyers said. He had never heard of Actifio, and at the time they had very few users and none in academia. But the CEO came to BU and discussed their technology in detail, and what he said made an impression. For instance, Actifio has a copy management tool to minimize the number of copies of a database that tend to proliferate in any organization and can become costly when the database is measured in hundreds of TBs. It also could deal with the highly heterogeneous architecture in the research labs, including both bare machine and VM environments, which none of the big vendor systems could manage. And it had the advanced data filtering that Dr. Meyer needed.
Activio is based on IBM technology and works in a clustered virtual environment. It has a simple interface that makes it easy to set up and is block based, said Dr. Meyer. It sits in band and captures deltas for very fast snapshots. At user-defined intervals it compiles these to create a full, application-aware backup on its internal storage. This can then be replicated to a remote site for backup. And if the main system experiences a failure, it can cut over to the internal Actifio backup while Actifio rebuilds the database on the replacement hardware.
"We actually had a disk failure and lost a LUN," Dr. Meyer said. "The backup didn't run as fast as the main system, but it was a lot better than shutting things down. Nobody in the institution realized that anything had happened."
Forgot Password
Almost there!
We just sent you a verification email. Please verify your account to gain access to
HBase Con 2012 | San Francisco. If you don’t think you received an email check your
spam folder.
In order to sign in, enter the email address you used to registered for the event. Once completed, you will receive an email with a verification link. Open this link to automatically sign into the site.
Register For HBase Con 2012 | San Francisco
Please fill out the information below. You will recieve an email with a verification link confirming your registration. Click the link to automatically sign into the site.
You’re almost there!
We just sent you a verification email. Please click the verification button in the email. Once your email address is verified, you will have full access to all event content for HBase Con 2012 | San Francisco.
I want my badge and interests to be visible to all attendees.
Checking this box will display your presense on the attendees list, view your profile and allow other attendees to contact you via 1-1 chat. Read the Privacy Policy. At any time, you can choose to disable this preference.
Select your Interests!
add
Upload your photo
Uploading..
OR
Connect via Twitter
Connect via Linkedin
EDIT PASSWORD
Share
Forgot Password
Almost there!
We just sent you a verification email. Please verify your account to gain access to
HBase Con 2012 | San Francisco. If you don’t think you received an email check your
spam folder.
In order to sign in, enter the email address you used to registered for the event. Once completed, you will receive an email with a verification link. Open this link to automatically sign into the site.
Sign in to gain access to HBase Con 2012 | San Francisco
Please sign in with LinkedIn to continue to HBase Con 2012 | San Francisco. Signing in with LinkedIn ensures a professional environment.
Are you sure you want to remove access rights for this user?
Details
Manage Access
email address
Community Invitation
Norbert Burger - hBase 2012 - theCUBE
The Cube - hBase 2012 - Norbert Burger, TBS, with John Furrier
When John Meyers, Ph.D., came to the Boston University School of Medicine as a medical researcher, he stepped into a chaotic data environment. "Servers were under lab benches. Irreplaceable repositories of years worth of data resided on NAS boxes on lab benches with no backup. I wouldn't use the term 'infrastructure' to describe the situation."
The problem, Dr. Meyers said, was that the central IT systems had not kept up with the basic change in the nature of medical research. It had become a big data game -- a single genome sequence is 3 Tbytes of data, but the IT architecture at the time was running on Microsoft Windows shares.
"I kept bugging the head of research about this until finally I guess I annoyed him and he asked me if I would like to take over running IT as well," he said at a Wikibon Peer Incite discussion Tuesday, May 15, 2012, recorded here. "I said yes on one condition: I could throw out everything and start over."
And that is exactly what he did. Specifically, he replaced the outmoded system with a shared resource across the research departments based on an EMC Isilon cluster. He chose Isilon specifically for its ability to handle large volumes of streaming data in what is now called a bit data environment and to support automated data tiering of all that data to move inactive data onto less expensive archive media while keeping it available.
Today that cluster has a 300TB capacity, and BU will add 200 more TBs this summer. But that expansion will all be archive media. One of the ways in which research differs from commercial environments is that researchers are always focused on "the next shiny thing" -- the next piece of research. Once they have won the grant or published the research paper, the teams move on to the next project and start generating new data to answer it while the older data languishes. However, preserving that data is important because sometimes it can be reused or the original research revisited. As a result, Dr. Meyer said, the Isilon cluster has plenty of tier 1 capacity but is running low on archive.
Providing remote site backup for this quantity of data presents its own challenges. One of those is adequate filtering. Actually only the fraction of original data in the system needs to be backed up. A great deal of the database is derived data from analysis. If necessary that can be recreated, while another portion that Dr. Meyer calls "scratch data" documenting intermediary steps, questions and comments, do not need to be preserved.
The problem at the time was that none of the systems from the leading vendors, including EMC, really met the need. "One day a consultant suggested that we talk to a startup called Actifio in Waltham that had novel technologies," Dr Meyers said. He had never heard of Actifio, and at the time they had very few users and none in academia. But the CEO came to BU and discussed their technology in detail, and what he said made an impression. For instance, Actifio has a copy management tool to minimize the number of copies of a database that tend to proliferate in any organization and can become costly when the database is measured in hundreds of TBs. It also could deal with the highly heterogeneous architecture in the research labs, including both bare machine and VM environments, which none of the big vendor systems could manage. And it had the advanced data filtering that Dr. Meyer needed.
Activio is based on IBM technology and works in a clustered virtual environment. It has a simple interface that makes it easy to set up and is block based, said Dr. Meyer. It sits in band and captures deltas for very fast snapshots. At user-defined intervals it compiles these to create a full, application-aware backup on its internal storage. This can then be replicated to a remote site for backup. And if the main system experiences a failure, it can cut over to the internal Actifio backup while Actifio rebuilds the database on the replacement hardware.
"We actually had a disk failure and lost a LUN," Dr. Meyer said. "The backup didn't run as fast as the main system, but it was a lot better than shutting things down. Nobody in the institution realized that anything had happened."