3 ways Yahoo employed Hadoop to optimize utilization | #HS16SJ
by Brittany Greaner | Jun 30, 2016
In the last three years, the demands from customers have grown exponentially. Like many companies, Yahoo, Inc. is adapting to better serve its customers and provide a better user experience. In his talk as keynote speaker today during Hadoop Summit 2016 in San Jose, CA, Mark Holderbaugh, senior director of Hadoop Engineering at Yahoo, discussed the three major highlights of what Yahoo has done to meet these growing demands.
1. YARN
The first thing Yahoo did was look at YARN, a cluster management technology, as a tool to increase utilization. Due to the YARN schedule, they were only getting 40 percent, so Yahoo needed to find a way to increase that figure. Based on feedback from the nodes, they were able to adjust and work on getting that percentage point up to a more favorable number.
YARN turned out to be a worthy investment that returned better utilization.
2. Migration to Tez
The second action Yahoo took was to migrate to Apache Tez, which is aimed at building an application framework that allows for a complex directed-acyclic-graph of tasks for processing data.
“Tez is the key that gives us up to zero minimal changes to jobs,” said Holderbaugh. This allowed the company to run millions of jobs and raise utilization 50 percent, just by switching to Tez. However, it was not a dynamic knife switch on each cluster, Holderbaugh emphasized. Each job has different specifications and changes every day, so it all has to be done individually on a case-by-case basis. This switch resulted in a reduction of runtime hours and memory. Yahoo had a 30 percent gain just from switching one pipeline.
It also started migrating Apache Hive jobs from Tez. “Hive gave us better utilization and improved latencies, allowing us to do more demand-type latent jobs,” said Holderbaugh. It shows that these latencies are on which node and increases the ability to get those latencies out of jobs. This avoids the need for extra clusters, and ultimately saves money.
3. Apache Storm
The third area Yahoo focused on was Apache Storm utilization. According to Holderbaugh, Yahoo has embraced Storm (an open-source distributed real-time computation system) since its creation in 2012. Storm is now being used in every part of Yahoo. It’s being used for data analyzation, as well as monitor clusters, and the utilization is even lower than in Hadoop clusters. Yahoo’s reasoning for doing this was a mission to keep the fine-grained stuff while improving utilization.
Holderbaugh also emphasized that these topics were the subject for many panels already in the course of the Hadoop Summit, and even more planned for today. The schedule is rife with opportunities to learn more about Yahoo’s efforts at optimization and much more. He encouraged the audience to go back and watch these talks or attend what they could for inspiration.
Hadoop in the cloud
After Holderbaugh finished his talk, Sanjay Radia, founder and architect at Hortonworks, Inc., took the stage. Radia’s talk focused on why you would want to put Hadoop in the cloud. First of all, it’s not actually a new idea, he said. Companies have been doing it for years. One major reason for that is the time and money you can save. There are no hardware costs involved, you don’t need an expert on staff and the cloud offers more elasticity. It can even create a cluster in minutes. In addition, it takes away some of the complexity by offering pre-tuned clusters.
Raida also emphasized that having shared data “fundamentally means we need shared management.” He talked about how important it is to have a shared metadata so that the data isn’t replicated and taking up needless space, which translates easily into a waste of money. You can accomplish this by having a shared database server, Radia said.
Collaboration and the cloud is certainly a theme at this year’s Hadoop Summit, as is using emerging technology to help your company run more smoothly and cost effectively than ever before.
#HS16SJ
#theCUBE
Forgot Password
Almost there!
We just sent you a verification email. Please verify your account to gain access to
Hadoop Summit 2016 | San Jose. If you don’t think you received an email check your
spam folder.
In order to sign in, enter the email address you used to registered for the event. Once completed, you will receive an email with a verification link. Open this link to automatically sign into the site.
Register For Hadoop Summit 2016 | San Jose
Please fill out the information below. You will recieve an email with a verification link confirming your registration. Click the link to automatically sign into the site.
You’re almost there!
We just sent you a verification email. Please click the verification button in the email. Once your email address is verified, you will have full access to all event content for Hadoop Summit 2016 | San Jose.
I want my badge and interests to be visible to all attendees.
Checking this box will display your presense on the attendees list, view your profile and allow other attendees to contact you via 1-1 chat. Read the Privacy Policy. At any time, you can choose to disable this preference.
Select your Interests!
add
Upload your photo
Uploading..
OR
Connect via Twitter
Connect via Linkedin
EDIT PASSWORD
Share
Forgot Password
Almost there!
We just sent you a verification email. Please verify your account to gain access to
Hadoop Summit 2016 | San Jose. If you don’t think you received an email check your
spam folder.
In order to sign in, enter the email address you used to registered for the event. Once completed, you will receive an email with a verification link. Open this link to automatically sign into the site.
Sign in to gain access to Hadoop Summit 2016 | San Jose
Please sign in with LinkedIn to continue to Hadoop Summit 2016 | San Jose. Signing in with LinkedIn ensures a professional environment.
Are you sure you want to remove access rights for this user?
Details
Manage Access
email address
Community Invitation
Day 3 Keynote | Hadoop Summit 2016 San Jose
3 ways Yahoo employed Hadoop to optimize utilization | #HS16SJ
by Brittany Greaner | Jun 30, 2016
In the last three years, the demands from customers have grown exponentially. Like many companies, Yahoo, Inc. is adapting to better serve its customers and provide a better user experience. In his talk as keynote speaker today during Hadoop Summit 2016 in San Jose, CA, Mark Holderbaugh, senior director of Hadoop Engineering at Yahoo, discussed the three major highlights of what Yahoo has done to meet these growing demands.
1. YARN
The first thing Yahoo did was look at YARN, a cluster management technology, as a tool to increase utilization. Due to the YARN schedule, they were only getting 40 percent, so Yahoo needed to find a way to increase that figure. Based on feedback from the nodes, they were able to adjust and work on getting that percentage point up to a more favorable number.
YARN turned out to be a worthy investment that returned better utilization.
2. Migration to Tez
The second action Yahoo took was to migrate to Apache Tez, which is aimed at building an application framework that allows for a complex directed-acyclic-graph of tasks for processing data.
“Tez is the key that gives us up to zero minimal changes to jobs,” said Holderbaugh. This allowed the company to run millions of jobs and raise utilization 50 percent, just by switching to Tez. However, it was not a dynamic knife switch on each cluster, Holderbaugh emphasized. Each job has different specifications and changes every day, so it all has to be done individually on a case-by-case basis. This switch resulted in a reduction of runtime hours and memory. Yahoo had a 30 percent gain just from switching one pipeline.
It also started migrating Apache Hive jobs from Tez. “Hive gave us better utilization and improved latencies, allowing us to do more demand-type latent jobs,” said Holderbaugh. It shows that these latencies are on which node and increases the ability to get those latencies out of jobs. This avoids the need for extra clusters, and ultimately saves money.
3. Apache Storm
The third area Yahoo focused on was Apache Storm utilization. According to Holderbaugh, Yahoo has embraced Storm (an open-source distributed real-time computation system) since its creation in 2012. Storm is now being used in every part of Yahoo. It’s being used for data analyzation, as well as monitor clusters, and the utilization is even lower than in Hadoop clusters. Yahoo’s reasoning for doing this was a mission to keep the fine-grained stuff while improving utilization.
Holderbaugh also emphasized that these topics were the subject for many panels already in the course of the Hadoop Summit, and even more planned for today. The schedule is rife with opportunities to learn more about Yahoo’s efforts at optimization and much more. He encouraged the audience to go back and watch these talks or attend what they could for inspiration.
Hadoop in the cloud
After Holderbaugh finished his talk, Sanjay Radia, founder and architect at Hortonworks, Inc., took the stage. Radia’s talk focused on why you would want to put Hadoop in the cloud. First of all, it’s not actually a new idea, he said. Companies have been doing it for years. One major reason for that is the time and money you can save. There are no hardware costs involved, you don’t need an expert on staff and the cloud offers more elasticity. It can even create a cluster in minutes. In addition, it takes away some of the complexity by offering pre-tuned clusters.
Raida also emphasized that having shared data “fundamentally means we need shared management.” He talked about how important it is to have a shared metadata so that the data isn’t replicated and taking up needless space, which translates easily into a waste of money. You can accomplish this by having a shared database server, Radia said.
Collaboration and the cloud is certainly a theme at this year’s Hadoop Summit, as is using emerging technology to help your company run more smoothly and cost effectively than ever before.
#HS16SJ
#theCUBE