Name: Day 3 Keynote | Hadoop Summit 2016 San Jose
Uploaded: 2016-07-06T22:43:00.000Z
Duration: 2 h 10 min 48 s

Day 3 Keynote | Hadoop Summit 2016 San Jose

3 ways Yahoo employed Hadoop to optimize utilization | #HS16SJ by Brittany Greaner | Jun 30, 2016 In the last three years, the demands from customers have grown exponentially. Like many companies, Yahoo, Inc. is adapting to better serve its customers and provide a better user experience. In his talk as keynote speaker today during Hadoop Summit 2016 in San Jose, CA, Mark Holderbaugh, senior director of Hadoop Engineering at Yahoo, discussed the three major highlights of what Yahoo has done to meet these growing demands. 1. YARN The first thing Yahoo did was look at YARN, a cluster management technology, as a tool to increase utilization. Due to the YARN schedule, they were only getting 40 percent, so Yahoo needed to find a way to increase that figure. Based on feedback from the nodes, they were able to adjust and work on getting that percentage point up to a more favorable number. YARN turned out to be a worthy investment that returned better utilization. 2. Migration to Tez The second action Yahoo took was to migrate to Apache Tez, which is aimed at building an application framework that allows for a complex directed-acyclic-graph of tasks for processing data. “Tez is the key that gives us up to zero minimal changes to jobs,” said Holderbaugh. This allowed the company to run millions of jobs and raise utilization 50 percent, just by switching to Tez. However, it was not a dynamic knife switch on each cluster, Holderbaugh emphasized. Each job has different specifications and changes every day, so it all has to be done individually on a case-by-case basis. This switch resulted in a reduction of runtime hours and memory. Yahoo had a 30 percent gain just from switching one pipeline. It also started migrating Apache Hive jobs from Tez. “Hive gave us better utilization and improved latencies, allowing us to do more demand-type latent jobs,” said Holderbaugh. It shows that these latencies are on which node and increases the ability to get those latencies out of jobs. This avoids the need for extra clusters, and ultimately saves money. 3. Apache Storm The third area Yahoo focused on was Apache Storm utilization. According to Holderbaugh, Yahoo has embraced Storm (an open-source distributed real-time computation system) since its creation in 2012. Storm is now being used in every part of Yahoo. It’s being used for data analyzation, as well as monitor clusters, and the utilization is even lower than in Hadoop clusters. Yahoo’s reasoning for doing this was a mission to keep the fine-grained stuff while improving utilization. Holderbaugh also emphasized that these topics were the subject for many panels already in the course of the Hadoop Summit, and even more planned for today. The schedule is rife with opportunities to learn more about Yahoo’s efforts at optimization and much more. He encouraged the audience to go back and watch these talks or attend what they could for inspiration. Hadoop in the cloud After Holderbaugh finished his talk, Sanjay Radia, founder and architect at Hortonworks, Inc., took the stage. Radia’s talk focused on why you would want to put Hadoop in the cloud. First of all, it’s not actually a new idea, he said. Companies have been doing it for years. One major reason for that is the time and money you can save. There are no hardware costs involved, you don’t need an expert on staff and the cloud offers more elasticity. It can even create a cluster in minutes. In addition, it takes away some of the complexity by offering pre-tuned clusters. Raida also emphasized that having shared data “fundamentally means we need shared management.” He talked about how important it is to have a shared metadata so that the data isn’t replicated and taking up needless space, which translates easily into a waste of money. You can accomplish this by having a shared database server, Radia said. Collaboration and the cloud is certainly a theme at this year’s Hadoop Summit, as is using emerging technology to help your company run more smoothly and cost effectively than ever before. #HS16SJ #theCUBE

Share this session