The Next Generation of Big Data: MapReduce2 with YARN

Introduction

I am sure that you have already heard about the next generation of MapReduce proposed by Hadoop. Its popularly called MapReduce2 or MR2.

With MR2 they are introducing many enhancements, the prime one being introduction of a new component called YARN (Yet Another Resource Negotiator).
With the current MapReduce implementation, there is just one Job Tracker that takes care of two critical functions:

Manage resources across the cluster and schedule jobs using that information.
Keep track of job execution. This includes rerunning failed nodes, job check-pointing, etc.

In YARN, the Job Tracker goes away and each of these two tasks are given to two different components.

Resource Manager

There is one Resource Manager per Hadoop Cluster and is responsible for scheduling jobs. It has the state information about all the nodes and thus it can make smarter scheduling decisions.

Freedom Month Sale — Upgrade Your Skills, Save Big!

Up to 80% OFF AWS Courses
Up to 30% OFF Microsoft Certs

Act Fast!

Application Master

There is one Application Master (AM) per job. Resource Manager schedules one AM per job, and once that is done, its AM’s responsibility to successfully complete the execution of the job. This takes away a lot of responsibility from the Resource Manager, and thus it can scale to many more nodes in the cluster and many more jobs.

Few points to know about this new architecture are:

Application Master aggressively writes check-pointing state to HDFS and thus load on HDFS increases. This application state is used for job recovery; if Application Master fails a job can be restarted from the last checkpoint.
The Task Node is replaced by NodeManager (More about this in future blog). There is an option to write Node Manager log to HDFS. Thus logs can go to a central place and debugging will be easier. This will further stress the HDFS cluster.
Yarn no longer just works with MapReduce. It can work with other distributed computing platform.
Yahoo introduced Storm: A Real time distributed computing platform. Open Source by Yahoo!!
This version also introduces Web services for Hadoop Cluster status. No longer you need to scrape web pages to automate stuff.

Please share the article if you liked it. Let me know of your thoughts in comments below

Freedom Month Sale — Discounts That Set You Free!

Up to 80% OFF AWS Courses
Up to 30% OFF Microsoft Certs

Act Fast!

About CloudThat

CloudThat is an award-winning company and the first in India to offer cloud training and consulting services worldwide. As a Microsoft Solutions Partner, AWS Advanced Tier Training Partner, and Google Cloud Platform Partner, CloudThat has empowered over 850,000 professionals through 600+ cloud certifications winning global recognition for its training excellence including 20 MCT Trainers in Microsoft’s Global Top 100 and an impressive 12 awards in the last 8 years. CloudThat specializes in Cloud Migration, Data Platforms, DevOps, IoT, and cutting-edge technologies like Gen AI & AI/ML. It has delivered over 500 consulting projects for 250+ organizations in 30+ countries as it continues to empower professionals and enterprises to thrive in the digital-first world.

WRITTEN BY Bhavesh Goswami

Bhavesh Goswami is the Founder & CEO of CloudThat Technologies. He is a leading expert in the Cloud Computing space with over a decade of experience. He was in the initial development team of Amazon Simple Storage Service (S3) at Amazon Web Services (AWS) in Seattle. and has been working in the Cloud Computing and Big Data fields for over 12 years now. He is a public speaker and has been the Keynote Speaker at the ‘International Conference on Computer Communication and Informatics’. He also has authored numerous research papers and patents in various fields.