HADOOP Kichidi

Kamakshaiah Musunuru
kamakshaiah.m@gmail.com

Contents

1 Introduction
 1.1 Definition
 1.2 Characteristics
 1.3 Big data in Action
  1.3.1 Marketing
  1.3.2 Manufacturing
  1.3.3 Big data and Sustainability
  1.3.4 Consumer Finance
2 About Hadoop
 2.1 History
 2.2 Architecture
 2.3 File systems
  2.3.1 Hadoop distributed file system
  2.3.2 JobTracker and TaskTracker: the MapReduce engine
 2.4 Hadoop Versions
  2.4.1 Hadoop 1 and Hadoop 2
  2.4.2 Hadoop 2 and Hadoop 3
 2.5 Hadoop Ecosystem
3 Fundamentals of Linux
 3.1 FOSS
  3.1.1 Free software
  3.1.2 Open source
  3.1.3 Benefits
  3.1.4 Drawbacks
  3.1.5 History
 3.2 Linux
  3.2.1 History
  3.2.2 Commercial and popular uptake
  3.2.3 Community
4 Hadoop Installation
 4.1 Hadoop - Installation
 4.2 History
 4.3 Installation of JAVA
  4.3.1 Update JAVA HOME variable
 4.4 Creating Hadoop User Group
 4.5 Install and configure open-ssh-server
 4.6 Download & Setup Hadoop
  4.6.1 Temporary working directories for NameNode and DataNode
 4.7 Configuring Hadoop
  4.7.1 ./bashrc
  4.7.2 hadoop-env.sh
  4.7.3 core-site.xml
  4.7.4 hdfs-site.xml
  4.7.5 yarn-site.xml
  4.7.6 mapred-site.xml
 4.8 Start the Hadoop
 4.9 Compiling Native Binaries
 4.10 Hadoop common
  4.10.1 Core Commands
  4.10.2 Hadoop daemons
5 Core JAVA
 5.1 What is Java?
  5.1.1 Features of Java
  5.1.2 Requirements for write Java programs
 5.2 Installation
 5.3 Editors
  5.3.1 Common mistakes
 5.4 Programming
  5.4.1 Java is object-oriented
  5.4.2 Objects and Classes
  5.4.3 Your First Java Program
  5.4.4 Dos
  5.4.5 Variables and Data Types
 5.5 Exercises
  5.5.1 Ex-1: System Input-Output
  5.5.2 Ex-2: If statement
  5.5.3 Ex-3: Nested-If
  5.5.4 Ex-4: If-Else-If example
  5.5.5 Ex-5: Switch Statement
  5.5.6 Ex-3: for loop
  5.5.7 Ex-4: While Loop
  5.5.8 Ex-5: Random Number Generation (RNG)
 5.6 Maps
  5.6.1 Collections
  5.6.2 Basic Methods
  5.6.3 Map Classes
  5.6.4 Ex-6: Map example 1
  5.6.5 Vectors
 5.7 Arrays
  5.7.1 Ex-1: One Dimensional Array
  5.7.2 Ex-2: Two Dimensional Array
  5.7.3 Vectors
6 YARN
7 HDFS
 7.1 Webinterface
 7.2 Shell Commands
8 Mapreduce
 8.1 How MapReduce Works?
  8.1.1 MR Example
  8.1.2 Sorting
  8.1.3 Searching
  8.1.4 Indexing
  8.1.5 TF-IDF
 8.2 Exercise
  8.2.1 Job Class
  8.2.2 Constructors
  8.2.3 Methods
  8.2.4 Mapper Class
  8.2.5 Reducer Class
 8.3 MapReduce Algorithm
  8.3.1 Input Output - Java perspective
 8.4 MapReduce Implementation
  8.4.1 Example - 1: Processing maximum units of electricity consumption
  8.4.2 Example - 2: Processing Partitioning
 8.5 Simple MapReduce Examples
 8.6 MapReduce Algorithms
  8.6.1 A-1: Count
9 Pig
 9.1 History
 9.2 Installation
  9.2.1 Procedure
  9.2.2 Shell Scripts

About the Author


PIC


Dr. M. Kamakshaiah is a open source software evangelist, full stack developer and academic of data science and anaytics. His teaching interests are IT practices in business, MIS, Data Science, Business Analytics including functional analytics related to marketing, finance, and HRM, quality and operations. He also teaches theoretical concepts like statistical diagnosis, multivariate analytics, numerical simulations & optimization, machine learning AI, IoT using few programming languages like R, Python, Java and more.

His research interests are related to few domains such as healthcare, education, food & agriculture, open source software and etc. He has taught abroad for two years and credits two country visits. He has developed few free and open source software solutions meant for corporate practitioners, academics, scholars and students engaging in data science and analytics. All his software applications are available from his GITHUB portal https://github.com/Kamakshaiah.

Foreword

I had never learned programming, seriously, through out my formal education. I never went to college in search of such knowledge. I did masters in business management. I started teaching soon after my post graduation. I did not even know as how to use simple spreadsheet applications until I encountered its need while preparing internal marks for PG students in one of the colleges where I taught business management. I used to take help from my colleagues to do such tasks. One of the reasons is that in India teaching industry was not so equipped to provide such resources to faculty members. My first encounter with computer was when I was studying MBA. We used to have lab, a namesake lab, that hardly useful to learn about computers forget about programming. We had very brief instruction over certain outdated tools like d base, FORTRAN, COBOL, C and a couple of other tools which I don’t remember appropriately. And this was part of certain paper (subject) on computer application, which was not regarded as an important course for MBA students. My first program was a prime number detection in Turbo C. I just copied from the text book and executed. I don’t remember now as how I did that neither I know at that time as what I was writing. We had very poor teaching atmosphere with very less amount of resources and instruction. But I used to make a special point of reading books related to computer science whenever I visited library. That way I used to neglect other papers (courses).

I did not have opportunity to work on computers until after 10 years of my work experience. I bought my first laptop in 2007, a dual core Acer-Aspire 5920. That was touted to be one of the latest and best machines in that year for medium class users. I did not know much about usage, so I used to play games and use office productivity suites. This machine was much different from the days I started reading about computers. Dual Core was the latest and industry was talking about launching of Core 2 Duo by Intel at that time. Before that the processors were never known with cores but every body used to be crazy for Pentium. I hardly know anything about these aspects of computing other than names in news.

My concentration skipped to operating systems due to the fact that I was so bad user. I spoiled OS several times. One of the reasons is due to viruses and the other is abuse of applications. I came across Linux when I was working in North East Africa as instructor while teaching business management. Computers in the university were very badly infected by different types of viruses. I got to install Ubuntu so that I can keep much of this infection away from my laptop. And that changed my understanding on computers and programming. My second programming language at that time ws Python (the first was C which I learned by self regardless of the fact that it was part of the course). I only know about Java but hardly had any idea of using the same for my needs. I was so curious of learning about Python with out much involving in programming. At the same time I also came across R Language which was regarded as one of the most widely used programming language for statistical computing and quantitative analysis. I started using R Language in full swing when I was working in Symbiosis, Pune, India. In the year 2013 I introduced all these tools, as a matter of minor specialization and while teaching business analytic, in one of the colleges in Hyderabad, India.

Coming to the present book, i.e. Hadoop Kichidi, I only call it Kichidi because it lacks formal structure and doesn’t involve hard-core programming. The work Kichidi is used for a special cookery in southern part of India. Kichidi is cooked mixing certain ingredients that are heterogeneous and consumed separately otherwise. I thought of putting this word for this manual or book, whatever you may call, due to the fact that it is a mixture of certain required aspects of open source tools like Linux, Java (Open JDK, Open JRE), Hadoop. Albeit of the facts, the goal of this resource is to gain understanding on big data analytics through Hadoop. To be frank, I started learning Java only after I installed Hadoop. Promise to say I did know the fact that I might require Java to do Hadoop. I learned basics of Java hurriedly and exactly in two days for I was preparing to give training on Hadoop. This sounds as ridiculous but it is true. The code I provided in this resource is in fact scratched from certain online resources but available as open content.

This book is only useful to gain knowledge on Hadoop administration but not programming. I wish that I might be able to write another book in which I could explain Hadoop programming. In fact, it is not extraneous if I say that I could learn and teach Hadoop that is because of passion for open source software and specifically for Linux. Linux is a great operating system. I use Ubuntu mostly for my needs. I often write SHELL scripts and also enthral at the way it works. This book has few chapters. The first deals with installation and configuration of Hadoop in Linux systems. The second deals with Hadoop Distributed File System (HDFS). The third deals with Yarn and its usefulness in managing resouces. The fourth deals with basic programming in Java and final chapter deals with MapReduce. MapReduce is a special langauge for data processing. HDFS, Yarn and MapReduce are part of Hadoop core or project. To learn MapReduce one need to know Core Java. I narrated all this story only to let the readers to know that learning requires passion but not formal education. I believe that to do research one needs brain but not PhD. In India, especially in academics there is misconception that to teach one needs PhD. This is totally an unreasonable idea which an academic must not have, at least in my opinion and pardon me if I am wrong. In India there is lot of knowledge and wisdom in villages, where people try to cure and heal deadly diseases and solve their problem rather more creatively than urban people. Unfortunately, most of the people do believe that civilization happened due to urbanization. Very few worry about uncivilized life in very civil places. After all, as I read somewhere, in 21st century the devil walk in the streets not as a beast but as a beauty. At last I wish you all the best and happy reading.

Author
Kamakshaiah Musunuru

Chapter 1
Introduction

Apache Hadoop is a collection of open source software utilities that facilitate using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Hadoop was originally designed for computer clusters built from commodity hardware, which is still the common use. It has since also found use on clusters of higher-end hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework.

The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and a processing part which is a MapReduce programming model. Hadoop splits files into large blocks and distributes them across nodes in a cluster. It then transfers packaged code into nodes to process the data in parallel. This approach takes advantage of data locality, where nodes manipulate the data they have access to. This allows the dataset to be processed faster and more efficiently than it would be in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are distributed via high-speed networking.

The base Apache Hadoop framework is composed of the following modules:

  1. Hadoop Common – contains libraries and utilities needed by other Hadoop modules;
  2. Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster;
  3. Hadoop YARN – (introduced in 2012) a platform responsible for managing computing resources in clusters and using them for scheduling users’ applications;
  4. Hadoop MapReduce – an implementation of the MapReduce programming model for large-scale data processing.

The term Hadoop is often used for both base modules and sub-modules and also the ecosystem, or collection of additional software packages that can be installed on top of or alongside Hadoop, such as Apache Pig, Apache Hive, Apache HBase, Apache Phoenix, Apache Spark, Apache ZooKeeper, Cloudera Impala, Apache Flume, Apache Sqoop, Apache Oozie, and Apache Storm. Apache Hadoop’s MapReduce and HDFS components were inspired by Google papers on MapReduce and Google File System.

The Hadoop framework itself is mostly written in the Java programming language, with some native code in C and command line utilities written as shell scripts. Though MapReduce Java code is common, any programming language can be used with Hadoop Streaming to implement the map and reduce parts of the user’s program.Other projects in the Hadoop ecosystem expose richer user interfaces.

Big data is a term used to refer to the study and applications of data sets that are too complex for traditional data-processing application software to adequately deal with. Big data challenges include capturing data, data storage, data analysis, search, sharing, transfer, visualization, querying, updating, information privacy and data source. Big data was originally associated with three key concepts: volume, variety, and velocity. Other concepts later attributed with big data are veracity (i.e., how much noise is in the data) and value. 1

Modern usage of the term “big data” tends to refer to the use of predictive analytics, user behavior analytics, or certain other advanced data analytics methods that extract value from data, and seldom to a particular size of data set. “There is little doubt that the quantities of data now available are indeed large, but that’s not the most relevant characteristic of this new data ecosystem.” Analysis of data sets can find new correlations to “spot business trends, prevent diseases, combat crime and so on.” Scientists, business executives, practitioners of medicine, advertising and governments alike regularly meet difficulties with large data-sets in areas including Internet search, fin-tech, urban informatics, and business informatics. Scientists encounter limitations in e-Science work, including meteorology, genomics, connectomics, complex physics simulations, biology and environmental research.

Data sets grow rapidly in part because they are increasingly gathered by cheap and numerous information sensing Internet of Things (IoT) devices such as mobile devices, aerial (remote sensing), software logs, cameras, microphones, radio-frequency identification (RFID) readers and wireless sensor networks. The world’s technological per-capita capacity to store information has roughly doubled every 40 months since the 1980s; as of 2012, every day 2.5 exabytes (2.5×10e18) of data are generated. Based on an IDC report prediction, the global data volume will grow exponentially from 4.4 zettabytes to 44 zettabytes between 2013 and 2020. By 2025, IDC predicts there will be 163 zettabytes of data. One question for large enterprises is determining who should own big-data initiatives that affect the entire organization. See table 1.1 for more details on data measures in bytes.





ValueSymbol SI



1000 k kilo
10002 M mega
10003 G giga
10004 T tera
10005 P peta
10006 E exa
10007 Z zetta
10008 Y yotta




Table 1.1: for multiples of bits (bit) or bytes (B)

Relational database management systems, desktop statistics and software packages used to visualize data often have difficulty handling big data. The work may require “massively parallel software running on tens, hundreds, or even thousands of servers”. What qualifies as being “big data” varies depending on the capabilities of the users and their tools, and expanding capabilities make big data a moving target. “For some organizations, facing hundreds of gigabytes of data for the first time may trigger a need to reconsider data management options. For others, it may take tens or hundreds of terabytes before data size becomes a significant consideration.”

1.1 Definition

The term has been in use since the 1990s, with some giving credit to John Mashey for popularizing the term. Perhaps, in simple terms; we might be able to define Big data as

“Anything that includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process data within a tolerable elapsed time”.

Big data philosophy encompasses unstructured, semi-structured and structured data, however the main focus is on unstructured data. Big data “size” is a constantly moving target, as of 2012 ranging from a few dozen terabytes to many exabytes of data. Big data requires a set of techniques and technologies with new forms of integration to reveal insights from data sets that are diverse, complex, and of a massive scale.

In 2016 De Mauro and et al defines that 2

“It represents the information assets characterized by such a high volume, velocity and variety to require specific technology and analytical methods for its transformation into value”.

Additionally, a new V, veracity, is added by some organizations to describe it, revisionism challenged by some industry authorities. The three Vs (volume, variety and velocity) have been further expanded to other complementary characteristics of big data:

  1. Machine learning: big data often doesn’t ask why and simply detects patterns.
  2. Digital footprint: big data is often a cost-free byproduct of digital interaction.

A 2018 definition states

“Big data is where parallel computing tools are needed to handle data”, and notes, “This represents a distinct and clearly defined change in the computer science used, via parallel programming theories, and losses of some of the guarantees and capabilities made by Codd’s relational model”. 3

1.2 Characteristics

Hilbert, Martin and others mentions below mentioned characteristics: 4

  1. Volume:The quantity of generated and stored data. The size of the data determines the value and potential insight, and whether it can be considered big data or not.
  2. Variety: The type and nature of the data. This helps people who analyze it to effectively use the resulting insight. Big data draws from text, images, audio, video; plus it completes missing pieces through data fusion.
  3. Velocity: In this context, the speed at which the data is generated and processed to meet the demands and challenges that lie in the path of growth and development. Big data is often available in real-time.
  4. Veracity: The data quality of captured data can vary greatly, affecting the accurate analysis.

1.3 Big data in Action

1.3.1 Marketing

Pandora, Spotify, Apple Music, Tidal... These are a few of the biggest names in streaming music. Livestreaming has forever changed how we listen to music, and the disruption these platforms are spreading through the music industry is only just starting. The next area that’s ripe for sweeping change is the use of big data and music personalization for live-stream listeners.

How are streaming services using data to understand what their customers want to hear? How are they using a wide array of data sources to make better listener recommendations? What insights do their personalization efforts provide other consumer-focused businesses?

Livestreaming “On a Scale That We’ve Never Seen””

Live-streaming music is the platform consumers demand, and it’s rapidly displacing physical sales (CDs) and downloads (via services such as iTunes). The Nielsen Music U.S. Mid-Year Report noted only halfway through 2017 that on-demand audio streams reached over 184 billion streams, a 62.4-percent increase over the same time period in 2016. David Bakula, the senior vice president of music industry insights for Nielsen, noted, “The rapid adoption of streaming platforms by consumers has generated engagement with music on a scale that we’ve never seen before.”

The key phrase to note is “engagement with music.” Subscription-based streaming has made discovery of new music easier. For lesser-known artists, it’s easier to find an audience and less costly to produce and distribute music. For consumers, it allows them to listen to as much music as they like for as long as they want. Their options aren’t limited to a download. They can listen to anything, anywhere.

Distributors have stopped selling ownership, and instead sell access. The barriers to engagement have dropped between musicians and listeners. How are streaming services using big data to improve customer engagement with their services?

Spotify’s Approach to the Personalization of Streaming

There are different approaches to combining big data and music personalization. For streaming leader Spotify, it was necessary to grow the personalization team. Within a few short years, the company went from having one tiny team to multiple teams located in New York, Boston, and Stockholm.

Spotify notes that platform features are “but the tip of a huge personalization iceberg.” Discover Weekly is one personalization feature that delivers a two-hour compilation of custom-made music to Spotify users, which is “based both on your own listening as well as what others are playlisting and listening to around the songs you love … It’s like having your best friend make you a personalized mix tape every single week.”

Release Radar is a feature that Spotify added in 2016. It provides users with a weekly playlist with songs from new albums. The company admits that for new albums, there isn’t data about streaming or playlists to draw from. Instead, it relies on “audio research” that uses deep learning to curate similarities between music. By comparing this research to a user’s listening history, new tracks that might pique a listener’s interest can be determined. Pandora’s Big Data Music Project

Spotify’s audio research approach is similar to Pandora’s Music Genome Project, which began in 1999. In this project, teams of trained musicians listen to and categorize music. From that work, Pandora has built a collection of 450 musical attributes that describe a song’s melody, harmony, rhythm, form, instrumentation, lyrics, and voice.

That work informs the platform’s machine learning and algorithms. Pandora combines the data compiled in the project with data about what users listened to, as well as the listener’s thumbs-up or thumbs-down choices when they are presented with a new song. With the Music Genome Project, Pandora hopes to deliver the “most comprehensive music analysis ever undertaken” and provide listeners with “personalized radio that plays what you love and continually evolves with your tastes.”

What Lessons Can Be Learned?

The sheer volume of music and other audio being streamed by listeners reveals that streaming services have tapped into a strong consumer need. This popularity is not something they’re taking for granted. The lessons being applied to big data and music offer insight that other consumer-focused retailers should heed.

Live-streaming platforms continue to explore more ways for big data to drive customer engagement and loyalty. The growth and customer satisfaction being driven by these efforts should be music to the ears of retailers of all kinds.

1.3.2 Manufacturing

If you work in manufacturing, you know how quickly competitors can gain an advantage: a plant improvement here, a supply chain tweak there—it all adds up. Additionally, the challenges facing manufacturers are driving them to transform the way they manufacture, distribute, sell, and support their offerings across the entire product life cycle. Using big data in manufacturing can give you the upper hand by making you more competitive.

Gaining an Innovative Edge in Productivity

Competition in the manufacturing industry is becoming highly unpredictable as innovation comes from an increasingly wide array of regions and markets. This can leave companies struggling to keep up, making efficiency in supply chains and manufacturing plants more important than ever. Exacerbating this competition is the accelerating innovation cycle. Improved technical capabilities enable competitors to design, engineer, manufacture and distribute products more quickly than ever before, reducing their time to market.

So, can using big data in manufacturing help businesses meet these competitive challenges? Absolutely. Companies with a more holistic view of their enterprise can spot potential issues and optimize their processes to retain a competitive edge.

For example, big data can also drive efficiency in product manufacturing processes through machine learning. This branch of artificial intelligence, which analyzes historical data to understand “normal” process performance, enables companies to monitor plant processes to detect anomalies, and optimize product quality and production yield. This, along with the ability to leverage machine learning-enabled predictive maintenance models, can dramatically minimize plant equipment downtime, driving down overall manufacturing costs even further.

Upgrading Your Supply Chains

Another problem manufacturers face is that their competitors are taking advantage of more complex, rapidly evolving supply chains. Digital supply chains respond faster to changes in demand due to a real-time flow of information, complete visibility, and the ability to respond quickly.

As manufacturers come under pressure to navigate these supply chains for better pricing and delivery terms, they must also ensure that they meet regulatory compliance challenges around environmental and corporate quality controls. Sourcing non-optimal components that don’t meet environmental standards could create compliance problems further down the line.

Using Data to Understand Your Needs

To prevent these problems while making the best use of the supply chain, manufacturers need a holistic understanding of conditions throughout the delivery pipeline. They must extend their visibility from the factory to the supply chain, drawing data from a range of stakeholders, including suppliers and logistics firms. This can create a complete view of the supply chain that will help manufacturers predict and adapt for emerging issues.

The more data sources that you can consume, the more you can understand and refine your manufacturing process. Pulling in supply chain data from logistics partners and suppliers can help you further refine your products by looking for patterns that affect quality and production times. Seeing patterns in supply chain behavior can help you avoid disruptions in materials supply, further enhancing your manufacturing efficiency.

Addressing Challenges With Customers

At the other end of the supply chain lies the consumer. There, too, companies face new challenges. Consumers are increasingly empowered, enjoying an unprecedented range of product and service options. Manufacturers that don’t anticipate their customers’ needs risk losing them. It’s all about growth in this hyper-competitive marketplace and growth demands excellence in customer experience with customer-centric interactions, and innovative products and services throughout the product life cycle.

The smartest vendors turn these customer challenges into opportunities. They use big data to gather and analyze customer requirements and feed those insights back into the product design cycle in a process of continuous improvement. They can use a variety of sources for this, including customer relationship management and marketing systems, sales databases, and technical support feeds.

Gaining Insights to Achieve Long-Term Success

Companies that master big data can meet these industry challenges and set themselves up for longer-term success. Instead of aiming only at short-term goals, such as increasing production volume and lowering costs, they can use data from this unified supply chain and manufacturing ecosystem to gain insights into new market opportunities.

Rather than blindly turning the crank to hit the next quarter’s sales target, smart companies can use this data to identify new geographical and product markets, and plan targeted business expansions, paving the way for structured, healthy growth.

This ultimate use of big data will come after manufacturers use data sources to create a more cohesive view of their product life cycles. Along the way, they can improve their competitive advantage by realizing tangible improvements in productivity and quality.

The data is already latent in the manufacturing plant and in your logistical supply chains. By using big data in manufacturing to power your decision-making, you can unlock its value.

1.3.3 Big data and Sustainability

Global spending on big data and business analytics will grow to more than $187 billion in the next few years. At the same time, fears over the use of big data continue to grow—according to research from the University of Cambridge, 71 percent of consumers believe brands that access their personal data use it unethically.

But that’s not always true. Smart organizations, in both the private and public sectors, are instead utilizing big data for social good. From addressing complex social issues to fighting hunger, big data is making it easier to create lasting, positive change in society.

Understanding the Need for Change

Various academics have voiced their concerns. Cathy O’Neil, former director of The Lede Program at Columbia University, says algorithms and mathematical models should be used to help reduce bias and inequality. However, the reverse is often true—big data models can sometimes help reinforce discrimination, and automated data–led decisions can leave unfortunate individuals further exposed to poverty.

Yet while the potential exists for big data to create inequality, there’s also the opportunity to improve society. Using big data for social good, therefore, should be the aim for all organizations that utilize big data—and examples of that positive use of information already exist.

Sponsoring Joined-Up Thinking

Some experts believe public and private partnerships are the best way to help use big data for social good. Large, blue-chip organizations have the skills and technologies that are developing game-changing big data models. Lending this capability to nonprofit organizations could help change the world for the better.

There are examples of this link-up taking place around the globe. The nonprofit DataKind brings together leading data scientists with high-impact social organizations: take its recent partnership with Pollinate Energy to address the detection of urban poor communities in Bangalore via satellite images and other data points.

Academic institutions are forging partnerships, too. The University of Chicago runs a summer program called The Data Science for Social Good Fellowship, which helps train aspiring data scientists to work on projects with a positive impact. Working closely with governments and nonprofits, research fellows in the program take on real-world problems in key areas, including education, health, and public safety.

Creating Multidisciplinary Connections

Mobile operator representation organization GSMA recently launched an initiative to use vendors’ big data capabilities to address humanitarian crises. The program, known as Big Data for Social Good, aims to meet significant crises—such as epidemics and natural disasters—head on, and is being launched alongside 19 of the world’s leading mobile operators.

Key executives in these organizations are helping to drive change. Nuria Oliver, director of data science research at Vodafone, has spent the past decade exploring how big data can be used to improve society. She says multidisciplinary approaches work best: bringing together data scientists and epidemiologists creates a confluence of talent to help solve social problems.

Another example from the mobile space comes in the form of LUCA, the big data unit within Telefonica. The organization—which is running a Big Data for Social Good initiative—is using an Apache Hadoop–based data platform to collect and store information in a data repository to undertake data transformation and analysis.

Making More Informed Decisions

What all parties recognize is that big data is about more than simply improving customer experiences and business profits. There is growing recognition of the game-changing power of big data from key organizations, such as the United Nations, the World Economic Forum, and the World Bank. In short, the right insight at the right time can improve society.

Take the United Network for Organ Sharing (UNOS), which is the private, nonprofit organization that runs the U.S. organ transplant system. UNOS manages the requirements of a network of hundreds of hospitals, transplant centers, organ procurement professionals, and thousands of volunteers. The decisions UNOS makes change peoples’ lives.

UNOS uses a connected data platform solution and 100 percent open-source tools to create a self-service reporting system for centers and organ procurement organizations. The organization’s data warehouse gives doctors a past and present view across patients and transplanted organs. The result is informed decisions that have the patients’ best interests in mind.

The amount of data being collected—and the money spent on technologies to analyze this information—continues to rise. While experts are concerned by potential risks to individuals, some organizations are already using their investment to help sponsor an improvement in societal conditions. At a time of huge change, we all have a responsibility to find ways to use big data for social good.

1.3.4 Consumer Finance

Anyone who’s bought a house without paying the full price upfront knows that applying for a mortgage is anything but simple and straightforward. Even the customers with the most solid financial footings go through lengthy and inconvenient processes to borrow for a home. However, financial institutions and other money services businesses are starting to investigate using big data analytics to cut down on the red tape in mortgage applications and make it easier for their customers to apply for loans.

How Big Data Changes Mortgage Applications

Mortgage banking has long been about modeling loan performance appropriately, based on a financial institution’s business goals, appetite for and ability to take risks, and the overall economy. Risk departments have used models to drive decisions about how many loans should be issued, what types of pricing and mitigation should be applied to individual loans, and what demographics and segments of the economy represent good loans for a bank or credit union.

More Data, Greater Assessments

The key difference with the emergence of big data analytics is the amount of data at companies’ disposal. Models with more data have gotten much better at assessing risk and providing an accurate idea of the behavior of a portfolio of mortgages. These models succeed because they use a much broader base of data types—including factors such as a prospective customer’s geolocation and transaction history—and are able to correlate things like comparable people who behave in a particular manner.

Data is also more available. The amount of data an organization can collect on one person is vaster than what used to be possible and, once assembled, the data can build a dramatically different picture. Data sets consisting of anything from geolocation to credit card transactions and store sales are starting to be available for sale to lending institutions and other outfits, and that data can be correlated with existing models to generate better decisions.

The types of mathematical models being used today have gotten better in recent years because more data input equals better tweaking and adjusting, and exponentially better training over much larger amounts of data. Simply put, there are more types of input, and that leads to better decision-making.

Data Leads to Faster Decision-Making

The speed of mortgage decisions is also improving, thanks to several factors. Today’s big data–oriented infrastructure is more robust and generally runs in near-real time. Most big data implementations have infrastructure that allows a lot of data—of different types or different functions—to be stored very effectively and very cheaply. Additionally, this infrastructure allows easy access to historical data in order to tweak models and validate them, providing analysts the ability to run models effectively and very quickly.

Key to these new capabilities is a computational engine that allows models to run in parallel and very fast, along with a streaming capability to be able to run a model on the fly as new data is coming in.

Potential Pitfalls to Be Aware Of

Ultimately, the idea behind drawing conclusions from big data is to build models and automation that drive better business decisions. But there are pitfalls to watch out for.

Losing the Human Touch

From a lending standpoint, as these models get better, the decision on whether someone should be approved can be driven more by the computer and less by a human decision. The removal of the human element in decision-making has advantages and disadvantages.

If a model is making a choice, it’s typically only programmed to examine risk factors, therefore leaving it blind to other factors a human might consider. On the other hand, a human’s decision can be influenced by biases extraneous to a purely risk-based decision. Removing biases can be an excellent attribute, but you also remove the ability to make judgment calls, which can mean making a borderline good loan go bad or vice verse.

Thus, a good model will always have two quality assurance steps: additional safety checks to ensure you don’t come up with the wrong answer because of a faulty model, and a review of decisions so that a human can assess the overall picture and either sustain or override the decision.

Getting Overwhelmed by Increased Loan Volumes

Another risk of utilizing data analytics as part of your mortgage and lending efforts is volume. With better models and decision-making, an organization can acquire an unwieldy appetite for loans. When you increase volume, however, risk can expand quickly, and the negative consequences can occur much more quickly over a larger base of people.

Being Outsmarted by Fraud

Finally, there’s a risk in how intelligent and tolerant models can be in driving lending decisions. In particular, fraud prevention needs to be factored in early on. Models need to be smart enough to prevent gaming. If nefarious actors know a model works with, say, three factors, they may then attempt to force those factors to drive a desired result. Modelers and analysts must understand this important issue and ensure that their models are smart enough to combat it.

Ultimately, the use of big data analytics for mortgages drives better institutional profitability—more loans that are closed in less time, and that perform better overall and have a reduced risk of loss. As long as you’re mindful of the common trouble areas, you can capitalize on data’s value and see the results you’re looking for.

Notes

1Laney, Doug (2001). “3D data management: Controlling data volume, velocity and variety”. META Group Research Note. 6 (70).

2Read more about this definition at https://www.emeraldinsight.com/doi/full/10.1108/LR-06-2015-0061

3Fox, Charles (2018-03-25). Data Science for Transport. Springer.

4Hilbert, Martin. “Big Data for Development: A Review of Promises and Challenges. Development Policy Review”. martinhilbert.net. Retrieved 7 October 2015.

Chapter 2
About Hadoop

Apache Hadoop is a collection of open source software utilities that facilitate using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Hadoop was originally designed for computer clusters built from commodity hardware, which is still the common use. It has since also found use on clusters of higher-end hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework.

The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and a processing part which is a MapReduce programming model. Hadoop splits files into large blocks and distributes them across nodes in a cluster. It then transfers packaged code into nodes to process the data in parallel. This approach takes advantage of data locality, where nodes manipulate the data they have access to. This allows the data set to be processed faster and more efficiently than it would be in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are distributed via high-speed networking.

The base Apache Hadoop framework is composed of the following modules:

  1. Hadoop Common – contains libraries and utilities needed by other Hadoop modules;
  2. Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster;
  3. Hadoop YARN – (introduced in 2012) a platform responsible for managing computing resources in clusters and using them for scheduling users’ applications;[10][11]
  4. Hadoop MapReduce – an implementation of the MapReduce programming model for large-scale data processing.

The term Hadoop is often used for both base modules and sub-modules and also the ecosystem, or collection of additional software packages that can be installed on top of or alongside Hadoop, such as Apache Pig, Apache Hive, Apache HBase, Apache Phoenix, Apache Spark, Apache ZooKeeper, Cloudera Impala, Apache Flume, Apache Sqoop, Apache Oozie, and Apache Storm. Apache Hadoop’s MapReduce and HDFS components were inspired by Google papers on MapReduce and Google File System.

The Hadoop framework itself is mostly written in the Java programming language, with some native code in C and command line utilities written as shell scripts. Though MapReduce Java code is common, any programming language can be used with Hadoop Streaming to implement the map and reduce parts of the user’s program.Other projects in the Hadoop ecosystem expose richer user interfaces.

2.1 History

According to its co-founders, Doug Cutting and Mike Cafarella, the genesis of Hadoop was the Google File System paper that was published in October 2003. This paper spawned another one from Google – “MapReduce: Simplified Data Processing on Large Clusters”. Development started on the Apache Nutch project, but was moved to the new Hadoop sub-project in January 2006. Doug Cutting, who was working at Yahoo! at the time, named it after his son’s toy elephant.[20] The initial code that was factored out of Nutch consisted of about 5,000 lines of code for HDFS and about 6,000 lines of code for MapReduce.

In March 2006, Owen O’Malley was the first committer to add to the Hadoop project; Hadoop 0.1.0 was released in April 2006. It continues to evolve through contributions that are being made to the project.

2.2 Architecture

Hadoop consists of the Hadoop Common package, which provides file system and operating system level abstractions, a MapReduce engine (either MapReduce/MR1 or YARN/MR2) and the Hadoop Distributed File System (HDFS). The Hadoop Common package contains the Java Archive (JAR) files and scripts needed to start Hadoop.

For effective scheduling of work, every Hadoop-compatible file system should provide location awareness, which is the name of the rack, specifically the network switch where a worker node is. Hadoop applications can use this information to execute code on the node where the data is, and, failing that, on the same rack/switch to reduce backbone traffic. HDFS uses this method when replicating data for data redundancy across multiple racks. This approach reduces the impact of a rack power outage or switch failure; if any of these hardware failures occurs, the data will remain available.

A small Hadoop cluster includes a single master and multiple worker nodes. The master node consists of a Job Tracker, Task Tracker, NameNode, and DataNode. A slave or worker node acts as both a DataNode and TaskTracker, though it is possible to have data-only and compute-only worker nodes. These are normally used only in nonstandard applications. Hadoop requires Java Runtime Environment (JRE) 1.6 or higher. The standard startup and shutdown scripts require that Secure Shell (SSH) be set up between nodes in the cluster.

In a larger cluster, HDFS nodes are managed through a dedicated NameNode server to host the file system index, and a secondary NameNode that can generate snapshots of the namenode’s memory structures, thereby preventing file-system corruption and loss of data. Similarly, a standalone JobTracker server can manage job scheduling across nodes. When Hadoop MapReduce is used with an alternate file system, the NameNode, secondary NameNode, and DataNode architecture of HDFS are replaced by the file-system-specific equivalents.


PIC

Figure 2.1: Hadoop work-flow


2.3 File systems

2.3.1 Hadoop distributed file system

The Hadoop distributed file system (HDFS) is a distributed, scalable, and portable file system written in Java for the Hadoop framework. Some consider it to instead be a data store due to its lack of POSIX compliance,[28] but it does provide shell commands and Java application programming interface (API) methods that are similar to other file systems. A Hadoop is divided into HDFS and MapReduce. HDFS is used for storing the data and MapReduce is used for processing data. HDFS has five services as follows:

  1. Name Node
  2. Secondary Name Node
  3. Job tracker
  4. Data Node
  5. Task Tracker

Top three are Master Services/Daemons/Nodes and bottom two are Slave Services. Master Services can communicate with each other and in the same way Slave services can communicate with each other. Name Node is a master node and Data node is its corresponding Slave node and can talk with each other.

  1. Name Node: HDFS consists of only one Name Node that is called the Master Node. The master node can track files, manage the file system and has the meta-data of all of the stored data within it. In particular, the name node contains the details of the number of blocks, locations of the data node that the data is stored in, where the replications are stored, and other details. The name node has direct contact with the client.
  2. Data Node: A Data Node stores data in it as blocks. This is also known as the slave node and it stores the actual data into HDFS which is responsible for the client to read and write. These are slave daemons. Every Data node sends a Heartbeat message to the Name node every 3 seconds and conveys that it is alive. In this way when Name Node does not receive a heartbeat from a data node for 2 minutes, it will take that data node as dead and starts the process of block replications on some other Data node.
  3. Secondary Name Node: This is only to take care of the checkpoints of the file system metadata which is in the Name Node. This is also known as the checkpoint Node. It is the helper Node for the Name Node.
  4. Job Tracker: Job Tracker receives the requests for Map Reduce execution from the client. Job tracker talks to the Name Node to know about the location of the data that will be used in processing. The Name Node responds with the metadata of the required processing data.
  5. Task Tracker: It is the Slave Node for the Job Tracker and it will take the task from the Job Tracker. And also it receives code from the Job Tracker. Task Tracker will take the code and apply on the file. The process of applying that code on the file is known as Mapper.

Hadoop cluster has nominally a single namenode plus a cluster of datanodes, although redundancy options are available for the namenode due to its complexity. Each datanode serves up blocks of data over the network using a block protocol specific to HDFS. The file system uses TCP/IP sockets for communication. Clients use remote procedure calls (RPC) to communicate with each other.

HDFS stores large files (typically in the range of gigabytes to terabytes[31]) across multiple machines. It achieves reliability by replicating the data across multiple hosts, and hence theoretically does not require redundant array of independent disks (RAID) storage on hosts (but to increase input-output (I/O) performance some RAID configurations are still useful). With the default replication value, 3, data is stored on three nodes: two on the same rack, and one on a different rack. Data nodes can talk to each other to re-balance data, to move copies around, and to keep the replication of data high. HDFS is not fully POSIX-compliant, because the requirements for a POSIX file-system differ from the target goals of a Hadoop application. The trade-off of not having a fully POSIX-compliant file-system is increased performance for data throughput and support for non-POSIX operations such as Append.

In May 2012, high-availability capabilities were added to HDFS, letting the main meta-data server called the NameNode manually fail-over onto a backup. The project has also started developing automatic fail-overs.

The HDFS file system includes a so-called secondary namenode, a misleading term that some might incorrectly interpret as a backup namenode when the primary namenode goes offline. In fact, the secondary namenode regularly connects with the primary namenode and builds snapshots of the primary namenode’s directory information, which the system then saves to local or remote directories. These checkpointed images can be used to restart a failed primary namenode without having to replay the entire journal of file-system actions, then to edit the log to create an up-to-date directory structure. Because the namenode is the single point for storage and management of metadata, it can become a bottleneck for supporting a huge number of files, especially a large number of small files. HDFS Federation, a new addition, aims to tackle this problem to a certain extent by allowing multiple namespaces served by separate namenodes. Moreover, there are some issues in HDFS such as small file issues, scalability problems, Single Point of Failure (SPoF), and bottlenecks in huge metadata requests. One advantage of using HDFS is data awareness between the job tracker and task tracker. The job tracker schedules map or reduce jobs to task trackers with an awareness of the data location. For example: if node A contains data (a, b, c) and node X contains data (x, y, z), the job tracker schedules node A to perform map or reduce tasks on (a, b, c) and node X would be scheduled to perform map or reduce tasks on (x, y, z). This reduces the amount of traffic that goes over the network and prevents unnecessary data transfer. When Hadoop is used with other file systems, this advantage is not always available. This can have a significant impact on job-completion times as demonstrated with data-intensive jobs.

HDFS was designed for mostly immutable files and may not be suitable for systems requiring concurrent write operations.

HDFS can be mounted directly with a File system in User-space (FUSE) virtual file system on Linux and some other Unix systems.

File access can be achieved through the native Java API, the Thrift API (generates a client in a number of languages e.g. C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, Smalltalk, and OCaml), the command-line interface, the HDFS-UI web application over HTTP, or via 3rd-party network client libraries.

HDFS is designed for portability across various hardware platforms and for compatibility with a variety of underlying operating systems. The HDFS design introduces portability limitations that result in some performance bottlenecks, since the Java implementation cannot use features that are exclusive to the platform on which HDFS is running. Due to its widespread integration into enterprise-level infrastructure, monitoring HDFS performance at scale has become an increasingly important issue. Monitoring end-to-end performance requires tracking metrics from datanodes, namenodes, and the underlying operating system. There are currently several monitoring platforms to track HDFS performance, including Hortonworks, Cloudera, and Datadog.

2.3.2 JobTracker and TaskTracker: the MapReduce engine

Atop the file systems comes the MapReduce Engine, which consists of one JobTracker, to which client applications submit MapReduce jobs. The JobTracker pushes work to available TaskTracker nodes in the cluster, striving to keep the work as close to the data as possible. With a rack-aware file system, the JobTracker knows which node contains the data, and which other machines are nearby. If the work cannot be hosted on the actual node where the data resides, priority is given to nodes in the same rack. This reduces network traffic on the main backbone network. If a TaskTracker fails or times out, that part of the job is rescheduled. The TaskTracker on each node spawns a separate Java virtual machine (JVM) process to prevent the TaskTracker itself from failing if the running job crashes its JVM. A heartbeat is sent from the TaskTracker to the JobTracker every few minutes to check its status. The Job Tracker and TaskTracker status and information is exposed by Jetty and can be viewed from a web browser.

Known limitations of this approach are:

Scheduling

By default Hadoop uses FIFO scheduling, and optionally 5 scheduling priorities to schedule jobs from a work queue. In version 0.19 the job scheduler was refactored out of the JobTracker, while adding the ability to use an alternate scheduler (such as the Fair scheduler or the Capacity scheduler, described next).

Fair scheduler

The fair scheduler was developed by Facebook. The goal of the fair scheduler is to provide fast response times for small jobs and Quality of service (QoS) for production jobs. The fair scheduler has three basic concepts.

  1. Jobs are grouped into pools.
  2. Each pool is assigned a guaranteed minimum share.
  3. Excess capacity is split between jobs.

By default, jobs that are uncategorized go into a default pool. Pools have to specify the minimum number of map slots, reduce slots, as well as a limit on the number of running jobs. Capacity scheduler

The capacity scheduler was developed by Yahoo. The capacity scheduler supports several features that are similar to those of the fair scheduler.

  1. Queues are allocated a fraction of the total resource capacity.
  2. Free resources are allocated to queues beyond their total capacity.
  3. Within a queue, a job with a high level of priority has access to the queue’s resources.

There is no preemption once a job is running.

2.4 Hadoop Versions

2.4.1 Hadoop 1 and Hadoop 2

The biggest difference between Hadoop 1 and Hadoop 2 is the addition of YARN (Yet Another Resource Negotiator), which replaced the MapReduce engine in the first version of Hadoop. YARN strives to allocate resources to various applications effectively. It runs two demons, which take care of two different tasks: the resource manager, which does job tracking and resource allocation to applications, the application master, which monitors progress of the execution.

2.4.2 Hadoop 2 and Hadoop 3

There are important features provided by Hadoop 3. For example, while there is one single namenode in Hadoop 2, Hadoop 3 enables having multiple name nodes, which solves the single point of failure problem.

In Hadoop 3, there are containers working in principle of Docker, which reduces time spent on application development. One of the biggest changes is that Hadoop 3 decreases storage overhead with erasure coding.

Also, Hadoop 3 permits usage of GPU hardware within the cluster, which is a very substantial benefit to execute deep learning algorithms on a Hadoop cluster.

2.5 Hadoop Ecosystem

(This material is copied from http://hadoop.apache.org/)

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

The project includes these modules:

  1. Hadoop Common: The common utilities that support the other Hadoop modules.
  2. Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data.
  3. Hadoop YARN: A framework for job scheduling and cluster resource management.
  4. Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

Other Hadoop-related projects at Apache include:

  1. AmbariTM: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually alongwith features to diagnose their performance characteristics in a user-friendly manner.
  2. AvroTM: A data serialization system.
  3. CassandraTM: A scalable multi-master database with no single points of failure.
  4. ChukwaTM: A data collection system for managing large distributed systems.
  5. HBaseTM: A scalable, distributed database that supports structured data storage for large tables.
  6. HiveTM: A data warehouse infrastructure that provides data summarization and ad hoc querying.
  7. MahoutTM: A Scalable machine learning and data mining library.
  8. PigTM: A high-level data-flow language and execution framework for parallel computation.
  9. SparkTM: A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation.
  10. TezTM:A generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases. Tez is being adopted by HiveTM, PigTM and other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g. ETL tools), to replace HadoopTM MapReduce as the underlying execution engine.
  11. ZooKeeperTM: A high-performance coordination service for distributed applications.

Notes

1Laney, Doug (2001). “3D data management: Controlling data volume, velocity and variety”. META Group Research Note. 6 (70).

2Read more about this definition at https://www.emeraldinsight.com/doi/full/10.1108/LR-06-2015-0061

3Fox, Charles (2018-03-25). Data Science for Transport. Springer.

4Hilbert, Martin. “Big Data for Development: A Review of Promises and Challenges. Development Policy Review”. martinhilbert.net. Retrieved 7 October 2015.

Chapter 3
Fundamentals of Linux

3.1 FOSS

Free and open-source software (FOSS) is software that can be classified as both free software and open-source software. That is, anyone is freely licensed to use, copy, study, and change the software in any way, and the source code is openly shared so that people are encouraged to voluntarily improve the design of the software. This is in contrast to proprietary software, where the software is under restrictive copyright licensing and the source code is usually hidden from the users.

FOSS maintains the software user’s civil liberty rights (see the Four Essential Freedoms, below). Other benefits of using FOSS can include decreased software costs, increased security and stability (especially in regard to malware), protecting privacy, education, and giving users more control over their own hardware. Free and open-source operating systems such as Linux and descendants of BSD are widely utilized today, powering millions of servers, desktops, smart-phones (e.g., Android), and other devices. Free-software licenses and open-source licenses are used by many software packages. The free-software movement and the open-source software movement are online social movements behind widespread production and adoption of FOSS.

”Free and open-source software” (FOSS) is an umbrella term for software that is simultaneously considered both Free software and open-source software. FOSS (free and open-source software) allows the user to inspect the source code and provides a high level of control of the software’s functions compared to proprietary software. The term ”free software” does not refer to the monetary cost of the software at all, but rather whether the license maintains the software user’s civil liberties (”free” as in “free speech,” not as in “free beer”). There are a number of related terms and abbreviations for free and open-source software (FOSS or F/OSS), or free/libre and open-source software (FLOSS or F/LOSS—FLOSS is the FSF-preferred term).

Although there is almost a complete overlap between free-software licenses and open-source-software licenses, there is a strong philosophical disagreement between the advocates of these two positions. The terminology of FOSS or ”Free and Open-source software” was created to be a neutral on these philosophical disagreements between the FSF and OSI and have a single unified term that could refer to both concepts.

3.1.1 Free software

Richard Stallman’s Free Software Definition, adopted by the Free Software Foundation (FSF), defines free software as a matter of liberty not price, and it upholds the Four Essential Freedoms. The earliest-known publication of the definition of his free-software idea was in the February 1986 edition of the FSF’s now-discontinued GNU’s Bulletin publication. The canonical source for the document is in the philosophy section of the GNU Project website. As of August 2017, it is published in 40 languages.

Four essential freedoms of Free Software

To meet the definition of ”free software”, the FSF requires the software’s licensing respect the civil liberties/human rights of what the FSF calls the software user’s ”Four Essential Freedoms”.

  1. The freedom to run the program as you wish, for any purpose (freedom 0).
  2. The freedom to study how the program works, and change it so it does your computing as you wish (freedom 1). Access to the source code is a precondition for this.
  3. The freedom to redistribute copies so you can help others (freedom 2).
  4. The freedom to distribute copies of your modified versions to others (freedom 3). By doing this you can give the whole community a chance to benefit from your changes. Access to the source code is a precondition for this.

3.1.2 Open source

The open-source-software definition is used by the Open Source Initiative (OSI) to determine whether a software license qualifies for the organization’s insignia for Open-source software. The definition was based on the Debian Free Software Guidelines, written and adapted primarily by Bruce Perens. Perens did not base his writing on the Four Essential Freedoms of free software from the Free Software Foundation, which were only later available on the web. Perens subsequently stated that he felt Eric Raymond’s promotion of Open-source unfairly overshadowed the Free Software Foundation’s efforts and reaffirmed his support for Free software. In the following 2000s, he spoke about open source again.

3.1.3 Benefits

  1. Personal control, customizable and freedom

    Users of FOSS benefit from the Four Essential Freedoms to make unrestricted use of, and to study, copy, modify, and redistribute such software with or without modification. If they would like to change the functionality of software they can bring about changes to the code and, if they wish, distribute such modified versions of the software or often depending on the software’s decision making model and its other users - even push or request such changes to be made via updates to the original software.

  2. Privacy and security

    Manufacturers of proprietary, closed-source software are sometimes pressured to building in back-doors or other covert, undesired features into their software. Instead of having to trust software vendors, users of FOSS can inspect and verify the source code themselves and can put trust on a community of volunteers and users. As proprietary code is typically hidden from public view, only the vendors themselves and hackers may be aware of any vulnerabilities in them while FOSS involves as many people as possible for exposing bugs quickly.

  3. Low costs or no costs

    FOSS is often free of charge although donations are often encouraged. This also allows users to better test and compare software.

  4. Quality, collaboration and efficiency

    FOSS allows for better collaboration among various parties and individuals with the goal of developing the most efficient software for its users or use-cases while proprietary software is typically meant to generate profits. Furthermore, in many cases more organizations and individuals contribute to such projects than to proprietary software. It has been shown that technical superiority is typically the primary reason why companies choose open source software.

3.1.4 Drawbacks

  1. Security and user-support

    According to Linus’s law the more people who can see and test a set of code, the more likely any flaws will be caught and fixed quickly. However, this does not guarantee a high level of participation. Having a grouping of full-time professionals behind a commercial product can in some cases be superior to FOSS.

    Furthermore, publicized source code might make it easier for hackers to find vulnerabilities in it and write exploits. This however assumes that such malicious hackers are more effective than white hat hackers which responsibly disclose or help fix the vulnerabilities, that no code leaks or exfiltrations occur and that reverse engineering of proprietary code is a hindrance of significance for malicious hackers.

  2. Hardware and software compatibility

    Sometimes, FOSS is not compatible with proprietary hardware or specific software. This is often due to manufacturers obstructing FOSS such as by not disclosing the interfaces or other specifications needed for members of the FOSS movement to write drivers for their hardware - for instance as they wish customers to run only their own proprietary software or as they might benefit from partnerships.

  3. Bugs and missing features

    While FOSS can be superior to proprietary equivalents in terms of software features and stability, in many cases FOSS has more unfixed bugs and missing features when compared to similar commercial software. This varies per case and usually depends on the level of interest and participation in a FOSS project. Furthermore, unlike with typical commercial software, missing features and bugfixes can be implemented by any party that has the relevant motivation, time and skill to do so.

  4. Less guarantees of development

    There is often less certainty in FOSS projects gaining the required resources / participation for continued development than commercial software backed by companies. However companies also often abolish projects for being unprofitable and often large companies rely on and hence co-develop open source software.

3.1.5 History

From the 1950s and on through the 1980s, it was common for computer users to have the source code for all programs they used, and the permission and ability to modify it for their own use. Software, including source code, was commonly shared by individuals who used computers, often as public domain software. Most companies had a business model based on hardware sales, and provided or bundled software with hardware, free of charge.

By the late 1960s, the prevailing business model around software was changing. A growing and evolving software industry was competing with the hardware manufacturer’s bundled software products; rather than funding software development from hardware revenue, these new companies were selling software directly. Leased machines required software support while providing no revenue for software, and some customers who were able to better meet their own needs did not want the costs of software bundled with hardware product costs. In United States vs. IBM, filed January 17, 1969, the government charged that bundled software was anti-competitive. While some software was still being provided without monetary cost and license restriction, there was a growing amount of software that was only at a monetary cost with restricted licensing. In the 1970s and early 1980s, some parts of the software industry began using technical measures (such as distributing only binary copies of computer programs) to prevent computer users from being able to use reverse engineering techniques to study and customize software they had paid for. In 1980, the copyright law was extended to computer programs in the United States[21]—previously, computer programs could be considered ideas, procedures, methods, systems, and processes, which are not copyrightable. Early on, closed-source software was uncommon until the mid-1970s to the 1980s, when IBM implemented in 1983 an ”object code only” policy, no longer distributing source code.

In 1983, Richard Stallman, longtime member of the hacker community at the MIT Artificial Intelligence Laboratory, announced the GNU project, saying that he had become frustrated with the effects of the change in culture of the computer industry and its users. Software development for the GNU operating system began in January 1984, and the Free Software Foundation (FSF) was founded in October 1985. An article outlining the project and its goals was published in March 1985 titled the GNU Manifesto. The manifesto included significant explanation of the GNU philosophy, Free Software Definition and ”copyleft” ideas. The FSF takes the position that the fundamental issue Free software addresses is an ethical one—to ensure software users can exercise what it calls ”The Four Essential Freedoms”.

The Linux kernel, created by Linus Torvalds, was released as freely modifiable source code in 1991. Initially, Linux was not released under either a Free software or an Open-source software license. However, with version 0.12 in February 1992, he re-licensed the project under the GNU General Public License.

FreeBSD and NetBSD (both derived from 386BSD) were released as Free software when the USL v. BSD lawsuit was settled out of court in 1993. OpenBSD forked from NetBSD in 1995. Also in 1995, The Apache HTTP Server, commonly referred to as Apache, was released under the Apache License 1.0.

In 1997, Eric Raymond published The Cathedral and the Bazaar, a reflective analysis of the hacker community and Free software principles. The paper received significant attention in early 1998, and was one factor in motivating Netscape Communications Corporation to release their popular Netscape Communicator Internet suite as Free software. This code is today better known as Mozilla Firefox and Thunderbird.

Netscape’s act prompted Raymond and others to look into how to bring the FSF’s Free software ideas and perceived benefits to the commercial software industry. They concluded that FSF’s social activism was not appealing to companies like Netscape, and looked for a way to rebrand the Free software movement to emphasize the business potential of sharing and collaborating on software source code. The new name they chose was ”Open-source”, and quickly Bruce Perens, publisher Tim O’Reilly, Linus Torvalds, and others signed on to the re-branding. The Open Source Initiative was founded in February 1998 to encourage the use of the new term and evangelize open-source principles.

While the Open Source Initiative sought to encourage the use of the new term and evangelize the principles it adhered to, commercial software vendors found themselves increasingly threatened by the concept of freely distributed software and universal access to an application’s source code. A Microsoft executive publicly stated in 2001 that ”Open-source is an intellectual property destroyer. I can’t imagine something that could be worse than this for the software business and the intellectual-property business.” This view perfectly summarizes the initial response to FOSS by some software corporations. For many years FOSS played a niche role outside of the mainstream of private software development. However the success of FOSS Operating Systems such as Linux, BSD and the companies based on FOSS such as Red Hat, has changed the software industry’s attitude and there has been a dramatic shift in the corporate philosophy concerning the development of Free and Open-source software (FOSS).

3.2 Linux

Linux is a family of open source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1991, by Linus Torvalds. Linux is typically packaged in a Linux distribution.

Distributions include the Linux kernel and supporting system software and libraries, many of which are provided by the GNU Project. Many Linux distributions use the word ”Linux” in their name, but the Free Software Foundation uses the name GNU/Linux to emphasize the importance of GNU software, causing some controversy.

Popular Linux distributions include Debian, Fedora, and Ubuntu. Commercial distributions include Red Hat Enterprise Linux and SUSE Linux Enterprise Server. Desktop Linux distributions include a windowing system such as X11 or Wayland, and a desktop environment such as GNOME or KDE Plasma. Distributions intended for servers may omit graphics altogether, or include a solution stack such as LAMP. Because Linux is freely redistributable, anyone may create a distribution for any purpose.

Linux was originally developed for personal computers based on the Intel x86 architecture, but has since been ported to more platforms than any other operating system. Because of the dominance of Android on smart-phones, Linux also has the largest installed base of all general-purpose operating systems. Although it is used by only around 2.3 percent of desktop computers, the Chromebook, which runs the Linux kernel-based Chrome OS, dominates the US K–12 education market and represents nearly 20 percent of sub-$300 notebook sales in the US. Linux is the leading operating system on servers (over 96.4% of the top 1 million web servers’ operating systems are Linux), leads other big iron systems such as mainframe computers, and is the only OS used on TOP500 supercomputers (since November 2017, having gradually eliminated all competitors).

Linux also runs on embedded systems, i.e. devices whose operating system is typically built into the firmware and is highly tailored to the system. This includes routers, automation controls, smart home technology (like Google Nest), televisions (Samsung and LG Smart TVs use Tizen and WebOS, respectively),automobiles (for example, Tesla, Audi, Mercedes-Benz, Hyundai, and Toyota all rely on Linux), digital video recorders, video game consoles, and smartwatches. The Falcon 9’s and the Dragon 2’s avionics use a customized version of Linux.

Linux is one of the most prominent examples of free and open-source software collaboration. The source code may be used, modified and distributed commercially or non-commercially by anyone under the terms of its respective licenses, such as the GNU General Public License.

3.2.1 History

Precursors

The Unix operating system was conceived and implemented in 1969, at AT&T’s Bell Laboratories in the United States by Ken Thompson, Dennis Ritchie, Douglas McIlroy, and Joe Ossanna. First released in 1971, Unix was written entirely in assembly language, as was common practice at the time. In 1973 in a key, pioneering approach, it was rewritten in the C programming language by Dennis Ritchie (with the exception of some hardware and I/O routines). The availability of a high-level language implementation of Unix made its porting to different computer platforms easier.


PIC

Figure 3.1: Linus Torvald


Due to an earlier antitrust case forbidding it from entering the computer business, AT&T was required to license the operating system’s source code to anyone who asked. As a result, Unix grew quickly and became widely adopted by academic institutions and businesses. In 1984, AT&T divested itself of Bell Labs; freed of the legal obligation requiring free licensing, Bell Labs began selling Unix as a proprietary product, where users were not legally allowed to modify Unix. The GNU Project, started in 1983 by Richard Stallman, had the goal of creating a ”complete Unix-compatible software system” composed entirely of free software. Work began in 1984.[39] Later, in 1985, Stallman started the Free Software Foundation and wrote the GNU General Public License (GNU GPL) in 1989. By the early 1990s, many of the programs required in an operating system (such as libraries, compilers, text editors, a Unix shell, and a windowing system) were completed, although low-level elements such as device drivers, daemons, and the kernel, called GNU/Hurd, were stalled and incomplete.

Linus Torvalds has stated that if the GNU kernel had been available at the time (1991), he would not have decided to write his own. Although not released until 1992, due to legal complications, development of 386BSD, from which NetBSD, OpenBSD and FreeBSD descended, predated that of Linux. Torvalds has also stated that if 386BSD had been available at the time, he probably would not have created Linux.

MINIX was created by Andrew S. Tanenbaum, a computer science professor, and released in 1987 as a minimal Unix-like operating system targeted at students and others who wanted to learn the operating system principles. Although the complete source code of MINIX was freely available, the licensing terms prevented it from being free software until the licensing changed in April 2000.

Creation

In 1991, while attending the University of Helsinki, Torvalds became curious about operating systems. Frustrated by the licensing of MINIX, which at the time limited it to educational use only, he began to work on his own operating system kernel, which eventually became the Linux kernel.

Torvalds began the development of the Linux kernel on MINIX and applications written for MINIX were also used on Linux. Later, Linux matured and further Linux kernel development took place on Linux systems. GNU applications also replaced all MINIX components, because it was advantageous to use the freely available code from the GNU Project with the fledgling operating system; code licensed under the GNU GPL can be reused in other computer programs as long as they also are released under the same or a compatible license. Torvalds initiated a switch from his original license, which prohibited commercial redistribution, to the GNU GPL. Developers worked to integrate GNU components with the Linux kernel, making a fully functional and free operating system.

Naming

Linus Torvalds had wanted to call his invention ”Freax”, a portmanteau of ”free”, ”freak”, and ”x” (as an allusion to Unix). During the start of his work on the system, some of the project’s makefiles included the name ”Freax” for about half a year. Torvalds had already considered the name ”Linux”, but initially dismissed it as too egotistical.

In order to facilitate development, the files were uploaded to the FTP server (ftp.funet.fi) of FUNET in September 1991. Ari Lemmke, Torvalds’ coworker at the Helsinki University of Technology (HUT), who was one of the volunteer administrators for the FTP server at the time, did not think that ”Freax” was a good name. So, he named the project ”Linux” on the server without consulting Torvalds. Later, however, Torvalds consented to ”Linux”.

According to a newsgroup post by Torvalds, the word ”Linux” should be pronounced LIN-uuks with a short ’i’ as in ’print’ and ’u’ as in ’put’. To further demonstrate how the word ”Linux” should be pronounced, he included an audio guide with the kernel source code. Contradictory, in this recording, he pronounces ’Linux’ LEEN-uuks with a short but close unrounded front vowel.


PIC


3.2.2 Commercial and popular uptake

Adoption of Linux in production environments, rather than being used only by hobbyists, started to take off first in the mid-1990s in the supercomputing community, where organizations such as NASA started to replace their increasingly expensive machines with clusters of inexpensive commodity computers running Linux. Commercial use began when Dell and IBM, followed by Hewlett-Packard, started offering Linux support to escape Microsoft’s monopoly in the desktop operating system market.


PIC

Figure 3.2: Ubuntu OS


Today, Linux systems are used throughout computing, from embedded systems to virtually all supercomputer sand have secured a place in server installations such as the popular LAMP application stack. Use of Linux distributions in home and enterprise desktops has been growing. Linux distributions have also become popular in the netbook market, with many devices shipping with customized Linux distributions installed, and Google releasing their own Chrome OS designed for netbooks.


PIC

Figure 3.3: Nexus smart-phone with Android OS


Linux’s greatest success in the consumer market is perhaps the mobile device market, with Android being one of the most dominant operating systems on smart-phones and very popular on tablets and, more recently, on wearable. Linux gaming is also on the rise with Valve showing its support for Linux and rolling out SteamOS, its own gaming-oriented Linux distribution. Linux distributions have also gained popularity with various local and national governments, such as the federal government of Brazil.

3.2.3 Community

A distribution is largely driven by its developer and user communities. Some vendors develop and fund their distributions on a volunteer basis, Debian being a well-known example. Others maintain a community version of their commercial distributions, as Red Hat does with Fedora, and SUSE does with openSUSE.

In many cities and regions, local associations known as Linux User Groups (LUGs) seek to promote their preferred distribution and by extension free software. They hold meetings and provide free demonstrations, training, technical support, and operating system installation to new users. Many Internet communities also provide support to Linux users and developers. Most distributions and free software/open-source projects have IRC chat rooms or newsgroups. Online forums are another means for support, with notable examples being LinuxQuestions.org and the various distribution specific support and community forums, such as ones for Ubuntu, Fedora, and Gentoo. Linux distributions host mailing lists; commonly there will be a specific topic such as usage or development for a given list.

There are several technology websites with a Linux focus. Print magazines on Linux often bundle cover disks that carry software or even complete Linux distributions. Although Linux distributions are generally available without charge, several large corporations sell, support, and contribute to the development of the components of the system and of free software. An analysis of the Linux kernel showed 75 percent of the code from December 2008 to January 2010 was developed by programmers working for corporations, leaving about 18 percent to volunteers and 7% unclassified. Major corporations that provide contributions include Dell, IBM, HP, Oracle, Sun Microsystems (now part of Oracle) and Nokia. A number of corporations, notably Red Hat, Canonical and SUSE, have built a significant business around Linux distributions.

The free software licenses, on which the various software packages of a distribution built on the Linux kernel are based, explicitly accommodate and encourage commercialization; the relationship between a Linux distribution as a whole and individual vendors may be seen as symbiotic. One common business model of commercial suppliers is charging for support, especially for business users. A number of companies also offer a specialized business version of their distribution, which adds proprietary support packages and tools to administer higher numbers of installations or to simplify administrative tasks.

Another business model is to give away the software in order to sell hardware. This used to be the norm in the computer industry, with operating systems such as CP/M, Apple DOS and versions of Mac OS prior to 7.6 freely copyable (but not modifiable). As computer hardware standardized throughout the 1980s, it became more difficult for hardware manufacturers to profit from this tactic, as the OS would run on any manufacturer’s computer that shared the same architecture.

Notes

1Laney, Doug (2001). “3D data management: Controlling data volume, velocity and variety”. META Group Research Note. 6 (70).

2Read more about this definition at https://www.emeraldinsight.com/doi/full/10.1108/LR-06-2015-0061

3Fox, Charles (2018-03-25). Data Science for Transport. Springer.

4Hilbert, Martin. “Big Data for Development: A Review of Promises and Challenges. Development Policy Review”. martinhilbert.net. Retrieved 7 October 2015.

Chapter 4
Hadoop Installation

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

The project includes these modules:

  1. Hadoop Common: The common utilities that support the other Hadoop modules.
  2. Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data.
  3. Hadoop YARN: A framework for job scheduling and cluster resource management.
  4. Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

Other Hadoop-related projects at Apache include:

  1. AmbariTM: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually alongwith features to diagnose their performance characteristics in a user-friendly manner.
  2. AvroTM: A data serialization system.
  3. CassandraTM: A scalable multi-master database with no single points of failure.
  4. ChukwaTM: A data collection system for managing large distributed systems.
  5. HBaseTM: A scalable, distributed database that supports structured data storage for large tables.
  6. HiveTM: A data warehouse infrastructure that provides data summarization and ad hoc querying.
  7. MahoutTM: A Scalable machine learning and data mining library.
  8. PigTM: A high-level data-flow language and execution framework for parallel computation.
  9. SparkTM: A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation.
  10. TezTM:A generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases. Tez is being adopted by HiveTM, PigTM and other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g. ETL tools), to replace HadoopTM MapReduce as the underlying execution engine.
  11. ZooKeeperTM: A high-performance coordination service for distributed applications.

4.1 Hadoop - Installation

The present release of Hadoop is 2.7.2 which is in continuation to the previous stabel release 2.7.1. There is excellent source for documentation on as how to install hadoop in Linux machines at https://hadoop.apache.org/docs/r2.7.2/index.html. There are mainly three components in Hadoop they are Hadoop-Common, HDFS, YARN.

  1. Hadoop-Common: Used to access WebHDFS via a proxy server. Also useful along with Hadoop compatible File System (HCFS).
  2. HDFS: Supports POSIX-style file system. 5
  3. YARN: Supports administration like write, modify, kill operations. Stores generic and application specific information. Manages queues.

Hadoop can be installed in two different ways, namely Single Node Cluster (SNC) and Multinode Cluster (MC). You may visit https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/SingleCluster.html for SNC and https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/ClusterSetup.html for MC.

This manual deals with SNC setup. The software is tested through 2 core, 64 bit Ubuntu OS version 16.04 running through genuine Intel processor of 2.9 GHz speed. The following is the procedure to install Hadoop.

4.2 History

According to its co-founders, Doug Cutting and Mike Cafarella, the genesis of Hadoop was the Google File System paper that was published in October 2003. This paper spawned another one from Google – “MapReduce: Simplified Data Processing on Large Clusters”. Development started on the Apache Nutch project, but was moved to the new Hadoop sub-project in January 2006. Doug Cutting, who was working at Yahoo! at the time, named it after his son’s toy elephant. The initial code that was factored out of Nutch consisted of about 5,000 lines of code for HDFS and about 6,000 lines of code for MapReduce.

In March 2006, Owen O’Malley was the first committer to add to the Hadoop project; Hadoop 0.1.0 was released in April 2006. It continues to evolve through contributions that are being made to the project.

4.3 Installation of JAVA

Java is one of the important prerequisites for setting up Hadoop. There are number of online resources that provide procedures as how to setup Java in Ubuntu. The following is the procedure to install Java in Ubuntu Terminal. 6

1sudo apt-get install default-jre default-jdk 
2java -version

Uninstall previous version or use update-alternatives --config java to set right or correct Java installation. You may also check Java installation by doing which java and whereis java.

4.3.1 Update JAVA HOME variable

Test if java is running or not with the help of java, javac these commands must be able to return certain help information such as how to use commands else your installation is wrong. One of the reasons might be that, the system might not be able to spot out Java installation directory. In such case, do as below to update Java Home variable.

1export JAVA_HOME=/usr/lib/jvm/java-7-openjdk 
2export PATH=\$PATH:/usr/lib/jvm/java-7-openjdk/bin

The above code is applicable only for default-jdk. There is also another way to install default-jdk and it is through Ubuntu Software Center. From 14.04 Ubuntu has Unity Desktop Environment, where you may be able to get the quick menu by executing Alt+F1 the left side Unity task bar appear in which you may be able to find Ubuntu Software Center. Ubuntu Software Center has nice searching arrangements, search for open jdk and try to install.

4.4 Creating Hadoop User Group

It is always better to have a separate user group for Hadoop work for many of the user commands used through Hadoop or more or less similar with certain native BASH commands. So, it is always advisable to have Hadoop installed through a separate working user name. The following is the procedure to create hadoop user.

1sudo addgroup hadoop 
2sudo adduser --ingroup hadoop hduser

While executing second statement the computer might trigger for password, you might give a simple and easy-to-remember word for that. Anyway, this password will be suppressed through SSH implementation which is explained in the subsequent section. Now this is time to give root privileges for newly created to user (hduser).

Open a separate terminal (Ctl+Alt+T) and do as below:

1sudo su root 
2sudo gedit etc/sudoers

The above code opens a text file known as sudoers in which find the line

1root ALL=(ALL:ALL) ALL

under “#User privilege specification” and insert the following line

1hduser ALL=(ALL:ALL) ALL

4.5 Install and configure open-ssh-server

OpenSSH is a freely available version of the Secure Shell (SSH) protocol family of tools for remotely controlling, or transferring files between, computers. OpenSSH can use many authentication methods, including plain password, public key, and Kerberos tickets. 7 There is certain source to manage authentication mechanisms through openssh at https://help.ubuntu.com/lts/serverguide/openssh-server.html. Visit http://www.openssh.com/ For time being you may do as follows:

1sudo apt-get install openssh-server 
2sudo su hduser 
3ssh-keygen -t rsa -P "" 
4cat \$HOME/.ssh/id_rsa.pub >>\$HOME/.ssh/authorized_keys

The first statement installs openssh-server in the computer. Second statement lets you change the user (from your default user to hduser). Third statement creates rsa (public) key 8 and last statement copies the public key from $HOME∕.ssh directory to another folder authorizedkeys in the same directory.

At this point you may reboot the system by simply executing command reboot at the Terminal. However, this optional, you may forward without rebooting your machine, as well. There is another task that few technicians do at this point of time i.e. disabling IPV6. This is also optional Hadoop work with both IPV4 and IPV6 and the performance differences are minimal.

4.6 Download & Setup Hadoop

GOTO https://hadoop.apache.org; click on release; download release of your interest by clicking on mirrors; choose very first mirror i.e. http://redrockdigimark.com/apachemirror/hadoop/common to download Hadoop. You may also try directly by copying and pasting http://redrockdigimark.com/apachemirror/hadoop/common/hadoop-2.7.1/hadoop-2.7.1.tar.gz) to download Hadoop. This url was active at the time of writing this manual. After downloading do as below:

  1. copy the .tar.gz file to Desktop
  2. Extract the file
  3. Copy the extracted file to /usr/local/hadoop

Assume that your extracted files are at Desktop folder i.e. /home/Desktop/ in your user account. Then you might execute the below statement in the Terminal.

1sudo mv /home/Desktop/your..hadoop..folder/ /usr/local/hadoop

In the above code your..hadoop..folder/ is your hadoop folder name (extract). Now we have created a folder under the directory /usr/local with a name hadoop in which our Hadoop files exists. Now since, we had created a folder it is better to give folder permissions to the user hduser because whatever that we do later uses the stuff living under this folder.

1sudo chown hduser:hadoop -R /usr/local/hadoop

4.6.1 Temporary working directories for NameNode and DataNode

The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. A DataNode stores data in the Hadoop File System (HFS), is also known as distributed file system (HDFS). The Hadoop Distributed File System (HDFS) is designed to store very large data sets reliably, and to stream those data sets at high bandwidth to user applications. In a large cluster, thousands of servers both host directly attached storage and execute user application tasks. These nodes of file system use certain temporary folder to track, change and execute user requests. Do as below to create a temporary directory for NameNode and DataNode.

1sudo mkdir -p /usr/local/hadoop_tmp/hdfs/namenode 
2sudo mkdir -p /usr/local/hadoop_tmp/hdfs/datanode 
3sudo chown hduser:hadoop -R /usr/local/hadoop_tmp/

The last (third) statement above gives both group, as well as, user privileges to newly created to folder hadooptmp.

4.7 Configuring Hadoop

4.7.1 ./bashrc

.bashrc is the file that is executed when you open a terminal window (like gnome terminal) for text-based interaction with the system from the command line. You should open the file .bashrc in your home directory in a text editor program like vi or gedit. Once, you open ./bashrc go to the end of the file and put the following code.

1export JAVA_HOME=/usr/lib/jvm/java-8-oracle 
2export HADOOP_HOME=/usr/local/hadoop 
3export PATH=\$PATH:\$HADOOP_HOME/bin 
4export PATH=\$PATH:\$HADOOP_HOME/sbin 
5export HADOOP_MAPRED_HOME=\$HADOOP_HOME 
6export HADOOP_HDFS_HOME=\$HADOOP_HOME 
7export YARN_HOME=\$HADOOP_HOME 
8export HADOOP_COMMON_LIB_NATIVE_DIRECTORY=\$HADOOP_HOME/lib/native 
9export HADOOP_OPTS="-Djava.library.path=\$HADOOP_HOME/lib" 
10export PATH=\$PATH:/usr/local/hadoop/bin

4.7.2 hadoop-env.sh

Hadoop need to know as where is Java installed in the computer. So go to the directory /usr/local/hadoop/env/hadoop/ then open hadoop-env.sh and find the line $JAV AHOME update the java installation directory against that line. In the beginning of this chapter, we have installed default-jdk so update the following line:

1JAVA_HOME = \$(JAVA_HOME)

You may comment the above line and write your own statement such as JAV AHOME = ∕usr∕lib∕jvm∕java - 8 - openjdk - amd64∕bin∕ and something like that...

4.7.3 core-site.xml

Get the file with the help of sudo gedit core-site.xml and put the following code in between ¡configure¿ and ¡/configure¿ tags.

1<property> 
2<name>fs.default.name</name> 
3<value>hdfs://localhost:9000</value> 
4</property>

4.7.4 hdfs-site.xml

Put the below code

1<property> 
2<name>dfs.replication</name> 
3<value>1</value> 
4</property> 
5<property> 
6<name>dfs.namenode.name.dir</name> 
7<value>file:/usr/local/hadoop_tmp/hdfs/namenode</value> 
8</property> 
9<property> 
10<name>dfs.datanode.data.dir</name> 
11<value>file:/usr/local/hadoop_tmp/hdfs/datanode</value> 
12</property>

4.7.5 yarn-site.xml

Put the following code

1<property> 
2<name>yarn.nodemanager.aux-services</name> 
3<value>mapreduce_shuffle</value> 
4</property> 
5<property> 
6<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> 
7<value>org.apache.hadoop.mapred.ShuffleHandler</value> 
8</property>

4.7.6 mapred-site.xml

Execute the below code to keep the old configuration file as it is.

1cp /usr/local/hadoop/etc/hadoop/mapred-site.xml.template /usr/local/hadoop/etc/hadoop/mapred-site.xml 
2sudo gedit mapred-site.xml

And put the follwing code in it.

1<property> 
2<name>mapreduce.framework.name</name> 
3<value>yarn</value> 
4</property>

4.8 Start the Hadoop

Now it is time to start the Hadoop first by formatting NameNode

1hdfs namenode -format

You must be able to get the message “namenode has been successfully formatted” from the bottom fifth line (or what-so-ever it is). You must be able to find the output that NameNode is successfully formatted. Now you may start the demons as below:

1start-dfs.sh 
2start-yarn.sh

You must start these demons one by one to keep Hadoop running. Now execute hadoop version at the console and you must be able to get relevant information.

4.9 Compiling Native Binaries

Once after installing Hadoop try to check native binaries, such as with hadoop checknative at console. The result should be as below:

If everything is correct, which means if your compiled libraries are compatible with your architecture. Otherwise you may get the following message:

1WARN util.NativeCodeLoader: Unable to load native-hadoop library for 
2your platform...using builtin-java classes where applicable

If the response to your command hadoop checknative is as above then something is wrong in your machine. The reason could be that the Hadoop libraries (compiled) are not correct for your machine. The you got to compile native libraries.

If native libraries are not properly installed user might experience problems at every phase of using Hadoop such as formatting namenode, starting deamons and using DFS resources. One of the potential reasons for encountering such problems is due to installations of compiled libraries from the archives. Hadoop might be built by using certain resources which might be different from the user resources. So, that might leads to problems while executing commands. To avoid such problems the user may have to compile libraries with in his/her system.

Please read instructions at https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-common/NativeLibraries.html before compiling Hadoop in your machine. The following instructions shows step-by-step procedure to compile Hadoop native libraries in Ubuntu.

There are certain prerequisites that need to be installed in your system before triggering compiling. They are as follows:

  1. Install maven
  2. Install protocol buffer (also known as protobuf )
  3. Install gnu toolchain such as
  4. Download hadoop source from hadoop apche releases at http://hadoop.apache.org/releases.html
  5. Change to downloaded directory
  6. Compile with the help of mvn package -Pdist,native -DskipTests -Dtar command

The go to hadoop-dist/target/hadoop-2.4.1/lib/native directory and copy libhadoop.so to your directory (lib directory under hadoop home)

4.10 Hadoop common

Hadoop distribution has 5 base modules that were available for users soon after installation. The following are those modules. These modules are different from Hadoop related projects or also known as Hadoop ecosystem. 9

In this section deals with Hadoop Common. There are other sections or perhaps chapters to learn about rest of the modules. Hadoop Common refers to the collection of common utilities and libraries that support other Hadoop modules. It is an essential part or module of the Apache Hadoop Framework, along with the Hadoop Distributed File System (HDFS), Hadoop YARN and Hadoop MapReduce. Like all other modules, Hadoop Common assumes that hardware failures are common and that these should be automatically handled in software by the Hadoop Framework. Hadoop Common is also known as Hadoop Core.

The Hadoop Common package is considered as the base/core of the framework as it provides essential services and basic processes such as abstraction of the underlying operating system and its file system. Hadoop Common also contains the necessary Java Archive (JAR) files and scripts required to start Hadoop. The Hadoop Common package also provides source code and documentation, as well as a contribution section that includes different projects from the Hadoop Community.

Every file is a jar file and every command related to Hadoop is a Java Class. Where do we find this Hadoop Common Jar? or its Classes? Well! There is a directory in $HADOOPHOME∕share∕hadoop∕common, in which you may find Jar file related to Hadoop Common. How do I know that; just studied online. The tree structure looks like below in Ubuntu OS.

1##@##-##:/usr/local/hadoop/share/hadoop/common$ tree 
2. 
3       hadoop-common-3.2.1.jar 
4       hadoop-common-3.2.1-tests.jar 
5       hadoop-kms-3.2.1.jar 
6       hadoop-nfs-3.2.1.jar

To know about the classes it is possible to execute a plain statement like hadoop-common-3.2.1.jar or use jar tf hadoop-common-3.2.1.jar, but this command return lot of information to the Terminal. Don’t be panic. Alternatively I can do as below to know as how many classes are available from a base Jar file. Each class has a unique purpose and activity to perform.

1jar tvf hadoop-common-3.2.1.jar | grep .class | wc -l

Above statement has shown me 2303 which means there are such a many class files inside hadoop-common-3.2.1.jar file.

4.10.1 Core Commands

Core commands are those commands that are available immediately after installing Hadoop. There is no necessity to start daemons. There are roughly 20 such commands. Below table shows the purpose of these commands. All of the Hadoop commands and subprojects follow the same basic structure:

Usage: shellcommand [SHELL_OPTIONS] [COMMAND] [GENERIC_OPTIONS] [COMMAND_OPTIONS]

The below are options available along with core shell command hadoop.




CommandDescription


archive Creates a hadoop archive.
checknative This command checks the availability of the
Hadoop native code.
classpath Prints the class path needed to get the Hadoop
jar and the required libraries. If called without
arguments, then prints the classpath set up by
the command scripts, which is likely to contain
wildcards in the classpath entries. Additional
options print the classpath after wildcard expansion
or write the classpath into the manifest of a
jar file. The latter is useful in environments
where wildcards cannot be used and the expanded
classpath exceeds the maximum supported command
line length.
conftest Validates configuration XML files. If the
-conffile option is not specified,
the files in $HADOOPCONFDIR whose name
end with .xml will be verified.
credential Command to manage credentials, passwords and
secrets within credential providers.
distch Copy file or directories recursively.
dtutil Utility to fetch and manage hadoop delegation
tokens inside credentials files.
fs The File System (FS) shell includes various
shell-like commands that directly interact with
the Hadoop Distributed File System (HDFS)
as well as other file systems that Hadoop supports.
There are roughly 50 sub-commands that are useful
to manage HDFS.
gridmix Gridmix is a benchmark tool for Hadoop cluster.
jar Runs a jar file. Use yarn jar to launch
YARN applications instead.
jnipath Print the computed java.library.path.
kerbname Convert the named principal via the
auth_to_local rules to the Hadoop
user name.
kdiag Diagnose Kerberos Problems
key Manage keys via the KeyProvider.
kms Run KMS, the Key Management Server.
trace View and modify Hadoop tracing settings.
version Prints the version.
CLASSNAMEDisplay computed Hadoop environment variables.
Runs the class named CLASSNAME.
The class must be part of a package.
envvars Display computed Hadoop environment variables.



Table 4.1: Hadoop Common Commands

Apart from the above there are few administrative and file for various user operations related to the Hadoop cluster. For instance, it is possible to know about Hadoop version by executing

1mk@mk-hp:~$ hadoop version 
2 
3Hadoop 3.2.1

It is possible to know about that if all the requirements are satisfied for Hadoop by using

1mk@mk-hp:~$ hadoop checknative 
2 
3Native library checking: 
4hadoop: true /usr/local/hadoop/lib/native/libhadoop.so.1.0.0 
5zlib:  true /lib/x86_64-linux-gnu/libz.so.1 
6zstd : true /lib/x86_64-linux-gnu/libzstd.so.1 
7snappy: true /lib/x86_64-linux-gnu/libsnappy.so.1 
8lz4:   true revision:10301 
9bzip2:  true /lib/x86_64-linux-gnu/libbz2.so.1 
10openssl: false EVP_CIPHER_CTX_cleanup 
11ISA-L:  false libhadoop was built without ISA-L support

Do you remember that Hadoop has few environment variables which need to be defined in .bashrc file in home directory. These settings can be known using envvars.

1mk@mk-hp:~$ hadoop envvars 
2JAVA_HOME='/usr/lib/jvm/java-11-openjdk-amd64' 
3HADOOP_COMMON_HOME='/usr/local/hadoop' 
4HADOOP_COMMON_DIR='share/hadoop/common' 
5HADOOP_COMMON_LIB_JARS_DIR='share/hadoop/common/lib' 
6HADOOP_COMMON_LIB_NATIVE_DIR='/usr/local/hadoop/lib/native' 
7HADOOP_CONF_DIR='/usr/local/hadoop/etc/hadoop' 
8HADOOP_TOOLS_HOME='/usr/local/hadoop' 
9HADOOP_TOOLS_DIR='share/hadoop/tools' 
10HADOOP_TOOLS_LIB_JARS_DIR='share/hadoop/tools/lib'

The other very frequently used commands are fs, classpath and jar. The command fs is used to manage HDFS, classpath and jar are frequently used while working with MapReduce programs. There is discussion on these commands in forthcoming chapters.

4.10.2 Hadoop daemons

It is better to know as how to start and stop Hadoop daemons before known Hadoop Common. Hadoop don’t start immediately after installing the same. It is required to start and stop Hadoop services manually. Every Hadoop core service viz., HDFS, MapReduce, Yarn has a script to start and stop. These scripts are known as startup scripts and they take care of starting and stopping various other things such as namenode, datanode, secondarynamenode, ResourceManager. For instance, to start HDFS it requires to execute start-hdfs.sh.

he $HADOOP _INSTALL∕hadoop∕bin directory contains some scripts used to launch Hadoop DFS and Hadoop Map/Reduce daemons. These are:

It is always easy to start daemons using start-all.sh. However, it is not recommended method for starting Hadoop daemons.

1mk@mk-hp:~$ start-all.sh 
2WARNING: Attempting to start all Apache Hadoop daemons as mk in 10 seconds. 
3WARNING: This is not a recommended production deployment configuration. 
4WARNING: Use CTRL-C to abort. 
5Starting namenodes on [0.0.0.0] 
6Starting datanodes 
7Starting secondary namenodes [mk-hp] 
8Starting resourcemanager 
9Starting nodemanagers

It is possible to check the status of Hadoop daemons or processes using JPS which stands for Java (Virtual Machine) Process Status.

1mk@mk-hp:~$ jps | wc -l 
24 
3 
4mk@mk-hp:~$ jps 
523025 NameNode 
623154 DataNode 
723357 SecondaryNameNode 
823951 Jps

Hadoop comes with wonderful Web User Interfaces. It is possible to check and observe each daemon using its respective URI.10 At this point in time go to the browser use the URI http://localhost:9870 to explore namenode and http://localhost:9864 to explore datanode and finally http://localhost:9868 for SecondaryNameNode. These ports tend to change. The default port for dfs is 9000 which changed in Hadoop 3.2.* to different other ports. The problem is Hadoop versions tend to change as by the preference. I use different versions of Hadoop side-by-side for testing and my other research needs. There is one way to know and figure out problems related to ports. Just use the following command.

1netstat -lpten | grep java

The command netstat is part of a package known as net-tools. In Ubuntu, it can be installed using sudo apt install net-tools. I figured out different ports using netstat and information regarding Hadoop (3.2.*) ports are as follows.




portservice


8042 ResourceManager
9864 DataNode
9865/66/67DataNodes (Slaves)
9868 Secondary NameNode
9870 HDFS (namenode)



Notes

5A set of formal descriptions that provide a standard for the design of operating systems, especially ones which are compatible with Unix. For more information read https://en.wikipedia.org/wiki/POSIX

6Use Ctl+Alt+T to open Terminal or Console in Ubuntu

7Kerberos is a computer network authentication protocol that works on the basis of ’tickets’ to allow nodes communicating over a non-secure network to prove their identity to one another in a secure manner.

8Visit http://man.openbsd.org/OpenBSD-current/man1/ssh-keygen.1 for more information.

9Visit https://hadoop.apache.org/ for more information on Hadoop ecosystem.

10There is very detailed demonstration about these URIs at https://kontext.tech/column/hadoop/265/default-ports-used-by-hadoop-services-hdfs-mapreduce-yarn. By the don’t go to official documentation, none of them are perfect. This demonstration is related to Hadoop 3.2.*, which is latest Hadoop distribution.

Chapter 5
Core JAVA

5.1 What is Java?

Java is a programming language and a platform. 11 Java history is interesting to know. The history of java starts from Green Team. Java team members (also known as Green Team), initiated a revolutionary task to develop a language for digital devices such as set-top boxes, televisions etc. For the green team members, it was an advance concept at that time. But, it was suited for internet programming. Later, Java technology as incorporated by Netscape. Currently, Java is used in internet programming, mobile devices, games, e-business solutions etc. There are given the major points that describes the history of java. James Gosling, Mike Sheridan, and Patrick Naughton initiated the Java language project in June 1991. The small team of sun engineers called Green Team. Originally designed for small, embedded systems in electronic appliances like set-top boxes. Firstly, it was called ”Greentalk” by James Gosling and file extension was .gt. After that, it was called Oak and was developed as a part of the Green project. Oak is a symbol of strength and chosen as a national tree of many countries like U.S.A., France, Germany, Romania etc. In 1995, Oak was renamed as ”Java” because it was already a trademark by Oak Technologies. In 1995, Time magazine called Java one of the Ten Best Products of 1995. The official JDK 1.0 released in(January 23, 1996). On 13 November 2006, Sun released much of Java as free and open source software under the terms of the GNU General Public License (GPL). On 8 May 2007, Sun finished the process, making all of Java’s core code free and open-source, aside from a small portion of code to which Sun did not hold the copyright.

The latest release of the Java Standard Edition is Java SE 8. With the advancement of Java and its widespread popularity, multiple configurations were built to suite various types of platforms. Ex: J2EE for Enterprise Applications, J2ME for Mobile Applications. The new J2 versions were renamed as Java SE, Java EE and Java ME respectively. Java is guaranteed to be Write Once, Run Anywhere.

5.1.1 Features of Java

There is given many features of java. They are also known as java buzzwords. The Java Features given below are simple and easy to understand.

  1. Simple
  2. Object-Oriented
  3. Platform independent
  4. Secured
  5. Robust
  6. Architecture neutral
  7. Portable
  8. Dynamic
  9. Interpreted
  10. High Performance
  11. Multithreaded
  12. Distributed

5.1.2 Requirements for write Java programs

For performing the examples discussed in this tutorial, you will need a Pentium 200 - MHz computer with a minimum of 64 MB of RAM (128 MB of RAM recommended).

You also will need the following software:

  1. Linux 7.1 or Windows xp/7/8 operating system.
  2. Java JDK 8
  3. Microsoft Notepad or any other text editor

5.2 Installation

This manual uses Ubuntu 16.04, gedit text writer for examples. So in Ubuntu or any other Linux machines can accommodate Java very easily. For instance, in Ubuntu go to Software Center choose OpenJDK and just press install. Otherwise, if you are interested in using Terminal just use the following statement in the Console/Terminal (Note: in Ubuntu you nay call the Terminal or Console by using shortcut keys Ctl+Alt+T).

1sudo apt-get install default-jdk, default-jre

The above statement takes a couple of minutes depending upon the your internet speed. The advantage of Linux (Ubuntu) is that you don’t need to worry about setting up path and etc. Just go to the Terminal and verify by executing java -version. The Console must be able to return the following message.

1openjdk version "1.8.0_91" 
2OpenJDK Runtime Environment (build 1.8.0_91-8u91-b14-3ubuntu1~16.04.1-b14) 
3OpenJDK 64-Bit Server VM (build 25.91-b14, mixed mode)

openjdk version "1.8.0_91" shows that the Java is v8. There are lot of online resources that has abundant of information as how to install and setup Java in other platforms like Windows and Mac. In case if you have multiple distributions (installations) of Java you may try update-alternatives --config java to switch between versions or types.

5.3 Editors

To write your Java programs, you will need a text editor. There are even more sophisticated IDEs available in the market. But for now, you can consider one of the following:

  1. Text Editor like gedit, vi, nano in Linux, or Notepad in Windows.
  2. Netbeans
  3. Eclipse

We will be using gedit a free gnome text editor available by default in Ubuntu.

5.3.1 Common mistakes

The following are the some of the common mistakes we might perform while executing code.

  1. Check the slash symbol before comment line, it must be but not \.
  2. If you are using arrays check the index both at declaration and in output script.
  3. At output statement “System.out.println()”, there must not be comma separating between text statement and variable. For instance, System.out.println("The out is: ", + a) is wrong; System.out.println("The out is: " + a) is correct.

5.4 Programming

5.4.1 Java is object-oriented

Object-oriented means we organize our software as a combination of different types of objects that incorporates both data and behavior. Object-oriented programming(OOPs) is a methodology that simplify software development and maintenance by providing some rules. Basic concepts of OOPs are:

  1. Object
  2. Class
  3. Inheritance
  4. Polymorphism
  5. Abstraction
  6. Encapsulation

5.4.2 Objects and Classes

Object

Objects have states and behaviors. Example: A dog has states - color, name, breed as well as behaviors -wagging, barking, eating. An object is an instance of a class. Let us now look deep into what are objects. If we consider the real-world we can find many objects around us, Cars, Dogs, Humans, etc. All these objects have a state and behavior. If we consider a dog, then its state is - name, breed, color, and the behavior is - barking, wagging, running. If you compare the software object with a real world object, they have very similar characteristics. Software objects also have a state and behavior. A software object’s state is stored in fields and behavior is shown via methods. So in software development, methods operate on the internal state of an object and the object-to-object communication is done via methods.

Calss

A class is a blue print from which individual objects are created.

1public class Dog{ 
2 
3  String breed; // state 
4  int ageC// state 
5  String color;// state 
6 
7  void barking(){// behavior 
8  } 
9 
10  void hungry(){ 
11  } 
12 
13  void sleeping(){ 
14  } 
15}

A class can contain any of the following variable types.

  1. Local variable: Variables defined inside methods, constructors or blocks are called local variables.
  2. Instance variable: Instance variables are variables within a class but outside any method.
  3. Class variable: Class variables are variables declared with in a class, outside any method, with the static keyword.

Constructor

When discussing about classes, one of the most important sub topic would be constructors. Every class has a constructor. If we do not explicitly write a constructor for a class the Java compiler builds a default constructor for that class.

Each time a new object is created, at least one constructor will be invoked. The main rule of constructors is that they should have the same name as the class. A class can have more than one constructor.

Example of a constructor is given below:

1public class Puppy{ 
2  public Puppy(){ 
3  } 
4 
5  public Puppy(String name){ 
6    // This constructor has one parameter, name. 
7  } 
8}

The below code shows instance variables and methods in a class.

1public class Puppy{ 
2 
3  int puppyAge; 
4 
5  public Puppy(String name){ 
6    // This constructor has one parameter, name. 
7    System.out.println("Name chosen is :" + name ); 
8  } 
9 
10  public void setAge( int age ){ 
11     puppyAge = age; 
12  } 
13 
14  public int getAge( ){ 
15     System.out.println("Puppy's age is :" + puppyAge ); 
16     return puppyAge; 
17  } 
18 
19  public static void main(String []args){ 
20    /* Object creation */ 
21    Puppy myPuppy = new Puppy( "tommy" ); 
22 
23    /* Call class method to set puppy's age */ 
24    myPuppy.setAge( 2 ); 
25 
26    /* Call another class method to get puppy's age */ 
27    myPuppy.getAge( ); 
28 
29    /* You can access instance variable as follows as well */ 
30    System.out.println("Variable Value :" + myPuppy.puppyAge ); 
31  } 
32}

5.4.3 Your First Java Program

To create a simple java program, you need to create a class that contains main method. Let’s understand the requirement first.

  1. Install Java (We did this in the previous chapter)
  2. Set the path (We did this in the previous chapter)
  3. Write the program
  4. Compile the program

Now let us write a program in simple text editor in Ubuntu. I will crate a text document from the terminal and start writing the program.

1sudo mkdir tmp 
2sudo gedit mfjp.txt

The second statement open a simple text application. Write the following code in the file.

1class MyClass 
2{ 
3public static void main(String args[]) 
4{ 
5System.out.println("Hello Java"); 
6} 
7}

Close the application (text file), and execute the following statements.

1sudo cp mfjp.txt mfjp.java 
2javac mfjp.java

The second statement create two files in your working directory (pwd) you may find a class file (MyClass) as MyClass.class. Then you might try executing the program by this way.

1java MyClass

You may get the output as “Hello Java” in terminal. Now let us learn few basics of Java.

5.4.4 Dos

About Java programs, it is very important to keep in mind the following points.

  1. Case Sensitivity - Java is case sensitive, which means identifier Hello and hello would have different meaning in Java.
  2. Class Names - For all class names the first letter should be in Upper Case. If several words are used to form a name of the class, each inner word’s first letter should be in Upper Case. Example, MyJavaExample.
  3. Method Names - All method names should start with a Lower Case letter. If several words are used to form the name of the method, then each inner word’s first letter should be in Upper Case. Example public void myMethodName().
  4. Program File Name - Name of the program file should exactly match the class name. When saving the file, you should save it using the class name (Remember Java is case sensitive) and append ’.java’ to the end of the name (if the file name and the class name do not match your program will not compile). Example: Assume MyFirstJavaProgram is the class name. Then the file should be saved as MyFirstJavaProgram.java. public static void main(String args[]) - Java program processing starts from the main() method which is a mandatory part of every Java program.

5.4.5 Variables and Data Types

In computer science variable is a name of memory location.There are three types of variables: local, instance and static. There are two types of datatypes in java, primitive and non-primitive. The following table shows the details of different data types.


Table 5.1: Data Types in Java



Data TypeDefault ValueDefault size



boolean false 1 bit
char ˘0  000 2 byte
byte 0 1 byte
short 0 2 byte
int 0 4 byte
long 0L 8 byte
float 0.0f 4 byte
double 0.0d 8 byte




The another important subject that needs attention is operators. The following table shows the operators used in Java programming.

[define table of operators here]

5.5 Exercises

The following are few exercises.

5.5.1 Ex-1: System Input-Output

The exercise is going to be like this that the console ask for a number and then print later the same number.

1 
2import java.util.Scanner; 
3 
4class inOut{ 
5 
6public static void main(String[] args){ 
7 
8Scanner scan = new Scanner(System.in); 
9int num = 0; 
10System.out.println("Enter any number: "); 
11num = scan.nextInt(); 
12 
13System.out.println("Input number is: " + num);  
14} 
15}

5.5.2 Ex-2: If statement

1public class ifExample 
2{ 
3public static void main(String[] args) 
4{ 
5 int age = 40; 
6 if(age <= 50) 
7 { 
8  System.out.println("The age is less than 50"); 
9 } 
10} 
11}

The above code evaluates the age of the individuals and returns the result. The following is the code for nested-if.

5.5.3 Ex-3: Nested-If

1import java.util.Scanner; 
2 
3class nestIf { 
4public static void main(String[] args){ 
5 
6Scanner scan = new Scanner(System.in); 
7 
8int s = 0; 
9 
10System.out.println("Enter some number: "); 
11 
12s = scan.nextInt(); 
13 
14if(s % 2 == 0){ 
15 
16 System.out.println("The number is even"); 
17 
18 } else{ 
19 
20 System.out.println("The numbe is not even"); 
21 
22 } 
23 
24} 
25 
26}

The expected output is as follows:

1Enter some number: 
2100 
3The number is even 
4 
5Enter some number: 
6125 
7The numbe is not even

5.5.4 Ex-4: If-Else-If example

Let us write code that makes student grade for different denominations. For instance, look at the following code.

1import java.util.Scanner; 
2 
3class studMarks{ 
4 
5public static void main(String[] args){ 
6 
7Scanner scan = new Scanner(System.in); 
8 
9int m = 0; 
10 
11System.out.println("Input marks: "); 
12m = scan.nextInt(); 
13 
14if(m < 50){ 
15System.out.println("Fail"); 
16} else if(m >= 50 && m <= 60){ 
17System.out.println("C Grade"); 
18}else if(m >= 60 && m <= 75){ 
19System.out.println("B Grade"); 
20}else if(m >= 75 && m <= 100){ 
21System.out.println("A Grade"); 
22}else{ 
23System.out.println("Invalid"); 
24} 
25} 
26}

The output for the above code is as follows:

1Input marks: 
258 
3C Grade 
4 
5Input marks: 
659 
7C Grade 
8 
9Input marks: 
1075 
11B Grade 
12 
13Input marks: 
1490 
15A Grade

5.5.5 Ex-5: Switch Statement

The Java switch statement is executes one statement from multiple conditions. It is like if-else-if ladder statement. Sometimes, it is easy to use switch statement in stead of if-else-if statemets. See the below exercise.

1import java.util.Scanner; 
2 
3class switchEx{ 
4public static void main(String[] args){ 
5 Scanner scan = new Scanner(System.in); 
6 
7 int s = 0; 
8 
9 System.out.println("Enter your choice: "); 
10 s = scan.nextInt(); 
11 
12 switch(s){ 
13  case 1: System.out.println("you want to try AM"); break; 
14  case 2: System.out.println("you want to try GM"); break; 
15  case 3: System.out.println("you want to try HM"); break; 
16  default: System.out.println("your input is invalid"); break; 
17  } 
18 } 
19 
20}

The above code reads the user input and outputs the action intended by the user. Actually the code need to compute the measure of interest and output instead of a just statement. For instance, if your input is 1 the code must be able to rerun arithmetic mean, but that needs a data distribution (vector). So, we need additional input i.e. array to computer statistical measures. We will try to continue this example in next section on arrays.

5.5.6 Ex-3: for loop

For statements are highly useful to execute loops.

1class forEx{ 
2 
3public static void main(String args[]){ 
4 
5for(int i = 1; i < 10; i++){ 
6System.out.print(i + "\n"); 
7 } 
8} 
9}

The above code prints a linear series of numbers. You may have to observe \n” in print statement this helps printing numbers vertically in the console, otherwise the code prints numbers one after the other adjacently (horizontally).

5.5.7 Ex-4: While Loop

The above exercise also can be done in the following way:

1class doEx{ 
2 
3public static void main(String[] args){ 
4 
5int i = 1; 
6 
7while(i <= 10){ 
8 System.out.println(i + "\n"); 
9 i++;  
10 
11 } 
12 
13} 
14 
15}

The other way of performing the same action:

1class doWhileEx{ 
2 
3public static void main(String[] args){ 
4 
5int i = 1; 
6 
7do{ 
8System.out.println(i); 
9i++; 
10}while(i <= 10); 
11 
12} 
13}

5.5.8 Ex-5: Random Number Generation (RNG)

Now let us do a small interesting exercise. In data anlaytics, as it was mentioned before, often data analysis try to use code on simulated data such as through Random Number Generation (RNG). There is some certain material on math functions of Java at https://docs.oracle.com/javase/tutorial/java/data/beyondmath.html.

Let us do this step-by-step approach. This example is very important for data analytics has a lot to do with random values of certain distributions. Data gets generated by certain processes. Mostly, when data analysts gets data, it is all about information regarding some certain process. These processes might be random or non-random. Most of the time data analysts works with random data distributions to test their models (code). So, the data analyst need to know as how to exercise with random processes.

1import java.util.Random; 
2 
3class randEx{ 
4 
5public static void main(String[] args){ 
6 Random rand = new Random(); 
7 double n = rand.nextInt(); 
8 System.out.println(n); 
9} 
10 
11}

The above code creates only one data point which generated randomly with the help of random function. But we need a vector of values to claim that it is a data distribution (a univariate variable). So, we got to use loop to generate a set of values by looping the random number generation process.

1import java.util.Random; 
2 
3class randEx{ 
4 
5public static void main(String[] args){ 
6 Random rand = new Random(); 
7 
8for(int i = 1; i <= 10; i++){ 
9 double n = rand.nextInt(10)+1; 
10 System.out.println(n); 
11  } 
12} 
13}

The above code gives the following output.

18.0 
21.0 
39.0 
48.0 
55.0 
68.0 
78.0 
88.0 
95.0 
107.0

If we change the value within the parenthesis i.e. double n = rand.nextInt(10)+1; to double n = rand.nextInt()*10;, the output will be different.

1-9.64653906E8 
2-1.386765098E9 
31.44151986E8 
4-6.869232E7 
5-2.17323714E8 
6-1.277274944E9 
7-1.300752846E9 
8-2.39470446E8 
9-1.75980253E9 
101.120968E9

If we keep empty parenthesis without multiplying with any value.

11.11883961E9 
2-2.16984983E8 
36.50234678E8 
42.046700708E9 
5-1.676107933E9 
61.885349562E9 
72.097020595E9 
82.027852388E9 
91.04450492E8 
10-1.624830023E9

Perhaps the right method might be double n = rand.nextInt(10)+1;. You can design your process as you wish, but depending on the requirements. Try to change the value in parenthesis and try the same code and observe changes. Moreover, there is certain resource on all those function that are available for mathematical operations at https://docs.oracle.com/javase/7/docs/api/java/lang/Math.html. We might discuss some of these methods in next chapter Java for Statistical Analysis and Packaging.

1import java.util.Random; 
2 
3class twoDarr{ 
4public static void main(String[] args){ 
5 
6double[][] b = new double[5][5]; 
7 
8Random rand = new Random(); 
9double n = rand.nextDouble()*10+1; 
10 
11for(int r = 0; r < b.length; r++){ 
12 for(int c = 0; c < b[r].length; c++){ 
13 b[r][c] = n; 
14  } 
15 } 
16 
17for(int r = 0; r < b.length; r++){ 
18 for(int c = 0; c < b[r].length; c++){ 
19  System.out.print(b[r][c] + "\t"); 
20 
21  } 
22 System.out.println(); 
23 } 
24} 
25}

See that you use System.out.print() in inner loop otherwise you don’t get a matrix in stead you may get all the columns listed vertically one below the other. This code still has a setback. The code produces only one unique number as element in all cross-sections. I mean the elements are same. To have different random values we might write the code as below:

1import java.util.Random; 
2import java.lang.Math; 
3 
4class twoDarr{ 
5public static void main(String[] args){ 
6 
7double[][] b = new double[5][5]; 
8 
9Random rand = new Random(); 
10 
11 
12for(int r = 0; r < b.length; r++){ 
13 for(int c = 0; c < b[r].length; c++){ 
14 double n = Math.round(Math.random()*100);  
15 b[r][c] = n; 
16  } 
17 } 
18 
19for(int r = 0; r < b.length; r++){ 
20 for(int c = 0; c < b[r].length; c++){ 
21  System.out.print("\t" + b[r][c]); 
22 
23  } 
24 System.out.println(); 
25 } 
26 
27} 
28}

The difference is that we used two additional methods i.e. round and random from Math library. Now the statement double n = Math.round(Math.random()*100); produces a unique value for every iteration of the loop. The expected output might be as below:

129.043.08.069.011.0 
277.039.02.05.039.0 
378.015.018.078.015.0 
455.042.028.068.023.0 
561.019.049.065.056.0

5.6 Maps

The Map interface maps unique keys to values. A key is an object that you use to retrieve a value at a later date. Given a key and a value, you can store the value in a Map object. After the value is stored, you can retrieve it by using its key. But you got to be a bit careful while using Maps. Several methods throw a NoSuchElementException when no items exist in the invoking map. A ClassCastException is thrown when an object is incompatible with the elements in a map. A NullPointerException is thrown if an attempt is made to use a null object and null is not allowed in the map. An UnsupportedOperationException is thrown when an attempt is made to change an unmodifiable map.

A Map is an interface that maps keys to values. The keys are unique and thus, no duplicate keys are allowed. A map can provide three views, which allow the contents of the map to be viewed as a set of keys, collection of values, or set of key-value mappings. In addition, the order of the map is defined as the order in which, the elements of a map are returned during iteration.

The Map interface is implemented by different Java classes, such as HashMap, HashTable and TreeMap. Each class provides different functionality and can be either synchronized or not. Also, some implementations prohibit null keys and values, and some have restrictions on the types of their keys.

A map contains values on the basis of key i.e. key and value pair. Each key and value pair is known as an entry. Map contains only unique keys. Map is useful if you have to search, update or delete elements on the basis of key.

5.6.1 Collections

The concept of maps belongs to a parent concept called collections. Data can be organized throught these collections in number of ways. Earlier i.e. before Java 2 there used to be adhoc classes like Dictionary, Vector, Stack and Properties. However, these classes were different in terms of their themes. For instance, the purpose of Vector is different from Property. The collections frameworks serves to fulfill following objectives.

  1. The framework had to be high-performance. The implementations for the fundamental collections (dynamic arrays, linked lists, trees, and hashtables) are highly efficient.
  2. The framework had to allow different types of collections to work in a similar manner and with a high degree of interoperability.
  3. Extending and/or adapting a collection had to be easy.

To achieve these goals Java implemented collections framework. Collections framework operates through certain standard interfaces. Few of the standard interfaces were; LinkedList, HashSet amd TreeSet. The following tables shows the details of these interfaces.




S.No. Interface with description


The collection interfaceThis enables you to work with groups of objects; it is at the top of the collections hierarchy.
The List Interface This extends Collection and an instance of List stores an ordered collection of elements.
The Set This extends Collection to handle sets, which must contain unique elements
The SortedSet This extends Set to handle sorted sets
The Map This maps unique keys to values.
The Map.Entry This describes an element (a key/value pair) in a map. This is an inner class of Map.
The SortedMap This extends Map so that the keys are maintained in ascending order.
The Enumeration This is legacy interface and defines the methods by which you can enumerate (obtain one at a time) the elements in a collection of objects. This legacy interface has been superseded by Iterator.



The collection classes

Java provides a set of standard collection classes that implement Collection interfaces. Some of the classes provide full implementations that can be used as-is and others are abstract class, providing skeletal implementations that are used as starting points for creating concrete collections.

The standard collection classes are summarized in the following table:

  1. AbstractCollection : Implements most of the Collection interface.
  2. AbstractList: Extends AbstractCollection and implements most of the List interface.
  3. AbstractSequentialList : Extends AbstractList for use by a collection that uses sequential rather than random access of its elements.
  4. LinkedList: Implements a linked list by extending AbstractSequentialList.
  5. ArrayList : Implements a dynamic array by extending AbstractList.
  6. AbstractSet : Extends AbstractCollection and implements most of the Set interface.
  7. HashSet : Extends AbstractSet for use with a hash table.
  8. LinkedHashSet : Extends HashSet to allow insertion-order iterations.
  9. TreeSet: Implements a set stored in a tree. Extends AbstractSet.
  10. AbstractMap: Implements most of the Map interface.
  11. HashMap: Extends AbstractMap to use a hash table.
  12. TreeMap: Extends AbstractMap to use a tree.
  13. WeakHashMap : Extends AbstractMap to use a hash table with weak keys.
  14. LinkedHashMap: Extends HashMap to allow insertion-order iterations.
  15. IdentityHashMap : Extends AbstractMap and uses reference equality when comparing documents.

Legacy classes

The following are the legacy classes for collections.

  1. Vector: This implements a dynamic array. It is similar to ArrayList, but with some differences.
  2. Stack: Stack is a subclass of Vector that implements a standard last-in, first-out stack.
  3. Dictionary: Dictionary is an abstract class that represents a key/value storage repository and operates much like Map.
  4. Hashtable: Hashtable was part of the original java.util and is a concrete implementation of a Dictionary.
  5. Properties: Properties is a subclass of Hashtable. It is used to maintain lists of values in which the key is a String and the value is also a String.
  6. BitSet: A BitSet class creates a special type of array that holds bit values. This array can increase in size as needed.

5.6.2 Basic Methods

A map has the form Map <K, V >where, K specifies the type of keys maintained in this map. V defines the type of mapped values. Furthermore, the Map interface provides a set of methods that must be implemented. Following are some of the methods.

  1. clear: Removes all the elements from the map.
  2. containsKey: Returns true if the map contains the requested key.
  3. containsValue: Returns true if the map contains the requested value.
  4. equals: Compares an Object with the map for equality.
  5. get: Retrieve the value of the requested key.
  6. keySet: Returns a Set that contains all keys of the map.
  7. put: Adds the requested key-value pair in the map.
  8. remove: Removes the requested key and its value from the map, if the key exists.
  9. size: Returns the number of key-value pairs currently in the map.

5.6.3 Map Classes

Hash Map

The most common class that implements the Map interface is the Java HashMap. A HashMap is a hash table based implementation of the Map interface. It permits null keys and values. Also, this class does not maintain any order among its elements and especially, it does not guarantee that the order will remain constant over time. Finally, a HashMap contains two fundamental parameters: initial capacity and performance. The capacity is defined as the number of buckets in the hash table, while the load factor is a measure that indicates the maximum value the hash table can reach, before being automatically increased.

A simple example that uses a HashMap is shown below:

5.6.4 Ex-6: Map example 1

1import java.util.*; 
2 
3public class hashEx{ 
4 
5public static void main(String[] args){ 
6 
7Map m = new HashMap(); 
8m.put("Kama", "40"); 
9m.put("Sukanya", "38"); 
10m.put("Ammu", "11"); 
11m.put("Chikku", "8"); 
12 
13System.out.println(); 
14System.out.println("Map Elements"); 
15System.out.println("\t" + m); 
16} 
17}

The above example executes a hashmap object known as m and later prints the values and keys together known as tuples. The output will be as below:

1Map Elements 
2{Sukanya=38, Kama=40, Ammu=11, Chikku=8}

5.6.5 Vectors

5.7 Arrays

Arrays are important for data analysts. Data analysts deals with data sets that are essentially matrices. A matrix is anything arranged in rows and columns. So, a data set is basically a data matrix set in rows and columns. The order of matrix is denoted in rows by columns (n), which are referred to as dimensions. The following section illustrates as how to deal with data matrices i.e. arrays.

An array is a container object that holds a fixed number of values of a single type. The length of an array is established when the array is created. After creation, its length is fixed. Each item in an array is called an element, and each element is accessed by its numerical index. As shown in the preceding illustration, numbering begins with 0. The following program, ArrayDemo, creates an array of integers, puts some values in the array, and prints each value to standard output.

5.7.1 Ex-1: One Dimensional Array

1class arrEx { 
2 
3public static void main(String[] args){ 
4 
5// declares array 
6 int[] AnArray; 
7 
8// initiize the array 
9 AnArray = new int[10]; 
10 
11// initialize elements 
12 AnArray[0] = 100; 
13 AnArray[1] = 200; 
14 AnArray[2] = 300; 
15 AnArray[3] = 400; 
16 AnArray[4] = 500; 
17 AnArray[5] = 600; 
18 AnArray[6] = 700; 
19 AnArray[7] = 800; 
20 AnArray[8] = 900; 
21 AnArray[9] = 1000; 
22 
23// output 
24 System.out.println("The first element in array: " + AnArray[0]); 
25 System.out.println("The first element in array: " + AnArray[1]); 
26 System.out.println("The first element in array: " + AnArray[2]); 
27 System.out.println("The first element in array: " + AnArray[3]); 
28 System.out.println("The first element in array: " + AnArray[4]); 
29 System.out.println("The first element in array: " + AnArray[5]); 
30 System.out.println("The first element in array: " + AnArray[6]); 
31 System.out.println("The first element in array: " + AnArray[7]); 
32 System.out.println("The first element in array: " + AnArray[8]); 
33 System.out.println("The first element in array: " + AnArray[9]); 
34} 
35}

The expected out put will be

1The first element in array: 100 
2The first element in array: 200 
3The first element in array: 300 
4The first element in array: 400 
5The first element in array: 500 
6The first element in array: 600 
7The first element in array: 700 
8The first element in array: 800 
9The first element in array: 900 
10The first element in array: 1000

The above code defines an array called AnArray and reads certain data then finally outputs the same data as by index. The above example shows how to declare and define an array. Lets see how to make a two dimensional array.

Ex-2: I/O operations on Arrays
1import java.util.Scanner; 
2 
3class switchEx { 
4 
5public static void main(String[] args){ 
6 
7double[] a; 
8a = new double[10]; 
9//int s = 0; 
10 
11Scanner scan = new Scanner(System.in); 
12 
13//int sum = 0; 
14 
15for(int i = 0; i< 10; i++){ 
16 
17System.out.println("Enter input: "); 
18a[i] = scan.nextInt(); 
19} 
20 
21for(int i = 0; i < 10; i++){ 
22 
23System.out.println(a[i]); 
24} 
25} 
26}

5.7.2 Ex-2: Two Dimensional Array

1class twoDarray{ 
2 
3public static void main(String[] args){ 
4 String[][] names = {{"Mr.", "Mrs.", "Ms.", "Mast."}, {"MK", "MS", "Ammu", "Chikku"}}; 
5 
6 System.out.println(names[0][0]+names[1][0]); 
7 System.out.println(names[0][1]+names[1][1]); 
8 System.out.println(names[0][2]+names[1][2]); 
9 System.out.println(names[0][3]+names[1][3]); 
10} 
11}

The above code concatenate the two different elements of the rows.

Ex-3: Copying Array

The System class has an arraycopy method that you can use to efficiently copy data from one array into another:

1class arrCopyDemo { 
2 
3public static void main(String[] args){ 
4 char[] copyFrom = {'a', 'b', 'c', 'd', 'e'}; 
5 char[] copyTo = new char[3]; 
6 
7System.arraycopy(copyFrom, 2, copyTo, 0, 3); 
8System.out.println(new String(copyTo)); 
9} 
10}

The above code creates two arrays of char data type namely copyFrom and copyTo. The array copyTo is created by taking from third element i.e. with index 3, to the last. The expected output is as below:

1cde

Ex-4: Array Manipulation

Arrays are a powerful and useful concept used in programming. Java SE provides methods to perform some of the most common manipulations related to arrays. For instance, the ArrayCopyDemo example uses the arraycopy method of the System class instead of manually iterating through the elements of the source array and placing each one into the destination array. This is performed behind the scenes, enabling the developer to use just one line of code to call the method.

For your convenience, Java SE provides several methods for performing array manipulations (common tasks, such as copying, sorting and searching arrays) in the java.util.Arrays class. For instance, the previous example can be modified to use the copyOfRange method of the java.util.Arrays class, as you can see in the ArrayCopyOfDemo example. The difference is that using the copyOfRange method does not require you to create the destination array before calling the method, because the destination array is returned by the method:

1import java.util.*; 
2import java.io.*; 
3 
4class arrManip { 
5 
6public static void main(String[] args){ 
7 
8char[] copyFrom = {'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l'}; 
9 
10int size = copyFrom.length; 
11 
12char[] copyTo = java.util.Arrays.copyOfRange(copyFrom, 5, + size); 
13 
14 System.out.println(new String(copyTo)); 
15 
16} 
17}

The above code shows the method of finding array length and printing the second array (part of the first array). The same will be different if we are using only a string. For instance

1class strLen { 
2 
3public static void main(String[] args){ 
4 String mystr = new String("This is only a test string"); 
5 System.out.println(+ mystr.length()); 
6} 
7}

You might see the method .length is used with parenthesis, whereas in Ex-4 there is no parenthesis. This might be because char works differently from String. Let us see for int.

5.7.3 Vectors

Vectors are very important data type in the domain of data analytics. In univariate analysis, every set of numbers treated as a unique vector. A data distribution is a set of contiguously arranged instances of values arise out of a unique process. So every data distribution is a vector. The same when used by the characters and strings are known as factor.

There are other couple of functions or methods to practice on arrays. For instance if we are working with array of integer data type we might be able to user certain methods like adding,

Notes

11Platform: Any hardware or software environment in which a program runs, is known as a platform. Since Java has its own runtime environment (JRE) and API, it is called platform.

Chapter 6
YARN

Hadoop is one of the most popular open-source distributed computation frameworks, popularized by the widely used MapReduce computation paradigm. However, recently, Hadoop as been trying to distance itself from a single computation model and provide an abstraction layer over a cluster of machines with which you can easily develop a great variety of fault-tolerant and scalable distributed computation paradigms. In a sense, Hadoop is now trying to become a ”kernel” of the distributed cluster, handling the low level details so you don’t have to, a kind of Linux for clusters.

Due to this paradigm change in Hadoop, we can identify 2 different Hadoop generations: pre-YARN and post-YARN (2.x). YARN (Yet Another Resource Manager) constitutes a significant improvement over the previous method of handling resources and applications inside Hadoop. To get a better idea of what exactly changed from the previous generation, you can have a look at these presentations by Spotify and Hortonworks. In a nutshell, the main features are:

  1. Better scaling due to reduced centralization of responsibilities. Should easily support over 10K nodes, 10K jobs and 100K tasks (previous generation started getting into trouble at 4K nodes and 40K tasks).
  2. More flexibility in resource allocations. You can now easily specify the requirements of each individual container directly in your YARN application, specifying needed RAM, CPU cores and even specific hosts on which the container should be allocated. Previously this was limited to global memory and CPU limits for all containers specified in configuration files.
  3. More flexibility in computation. Previous Hadoop generations only ran MapReduce jobs. With YARN, Hadoop can run virtually any kind of computation. MapReduce is still possible but is implemented as a backwards-compatible module called MapReducev2 (not enabled by default). Other modules exist such as Giraph for graph processing, Spark for general computations with intermediate results stored in memory, etc.
  4. Better failure handling. Previous generations had a single point of failure in the JobTracker losing the entire job queue in such an event. New generation has (or will soon have) recovery of both ApplicationMaster (through a restart by the ResourceManager) and the ResourceManager (through ZooKeeper, YARN-128, YARN-149, YARN-556).
  5. Wire-compatible protocol. This should guarantee protocol compatibility even between different versions of Hadoop so you no longer have to worry about having to simultaneously update the entire cluster and can do rolling upgrades.

YARN is the component responsible for allocating containers to run tasks, coordinating the execution of said tasks, restart them in case of failure, among other housekeeping. Just like HDFS, it also has 2 main components: a ResourceManager which keeps track of the cluster resources and NodeManagers in each of the nodes which communicates with the ResourceManager and sets up containers for execution of tasks. We got to configure yarn-site.xml to use YARN. Please see the instructions in the installation section. You may go through the default configuration at http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-common/yarn-default.xml

The general concept is that an application submission client submits an application to the YARN ResourceManager (RM). This can be done through setting up a YarnClient object. After YarnClient is started, the client can then set up application context, prepare the very first container of the application that contains the ApplicationMaster (AM), and then submit the application. You need to provide information such as the details about the local files/jars that need to be available for your application to run, the actual command that needs to be executed (with the necessary command line arguments), any OS environment settings (optional), etc. Effectively, you need to describe the Unix process(es) that needs to be launched for your ApplicationMaster. The YARN ResourceManager will then launch the ApplicationMaster (as specified) on an allocated container. The ApplicationMaster communicates with YARN cluster, and handles application execution. It performs operations in an asynchronous fashion. During application launch time, the main tasks of the ApplicationMaster are:

The three main protocols for YARN application (ApplicationClientProtocol, ApplicationMasterProtocol and ContainerManagementProtocol) are still preserved. The 3 clients wrap these 3 protocols to provide simpler programming model for YARN applications.

The entire process at clients side can go as shown below. 12

  1. The first step that a client needs to do is to initialize and start a YarnClient.
  2. Once a client is set up, the client needs to create an application, and get its application id.
  3. The response from the YarnClientApplication for a new application also contains information about the cluster such as the minimum/maximum resource capabilities of the cluster. This is required so that to ensure that you can correctly set the specifications of the container in which the ApplicationMaster would be launched.
  4. The main crux of a client is to setup the ApplicationSubmissionContext which defines all the information needed by the RM to launch the AM. A client needs to set the following into the context:
  5. In certain situations, if the application is taking too long or due to other factors, the client may wish to kill the application. YarnClient supports the killApplication call that allows a client to send a kill signal to the AM via the ResourceManager. An ApplicationMaster if so designed may also support an abort call via its rpc layer that a client may be able to leverage.

The entire process at master (AM) side can go as shown below.

  1. The AM is the actual owner of the job. It will be launched by the RM and via the client will be provided all the necessary information and resources about the job that it has been tasked with to oversee and complete.
  2. As the AM is launched within a container that may (likely will) be sharing a physical host with other containers, given the multi-tenancy nature, amongst other issues, it cannot make any assumptions of things like pre-configured ports that it can listen on.
  3. When the AM starts up, several parameters are made available to it via the environment. These include the ContainerId for the AM container, the application submission time and details about the NM (NodeManager) host running the ApplicationMaster. Ref ApplicationConstants for parameter names.
  4. All interactions with the RM require an ApplicationAttemptId (there can be multiple attempts per application in case of failures). The ApplicationAttemptId can be obtained from the AM’s container id. There are helper APIs to convert the value obtained from the environment into objects.
  5. After an AM has initialized itself completely, we can start the two clients: one to ResourceManager, and one to NodeManagers. We set them up with our customized event handler, and we will talk about those event handlers in detail later in this article.
  6. The AM has to emit heartbeats to the RM to keep it informed that the AM is alive and still running. The timeout expiry interval at the RM is defined by a config setting accessible via YarnConfiguration.RM_AM_EXPIRY_INTERVAL_MS with the default being defined by YarnConfiguration.DEFAULT_RM_AM_EXPIRY_INTERVAL_MS. The ApplicationMaster needs to register itself with the ResourceManager to start hearbeating.
  7. In the response of the registration, maximum resource capability if included. You may want to use this to check the application’s request.
  8. Based on the task requirements, the AM can ask for a set of containers to run its tasks on. We can now calculate how many containers we need, and request those many containers.
  9. In setupContainerAskForRM(), the follow two things need some set up:
  10. After container allocation requests have been sent by the application manager, contailers will be launched asynchronously, by the event handler of the AMRMClientAsync client. The handler should implement AMRMClientAsync. CallbackHandler interface.
  11. When there are containers allocated, the handler sets up a thread that runs the code to launch containers. Here we use the name LaunchContainerRunnable to demonstrate. We will talk about the LaunchContainerRunnable class in the following part of this article.
  12. On heart beat, the event handler reports the progress of the application.
  13. The container launch thread actually launches the containers on NMs. After a container has been allocated to the AM, it needs to follow a similar process that the client followed in setting up the ContainerLaunchContext for the eventual task that is going to be running on the allocated Container. Once the ContainerLaunchContext is defined, the AM can start it through the NMClientAsync.
  14. The NMClientAsync object, together with its event handler, handles container events. Including container start, stop, status update, and occurs an error.
  15. After the ApplicationMaster determines the work is done, it needs to unregister itself through the AM-RM client, and then stops the client.

Testing

The most important ingredient in Yarn is that of memory and cores. For instance, if your computer has 2GB RAM and 2 cores (free -g or free -m to know about RAM in your computer). Use cat /proc/meminfo to know entire memory status of your computer. Use cat /proc/cpuinfo | grep processor | wc -l to know about number of processors (i.e. cores). The Yarn is all about managing resources. So we might be able to restrict or permit tasks by earmarking or allocating memory and cores. For instance, if you look at the default configuration settings of Yarn , you may be able to find the below values for memory and cores.





Varibale ValueDescription



yarn.scheduler.minimum-allocation-mb 1024 10cmThe minimum allocation for every container
request at the RM, in MBs.
yarn.scheduler.maximum-allocation-mb 8192 10cmThe maximum allocation for every container
request at the RM, in MBs.
yarn.scheduler.minimum-allocation-vcores 1 10cmThe minimum allocation for every container
request at the RM, in terms of virtual CPU cores.
yarn.scheduler.maximum-allocation-vcores 4 10cmThe maximum allocation for every container
request at the RM, in terms of virtual CPU cores.




Memory requests lower or greater than the specified limits will throw a InvalidResourceRequestException. Requests lower or greater than the specified value for cores also will throw a InvalidResourceRequestException. So, that is how Yarn try to control the resources by providing containers to tasks such that they will be executed error-free. So the users need to adapt these resources as per the case.

Now let us see as how to test the cluster with custom values (specified by user). We need the hadoop-yarn-applications-distributedshell-*.*.*.jar to execute this test.

The following is the entire statement that need to be executed at the console:

1hadoop jar hadoop-yarn-applications-distributedshell-2.7.2.jar org.apache.hadoop.yarn.applications.distributedshell.Client --jar hadoop-yarn-applications-distributedshell-2.7.2.jar --shell_command date --num_containers 2 --master_memory 1024

With this command we are telling Hadoop to run the Client class in the hadoop-yarn-applications-distributedshell-2.2.0.jar, passing it the jar containing the definition of the ApplicationMaster (the same jar), the shell command to run in each of the hosts (date), the number of containers to spawn (2) and the memory used by the ApplicationMaster (1024MB). The value of 1024 was set empirically by trying to run the program several times until it stopped failing due to the ApplicationMaster using more memory than that which had been allocated to it. This throw little output and finally ends with a statement

116/09/07 16:55:50 INFO distributedshell.Client: Application has completed successfully. Breaking monitoring loop 
216/09/07 16:55:50 INFO distributedshell.Client: Application completed successfully

You are successful. Then find a statement like application_1473245156605_0001 and run grep "" $HADOOP_PREFIX/logs/userlogs/<APPLICATION ID>/**/stdout in the console. In my case it is...

1grep "" \$HADOOP_PREFIX/logs/userlogs/<APPLICATION ID>/**/stdout

Notes

12Read at https://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html for full set of instructions.

Chapter 7
HDFS

The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data. HDFS was originally built as infrastructure for the Apache Nutch web search engine project. HDFS is part of the Apache Hadoop Core project. The project URL is http://hadoop.apache.org/. Read more about HDFS at http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html

HDFS is the primary distributed storage used by Hadoop applications. A HDFS cluster primarily consists of a NameNode that manages the file system metadata and DataNodes that store the actual data. The HDFS architecture deal with basic interactions among NameNode, the DataNodes, and the clients. Clients contact NameNode for file metadata or file modifications and perform actual file I/O directly with the DataNodes.

The following are some of the salient features of HDFS:

  1. Hadoop, including HDFS, is well suited for distributed storage and distributed processing using commodity hardware. It is fault tolerant, scalable, and extremely simple to expand. MapReduce, well known for its simplicity and applicability for large set of distributed applications, is an integral part of Hadoop.
  2. HDFS is highly configurable with a default configuration well suited for many installations. Most of the time, configuration needs to be tuned only for very large clusters.
  3. Hadoop is written in Java and is supported on all major platforms.
  4. Hadoop supports shell-like commands to interact with HDFS directly.
  5. The NameNode and Datanodes have built in web servers that makes it easy to check current status of the cluster.

New features and improvements are regularly implemented in HDFS. The following is a subset of useful features in HDFS:

7.1 Webinterface

NameNode and DataNode each run an internal web server in order to display basic information about the current status of the cluster. With the default configuration, the NameNode front page is at http://namenode-name:50070/. It lists the DataNodes in the cluster and basic statistics of the cluster. The web interface can also be used to browse the file system (using “Browse the file system” link on the NameNode front page).

7.2 Shell Commands

Hadoop includes various shell-like commands that directly interact with HDFS and other file systems that Hadoop supports. The command bin/hdfs dfs -help lists the commands supported by Hadoop shell. Furthermore, the command bin/hdfs dfs -help command-name displays more detailed help for a command. These commands support most of the normal files system operations like copying files, changing file permissions, etc. It also supports a few HDFS specific operations like changing replication of files. For more information see File System Shell Guide at http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html. To know it in your computer just do as hadoop fs -help. You may get a list of functions that are useful to perform operations on HDFS. For instance -appendToFile (the very first function or called as option is useful to update the content of a local file (.txt or etc.) to destination folder (possibly located at dfs). The description to this option is given as below.

1-appendToFile <localsrc> ... <dst> : 
2 Appends the contents of all the given local files to the given dst file. The dst 
3 file will be created if it does not exist. If <localSrc> is -, then the input is 
4 read from stdin.

We may practice the code as shown below

1sudo gedit text.txt # copy or write some data into it just like "this only a test file" 
2hadoop fs -mkdir /dir1 
3hadoop fs -copyFromLocal text.txt /dir1/ 
4hadoop gedit text.txt # this time add few lines of text like "to test few commands of hadoop" 
5hadoop fs -appendTofile text.txt /dir1/text.txt # now you must be able to see both lines together in the file.

In the above code -mkdir, -copyFromLocal are other two options or functions you may read as well from the help context i.e. hadoop fs -help. Following is the list of few other commands.

  1. cat: Copies source paths to stdout.
  2. checksum: Returns the checksum information of a file.
  3. chgrp: Change group association of files. The user must be the owner of files, or else a super-user.
  4. chmod: Change the permissions of files. With -R, make the change recursively through the directory structure. The user must be the owner of the file, or else a super-user.
  5. chown: Change the owner of files. The user must be a super-user.
  6. copyFromLocal: Similar to put command, except that the source is restricted to a local file reference.
  7. copyToLocal: Similar to get command, except that the destination is restricted to a local file reference.
  8. count: Count the number of directories, files and bytes under the paths that match the specified file pattern. The output columns with -count are: DIR_COUNT,FILE_COUNT,CONTENT _SIZE,PATHNAME.
  9. cp: Copy files from source to destination. This command allows multiple sources as well in which case the destination must be a directory.
  10. createSnapshot & deleteSanpshot: HDFS Snapshots are read-only point-in-time copies of the file system. Snapshots can be taken on a sub-tree of the file system or the entire file system. Some common use cases of snapshots are data backup, protection against user errors and disaster recovery. Please read more on http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsSnapshots.html
  11. df: display free space This command is highly useful. HDFS resources are precious for execution of jobs. The memory gradually reduces each time the job is exexuted. This command helps knowing or verifying memory status. For instance, hadoop fs -df / will let you know the memory status. 13
  12. du: Displays sizes of files and directories contained in the given directory or the length of a file in case its just a file.
  13. dus: Displays a summary of file lengths.
  14. expunge: Empty the Trash. Perhaps the other important command after du, df that helps us to make MapReduce jobs rather better.
  15. find: Finds all files that match the specified expression and applies selected actions to them. If no path is specified then defaults to the current working directory. If no expression is specified then defaults to -print.
    1hduser@hduser:~\$ hadoop fs -find / -name text.txt 
    2/dir1/text.txt
  16. get: Copy files to the local file system. Files that fail the CRC check may be copied with the -ignorecrc option. Files and CRCs may be copied using the -crc option.
    1hadoop fs -get /dir1/text.txt

    Once after executing the code we can check with native Linux list command ls to check if the file is downloaded from the HDFS.

  17. getfacl: Displays the Access Control Lists (ACLs) of files and directories. If a directory has a default ACL, then getfacl also displays the default ACL.
    1hadoop fs -getfacl /dir1/

    Will display the following information

    1# file: /dir1 
    2# owner: hduser 
    3# group: supergroup 
    4getfacl: The ACL operation has been rejected. Support for ACLs has been disabled by setting dfs.namenode.acls.enabled to false.
  18. getfattr: Displays the extended attribute names and values (if any) for a file or directory.
  19. getmerge: Takes a source directory and a destination file as input and concatenates files in src into the destination local file. Optionally -nl can be set to enable adding a newline character (LF) at the end of each file.
  20. help: Perhaps this is foremost and important command that every user must know. This command helps understanding options associated with any given command. Precisely, this command returns usage output.
  21. ls: This command is mostly used command in HDFS. This command has three options associated with it, namely -d, -h, -R. The option -R return entire information of HDFS. Alternatively we can also use -lsr in stead of with -R option.
  22. moveFromLocal: Similar to put command, except that the source localsrc is deleted after it’s copied. The difference between put and moveFromLocal is that moveFromLocal deletes the source file once after copying.
  23. moveToLocal: This command performs just as above but copies files from HDFS to local FS.
  24. mv: Moves files from source to destination. This command allows multiple sources as well in which case the destination needs to be a directory. Moving files across file systems is not permitted. For instance, let us see how mv executes the tast with in HDFS.
    1hduser@hduser:~\$ hadoop fs -mv /dir1/text.txt /dir2/ 
    2hduser@hduser:~\$ hadoop fs -ls -R / 
    3drwxr-xr-x - hduser supergroup      0 2016-09-07 10:43 /dir1 
    4drwxr-xr-x - hduser supergroup      0 2016-09-07 10:43 /dir2 
    5-rw-r--r-- 1 hduser supergroup    272 2016-09-07 10:02 /dir2/text.txt 
    6hduser@hduser:~\$ hadoop fs -ls / 
    7Found 2 items 
    8drwxr-xr-x - hduser supergroup      0 2016-09-07 10:43 /dir1 
    9drwxr-xr-x - hduser supergroup      0 2016-09-07 10:43 /dir2 
    10hduser@hduser:~\$ hadoop fs -ls /dir1
  25. put: Copy single src, or multiple srcs from local file system to the destination file system. Also reads input from stdin and writes to destination file system.
  26. renameSnapshot: Rename a snapshot. This operation requires owner privilege of the snapshottable directory.
  27. rm: Delete files specified as args. The args: -f, -r, -R, -skipTrash
  28. rmdir: Delete a directory.
  29. rmr: Recursive version of delete.
  30. setfacl: Sets Access Control Lists (ACLs) of files and directories.
  31. setattr: Sets an extended attribute name and value for a file or directory.
  32. setrep: Changes the replication factor of a file. If path is a directory then the command recursively changes the replication factor of all files under the directory tree rooted at path.
  33. stat: Print statistics about the file/directory at ¡path¿ in the specified format. Format accepts file size in blocks (%b), type (%F), group name of owner (%g), name (%n), block size (%o), replication (%r), user name of owner(%u), and modification date (%y, %Y). %y shows UTC date as \yyyy - MM - ddHH : mm : ss” and %Y shows milliseconds since January1,1970UTC. If the format is not specified, %y is used by default. Eg. hadoop fs -stat "%F %u:%g %b %y %n" /file
  34. tail: Displays last kilobyte of the file to stdout. Eg. hadoop fs -tail pathname.
  35. test: Tests if entity or object in certain path. Emits Boolean value as result.
  36. text: Takes a source file and outputs the file in text format. The allowed formats are zip and TextRecordInputStream.
  37. touchz: Create a file of zero length. Eg. hadoop fs -touchz pathname
  38. truncate:Truncate all files that match the specified file pattern to the specified length. The -w flag requests that the command waits for block recovery to complete, if necessary. Without -w flag the file may remain unclosed for some time while the recovery is in progress. During this time file cannot be reopened for append. The syntax is hadoop fs -truncate [-w] <length> <paths>. Eg. hadoop fs -truncate 55 /user/hadoop/file1 /user/hadoop/file2 or hadoop fs -truncate -w 127 hdfs://nn1.example.com/user/hadoop/file1
  39. usage: Return the help for an individual command. Eg. hadoop fs -usage command

Notes

13There are certain problem that are persistent and also naging while executing jobs. One of the problems is MapReduce jobs. At times it is common that MapReduce code exists with out executing besides throwing certain complaints. One of the reasons for these problems is shortage of memory. Commands like df, du are greatly helpful while verifying the memory for execution.

Chapter 8
Mapreduce

MapReduce is a programming paradigm that runs in the background of Hadoop to provide scalability and easy data-processing solutions. MapReduce is a programming model specifically implemented for processing large data sets. The model was developed by Jeffrey Dean and Sanjay Ghemawat at Google. At its core, MapReduce is a combination of two functions - map() and reduce(), as its name would suggest.

Traditional Enterprise Systems normally have a centralized server to store and process data. The following illustration depicts a schematic view of a traditional enterprise system. Traditional model is certainly not suitable to process huge volumes of scalable data and cannot be accommodated by standard database servers. Moreover, the centralized system creates too much of a bottleneck while processing multiple files simultaneously.


PIC


Google solved this bottleneck issue using an algorithm called MapReduce. MapReduce divides a task into small parts and assigns them to many computers. Later, the results are collected at one place and integrated to form the result dataset.


PIC


8.1 How MapReduce Works?

The MapReduce algorithm contains two important tasks, namely Map and Reduce. The Map task takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key-value pairs).

The Reduce task takes the output from the Map as an input and combines those data tuples (key-value pairs) into a smaller set of tuples. The reduce task is always performed after the map job. Let us now take a close look at each of the phases and try to understand their significance.


PIC


Different phases in MapReduce:

  1. Input Phase Here we have a Record Reader that translates each record in an input file and sends the parsed data to the mapper in the form of key-value pairs.
  2. Map Map is a user-defined function, which takes a series of key-value pairs and processes each one of them to generate zero or more key-value pairs.
  3. Intermediate Keys They key-value pairs generated by the mapper are known as intermediate keys.
  4. Combiner A combiner is a type of local Reducer that groups similar data from the map phase into identifiable sets. It takes the intermediate keys from the mapper as input and applies a user-defined code to aggregate the values in a small scope of one mapper. It is not a part of the main MapReduce algorithm; it is optional.
  5. Shuffle and Sort The Reducer task starts with the Shuffle and Sort step. It downloads the grouped key-value pairs onto the local machine, where the Reducer is running. The individual key-value pairs are sorted by key into a larger data list. The data list groups the equivalent keys together so that their values can be iterated easily in the Reducer task.
  6. Reducer The Reducer takes the grouped key-value paired data as input and runs a Reducer function on each one of them. Here, the data can be aggregated, filtered, and combined in a number of ways, and it requires a wide range of processing. Once the execution is over, it gives zero or more key-value pairs to the final step.
  7. Output Phase In the output phase, we have an output formatter that translates the final key-value pairs from the Reducer function and writes them onto a file using a record writer.


PIC


8.1.1 MR Example

Let us take a real-world example to comprehend the power of MapReduce. Twitter receives around 500 million tweets per day, which is nearly 3000 tweets per second. The following illustration shows how Tweeter manages its tweets with the help of MapReduce.


PIC


As shown in the illustration, the MapReduce algorithm performs the following actions:

  1. Tokenize - Tokenizes the tweets into maps of tokens and writes them as key-value pairs.
  2. Filter - Filters unwanted words from the maps of tokens and writes the filtered maps as key-value pairs.
  3. Count - Generates a token counter per word.
  4. Aggregate Counters - Prepares an aggregate of similar counter values into small manageable units.

The MapReduce algorithm contains two important tasks, namely Map and Reduce.

The map task is done by means of Mapper Class, the reduce task is done by means of Reducer Class. Mapper class takes the input, tokenizes it, maps and sorts it. The output of Mapper class is used as input by Reducer class, which in turn searches matching pairs and reduces them.


PIC


MapReduce implements various mathematical algorithms to divide a task into small parts and assign them to multiple systems. In technical terms, MapReduce algorithm helps in sending the Map Reduce tasks to appropriate servers in a cluster.

These mathematical algorithms may include the following:

  1. Searching
  2. Indexing
  3. TF-IDF

8.1.2 Sorting

Sorting is one of the basic MapReduce algorithms to process and analyze data. MapReduce implements sorting algorithm to automatically sort the output key-value pairs from the mapper by their keys. Sorting methods are implemented in the mapper class itself. In the Shuffle and Sort phase, after tokenizing the values in the mapper class, the Context class (user-defined class) collects the matching valued keys as a collection. To collect similar key-value pairs (intermediate keys), the Mapper class takes the help of RawComparator class to sort the key-value pairs. The set of intermediate key-value pairs for a given Reducer is automatically sorted by Hadoop to form key-values (K2,V 2,V 2,) before they are presented to the Reducer.

8.1.3 Searching

Searching plays an important role in MapReduce algorithm. It helps in the combiner phase (optional) and in the Reducer phase. Let us try to understand how Searching works with the help of an example.

Example

The following example shows how MapReduce employs Searching algorithm to find out the details of the employee who draws the highest salary in a given employee dataset. Let us assume we have employee data in four different files A, B, C, and D. Let us also assume there are duplicate employee records in all four files because of importing the employee data from all database tables repeatedly. See the following illustration.


PIC


The Map phase:

The Map phase processes each input file and provides the employee data in key-value pairs (<k, v>: <emp name, salary>). See the following illustration.


PIC


The combiner phase (searching technique) will accept the input from the Map phase as a key-value pair with employee name and salary. Using searching technique, the combiner will check all the employee salary to find the highest salaried employee in each file. See the following snippet.

1<k: employee name, v: salary> 
2Max= the salary of an first employee. Treated as max salary 
3 
4if(v(second employee).salary > Max){ 
5  Max = v(salary); 
6} 
7 
8else{ 
9  Continue checking; 
10}

Reducer phase:

Reducer phase Form each file, you will find the highest salaried employee. To avoid redundancy, check all the <k, v >pairs and eliminate duplicate entries, if any. The same algorithm is used in between the four <k, v >pairs, which are coming from four input files. The final output should be as follows:

8.1.4 Indexing

Normally indexing is used to point to a particular data and its address. It performs batch indexing on the input files for a particular Mapper.

The indexing technique that is normally used in MapReduce is known as inverted index. Search engines like Google and Bing use inverted indexing technique. Let us try to understand how Indexing works with the help of a simple example.

Example

The following text is the input for inverted indexing. Here T[0],T[1], and t[2] are the file names and their content are in double quotes.

1T[0] = "it is what it is" 
2T[1] = "what is it" 
3T[2] = "it is a banana"

After applying the Indexing algorithm, we get the following output:

1"a": {2} 
2"banana": {2} 
3"is": {0, 1, 2} 
4"it": {0, 1, 2} 
5"what": {0, 1}

Here ”a” : 2 implies the term ”a” appears in the T[2] file. Similarly, ”is” : 0,1,2 implies the term ”is” appears in the files T[0],T[1], and T[2].

8.1.5 TF-IDF

TF-IDF is a text processing algorithm which is short for Term Frequency Inverse Document Frequency. It is one of the common web analysis algorithms. Here, the term ’frequency’ refers to the number of times a term appears in a document.

TF

It measures how frequently a particular term occurs in a document. It is calculated by the number of times a word appears in a document divided by the total number of words in that document.

          (N umberoftimestermthe ‘the′appearsinadocument)
T F(the) =--------------------------------------------
                (Totalnumberof termsinthedocument )
(8.1)

Inverse Document Frequency (IDF)

It measures the importance of a term. It is calculated by the number of documents in the text database divided by the number of documents where a specific term appears.

While computing TF, all the terms are considered equally important. That means, TF counts the term frequency for normal words like “is”, “a”, “what”, etc. Thus we need to know the frequent terms while scaling up the rare ones, by computing the following:

IDF  (the) = loge(----Totalnumberofdocuments------).
               N umberof documentswithterm ‘the′init
(8.2)

Example

Consider a document containing 1000 words, wherein the word hive appears 50 times. The TF for hive is then (501000) = 0.05. Now, assume we have 10 million documents and the word hive appears in 1000 of these. Then, the IDF is calculated as log(10,000,0001,000) = 4. The TF - IDF weight is the product of these quantities: 0.054 = 0.20.

MapReduce works only on Linux flavored operating systems and it comes inbuilt with a Hadoop Framework. Please refer to Chapter one on “Installation” to know about installing Hadoop in Linux operating system.

8.2 Exercise

We will take a close look at the classes and their methods that are involved in the operations of MapReduce programming. We will primarily keep our focus on the following:

  1. JobContext Interface
  2. Job Class
  3. Mapper Class
  4. Reducer Class

The JobContext interface is the super interface for all the classes, which defines different jobs in MapReduce. It gives you a read-only view of the job that is provided to the tasks while they are running. The following are the sub-interfaces of JobContext interface.


Table 8.1: JobContext Interface


S.No. Subinterface Description


1 MapContext KEYIN, VALUEIN, KEYOUT, VALUEOUT >Defines the context that is given to the Mapper.


2 ReduceContextVALUEIN KEYOUT VALUEOUT >Defines the context that is passed to the Reducer.



Job class is the main class that implements the JobContext interface.

8.2.1 Job Class

The Job class is the most important class in the MapReduce API. It allows the user to configure the job, submit it, control its execution, and query the state. The set methods only work until the job is submitted, afterwards they will throw an IllegalStateException. Normally, the user creates the application, describes the various facets of the job, and then submits the job and monitors its progress.

Here is an example of how to submit a job.

1// Create a new Job 
2Job job = new Job(new Configuration()); 
3job.setJarByClass(MyJob.class); 
4 
5// Specify various job-specific parameters 
6job.setJobName("myjob"); 
7job.setInputPath(new Path("in")); 
8job.setOutputPath(new Path("out")); 
9 
10job.setMapperClass(MyJob.MyMapper.class); 
11job.setReducerClass(MyJob.MyReducer.class); 
12 
13// Submit the job, then poll for progress until the job is complete 
14job.waitForCompletion(true);

8.2.2 Constructors

Following are the constructor summary of Job class.




S. No. Constructor summary


1 Job()
2 Job(Configuration conf)
3 Job(Configuration conf, String JobName)



8.2.3 Methods

Some of the important methods of Job class are as follows:





S. No.Method Description



1 getJobName() User-specified job name.
2 getJobState() Returns the current state of the Job.
3 isComplete() Checks if the job is finished or not.
4 setInputFormatClass() Sets the InputFormat for the job.
5 setJobName(String name)Sets the user-specified job name.
6 setOutputFormatClass() Sets the Output Format for the job.
7 setMapperClass(Class) Sets the Mapper for the job.
8 setReducerClass(Class) Sets the Reducer for the job.
9 setPartitionerClass(Class)Sets the Partitioner for the job.
10 setCombinerClass(Class) Sets the Combiner for the job.




8.2.4 Mapper Class

The Mapper class defines the Map job. Maps input key-value pairs to a set of intermediate key-value pairs. Maps are the individual tasks that transform the input records into intermediate records. The transformed intermediate records need not be of the same type as the input records. A given input pair may map to zero or many output pairs.

Method

map is the most prominent method of the Mapper class. This method is called once for each key-value pair in the input split. The syntax is defined below:

1map(KEYIN key, VALUEIN value, org.apache.hadoop.mapreduce.Mapper.Context context)

8.2.5 Reducer Class

The Reducer class defines the Reduce job in MapReduce. It reduces a set of intermediate values that share a key to a smaller set of values. Reducer implementations can access the Configuration for a job via the JobContext.getConfiguration() method. A Reducer has three primary phases Shuffle, Sort, and Reduce.

  1. Shuffle The Reducer copies the sorted output from each Mapper using HTTP across the network.
  2. Sort The framework merge-sorts the Reducer inputs by keys (since different Mappers may have output the same key). The shuffle and sort phases occur simultaneously, i.e., while outputs are being fetched, they are merged.
  3. Reduce In this phase the reduce (Object, Iterable, Context) method is called for each key, (collection of values) >in the sorted inputs.

Method

reduce is the most prominent method of the Reducer class. This method is called once for each key on the collection of key-value pairs. The syntax is defined below

1reduce(KEYIN key, Iterable<VALUEIN> values, org.apache.hadoop.mapreduce.Reducer.Context context)

8.3 MapReduce Algorithm

Generally MapReduce paradigm is based on sending map-reduce programs to computers where the actual data resides.

  1. During a MapReduce job, Hadoop sends Map and Reduce tasks to appropriate servers in the cluster.
  2. The framework manages all the details of data-passing like issuing tasks, verifying task completion, and copying data around the cluster between the nodes.
  3. Most of the computing takes place on the nodes with data on local disks that reduces the network traffic.
  4. After completing a given task, the cluster collects and reduces the data to form an appropriate result, and sends it back to the Hadoop server.


PIC


8.3.1 Input Output - Java perspective

The MapReduce framework operates on key-value pairs, that is, the framework views the input to the job as a set of key-value pairs and produces a set of key-value pair as the output of the job, conceivably of different types. The key and value classes have to be serializable by the framework and hence, it is required to implement the Writable interface. Additionally, the key classes have to implement the WritableComparable interface to facilitate sorting by the framework. Both the input and output format of a MapReduce job are in the form of key-value pairs.

(Input) <k1, v1 > ->map -><k2, v2 > ->reduce -><k3, v3 > (Output).


Table 8.2: MapReduce Execution



Input Output



Map <k1, v1 > list (<k2, v2>)
Reduce<k2, list(v2)>list (3, v3>)




8.4 MapReduce Implementation

8.4.1 Example - 1: Processing maximum units of electricity consumption

The following table shows the data regarding the electrical consumption of an organization. The table includes the monthly electrical consumption and the annual average for five consecutive years.


Table 8.3: MapReduce Example - Data of Electrical Consumption














JanFebMarAprMayJunJulAugSepOctNovDecAvg














197923 23 2 43 24 25 26 26 26 26 25 26 25
198026 27 28 28 28 30 31 31 31 30 30 30 29
198131 32 32 32 33 34 35 36 36 34 34 34 34
198439 38 39 39 39 41 42 43 40 39 38 38 40
198538 39 39 39 39 41 41 41 0 40 39 39 45















We need to write applications to process the input data in the given table to find the year of maximum usage, the year of minimum usage, and so on. This task is easy for programmers with finite amount of records, as they will simply write the logic to produce the required output, and pass the data to the written application. Let us now raise the scale of the input data. Assume we have to analyze the electrical consumption of all the large-scale industries of a particular state. When we write applications to process such bulk data,

  1. They will take a lot of time to execute.
  2. There will be heavy network traffic when we move data from the source to the network server.

To solve these problems, we have the MapReduce framework. The above data can be saved in a simple text file (.txt), and can be used as input. The data looks like below (the header is removed).


Table 8.4: Electric consumption data modified (header removed)














197923232 43242526262626252625
1980 26272828283031313130303029
1981 31323232333435363634343434
1984 39383939394142434039383840
1985 38393939394141410 40393945















The MapReduce code goes as below:

1package hadoop; 
2 
3import java.util.*; 
4import java.io.IOException; 
5import java.io.IOException; 
6 
7import org.apache.hadoop.fs.Path; 
8import org.apache.hadoop.conf.*; 
9import org.apache.hadoop.io.*; 
10import org.apache.hadoop.mapred.*; 
11import org.apache.hadoop.util.*; 
12 
13public class ProcessUnits 
14{ 
15  //Mapper class 
16  public static class E_EMapper extends MapReduceBase implements 
17  Mapper<LongWritable, /*Input key Type */ 
18  Text,             /*Input value Type*/ 
19  Text,             /*Output key Type*/ 
20  IntWritable>       /*Output value Type*/ 
21  { 
22    //Map function 
23    public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException 
24    { 
25      String line = value.toString(); 
26      String lasttoken = null; 
27      StringTokenizer s = new StringTokenizer(line,"\t"); 
28      String year = s.nextToken(); 
29 
30      while(s.hasMoreTokens()){ 
31         lasttoken=s.nextToken(); 
32      } 
33 
34      int avgprice = Integer.parseInt(lasttoken); 
35      output.collect(new Text(year), new IntWritable(avgprice)); 
36    } 
37  } 
38 
39  //Reducer class 
40 
41  public static class E_EReduce extends MapReduceBase implements 
42  Reducer< Text, IntWritable, Text, IntWritable > 
43  { 
44    //Reduce function 
45    public void reduce(Text key, Iterator <IntWritable> values, OutputCollector>Text, IntWritable> output, Reporter reporter) throws IOException 
46    { 
47      int maxavg=30; 
48      int val=Integer.MIN_VALUE; 
49      while (values.hasNext()) 
50      { 
51         if((val=values.next().get())>maxavg) 
52         { 
53           output.collect(key, new IntWritable(val)); 
54         } 
55      } 
56    } 
57  } 
58 
59  //Main function 
60 
61  public static void main(String args[])throws Exception 
62  { 
63    JobConf conf = new JobConf(Eleunits.class); 
64 
65    conf.setJobName("max_eletricityunits"); 
66 
67    conf.setOutputKeyClass(Text.class); 
68    conf.setOutputValueClass(IntWritable.class); 
69 
70    conf.setMapperClass(E_EMapper.class); 
71    conf.setCombinerClass(E_EReduce.class); 
72    conf.setReducerClass(E_EReduce.class); 
73 
74    conf.setInputFormat(TextInputFormat.class); 
75    conf.setOutputFormat(TextOutputFormat.class); 
76 
77    FileInputFormat.setInputPaths(conf, new Path(args[0])); 
78    FileOutputFormat.setOutputPath(conf, new Path(args[1])); 
79 
80    JobClient.runJob(conf); 
81  } 
82}

Save the above program into ProcessUnits.java. The compilation and execution of the program is given below. Compilation and

Execution of ProcessUnits Program

Let us assume we are in the home directory of Hadoop user (e.g.∕home∕hadoop). Follow the following steps to compile and execute the program.

  1. Use the following command to create a directory to store the compiled java classes.
    1mkdir units
  2. Download Hadoop-core-1.2.1.jar or Hadoop-core-2.7.2.jar, which is used to compile and execute the MapReduce program. Download the jar from https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-core/1.2.1 for 1.2.1 or http://mvnrepository.com/artifact/org.apache.hadoop/hadoop-mapreduce-client-core/2.7.2 for 2.7.2. Let us assume the download folder is ∕home∕hadoop∕.
  3. The following commands are used to compile the ProcessUnits.java program and to create a jar for the program.
    1$ javac -classpath hadoop-core-1.2.1.jar -d units ProcessUnits.java 
    2$ jar -cvf units.jar -C units/ . 
    3\end{verbatim} 
    4 
    5\item The following command is used to create an input directory in HDFS. 
    6 
    7\begin{lstlisting} 
    8\$HADOOP_HOME/bin/hadoop fs -mkdir input_dir
  4. The following command is used to copy the input file named sample.txt in the input directory of HDFS.
    1\$HADOOP_HOME/bin/hadoop fs -put /home/hadoop/sample.txt input_dir
  5. The following command is used to verify the files in the input directory.
    1\$HADOOP_HOME/bin/hadoop fs -ls input_dir/
  6. The following command is used to run the Eleunit_max application by taking input files from the input directory.
    1\$HADOOP_HOME/bin/hadoop jar units.jar hadoop.ProcessUnits input_dir output_dir
  7. The following command is used to verify the resultant files in the output folder.
    1\$HADOOP_HOME/bin/hadoop fs -ls output_dir/
  8. The following command is used to see the output in Part-00000 file. This file is generated by HDFS.
    1\$HADOOP_HOME/bin/hadoop fs -cat output_dir/part-00000
  9. The following command is used to copy the output folder from HDFS to the local file system.
    1\$HADOOP_HOME/bin/hadoop fs -cat output_dir/part-00000/bin/hadoop dfs -get output_dir /home/hadoop

8.4.2 Example - 2: Processing Partitioning

8.5 Simple MapReduce Examples

There are built-in mapreduce examples. Usually the programs must be executed through jar files. These jar files must be created through Java on java class files which in turn obtained through Java compiler. For instance, imagine that we got a java program called “wordcount” this file must processed through the below code.

1javac wordcount.java 
2jar wordcount # it is a class

These things are explained under the section 5.4 mapreduce implementation. These things are for those who are comfortable with writing programs using core java. However, for beginners it might not be possible and also cumbersome at first. Naive users might not be able to understand the algorithms and write programs at the brink of the very practice. So, for that there are beautiful mechanisms to practice and understand such programs i.e. MapReduce routines through certain application or file known as hadoop-mapreduce-examples-[version].jar located under share folder in hadoop_home. You can find a separate hadoop/mapreduce in share folder where this file hadoop-mapreduce-examples-[version].jar can be found.

1hduser@hadoop:/usr/local/hadoop/share/hadoop/mapreduce\$ ls 
2hadoop-mapreduce-client-app-2.7.1.jar hadoop-mapreduce-client-hs-plugins-2.7.1.jar hadoop-mapreduce-examples-2.7.1.jar sources 
3hadoop-mapreduce-client-common-2.7.1.jar hadoop-mapreduce-client-jobclient-2.7.1.jar lib 
4hadoop-mapreduce-client-core-2.7.1.jar hadoop-mapreduce-client-jobclient-2.7.1-tests.jar lib-examples 
5hadoop-mapreduce-client-hs-2.7.1.jar hadoop-mapreduce-client-shuffle-2.7.1.jar    part-r-00000 
6hduser@hadoop:/usr/local/hadoop/share/hadoop/mapreduce\$

It is easy to know as what programs are supported by hadoop-mapreduce-examples.*.*.*.jar but by executing hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar Following are the other examples that you might find more useful for practice.

  1. aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
  2. aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
  3. bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.
  4. dbcount: An example job that count the pageview counts from a database.
  5. distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi.
  6. grep: A map/reduce program that counts the matches of a regex in the input.
  7. join: A job that effects a join over sorted, equally partitioned datasets.
  8. multifilewc: A job that counts words from several files.
  9. pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
  10. pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method.
  11. randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
  12. randomwriter: A map/reduce program that writes 10GB of random data per node.
  13. secondarysort: An example defining a secondary sort to the reduce.
  14. sort: A map/reduce program that sorts the data written by the random writer.
  15. sudoku: A sudoku solver.
  16. teragen: Generate data for the terasort.
  17. terasort: Run the terasort. terasort is a benchmark in computing to evaluate the time taken to sort the randomly distributed data in a given computer. This function is used to assess the performance of Hadoop MR in Cluster.
  18. teravalidate: Checking results of terasort.
  19. wordcount: A map/reduce program that counts the words in the input files.
  20. wordmean: A map/reduce program that counts the average length of the words in the input files.
  21. wordmedian: A map/reduce program that counts the median length of the words in the input files.
  22. wordstandarddeviation: A map/reduce program that counts the standard deviation of the length of the words in the input files.

One interesting question is as what is this hadoop-core-*.*.*.jar and how it is different from hadoop-common-*.*.*.jar?

8.6 MapReduce Algorithms

8.6.1 A-1: Count

Let start with something really simple. The code snippet below shows Mapper that simply emit “1” for each term it processes and Reducer that goes through the lists of ones and sum them up:

1class Mapper 
2  method Map(docid id, doc d) 
3    for all term t in doc d do 
4      Emit(term t, count 1) 
5 
6class Reducer 
7  method Reduce(term t, counts [c1, c2,...]) 
8    sum = 0 
9    for all count c in [c1, c2,...] do 
10       sum = sum + c 
11    Emit(term t, count sum)

The obvious disadvantage of this approach is a high amount of dummy counters emitted by the Mapper. The Mapper can decrease a number of counters via summing counters for each document:

1class Mapper 
2  method Map(docid id, doc d) 
3    H = new AssociativeArray 
4    for all term t in doc d do 
5       H{t} = H{t} + 1 
6    for all term t in H do 
7      Emit(term t, count H{t})

In order to accumulate counters not only for one document, but for all documents processed by one Mapper node, it is possible to leverage Combiners:

1class Mapper 
2  method Map(docid id, doc d) 
3    for all term t in doc d do 
4      Emit(term t, count 1) 
5 
6class Combiner 
7  method Combine(term t, [c1, c2,...]) 
8    sum = 0 
9    for all count c in [c1, c2,...] do 
10       sum = sum + c 
11    Emit(term t, count sum) 
12 
13class Reducer 
14  method Reduce(term t, counts [c1, c2,...]) 
15    sum = 0 
16    for all count c in [c1, c2,...] do 
17       sum = sum + c 
18    Emit(term t, count sum)

Notes

13There are certain problem that are persistent and also naging while executing jobs. One of the problems is MapReduce jobs. At times it is common that MapReduce code exists with out executing besides throwing certain complaints. One of the reasons for these problems is shortage of memory. Commands like df, du are greatly helpful while verifying the memory for execution.

Chapter 9
Pig

Apache Pig is an abstraction over MapReduce. It is a tool/platform which is used to analyze larger sets of data representing them as data flows. Pig is generally used with Hadoop; we can perform all the data manipulation operations in Hadoop using Pig.To write data analysis programs, Pig provides a high-level language known as Pig Latin. This language provides various operators using which programmers can develop their own functions for reading, writing, and processing data.

To analyze data using Apache Pig, programmers need to write scripts using Pig Latin language. All these scripts are internally converted to Map and Reduce tasks. Apache Pig has a component known as Pig Engine that accepts the Pig Latin scripts as input and converts those scripts into MapReduce jobs.

9.1 History

In 2006, Apache Pig was developed as a research project at Yahoo, especially to create and execute MapReduce jobs on every dataset. In 2007, Apache Pig was open sourced via Apache incubator. In 2008, the first release of Apache Pig came out. In 2010, Apache Pig graduated as an Apache top-level project. The language used to analyze data in Hadoop using Pig is known as Pig Latin. It is a highlevel data processing language which provides a rich set of data types and operators to perform various operations on the data.

To perform a particular task Programmers using Pig, programmers need to write a Pig script using the Pig Latin language, and execute them using any of the execution mechanisms (Grunt Shell, UDFs, Embedded). After execution, these scripts will go through a series of transformations applied by the Pig Framework, to produce the desired output. Internally, Apache Pig converts these scripts into a series of MapReduce jobs, and thus, it makes the programmer’s job easy. The architecture of Apache Pig is shown below.


PIC


Why do we need Pig?

Programmers who are not so good at Java normally used to struggle working with Hadoop, especially while performing any MapReduce tasks. Apache Pig is a boon for all such programmers.

Using Pig Latin, programmers can perform MapReduce tasks easily without having to type complex codes in Java.

Apache Pig uses multi-query approach, thereby reducing the length of codes. For example, an operation that would require you to type 200 lines of code (LoC) in Java can be easily done by typing as less as just 10 LoC in Apache Pig. Ultimately Apache Pig reduces the development time by almost 16 times. Pig Latin is SQL-like language and it is easy to learn Apache Pig when you are familiar with SQL.

Apache Pig provides many built-in operators to support data operations like joins, filters, ordering, etc. In addition, it also provides nested data types like tuples, bags, and maps that are missing from MapReduce.

Features of Pig

Apache Pig comes with the following features.

  1. Rich set of operators It provides many operators to perform operations like join, sort, filer, etc.
  2. Ease of programming Pig Latin is similar to SQL and it is easy to write a Pig script if you are good at SQL.
  3. Optimization opportunities The tasks in Apache Pig optimize their execution automatically, so the programmers need to focus only on semantics of the language.
  4. Extensibility Using the existing operators, users can develop their own functions to read, process, and write data.
  5. UDF’s Pig provides the facility to create User-defined Functions in other programming languages such as Java and invoke or embed them in Pig Scripts.
  6. Handles all kinds of data Apache Pig analyzes all kinds of data, both structured as well as unstructured. It stores the results in HDFS.

Comparision between Pig and MapReduce

Pig Apache Pig is a data flow language. It is a high level language. Performing a Join operation in Apache Pig is pretty simple. Any novice programmer with a basic knowledge of SQL can work conveniently with Apache Pig. Apache Pig uses multi-query approach, thereby reducing the length of the codes to a great extent. There is no need for compilation. On execution, every Apache Pig operator is converted internally into a MapReduce job.

MapReduce

Comparison between Pig and SQL

Listed below are the major differences between Apache Pig and SQL.

Pig Pig Latin is a procedural language. In Apache Pig, schema is optional. We can store data without designing a schema (values are stored as $01, $02 etc.) The data model in Apache Pig is nested relational. Apache Pig provides limited opportunity for Query optimization.

SQL

Comparison between Apache Pig Vs Hive

Both Apache Pig and Hive are used to create MapReduce jobs. And in some cases, Hive operates on HDFS in a similar way Apache Pig does. In the following table, we have listed a few significant points that set Apache Pig apart from Hive.

Pig Apache Pig uses a language called Pig Latin. It was originally created at Yahoo. Pig Latin is a data flow language. Pig Latin is a procedural language and it fits in pipeline paradigm. Apache Pig can handle structured, unstructured, and semi-structured data.

Hive

Applications of Apache Pig

Apache Pig is generally used by data scientists for performing tasks involving ad-hoc processing and quick prototyping. Apache Pig is used

  1. To process huge data sources such as web logs.
  2. To perform data processing for search platforms.
  3. To process time sensitive data loads.

9.2 Installation

Installation of Pig is strait forward. Just extract the compressed binaries to certain folder, update the path and configure.

Go to http://mirror.fibergrid.in/apache/pig/latest/ download both pig-0.16.0-src.tar.gz and pig-0.16.0.tar.gz. But it is always better to visit the main portal https://pig.apache.org/ for updates and links. You may find release pages at this portal which redirects you to a couple of other places so that you may be able to find aforementioned two compressed packages (in .tar.gz).

9.2.1 Procedure

Follow the below steps to install Pig in your Linux machine.

  1. Step 1: Create a directory with the name Pig in the same directory where the installation directories of Hadoop, Java, and other software were installed. (In our tutorial, we have created the Pig directory in the user named Hadoop). mkdir Pig
  2. Step 2: Extract the source file into the directory created in above step. tar zxvf pig-0.16.0.tar.gz anywhere you may like to do it.
  3. Step 3: Move the (extracted) source files to the Pig directory. mv pig-0.16.0.tar.gz/* /usr/local/Pig. The path /usr/local/Pig is my path, you may create some such path anywhere but preferably in directory where hadoop is installed.

Configure

After installing Apache Pig, we have to configure it. To configure, we need to edit two files bashrc and pig.properties.

Open bashrc file and add the following lines

export PIG_HOME = /home/Hadoop/Pig  
export PATH  = PATH:/home/Hadoop/pig/bin  
export PIG_CLASSPATH = \$HADOOP_HOME/conf

Now you may try souce .bashrc to save the changes to the file. Go back to parent directy there try pig -version. You must be able to see the follwoing response in the terminal.

Apache Pig version 0.16.0 (r1746530)  
compiled Jun 01 2016, 23:10:49

Check the properties (located in $PIGHOME∕conf directory with the help of sudo gedit $PIGHOME∕conf∕pig.properties.

Apache Pig Components

As shown in the figure, there are various components in the Apache Pig framework. Let us take a look at the major components.

Parser

Initially the Pig Scripts are handled by the Parser. It checks the syntax of the script, does type checking, and other miscellaneous checks. The output of the parser will be a DAG (directed acyclic graph), which represents the Pig Latin statements and logical operators.

In the DAG, the logical operators of the script are represented as the nodes and the data flows are represented as edges.

A directed acyclic graph (DAG), is a finite directed graph with no directed cycles. That is, it consists of finitely many vertices and edges, with each edge directed from one vertex to another, such that there is no way to start at any vertex v and follow a consistently-directed sequence of edges that eventually loops back to v again. Equivalently, a DAG is a directed graph that has a topological ordering, a sequence of the vertices such that every edge is directed from earlier to later in the sequence. Though it seems that the performance of algorithm is in random steps, but scientists found that there exists certain topological ordering in such random steps. These methods are covered under topological sorting algorithms, one of the oldest application that use such sorting is PERT. A. B. Khan (1962), perhaps, the very first person to write about such techniques. His study on non-empty acyclic graphs paved a way for topological ordering. The other algorithm, of course, is depth first search explained by Tarjan (1976) later Cormen et al in 2001. 14

Optimizer

The logical plan (DAG) is passed to the logical optimizer, which carries out the logical optimizations such as projection and pushdown.

Compiler

The compiler compiles the optimized logical plan into a series of MapReduce jobs.

Execution engine

Finally the MapReduce jobs are submitted to Hadoop in a sorted order. Finally, these MapReduce jobs are executed on Hadoop producing the desired results.

Pig Latin Data Model

The data model of Pig Latin is fully nested and it allows complex non-atomic data types such as map and tuple. Given below is the diagrammatical representation of Pig Latin’s data model.


PIC


  1. Atom: Any single value in Pig Latin, irrespective of their data, type is known as an Atom. It is stored as string and can be used as string and number. int, long, float, double, chararray, and bytearray are the atomic values of Pig. A piece of data or a simple atomic value is known as a field. Ex. ’Kamakshaiah’ or ’40’ etc.
  2. Tuple: A record that is formed by an ordered set of fields is known as a tuple, the fields can be of any type. A tuple is similar to a row in a table of RDBMS. Ex. (Sukanya, 38)
  3. Bag: A bag is an unordered set of tuples. In other words, a collection of tuples (non-unique) is known as a bag. Each tuple can have any number of fields (flexible schema). A bag is represented by ‘’. It is similar to a table in RDBMS, but unlike a table in RDBMS, it is not necessary that every tuple contain the same number of fields or that the fields in the same position (column) have the same type. A bag can be a field in a relation; in that context, it is known as inner bag. Ex. (Amrutha, 11), (Bhargav, 6)
  4. Map: A map (or data map) is a set of key-value pairs. The key needs to be of type chararray and should be unique. The value might be of any type. It is represented by ‘[]’. Ex. [name#Kamakshaiah, age#40]
  5. Relation: A relation is a bag of tuples. The relations in Pig Latin are unordered (there is no guarantee that tuples are processed in any particular order).

Apache Pig Execution Mode

You can run Apache Pig in two modes, namely, Local Mode and HDFS mode.

Local Mode

In this mode, all the files are installed and run from your local host and local file system. There is no need of Hadoop or HDFS. This mode is generally used for testing purpose.

MapReduce Mode

MapReduce mode is where we load or process the data that exists in the Hadoop File System (HDFS) using Apache Pig. In this mode, whenever we execute the Pig Latin statements to process the data, a MapReduce job is invoked in the back-end to perform a particular operation on the data that exists in the HDFS.

Apache Pig Execution Mechanisms

Apache Pig scripts can be executed in three ways, namely, interactive mode, batch mode, and embedded mode.

  1. Interactive Mode (Grunt shell) You can run Apache Pig in interactive mode using the Grunt shell. In this shell, you can enter the Pig Latin statements and get the output (using Dump operator).
  2. Batch Mode (Script) You can run Apache Pig in Batch mode by writing the Pig Latin script in a single file with .pig extension.
  3. Embedded Mode (UDF) Apache Pig provides the provision of defining our own functions (User Defined Functions) in programming languages such as Java, and using them in our script.

Invoking the Grunt Shell

PIG can be invoked in three different modes viz. (1) local mode, (2) mapreduce mode. There is also third mode known as interactive mode. You can invoke the Grunt shell in a desired mode (local/MapReduce) using the x option as shown below.

pig -x local  
pig -x mapreduce

In local mode, all files are installed and run using your local host and file system. Specify local mode using the -x flag (pig -x local). Note that local mode does not support parallel mapper execution with Hadoop 0.20.x and 1.0.0.

9.2.2 Shell Scripts

PIG shell (grunt>) has provision to execute the shell scripts. The user need not switch between BASH and GRUNT to and forth. However they are very few in number.

For instance to know

Notes

14http://dl.acm.org/citation.cfm?doid=368996.369025