Hadoop Unit-4
Hadoop Unit-4
UNIT-4
Hadoop:
1. Meet Hadoop
2. Comparison with other systems
3. A brief history of Hadoop and the Hadoop ecosystem
4. Analysing the Data with Hadoop
5. Hadoop Distributed File System
6. HDFS concepts
7. Design of HDFS
8. Data Flow in HDFS
9. Developing a Map Reduce Application
10. How Map Reduce Works
TEXTBOOK
URL:
https://www.isical.ac.in/~acmsc/WBDA2015/slides/hg/Oreilly.Hadoop.The.D
efinitive.Guide.3rd.Edition.Jan.2012.pdf
TEXTBOOK- INDEX
S.NO NAME OF THE TOPIC CHAPTER PAGE NO
1 Meet Hadoop 1 1-4
2 Comparison with other systems 1 4-8
3 A brief history of Hadoop and the Hadoop ecosystem 1 9-13
4 Analysing the Data with Hadoop 2 20-30
5 Hadoop Distributed File System 3 45-45
6 Design of HDFS 3 45-46
7 HDFS concepts 3 47-51
8 Data Flow in HDFS 3 69-75
9 Developing a Map Reduce Application 5 145-182
10 How Map Reduce Works 6 187-217
MEET HADOOP
The Apache Hadoop software library is a framework that
allows for the distributed processing of large data sets across
clusters of computers using simple programming models.
It is designed to scale up from single servers to thousands of
machines, each offering local computation and storage.
Rather than rely on hardware to deliver high-availability, the
library itself is designed to detect and handle failures at the
application layer, so delivering a highly-available service on
top of a cluster of computers, each of which may be prone to
failures.
Hadoop provides: a reliable shared storage & analysis system.
The storage is provided by HDFS and analysis by MapReduce
Comparison with other systems
9. High data integrity available. Low data integrity available than RDBMS.
10. Cost is applicable for licensed software. Free of cost, as it is an open-source software.
A brief history of Hadoop and the
Hadoop ecosystem
The Origin of the Name “Hadoop”
The name Hadoop is not an acronym; it’s a made-up name. The project’s
creator, Doug Cutting, explains how the name came about:
“The name my kid gave a stuffed yellow elephant. Short, relatively easy
to spell and pronounce, meaningless, and not used elsewhere: those are
my naming criteria. Kids are good at generating such names.”
Although Hadoop is best known for MapReduce and its distributed
filesystem (HDFS), the term is also used for a family of related projects that
fall under the umbrella of infrastructure for distributed computing and large-
scale data processing
A brief history of Hadoop and the
Hadoop ecosystem
The Hadoop ecosystem consists of:
1. Common
A set of components and interfaces for distributed filesystems and general
I/O
(serialization, Java RPC, persistent data structures).
2. Avro
A serialization system for efficient, cross-language RPC, and persistent data
storage.
3. MapReduce
A distributed data processing model and execution environment that runs on
large clusters of commodity machines.
A brief history of Hadoop and the
Hadoop ecosystem
4. HDFS
A distributed filesystem that runs on large clusters of commodity machines.
5. Pig
A data flow language and execution environment for exploring very large
datasets. Pig runs on HDFS and MapReduce clusters.
6. Hive
A distributed data warehouse. Hive manages data stored in HDFS and
provides a query language based on SQL (and which is translated by the
runtime engine to MapReduce jobs) for querying the data.
A brief history of Hadoop and the
Hadoop ecosystem
7. HBase
A distributed, column-oriented database. HBase uses HDFS for its
underlying storage and supports both batch-style computations using
MapReduce and point queries (random reads).
8. ZooKeeper
A distributed, highly available coordination service. ZooKeeper provides
primitives such as distributed locks that can be used for building distributed
applications.
9. Sqoop
A tool for efficiently moving data between relational databases and HDFS.
Analysing the Data with Hadoop
A test run
After writing a MapReduce job, it’s normal to try it out on a small
dataset to flush out any immediate problems with the code
Hadoop Distributed File System
When a dataset outgrows the storage capacity of a single physical
machine, it becomes necessary to partition it across a number of
separate machines.
Filesystems that manage the storage across a network of machines are
called distributed filesystems.
Since they are network-based, all the complications of network
programming kick in, thus making distributed filesystems more
complex than regular disk filesystems.
For example, one of the biggest challenges is making the filesystem
tolerate node failure without suffering data loss.
Hadoop comes with a distributed filesystem called HDFS, which
stands for Hadoop Distributed Filesystem.
DESIGN OF HDFS
HDFS is a filesystem designed for storing very large files with
streaming data access patterns, running on clusters of commodity
hardware.
1. Very large files
“Very large” in this context means files that are hundreds of
megabytes, gigabytes, or terabytes in size. There are Hadoop clusters
running today that store petabytes of data
2. Streaming data access
HDFS is built around the idea that the most efficient data processing
pattern is a write-once, read-many-times pattern. A dataset is typically
generated or copied from source, then various analyses are performed
on that dataset over time. Each analysis will involve a large
proportion, if not all, of the dataset, so the time to read the whole
dataset is more important than the latency in reading the first record.
DESIGN OF HDFS
3. Commodity hardware
Hadoop doesn’t require expensive, highly reliable hardware to run
on. It’s designed to run on clusters of commodity hardware
(commonly available hardware from multiple vendors) for which the
chance of node failure across the cluster is high, at least for large
clusters. HDFS is designed to carry on working without a noticeable
interruption to the user in the case of such failure.
It is also worth examining the applications for which using HDFS
does not work so well. While this may change in the future, there are
areas where HDFS is not a good fit today:
Low-latency data access.
Lots of small files.
Multiple writers, arbitrary file modifications.
DESIGN OF HDFS
1. Low-latency data access
Applications that require low-latency access to data, in the tens of
milliseconds range, will not work well with HDFS. Remember, HDFS is
optimized for delivering a high throughput of data, and this may be at the
expense of latency. HBase is currently a better choice for low-latency
access. (Latency:: the period of delay when one component of a hardware
system is waiting for an action to be executed by another component)
2. Lots of small files
Since the namenode holds filesystem metadata in memory, the limit to the
number of files in a filesystem is governed by the amount of memory on
the namenode. As a rule of thumb, each file, directory, and block takes
about 150 bytes. So, for example, if you had one million files, each taking
one block, you would need at least 300 MB of memory. While storing
millions of files is feasible, billions is beyond the capability of current
hardware.
DESIGN OF HDFS
3. Multiple writers, arbitrary file modifications
Files in HDFS may be written to by a single writer. Writes are always
made at the end of the file. There is no support for multiple writers, or
for modifications at arbitrary offsets in the file. (These might be
supported in the future, but they are likely to be relatively inefficient.)
HDFS CONCEPTS
HDFS CONCEPTS
HDFS CONCEPTS
HDFS CONCEPTS
HDFS CONCEPTS
HDFS CONCEPTS
HDFS CONCEPTS
HDFS CONCEPTS
HDFS CONCEPTS
HDFS CONCEPTS
BLOCK REPLICATION
HDFS CONCEPTS
HDFS CONCEPTS
HDFS CONCEPTS
HDFS CONCEPTS
Replication Strategy :-
One replica on local node. Second replica on a remote rack.
Third replica on same remote rack. Additional replicas are
randomly placed
HDFS CONCEPTS
HDFS CONCEPTS
HDFS CONCEPTS
HDFS CONCEPTS
HDFS CONCEPTS
HDFS CONCEPTS
HDFS CONCEPTS
HDFS Federation
The namenode keeps a reference to every file and block in the filesystem in
memory, which means that on very large clusters with many files, memory
becomes the limiting factor for scaling .
HDFS Federation, introduced in the 0.23 release series, allows a cluster to
scale by adding namenodes, each of which manages a portion of the
filesystem namespace.
For example, one namenode might manage all the files rooted under /user,
say, and a second namenode might handle files under /share.
Under federation, each namenode manages a namespace volume, which is
made up of the metadata for the namespace, and a block pool containing all
the blocks for the files in the namespace.
Namespace volumes are independent of each other, which means namenodes
do not communicate with one another, and the failure of one namenode does
not affect the availability of the namespaces managed by other namenodes.
Block pool storage is not partitioned, however, so datanodes register with
each namenode in the cluster and store blocks from multiple block pools.
HDFS CONCEPTS
HDFS High-Availability
The namenode is still a single point of failure (SPOF), since if it did
fail, all clients—including MapReduce jobs—would be unable to
read, write, or list files, because the namenode is the sole repository
of the metadata and the file-to-block mapping.
In such an event the whole Hadoop system would effectively be out
of service until a new namenode could be brought online.
To recover from a failed namenode in this situation, an administrator
starts a new primary namenode with one of the filesystem metadata
replicas and configures datanodes and clients to use this new
namenode.
On large clusters with many files and blocks, the time it takes for a
namenode to start from cold can be 30 minutes or more.
HDFS CONCEPTS
Failover fencing
The transition from the active namenode to the standby
is managed by a new entity in the system called the
failover controller. Failover controllers are pluggable,
but the first implementation uses ZooKeeper to ensure
that only one namenode is active.
Failover may also be initiated manually by an
adminstrator, in the case of routine maintenance, for
example. This is known as a graceful failover, since the
failover controller arranges an orderly transition for
both namenodes to switch
HDFS CONCEPTS
Fencing
In the case of an ungraceful failover, however, it is impossible to be sure
that the failed namenode has stopped running. For example, a slow
network or a network partition can trigger a failover transition, even
though the previously active namenode is still running, and thinks it is
still the active namenode.
The HA implementation goes to great lengths to ensure that the
previously active namenode is prevented from doing any damage and
causing corruption—a method known as fencing.
The system employs a range of fencing mechanisms, including killing the
namenode’s process, revoking its access to the shared storage directory
(typically by using a vendor-specific NFS command), and disabling its
network port via a remote management command.
As a last resort, the previously active namenode can be fenced with a
technique rather graphically known as STONITH, or “shoot the other
node in the head”, which uses a specialized power distribution unit to
forcibly power down the host machine.
Data Flow in HDFS