0% found this document useful (0 votes)

89 views44 pages

Hadoop Unit-4

The document discusses Hadoop and related topics including: 1. A brief history of Hadoop and the Hadoop ecosystem including key components like HDFS, MapReduce, Pig, Hive, HBase and ZooKeeper. 2. An overview of how data is analyzed with Hadoop using the MapReduce programming model and logical data flow. 3. Descriptions of the Hadoop Distributed File System (HDFS) including its design to store very large files across clusters of commodity hardware and support streaming data access patterns.

Uploaded by

Kishore Parimi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

89 views44 pages

Hadoop Unit-4

Uploaded by

Kishore Parimi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 44

HADOOP

UNIT-4
 Hadoop:
1. Meet Hadoop
2. Comparison with other systems
3. A brief history of Hadoop and the Hadoop ecosystem
4. Analysing the Data with Hadoop
5. Hadoop Distributed File System
6. HDFS concepts
7. Design of HDFS
8. Data Flow in HDFS
9. Developing a Map Reduce Application
10. How Map Reduce Works
TEXTBOOK
 URL:
https://www.isical.ac.in/~acmsc/WBDA2015/slides/hg/Oreilly.Hadoop.The.D
efinitive.Guide.3rd.Edition.Jan.2012.pdf
TEXTBOOK- INDEX
S.NO NAME OF THE TOPIC CHAPTER PAGE NO
1 Meet Hadoop 1 1-4
2 Comparison with other systems 1 4-8
3 A brief history of Hadoop and the Hadoop ecosystem 1 9-13
4 Analysing the Data with Hadoop 2 20-30
5 Hadoop Distributed File System 3 45-45
6 Design of HDFS 3 45-46
7 HDFS concepts 3 47-51
8 Data Flow in HDFS 3 69-75
9 Developing a Map Reduce Application 5 145-182
10 How Map Reduce Works 6 187-217
MEET HADOOP
 The Apache Hadoop software library is a framework that
allows for the distributed processing of large data sets across
clusters of computers using simple programming models.
 It is designed to scale up from single servers to thousands of
machines, each offering local computation and storage.
 Rather than rely on hardware to deliver high-availability, the
library itself is designed to detect and handle failures at the
application layer, so delivering a highly-available service on
top of a cluster of computers, each of which may be prone to
failures.
 Hadoop provides: a reliable shared storage & analysis system.
 The storage is provided by HDFS and analysis by MapReduce
Comparison with other systems

Traditional RDBMS MapReduce

Data size Gigabytes Petabytes
Access Interactive and batch Batch
Updates Read and write many times Write once, read many time
Structure Static schema Dynamic schema
Integrity High Low
Scaling Nonlinear Linear
Comparison with other systems
Parameters RDBMS Hadoop
1. Data Storage average data size in (Giga Bytes) Use for large data set (Tera Bytes and Peta Bytes)
2. Schema (static schema)Required on write (dynamic schema)Required on reading
3. Hardware Profile High-End Servers Commodity/Utility Hardware
4. Scalability Vertical Horizontal
5. Data Objects Works on Relational Tables Works on Key/Value Pair
6. Integrity High (ACID) Low
7. Throughput Low High
OLTP (Online transaction
8. Use Case Analytics (Audio, video, logs, etc), Data Discovery
processing)
9. Speed Reads are fast Both reads and writes are fast
10. Querying SQL Language HQL (Hive Query Language)
11. Cost License Free
Significantly used for Structured Significantly used for Structured, Semi-Structured
12. Data Variety
data. and Unstructured data
Application is usually OLTP and
13. Application Application is usually data discovery and storage
complex ACID
Comparison with other systems
S.NO RDBMS Hadoop
1 Traditional row-column based databases,
An open-source software used for storing data and
basically used for data storage, manipulation
running applications or processes concurrently.
and retrieval.
In this both structured and unstructured data is
2. In this structured data is mostly processed.
processed.
3. It is best suited for OLTP environment. It is best suited for BIG data.
4. It is less scalable than Hadoop. It is highly scalable.
5. Data normalization is required in RDBMS. Data normalization is not required in Hadoop.

6. It stores transformed and aggregated data. It stores huge volume of data.

7. It has no latency in response. It has some latency in response.
8. The data schema of RDBMS is static type. The data schema of Hadoop is dynamic type.

9. High data integrity available. Low data integrity available than RDBMS.

10. Cost is applicable for licensed software. Free of cost, as it is an open-source software.
A brief history of Hadoop and the
Hadoop ecosystem
 The Origin of the Name “Hadoop”
The name Hadoop is not an acronym; it’s a made-up name. The project’s
creator, Doug Cutting, explains how the name came about:

“The name my kid gave a stuffed yellow elephant. Short, relatively easy
to spell and pronounce, meaningless, and not used elsewhere: those are
my naming criteria. Kids are good at generating such names.”
 Although Hadoop is best known for MapReduce and its distributed
filesystem (HDFS), the term is also used for a family of related projects that
fall under the umbrella of infrastructure for distributed computing and large-
scale data processing
A brief history of Hadoop and the
Hadoop ecosystem
The Hadoop ecosystem consists of:
1. Common
A set of components and interfaces for distributed filesystems and general
I/O
(serialization, Java RPC, persistent data structures).
2. Avro
A serialization system for efficient, cross-language RPC, and persistent data
storage.
3. MapReduce
A distributed data processing model and execution environment that runs on
large clusters of commodity machines.
A brief history of Hadoop and the
Hadoop ecosystem
4. HDFS
A distributed filesystem that runs on large clusters of commodity machines.
5. Pig
A data flow language and execution environment for exploring very large
datasets. Pig runs on HDFS and MapReduce clusters.
6. Hive
A distributed data warehouse. Hive manages data stored in HDFS and
provides a query language based on SQL (and which is translated by the
runtime engine to MapReduce jobs) for querying the data.
A brief history of Hadoop and the
Hadoop ecosystem
7. HBase
A distributed, column-oriented database. HBase uses HDFS for its
underlying storage and supports both batch-style computations using
MapReduce and point queries (random reads).
8. ZooKeeper
A distributed, highly available coordination service. ZooKeeper provides
primitives such as distributed locks that can be used for building distributed
applications.
9. Sqoop
A tool for efficiently moving data between relational databases and HDFS.
Analysing the Data with Hadoop

 To take advantage of the parallel processing that Hadoop provides, we

need to express our query as a MapReduce job. After some local, small-
scale testing, we will be able to run it on a cluster of machines.
 Map and Reduce: MapReduce works by breaking the processing into
two phases: the map phase and the reduce phase.
 Each phase has key-value pairs as input and output, the types of which
may be chosen by the programmer. The programmer also specifies two
functions: the map function and the reduce function.
Analysing the Data with Hadoop

MapReduce logical data flow

 A test run
After writing a MapReduce job, it’s normal to try it out on a small
dataset to flush out any immediate problems with the code
Hadoop Distributed File System
 When a dataset outgrows the storage capacity of a single physical
machine, it becomes necessary to partition it across a number of
separate machines.
 Filesystems that manage the storage across a network of machines are
called distributed filesystems.
 Since they are network-based, all the complications of network
programming kick in, thus making distributed filesystems more
complex than regular disk filesystems.
 For example, one of the biggest challenges is making the filesystem
tolerate node failure without suffering data loss.
 Hadoop comes with a distributed filesystem called HDFS, which
stands for Hadoop Distributed Filesystem.
DESIGN OF HDFS
 HDFS is a filesystem designed for storing very large files with
streaming data access patterns, running on clusters of commodity
hardware.
1. Very large files
 “Very large” in this context means files that are hundreds of
megabytes, gigabytes, or terabytes in size. There are Hadoop clusters
running today that store petabytes of data
2. Streaming data access
 HDFS is built around the idea that the most efficient data processing
pattern is a write-once, read-many-times pattern. A dataset is typically
generated or copied from source, then various analyses are performed
on that dataset over time. Each analysis will involve a large
proportion, if not all, of the dataset, so the time to read the whole
dataset is more important than the latency in reading the first record.
DESIGN OF HDFS
3. Commodity hardware
 Hadoop doesn’t require expensive, highly reliable hardware to run
on. It’s designed to run on clusters of commodity hardware
(commonly available hardware from multiple vendors) for which the
chance of node failure across the cluster is high, at least for large
clusters. HDFS is designed to carry on working without a noticeable
interruption to the user in the case of such failure.
 It is also worth examining the applications for which using HDFS
does not work so well. While this may change in the future, there are
areas where HDFS is not a good fit today:
 Low-latency data access.
 Lots of small files.
 Multiple writers, arbitrary file modifications.
DESIGN OF HDFS
1. Low-latency data access
 Applications that require low-latency access to data, in the tens of
milliseconds range, will not work well with HDFS. Remember, HDFS is
optimized for delivering a high throughput of data, and this may be at the
expense of latency. HBase is currently a better choice for low-latency
access. (Latency:: the period of delay when one component of a hardware
system is waiting for an action to be executed by another component)
2. Lots of small files
 Since the namenode holds filesystem metadata in memory, the limit to the
number of files in a filesystem is governed by the amount of memory on
the namenode. As a rule of thumb, each file, directory, and block takes
about 150 bytes. So, for example, if you had one million files, each taking
one block, you would need at least 300 MB of memory. While storing
millions of files is feasible, billions is beyond the capability of current
hardware.
DESIGN OF HDFS
3. Multiple writers, arbitrary file modifications
 Files in HDFS may be written to by a single writer. Writes are always
made at the end of the file. There is no support for multiple writers, or
for modifications at arbitrary offsets in the file. (These might be
supported in the future, but they are likely to be relatively inefficient.)
HDFS CONCEPTS
HDFS CONCEPTS
HDFS CONCEPTS
HDFS CONCEPTS
HDFS CONCEPTS
HDFS CONCEPTS
HDFS CONCEPTS
HDFS CONCEPTS
HDFS CONCEPTS
HDFS CONCEPTS

BLOCK REPLICATION
HDFS CONCEPTS
HDFS CONCEPTS
HDFS CONCEPTS
HDFS CONCEPTS
 Replication Strategy :-
 One replica on local node. Second replica on a remote rack.
 Third replica on same remote rack. Additional replicas are
randomly placed
HDFS CONCEPTS
HDFS CONCEPTS
HDFS CONCEPTS
HDFS CONCEPTS
HDFS CONCEPTS
HDFS CONCEPTS
HDFS CONCEPTS
 HDFS Federation
 The namenode keeps a reference to every file and block in the filesystem in
memory, which means that on very large clusters with many files, memory
becomes the limiting factor for scaling .
 HDFS Federation, introduced in the 0.23 release series, allows a cluster to
scale by adding namenodes, each of which manages a portion of the
filesystem namespace.
 For example, one namenode might manage all the files rooted under /user,
say, and a second namenode might handle files under /share.
 Under federation, each namenode manages a namespace volume, which is
made up of the metadata for the namespace, and a block pool containing all
the blocks for the files in the namespace.
 Namespace volumes are independent of each other, which means namenodes
do not communicate with one another, and the failure of one namenode does
not affect the availability of the namespaces managed by other namenodes.
 Block pool storage is not partitioned, however, so datanodes register with
each namenode in the cluster and store blocks from multiple block pools.
HDFS CONCEPTS
 HDFS High-Availability
 The namenode is still a single point of failure (SPOF), since if it did
fail, all clients—including MapReduce jobs—would be unable to
read, write, or list files, because the namenode is the sole repository
of the metadata and the file-to-block mapping.
 In such an event the whole Hadoop system would effectively be out
of service until a new namenode could be brought online.
 To recover from a failed namenode in this situation, an administrator
starts a new primary namenode with one of the filesystem metadata
replicas and configures datanodes and clients to use this new
namenode.
 On large clusters with many files and blocks, the time it takes for a
namenode to start from cold can be 30 minutes or more.
HDFS CONCEPTS
 Failover fencing
 The transition from the active namenode to the standby
is managed by a new entity in the system called the
failover controller. Failover controllers are pluggable,
but the first implementation uses ZooKeeper to ensure
that only one namenode is active.
 Failover may also be initiated manually by an
adminstrator, in the case of routine maintenance, for
example. This is known as a graceful failover, since the
failover controller arranges an orderly transition for
both namenodes to switch
HDFS CONCEPTS
 Fencing
 In the case of an ungraceful failover, however, it is impossible to be sure
that the failed namenode has stopped running. For example, a slow
network or a network partition can trigger a failover transition, even
though the previously active namenode is still running, and thinks it is
still the active namenode.
 The HA implementation goes to great lengths to ensure that the
previously active namenode is prevented from doing any damage and
causing corruption—a method known as fencing.
 The system employs a range of fencing mechanisms, including killing the
namenode’s process, revoking its access to the shared storage directory
(typically by using a vendor-specific NFS command), and disabling its
network port via a remote management command.
 As a last resort, the previously active namenode can be fenced with a
technique rather graphically known as STONITH, or “shoot the other
node in the head”, which uses a specialized power distribution unit to
forcibly power down the host machine.
Data Flow in HDFS

SRF Form
No ratings yet
SRF Form
1 page
Final Report Cloud Classroom With E Learning System
No ratings yet
Final Report Cloud Classroom With E Learning System
22 pages
Diagnosis_of_Coronary_Heart_Disease_Through_Deep_Learning-Based_Segmentation_and_Localization_in_Computed_Tomography_Angiography
No ratings yet
Diagnosis_of_Coronary_Heart_Disease_Through_Deep_Learning-Based_Segmentation_and_Localization_in_Computed_Tomography_Angiography
17 pages
A Comprehensive Overview of Knowledge Graph Completion
No ratings yet
A Comprehensive Overview of Knowledge Graph Completion
65 pages
Bda Unit 2
No ratings yet
Bda Unit 2
79 pages
BDA_UNIT-4-PIG-Notes
No ratings yet
BDA_UNIT-4-PIG-Notes
9 pages
Big - Data PPT Unit 3
100% (1)
Big - Data PPT Unit 3
148 pages
Big data aktu unit 3
No ratings yet
Big data aktu unit 3
90 pages
Alpha·Omega Φoton Manual Website
No ratings yet
Alpha·Omega Φoton Manual Website
12 pages
Software Reference iMAP 17 3 PDF
No ratings yet
Software Reference iMAP 17 3 PDF
1,330 pages
Effective Communication Skills For Scientific and Technical Professionals PDF
No ratings yet
Effective Communication Skills For Scientific and Technical Professionals PDF
334 pages
1.3 Evolution of Analytic Scalability
No ratings yet
1.3 Evolution of Analytic Scalability
15 pages
MAKALAH PROJECT Introduction To Web Programming
No ratings yet
MAKALAH PROJECT Introduction To Web Programming
15 pages
AdrijaRay Dsa Report
No ratings yet
AdrijaRay Dsa Report
7 pages
Unit-4
No ratings yet
Unit-4
42 pages
Cnsa Course Based Project
No ratings yet
Cnsa Course Based Project
10 pages
IT Project Intake and Process Software
No ratings yet
IT Project Intake and Process Software
35 pages
BDA Unit - II
No ratings yet
BDA Unit - II
66 pages
NoSQL Module 2
No ratings yet
NoSQL Module 2
76 pages
Analysis of Algorithms: Chapter 9.1, 9.2, 9.3, 9.4
No ratings yet
Analysis of Algorithms: Chapter 9.1, 9.2, 9.3, 9.4
52 pages
Big Data Analytics – Unit 4
No ratings yet
Big Data Analytics – Unit 4
32 pages
Videx User Guide
No ratings yet
Videx User Guide
4 pages
HDFS Commands
No ratings yet
HDFS Commands
15 pages
Bda Unit-6
No ratings yet
Bda Unit-6
7 pages
Big Data Analytics - Lab-Manual
No ratings yet
Big Data Analytics - Lab-Manual
19 pages
APP Question Bank Unit3
100% (1)
APP Question Bank Unit3
5 pages
Hadoop and Mapreduce
No ratings yet
Hadoop and Mapreduce
21 pages
2.entity Relationship Model
No ratings yet
2.entity Relationship Model
14 pages
B.ed. Syllabus Final-1
No ratings yet
B.ed. Syllabus Final-1
131 pages
BDA Unit 5 HIVE HBASE
No ratings yet
BDA Unit 5 HIVE HBASE
33 pages
Data Structures and Algorithms in C
No ratings yet
Data Structures and Algorithms in C
256 pages
SPARC Photogrammetry Draft
No ratings yet
SPARC Photogrammetry Draft
82 pages
HDFS Concepts
No ratings yet
HDFS Concepts
10 pages
B.tech Viii Bda Chapter 3
No ratings yet
B.tech Viii Bda Chapter 3
21 pages
Module 3 Nosql
No ratings yet
Module 3 Nosql
12 pages
Pig Hive
No ratings yet
Pig Hive
72 pages
Chapter 22
No ratings yet
Chapter 22
40 pages
Orient PO
No ratings yet
Orient PO
1 page
Static Behavior of Natural Gas
No ratings yet
Static Behavior of Natural Gas
34 pages
Unit-4-Unit-4-Bda EDIT
No ratings yet
Unit-4-Unit-4-Bda EDIT
16 pages
Hadoop ppt@87
No ratings yet
Hadoop ppt@87
16 pages
UNIT V Streaming
No ratings yet
UNIT V Streaming
22 pages
BDA Presentations Unit-4 - Hadoop, Ecosystem
100% (1)
BDA Presentations Unit-4 - Hadoop, Ecosystem
25 pages
HBase
No ratings yet
HBase
31 pages
Valkenburg Problems PDF
No ratings yet
Valkenburg Problems PDF
74 pages
Algorithm Mode Encrypt Decrypt 3des: Mode Is Not Supported!
No ratings yet
Algorithm Mode Encrypt Decrypt 3des: Mode Is Not Supported!
1 page
HDFS Commands
No ratings yet
HDFS Commands
6 pages
BDC Previous Papers 2 Marks
100% (1)
BDC Previous Papers 2 Marks
7 pages
Unit 4 HIVE - PIG
No ratings yet
Unit 4 HIVE - PIG
71 pages
Bluej Project On Library Management: Program
No ratings yet
Bluej Project On Library Management: Program
3 pages
Hadoop Distributed File System
No ratings yet
Hadoop Distributed File System
5 pages
Samsung BTS Claim
No ratings yet
Samsung BTS Claim
2 pages
R Language
No ratings yet
R Language
59 pages
CCS334 BIG DATA ANALYTICS Session 1 Intr
No ratings yet
CCS334 BIG DATA ANALYTICS Session 1 Intr
18 pages
Mining Data Streams (Part 2)
No ratings yet
Mining Data Streams (Part 2)
56 pages
Subject Name Parallel and Distributed Computing
100% (1)
Subject Name Parallel and Distributed Computing
3 pages
Lab Manual: Sri Ramakrishna Institute of Technology
No ratings yet
Lab Manual: Sri Ramakrishna Institute of Technology
49 pages
What Is Apache Flume?: Collecting, Aggregating, and Moving Large Amounts of Log Data. in
No ratings yet
What Is Apache Flume?: Collecting, Aggregating, and Moving Large Amounts of Log Data. in
8 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
30 pages
Big Data Analytics Lab Manual
No ratings yet
Big Data Analytics Lab Manual
80 pages
36) Network Data Science Course Certificate PDF
No ratings yet
36) Network Data Science Course Certificate PDF
1 page
Hadoop I/O: Jaeyong Choi
No ratings yet
Hadoop I/O: Jaeyong Choi
36 pages
Mining Data Streams
No ratings yet
Mining Data Streams
67 pages
Hbase
No ratings yet
Hbase
13 pages
MCQ Type Questions
No ratings yet
MCQ Type Questions
24 pages
BDA Lab ManuaL[1]
No ratings yet
BDA Lab ManuaL[1]
83 pages
WT Unit 3
No ratings yet
WT Unit 3
57 pages
BDA Experiment 14 PDF
No ratings yet
BDA Experiment 14 PDF
77 pages
Map Reduce
No ratings yet
Map Reduce
40 pages
18CS72 Module1 Qbank
No ratings yet
18CS72 Module1 Qbank
2 pages
P.prabu (28x61c) CCS334 BDA - Unit 4
No ratings yet
P.prabu (28x61c) CCS334 BDA - Unit 4
28 pages
DAN Lab ManuaL
No ratings yet
DAN Lab ManuaL
53 pages
Hadoop Big Data Administration
No ratings yet
Hadoop Big Data Administration
6 pages
Notes - Unit 3 - Map Reduce Applications
No ratings yet
Notes - Unit 3 - Map Reduce Applications
11 pages
Hive Lecture Notes
100% (1)
Hive Lecture Notes
17 pages
AE311
No ratings yet
AE311
2 pages
R12 Oracle Applications System Administrator Fundamentals
No ratings yet
R12 Oracle Applications System Administrator Fundamentals
4 pages
Brochure CT Scan SOMATOM Perspective PDF
No ratings yet
Brochure CT Scan SOMATOM Perspective PDF
46 pages
1905304-ANALOG ELECTRONICS Syllabus PDF
No ratings yet
1905304-ANALOG ELECTRONICS Syllabus PDF
3 pages
391 - CS8091 Big Data Analytics - Anna University 2017 Regulation Syllabus
0% (2)
391 - CS8091 Big Data Analytics - Anna University 2017 Regulation Syllabus
2 pages
Hbase PPT PDF
No ratings yet
Hbase PPT PDF
100 pages
Bda Unit 3
No ratings yet
Bda Unit 3
22 pages
Ccs 334
No ratings yet
Ccs 334
16 pages
CS8091 Bigdata Analytics Lessonplan With Date
No ratings yet
CS8091 Bigdata Analytics Lessonplan With Date
11 pages
3-1 Bigdata (Spark)
No ratings yet
3-1 Bigdata (Spark)
3 pages
2.21 of Service Manual RX2
No ratings yet
2.21 of Service Manual RX2
2 pages
1.2 Challenges of Conventional Systems
100% (1)
1.2 Challenges of Conventional Systems
19 pages
FTU Crypto - Complete Guide On Following Trades With Us
No ratings yet
FTU Crypto - Complete Guide On Following Trades With Us
8 pages
Optimizing Hadoop for MapReduce
From Everand
Optimizing Hadoop for MapReduce
Khaled Tannir
No ratings yet
Lecture Notes: Introduction To Data Science and Big Data
No ratings yet
Lecture Notes: Introduction To Data Science and Big Data
5 pages
5.1 Mining Data Streams
No ratings yet
5.1 Mining Data Streams
16 pages
Housekeeping NCII First Quarter Pointers
No ratings yet
Housekeeping NCII First Quarter Pointers
1 page
ICT Skills Questions
No ratings yet
ICT Skills Questions
6 pages
AppDynamics Third Edition
From Everand
AppDynamics Third Edition
Gerardus Blokdyk
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Hadoop Unit-4

Uploaded by

Hadoop Unit-4

Uploaded by

HADOOP

Traditional RDBMS MapReduce

6. It stores transformed and aggregated data. It stores huge volume of data.

 To take advantage of the parallel processing that Hadoop provides, we

MapReduce logical data flow

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.