0% found this document useful (0 votes)

44 views

Slide 3 Hadoop MapReduce Tutorial

The document provides information about Hadoop shell commands to manage HDFS and describes how Hadoop MapReduce works. It discusses HDFS commands like ls, mkdir, touchz, du, put, cat, get, count, rm, cp, rmdir, and usage. It also explains that in MapReduce, input data is stored in HDFS files that are split into InputSplits by the InputFormat, and these splits are then processed by Map tasks.

Uploaded by

Hưng

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views

Slide 3 Hadoop MapReduce Tutorial

Uploaded by

Hưng

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 119

Big Data

U I T
-
Hadoop tutorial

b
a
Trong-Hop Do

3 l
S S3Lab
Smart Software System Laboratory

1
U I T
b -
“Without big data, you are blind and deaf
and in the middle of a freeway.”

3 l a – Geoffrey Moore

Big Data
S 2
U I T
-
Hadoop ShellbCommands
3 l a
S 3
Hadoop Shell Commands to Manage HDFS

●
●
Open terminal

U I T
In this exercise, we will practise generic HDFS commands and get ourselves familiar with Hadoop command line interface.

-
Open Terminal on your VM, navigate to Application > System Tools > Terminal Or use the shortcut on desktop.

l ab
S 3
4
Hadoop Shell Commands to Manage HDFS
hdfs dfs

U I T
● View description of all the commands associated with FsShell subsystem

-
● Command: hdfs dfs

l ab
S 3
5
Hadoop Shell Commands to Manage HDFS
ls

U I T
● HDFS Command to display the list of Files and Directories in HDFS.

-
● Command: hdfs dfs –ls

l ab
S 3
6
Hadoop Shell Commands to Manage HDFS
ls
● View the content of root directory.

U I T
-
● Command: hdfs dfs –ls /

l ab
S 3
● Note: the directory structure in HDFS has nothing to do with the directory
structure of the local filesystem, they are separate namespaces.
7
Hadoop Shell Commands to Manage HDFS
ls
● The home directory is /user/cloudera/

U I T
-
● Command: hdfs dfs –ls or hdfs dfs –ls /user/cloudera

l ab
S 3
8
Hadoop Shell Commands to Manage HDFS
mkdir

U I
● HDFS Command to create the directory in HDFS.
T
-
● Usage: hdfs dfs –mkdir directory_name

b
l a
● Command: hdfs dfs –mkdir dataset

S 3
9
Hadoop Shell Commands to Manage HDFS
touchz

U I T
● HDFS Command to create a file in HDFS with file size 0 bytes.

-
● Usage: hdfs dfs –touchz directory/filename

b
l a
● Command: hdfs dfs –touchz dataset/sample

S 3
10
Hadoop Shell Commands to Manage HDFS
du
● HDFS Command to check the file size.

U I T
-
● Usage: hdfs dfs –du –s directory/filename

b
l a
● Command: hdfs dfs –du –s dataset/sample

S 3
11
Hadoop Shell Commands to Manage HDFS
put

U I T
● HDFS Command to copy single source or multiple sources from local file

b -
system to the destination file system.
● Usage: hdfs dfs -put <localsrc> <destination>

l a
● Command: hdfs dfs –put data.txt dataset

3
S 12
Hadoop Shell Commands to Manage HDFS
cat

U I T
● HDFS Command that reads a file on HDFS and prints the content of that

-
file to the standard output.

b
● Usage: hdfs dfs –cat /path/to/file_in_hdfs

3 l a
● Command: hdfs dfs –cat dataset/data.txt

S 13
Hadoop Shell Commands to Manage HDFS
get

U I T
● HDFS Command to copy files from hdfs to the local file system.

b -
● Usage: hdfs dfs -get <src> <localdst>
● Command: hdfs dfs –get dataset/sample /home/cloudera

3 l a
S 14
Hadoop Shell Commands to Manage HDFS
count

U I T
● HDFS Command to count the number of directories, files, and bytes under

-
the paths that match the specified file pattern.

b
● Usage: hdfs dfs -count <path>

3 l a
● Command: hdfs dfs –count dataset

S 15
Hadoop Shell Commands to Manage HDFS
rm

U
● HDFS Command to remove the file from HDFS.
I T
-
● Usage: hdfs dfs –rm <path>

b
l a
● Command: hdfs dfs –rm dataset/sample

S 3
16
Hadoop Shell Commands to Manage HDFS
rm -r
●

U I T
HDFS Command to remove the entire directory and all of its content from HDFS.

-
● Usage: hdfs dfs -rm -r <path>

b
● Command: hdfs dfs -rm -r dataset

3 l a
S 17
Hadoop Shell Commands to Manage HDFS
cp
●

U I T
HDFS Command to copy files from source to destination. This command allows multiple sources

-
as well, in which case the destination must be a directory.
● Usage: hdfs dfs -cp <src> <dest>

b
● Command: hdfs dfs -cp /user/cloudera/dataset/data.txt /user/cloudera/dataset/datacopy.txt

3 l a
S 18
Hadoop Shell Commands to Manage HDFS
rmdir
● HDFS Command to remove a empty directory.

U I T
-
● Usage: hdfs dfs -rmdir <path>

b
l a
● Command: hdfs dfs –rmdir /user/cloudera/dataset

S 3
19
Hadoop Shell Commands to Manage HDFS
usage

U I T
● HDFS Command that returns the help for an individual command.

b -
● Usage: hdfs dfs -usage <command>
● Command: hdfs dfs -usage mkdir

l a
● By using usage command you can get information about any command

3
S 20
Hadoop Shell Commands to Manage HDFS
help

U I T
● HDFS Command that displays help for given command or all commands if

-
none is specified.

b
● Command: hdfs dfs -help

3 l a
S 21
U I T
Word Count b -
MapReduce
3 l a
S 22
Hadoop MapReduce

U I T
b -
3 l a
S 23
Hadoop MapReduce
How Hadoop MapReduce work?

U I T
b -
3 l a
S 24
Hadoop MapReduce
Input Files

U I T
● The data for a MapReduce task is stored in input files, and input files typically lives

-
in HDFS. The format of these files is arbitrary, while line-based log files and binary
format can also be used.

l ab
S 3
25
Hadoop MapReduce
InputFormat

U I T
● Now, InputFormat defines how these input files are split and read. It selects the files

-
or other objects that are used for input. InputFormat creates InputSplit.

l ab
S 3
26
Hadoop MapReduce
InputSplits
●

U I T
It is created by InputFormat, logically represent the data which will be processed by an

-
individual Mapper (We will understand mapper below). One map task is created for each split;
thus the number of map tasks will be equal to the number of InputSplits. The split is divided

b
into records and each record will be processed by the mapper.

3 l a
S 27
Hadoop MapReduce
RecordReader
●

U I T
It communicates with the InputSplit in Hadoop MapReduce and converts the data into key-

-
value pairs suitable for reading by the mapper. By default, it uses TextInputFormat for
converting data into a key-value pair. RecordReader communicates with the InputSplit until the

b
file reading is not completed. It assigns byte offset (unique number) to each line present in the
file. Further, these key-value pairs are sent to the mapper for further processing.

3 l a
S 28
Hadoop MapReduce
Mapper

U I T
● It processes each input record (from RecordReader) and generates new key-value pair, and this key-value pair

-
generated by Mapper is completely different from the input pair. The output of Mapper is also known as
intermediate output which is written to the local disk. The output of the Mapper is not stored on HDFS as this is
temporary data and writing on HDFS will create unnecessary copies (also HDFS is a high latency system).

b
Mappers output is passed to the combiner for further process

3 l a
S 29
Hadoop MapReduce
Combiner
●

U I T
The combiner is also known as ‘Mini-reducer’. Hadoop MapReduce Combiner performs local

-
aggregation on the mappers’ output, which helps to minimize the data transfer between
mapper and reducer (we will see reducer below). Once the combiner functionality is executed,

b
the output is then passed to the partitioner for further work.

3 l a
S 30
Hadoop MapReduce
Partitioner
●

U I T
Hadoop MapReduce, Partitioner comes into the picture if we are working on more than one

-
reducer (for one reducer partitioner is not used).
● Partitioner takes the output from combiners and performs partitioning. Partitioning of output

b
takes place on the basis of the key and then sorted. By hash function, key (or a subset of the
key) is used to derive the partition.

l a
● According to the key value in MapReduce, each combiner output is partitioned, and a record
having the same key value goes into the same partition, and then each partition is sent to a

3
reducer. Partitioning allows even distribution of the map output over the reducer.

S 31
Hadoop MapReduce
Shuffling and Sorting
●

U I T
Now, the output is Shuffled to the reduce node (which is a normal slave node but reduce phase

-
will run here hence called as reducer node). The shuffling is the physical movement of the data
which is done over the network. Once all the mappers are finished and their output is shuffled

b
on the reducer nodes, then this intermediate output is merged and sorted, which is then
provided as input to reduce phase.

3 l a
S 32
Hadoop MapReduce
Reducer
●

U I T
It takes the set of intermediate key-value pairs produced by the mappers as the input and then

-
runs a reducer function on each of them to generate the output. The output of the reducer is the
final output, which is stored in HDFS.

l ab
S 3
33
Hadoop MapReduce
RecordWriter
●

U I T
It writes these output key-value pair from the Reducer phase to the output files.

b -
3 l a
S 34
Hadoop MapReduce
OutputFormat

U I T
● The way these output key-value pairs are written in output files by RecordWriter is determined by the

-
OutputFormat. OutputFormat instances provided by the Hadoop are used to write files in HDFS or on the local
disk. Thus the final output of reducer is written on HDFS by OutputFormat instances.
● Hence, in this manner, a Hadoop MapReduce works over the cluster.

l ab
S 3
35
Hadoop MapReduce

U I T
b -
3 l a
S 36
Hadoop MapReduce

U I T
The entire MapReduce program can be fundamentally divided into three parts:

● Mapper Phase Code

b -
a
● Reducer Phase Code
● Driver Code

3 l
S 37
Hadoop MapReduce
Mapper code
public static class TokenizerMapper

U I T
-
extends Mapper<Object, Text, Text, IntWritable>{

b
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

l a
public void map(Object key, Text value, Context context

3
) throws IOException, InterruptedException {

S
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
38
Hadoop MapReduce
Mapper code
public static class IntSumReducer

U I T
-
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();

b
public void reduce(Text key, Iterable<IntWritable> values,

a
Context context

l
) throws IOException, InterruptedException {
int sum = 0;

3
for (IntWritable val : values) {

S
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
}
39
Hadoop MapReduce
Driver code

Configuration conf = new Configuration();

U I T
-
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);

b
job.setMapperClass(TokenizerMapper.class);

l a
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);

3
job.setOutputKeyClass(Text.class);

S
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);

40
Word Count MapReduce

U I T
-
Word Count MapReduce Tutorial starts from here

b
3 l a
S 41
Word Count MapReduce

U I T
● Open Eclipse on Cloudera Quickstart VM and create a new Java Project

b -
3 l a
S 42
Word Count MapReduce

U I T
● Add Hadoop Libraries to project through Add External JARS

b -
3 l a
S 43
Word Count MapReduce

● Add all .jar files in the folder usr/lib/hadoop

U I T
b -
3 l a
S 44
Word Count MapReduce

U I
● Add all .jar files in the folder usr/lib/hadoop/client
T
b -
3 l a
S 45
Word Count MapReduce

● Copy the source code of WordCount programe from

U I T
-
https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-
mapreduce-client-core/MapReduceTutorial.html

l ab
S 3
46
Word Count MapReduce

● Export to jar file

U I T
b -
3 l a
S 47
Hadoop Shell Commands to Manage HDFS

U I T
Name the file WordCount.jar and place it inside /home/cloudera/workspace/WordCount

b -
3 l a
S 48
Hadoop Shell Commands to Manage HDFS

U I
● Check if the WordCount.jar file has been created
T
b -
3 l a
S 49
Hadoop Shell Commands to Manage HDFS

U I T
● Create two text files inside /home/cloudera/workspace/WordCount folder

b -
3 l a
S 50
Hadoop Shell Commands to Manage HDFS

● Create input directory in Hadoop

U I T
b -
3 l a
S 51
Hadoop Shell Commands to Manage HDFS

U I T
● Move two text files to the input directory in Hadoop

b -
3 l a
S 52
Hadoop Shell Commands to Manage HDFS

● Check the content of the two files in Hadoop

U I T
b -
3 l a
S 53
Hadoop Shell Commands to Manage HDFS

● Execute the JAR file on Hadoop

U I T
○

b -
a
WordCount.jar jar: file to run

l
○ WordCount: name of the class

S 3
inputWC: input directory

outputWC: output directory

● This will create a directory /user/cloudera/outputWC and two files within it

54
Hadoop Shell Commands to Manage HDFS

U I T
● Check the output directory and the output files inside it

b -
3 l a
S
● Note: Job ran with only one Reducer, so there should be only one part-file
55
Hadoop Shell Commands to Manage HDFS

● Check the content of the output file

U I T
b -
3 l a
S 56
Hadoop Shell Commands to Manage HDFS

U I T
If ssh localhost is not working and it shows, ssh not installed. Then please enter the following: >

-
sudo apt-get install ssh Enter password if prompted.
● If there is an error while executing the command. Please check the variables JAVA_HOME and

b
HADOOP_CLASSPATH. Later reset the values and proceed.

l a
● If the is ClassNotFoundException, then please find the details in the link here: LINK. Or one

3
could compile and save the .jar file within the same source directory.

S
● If an error like the following appears on trying to make directories in HDFS viz. then please run
start-dfs.sh.

57
U I T
b -
Join in MapReduce
3 l a
S 58
What is Join in MapReduce

● Advantage – Optimize Solution in terms of

U I T
-
processing speed of the data.

b
● Disadvantage – Time Consuming for programmer,

l a
Non-ease mode of development due to 100’s of

3
lines of code and Availability of higher level

S
frameworks like HIVE/PIG

59
Type of Join in MapReduce
Map-Side Join
Read the data streams into the mappers and uses login within the

U I T
-
mapper function to perform the join.
● Where to use: – When you have One large dataset and you need

b
to join this with a small dataset. Also for Optimize performance.

a
● Why: – Smaller table will be loaded into memory and the join

l
operation will happen during the mapper execution of large data set

3
● Advantage: Better performance.

S
● Disadvantage: Not Flexible i.e. can not be used if both data sets
are large in size.
● Note: Reduce Side Join can also be used here but the performance
will decrease.

60
Type of Join in MapReduce
Reduce-Side Join
Process the multiple data streams through multiple map stages and

U I T
-
perform the join at Reducer stage.
● Where to use: – When both the data sets are large.

b
Why: – None of the dataset can be loaded in memory

l a
completely. Will have to process both tables separately and
JOIN them at reducer side

3
● Advantage: Flexible and can be applied anywhere.

S
Disadvantage: Poor performance in comparison with Map-side
Joins
● Note: – Map Side Join can’t be used here.

61
Reduce side Join

• Suppose that I have two separate datasets of a sports complex:

U I T
•

-
cust_details: It contains the details of the customer.

b
a
• transaction_details: It contains the transaction record of the customer.

3 l
Using these two datasets, I want to know the lifetime value of each customer. In doing so, I will be

S
needing the following things:

• The person’s name along with the frequency of the visits by that person.

• The total amount spent by him/her for purchasing the equipment.

62
Reduce side Join

U I T
b -
3 l a
S 63
Reduce side Join
Map phase: Mapper for customer
• Read the input taking one tuple at a time.

U I T
-
• Tokenize each word in that tuple and fetch the cust ID along with the name of the person.
•

b
The cust ID will be the key of the key-value pair that the mapper will generate eventually.
• Add a tag “cust” to indicate that this input tuple is of cust_details type.

l a
• Therefore, mapper for cust_details will produce following intermediate key-value pair:

3
Key – Value pair: [cust ID, cust name]

S
Example: [4000001, cust Kristina], [4000002, cust Paige], etc.

64
Reduce side Join
Map phase: Mapper for transaction
• Fetch the amount value instead of name of the person.

U I T
-
• In this case, “tnxn” is used as a tag.
•

b
Therefore, the cust ID will be the key of the key-value pair that the mapper will generate eventually.
•

a
Finally, the output of mapper for transaction_details will be of the following format:

l
Key, Value Pair: [cust ID, tnxn amount]

S 3
Example: [4000001, tnxn 40.33], [4000002, tnxn 198.44], etc.

65
Reduce side Join
Sorting and Shuffling Phase
•

U I T
The sorting and shuffling phase will generate an array list of values corresponding to each key. In other words, it will

-
put together all the values corresponding to each unique key in the intermediate key-value pair. The output of sorting
and shuffling phase will be of the following format:

b
Key – list of Values:

l a
{cust ID1 – [(cust name1), (tnxn amount1), (tnxn amount2), (tnxn amount3),…..]}

3
{cust ID2 – [(cust name2), (tnxn amount1), (tnxn amount2), (tnxn amount3),…..]}

S
……
• Example:

{4000001 – [(cust kristina), (tnxn 40.33), (tnxn 47.05),…]};

{4000002 – [(cust paige), (tnxn 198.44), (tnxn 5.58),…]};

66
……
Reduce side Join
Reduce Phase
•

U I T
The primary goal to perform this reduce-side join operation was to find out that how many times a

-
particular customer has visited sports complex and the total amount spent by that very customer on

b
different sports. Therefore, the final output should be of the following format:

l a
Key – Value pair: [Name of the customer] (Key) – [total amount, frequency of the visit] (Value)

S 3
Hence, the final output that my reducer will generate is given below:

Kristina, 651.05 8

Paige, 706.97 6

….. 67
Reduce side Join

U I T
b -
l a
Reduce side Join tutorial starts from here

S 3
68
Reduce side Join

● Create Java Project in Eclipse then click Next.

U I T
b -
3 l a
S 69
Reduce side Join

● Add External JARs

U I T
b -
3 l a
S 70
Reduce side Join

● Add all .jar files in the folder usr/lib/hadoop

U I T
b -
3 l a
S 71
Reduce side Join

U I
● Add all .jar files in the folder usr/lib/hadoop/client
T
b -
3 l a
S 72
Reduce side Join

● Create new class ReduceJoin

U I T
b -
3 l a
S 73
Reduce side Join

● Export to JAR file

U I T
b -
3 l a
S 74
Reduce side Join

U I T
Name the file ReduceJoin.jar and place it inside /home/cloudera/workspace/ReduceJoin

b -
3 l a
S 75
Reduce side Join

● Check if the JAR file has been created

U I T
b -
3 l a
S 76
Reduce side Join

U I T
● Create input file for customer inside /home/cloudera/workspace/ReduceJoin

b -
3 l a
S 77
Reduce side Join

U I T
● Create input file for transaction inside /home/cloudera/workspace/ReduceJoin

b -
3 l a
S 78
Reduce side Join

● Create input directory in HDFS

U I T
b -
3 l a
S 79
Reduce side Join

● Move input files to HDFS

U I T
b -
3 l a
S 80
Reduce side Join

● Check the content of cust file

U I T
b -
3 l a
S 81
Reduce side Join

● Submit the job:

U I T
-
hadoop jar ReduceJoin.jar ReduceJoin ReduceJoin/input/cust ReduceJoin/input/trans ReduceJoin/output

l ab
S 3
82
Reduce side Join

● Check the output directory in HDFS

U I T
b -
3 l a
S 83
Reduce side Join

U I T
● Check the output directory in HDFS and the result inside it

b -
3 l a
S 84
Assignment 2
● Write program to output

U I T
-
○ Min, Max, and Average age of users for each Game Type

b
○ The game types with lowest average age

3 l a
S …

Hint: you might need two MapReduce jobs for this task
85
Key Value

First MapReduce job [4000001 → age 55]

T
Key Value

I
[4000002 → age 74]
… [4000001 → type Exercise & Fitness]

U
[4000002 → type Exercise & Fitness]
Map …

-
[4000002 → type Team Sports]
…

b
[4000001 → type Combat Sports]

l a
Sort and Shuffle

3
{4000001 → [(age 55) , (type Exercise & Fitness), (type Combat Sport), (type Water Sport),…,(type Water Sport),…]}
{4000002 → [(age 74) , (type Exercise & Fitness), (type Team Sports), …]}

S
Sort and Shuffle will be done by the framework. Each of the line above will be the input for Reduce()
function in the Reducer class. (i.e. the reducer will run the Reduce() function for each key).

Key Value

Reduce [55 → Exercise & Fitness, Combat Sport, Water Sport, …] Note that there might be duplicate game types for each
player. So remember to remove any duplicates.
[74 → Exercise & Fitness, Team Sports,…] 86
Output of the first MapReduce job

U I T
b -
l a
…

S 3
87
Key Value
Second MapReduce job

I T
[Winter Sport → 55]
[Gymnastics → 55]
…

U
[Water Sport → 55]
Map …

-
[Water Sport → 74]
… [Team Sports → 74]
…

b
[Gymnastics → 60]
[Team Sports → 60]

l a
Sort and Shuffle {Winter Sport → [55,…]}

3
{Water Sport → [55,74,…]}
{Gymnastics → [55,…,60]}

S
…

Key Value

Reduce [Winter Sport → Min: 42 Max: 55 Avg: 58]

[Water Sport → Min: 34 Max:74 Avg: 53.29]
88
…
Output of the second MapReduce job

U I T
Run the second MapReduce job which takes the output of the first job as input

b -
3 l a …

S 89
Output types with lowest average age (2nd job)

U I T
Set the number of Reducer to 1, keep track the Game Type with lowest average age, then write

-
it to the output in cleanup function, which will run after all Reduce() function will have finished.

l ab
S 3 …

90
Job chain

U I T
b -
3 l a
S
Instead of running two separate
jobs, we can also run a job chain

91
Job chain

U I T
Instead of running two separate jobs, we can also run a job chain

b -
3 l a
S 92
Job chain

U I T
…

b -
3 l a
S …

93
U I T
b -
Hadoop Streaming:
3 l a
S
Writing A Hadoop MapReduce Program In Python

94
Hadoop Streaming:
What is Hadoop Streaming?

U I T
● Hadoop Streaming is a utility that comes with the Hadoop distribution. It can be used to execute

-
programs for big data analysis. Hadoop streaming can be performed using languages like Python,
Java, PHP, Scala, Perl, UNIX, and many more. The utility allows us to create and run Map/Reduce

b
jobs with any executable or script as the mapper and/or the reducer. For example:

l a
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar

-input myInputDirs

3
S
-output myOutputDir

-mapper /bin/cat

-reducer /bin/wc
95
Hadoop Streaming:
What is Hadoop Streaming?

U I T
b -
3 l a
S 96
Hadoop Streaming:

U I T
b -
l a
Hadoop streaming tutorial start from here

S 3
97
Hadoop Streaming:

● Create new General Project in Eclipse

U I T
b -
3 l a
S 98
Hadoop Streaming:

● Name the project HadoopStreaming

U I T
b -
3 l a
S 99
Hadoop Streaming:

● Create new file mapper.py

U I T
b -
3 l a
S 100
Hadoop Streaming:
mapper.py

import sys

U I T
-
#Word Count Example
# input comes from standard input STDIN

b
for line in sys.stdin:
line = line.strip() #remove leading and trailing whitespaces

a
words = line.split() #split the line into words and returns as a list

l
for word in words:
#write the results to standard output STDOUT

3
print'%s %s' % (word,1) #Emit the word

S 101
Hadoop Streaming:

● Create new file reducer.py

U I T
b -
3 l a
S 102
Hadoop Streaming:
reducer.py
import sys
from operator import itemgetter

U I T
-
# using a dictionary to map words to their counts
current_word = None
current_count = 0

b
word = None
# input comes from STDIN
for line in sys.stdin:

a
line = line.strip()

l
word,count = line.split(' ',1)
try:

3
count = int(count)
except ValueError:
continue

S
if current_word == word:
current_count += count
else:
if current_word:
print '%s %s' % (current_word, current_count)
current_count = count
current_word = word
if current_word == word:
print '%s %s' % (current_word,current_count)

103
Hadoop Streaming:

U I T
● Locate the HadoopStreaming directory which contains mapper.py and reducer.py

-
● Create word.txt file

l ab
S 3
104
Hadoop Streaming:

U I T
We can run mapper and reducer on local files (ex: word.txt). In order to run the Map and reduce

-
on the Hadoop Distributed File System (HDFS), we need the Hadoop Streaming jar. So before
we run the scripts on HDFS, let’s run them locally to ensure that they are working fine.

b
● command: cat word.txt | python mapper.py

3 l a
S 105
Hadoop Streaming:

● Run reducer.py

U I T
-
● command: cat word.txt | python mapper.py | sort -k1,1 | python reducer.py

b
3 l a
S 106
Hadoop Streaming:
Running the Python Code on Hadoop

● Create HadoopStreaming directory in HDFS

U I T
● Move word.txt to HDFS

b -
3 l a
S 107
Hadoop Streaming:
Running the Python Code on Hadoop

U I T
-
● So locate the Hadoop Streaming jar on your terminal and copy the path.

b
● Command: ls /usr/lib/hadoop-mapreduce/hadoop-streaming.jar

3 l a
S 108
Hadoop Streaming:
●
Run the MapReduce job

U I T
Command: hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -file /home/cloudera/workspace/HadoopStreaming/mapper.py
-mapper 'python mapper.py' -file /home/cloudera/workspace/HadoopStreaming/reducer.py -reducer 'python reducer.py' -input

-
HadoopStreaming/word.txt -output HSOutput

l ab
S 3
Note: If we navigate to HadoopStreaming directory, we don’t need to specify the full path /home/cloudera/workspace/HadoopStreaming/mapper.py

[cloudera@quickstart HadoopStreaming]$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -file mapper.py -mapper 'python mapper.py' -file
reducer.py -reducer 'python reducer.py' -input HadoopStreaming/word.txt -output HSOutput
110
Hadoop Streaming:

U I T
● Hadoop provides a basic web interface for statistics and information

b -
3 l a
S 111
Hadoop Streaming:

U I T
● Check the output directory and the content of the result file

b -
3 l a
S 112
Share Folders between VM and Host

U I T
-
● Just in case you’re finding the way to transfer files from Host to VM

b
3 l a
S 113
Share Folders between VM and Host

U I T
● Device -> Shared Folders -> Shared Folders Setting

b -
3 l a
S 114
Share Folders between VM and Host

● Click Adds new shared folder

U I T
b -
3 l a
S 115
Share Folders between VM and Host

● Navigate to any folder on Window

U I T
b -
3 l a
S 116
Share Folders between VM and Host

U I T
-
• Folder Path - folder in the host (Windows 10)

b
• Folder Name - repeat the folder name from above

a
• Auto-mount - check this option
•

3 l
Mount point - guest OS folder where the shared

S
folder will mount (will be created if it does not exist)
• Make Permanent - check this option

117
Share Folders between VM and Host

U I T
The ShareVM folder will be created (if not exist) after you reboot the VM.

-
● You can create the ShareVM folder in guest OS (Cloudera) by yourself (so no need to reboot)
● Command: mkdir /home/cloudera/workspace/ShareVM

l ab
S 3
118
Share Folders between VM and Host

● Switch to root account (password: cloudera)

U I T
-
● Mount the shared folder:

b
Command: Mount –t vboxsf ShareFromWindow /home/cloudera/workspace/ShareVM

3 l a
S 119
Share Folders between VM and Host

● Put some files to the shared folder in Window

U I T
-
● Click the newly created shortcut in Cloudera VM and check if the files appear

l ab
S 3
120

(eBook PDF) GO! with Microsoft Office 365, 2019 Edition Introductory, 1st edition instant download
0% (1)
(eBook PDF) GO! with Microsoft Office 365, 2019 Edition Introductory, 1st edition instant download
57 pages
Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
Google People and Ai Guidebook-Workshop-Slides
No ratings yet
Google People and Ai Guidebook-Workshop-Slides
126 pages
(IJIT-V6I5P7) :ravishankar Belkunde
No ratings yet
(IJIT-V6I5P7) :ravishankar Belkunde
9 pages
Saurav Dudulwar Resume
No ratings yet
Saurav Dudulwar Resume
1 page
Recording As A Solution Over MRA (An Overview) : Technical Consulting Engineer Telepresence Solutions Group
No ratings yet
Recording As A Solution Over MRA (An Overview) : Technical Consulting Engineer Telepresence Solutions Group
16 pages
Grade 2 IT Short Note PDF
75% (8)
Grade 2 IT Short Note PDF
2 pages
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
big_data_topic3_[spark]_[thanh_binh_nguyen].TextMark
No ratings yet
big_data_topic3_[spark]_[thanh_binh_nguyen].TextMark
60 pages
Slide 13 - Kafka
No ratings yet
Slide 13 - Kafka
109 pages
2.3 Resource Management
No ratings yet
2.3 Resource Management
23 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
10 pages
Machine Learning With Spark
No ratings yet
Machine Learning With Spark
26 pages
Installing and Using Impala
No ratings yet
Installing and Using Impala
248 pages
Lecture 4 - Pair RDD and DataFrame
No ratings yet
Lecture 4 - Pair RDD and DataFrame
38 pages
Data Mining N Business Intelligence
No ratings yet
Data Mining N Business Intelligence
63 pages
Matillion Optimizing Snowflake
No ratings yet
Matillion Optimizing Snowflake
23 pages
Machine Learning Spark ML
No ratings yet
Machine Learning Spark ML
11 pages
Spark NLP Training-Public-April 2020
No ratings yet
Spark NLP Training-Public-April 2020
39 pages
TF On Spark
No ratings yet
TF On Spark
35 pages
HADOOP
100% (1)
HADOOP
35 pages
Google Cloud Platform Tutorial
No ratings yet
Google Cloud Platform Tutorial
6 pages
Ambari Operations
No ratings yet
Ambari Operations
194 pages
Donald Ngandeu 1
No ratings yet
Donald Ngandeu 1
6 pages
Applied Coding Track
No ratings yet
Applied Coding Track
10 pages
Pyspark RDD Cheat Sheet Python For Data Science
No ratings yet
Pyspark RDD Cheat Sheet Python For Data Science
1 page
De Mod 2 Transform Data With Spark
No ratings yet
De Mod 2 Transform Data With Spark
32 pages
3170722_BDA_Lab Manual(1)
No ratings yet
3170722_BDA_Lab Manual(1)
78 pages
CCA175 Demo Examenes
No ratings yet
CCA175 Demo Examenes
19 pages
Databricks Question
No ratings yet
Databricks Question
7 pages
Advanced Data Model
No ratings yet
Advanced Data Model
18 pages
Problem Description: Sensitivity: Internal & Restricted
No ratings yet
Problem Description: Sensitivity: Internal & Restricted
2 pages
Stream Processing at Lyft
No ratings yet
Stream Processing at Lyft
20 pages
02 - Introduction To Data Lakehouse Open-Source Technologies
No ratings yet
02 - Introduction To Data Lakehouse Open-Source Technologies
42 pages
Spring Cloud Dataflow Reference
No ratings yet
Spring Cloud Dataflow Reference
130 pages
1 - Optimize Amazon SageMaker Deployment Strategies
No ratings yet
1 - Optimize Amazon SageMaker Deployment Strategies
45 pages
010 Intro Natural Language Processing
No ratings yet
010 Intro Natural Language Processing
43 pages
DATA ANALYTICS Lab
No ratings yet
DATA ANALYTICS Lab
3 pages
Aws Three Practical Use Cases With Databricks Ebook v5 101221
No ratings yet
Aws Three Practical Use Cases With Databricks Ebook v5 101221
34 pages
Cloudera Hive
No ratings yet
Cloudera Hive
107 pages
(Omran) Introduction To Google Cloud Platform
No ratings yet
(Omran) Introduction To Google Cloud Platform
45 pages
2018 02 08 Whats New in Apache Spark 2 180213220045
No ratings yet
2018 02 08 Whats New in Apache Spark 2 180213220045
57 pages
05.azure Data Lake Authentication
No ratings yet
05.azure Data Lake Authentication
16 pages
Apache Calcite Tutorial
No ratings yet
Apache Calcite Tutorial
83 pages
Hadoop Interviews Q
No ratings yet
Hadoop Interviews Q
9 pages
Primer On Big Data Testing
No ratings yet
Primer On Big Data Testing
24 pages
3 Lecture 3-ETL
100% (1)
3 Lecture 3-ETL
42 pages
Databricks DBX CLI - Deploy The Spark JAR Using YAML - by Ganesh Chandrasekaran - Medium
No ratings yet
Databricks DBX CLI - Deploy The Spark JAR Using YAML - by Ganesh Chandrasekaran - Medium
7 pages
(English (Auto-Generated) ) Building End-to-End Delta Pipelines On GCP (DownSub - Com)
No ratings yet
(English (Auto-Generated) ) Building End-to-End Delta Pipelines On GCP (DownSub - Com)
24 pages
What Are DBT Sources
No ratings yet
What Are DBT Sources
109 pages
Qlik Sense Installation Guide
No ratings yet
Qlik Sense Installation Guide
63 pages
Module 6 - Guided Lab - Creating A Virtual Private Cloud
No ratings yet
Module 6 - Guided Lab - Creating A Virtual Private Cloud
9 pages
Big Data Tools 2 - Apache Spark With PySpark
No ratings yet
Big Data Tools 2 - Apache Spark With PySpark
33 pages
Cloudera Kafka
No ratings yet
Cloudera Kafka
175 pages
MapReduce Example
No ratings yet
MapReduce Example
3 pages
[FREE PDF sample] Python Unit Test Automation: Practical Techniques for Python Developers and Testers 1 / converted Edition Ashwin Pajankar ebooks
100% (2)
[FREE PDF sample] Python Unit Test Automation: Practical Techniques for Python Developers and Testers 1 / converted Edition Ashwin Pajankar ebooks
35 pages
The Python Workbook: A Brief Introduction with Exercises and Solutions 2nd Edition Ben Stephenson all chapter instant download
100% (1)
The Python Workbook: A Brief Introduction with Exercises and Solutions 2nd Edition Ben Stephenson all chapter instant download
49 pages
Day 4-01-Spark
No ratings yet
Day 4-01-Spark
43 pages
Spark Optimizations & Deployment
No ratings yet
Spark Optimizations & Deployment
39 pages
HDFS Commands
No ratings yet
HDFS Commands
6 pages
Cloudera Hive
No ratings yet
Cloudera Hive
132 pages
Reading Preparing For ACE Module 2 v2.0
No ratings yet
Reading Preparing For ACE Module 2 v2.0
30 pages
Download ebooks file Learn PySpark: Build python-based machine learning and deep learning models 1st Edition Pramod Singh all chapters
100% (3)
Download ebooks file Learn PySpark: Build python-based machine learning and deep learning models 1st Edition Pramod Singh all chapters
55 pages
CREDITS
No ratings yet
CREDITS
19 pages
User Manual EN - MIE-3-6-02.newest-EN - Backup Redundent
No ratings yet
User Manual EN - MIE-3-6-02.newest-EN - Backup Redundent
4 pages
1Z0-1042-24-Demo
No ratings yet
1Z0-1042-24-Demo
6 pages
JD51单片机sst89e58数据手册
No ratings yet
JD51单片机sst89e58数据手册
79 pages
Module1 PartA Dr. Ilavarasi
No ratings yet
Module1 PartA Dr. Ilavarasi
36 pages
Blockchain
No ratings yet
Blockchain
5 pages
Training Timetable For 100% Practical Intensive Digital Marketing (Oct, 2020 Revised)
100% (1)
Training Timetable For 100% Practical Intensive Digital Marketing (Oct, 2020 Revised)
2 pages
6910_AllAboutEve_CG_20190131_Web2
No ratings yet
6910_AllAboutEve_CG_20190131_Web2
12 pages
Install Linux Ubuntu 14
No ratings yet
Install Linux Ubuntu 14
20 pages
FAU Supporting Documents For Applications For Masters Degree Programmes
No ratings yet
FAU Supporting Documents For Applications For Masters Degree Programmes
7 pages
Protective Devices & Maintenance
67% (3)
Protective Devices & Maintenance
77 pages
FAISALKHAN
No ratings yet
FAISALKHAN
17 pages
XRDynamic500_Anton_Paar_XRD
No ratings yet
XRDynamic500_Anton_Paar_XRD
33 pages
Whitepaper Flowsic600 Dru en Im0081886
No ratings yet
Whitepaper Flowsic600 Dru en Im0081886
20 pages
Utility 3000R Reefer Brochure
No ratings yet
Utility 3000R Reefer Brochure
12 pages
Memo Format of OJT Portfolio
0% (1)
Memo Format of OJT Portfolio
2 pages
EGL - The Mesa 3D Graphics Library Latest Documentation
No ratings yet
EGL - The Mesa 3D Graphics Library Latest Documentation
6 pages
AM009NN4DCH+AA+SUBMITTAL_WF_4Way_11122019
No ratings yet
AM009NN4DCH+AA+SUBMITTAL_WF_4Way_11122019
2 pages
Rivulis ProFlat English US 20190512 Web
No ratings yet
Rivulis ProFlat English US 20190512 Web
2 pages
Test Facility For Torsionable Cables and Connections in Wind Turbines
No ratings yet
Test Facility For Torsionable Cables and Connections in Wind Turbines
4 pages
Cakmakcioglu 2014
No ratings yet
Cakmakcioglu 2014
11 pages
Unit 2
No ratings yet
Unit 2
22 pages
Guillotine Shutter
No ratings yet
Guillotine Shutter
10 pages
Mil DTL 70599B
No ratings yet
Mil DTL 70599B
58 pages
Test I. Read Each Item, Then Choose The
No ratings yet
Test I. Read Each Item, Then Choose The
2 pages
BIOSTAR Viotech 3200+ SPEC
No ratings yet
BIOSTAR Viotech 3200+ SPEC
2 pages
Wheel Loader Maintenance Checklist
No ratings yet
Wheel Loader Maintenance Checklist
1 page

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Slide 3 Hadoop MapReduce Tutorial

Uploaded by

Slide 3 Hadoop MapReduce Tutorial

Uploaded by

Big Data

● Mapper Phase Code

Configuration conf = new Configuration();

● Add all .jar files in the folder usr/lib/hadoop

● Copy the source code of WordCount programe from

● Export to jar file

● Create input directory in Hadoop

● Check the content of the two files in Hadoop

● Execute the JAR file on Hadoop

outputWC: output directory

● This will create a directory /user/cloudera/outputWC and two files within it

● Check the content of the output file

● Advantage – Optimize Solution in terms of

• Suppose that I have two separate datasets of a sports complex:

• The total amount spent by him/her for purchasing the equipment.

{4000001 – [(cust kristina), (tnxn 40.33), (tnxn 47.05),…]};

{4000002 – [(cust paige), (tnxn 198.44), (tnxn 5.58),…]};

● Create Java Project in Eclipse then click Next.

● Add External JARs

● Add all .jar files in the folder usr/lib/hadoop

● Create new class ReduceJoin

● Export to JAR file

● Check if the JAR file has been created

● Create input directory in HDFS

● Move input files to HDFS

● Check the content of cust file

● Submit the job:

● Check the output directory in HDFS

First MapReduce job [4000001 → age 55]

Reduce [Winter Sport → Min: 42 Max: 55 Avg: 58]

● Create new General Project in Eclipse

● Name the project HadoopStreaming

● Create new file mapper.py

● Create new file reducer.py

● Create HadoopStreaming directory in HDFS

● Click Adds new shared folder

● Navigate to any folder on Window

● Switch to root account (password: cloudera)

● Put some files to the shared folder in Window

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.