Slide 3 Hadoop MapReduce Tutorial
Slide 3 Hadoop MapReduce Tutorial
U I T
-
Hadoop tutorial
b
a
Trong-Hop Do
3 l
S S3Lab
Smart Software System Laboratory
1
U I T
b -
“Without big data, you are blind and deaf
and in the middle of a freeway.”
3 l a – Geoffrey Moore
Big Data
S 2
U I T
-
Hadoop ShellbCommands
3 l a
S 3
Hadoop Shell Commands to Manage HDFS
●
●
Open terminal
U I T
In this exercise, we will practise generic HDFS commands and get ourselves familiar with Hadoop command line interface.
-
Open Terminal on your VM, navigate to Application > System Tools > Terminal Or use the shortcut on desktop.
l ab
S 3
4
Hadoop Shell Commands to Manage HDFS
hdfs dfs
U I T
● View description of all the commands associated with FsShell subsystem
-
● Command: hdfs dfs
l ab
S 3
5
Hadoop Shell Commands to Manage HDFS
ls
U I T
● HDFS Command to display the list of Files and Directories in HDFS.
-
● Command: hdfs dfs –ls
l ab
S 3
6
Hadoop Shell Commands to Manage HDFS
ls
● View the content of root directory.
U I T
-
● Command: hdfs dfs –ls /
l ab
S 3
● Note: the directory structure in HDFS has nothing to do with the directory
structure of the local filesystem, they are separate namespaces.
7
Hadoop Shell Commands to Manage HDFS
ls
● The home directory is /user/cloudera/
U I T
-
● Command: hdfs dfs –ls or hdfs dfs –ls /user/cloudera
l ab
S 3
8
Hadoop Shell Commands to Manage HDFS
mkdir
U I
● HDFS Command to create the directory in HDFS.
T
-
● Usage: hdfs dfs –mkdir directory_name
b
l a
● Command: hdfs dfs –mkdir dataset
S 3
9
Hadoop Shell Commands to Manage HDFS
touchz
U I T
● HDFS Command to create a file in HDFS with file size 0 bytes.
-
● Usage: hdfs dfs –touchz directory/filename
b
l a
● Command: hdfs dfs –touchz dataset/sample
S 3
10
Hadoop Shell Commands to Manage HDFS
du
● HDFS Command to check the file size.
U I T
-
● Usage: hdfs dfs –du –s directory/filename
b
l a
● Command: hdfs dfs –du –s dataset/sample
S 3
11
Hadoop Shell Commands to Manage HDFS
put
U I T
● HDFS Command to copy single source or multiple sources from local file
b -
system to the destination file system.
● Usage: hdfs dfs -put <localsrc> <destination>
l a
● Command: hdfs dfs –put data.txt dataset
3
S 12
Hadoop Shell Commands to Manage HDFS
cat
U I T
● HDFS Command that reads a file on HDFS and prints the content of that
-
file to the standard output.
b
● Usage: hdfs dfs –cat /path/to/file_in_hdfs
3 l a
● Command: hdfs dfs –cat dataset/data.txt
S 13
Hadoop Shell Commands to Manage HDFS
get
U I T
● HDFS Command to copy files from hdfs to the local file system.
b -
● Usage: hdfs dfs -get <src> <localdst>
● Command: hdfs dfs –get dataset/sample /home/cloudera
3 l a
S 14
Hadoop Shell Commands to Manage HDFS
count
U I T
● HDFS Command to count the number of directories, files, and bytes under
-
the paths that match the specified file pattern.
b
● Usage: hdfs dfs -count <path>
3 l a
● Command: hdfs dfs –count dataset
S 15
Hadoop Shell Commands to Manage HDFS
rm
U
● HDFS Command to remove the file from HDFS.
I T
-
● Usage: hdfs dfs –rm <path>
b
l a
● Command: hdfs dfs –rm dataset/sample
S 3
16
Hadoop Shell Commands to Manage HDFS
rm -r
●
U I T
HDFS Command to remove the entire directory and all of its content from HDFS.
-
● Usage: hdfs dfs -rm -r <path>
b
● Command: hdfs dfs -rm -r dataset
3 l a
S 17
Hadoop Shell Commands to Manage HDFS
cp
●
U I T
HDFS Command to copy files from source to destination. This command allows multiple sources
-
as well, in which case the destination must be a directory.
● Usage: hdfs dfs -cp <src> <dest>
b
● Command: hdfs dfs -cp /user/cloudera/dataset/data.txt /user/cloudera/dataset/datacopy.txt
3 l a
S 18
Hadoop Shell Commands to Manage HDFS
rmdir
● HDFS Command to remove a empty directory.
U I T
-
● Usage: hdfs dfs -rmdir <path>
b
l a
● Command: hdfs dfs –rmdir /user/cloudera/dataset
S 3
19
Hadoop Shell Commands to Manage HDFS
usage
U I T
● HDFS Command that returns the help for an individual command.
b -
● Usage: hdfs dfs -usage <command>
● Command: hdfs dfs -usage mkdir
l a
● By using usage command you can get information about any command
3
S 20
Hadoop Shell Commands to Manage HDFS
help
U I T
● HDFS Command that displays help for given command or all commands if
-
none is specified.
b
● Command: hdfs dfs -help
3 l a
S 21
U I T
Word Count b -
MapReduce
3 l a
S 22
Hadoop MapReduce
U I T
b -
3 l a
S 23
Hadoop MapReduce
How Hadoop MapReduce work?
U I T
b -
3 l a
S 24
Hadoop MapReduce
Input Files
U I T
● The data for a MapReduce task is stored in input files, and input files typically lives
-
in HDFS. The format of these files is arbitrary, while line-based log files and binary
format can also be used.
l ab
S 3
25
Hadoop MapReduce
InputFormat
U I T
● Now, InputFormat defines how these input files are split and read. It selects the files
-
or other objects that are used for input. InputFormat creates InputSplit.
l ab
S 3
26
Hadoop MapReduce
InputSplits
●
U I T
It is created by InputFormat, logically represent the data which will be processed by an
-
individual Mapper (We will understand mapper below). One map task is created for each split;
thus the number of map tasks will be equal to the number of InputSplits. The split is divided
b
into records and each record will be processed by the mapper.
3 l a
S 27
Hadoop MapReduce
RecordReader
●
U I T
It communicates with the InputSplit in Hadoop MapReduce and converts the data into key-
-
value pairs suitable for reading by the mapper. By default, it uses TextInputFormat for
converting data into a key-value pair. RecordReader communicates with the InputSplit until the
b
file reading is not completed. It assigns byte offset (unique number) to each line present in the
file. Further, these key-value pairs are sent to the mapper for further processing.
3 l a
S 28
Hadoop MapReduce
Mapper
U I T
● It processes each input record (from RecordReader) and generates new key-value pair, and this key-value pair
-
generated by Mapper is completely different from the input pair. The output of Mapper is also known as
intermediate output which is written to the local disk. The output of the Mapper is not stored on HDFS as this is
temporary data and writing on HDFS will create unnecessary copies (also HDFS is a high latency system).
b
Mappers output is passed to the combiner for further process
3 l a
S 29
Hadoop MapReduce
Combiner
●
U I T
The combiner is also known as ‘Mini-reducer’. Hadoop MapReduce Combiner performs local
-
aggregation on the mappers’ output, which helps to minimize the data transfer between
mapper and reducer (we will see reducer below). Once the combiner functionality is executed,
b
the output is then passed to the partitioner for further work.
3 l a
S 30
Hadoop MapReduce
Partitioner
●
U I T
Hadoop MapReduce, Partitioner comes into the picture if we are working on more than one
-
reducer (for one reducer partitioner is not used).
● Partitioner takes the output from combiners and performs partitioning. Partitioning of output
b
takes place on the basis of the key and then sorted. By hash function, key (or a subset of the
key) is used to derive the partition.
l a
● According to the key value in MapReduce, each combiner output is partitioned, and a record
having the same key value goes into the same partition, and then each partition is sent to a
3
reducer. Partitioning allows even distribution of the map output over the reducer.
S 31
Hadoop MapReduce
Shuffling and Sorting
●
U I T
Now, the output is Shuffled to the reduce node (which is a normal slave node but reduce phase
-
will run here hence called as reducer node). The shuffling is the physical movement of the data
which is done over the network. Once all the mappers are finished and their output is shuffled
b
on the reducer nodes, then this intermediate output is merged and sorted, which is then
provided as input to reduce phase.
3 l a
S 32
Hadoop MapReduce
Reducer
●
U I T
It takes the set of intermediate key-value pairs produced by the mappers as the input and then
-
runs a reducer function on each of them to generate the output. The output of the reducer is the
final output, which is stored in HDFS.
l ab
S 3
33
Hadoop MapReduce
RecordWriter
●
U I T
It writes these output key-value pair from the Reducer phase to the output files.
b -
3 l a
S 34
Hadoop MapReduce
OutputFormat
U I T
● The way these output key-value pairs are written in output files by RecordWriter is determined by the
-
OutputFormat. OutputFormat instances provided by the Hadoop are used to write files in HDFS or on the local
disk. Thus the final output of reducer is written on HDFS by OutputFormat instances.
● Hence, in this manner, a Hadoop MapReduce works over the cluster.
l ab
S 3
35
Hadoop MapReduce
U I T
b -
3 l a
S 36
Hadoop MapReduce
U I T
The entire MapReduce program can be fundamentally divided into three parts:
b -
a
● Reducer Phase Code
● Driver Code
3 l
S 37
Hadoop MapReduce
Mapper code
public static class TokenizerMapper
U I T
-
extends Mapper<Object, Text, Text, IntWritable>{
b
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
l a
public void map(Object key, Text value, Context context
3
) throws IOException, InterruptedException {
S
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
38
Hadoop MapReduce
Mapper code
public static class IntSumReducer
U I T
-
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
b
public void reduce(Text key, Iterable<IntWritable> values,
a
Context context
l
) throws IOException, InterruptedException {
int sum = 0;
3
for (IntWritable val : values) {
S
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
}
39
Hadoop MapReduce
Driver code
U I T
-
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
b
job.setMapperClass(TokenizerMapper.class);
l a
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
3
job.setOutputKeyClass(Text.class);
S
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
40
Word Count MapReduce
U I T
-
Word Count MapReduce Tutorial starts from here
b
3 l a
S 41
Word Count MapReduce
U I T
● Open Eclipse on Cloudera Quickstart VM and create a new Java Project
b -
3 l a
S 42
Word Count MapReduce
U I T
● Add Hadoop Libraries to project through Add External JARS
b -
3 l a
S 43
Word Count MapReduce
U I T
b -
3 l a
S 44
Word Count MapReduce
U I
● Add all .jar files in the folder usr/lib/hadoop/client
T
b -
3 l a
S 45
Word Count MapReduce
U I T
-
https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-
mapreduce-client-core/MapReduceTutorial.html
l ab
S 3
46
Word Count MapReduce
U I T
b -
3 l a
S 47
Hadoop Shell Commands to Manage HDFS
U I T
Name the file WordCount.jar and place it inside /home/cloudera/workspace/WordCount
b -
3 l a
S 48
Hadoop Shell Commands to Manage HDFS
U I
● Check if the WordCount.jar file has been created
T
b -
3 l a
S 49
Hadoop Shell Commands to Manage HDFS
U I T
● Create two text files inside /home/cloudera/workspace/WordCount folder
b -
3 l a
S 50
Hadoop Shell Commands to Manage HDFS
U I T
b -
3 l a
S 51
Hadoop Shell Commands to Manage HDFS
U I T
● Move two text files to the input directory in Hadoop
b -
3 l a
S 52
Hadoop Shell Commands to Manage HDFS
U I T
b -
3 l a
S 53
Hadoop Shell Commands to Manage HDFS
U I T
○
b -
a
WordCount.jar jar: file to run
l
○ WordCount: name of the class
S 3
inputWC: input directory
54
Hadoop Shell Commands to Manage HDFS
U I T
● Check the output directory and the output files inside it
b -
3 l a
S
● Note: Job ran with only one Reducer, so there should be only one part-file
55
Hadoop Shell Commands to Manage HDFS
U I T
b -
3 l a
S 56
Hadoop Shell Commands to Manage HDFS
U I T
If ssh localhost is not working and it shows, ssh not installed. Then please enter the following: >
-
sudo apt-get install ssh Enter password if prompted.
● If there is an error while executing the command. Please check the variables JAVA_HOME and
b
HADOOP_CLASSPATH. Later reset the values and proceed.
l a
● If the is ClassNotFoundException, then please find the details in the link here: LINK. Or one
3
could compile and save the .jar file within the same source directory.
S
● If an error like the following appears on trying to make directories in HDFS viz. then please run
start-dfs.sh.
57
U I T
b -
Join in MapReduce
3 l a
S 58
What is Join in MapReduce
U I T
-
processing speed of the data.
b
● Disadvantage – Time Consuming for programmer,
l a
Non-ease mode of development due to 100’s of
3
lines of code and Availability of higher level
S
frameworks like HIVE/PIG
59
Type of Join in MapReduce
Map-Side Join
Read the data streams into the mappers and uses login within the
U I T
-
mapper function to perform the join.
● Where to use: – When you have One large dataset and you need
b
to join this with a small dataset. Also for Optimize performance.
a
● Why: – Smaller table will be loaded into memory and the join
l
operation will happen during the mapper execution of large data set
3
● Advantage: Better performance.
S
● Disadvantage: Not Flexible i.e. can not be used if both data sets
are large in size.
● Note: Reduce Side Join can also be used here but the performance
will decrease.
60
Type of Join in MapReduce
Reduce-Side Join
Process the multiple data streams through multiple map stages and
U I T
-
perform the join at Reducer stage.
● Where to use: – When both the data sets are large.
b
Why: – None of the dataset can be loaded in memory
l a
completely. Will have to process both tables separately and
JOIN them at reducer side
3
● Advantage: Flexible and can be applied anywhere.
S
Disadvantage: Poor performance in comparison with Map-side
Joins
● Note: – Map Side Join can’t be used here.
61
Reduce side Join
U I T
•
-
cust_details: It contains the details of the customer.
b
a
• transaction_details: It contains the transaction record of the customer.
3 l
Using these two datasets, I want to know the lifetime value of each customer. In doing so, I will be
S
needing the following things:
• The person’s name along with the frequency of the visits by that person.
U I T
b -
3 l a
S 63
Reduce side Join
Map phase: Mapper for customer
• Read the input taking one tuple at a time.
U I T
-
• Tokenize each word in that tuple and fetch the cust ID along with the name of the person.
•
b
The cust ID will be the key of the key-value pair that the mapper will generate eventually.
• Add a tag “cust” to indicate that this input tuple is of cust_details type.
l a
• Therefore, mapper for cust_details will produce following intermediate key-value pair:
3
Key – Value pair: [cust ID, cust name]
S
Example: [4000001, cust Kristina], [4000002, cust Paige], etc.
64
Reduce side Join
Map phase: Mapper for transaction
• Fetch the amount value instead of name of the person.
U I T
-
• In this case, “tnxn” is used as a tag.
•
b
Therefore, the cust ID will be the key of the key-value pair that the mapper will generate eventually.
•
a
Finally, the output of mapper for transaction_details will be of the following format:
l
Key, Value Pair: [cust ID, tnxn amount]
S 3
Example: [4000001, tnxn 40.33], [4000002, tnxn 198.44], etc.
65
Reduce side Join
Sorting and Shuffling Phase
•
U I T
The sorting and shuffling phase will generate an array list of values corresponding to each key. In other words, it will
-
put together all the values corresponding to each unique key in the intermediate key-value pair. The output of sorting
and shuffling phase will be of the following format:
b
Key – list of Values:
l a
{cust ID1 – [(cust name1), (tnxn amount1), (tnxn amount2), (tnxn amount3),…..]}
3
{cust ID2 – [(cust name2), (tnxn amount1), (tnxn amount2), (tnxn amount3),…..]}
S
……
• Example:
U I T
The primary goal to perform this reduce-side join operation was to find out that how many times a
-
particular customer has visited sports complex and the total amount spent by that very customer on
b
different sports. Therefore, the final output should be of the following format:
l a
Key – Value pair: [Name of the customer] (Key) – [total amount, frequency of the visit] (Value)
S 3
Hence, the final output that my reducer will generate is given below:
Kristina, 651.05 8
Paige, 706.97 6
….. 67
Reduce side Join
U I T
b -
l a
Reduce side Join tutorial starts from here
S 3
68
Reduce side Join
U I T
b -
3 l a
S 69
Reduce side Join
U I T
b -
3 l a
S 70
Reduce side Join
U I T
b -
3 l a
S 71
Reduce side Join
U I
● Add all .jar files in the folder usr/lib/hadoop/client
T
b -
3 l a
S 72
Reduce side Join
U I T
b -
3 l a
S 73
Reduce side Join
U I T
b -
3 l a
S 74
Reduce side Join
U I T
Name the file ReduceJoin.jar and place it inside /home/cloudera/workspace/ReduceJoin
b -
3 l a
S 75
Reduce side Join
U I T
b -
3 l a
S 76
Reduce side Join
U I T
● Create input file for customer inside /home/cloudera/workspace/ReduceJoin
b -
3 l a
S 77
Reduce side Join
U I T
● Create input file for transaction inside /home/cloudera/workspace/ReduceJoin
b -
3 l a
S 78
Reduce side Join
U I T
b -
3 l a
S 79
Reduce side Join
U I T
b -
3 l a
S 80
Reduce side Join
U I T
b -
3 l a
S 81
Reduce side Join
U I T
-
hadoop jar ReduceJoin.jar ReduceJoin ReduceJoin/input/cust ReduceJoin/input/trans ReduceJoin/output
l ab
S 3
82
Reduce side Join
U I T
b -
3 l a
S 83
Reduce side Join
U I T
● Check the output directory in HDFS and the result inside it
b -
3 l a
S 84
Assignment 2
● Write program to output
U I T
-
○ Min, Max, and Average age of users for each Game Type
b
○ The game types with lowest average age
3 l a
S …
Hint: you might need two MapReduce jobs for this task
85
Key Value
T
Key Value
I
[4000002 → age 74]
… [4000001 → type Exercise & Fitness]
U
[4000002 → type Exercise & Fitness]
Map …
-
[4000002 → type Team Sports]
…
b
[4000001 → type Combat Sports]
l a
Sort and Shuffle
3
{4000001 → [(age 55) , (type Exercise & Fitness), (type Combat Sport), (type Water Sport),…,(type Water Sport),…]}
{4000002 → [(age 74) , (type Exercise & Fitness), (type Team Sports), …]}
S
Sort and Shuffle will be done by the framework. Each of the line above will be the input for Reduce()
function in the Reducer class. (i.e. the reducer will run the Reduce() function for each key).
Key Value
Reduce [55 → Exercise & Fitness, Combat Sport, Water Sport, …] Note that there might be duplicate game types for each
player. So remember to remove any duplicates.
[74 → Exercise & Fitness, Team Sports,…] 86
Output of the first MapReduce job
U I T
b -
l a
…
S 3
87
Key Value
Second MapReduce job
I T
[Winter Sport → 55]
[Gymnastics → 55]
…
U
[Water Sport → 55]
Map …
-
[Water Sport → 74]
… [Team Sports → 74]
…
b
[Gymnastics → 60]
[Team Sports → 60]
l a
Sort and Shuffle {Winter Sport → [55,…]}
3
{Water Sport → [55,74,…]}
{Gymnastics → [55,…,60]}
S
…
Key Value
U I T
Run the second MapReduce job which takes the output of the first job as input
b -
3 l a …
S 89
Output types with lowest average age (2nd job)
U I T
Set the number of Reducer to 1, keep track the Game Type with lowest average age, then write
-
it to the output in cleanup function, which will run after all Reduce() function will have finished.
l ab
S 3 …
90
Job chain
U I T
b -
3 l a
S
Instead of running two separate
jobs, we can also run a job chain
91
Job chain
U I T
Instead of running two separate jobs, we can also run a job chain
b -
3 l a
S 92
Job chain
U I T
…
b -
3 l a
S …
93
U I T
b -
Hadoop Streaming:
3 l a
S
Writing A Hadoop MapReduce Program In Python
94
Hadoop Streaming:
What is Hadoop Streaming?
U I T
● Hadoop Streaming is a utility that comes with the Hadoop distribution. It can be used to execute
-
programs for big data analysis. Hadoop streaming can be performed using languages like Python,
Java, PHP, Scala, Perl, UNIX, and many more. The utility allows us to create and run Map/Reduce
b
jobs with any executable or script as the mapper and/or the reducer. For example:
l a
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar
-input myInputDirs
3
S
-output myOutputDir
-mapper /bin/cat
-reducer /bin/wc
95
Hadoop Streaming:
What is Hadoop Streaming?
U I T
b -
3 l a
S 96
Hadoop Streaming:
U I T
b -
l a
Hadoop streaming tutorial start from here
S 3
97
Hadoop Streaming:
U I T
b -
3 l a
S 98
Hadoop Streaming:
U I T
b -
3 l a
S 99
Hadoop Streaming:
U I T
b -
3 l a
S 100
Hadoop Streaming:
mapper.py
import sys
U I T
-
#Word Count Example
# input comes from standard input STDIN
b
for line in sys.stdin:
line = line.strip() #remove leading and trailing whitespaces
a
words = line.split() #split the line into words and returns as a list
l
for word in words:
#write the results to standard output STDOUT
3
print'%s %s' % (word,1) #Emit the word
S 101
Hadoop Streaming:
U I T
b -
3 l a
S 102
Hadoop Streaming:
reducer.py
import sys
from operator import itemgetter
U I T
-
# using a dictionary to map words to their counts
current_word = None
current_count = 0
b
word = None
# input comes from STDIN
for line in sys.stdin:
a
line = line.strip()
l
word,count = line.split(' ',1)
try:
3
count = int(count)
except ValueError:
continue
S
if current_word == word:
current_count += count
else:
if current_word:
print '%s %s' % (current_word, current_count)
current_count = count
current_word = word
if current_word == word:
print '%s %s' % (current_word,current_count)
103
Hadoop Streaming:
U I T
● Locate the HadoopStreaming directory which contains mapper.py and reducer.py
-
● Create word.txt file
l ab
S 3
104
Hadoop Streaming:
U I T
We can run mapper and reducer on local files (ex: word.txt). In order to run the Map and reduce
-
on the Hadoop Distributed File System (HDFS), we need the Hadoop Streaming jar. So before
we run the scripts on HDFS, let’s run them locally to ensure that they are working fine.
b
● command: cat word.txt | python mapper.py
3 l a
S 105
Hadoop Streaming:
● Run reducer.py
U I T
-
● command: cat word.txt | python mapper.py | sort -k1,1 | python reducer.py
b
3 l a
S 106
Hadoop Streaming:
Running the Python Code on Hadoop
U I T
● Move word.txt to HDFS
b -
3 l a
S 107
Hadoop Streaming:
Running the Python Code on Hadoop
U I T
-
● So locate the Hadoop Streaming jar on your terminal and copy the path.
b
● Command: ls /usr/lib/hadoop-mapreduce/hadoop-streaming.jar
3 l a
S 108
Hadoop Streaming:
●
Run the MapReduce job
U I T
Command: hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -file /home/cloudera/workspace/HadoopStreaming/mapper.py
-mapper 'python mapper.py' -file /home/cloudera/workspace/HadoopStreaming/reducer.py -reducer 'python reducer.py' -input
-
HadoopStreaming/word.txt -output HSOutput
l ab
S 3
Note: If we navigate to HadoopStreaming directory, we don’t need to specify the full path /home/cloudera/workspace/HadoopStreaming/mapper.py
[cloudera@quickstart HadoopStreaming]$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -file mapper.py -mapper 'python mapper.py' -file
reducer.py -reducer 'python reducer.py' -input HadoopStreaming/word.txt -output HSOutput
110
Hadoop Streaming:
U I T
● Hadoop provides a basic web interface for statistics and information
b -
3 l a
S 111
Hadoop Streaming:
U I T
● Check the output directory and the content of the result file
b -
3 l a
S 112
Share Folders between VM and Host
U I T
-
● Just in case you’re finding the way to transfer files from Host to VM
b
3 l a
S 113
Share Folders between VM and Host
U I T
● Device -> Shared Folders -> Shared Folders Setting
b -
3 l a
S 114
Share Folders between VM and Host
U I T
b -
3 l a
S 115
Share Folders between VM and Host
U I T
b -
3 l a
S 116
Share Folders between VM and Host
U I T
-
• Folder Path - folder in the host (Windows 10)
b
• Folder Name - repeat the folder name from above
a
• Auto-mount - check this option
•
3 l
Mount point - guest OS folder where the shared
S
folder will mount (will be created if it does not exist)
• Make Permanent - check this option
117
Share Folders between VM and Host
U I T
The ShareVM folder will be created (if not exist) after you reboot the VM.
-
● You can create the ShareVM folder in guest OS (Cloudera) by yourself (so no need to reboot)
● Command: mkdir /home/cloudera/workspace/ShareVM
l ab
S 3
118
Share Folders between VM and Host
U I T
-
● Mount the shared folder:
b
Command: Mount –t vboxsf ShareFromWindow /home/cloudera/workspace/ShareVM
3 l a
S 119
Share Folders between VM and Host
U I T
-
● Click the newly created shortcut in Cloudera VM and check if the files appear
l ab
S 3
120