Lec 4
Lec 4
Preface content of this lecture; in this lecture, we will cover the goals of HDFS, read/write process in
HDFS, configurations tuning parameters to control HDFS performance and robustness.
Now, let us see the Hadoop distributed file system: some of the design concepts and then we will go in
more detail of HDFS design. Now, the first important thing is about a scalable, distributed file system:
that means, we can add more disk and we can get a scalable performance that is one of the major design,
concepts of HDFS, in realizing the scalable distributed file system. Now, as you add more you are adding
a lot of disk and, this will automatically scales out the performance, as far as the design, goal of HDFS is
concerned. Why this is required is because, if the data, set is big and very large, cannot fit in, to one
computer system. So, hundreds and thousands of computer systems are used, to store that file. So, hence
the data, of a file is now, divided into the form of a blocks and distributed, onto this large-scale
infrastructure. So, that means a distributed data, on the local disk is now, stored on several nodes, this
particular method, will insure the low cost commodity hardware usage, to store the amount of
information, by distributing it across all these multiple nodes, which comprises of low-cost commodity
hardware. The drawback is that, some of the nodes may get failure: that we will see that, it is also
included as per the design, goal of HDFS and the low cost commodity hardware, is to be used in this
particular manner, a lot of performance out of it, is achieved because, we are aggregating the
performance, of hundreds and thousands of such commodity low cost hardware. So, in this particular
diagram we see that, we assume and number of nodes let us say node 1, node 2, on so on, node n these
nodes are, in the range of hundreds and thousands of the nodes and if a file is given, file is broken into
the, to the blocks and these blocks are now, having a data, which will be distributed, on this particular
kind of setup. So, in this particular example, we can see that, a file is there and that file is, divided into the
blocks, file data is block, in divided into the data, blocks and each block, is stored across different nodes.
So, we can see here, all the blue colored nodes, blue color blocks of these nodes are storing a file data. So,
hence the file data is now distributed, onto this local disk in HDFS.
So, hundreds and thousands of the nodes are available and their disk is being used for storing. Now, these
comprises of the commodity hardware, so they are prone to the hardware failure and as I told that, that
this design. So, they are prone to the hardware failure, so the design needs to handle, the node failures. In
this particular case, so HDFS design goal, is to handle, the node failures also. So, another aspect is about
the portability across heterogeneous Hardware, why because? There are hundreds and thousands of
community hardware machines, they may be having different operating system and the software running,
so hence, this heterogeneity also requires, the portability support, in this particular case. That is also one
of the HDFS design goals, another important design goal of HDFS is to handle, the large data sets. So, the
data sets so the file size, ranging from terabyte to the pet bytes that is huge, file or huge dataset is also
now, being able to stored here in HDFS file system so it provides a support of, the handling the large
datasets also enable the processing with the high throughput. So that means how this is all ensured the
processing with the high throughput? That we will see and it has kept as one of the important design goals
of HDFS.
Now let us see, what is new in Hadoop version 2.0? So, HDFS, in aversion Hadoop 2.0 or HDFS 2.0,
uses the HDFS Federation that means that, it is not a single namespace but it is a Federation, as that is
called, ‘HDFS’ name node Federation. So, so this particular Federation will now, have multiple data
nodes and multiple name nodes, are there and this will increase, the reliability of the name node, in this
case of Federation. So, it is not one name node, but it is, n number of n in name nodes and this particular
method is called the, ‘HDFS Federation’. The benefits is to increase the namespace scalability, earlier
there was one name is space now it has, a Federation of name a space so, obviously the scalability is
increased and also, the performance is increased and also, the isolation, performance is increased, why
because? Now, the nearest namespace is used to serve, the clients requests and isolation means if let us
say a high, requirement high a large amount of resource requirement is there for a particular application, it
is not going to affect, the entire a single namespace, why because? It is a Federation of name is space
other, applications is not going to be affected by a very high, requirement for a particular application that
is called, ‘Isolation’.
Now, in HDFS version two, are in Hadoop version two how, this all is done let us go and discuss. Now,
here as we have mentioned that, instead of one name node here now we have, multiple name mode
servers and they are managing, the namespace hence they becomes a multiple namespace and the data, is
now stored in the form of a block pools. So, now this block pool is also, going to be managed across these
data nodes, on the nodes of the cluster machines. So, it is not only one node, but several nodes are
involved and they will be storing the block pools. So, there is a pool associated with each name node or a
namespace and these pools are essentially, spread out over all the data nodes.
So, here you can see in this particular diagram that the namespace it has multiple namespace, namespace
one, namespace 2 and so on up to namespace n. Let us assume that it has, multiple namespaces and each
name is space, is having a block pool. So, these are called, ‘Block Pools’ and these block pools are stored
on the nodes. So, they are on different nodes, just like a cluster machine so, each block pool is stored on a
different node. So, different nodes are there and this is going to manage, the multiple namespace and this
is called the, ‘Federation’ of the block pools. So, hence now it is not a single point of failure, even if one
or more name node, namespace or a name node fails, it is not going to affect, anything and it also
increases the, the performance reliability and throughput and also, performs also provides you the
isolation. So, if you remember the original design you have only one namespace and a bunch of data
nodes, so the structure looks, like similar but, internally it is managed as the Federation. So, you have a
bunch of named nodes now, instead of one named node and each of these named nodes is essentially
write to these pools. But, the pools are spread out over the data nodes just like before, this is where the
data is spread out and you can grass over the different data nodes. So, the block pool is essentially the
main thing that’s different in Hadoop, version or HDFS version two.
So, HDFS performance measures, if we see that, here we see that, determine the number of blocks, for a
given file, for a given file size. And the, the key HDFS and the system components are affected by the
block size and impact of ,using a lot of small files on, HDFS and HDFS system. So, these are some of the
performance, measures that we are going to, tune the parameters and measure the performance. Let us
summarize it again, all these and different, tuning parameter for performance. And so, basically the
number of, how many number of blocks for a given file size? And this is required, to be known, in so
basically, we will see there is a performance measure or, or basically there is a tradeoff, in the number of
blocks to be replicated. The another key component is, about the size of the block. So, here the block size,
which varies from 64 MB to 128 MB. So, if the block size is 64 then, what is the performance and if we
increase the block size, then what will be the performance, similarly the number of blocks, that means,
how many this is the replication, if the replication factor is three that means, every block is replicated on
three different nodes, for a given file size. So, if the replication is 1, then obviously we are, saving lot of
space. But, performance we are going to sacrifice, so there is a trade-off between this. And another
important parameter, for HDFS performance is about, the number of small files, which are there on, the
HDFS. So, if there are lot of small files, which are there in HDFS, the performance goes down, we will
see how and how this particular problem is to be overcome, in the further slides.
So, let us see that, recall again the HDFS architecture, where the data is distributed, on the local disk, on
several nodes. So here, in this particular picture, we have shown, several nodes where data is divided and
that is called, ‘Distributed data on different local disks’, it is stored.
10 × ( 1024
64 )
=160
So, 160 blocks we have to just store, in a distributed manner, on several nodes and therefore, this
particular block size is going to matter much. So, if we increase the number of or the size of the block,
then obviously, it will be less than 160 blocks. So, if the number of blocks are more hence, the parallel
operations is more possible and that we are going to see about, what is the, what is the effect of keeping
the small block size, of 64 or a more than 64.
So, the importance of the number of blocks in a file. So, if the number of blocks are more, than the
amount of memory which is used in the name node, will be more in that case. So, for example, here every
block that, you create basically every file, could be a lot of blocks we saw in the previous case, 160
blocks. and if you have a million of files and this, that millions of objects, essentially is required to
basically store that, amount of space in the name node to manage it and it becomes, several times bigger.
If let us say, the number of blocks are more. And the files are more. So, we will see this kind of
importance, of the number of blocks, so it is going to affect, into the size, into the name node. And it
measures, how much memory is going to be used, in the name node to manage that, many number of
blocks of a file. Now, the number of map tasks, also is going to matter much, for example, if the file is
divided into160 blocks, so at least 160 different, map functions are required to be executed, to cover
entire data set operations or computations. So hence, if the number of blocks are more, not only it is going
to take more space, in the name node, but also more number of map, functions also required to be
executed.
Hence, there has to be a trade-off. Similarly, if there is a large number of a small files: this will impact on
the name node, why because a lot of memory is required, to store the information of this number of small
files. Hence, the network load is going to increase, in this particular case.
So, HDFS is therefore optimized for a large file size. So, lot of small files is bad and the solution, to this
particular problem is to merge and concatenate different files are, there is a operation which is called,
‘Sequence Files’, several files are Mouse together, in a sequence and that is called a, ‘Sequence File’.
And which is treated as one big file instead of keeping, many number of small files, another solution, to
the small number of lots of small number of files is, using the HBase and hive configuration, for this
particular large number of small files. They will be used to optimize this particular issue. And also there
is also, another solution is to combine the input file, input format, file input format.
Now, let us see, in more detail about read and write processes, which is there in HDFS, how it is being
supported.
Now, the read process in HDFS, if we see that, first of all we have to identify that there is a name node.
And this is the client. And there are several data nodes, in this example, we are having one name node and
there are three data nodes and there is a client, which will perform the read operation. So, the HDFS
client, will then request to read a particular file, this is the read operation and this particular request will
go to the name node, to know the, the blocks where that, read operation is to be executed and the data is
to be given back to the to the client. So, it will send the request to the name node and then, name node
will ,give back this information, back to this particular client end of HDFS and from there, it will have
two options, whether to read from the block number four or to read from the block number five. Then it
will try to read the, the one which is the closest one and this particular data is given back to the client.
Now, then let us see the write operation, which is initiated again by the client. So, whenever there is a
client wants to do a write operation and this particular write operation is now, going to be requesting, the
name node to find out the, the data node which can, be used to store, the, the clients data. And after
getting this information back, this particular right operation is being performed on this particular data
node, which is, which is the closest one and that, particular data node has to, do this replication. If let us
say, replication factor is 3 then, it will do this in the form of a pipeline. So, the client will write down or
will write the data, on a particular data node and that data node, in turn will carry out, the pipeline for the
replication, this is called a, ‘Replication Pipeline’. And once the replication is over, then it will send the
acknowledgment and the right operation is completed, in this particular process.
So, HDFS tuning parameters, we are going to see, especially the DFS block size, from that viewpoint and
also we will see the name node, data node and all these different tuning parameters.
Let us see, what are the, which are most important, which need to be decided for performance from,
performance perspective. Now here, HDFS block size recall that, impacts how much name node memory
is used, the number of map tasks that are showing up and also have the impacts on the performance. So,
the by default the block size is 64 megabytes. And typically, it can go up to 128 megabytes and it can be
changed based on the workloads. So, if let us say that, we want to have a better performance and the size,
file size is too big, too large, then obviously more than 64 megabytes is good enough, so that so, so the
parameter that this, make this particular changes is known as, DFS block size or dfs.block.size, where we
have to mention about, the, the, the, the block size, by default it is 64 but we can increase up to, 128
megabytes. So if the, if the block size is more obviously, the number of blocks, will be, will be less and if
it is less than, the amount of space which is required, to store in the namespace memory, will be less and
also, if it is less and also the number of map tasks, which will be required to execute also, will be less.
And so, basically there is a trade-off, where the performance is required, so we have to set, this block size
accordingly and application.
So, another parameter is called the, ‘HDFS’, application by default that application is 3 and this parameter
is set in a dfs.replication, configuration file. And there is a trade-off, that means, if we lower, it to reduce
the replication cost, that means, if the replication factor is not 3, if it is less, than the replication cost will
be less. But, the trade-off is that, it will be less robust, robust in the sense, if some of the nodes are filled
and there is only one replication, there is no replicas available of that node, so that particular data will not
be available. So hence, it will be less robust. And also, the it will lose the performance, for example, if it
is replicated then, it will be able to serve that, particular data block, from the closest possible, data block
to the client. So, higher application can make data local to the more workers, lower replication means, and
more space.
So, HDFS robustness, we have so far discussed. And so therefore the replication, on the data node is
done. So, that it is a lack fault tolerant, that means, the replicas are across racks, so that if the one rack is
down, it will be able to serve, from the other end. So, Name node receives the heartbeats and block the
report from, this one data nodes, so all these is measured and wherever there is data note down, this
information is captured or understood and the name node and that particular node is now, not being used
by for the client, for the requests.
So, multiple copies of Central meta data structure is being maintained to handle with these common
failures. And failover, to standby the name node is there and normally it is manually done, by default.
Now, there is a trade-off between the replication, trade-off with the respect to the, to the robustness.
Before we start, the idea is that, if we reduce the replication factor, then it is going to affect to the
robustness. For example, if let us say, it is not replicated, to the other data nodes. And if that data nodes,
containing that data or a block fields, then it is not available at other end. Hence, it is going to affect the
robustness. So, replication is so important, that we are going to discuss. So, one performance trade-off is
actually, when you go out, to do some of the Map Reduce jobs, having replicas gives additional locality
possibility. But, the big trade-off is the robustness, in this case we said, no replica, might lose a node or, a
or a local discount recover because, there is no replica. Hence, if replication factor is, is not immutable, so
if the replication so, no replica is available, if no replica is available, then obviously it is lead to a failure.
And hence, there is no hence, it is not robust. Similarly, with the data corruption and if we get, the checks
that is bad and we cannot recover, why because, we do not have any replicas and other parameters,
changes have similar effects. So, basically there is a trade-off between, the replica and the robustness.
So in conclusion, in this lecture, we have discussed the HDFS. HDFS version two and operation that is
read and write which is supported in HDFS, we have also seen, the main configuration, we have also seen
the performance, parameters and the tuning parameters, with respect to the block size and the replication
factor, to ensure about the HDFS performance and its robustness trade-off. Thank you.