Unit 1
Unit 1
Unit -I
Data science
Data science is the study of data to extract meaningful insights for business. It is a
multidisciplinary approach that combines principles and practices from the fields of
mathematics, statistics, artificial intelligence, and computer engineering to analyze large
amounts of data. This analysis helps data scientists to ask and answer questions like what
happened, why it happened, what will happen, and what can be done with the results.
Importance of Data Science
Data science is important because it combines tools, methods, and technology to generate
meaning from data. Modern organizations are inundated with data; there is a proliferation of
devices that can automatically collect and store information. Online systems and payment
portals capture more data in the fields of e-commerce, medicine, finance, and every other
aspect of human life. We have text, audio, video, and image data available in vast quantities
Math Skills:
Linear Algebra, Multivariable Calculus & Optimization Technique: These three things are
very important as they help us in understanding various machine learning algorithms that
play an important role in Data Science.
Statistics & Probability: Understanding of Statistics is very significant as this is a part of
Data analysis. Probability is also significant to statistics and it is considered a prerequisite for
mastering machine learning.
Programming Knowledge: One needs to have a good grasp of programming concepts such
as Data structures and Algorithms. The programming languages used are Python, R, Java,
Scala. C++ is also useful in some places where performance is very important.
Relational Databases:
One needs to know databases such as SQL or Oracle so that he/she can retrieve the necessary
data from them whenever required.
Non-Relational Databases:
There are many types of non-relational databases but mostly used types are Cassandra,
HBase, MongoDB, CouchDB, Redis, Dynamo.
Machine Learning:
It is one of the most vital parts of data science and the hottest subject of research among
researchers so each year new advancements are made in this. One at least needs to understand
basic algorithms of Supervised and Unsupervised Learning. There are multiple libraries
available in Python and R for implementing these algorithms.
Distributed Computing: It is also one of the most important skills to handle a large amount
of data because one can’t process this much data on a single system. The tools that mostly
used are Apache Hadoop and Spark. The two major parts of these tolls are HDFS(Hadoop
Distributed File System) that is used for collecting data over a distributed file system.
Another part is map-reduce, by which we manipulate the data. One can write map-reduce in
programs in Java or Python. There are various other tools such as PIG, HIVE, etc.
Communication Skill:
It includes both written and verbal communication. What happens in a data science project is
after drawing conclusions from the analysis, the project has to be communicated to others.
Sometimes this may be a report you send to your boss or team at work. Other times it may be
a blog post. Often it may be a presentation to a group of colleagues. Regardless, a data
science project always involves some form of communication of the projects’ findings. So it’s
necessary to have communication skills for becoming a data scientist.
Data Scientist —
Data science
Data science is the professional field that deals with turning data into value such as new
insights or predictive models. It brings together expertise from fields including statistics,
mathematics, computer science, communication as well as domain expertise such as business
knowledge. Data scientist has recently been voted the No 1 job in the U.S., based on current
demand and salary and career opportunities.
Data mining
Data mining is the process of discovering insights from data. In terms of Big Data, because it
is so large, this is generally done by computational methods in an automated way using
methods such as decision trees, clustering analysis and, most recently, machine learning. This
can be thought of as using the brute mathematical power of computers to spot patterns in data
which would not be visible to the human eye due to the complexity of the dataset.
Hadoop
Hadoop is a framework for Big Data computing which has been released into the public
domain as open source software, and so can freely be used by anyone. It consists of a number
of modules all tailored for a different vital step of the Big Data process – from file storage
(Hadoop File System – HDFS) to database (HBase) to carrying out data operations (Hadoop
MapReduce – see below). It has become so popular due to its power and flexibility that it has
developed its own industry of retailers (selling tailored versions), support service providers
and consultants.
Predictive modelling
At its simplest, this is predicting what will happen next based on data about what has
happened previously. In the Big Data age, because there is more data around than ever before,
predictions are becoming more and more accurate. Predictive modelling is a core component
of most Big Data initiatives, which are formulated to help us choose the course of action
which will lead to the most desirable outcome. The speed of modern computers and the
volume of data available means that predictions can be made based on a huge number of
variables, allowing an ever-increasing number of variables to be assessed for the probability
that it will lead to success.
MapReduce
MapReduce is a computing procedure for working with large datasets, which was devised
due to difficulty of reading and analysing really Big Data using conventional computing
methodologies. As its name suggest, it consists of two procedures – mapping (sorting
information into the format needed for analysis – i.e. sorting a list of people according to their
age) and reducing (performing an operation, such checking the age of everyone in the dataset
to see who is over 21).
NoSQL
NoSQL refers to a database format designed to hold more than data which is simply arranged
into tables, rows, and columns, as is the case in a conventional relational database. This
database format has proven very popular in Big Data applications because Big Data is often
messy, unstructured and does not easily fit into traditional database frameworks.
Python
Python is a programming language which has become very popular in the Big Data space due
to its ability to work very well with large, unstructured datasets (see Part II for the difference
between structured and unstructured data). It is considered to be easier to learn for a data
science beginner than other languages such as R (see also Part II) and more flexible.
R
R is another programming language commonly used in Big Data, and can be thought of as
more specialised than Python, being geared towards statistics. Its strength lies in its powerful
handling of structured data. Like Python, it has an active community of users who are
constantly expanding and adding to its capabilities by creating new libraries and extensions.
Recommendation engine
A recommendation engine is basically an algorithm, or collection of algorithms, designed to
match an entity (for example, a customer) with something they are looking for.
Recommendation engines used by the likes of Netflix or Amazon heavily rely on Big Data
technology to gain an overview of their customers and, using predictive modelling, match
them with products to buy or content to consume. The economic incentives offered by
recommendation engines has been a driving force behind a lot of commercial Big Data
initiatives and developments over the last decade.
Real-time
Real-time means “as it happens” and in Big Data refers to a system or process which is able
to give data-driven insights based on what is happening at the present moment. Recent years
have seen a large push for the development of systems capable of processing and offering
insights in real-time (or near-real-time), and advances in computing power as well as
development of techniques such as machine learning have made it a reality in many
applications today.
Reporting
The crucial “last step” of many Big Data initiative involves getting the right information to
the people who need it to make decisions, at the right time. When this step is automated,
analytics is applied to the insights themselves to ensure that they are communicated in a way
that they will be understood and easy to act on. This will usually involve creating multiple
reports based on the same data or insights but each intended for a different audience (for
example, in-depth technical analysis for engineers, and an overview of the impact on the
bottom line for c-level executives).
Spark
Spark is another open source framework like Hadoop but more recently developed and more
suited to handling cutting-edge Big Data tasks involving real time analytics and machine
learning. Unlike Hadoop it does not include its own filesystem, though it is designed to work
with Hadoop’s HDFS or a number of other options. However, for certain data related
processes it is able to calculate at over 100 times the speed of Hadoop, thanks to its in-
memory processing capability. This means it is becoming an increasingly popular choice for
projects involving deep learning, neural networks and other compute-intensive tasks.
Visualisation
Humans find it very hard to understand and draw insights from large amounts of text or
numerical data – we can do it, but it takes time, and our concentration and attention is limited.
For this reason effort has been made to develop computer applications capable of rendering
information in a visual form – charts and graphics which highlight the most important
insights which have resulted from our Big Data projects. A subfield of reporting (see above),
visualising is now often an automated process, with visualisations customised by algorithm to
be understandable to the people who need to act or take decisions based on them.
Structured Data
Able to be processed, sorted, analyzed, and stored in a predetermined format, then
retrieved in a fixed format
Accessed by a computer with the help of search algorithms
First type of big data to be gathered
Easiest of the three types of big data to analyze
Examples of structured data include:
Application-generated data
Dates
Names
Numbers (e.g., telephone, credit card, US ZIP Codes, social security)
Semi-Structured Data
Contains both structured as well as unstructured information
Data may be formatted in segments
Appears to be fully-structured, but may not be
Not in the standardized database format as structured data
Has some properties that make it easier to process than unstructured data
Examples
CSV
Electronic data interchange (EDI)
HTML
JSON documents
NoSQL databases
Portable Document Files (PDF)
RDF
XML
Unstructured Data
Not in any predetermined format (i.e., no apparent format)
Accounts for the majority of the digital data that makes up big data
Examples of the different types of unstructured data include:
Human-generated data
Email
Text messages
Invoices
Text files
Social media data
Machine-generated data
Geospatial data
Weather data
Data from IoT and smart devices
Radar data
Videos
Satellite images
Scientific data
There are five v's of Big Data that explains the characteristics.
Next
Unmute
Current TimeÂ
0:00
/
DurationÂ
18:10
Â
Fullscreen
Play Video
Structured data: In Structured schema, along with all the required columns. It is in a tabular
form. Structured Data is stored in the relational database management system.
Semi-structured: In Semi-structured, the schema is not appropriately defined, e.g., JSON,
XML, CSV, TSV, and email. OLTP (Online Transaction Processing) systems are built to
work with semi-structured data. It is stored in relations, i.e., tables.
Unstructured Data: All the unstructured files, log files, audio files, and image files are
included in the unstructured data. Some organizations have much data available, but they did
not know how to derive the value of data since the data is raw.
Quasi-structured Data:The data format contains textual data with inconsistent data formats
that are formatted with effort and time with some tools.
Example: Web server logs, i.e., the log file is created and maintained by some server that
contains a list of activities.
Veracity
Veracity means how much the data is reliable. It has many ways to filter or translate the data.
Veracity is the process of being able to handle and manage data efficiently. Big Data is also
essential in business development.
For example, Facebook posts with hashtags.
Value
Value is an essential characteristic of big data. It is not the data that we process or store. It is
valuable and reliable data that we store, process, and also analyze.
Big data velocity deals with the speed at the data flows from sources like application logs,
business processes, networks, and social media sites, sensors, mobile devices, etc.
1. Volume
Volume refers to the huge amounts of data that is collected and generated every second in
large organizations. This data is generated from different sources such as IoT devices, social
media, videos, financial transactions, and customer logs.
Storing and processing this huge amount of data was a problem earlier. But now distributed
systems such as Hadoop are used for organizing data collected from all these sources. The
size of the data is crucial for understanding its value. Also, the volume is useful in
determining whether a collection of data is Big Data or not.
Data volume can vary. For example, a text file is a few kilobytes whereas a video file is a few
megabytes. In fact, Facebook from Meta itself can produce an enormous proportion of data in
a single day. Billions of messages, likes, and posts each day contribute to generating such
huge data.
The global mobile traffic was tallied to be around 6.2 ExaBytes( 6.2 billion GB) per month in
the year 2016.
2. Variety
Another one of the most important Big Data characteristics is its variety. It refers to the
different sources of data and their nature. The sources of data have changed over the years.
Earlier, it was only available in spreadsheets and databases. Nowadays, data is present in
photos, audio files, videos, text files, and PDFs.
The variety of data is crucial for its storage and analysis.
A variety of data can be classified into three distinct parts:
Structured data
Semi-Structured data
Unstructured data
3. Velocity
This term refers to the speed at which the data is created or generated. This speed of data
producing is also related to how fast this data is going to be processed. This is because only
after analysis and processing, the data can meet the demands of the clients/users.
Massive amounts of data are produced from sensors, social media sites, and application logs
– and all of it is continuous. If the data flow is not continuous, there is no point in investing
time or effort on it.
As an example, per day, people generate more than 3.5 billion searches on Google.
4. Value
Among the characteristics of Big Data, value is perhaps the most important. No matter how
fast the data is produced or its amount, it has to be reliable and useful. Otherwise, the data is
not good enough for processing or analysis. Research says that poor quality data can lead to
almost a 20% loss in a company’s revenue.
Data scientists first convert raw data into information. Then this data set is cleaned to retrieve
the most useful data. Analysis and pattern identification is done on this data set. If the process
is a success, the data can be considered to be valuable.
5. Veracity
This feature of Big Data is connected to the previous one. It defines the degree of
trustworthiness of the data. As most of the data you encounter is unstructured, it is important
to filter out the unnecessary information and use the rest for processing.
Veracity is one of the characteristics of big data analytics that denotes data inconsistency as
well as data uncertainty.
As an example, a huge amount of data can create much confusion on the other hand, when
there is a fewer amount of data, that creates inadequate information.
Other than these five traits of big data in data science, there are a few more characteristics of
big data analytics that have been discussed down below:
1. Volatility
One of the big data characteristics is Volatility. Volatility means rapid change. And Big data is
in continuous change. Like data collected from a particular source change within a span of a
few days or so. This characteristic of Big Data hampers data homogenization. This process is
also known as the variability of data.
2. Visualization
Visualization is one more characteristic of big data analytics. Visualization is the method of
representing that big data that has been generated in the form of graphs and charts. Big data
professionals have to share their big data insights with non-technical audiences on a daily
basis.
Evolution of Big Data —
If we see the last few decades, we can analyze that Big Data technology has gained so much
growth. There are a lot of milestones in the evolution of Big Data which are described below:
1. Data Warehousing:
In the 1990s, data warehousing emerged as a solution to store and analyze large
volumes of structured data.
2. Hadoop:
Hadoop was introduced in 2006 by Doug Cutting and Mike Cafarella. Distributed
storage medium and large data processing are provided by Hadoop, and it is an open-
source framework.
3. NoSQL Databases:
In 2009, NoSQL databases were introduced, which provide a flexible way to store and
retrieve unstructured data.
4. Cloud Computing:
Cloud Computing technology helps companies to store their important data in data
centers that are remote, and it saves their infrastructure cost and maintenance costs.
5. Machine Learning:
Machine Learning algorithms are those algorithms that work on large data, and
analysis is done on a huge amount of data to get meaningful insights from it. This has
led to the development of artificial intelligence (AI) applications.
6. Data Streaming:
Data Streaming technology has emerged as a solution to process large volumes of
data in real time.
7. Edge Computing:
Edge Computing is a kind of distributed computing paradigm that allows data
processing to be done at the edge or the corner of the network, closer to the source of
the data.
Overall, big data technology has come a long way since the early days of data warehousing.
The introduction of Hadoop, NoSQL databases, cloud computing, machine learning, data
streaming, and edge computing has revolutionized how we store, process, and analyze large
volumes of data. As technology evolves, we can expect Big Data to play a very important
role in various industries.
Big Data Analytics —
Big data analytics is the often complex process of examining big data to uncover information
-- such as hidden patterns, correlations, market trends and customer preferences -- that can
help organizations make informed business decisions.
On a broad scale, data analytics technologies and techniques give organizations a way to
analyze data sets and gather new information. Business intelligence (BI) queries answer basic
questions about business operations and performance.
Big data analytics is a form of advanced analytics, which involve complex applications with
elements such as predictive models, statistical algorithms and what-if analysis powered by
analytics systems.
Classification of Analytics —
There are four types of analytics, Descriptive, Diagnostic, Predictive, and Prescriptive. The
chart below outlines the levels of these four categories. It compares the amount of value-
added to an organization versus the complexity it takes to implement.
The idea is that you should start with the easiest to implement, Descriptive Analytics. In this
blog, we will review the four analytics types and an example of their use cases, and how they
all work together.
Predictive Analytics
Predictive analytics turn the data into valuable, actionable information. predictive analytics
uses data to determine the probable outcome of an event or a likelihood of a situation
occurring. Predictive analytics holds a variety of statistical techniques from modeling,
machine learning, data mining, and game theory that analyze current and historical facts to
make predictions about a future event. Techniques that are used for predictive analytics are:
Linear Regression
Time Series Analysis and Forecasting
Data Mining
Basic Corner Stones of Predictive Analytics
Predictive modeling
Decision Analysis and optimization
Transaction profiling
Descriptive Analytics
Descriptive analytics looks at data and analyze past event for insight as to how to approach
future events. It looks at past performance and understands the performance by mining
historical data to understand the cause of success or failure in the past. Almost all
management reporting such as sales, marketing, operations, and finance uses this type of
analysis.
The descriptive model quantifies relationships in data in a way that is often used to classify
customers or prospects into groups. Unlike a predictive model that focuses on predicting the
behavior of a single customer, Descriptive analytics identifies many different relationships
between customer and product.
Common examples of Descriptive analytics are company reports that provide historic reviews
like:
Data Queries
Reports
Descriptive Statistics
Data dashboard
Prescriptive Analytics
Prescriptive Analytics automatically synthesize big data, mathematical science, business rule,
and machine learning to make a prediction and then suggests a decision option to take
advantage of the prediction.
Prescriptive analytics goes beyond predicting future outcomes by also suggesting action
benefits from the predictions and showing the decision maker the implication of each
decision option. Prescriptive Analytics not only anticipates what will happen and when to
happen but also why it will happen. Further, Prescriptive Analytics can suggest decision
options on how to take advantage of a future opportunity or mitigate a future risk and
illustrate the implication of each decision option.
For example, Prescriptive Analytics can benefit healthcare strategic planning by using
analytics to leverage operational and usage data combined with data of external factors such
as economic data, population demography, etc.
Diagnostic Analytics
In this analysis, we generally use historical data over other data to answer any question or for
the solution of any problem. We try to find any dependency and pattern in the historical data
of the particular problem.
For example, companies go for this analysis because it gives a great insight into a problem,
and they also keep detailed information about their disposal otherwise data collection may
turn out individual for every problem and it will be very time-consuming. Common
techniques used for Diagnostic Analytics are:
Data discovery
Data mining
Correlations
Data Analysis Software tools build it easier for users to process and manipulate information,
Data Analysis Software provides tools to assist with qualitative analysis like
methodology.
Data Analysis Software has the Statistical and Analytical Capability for decision-
making methods.
R and Python
Microsoft Excel
Tableau
RapidMiner
KNIME
Power BI
Apache Spark
QlikView
Talend
Splunk
Linear Regression—
Linear regression analysis is used to predict the value of a variable based on the value of
another variable. The variable you want to predict is called the dependent variable. The
variable you are using to predict the other variable's value is called the independent variable.
This form of analysis estimates the coefficients of the linear equation, involving one or more
independent variables that best predict the value of the dependent variable. Linear regression
fits a straight line or surface that minimizes the discrepancies between predicted and actual
output values. There are simple linear regression calculators that use a “least squares” method
to discover the best-fit line for a set of paired data. You then estimate the value of X
(dependent variable) from Y (independent variable).
You can perform the linear regression method in a variety of programs and environments,
including:
R linear regression
MATLAB linear regression
Sklearn linear regression
Linear regression Python
Excel linear regression
Polynomial Regression —
o Polynomial Regression is a regression algorithm that models the relationship between
a dependent(y) and independent variable(x) as nth degree polynomial. The
Polynomial Regression equation is given below:
y= b0+b1x1+ b2x12+ b2x13+...... bnx1n
o It is also called the special case of Multiple Linear Regression in ML. Because we add
some polynomial terms to the Multiple Linear regression equation to convert it into
Polynomial Regression.
o It is a linear model with some modification in order to increase the accuracy.
o The dataset used in Polynomial regression for training is of non-linear nature.
o It makes use of a linear regression model to fit the complicated and non-linear
functions and datasets.
o Hence, "In Polynomial regression, the original features are converted into
Polynomial features of required degree (2,3,..,n) and then modeled using a linear
model."
Multivariate Regression
Multivariate regression is a technique used to measure the degree to which the various
independent variable and various dependent variables are linearly related to each other. The
relation is said to be linear due to the correlation between the variables. Once the multivariate
regression is applied to the dataset, this method is then used to predict the behaviour of the
response variable based on its corresponding predictor variables.
Multivariate regression is commonly used as a supervised algorithm in machine learning, a
model to predict the behaviour of dependent variables and multiple independent variables.
UNIT 2
Introducing Hadoop
Hadoop is an Apache open source framework written in java that allows distributed
processing of large datasets across clusters of computers using simple programming models.
The Hadoop framework application works in an environment that provides
distributed storage and computation across clusters of computers. Hadoop is designed to
scale up from single server to thousands of machines, each offering local computation and
storage.
Hadoop Overview —
Hadoop makes it easier to use all the storage and processing capacity in cluster servers, and
to execute distributed processes against huge amounts of data. Hadoop provides the building
blocks on which other services and applications can be built.
Applications that collect data in various formats can place data into the Hadoop cluster by
using an API operation to connect to the NameNode. The NameNode tracks the file directory
structure and placement of “chunks” for each file, replicated across DataNodes. To run a job
to query the data, provide a MapReduce job made up of many map and reduce tasks that run
against the data in HDFS spread across the DataNodes. Map tasks run on each node against
the input files supplied, and reducers run to aggregate and organize the final output.
The Hadoop ecosystem has grown significantly over the years due to its extensibility. Today,
the Hadoop ecosystem includes many tools and applications to help collect, store, process,
analyze, and manage big data. Some of the most popular applications are:
Spark – An open source, distributed processing system commonly used for big
data workloads. Apache Spark uses in-memory caching and optimized execution
for fast performance, and it supports general batch processing, streaming
analytics, machine learning, graph databases, and ad hoc queries.
Presto – An open source, distributed SQL query engine optimized for low-
latency, ad-hoc analysis of data. It supports the ANSI SQL standard, including
complex queries, aggregations, joins, and window functions. Presto can process
data from multiple data sources including the Hadoop Distributed File System
(HDFS) and Amazon S3.
Hive – Allows users to leverage Hadoop MapReduce using a SQL interface,
enabling analytics at a massive scale, in addition to distributed and fault-tolerant
data warehousing.
HBase – An open source, non-relational, versioned database that runs on top of
Amazon S3 (using EMRFS) or the Hadoop Distributed File System (HDFS).
HBase is a massively scalable, distributed big data store built for random, strictly
consistent, real-time access for tables with billions of rows and millions of
columns.
Zeppelin – An interactive notebook that enables interactive data exploration.
RDBMS versus Hadoop —
S.No. RDBMS Hadoop
3. It is best suited for OLTP environment. It is best suited for BIG data.
8. The data schema of RDBMS is static type. The data schema of Hadoop is dynamic type
9. High data integrity available. Low data integrity available than RDBMS.
10. Cost is applicable for licensed software. Free of cost, as it is an open source software
HDFS (Hadoop
Distributed File System):
HDFS (Hadoop Distributed File System) is the primary storage system used by Hadoop
applications. This open source framework works by rapidly transferring data between nodes.
It's often used by companies who need to handle and store big data. HDFS is a key
component of many Hadoop systems, as it provides a means for managing big data, as well as
supporting big data analytics.
There are many companies across the globe that use HDFS, so what exactly is it and why is it
needed? Let's take a deep dive into what HDFS is and why it may be useful for businesses.
What is HDFS?
HDFS stands for Hadoop Distributed File System. HDFS operates as a distributed file system
designed to run on commodity hardware.
HDFS is fault-tolerant and designed to be deployed on low-cost, commodity hardware. HDFS
provides high throughput data access to application data and is suitable for applications that
have large data sets and enables streaming access to file system data in Apache Hadoop.
So, what is Hadoop? And how does it vary from HDFS? A core difference between Hadoop
and HDFS is that Hadoop is the open source framework that can store, process and analyze
data, while HDFS is the file system of Hadoop that provides access to data. This essentially
means that HDFS is a module of Hadoop.
Let's take a look at HDFS architecture:
As we can see, it focuses on NameNodes and DataNodes. The NameNode is the hardware
that contains the GNU/Linux operating system and software. The Hadoop distributed file
system acts as the master server and can manage the files, control a client's access to files,
and overseas file operating processes such as renaming, opening, and closing files.
A DataNode is hardware having the GNU/Linux operating system and DataNode software.
For every node in a HDFS cluster, you will locate a DataNode. These nodes help to control
the data storage of their system as they can perform operations on the file systems if the client
requests, and also create, replicate, and block files when the NameNode instructs.
The HDFS meaning and purpose is to achieve the following goals:
Manage large datasets - Organizing and storing datasets can be a hard talk to handle.
HDFS is used to manage the applications that have to deal with huge datasets. To do
this, HDFS should have hundreds of nodes per cluster.
Detecting faults - HDFS should have technology in place to scan and detect faults
quickly and effectively as it includes a large number of commodity hardware.
Failure of components is a common issue.
Hardware efficiency - When large datasets are involved it can reduce the network
traffic and increase the processing speed.
Components and Block Replication —
Replication of blocks
HDFS is a reliable storage component of Hadoop. This is because every block stored in the
filesystem is replicated on different Data Nodes in the cluster. This makes HDFS fault-
tolerant.
The default replication factor in HDFS is 3. This means that every block will have two more
copies of it, each stored on separate DataNodes in the cluster. However, this number is
configurable.
But you must be wondering doesn’t that mean that we are taking up too much storage. For
instance, if we have 5 blocks of 128MB each, that amounts to 5*128*3 = 1920 MB. True.
But then these nodes are commodity hardware. We can easily scale the cluster to add more of
these machines. The cost of buying machines is much lower than the cost of losing the data!
Now, you must be wondering, how does Namenode decide which Datanode to store the
replicas on? Well, before answering that question, we need to have a look at what is a Rack in
Hadoop.
Hadoop —
Hadoop stores and processes the data in a distributed manner across the cluster of commodity
hardware. To store and process any data, the client submits the data and program to the
Hadoop cluster.
Hadoop HDFS stores the data, MapReduce processes the data stored in HDFS, and YARN
divides the tasks and assigns resources.
Introduction to MapReduce —
raditional Enterprise Systems normally have a centralized server to store and process data.
The following illustration depicts a schematic view of a traditional enterprise system.
Traditional model is certainly not suitable to process huge volumes of scalable data and
cannot be accommodated by standard database servers. Moreover, the centralized system
creates too much of a bottleneck while processing multiple files simultaneously.
Google solved this bottleneck issue using an algorithm called MapReduce. MapReduce
divides a task into small parts and assigns them to many computers. Later, the results are
collected at one place and integrated to form the result dataset.
Features of MapReduce
1. Highly scalable
A framework with excellent scalability is Apache Hadoop MapReduce. This is because of its
capacity for distributing and storing large amounts of data across numerous servers. These
servers can all run simultaneously and are all reasonably priced.
By adding servers to the cluster, we can simply grow the amount of storage and computing
power. We may improve the capacity of nodes or add any number of nodes (horizontal
scalability) to attain high computing power. Organizations may execute applications from
massive sets of nodes, potentially using thousands of terabytes of data, thanks to Hadoop
MapReduce programming.
2. Versatile
Businesses can use MapReduce programming to access new data sources. It makes it possible
for companies to work with many forms of data. Enterprises can access both organized and
unstructured data with this method and acquire valuable insights from the various data
sources.
Since Hadoop is an open-source project, its source code is freely accessible for review,
alterations, and analyses. This enables businesses to alter the code to meet their specific
needs. The MapReduce framework supports data from sources including email, social media,
and clickstreams in different languages.
3. Secure
The MapReduce programming model uses the HBase and HDFS security approaches, and
only authenticated users are permitted to view and manipulate the data. HDFS uses a
replication technique in Hadoop 2 to provide fault tolerance. Depending on the replication
factor, it makes a clone of each block on the various machines. One can therefore access data
from the other devices that house a replica of the same data if any machine in a cluster goes
down. Erasure coding has taken the role of this replication technique in Hadoop 3. Erasure
coding delivers the same level of fault tolerance with less area. The storage overhead with
erasure coding is less than 50%.
4. Affordability
With the help of the MapReduce programming framework and Hadoop’s scalable design, big
data volumes may be stored and processed very affordably. Such a system is particularly cost-
effective and highly scalable, making it ideal for business models that must store data that is
constantly expanding to meet the demands of the present.
5. Fast-paced
The Hadoop Distributed File System, a distributed storage technique used by MapReduce, is
a mapping system for finding data in a cluster. The data processing technologies, such as
MapReduce programming, are typically placed on the same servers that enable quicker data
processing.
Thanks to Hadoop’s distributed data storage, users may process data in a distributed manner
across a cluster of nodes. As a result, it gives the Hadoop architecture the capacity to process
data exceptionally quickly. Hadoop MapReduce can process unstructured or semi-structured
data in high numbers in a shorter time.
7. Parallel processing-compatible
The parallel processing involved in MapReduce programming is one of its key components.
The tasks are divided in the programming paradigm to enable the simultaneous execution of
independent activities. As a result, the program runs faster because of the parallel processing,
which makes it simpler for the processes to handle each job. Multiple processors can carry
out these broken-down tasks thanks to parallel processing. Consequently, the entire software
runs faster.
8. Reliable
The same set of data is transferred to some other nodes in a cluster each time a collection of
information is sent to a single node. Therefore, even if one node fails, backup copies are
always available on other nodes that may still be retrieved whenever necessary. This ensures
high data availability.
The framework offers a way to guarantee data trustworthiness through the use of Block
Scanner, Volume Scanner, Disk Checker, and Directory Scanner modules. Your data is safely
saved in the cluster and is accessible from another machine that has a copy of the data if your
device fails or the data becomes corrupt.
9. Highly available
Hadoop’s fault tolerance feature ensures that even if one of the DataNodes fails, the user may
still access the data from other DataNodes that have copies of it. Moreover, the high
accessibility Hadoop cluster comprises two or more active and passive NameNodes running
on hot standby. The active NameNode is the active node. A passive node is a backup node
that applies changes made in active NameNode’s edit logs to its namespace.