Big ETL Extracting Transforming Loading
Big ETL Extracting Transforming Loading
| PDPTA'15 |
Abstract— ETL process (Extracting-Transforming-Loading) in the very large data integration in a data warehouse. We
is responsible for (E)xtracting data from heterogeneous propose a parallel/distributed ETL approach, called Big-
sources, (T)ransforming and finally (L)oading them into a ETL (ETL Approach for Big Data), consisting of a set
data warehouse (DW). Nowadays, Internet and Web 2.0 are of MR-based ETL functionalities. The solution offered by
generating data at an increasing rate, and therefore put the the research community, in this context, is to distribute the
information systems (IS) face to the challenge of big data. ETL process on a cluster of computers. Each ETL process
Data integration systems and ETL, in particular, should be instance handles a partition of data source in parallel way to
revisited and adapted and the well-known solution is based improve the performance of the ETL. This solution is defined
on the data distribution and the parallel/distributed pro- only at a process level (coarse granularity level) and does
cessing. Among all the dimensions defining the complexity not consider the ETL functionalities (fine granularity level)
of the big data, we focus in this paper on its excessive which allows understanding deeply the ETL complexity and
"volume" in order to ensure good performance for ETL improve, therefore, significantly the ETL process. To the
processes. In this context, we propose an original approach best of our knowledge, Big-ETL is a different and original
called Big-ETL (ETL Approach for Big Data) in which we approach in the data integration field. We first define an ETL
define ETL functionalities that can be run easily on a cluster process at a very fine level by parallelizing/distributing its
of computers with MapReduce (MR) paradigm. Big-ETL core functionalities according to the MR paradigm. Big-ETL
allows, thereby, parallelizing/distributing ETL at two levels: allows, thereby, parallelization/distribution of the ETL at two
(i) the ETL process level (coarse granularity level), and levels: (i) ETL functionality level, and (ii) ETL process level;
(ii) the functionality level (fine level); this allows improving this will improve further the ETL performance facing the
further the ETL performance. big data. To validate our Big-ETL approach, we developed
a prototype and conducted some experiments.
Keywords: Data Warehousing, Extracting-Transforming-Loading, The rest of this paper is structured as follows. Section 2
Parallel/distributed processing, Big Data, MapReduce. presents a state of the art in the ETL field followed by a
classification of ETL approaches proposed in the literature
1. Introduction according to the parallelization criteria. Section 3 is devoted
The widespread use of internet, web 2.0, social networks, to our Big-ETL approach. We present in Section 4 our
and digital sensors produce non-traditional data volumes. prototypical implementation and the conducted experiments.
Indeed, MapReduce (MR) jobs run continuously on Google We conclude and present our future work in Section 5.
clusters and deal over twenty Petabytes of data per day [1].
This data explosion is an opportunity for the emergence
of new business applications such as Big Data Analytics
2. Related work
(BDA); but it is, at the same time, a problem given the One of the first contributions on the ETL field is
limited capabilities of machines and traditional applications. [6]. It is a modeling approach based on a non-standard
These large data are called now "big data" and are charac- graphical formalism where ARKTOS II is the implemented
terized by the four "V" [2]: Volume that implies the amount framework. It is the first contribution that allows modeling
of data going beyond the usual units, the Velocity means an ETL process with all its details at a very fine level,
the speed with which this data is generated and should be i.e. the attribute. In [7], authors proposed a more holistic
processed, Variety is defined as the diversity of formats modeling approach based on UML (Unified Modeling
and structures, and Veracity relates to data accuracy and Language) but with less details on ETL process compared
reliability. Furthermore, new paradigms emerged such as to [6]. Authors in [8] adopted BPMN notation (Business
Cloud Computing [3] and MapReduce (MR) [4]. In addition, Process Model and Notation), a standard notation dedicated
novel data models are proposed for very large data storage to the business process modeling. This work was followed
such as NoSQL (Not Only SQL) [5]. This paper aims to by [9], a modeling framework based on a metamodel in
provide solutions to the problems caused by the big data in a MDD (Model Driven Development) architecture. [7] and
decision-support environment. We are particularly interested [8] are top-down approaches and allow, therefore, modeling
Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'15 | 463
sub-processes in their collapsed/expanded form for more time), and (iii) the data are with moderate size.
readability. Authors in [10] proposed a modeling approach
which consists of a summary view of the ETL process and
adopt the Reo model [11]. We consider that this contribution
could be interesting but is not mature enough and deserves
Reo customization model to support the ETL specifics.
only at the process level. Big-ETL applies MR at two for an elementary function. An elementary process is an
levels: (i) Process level (coarse granularity level), and (ii) atomic unit of processing which is synchronized with other
functionality level (fine granularity level). We believe that elementary processes to ensure the ETL functionality. Each
the ETL, in the context of technological change having af- one of the elementary processes is implemented as an
fected both data and processes, still presents some scientific elementary function. Thus, we consider that the aggregate
problems such as big data modeling considering its different functionality is a set of synchronized elementary functions.
characteristics (volume, variety, velocity, veracity,...), data For example, the functionality CDC which is responsible to
partitioning, parallel processing in its various forms (pro- identify the changes (INSERT, UPDATE, DELETE) having
cesses parallelization, process components parallelization, affected the data in a particular source, can be decomposed
pipeline parallelization,...), etc. Functionalities as the core in three elementary functions where each one is in charge
ETL functions deserve a most in-depth study to ensure, at of identifying INSERT, UPDATE, DELETE respectively.
a very fine level, robustness, reliability and optimization
of the ETL process. Our Big-ETL is a parallel/distributed 3.1.2 Vertical Distribution of Functionaltilies (VDF)
ETL approach based on two distribution levels (Process and
functionalities) and two distribution directions (Vertical and As shown in Fig. 3, the ETL process runs in one instance,
horizontal). while each of its functionalities runs in multiple instances.
For example, the functionality F4 (oval) that runs in three
3. ETL Approach for Big Data instances (fragments separated by dashes), received its input
data from F2 and F3. These inputs are partitioned and
We present in this section our Big-ETL approach. We
each of the three partitions is subject to an instance of F4
deployed it on many ETL functionalities such as Chang-
(mapper). Partial results produced by the three mappers are
ing Data Capture (CDC), Data Quality Validation (DVQ),
merged by reducers to provide the final F4 outputs. This is
Surrogate Key (SK), Slowly Changing Dimension (SCD),
a novelty in the parallel/distributed ETL approaches based
Surrogate Key Pipeline (SKP). Among all these ETL func-
on MR paradigm as all other approaches does not consider
tionalities, we chose to present, in this paper, CDC to
the parallelization/distribution at an ETL functionality.
illustrate our Big-ETL approach.
contain hundreds of attributes in data warehousing systems. 3.2.3 INSERT-UPDATE data capture (IUDCP) and
In order to improve the CDC performance and make its DELETE data capture (DDCP) Processes
cost lower, we adapted the well-known hash function CRC We propose two parallel processes in the new CDC
(Cyclic Redundancy Check) which is widely used in digital scheme that support (i) INSERT and UPDATE data capture
data transmission field [19] and internet applications [20]. (IUDCP), and (ii) DELETE data capture (DDCP).
We adapted CRC function in the CDC context as follows.
Let tuple1 and tuple2 be two tuples stored in ST and STpv
respectively. If tuple1 and tuple2 satisfy the two equations
1 and 2, it means that they are similar. In this case, the
tuple1 will be rejected by the CDC process as no changes
have occurred. However, if only the equation 1 is satisfied,
it means that tuple1 has been affected by changes and will
be extracted by the CDC process as UPDATE.
change in at least one attribute between the two tuples; this Apache Hadoop environment and uses mainly two modules
is a modification (UPDATE). (1) HDFS for distributed store and high-throughput access to
application data, and (2) MapReduce for parallel processing.
Algorithm 1 IU_MAP(Pst) We defined two levels for our experiments: (i) ETL process
Input: Pst, LookupSTpv, tuple1: ST record, tuple2: STpv level (coarse granularity level) and (ii) ETL Functionality
record level (fine granularity level). We present in this section
Output: CHANGES the results for the first scenario. To evaluate our P-ETL
platform, we proposed an ETL process example applied on
1: while not eof (Pst) do students’ data gathered at the Education Ministry. The data
2: read(Pst, tuple1) source contains student identifier (St_id), his enrollment
3: P stpv ← lookup(LookupST pv, tuple1.KEY ); date (Enr_Date), his cycle (Bachelor, Master or Ph.D.), his
4: if found() then specialty (medicine, biology, computer , ...) and finally we
5: tuple2 ← lookup(P stpv, tuple1.KEY ); find information about scholarship (if the student received
6: if found() then scholarship or not) and about sport (if he practices sport
7: if CRC(tuple1) 6= CRC(tuple2) then or not). We developed a program to generate csv source
8: extracting tuple1 as UPDATE; data. In this experiment, we generated 7 samples of source
9: end if data that vary between 244 ∗ 106 and 2, 44 ∗ 109 tuples
10: else where each one has 44 bytes of size. The ETL process
11: extracting tuple1 as INSERT; configured to process the data is as follows. The first task is
12: end if projection which restricts the source tuples to an attributes
13: else subset by excluding Scholarship and Sport. The process
14: extracting tuple1 as INSERT; presents in the second task a restriction which filters tuples
15: end if and rejects those having Null value in Enr_Date, Cycle,
16: end while and Specialty. The third task in the process is GetDate()
17: return (CHANGES); which retrieves year from Enr_Date. The last task is the
aggregation function COUNT() which computes the number
of students grouped by enrolment year, Cycle and Specialty.
As shown in Fig. 8, DDCP operates on the same principle
as IUDCP but in the opposite direction. In DDCP, we focus We considered the P-ETL scalability by varying the "data
exclusively on the case where the tuple does not exist source size" and the "number of tasks". The test environment
(DELETE). In order to have an approach about how to is a cluster made up of 10 machines (nodes). Each machine
process the mixing of these multiple operations, we propose has an intel-Core i5-2500 CPU@3.30 GHZ x 4 processor
a main program of CDC called CDC_BigData. At this level, with 4GB RAM, 20 GB of free HDD space. These machines
the ST and STpv tables are sorted and then partitioned, the operate with Ubuntu-12.10 and are interconnected by a
LookupST and LookupSTpv tables are generated respectively switched Ethernet 100 Mbps in a LAN. The framework
from ST and STpv, and finally the parallel IUDCP and Apache Hadoop 1.2.0 is installed on all the machines. One
DDCP processes are invoked. Algorithm 1 is responsible for of these 10 machines is configured to perform the role
capturing insertions and updates in ST table. A P sti partition of Namenode in the HDFS system and JobTracker in the
will be processed by an instance of iu_map () function. Line MapReduce system. However, the other machines are config-
3 looks up in LookupST pv for a P stpv partition which ured to be HDFS DataNodes and TaskTrackers. Overall, we
may contain the tuple readed in line 2. Lines 4-12 describe can see, in FIG. 9, that the increasing of tasks improves the
the case where the P stpvk is located. Line 5 looks-up in processing time. Indeed, we further analyzed the results and
Pstpv the tuple. Lines 6-9 treat the case of tuple affected by discovered some interesting aspects. Seeing the paper length
changes (UPDATE) by invoking CRC hash function. Lines constraint, we can not present all the experiment results.
10-12 treat the case where the tuple does not exist in the FIG. 10 shows the "time saving" by increasing tasks.
partition P stpvk and is thereby captured as an insert. Lines The "time saving" is calculated as the difference between
13-15 treat the case where the tuple does not match with "processing time" corresponding to different "number of
any partition in the lookup table LookupST pv and thereby tasks". We can see that the time saving to handle 2, 2 ∗ 109
it is captured as an insert. tuples (FIG. 10 (a)) decreases when we configure more than
"5 tasks". Also, to handle 2, 44∗109 tuples (FIG. 10 (b)), the
4. Implementation and experiment time saving after "8 tasks" becomes not significant. To sum
We developed an ETL platform called P-ETL (Parallel- up our experiment, we note that the "number of tasks" is not
ETL) which provides: (i) data distribution, and (ii) parallel the only parameter to be set in order to speed-up the process.
and distributed ETL processing. P-ETL is implemented in Our cluster must be extended in terms of nodes, memory
468 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'15 |
References
[1] J. Dean and S. Ghemawat, “Mapreduce: simplified data processing
on large clusters,” Communications of the ACM, vol. 51, no. 1, pp.
107–113, 2008.