0% found this document useful (0 votes)
33 views

Big ETL Extracting Transforming Loading

Uploaded by

Sofiane Stef
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views

Big ETL Extracting Transforming Loading

Uploaded by

Sofiane Stef
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

462 Int'l Conf. Par. and Dist. Proc. Tech. and Appl.

| PDPTA'15 |

Big-ETL: Extracting-Transforming-Loading Approach for Big Data


M. Bala1 , O. Boussaid2 , and Z. Alimazighi3
1 Department of informatics, Saad Dahleb University, Blida 1, Blida, Algeria
2 Department of informatics and Statistics, University of Lyon 2, Lyon, France
3 Department of informatics, USTHB, Algiers, Algeria

Abstract— ETL process (Extracting-Transforming-Loading) in the very large data integration in a data warehouse. We
is responsible for (E)xtracting data from heterogeneous propose a parallel/distributed ETL approach, called Big-
sources, (T)ransforming and finally (L)oading them into a ETL (ETL Approach for Big Data), consisting of a set
data warehouse (DW). Nowadays, Internet and Web 2.0 are of MR-based ETL functionalities. The solution offered by
generating data at an increasing rate, and therefore put the the research community, in this context, is to distribute the
information systems (IS) face to the challenge of big data. ETL process on a cluster of computers. Each ETL process
Data integration systems and ETL, in particular, should be instance handles a partition of data source in parallel way to
revisited and adapted and the well-known solution is based improve the performance of the ETL. This solution is defined
on the data distribution and the parallel/distributed pro- only at a process level (coarse granularity level) and does
cessing. Among all the dimensions defining the complexity not consider the ETL functionalities (fine granularity level)
of the big data, we focus in this paper on its excessive which allows understanding deeply the ETL complexity and
"volume" in order to ensure good performance for ETL improve, therefore, significantly the ETL process. To the
processes. In this context, we propose an original approach best of our knowledge, Big-ETL is a different and original
called Big-ETL (ETL Approach for Big Data) in which we approach in the data integration field. We first define an ETL
define ETL functionalities that can be run easily on a cluster process at a very fine level by parallelizing/distributing its
of computers with MapReduce (MR) paradigm. Big-ETL core functionalities according to the MR paradigm. Big-ETL
allows, thereby, parallelizing/distributing ETL at two levels: allows, thereby, parallelization/distribution of the ETL at two
(i) the ETL process level (coarse granularity level), and levels: (i) ETL functionality level, and (ii) ETL process level;
(ii) the functionality level (fine level); this allows improving this will improve further the ETL performance facing the
further the ETL performance. big data. To validate our Big-ETL approach, we developed
a prototype and conducted some experiments.
Keywords: Data Warehousing, Extracting-Transforming-Loading, The rest of this paper is structured as follows. Section 2
Parallel/distributed processing, Big Data, MapReduce. presents a state of the art in the ETL field followed by a
classification of ETL approaches proposed in the literature
1. Introduction according to the parallelization criteria. Section 3 is devoted
The widespread use of internet, web 2.0, social networks, to our Big-ETL approach. We present in Section 4 our
and digital sensors produce non-traditional data volumes. prototypical implementation and the conducted experiments.
Indeed, MapReduce (MR) jobs run continuously on Google We conclude and present our future work in Section 5.
clusters and deal over twenty Petabytes of data per day [1].
This data explosion is an opportunity for the emergence
of new business applications such as Big Data Analytics
2. Related work
(BDA); but it is, at the same time, a problem given the One of the first contributions on the ETL field is
limited capabilities of machines and traditional applications. [6]. It is a modeling approach based on a non-standard
These large data are called now "big data" and are charac- graphical formalism where ARKTOS II is the implemented
terized by the four "V" [2]: Volume that implies the amount framework. It is the first contribution that allows modeling
of data going beyond the usual units, the Velocity means an ETL process with all its details at a very fine level,
the speed with which this data is generated and should be i.e. the attribute. In [7], authors proposed a more holistic
processed, Variety is defined as the diversity of formats modeling approach based on UML (Unified Modeling
and structures, and Veracity relates to data accuracy and Language) but with less details on ETL process compared
reliability. Furthermore, new paradigms emerged such as to [6]. Authors in [8] adopted BPMN notation (Business
Cloud Computing [3] and MapReduce (MR) [4]. In addition, Process Model and Notation), a standard notation dedicated
novel data models are proposed for very large data storage to the business process modeling. This work was followed
such as NoSQL (Not Only SQL) [5]. This paper aims to by [9], a modeling framework based on a metamodel in
provide solutions to the problems caused by the big data in a MDD (Model Driven Development) architecture. [7] and
decision-support environment. We are particularly interested [8] are top-down approaches and allow, therefore, modeling
Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'15 | 463

sub-processes in their collapsed/expanded form for more time), and (iii) the data are with moderate size.
readability. Authors in [10] proposed a modeling approach
which consists of a summary view of the ETL process and
adopt the Reo model [11]. We consider that this contribution
could be interesting but is not mature enough and deserves
Reo customization model to support the ETL specifics.

Following the big data emergence, some works tackled


interesting issues. [12] is an approach which focuses on
the performance of ETL processes dealing with large data
and adopts the MapReduce paradigm. This approach is
implemented in a prototype called ETLMR which is a
MapReduce version of the PygramETL prototype [13]. The Fig. 1: Centralized ETL Process approach.
ETLMR platform is demonstrated in [14]. [15] shows that
ETL solutions based on MapReduce frameworks, such as In this context, only the independent functionalities can be
Apache Hadoop, are very efficient and less costly compared run in parallel way (both the ETL functions and the machine,
to ETL tools market. Recently, authors in [16] proposed on which it will be run, should be multithreaded). An
CloudETL framework. CloudETL uses Apache Hadoop to ETL functionality, such as Changing Data Capture (CDC),
parallelize ETL processes and Apache Hive to process data. Surrogate Key (SK), Slowly Changing Dimension (SCD),
Overall, experiments in [16] shows that CloudETL is faster Surrogate Key Pipeline (SKP), is a basic function that
than ETLMR and Hive for large data sets processing. [17] supports a particular aspect of an ETL process. In Fig. 1,
demonstrates the P-ETL platform. P-ETL (Parallel-ETL) we can see that (F1 and F3) or (F2 and F3) can be run in
is implemented under the Apache Hadoop framework and parallel way.
provides a simple GUI to set an ETL process and the
parallel/distributed environment. In the batch version, P-
ETL runs thanks to an XML file (config.xml) in which the b) Distributed ETL process approach: The well-known
same parameters should be set. In the P-ETL approach, the solution to cope with big data is the "paralleliza-
mappers (Map step) are in charge of standardizing the data tion/distribution" of the data and the ETL process on a
(cleasing, filtering, converting, ...) and the reducers (Reduce cluster of computers. The MR paradigm, for instance, allows
step) are dedicated for merging and aggregating them. To splitting large amounts of data sets where each partition will
the best of our knowledge, there are no works having be subject to an instance of the ETL process.
tackled the ETL modeling issue intended to the big data
environment and more precisely to the parallel/distributed
ETL processing. We focus in this paper on the paralleliza-
tion/distribution issue to improve the performance of the
ETL. The classification proposed in Tab.1 is based on the
parallelization criteria.

Table 1: Classification of ETL works


Approach Purpose Classification
[6] Modeling Centralized approach
[7] Modeling Centralized approach
[8] Modeling Centralized approach
[13] Performance Centralized approach
[12] Performance Distributed approach
[10] Modeling Centralized approach
[15] Performance Distributed approach
[16] Performance Distributed approach
[17] Performance Distributed approach Fig. 2: Distributed ETL Process approach.
Big-ETL Performance Distributed approach
As depicted in Fig. 2, multiple ETL process instances
run in parallel way where each one deals with its data
a) Centralized ETL process approach: In this paper, the partition in the Map step. The partial results produced by
ETL process approach is defined as centralized (or classical) the mappers are merged/aggregated in the Reduce step and
when (i) the ETL process runs on an ETL server (one then loaded into the DW. All approaches proposed with MR
machine), (ii) in one instance (one execution at the same paradigm, [12] and [15] for instance, apply the distribution
464 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'15 |

only at the process level. Big-ETL applies MR at two for an elementary function. An elementary process is an
levels: (i) Process level (coarse granularity level), and (ii) atomic unit of processing which is synchronized with other
functionality level (fine granularity level). We believe that elementary processes to ensure the ETL functionality. Each
the ETL, in the context of technological change having af- one of the elementary processes is implemented as an
fected both data and processes, still presents some scientific elementary function. Thus, we consider that the aggregate
problems such as big data modeling considering its different functionality is a set of synchronized elementary functions.
characteristics (volume, variety, velocity, veracity,...), data For example, the functionality CDC which is responsible to
partitioning, parallel processing in its various forms (pro- identify the changes (INSERT, UPDATE, DELETE) having
cesses parallelization, process components parallelization, affected the data in a particular source, can be decomposed
pipeline parallelization,...), etc. Functionalities as the core in three elementary functions where each one is in charge
ETL functions deserve a most in-depth study to ensure, at of identifying INSERT, UPDATE, DELETE respectively.
a very fine level, robustness, reliability and optimization
of the ETL process. Our Big-ETL is a parallel/distributed 3.1.2 Vertical Distribution of Functionaltilies (VDF)
ETL approach based on two distribution levels (Process and
functionalities) and two distribution directions (Vertical and As shown in Fig. 3, the ETL process runs in one instance,
horizontal). while each of its functionalities runs in multiple instances.
For example, the functionality F4 (oval) that runs in three
3. ETL Approach for Big Data instances (fragments separated by dashes), received its input
data from F2 and F3. These inputs are partitioned and
We present in this section our Big-ETL approach. We
each of the three partitions is subject to an instance of F4
deployed it on many ETL functionalities such as Chang-
(mapper). Partial results produced by the three mappers are
ing Data Capture (CDC), Data Quality Validation (DVQ),
merged by reducers to provide the final F4 outputs. This is
Surrogate Key (SK), Slowly Changing Dimension (SCD),
a novelty in the parallel/distributed ETL approaches based
Surrogate Key Pipeline (SKP). Among all these ETL func-
on MR paradigm as all other approaches does not consider
tionalities, we chose to present, in this paper, CDC to
the parallelization/distribution at an ETL functionality.
illustrate our Big-ETL approach.

3.1 Big-ETL principle


Our Big-ETL process is functionalities-based approach
which exploits the MR paradigm. For each of these func-
tionalities, we apply the same principle adopted at a process
level in the "distributed ETL process approach" as depicted
in Fig. 2.

3.1.1 Key Concepts


a) ETL functionality: In order to control the complexity Fig. 3: VDF Approach.
of the ETL process, we define it thanks to a set of core
functionalities. The ETL functionality is a basic function
that supports a particular ETL aspect such as Changing Data
3.1.3 Vertical Distribution of Functionaltilies and Pro-
Capture (CDC), Data Quality Validation (DQV), Surrogate
cess (VDFP)
Key (SK), Slowly Changing Dimension (SCD), Surrogate
Key Pipeline (SKP), etc. The ETL task, however, is an In case where VDF presents low performance (particularly
instance of an ETL functionality. Let SK1 and SK2 be two if the ETL process contains much sequential functionalities),
ETL tasks that generate a surrogate key for inserting tuples in the designer should set the ETL process to be run in several
PRODUCT and CUSTOMER dimensions respectively. SK1 instances. This is an hybrid approach that takes the principles
and SK2 are two different tasks but both are based on SK. of the "distributed ETL process" and VDF approaches at the
Thus, the SK is the ETL functionality where SK1 and SK2 same time as shown in Fig. 4.
are its instances. In the follows, we describe an ETL process It should be noted that the VDFP approach requires more
in terms of its functionalities. resources (cluster nodes, HDD space, RAM, Cache, LAN
bandwith ...). When the ETL process runs in the VDF
b) Elementary process/function: When an ETL function- approach and reaches F4, it will require three tasks as F4
ality is not atomic (aggregate functionality) in terms of runs in three instances. The same process executed in the
processing, we consider that it is in charge of several VDFP approach will require thirty parallel tasks if it runs in
separated elementary processes where each one is affected ten instances in addition to the three instances of F4.
Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'15 | 465

Since all the tuples pass by the pipeline, we propose a


synchronization schema in order to process a subset of
tuples in parallel way. The number of tuples present in the
pipeline should be equal to the number of functionalities.
Indeed, when a particular tuple is being processed by the
last functionality in the pipe, its successors should be also
processed according to the order of the functionalities as
defined in the pipe. Let P be a pipe in which is defined a
set of functionalities F1 , F2 , ..., Fn .

Fig. 4: VDFP Approach.

3.1.4 Horizontal Distribution of Functionaltilies (HDF)


Some ETL functionalities operate several elementary pro-
cesses on source data. In this case, these functionalities are Fig. 6: Sequential (a) and parallel (b) pipeline.
not atomic and can thereby be decomposed into elementary
functions where each one is in charge of a particular process When the tuples of data t1 , t2 , ..., tm should pass by a
unit. Let F be a functionality in an ETL process which sequential pipe P , the tuple ti is moved in the pipe P
operates some elementary processes units T1 , T2 , ..., Tn on when the tuple ti−1 is completely processed by all the
the source data. functionalities F1 , F2 , ..., Fn (FIG. 6 (a)). Thus, only one
tuple can be present in the pipe at the same time. In order
to improve further the performance of the ETL process, we
propose to parallelize the pipe. In this way, several tuples
can be processed simultaneously in the pipe where each one
is handled by a functionality according to the order defined
in the pipe. Thus, when the tuple ti is being processed by
Fn , the tuple ti+1 is processed, in the same time, by Fn−1 ,
the tuple ti+2 is processed by Fn−2 and so on. In this way,
Fig. 5: Elementary processes (a) and functionalities (b). the number of tuples being processed in the pipe is equal to
the number of functionalities defined in the pipe (equal to n).
We can decompose F into elementary functions noted Indeed, when a tuple is moved out the pipe after a complete
f1 , f2 , ..., fn where each fi is in charge of Ti . FIG. 5 (a) process, another tuple (first in the queue) is moved in and
shows an ETL functionaly F which operates six elementary so on until a complete process of the data partition. FIG. 6
processes T1 , T2 , ..., T6 . We note that T1 , T2 , T3 , T4 can be (b) depicts the pipe P in the parallel approach.
run in parallel way since no dependencies exist between
them. In the same way, T5 and T6 can be, also, run in paral- 3.2 Changing Data Capture (CDC) in Big-ETL
lel. Thus, we can decompose F into six elementary functions approach
noted f1 , f2 , ..., f6 which are in charge of T1 , T2 , ..., T6 Our Big-ETL approach is applied on many ETL function-
respectively (FIG. 5 (b)). Unlike the VDF approach which alities such as CDC, SCD, SKP, etc. Seeing the paper space
distributes the ETL functionality by instanciation, the HDF constraint, we illustrate Big-ETL with the CDC functionality.
approach distributes the ETL functionality by fragmentation. The ETL functionality CDC is considered as the main
In a distributed environment, the schema depicted in FIG. 5 functionality in the E step of ETL. It identifies the data
(b) allows, in a first phase, distributing F in four fragments affected by changes (INSERT, UPDATE, DELETE) in the
f1 , f2 , f3 and f4 which run in parallel way. In the second source systems. These latter is then extracted and processed
phase, F is distributed into f4 and f5 that can be run in for the DW refresh [18]. The rest of data (unaffected by
parallel way and allow providing the final output of F . changes) is rejected since it is already loaded in the DW.
The most common technique used in this field is based
3.1.5 Pipeline Processing Distribution (PPD) on snapshots [18]. In the classical algorithms of CDC, the
Some ETL functionalities process the source data tuple- changes between two corresponding tuples are detected by
by-tuple in a sequential way called pipeline processing. comparing them attribute-by-attribute. Furthermore, tuples
466 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'15 |

contain hundreds of attributes in data warehousing systems. 3.2.3 INSERT-UPDATE data capture (IUDCP) and
In order to improve the CDC performance and make its DELETE data capture (DDCP) Processes
cost lower, we adapted the well-known hash function CRC We propose two parallel processes in the new CDC
(Cyclic Redundancy Check) which is widely used in digital scheme that support (i) INSERT and UPDATE data capture
data transmission field [19] and internet applications [20]. (IUDCP), and (ii) DELETE data capture (DDCP).
We adapted CRC function in the CDC context as follows.
Let tuple1 and tuple2 be two tuples stored in ST and STpv
respectively. If tuple1 and tuple2 satisfy the two equations
1 and 2, it means that they are similar. In this case, the
tuple1 will be rejected by the CDC process as no changes
have occurred. However, if only the equation 1 is satisfied,
it means that tuple1 has been affected by changes and will
be extracted by the CDC process as UPDATE.

tuple1.KEY = tuple2.KEY (1)


CRC(tuple1) = CRC(tuple2) (2)
To propose a CDC schema in the Big-ETL environment,
we consider that both ST and STpv tables contain large data,
the CDC functionality will run on a cluster of computers,
and we adopt the MR paradigm. The classical scheme of
CDC will be supplemented by new aspects namely (i) data
partitioning, (ii) lookup tables, (iii) insert and update data Fig. 7: IUDC process architecture.
capture process and (iv) delete data capture process.

3.2.1 Data partitioning


To deal with large data, we adopt the rule of "divide and
conquer". In the context of CDC, the system should, firstly,
sort ST and STpv on the column KEY, and then split them to
obtain usual volumes of data. The partitioning of ST allows
processing the generated partitions in parallel way. STpv is
partitioned to avoid searching ST tuples in a large volume
of data.

3.2.2 Lookup tables


To avoid searching a tuple in all ST and STpv partitions,
we use Lookup tables denoted LookupST and LookupSTpv
respectively. They identify the partition that will contain a
given tuple. Here, are some details on the use of lookup
tables:
• LookupST and LookupSTpv contain the min and max
values of keys (#KEY) for each ST and STpv partitions Fig. 8: DDC process architecture.
respectively;
• For a Ti tuple in ST, it consists of searching the P stpvk Each process runs into multiple parallel instances. Each
partition of STpv that satisfies the expression 3 in instance of IUDCP and DDCP handles ST and STpv partition
LookupSTpv; respectively. Fig. 7 depicts IUDCP. Each partition P sti is
• For a Tj tuple in STpv, it consists of searching the assigned to a M api task which is responsible for checking
P stk partition of ST that satisfies the expression 4 in the existence of each its partition tuples in STpv. To this end,
LookupST. the mapper looks up in LookupSTpv the partition P stpvk that
may contain the tuple. Once the P stpvk partition is identi-
LookupST pv.KEY min ≤ Ti .KEY ≤ LookupST pv.KEY max (3)
fied, three cases can arise: (1) #KEY value is nonexistent in
P stpvk ; this means an insertion (INSERT), (2) #KEY value
exists and identifies a similar copy of the tuple in P stpvk ;
LookupST.KEY min ≤ Tj .KEY ≤ LookupST.KEY max (4) the tuple is rejected (3) #KEY value exists in P stpvk with
Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'15 | 467

change in at least one attribute between the two tuples; this Apache Hadoop environment and uses mainly two modules
is a modification (UPDATE). (1) HDFS for distributed store and high-throughput access to
application data, and (2) MapReduce for parallel processing.
Algorithm 1 IU_MAP(Pst) We defined two levels for our experiments: (i) ETL process
Input: Pst, LookupSTpv, tuple1: ST record, tuple2: STpv level (coarse granularity level) and (ii) ETL Functionality
record level (fine granularity level). We present in this section
Output: CHANGES the results for the first scenario. To evaluate our P-ETL
platform, we proposed an ETL process example applied on
1: while not eof (Pst) do students’ data gathered at the Education Ministry. The data
2: read(Pst, tuple1) source contains student identifier (St_id), his enrollment
3: P stpv ← lookup(LookupST pv, tuple1.KEY ); date (Enr_Date), his cycle (Bachelor, Master or Ph.D.), his
4: if found() then specialty (medicine, biology, computer , ...) and finally we
5: tuple2 ← lookup(P stpv, tuple1.KEY ); find information about scholarship (if the student received
6: if found() then scholarship or not) and about sport (if he practices sport
7: if CRC(tuple1) 6= CRC(tuple2) then or not). We developed a program to generate csv source
8: extracting tuple1 as UPDATE; data. In this experiment, we generated 7 samples of source
9: end if data that vary between 244 ∗ 106 and 2, 44 ∗ 109 tuples
10: else where each one has 44 bytes of size. The ETL process
11: extracting tuple1 as INSERT; configured to process the data is as follows. The first task is
12: end if projection which restricts the source tuples to an attributes
13: else subset by excluding Scholarship and Sport. The process
14: extracting tuple1 as INSERT; presents in the second task a restriction which filters tuples
15: end if and rejects those having Null value in Enr_Date, Cycle,
16: end while and Specialty. The third task in the process is GetDate()
17: return (CHANGES); which retrieves year from Enr_Date. The last task is the
aggregation function COUNT() which computes the number
of students grouped by enrolment year, Cycle and Specialty.
As shown in Fig. 8, DDCP operates on the same principle
as IUDCP but in the opposite direction. In DDCP, we focus We considered the P-ETL scalability by varying the "data
exclusively on the case where the tuple does not exist source size" and the "number of tasks". The test environment
(DELETE). In order to have an approach about how to is a cluster made up of 10 machines (nodes). Each machine
process the mixing of these multiple operations, we propose has an intel-Core i5-2500 CPU@3.30 GHZ x 4 processor
a main program of CDC called CDC_BigData. At this level, with 4GB RAM, 20 GB of free HDD space. These machines
the ST and STpv tables are sorted and then partitioned, the operate with Ubuntu-12.10 and are interconnected by a
LookupST and LookupSTpv tables are generated respectively switched Ethernet 100 Mbps in a LAN. The framework
from ST and STpv, and finally the parallel IUDCP and Apache Hadoop 1.2.0 is installed on all the machines. One
DDCP processes are invoked. Algorithm 1 is responsible for of these 10 machines is configured to perform the role
capturing insertions and updates in ST table. A P sti partition of Namenode in the HDFS system and JobTracker in the
will be processed by an instance of iu_map () function. Line MapReduce system. However, the other machines are config-
3 looks up in LookupST pv for a P stpv partition which ured to be HDFS DataNodes and TaskTrackers. Overall, we
may contain the tuple readed in line 2. Lines 4-12 describe can see, in FIG. 9, that the increasing of tasks improves the
the case where the P stpvk is located. Line 5 looks-up in processing time. Indeed, we further analyzed the results and
Pstpv the tuple. Lines 6-9 treat the case of tuple affected by discovered some interesting aspects. Seeing the paper length
changes (UPDATE) by invoking CRC hash function. Lines constraint, we can not present all the experiment results.
10-12 treat the case where the tuple does not exist in the FIG. 10 shows the "time saving" by increasing tasks.
partition P stpvk and is thereby captured as an insert. Lines The "time saving" is calculated as the difference between
13-15 treat the case where the tuple does not match with "processing time" corresponding to different "number of
any partition in the lookup table LookupST pv and thereby tasks". We can see that the time saving to handle 2, 2 ∗ 109
it is captured as an insert. tuples (FIG. 10 (a)) decreases when we configure more than
"5 tasks". Also, to handle 2, 44∗109 tuples (FIG. 10 (b)), the
4. Implementation and experiment time saving after "8 tasks" becomes not significant. To sum
We developed an ETL platform called P-ETL (Parallel- up our experiment, we note that the "number of tasks" is not
ETL) which provides: (i) data distribution, and (ii) parallel the only parameter to be set in order to speed-up the process.
and distributed ETL processing. P-ETL is implemented in Our cluster must be extended in terms of nodes, memory
468 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'15 |

[2] S. Mohanty, M. Jagadeesh, and H. Srivatsa, Big Data Imperatives:


Enterprise Big Data Warehouse,BI Implementations and Analytics.
Apress, 2013.
[3] B. Sosinsky, Cloud computing bible. John Wiley & Sons, 2010, vol.
762.
[4] J. Dean and S. Ghemawat, “Mapreduce: a flexible data processing
tool,” Communications of the ACM, vol. 53, no. 1, pp. 72–77, 2010.
[5] J. Han, E. Haihong, G. Le, and J. Du, “Survey on nosql database,” in
6th international conference on pervasive computing and applications
(ICPCA), 2011. IEEE, 2011, pp. 363–366.
[6] P. Vassiliadis, A. Simitsis, and S. Skiadopoulos, “Conceptual modeling
for etl processes,” in Proceedings of the 5th ACM international
workshop on Data Warehousing and OLAP. ACM, 2002, pp. 14–21.
[7] J. Trujillo and S. Luján-Mora, “A uml based approach for modeling
etl processes in data warehouses,” in Conceptual Modeling-ER 2003.
Springer, 2003, pp. 307–320.
Fig. 9: Proc. time (min.) by scaling up data (tuples) and [8] Z. El Akkaoui and E. Zimányi, “Defining etl worfklows using bpmn
and bpel,” in Proceedings of the ACM twelfth international workshop
increasing tasks. on Data warehousing and OLAP. ACM, 2009, pp. 41–48.
[9] Z. El Akkaoui, E. Zimànyi, J.-N. Mazón, and J. Trujillo, “A model-
driven framework for etl process development,” in Proceedings of the
ACM 14th international workshop on Data Warehousing and OLAP.
space (RAM, cache), LAN bandwidth, etc. The cluster used ACM, 2011, pp. 45–52.
for this experiment is a small-sized infrastructure. The HDD [10] B. Oliveira and O. Belo, “Using reo on etl conceptual modelling: a
space is very low (20 GB per node). Thus, trying to increase first approach,” in Proceedings of the sixteenth international workshop
on Data warehousing and OLAP. ACM, 2013, pp. 55–60.
tasks, for example, to more than eight while staying on the [11] F. Arbab, “Reo: a channel-based coordination model for component
same resources in terms of HDD, RAM ..., will not make composition,” Mathematical Structures in Computer Science, vol. 14,
Handoop able to improve performance of the process if HDD no. 3, pp. 329–366, 2004.
[12] X. Liu, C. Thomsen, and T. B. Pedersen, “Etlmr: a highly scalable di-
space or memory is, already, completely consumed by the mensional etl framework based on mapreduce,” in Data Warehousing
eight tasks. and Knowledge Discovery. Springer, 2011, pp. 96–111.
[13] C. Thomsen and T. Bach Pedersen, “pygrametl: A powerful pro-
gramming framework for extract-transform-load programmers,” in
Proceedings of the ACM twelfth international workshop on Data
warehousing and OLAP. ACM, 2009, pp. 49–56.
[14] X. Liu, C. Thomsen, and T. B. Pedersen, “Mapreduce-based dimen-
sional etl made easy,” Proceedings of the VLDB Endowment, vol. 5,
no. 12, pp. 1882–1885, 2012.
[15] S. Misra, S. K. Saha, and C. Mazumdar, “Performance comparison of
hadoop based tools with commercial etl tools–a case study,” in Big
Data Analytics. Springer, 2013, pp. 176–184.
[16] X. Liu, C. Thomsen, and T. B. Pedersen, “Cloudetl: scalable dimen-
Fig. 10: Time saving (min.) by increasing tasks. sional etl for hive,” in Proceedings of the 18th International Database
Engineering & Applications Symposium. ACM, 2014, pp. 195–206.
[17] M. Bala, O. Mokeddem, O. Boussaid, and Z. Alimazighi, “Une
plateforme etl parallèle et distribuée pour l´intégration de données
massives,” Revue des Nouvelles Technologies de l’Information, vol.
5. Conclusion Extraction et Gestion des Connaissances, RNTI-E-28, pp. 455–460,
2015.
The ETL is the core component of decision-support sys- [18] R. Kimball and J. Caserta, The data warehouse ETL toolkit. John
tem since all the data dedicated for analysis pass through this Wiley & Sons, 2004.
process. It should be adapted following the new approaches [19] D. V. Sarwate, “Computation of cyclic redundancy checks via table
look-up,” Communications of the ACM, vol. 31, no. 8, pp. 1008–1013,
and paradigms to cope with big data. In this context, we pro- 1988.
posed a parallel/distributed approach for ETL process where [20] M. P. Freivald, A. C. Noble, and M. S. Richards, “Change-detection
its functionalities run in parallel way with MR paradigm. tool indicating degree and location of change of internet documents
by comparison of cyclic-redundancy-check (crc) signatures,” Apr. 27
In the near future, we plan to finish our experiments on a 1999, uS Patent 5,898,836.
larger scale both in ETL process level and ETL functionality
level. A complete benchmark in which we compare the
four approaches (centralized ETL process approach, dis-
tributed ETL process approach, Big-ETL approach, Hybrid
approach) is an interesting perspective.

References
[1] J. Dean and S. Ghemawat, “Mapreduce: simplified data processing
on large clusters,” Communications of the ACM, vol. 51, no. 1, pp.
107–113, 2008.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy