ETL is the most common method used when transferring data from a
source system to a data warehouse. But
there are cases where you might want to use ELT.
What is the difference between the
two? Read on:
Extract, Transform, and Load (ETL) is a process that involves extracting data from outside sources,
transforming it to fit operational needs (sometimes using staging tables), then
loading it into the end target database or data warehouse. This approach is reasonable as long as many
different databases are involved in your data warehouse landscape. In this scenario you have to transport data
from one place to another anyway, so it’s a legitimate way to do the
transformation work in a separate specialized engine.
Traditionally,
ETL refers to the process of moving data from source systems into a data
warehouse. The data is:
- Extracted
– copied from the source system to a staging area
- Transformed
– reformatted for the warehouse with business calculations applied
- Loaded
– copied from the staging area into the warehouse
Extract, Load, Transform (ELT) is a process where data is extracted, then loaded into a staging
table in the database, transforming it where it sits in the database and then
loading it into the target database or data warehouse.
ELT should be used, instead of ETL, in these various cases :
- There are big volumes of data
- The source database and the
target database are the same
- The database engine is well
adapted for that kind of processing, such as PDW, which is great at
loading massive amounts of data very quickly
- big data
projects
ELT has the benefit
of minimizing the processing on the source since no transforming is
being done, which can be extremely important if the source is a production
system where you could be impacting the user experience as opposed to a copy of
the source (via replication, database snapshot, etc). The negative of
this approach is it may take longer to get the data into the data warehouse as
with the staging tables you have an extra step in the process, and you will
need more disk space for the staging tables.
So in short, when
using ETL the transformations are processed by the ETL tools, while in ELT the
transformations are processed by the target data source (RDBMS).
ELT is a different way of looking at the tool approach to data
movement. Instead of transforming the data before it’s written, ELT leverages
the target system to do the transformation. The data is copied to the target
and then transformed in place.
ELT makes sense when the target is a high-end data engine,
such as a data appliance, Hadoop cluster, or cloud installation to name
three examples. If this power is there, why not use it?
ETL, on the other hand, is designed using
a pipeline approach. While data is flowing from the source to the target, a
transformation engine (something unique to the tool) takes care of any data
changes.
Which is better depends on priorities.
All things being equal, it’s better to have fewer moving parts. ELT has no
transformation engine – the work is done by the target system, which is already
there and probably being used for other development work.
Why are big data projects different from traditional data warehouse projects?
Big data projects are mostly characterized as one or a
combination of these 4 (or 5) data requirements:
·
Volume: the volume of (raw) data
·
Variety: the variety (e.g. structured, unstructured,
semi-structured) of data
·
Velocity: the speed of data processing, consummation or
analytics of data
·
Veracity: the level of trust in the data
·
(Value): the value behind the data
For big data, each of the “V”s is bigger in terms of order of
magnitudes of its classification. For example, a traditional data
warehouse data volume is usually around several hundred gigabytes or a low
number of terabytes, while big data projects typically handle data volumes of
hundreds or even thousands of terabytes. Another example would be that
traditional data warehouse systems only manage and process structured data,
whereas typical big data projects need to manage and process both structured
and unstructured data.
Having this in mind, it is obvious that traditional technologies
or methodologies for data warehousing may not be sufficient to handle these big
data requirements.
Mastering
the data and information supply chain using traditional ETL
The fundamental concept behind most ETL implementations is the restriction of the data in the supply chain. Only data, which
is presumably important will be identified,
extracted and loaded into a staging area inside a database, and later, into the
data warehouse. “Presumably” is the weakness in this concept. Who really knows
which data is required for which analytic insight and requirement as of now and tomorrow? Who knows which legal or regulatory
requirements must be followed in the months and years to come?
Each change in the definition and scope of the information and
data supply chain requires a considerable amount of effort, time and budget and
is a risk for any production system. There must be a resolution for this
dilemma – and here it comes.
A new “must follow”
paradigm for big data: ELT
Just a little change in the sequence of two letters will mean
everything to the success of your big data project: ELT (Extraction, Load and
Transform). This change seems small, but the difference lies in the overall
concept of data management. Instead of restricting the data
sources to only “presumably” important data (and all the steps this
entails), what if we take all available
data, and put it into a flexible, powerful big data platform
such as the Hadoop-based system?
Data storage in Hadoop is flexible, powerful, almost unlimited,
and cost efficient –since
it can use commodity hardware and scales across many computing nodes and local
storage.
Hadoop is a schema-on-read system. It allows the storage of all
kinds of data without knowing its format or definition (e.g. JSON, images,
movies, text files, spreadsheets, log files and many more). Without the
previously discussed limitation in the amount of data which will be extracted
in the ETL methodology, we can be sure that we have all data we need today and
may need in the future. This also reduces the required effort for the
identification of “important” data.
This comment has been removed by the author.
ReplyDeleteGood one Mukesh.....keep posting such useful articles
ReplyDeleteSure...Thanks
Delete