Job Search

Saturday, February 27, 2016

Difference between ETL and ELT

ETL is the most common method used when transferring data from a source system to a data warehouse.  But there are cases where you might want to use ELT. 

What is the difference between the two?  Read on:

Extract, Transform, and Load (ETL) is a process that involves extracting data from outside sources, transforming it to fit operational needs (sometimes using staging tables), then loading it into the end target database or data warehouse.  This approach is reasonable as long as many different databases are involved in your data warehouse landscape.  In this scenario you have to transport data from one place to another anyway, so it’s a legitimate way to do the transformation work in a separate specialized engine.

Traditionally, ETL refers to the process of moving data from source systems into a data warehouse. The data is:
  • Extracted – copied from the source system to a staging area
  • Transformed – reformatted for the warehouse with business calculations applied
  • Loaded – copied from the staging area into the warehouse
Extract, Load, Transform (ELT) is a process where data is extracted, then loaded into a staging table in the database, transforming it where it sits in the database and then loading it into the target database or data warehouse.

ELT should be used, instead of ETL, in these various cases :
  • There are big volumes of data
  • The source database and the target database are the same
  • The database engine is well adapted for that kind of processing, such as PDW, which is great at loading massive amounts of data very quickly
  • big data projects
ELT has the benefit of minimizing the processing on the source since no transforming is being done, which can be extremely important if the source is a production system where you could be impacting the user experience as opposed to a copy of the source (via replication, database snapshot, etc).  The negative of this approach is it may take longer to get the data into the data warehouse as with the staging tables you have an extra step in the process, and you will need more disk space for the staging tables.

So in short, when using ETL the transformations are processed by the ETL tools, while in ELT the transformations are processed by the target data source (RDBMS).

ELT is a different way of looking at the tool approach to data movement. Instead of transforming the data before it’s written, ELT leverages the target system to do the transformation. The data is copied to the target and then transformed in place.
ELT makes sense when the target is a high-end data engine, such as a data appliance, Hadoop cluster, or cloud installation to name three examples.  If this power is there, why not use it?

ETL, on the other hand, is designed using a pipeline approach. While data is flowing from the source to the target, a transformation engine (something unique to the tool) takes care of any data changes.

Which is better depends on priorities. All things being equal, it’s better to have fewer moving parts. ELT has no transformation engine – the work is done by the target system, which is already there and probably being used for other development work.

On the other hand, the ETL approach can provide drastically better performance in certain scenarios. The training and development costs of ETL need to be weighed against the need for better performance. (Additionally, if you don’t have a target system powerful enough for ELT, ETL may be more economical.)



Why are big data projects different from traditional data warehouse projects?
Big data projects are mostly characterized as one or a combination of these 4 (or 5) data requirements:
·   Volume: the volume of (raw) data
·   Variety: the variety (e.g. structured, unstructured, semi-structured) of data
·   Velocity: the speed of data processing, consummation or analytics of data
·   Veracity: the level of trust in the data
·   (Value): the value behind the data

For big data, each of the “V”s is bigger in terms of order of magnitudes of its classification. For example, a traditional data warehouse data volume is usually around several hundred gigabytes or a low number of terabytes, while big data projects typically handle data volumes of hundreds or even thousands of terabytes. Another example would be that traditional data warehouse systems only manage and process structured data, whereas typical big data projects need to manage and process both structured and unstructured data.

Having this in mind, it is obvious that traditional technologies or methodologies for data warehousing may not be sufficient to handle these big data requirements.

Mastering the data and information supply chain using traditional ETL
The fundamental concept behind most ETL implementations is the restriction of the data in the supply chain. Only data, which is presumably important will be identified, extracted and loaded into a staging area inside a database, and later, into the data warehouse. “Presumably” is the weakness in this concept. Who really knows which data is required for which analytic insight and requirement as of now and tomorrow? Who knows which legal or regulatory requirements must be followed in the months and years to come?

Each change in the definition and scope of the information and data supply chain requires a considerable amount of effort, time and budget and is a risk for any production system. There must be a resolution for this dilemma – and here it comes.

A new “must follow” paradigm for big data: ELT
Just a little change in the sequence of two letters will mean everything to the success of your big data project: ELT (Extraction, Load and Transform). This change seems small, but the difference lies in the overall concept of data management.  Instead of restricting the data sources to only “presumably” important data (and all the steps this entails), what if we take all available data, and put it into a flexible, powerful big data platform such as the Hadoop-based system?

Data storage in Hadoop is flexible, powerful, almost unlimited, and cost efficient since it can use commodity hardware and scales across many computing nodes and local storage.

Hadoop is a schema-on-read system. It allows the storage of all kinds of data without knowing its format or definition (e.g. JSON, images, movies, text files, spreadsheets, log files and many more). Without the previously discussed limitation in the amount of data which will be extracted in the ETL methodology, we can be sure that we have all data we need today and may need in the future. This also reduces the required effort for the identification of “important” data.

 I hope you all have enjoyed reading this article. Comments are welcome....

3 comments: