Experiment No. 04: Real-Life ETL Cycle

EXPERIMENT NO.
04
AIM: Implementation of ETL for Business Intelligence.
THEORY:
➢ What is ETL?
ETL (Extract, Transform and Load) is a process in data warehousing responsible for pulling data out of the
source systems and placing it into a data warehouse. ETL involves the following tasks:
Extracting the data from source systems (SAP, ERP, other operational systems), data from different source
systems is converted into one consolidated data warehouse format which is ready for transformation
processing.
Transforming the data may involve the following tasks:
• Applying business rules (so-called derivations, e.g., calculating new measures and dimensions),
• Cleaning (e.g., mapping NULL to 0 or "Male" to "M" and "Female" to "F" etc.),
• Filtering (e.g., selecting only certain columns to load),
• Splitting a column into multiple columns and vice versa,
• Joining together data from multiple sources (e.g., lookup, merge),
• Transposing rows and columns,
• Applying any kind of simple or complex data validation.
Loading the data into a data warehouse or data repository other reporting applications.
➢ Real-life ETL cycle

The typical real-life ETL cycle consists of the following execution steps:
1. Cycle initiation
2. Build reference data
3. Extract (from sources)
4. Validate
5. Transform (clean, apply business rules, check for data integrity, create aggregates or disaggregates)
6. Stage (load into staging tables, if used)
7. Audit reports (for example, on compliance with business rules. Also, in case of failure, helps to
diagnose/repair)
8. Publish (to target tables)
9. Archive
10. Clean up
➢ What is ETL Testing?
ETL testing is done to ensure that the data that has been loaded from a source to the destination after business
transformation is accurate. It also involves the verification of data at various middle stages that are being used
between source and destination. ETL stands for Extract-Transform-Load.
➢ ETL Testing Process
Similar to other Testing Process, ETL also go through different phases. The different phases of ETL testing
process is as follows:
ETL testing is performed in five

stages
1. Identifying data sources and

requirements
2. Data acquisition
3. Implement business logics and
dimensional Modelling
4. Build and populate data
5. Build Reports
➢ ETL TOOLS
EPC chose to standardize on an ETL tool (Informatica) as the backbone of the data integration architecture.
While many practitioners relate ETL tools with batch data warehouse architectures, ETL tools increasingly
are used for operational systems integration efforts. The base transformation functionality in the leading ETL
tools has become mature; therefore, major changes in transformation functionality from release to release are
not common. This has allowed vendors to grow in other areas, such as SOA enablement and stronger
integration with data quality technologies.
These changes have made the ETL tools more attractive for broad use in handling the majority of data
integration tasks across the enterprise. EPC is utilizing the base Informatica ETL tool PowerCenter with the
real-time and Web services options, Informatica Data Quality, and PowerExchange for Oracle E-Business
Suite.
The ETL backbone envisioned by EPC for the first release of the ERP project satisfies the required architecture
functions. It was quickly evident that the functions in an operational environment are not significantly different
from those in a data warehouse environment.
Here are some examples of the functions:
• Flag, track, and fix errors

• Cleanse name and address data
• Schedule jobs
• Read and write from heterogeneous sources and targets
• Translate source systems codes into standardized codes
• Handle change data capture
• Transformation to restructure data
While the necessary functions are the same, the implementations are different from those in a data warehouse.
Here are a few examples of these differences:
Error records in a data warehouse are flagged and tracked but are seldom handled or fixed because they do
not have significant impact. Based on aggregate analysis, a single record missing is often statistically
inconsequential. In an operational system, this is not true. For example, if an incoming order record fails
because a product cannot be found, there is a customer waiting for the product to be shipped. The error must
be corrected in a timely fashion.
Addresses in a data warehouse often are cleansed in batch. At EPC, the operational system needs them
cleansed in real time. The SOA enablement of the ETL technology has allowed EPC to utilize the ETL
backbone for this function.
In data warehousing environments, the ETL tool or enterprise scheduling packages are used to execute large
batches with many jobs having extensive dependencies. These batches are often run overnight. For EPC, we
are using the ETL tool to integrate data from multiple applications using smaller batches that run frequently
throughout the day.
➢ Obstacles:
Managing Diverse and Fast-Changing Data
There are numerous challenges to implementing efficient and reliable ETL processes.
• Data volumes are growing exponentially.
With the rise of big data, ETL processes have to process large amounts of structured and unstructured data,
such as call detail records, banking transactions, weblog files, social media files, etc. Some business
intelligence (BI) systems merely get incrementally updated, whereas others require a complete reload at each
iteration.
• Data velocity is moving faster.
From batch processing to real-time. Information needs to be distributed to all connected systems to enable
real-time business insight and avoid multiple versions of the truth. As business intelligence analysis tends
toward real-time, data warehouses and data marts need to be refreshed more often and the load time windows
have shrunk. ETL processes need comprehensive connectivity to a wide range of systems, including packaged
applications (ERP, CRM, etc.), databases, mainframes, files, Web services, big data platforms and SaaS
applications.
• Transformations involved in ETL processes can be highly complex.
Data needs to be aggregated, parsed, computed, statistically processed, etc. BI-specific transformations are
also required, such as slowly changing dimensions. Primary keys are some of the most important attributes in
relational databases as they tie everything together.
➢ Solution
Talend's Big Data and Data Management solutions are optimized for enterprise-grade ETL, for big data and
small. The following features are especially critical to the design, development, execution and maintenance
of data integration and ETL processes:
A highly scalable and fast execution open source platform
Leveraging a grid of commodity hardware, and the only solution to support the dual ETL and ELT
architecture.
Broad data integration connectivity
Support all systems so you can access all production data.
Built-in advanced components
Support for big data (Hadoop, NoSQL, big data platforms), and ETL including string manipulations, slowly
changing dimensions, automatic lookup handling, bulk load support and data mapping tools that can handle
complex data mappings.
Business-oriented process modeling
Involving business stakeholders and ensures proper communication between IT and lines of business.
Fully graphical development environment
Greatly improving productivity and facilitates maintenance.
➢ MDD-Based Framework:
MDD is an approach to software development where extensive models are created before source code is
written. As shown in Fig.the MDD approach defines four main layers (see Meta-Object Facilities (MOF)20):
the Model Instance layer (M0), the Model layer (M1), the Meta-Modellayer (M2), and the Meta-Meta-Model
layer (M3). The Model Instance layer (M0) is a representation of thereal-world system where the ETL process
design and implementation are intended to perform. This may be represented, respectively, by a vendor-
independent graphical interface and by a vendor-specifc ETL engine. At the Model layer (M1), both the ETL
Process Model is designed and the ETL process code is derived by applying a set of transformations ions, thus
moving from the design to the implementation. The Meta-Model layer (M2) consists of the BPMN4ETL
metamodel that defines ETL patterns at the design phase, and a 4GL grammar at the implementation phase.
Finally, the MetaMeta-Model level (M3), corresponds to the MOF metametamodel at the design phase, while
it corresponds to the Backus Naur Form (BNF) at the implementation phase
CONCLUSION: Thus we have ETL in Business Intelligence is studied & implemented.

Experiment No. 04: Real-Life ETL Cycle

Cargado por

Información del documento

Título original

Derechos de autor

Formatos disponibles

Compartir este documento

Compartir o incrustar documentos

Opciones para compartir

¿Le pareció útil este documento?

¿Este contenido es inapropiado?

Copyright:

Formatos disponibles

Experiment No. 04: Real-Life ETL Cycle

Cargado por

Copyright:

Formatos disponibles

EXPERIMENT NO.

AIM: Implementation of ETL for Business Intelligence.

➢ Real-life ETL cycle

➢ What is ETL Testing?

ETL testing is performed in five

1. Identifying data sources and

Here are some examples of the functions:

• Flag, track, and fix errors

Managing Diverse and Fast-Changing Data

CONCLUSION: Thus we have ETL in Business Intelligence is studied & implemented.

También podría gustarte