Está en la página 1de 13

Data Warehouse Book.............................................................................................................................

2
The Key Acronyms And Definitions ......................................................................................................... 2
What's the need for a Data Warehouse? ................................................................................................ 3
Difference Between DWH and OLTP ....................................................................................................... 4
Data Acquisition ...................................................................................................................................... 4
Data Loading ............................................................................................................................................ 4
Good Link................................................................................................................................................. 5
Difference between Data mart and DWH ............................................................................................... 5
Top-down and Bottom-up Approach ...................................................................................................... 6
ETL Application Development Example................................................................................................... 6
Staging ..................................................................................................................................................... 7
Informatica (9.5) ...................................................................................................................................... 7
Informatica Corporation Products .......................................................................................................... 8
PowerCenter Integration Service (PCIS) .................................................................................................. 8
Transformations ...................................................................................................................................... 9
Mapplets................................................................................................................................................ 10
Tasks and types of Tasks........................................................................................................................ 10
Worklet and types of Worklets ............................................................................................................. 10
Dimensions ............................................................................................................................................ 11
Installation and Practice ........................................................................................................................ 12
What is the purpose of this article ?

A quick briefing to ETL, Business Intelligence and Informatica

With this article it is possible to understand what needs to be known to have a solid
understanding with ETL, BI and Informatica

Data Warehouse Book


The Data Warehouse Toolkit - The Complete Guide to Dimensional Modeling 2nd Edition
by Ralph Kimball and Margy Ross

The Key Acronyms And Definitions


OLTP stands for Online Transaction Processing
ODS stands for Operational Data Source
OLAP (Online Analytical Processing) :
This system is an application that collects, manages, processes and presents multidimensional data
for Analysis and management purposes.
Data Mining :
Data Mining is the process of analyzing data from different perspectives and summarizing into useful
information.
BI (Business Intelligence) :
BI is the leveraging of Data Warehouse to help and make business decisions and business
recommendations. Information and data rules engines are leveraged here to help make these
decisions along with statistical analysis tools and data mining tools.
From an information systems standpoint, BI provides users with online analytical processing or data
analysis capabilities to predict trends, evaluate business questions, and so on.
From an business analyst viewpoint, it is the process of gathering high-quality, meaningful
information about a subject, which enables the analyst to draw conclusions.
ISP in relation to Informatica stands for Informatica Service Platform
OLCP (Online Complex Processing) :
OLCP is typified by a moderate number of rows processed against multiple databases with a
response in minutes, but not seconds.
Examples:
1. Filling out an application for instant credit in a store is an OLCP Transaction.
2. Getting an instant approval on an interest rate for a loan.
DSS (Decision Support Systems) :
Decision Support Systems include batch Reports, which Roll-up numbers to give business the big
picture, and overtime, have evolved. DSS is an area where Teradata has traditionally excelled with its
parallel architecture.

Homogenous Joins :
Homogenous means same. A Join that is made on similar kind of database is known as Homogenous
Joins.
Heterogeneous Joins :
Heterogeneous means different. A Join that is made on different kinds of database is known as
Homogenous Joins.
Flat Files :
A File which is saved with an extension , some examples, of " .txt, .csv, .dat etc ". There are two kinds
of Flat Files (1) Delimited Flat Files (2) Fixed with Flat Files. Fixed with Flat Files can improve the
performance over Delimited Flat Files when extracting the data from Flat Files.
File List :
A File list is the list of all data files that can be grouped together. The files should contain similar meta
data definitions.
XML Source Qualifier Transformation :
It reads the data from XML files, which are saved with an extension of .xml

What's the need for a Data Warehouse?


1. All the past attempts by IT to provide strategic information have been failures. This was mainly
because IT has been trying to provided strategic information from operational systems. (Operational
System here refers to OLTP) Operational systems are not designed for strategic information.
2. Informational systems are different from the traditional operational systems.
3. We need a new type of computing environment to provide strategic information. The Data
Warehouse promises to be this new computing environment.
For example we can provide business the below :
1. Show the top-selling products
2. Tell me why this is going on (drill down)?
3. Let me see other data (drill across)
4. Show the highest margins
5. Alert when a region sells below the target
"A Data Warehouse is a subject oriented, integrated, nonvolatile and time variant collection of data
in support of management's decisions."
In the Data Warehouse, data is not stored by operational applications, but by business subjects.

Difference Between DWH and OLTP


Below is the difference between DWH (Data Warehouse) and OLTP (Online Transaction Processing)
They are both databases and hold data. But, they have been designed for different scopes. Running
the business (OLTP) and Managing the business (DWH)
OLTP (Online Transaction Processing)

DWH (Data Warehouse)

It is designed to support Business Transaction It is designed to support Analysis and decision


Processing
making
Maintaining current data

Maintaining historical data

Normalized data (No duplicate rows)

De-normalized data (duplicate rows)

Query response time is low

Query response time is high

Application oriented data

Subject oriented data

Detailed data

Summarized data

Limited period of Transaction history (current, 1 - Transaction history per years


3 Months)

Data Acquisition

It means an ETL
It is the process of extracting data from the multiple OLTP source systems
Transforming the data into a desired Business format
Loading the data into a destination system
There are two types of ETL to perform Data Acquisition, that are :

(1) Code Based ETL and (2) GUI Based ETL


Some GUI based ETL are :
Informatica, Data Stage, BODS (Business Object Data Services) and SSIS (SQL Server Integration
Service) etc.
I personally would like to focus on Informatica and SSIS. Later on, we will be introduced to Staging.
That means we are not going to do ETL directly. We perform Extract and do Staging. Staging is a
temporary data parking.

Data Loading
It is the process of inserting the data into a destination system. There are two types of data loading.
That are:
1. Initial Load or Full Load and
2. Incremental Load or Delta Load
1. Initial Load: It is the process of inserting the source data rows into an empty target tables. In the
Initial Load all the source data gets loaded into target.

2. Incremental Load:
It is the process of inserting only the new records after Initial Load.
This topic places a question in our mind "How do we do it? " and introduces us "Top-Down Approach
and Bottom-Up Approach".

Good Link
http://www.nagesh.com/publications/technology/173-inmon-vs-kimball-an-analysis.html
The DWH Gurus INMON and KIMBALL explained us about Top-Down Approach and Bottom-Up
Approach.
1. Top-down Approach (INMON)
According to the Inmon's approach we develop an Enterprise Data Warehouse first and then derive
department specific subject oriented Database called Data marts.
2. Bottom-up Approach (KIMBALL)
According to the Kimball's approach we develop department specific subject oriented Database
called Data marts first and then consolidate or conglomerate the Data marts into an Enterprise Data
Warehouse.

Difference between Data mart and DWH


What is the difference between Data mart and DWH?
That's the question pops up and we heard about it. So, it is worth full to see the difference in a
tabular form.
DWH
Designed to store Enterprise Information

Data mart
Designed to
information

store

department

specific

Designed for Top Management who take the Designed for Middle Level Management who take
decisions at Enterprise Level (CEO etc)
the decisions for department level
Recommended to build high range Database. For Recommended to build mid range Database. For
example Teradata Database
example Oracle Database

Top-down and Bottom-up Approach


Note:
Data mart development is an independent of Enterprise Data Warehouse development.
Top-down Approach:
Data mart
HR

Enterprise
DWH

Data mart
Finance
Bottom-up Approach:

Data mart
Enterprise
DWH

Finance

Data mart
HR

ETL Application Development Example


Let's say the source system (OLTP Database) has a table called "customer" and we load that table
into the target system (DWH Database)with a table called "dim_cust". The below shapes explains it
further.

OLTP Database

DWH Database

customer
+cust_id
+cust_first_name
+ gender

dim_cust
+cust_id
+cust_name
+ gender

metadata

metadata

OLTP Database
Customer
+cust_id
+cust_first_name
+ gender

Metadata of source[E]
customer

DWH Database

ODBC

Business Rule [T]


concat ()
decode() or any etc.
ETL Application

Metadata of Target [L]


dim_cust

ODBC

dim_cust
+cust_id
+cust_name
+ gender

Staging

A Staging is a "Temporary Database" and an intermediate layer between OLTP System and
DWH.
A Staging is a replica of OLTP System.
A Staging can be called as a data parking area where the data parks temporarily. It can be a
collection of data from one more OLTP Databases.

Informatica (9.5)

This took me to the main topics from Informatica. An Informatica PowerCenter is a Data
Integration platform which allows to access the data from multiple OLTP source systems.
Transforming the data and deliver the data throughout the Enterprise at any speed.
Founded by Mr. Gaurav Dhillon.
Current CEO is Mr. Sohaib Abbasi (It is very interesting and inspiring to read his professional
achievements. Do not miss it ! )

Informatica Corporation Products


1. Informatica PowerCenter
2. Informatica PowerMart
3. Informatica PowerExchange
4. Informatica Data Quality (IDQ)
5. Informatica PowerAnalyzer
6. Master Data Manager (MDM)
7. Informatica B2B Integration
8. Informatica Cloud
1. Informatica PowerCenter
The following PowerCenter components get installed with Informatica PowerCenter :
1. Informatica PowerCenter Clients
2. Informatica PowerCenter Repository
3. PowerCenter Repository Service (PCRS)
4. PowerCenter Integration Service (PCIS)
5. PowerCenter Domain
6. Informatica Administrator (Webclient)

1. Informatica PowerCenter Clients :


Using PowerCenter Clients we can develop an ETL Process. The steps involved are :
1 - Creation of Source definition
2 - Creation of Target definition
3 - Design a Mapping with / without Business rule
4 - Create Session for each Mapping
5 - Create Workflow (a Sequential or Concurrent)
6 - Start Workflow (Running a Workflow)
7 - View Status and Monitor

PowerCenter Integration Service (PCIS)


It is an ETL Engine or an ETL Server that performs Extraction, Transformation and Loading.
An Integration Service has following components:
1. Reader
2. DTM (Data Transformation Manager)
3. Writer
1. Reader : It extracts the data from source systems.
2. DTM (Data Transformation Manager) : It process the data according to the business rule that you
configure in the Mapping.
3. Writer : It loads the data into the target.

Transformations
A Transformation is a PowerCenter object which allows to create business rules for processing data.
There are two types of Transformations, they are:
1) Active Transformation (New feature from Informatica PowerCenter 9.1)
2) Passive Transformation
1) Active Transformation :
A Transformation which can affect the rows that is change the number of rows while processing the
data is known as an Active Transformation. Below is the list of Active Transformations used for
processing the data :
Note : t/r means Transformation
1. Source Qualifier t/r
2. Filter t/r
3. Router t/r
4. Sorter t/r
5. Aggregator t/r
6. Rank t/r

7. Joiner t/r
8. Union t/r
9. Update Strategy t/r
10. Transaction Control t/r
11. Normalizer t/r

2) Passive Transformation :
A Transformation which does not affect the rows or does not the change the number of rows while
processing the data is known as a Passive Transformation. Below is the list of Passive
Transformations used for processing the data :
Note : t/r means Transformation
1. Expression t/r
2. Lookup t/r (Very important Topic)
3. Sequence Generator t/r
4. Stored Procedure t/r

5. XML Source Qualifier t/r


6. Input t/r (Mapplet)
7. Output t/r (Mapplet)
8. External Procedure t/r

Port and types of Ports :


In the language or terminology of Transformation, a Port is nothing but a Column of a Table or
simply a file.
There are two types of Ports, they are :
(1) Input Port and (2) Output Port
(1) Input Port :
A Port that can receive the data is called an Input Port.
(2) Output Port :
A Port that can provide the data is called an Output Port.

Mapplets

A Mapplet is a reusable object created with the business rules using a set of transformations
A maplet is created using Mapplet designer tool in the designer client components.
There are types of Mapplets :

1. Active Mapplet :
A Mapplet which is created with at least one Active Transformation. Such Mapplet is known as Active
Mapplet.
2. Passive Mapplet :
A Mapplet which is created with only Passive Transformation is known as Passive Mapplet.

Tasks and types of Tasks


A Task is a set of executable actions, commands and functions.
There are two types of Tasks, that are :
1. Reusable Task
2. Non-Reusable Task
1. Reusable Task :

A task that can integrated to multiple Workflows is knows as Reusable task.


Reusable tasks are created using "Task Developer" tool and "Worklet Designer" tool

2. Non- Reusable Task :

A Task which is created specific to a Workflow is knows as Non-Reusable Task.


A Non-Reusable Task is created using "Workflow Designer" tool.

Worklet and types of Worklets


A Worklet is defined as group of related tasks. There are two types of Worklets, that are :
1. Reusable Worklet
2. Non-Reusable Worklet
Reusable Worklet :

Created using "Worklet Designer tool"


Can be assigned to multple Workflows
A Worflow which contains the Worklet is known as Parent Workflow
A Worklet expands and executes the inner Workflow, if any child Workflows are there.

Non-Reusable Worklet :
A Worklet which is created specific to the Workflow is known as Non-Reusable Worklet
Created using "Workflow Desinger" tool and can be converted to Reusable Worklet.

Dimensions
Star Schema :
A Star Schema is a Database Design which contains a centrally located Fact Table which is radially
surrounded by multiple Dimension Tables.
Since the database design looks like a Star hence it is known as Star Schema. There two variants of
Star Schema, that are :
1. Simple Star Schema
2. Complex Star Schema
Simple Star Schema :
It contains only one Fact Table.
Complex Star Schema :
It contains more than one Fact Table
Fact Table :

A Fact Table contains the Facts


Facts are numeric but not every numeric is fact
Facts are Business measures which are used to evaluate the performance of enterprise.

Dimension table :

A Dimension table is a descriptive data and text which describes the Key Performance
indicators, knows as Facts.
The related dimensions are organized in dimension tables
A dimension table can organize the data in hierarchical format and dimension tables are in
de-normalized
A dimension can proved or answer the following business questions, "who, what , when,
where"

Snowflake Schema :

A very large de-normalized dimensional tables are split into normalized dimensional tables
A dimensional table may have parents
Advantage : query performance increases
Disadvantage : more joins

Star Schema and Snowflake Schema Differences :


Star Schema

Snowflake Schema

1. No dimension table have parents


2. De-normalized dimensions
3. Less Joins

1. The dimension table may have parents


2. Normalized dimensions
3. More Joins

Galaxy Schema :
It is also known as Hybrid Schema (or) Complex Star Schema (or) Multi Star Schema (or) Integrated
Schema (or) Constellation Schema
Fact Constellation :
It is the process of joining two Fact Tables using constraints (Primary and Foreign Keys)
Conformed Dimensions :
A dimension table which is shared or used by multiple Fact Tables is known as Conformed
Dimensions.
Slowly Changing Dimensions (SCD) :
A dimension that can capture the changes which takes place over the period of time is known as
Slowly Changing Dimensions.
There are three types of Slowly Changing Dimensions, that are :
1 .TYPE 1 Dimension : It captures only current data. It does not maintain history.
2. TYPE 2 Dimension : It captures complete historical data. For each update in OLTP it will insert a
new record in the target.
3. TYPE 3 Dimension : It captures partial historical information in the target that is current and
previous.

Installation and Practice


PowerCenter Express - ETL Tool
If your aim is to practice or learn ETL then PowerCenter Express is initially sufficient.
Link to Download :
https://community.informatica.com/solutions/pcexpress
Registration is required to download it and Personal Edition is the free one.
Click on "Resources" and open PowerCenter Express Installation and Upgrade Guide for the installation
instructions. Tip : If your installation is on Windows then focus the instructions related only to
Windows.
Informatica PowerCenter
If your aim is to install PowerCenter Server, Clients etc to get a real time experience then a lot of
installations have to made and much memory and disk space is needed -- BI Software, BI
Applications, DAC Server, Oracle Clients etc.
Well, It is not a bad idea to check the Software once !

Link to Download :
Registration of a normal Oracle Account is sufficient to download the software
https://edelivery.oracle.com/
Below is about 64 bit Installation. Navigate to the below and click "Go" and then "Continue"

Scroll down or use Ctl+F to see the below and to start downloading

Tip :

In my case I extracted everything to my external hard disk first


Create a directory, for example INFA, and the location or destination of all the extracting files
should be only this Folder, that is INFA
Extract one after the other
Use better WinRar to extract the files

Some useful Links :


http://www.disoln.org/2012/09/Informatica-PowerCenter-9-Installation-and-ConfigurationComplete-Guide.html
http://bloggingaboutoracle.com/installation-informatica-powercenter-9-0-1-on-oracleenterprise-linux-5-6/
https://community.oracle.com/thread/2474504?tstart=0

También podría gustarte