Documentos de Académico
Documentos de Profesional
Documentos de Cultura
Design⋆
Sergio Luján-Mora and Juan Trujillo
1 Introdu
tion
In the early nineties, Inmon [1℄
oined the term data warehouse (DW): A data
warehouse is a subje
t-oriented, integrated, time-variant, nonvolatile
olle
tion
of data in support of management's de
isions . A DW is integrated be
ause
data are gathered into the DW from a variety of sour
es (lega
y systems, re-
lational databases, COBOL les, et
.) and merged into a
oherent whole. ETL
(Extra
tion-Transformation-Loading) pro
esses are responsible for the extra
-
tion of data from heterogeneous operational data sour
es, their transformation
(
onversion,
leaning, normalization, et
.) and their loading into DWs.
⋆
This paper has been supported by a grant from the State Se
retary of Edu
ation
and Universities (Spanish Ministery of Edu
ation, Culture and Sport).
1-2 Sergio Luján-Mora and Juan Trujillo
2 Related work
In this se
tion, we present a brief dis
ussion about some of the most well-known
DW design methods. Some other methods that we
annot dis
uss due to the
la
k of spa
e are [11,12,13℄.
1
We use the term method instead of methodology be
ause a methodology is the
study of methods.
2
We distinguish between nal users (end users or business information
onsumers)
and DW developers (DW designers, administrators, programmers, or, in general,
information systems professionals).
A Comprehensive Method for Data Warehouse Design 1-3
In [14℄, dierent
ase studies of data marts (DM) are presented. The MD
modeling is based in the use of the star s
hema and its dierent variations
(snowake and fa
t
onstellation). Moreover, the BUS matrix ar
hite
ture is
proposed to build a
orporate DW from integrating the design of several DMs.
Although we
onsider this work as a fundamental referen
e in the MD eld (R.
Kimball provides a very sound dis
ussion of star s
hema design), we miss a formal
method for the design of DWs. Furthermore, the
on
eptual and logi
al models
oin
ide in this proposal, and the
on
epts on the BUS matrix ar
hite
ture are
a
ompilation of the personal experien
es of the authors and the problems they
have fa
ed building enterprise DWs from DMs.
In [3℄, the authors propose the Dimensional-Fa
t Model (DFM), a parti
ular
notation for the DW
on
eptual design. Moreover, they also propose how to
derive a DW s
hema from the data sour
es des
ribed by Entity-Relationship
(ER) s
hemas. From our point of view, this proposal is only oriented to the
on
eptual and logi
al design of DWs, be
ause it does not
onsider important
aspe
ts su
h as the design of ETL pro
esses. Furthermore, the authors assume
a relational implementation of the DWs and the existen
e of all the ER s
hemas
of the data sour
es, whi
h unfortunately o
urs in few o
asions. Finally, we
onsider that the use of a parti
ular notation makes di
ult the appli
ation of
this proposal.
In [2℄, the authors present the Multidimensional Model (MD), a logi
al model
for OLAP systems, and show how it
an be used in the design of MD databases.
The authors also propose a general design method, aimed at building an MD
s
hema starting from an operational database des
ribed by an ER s
hema. Al-
though the design steps are des
ribed in a logi
and
oherent way, the DW design
is only based on the operational data sour
es, what we
onsider insu
ient be-
ause the nal users' requirements are very important in the DW design.
In [15℄, the building of star s
hemas (and its dierent variations) from the
on
eptual s
hemas of the operational data sour
es is proposed. On
e again, it
is highly supposed that the data sour
es are dened by means of ER s
hemas.
This approa
h diers from the above-presented ones in that it does not propose
a parti
ular graphi
al notation for the
on
eptual design of the DWs, instead it
uses the ER graphi
al notation.
Most re
ently, in [16℄ another method for the DW design is proposed. This
method is based on a MD model
alled IDEA and it proposes a set of steps to
address the
on
eptual, logi
al, and physi
al design of a DW. One of the most
important advantages, with respe
t to the previous proposals, is that the
on-
eptual s
hema of the DW is built taking into
onsideration both the operational
data sour
es and the nal users' requirements. Nevertheless, this method only
onsiders the data modeling and does not address other relevant aspe
ts, su
h
as the ETL pro
esses.
Regarding the UML, some proposals to extend the UML for database de-
sign have been presented [18,19,20℄, sin
e the UML does not expli
itly in
lude
a data model. However, these proposals do not ree
t the pe
uliarities of MD
modeling. There have been some approa
hes that uses UML for the MD mod-
eling up to date (again, and due to spa
e
onstrains, we refer the reader to [9℄
for a
omprehensive
omparison). However, none of these approa
hes propose a
omprehensive method to
over all main DW design phases.
Therefore, and based on the previous
onsiderations, we believe that
urrently
there is not a standard formal method that
omprises the main steps of a DW
design.
Operational Data S
hema (ODS): it denes the stru
ture of the operational
and external data sour
es.
Data Warehouse Con
eptual S
hema (DWCS): it denes the
on
eptual s
hema
(fa
ts, dimensiones, hierar
hies, et
.) of the DW.
Data Warehouse Storage S
hema (DWSS): it denes the physi
al storage of
the DW depending on the target platform (relational, MD, OO, et
.).
Business Model (BM): it denes the dierent ways or views of a
essing the
DW from the nal users' point of view. It is
omposed of dierent subsets
of the DWCS.
From our point of view, two s
hema mappings are also needed in order to
obtain a global and integrated DW design approa
h that
overs the ne
essary
s
hemas:
ETL Pro
ess: it denes the mapping between the ODS and the DWCS.
A Comprehensive Method for Data Warehouse Design 1-5
Exportation Pro
ess: it denes the mapping between the DWCS and the
DWSS.
Bidire
tional dependen
ies are not allowed at this level of the model as we
prevent a tight
oupling between pa
kages. The designer
an independently de-
ne ODS, DWCS, and DWSS if desired (for example, our proposal
an be used to
only do
ument the
on
eptual model of the DW by means of the DWCS).
In the following se
tions, we present with further detail ea
h one of the
s
hemas and mappings we use in our method.
sour
es (e.g.
ensus data, e
onomi
data) that DWs often in
orporate to support
analysis.
Nowadays, there does not exist an a
epted UML extension for modeling dif-
ferent types of data sour
es. Therefore, we have to use dierent UML extensions
to model the ODS a
ording to the sour
e. For example, if the data sour
e is
a relational database, we use Rational's UML Prole for Database Design [20℄,
a UML extension for the design of relational databases, but if the data sour
e
is an XML do
ument, we use [23℄, a UML extension for the modeling of XML
do
uments by means of DTDs and XML S
hemas.
On the other hand, the denition of the ODS
an require dierent a
tivities,
su
h as reverse-engineering of several data sour
es, the dis
overy of relationships
in the data sour
es, et
., be
ause data models are not
urrent, oer insu
ient
detail, or more likely, do not exist at all; however, these issues are beyond the
s
ope of this paper.
Table 1 shows the main stereotypes of our UML prole for the
on
eptual
design of DWs.
(Exportation Pro
ess) between the DWCS and the DWSS; in the latter, a spe
i
transformation algorithm for ea
h type of data storage generates the
orrespond-
ing DWSS.
In the
ase of a manual denition of the DWSS, the DW designer
an also
spe
ify some physi
al representations, su
h as summaries (also
alled aggregates
or materialized views) and derived data.
4.1 Analysis
Determine initial requirements: dierent users have dierent kinds of infor-
mational needs, and therefore, the s
ope of the DW (the business problems to
be solved) is established from interviews with nal users. Dierent substeps
an be a
hieved during requirements gathering; spe
i
ally, the designer has
to:
• Determine the desired data format users wish, the required detail level,
and the parti
ular data elements.
• Classify dierent summaries. Some summaries represent business results
that are quite signi
ant and are widely used (e.g., annual operating
results, quarterly sales gures by region, et
.), whereas others are less
ommon.
• Help the nal users to understand what they do not know they need
(anti
ipate data needs). The designer
an trigger ideas of new form of
analysis from nal users.
• Dene a
ess
ontrol and se
urity rules.
• Dis
over what are the needs of exploratory analysis (knowledge dis
overy,
data mining, and so on).
• Catalogue existing analyti
and reporting pro
esses.
• Seek potential new users of the DW.
3
A
tually, some of the steps are not stri
tly sequential, but in many
ases
an be
a
omplished in parallel.
A Comprehensive Method for Data Warehouse Design 1-11
Fig. 3. UML a tivity diagram that represents the main steps of our method
Dene business rules: from the information
olle
ted in the previous a
tivity,
the business rules that will be applied during the
onstru
tion of the DW
are dened (e.g., the denition of the derived measures su
h as net prot
or produ
t return rate). The
orre
t denition of the business rules avoids
onfusions and ambiguous
al
ulations, and ensures standard interpretation
of data.
Identify operational & external data sour
es (ODS): the nal users' require-
ments are taken into a
ount in order to spe
ify the operational and external
data sour
es. Dierent questions have to be answered: Do we have the data
we need?, Where is that data?, Can we a
ess data?, What is the
urrent
data quality?, et
. A
ording to [27℄, the designer has to s
an the available
data, dividing them into those that are:
1. Operational and transitory: they are not valuable for the DW.
2. Of analyti
and histori
al value: they are
andidates for the DW.
1-12 Sergio Luján-Mora and Juan Trujillo
4.2 Design
Design
on
eptual s
hema (DWCS): two extreme strategies (they are rep-
resented as dashed lines in Fig. 3)
an be adopted in this a
tivity: top-
down (the denition of the DWCS is based on the nal users' requirements) or
bottom-up (the denition of the DWCS is based on the available data sour
es).
We suggest to adopt a
ombined solution: the DW is designed from the nal
users' requirements, but bearing in mind the available data sour
es. More-
over, the designer has to understand the nal users' requirements and prop-
erly mat
h the DW to the intended usage. The designer has to divide data
into fa
ts (
omposed of measures) and dimensions (with the
orresponding
hierar
hies). Finally, the detail level of fa
ts (also
alled level of granularity
or grain of the fa
t) have to be dened; it is highly better to store data at
the most grained level be
ause less detail data
an always be obtained from
them.
Dene ETL pro
esses: the ETL pro
esses are dened as a mapping between
the data sour
es (ODS) and the DW (DWCS); the business rules are applied
in the
al
ulation of the derived attributes, the attribute transformations,
et
. This a
tivity and the previous one dene a
y
le, be
ause during the
denition of the ETL pro
esses some errors in the DWCS
an be dete
ted
(e.g., a dimension attribute does not exist in the ODS) and, therefore, the
DWCS may be modied. The design of an ETL pro
ess is usually
omposed
of seven steps: (1) Sele
t the sour
es for extra
tion, (2) Clean the sour
es,
(3)Transform the sour
es, (4) Join the sour
es, (5) Sele
t the target to load,
(6) Map sour
e attributes to target attributes, and (7) Load the data.
Dene DMs (BM): dierent models in the BM are dened from the nal users'
initial requirements and the DWCS; the BM
an be implemented as real or
virtual DMs.
Dene reports & queries (BM): the queries dened by the nal users are
dened by means of
ube
lasses.
4.3 Implementation
Dene storage (DWSS): the target platform is sele
ted (relational, MD, OO,
et
.) and the
orresponding logi
al s
hema (DWSS) is dened; the query per-
forman
e
an be improved by simplifying the data s
hema (so that it only
ontains the essential data) or by the denition of summaries (aggregates)
based on the nal users' requirements.
Dene exportation pro
esses: the mappings between the DWCS and the DWSS
are dened; these mappings
an be manually or automati
ally dened.
Implement reports & queries: the reports and queries requested by the nal
users are implemented in the query tool used (generally, an OLAP appli
a-
tion).
A Comprehensive Method for Data Warehouse Design 1-13
4.4 Test
Validate DW: the solution obtained (the DW built) is
he
ked against the
existing problem (nal users' requirements). If any dis
repan
y exists, some
orre
tive a
tions
an be taken and the pro
ess
an return to one of the
previous a
tivities.
Referen
es
1. Inmon, W.: Building the Data Warehouse. QED Press/John Wiley (1992) (Last
edition: 3rd edition, John Wiley & Sons, 2002).
2. Cabibbo, L., Torlone, R.: A Logi
al Approa
h to Multidimensional Databases.
In: Pro
. of the 6th Intl. Conf. on Extending Database Te
hnology (EDBT'98).
Volume 1377 of LNCS., Valen
ia, Spain (1998) 183197
3. Golfarelli, M., Rizzi, S.: A methodologi
al Framework for Data Warehouse De-
sign. In: Pro
. of the ACM 1st Intl. Workshop on Data Warehousing and OLAP
(DOLAP'98), Bethesda, USA (1998) 39
4. Sapia, C., Blas
hka, M., Höing, G., Dinter, B.: Extending the E/R Model for the
Multidimensional Paradigm. In: Pro
. of the 1st Intl. Workshop on Data Warehouse
and Data Mining (DWDM'98). Volume 1552 of LNCS., Singapore (1998) 105116
5. Tryfona, N., Busborg, F., Christiansen, J.: starER: A Con
eptual Model for Data
Warehouse Design. In: Pro
. of the ACM 2nd Intl. Workshop on Data Warehousing
and OLAP (DOLAP'99), Kansas City, USA (1999)
6. Husemann, B., Le
htenborger, J., Vossen, G.: Con
eptual Data Warehouse Design.
In: Pro
. of the 2nd Intl. Workshop on Design and Management of Data Warehouses
(DMDW'00), Sto
kholm, Sweden (2000) 39
7. Trujillo, J., Palomar, M., Gómez, J., Song, I.: Designing Data Warehouses with
OO Con
eptual Models. IEEE Computer, spe
ial issue on Data Warehouses 34
(2001) 6675
8. Abelló, A., Samos, J., Saltor, F.: YAM2 (Yet Another Multidimensional Model):
An Extension of UML. In: Intl. Database Engineering & Appli
ations Symposium
(IDEAS'02), Edmonton, Canada (2002) 172181
1-14 Sergio Luján-Mora and Juan Trujillo
9. Abelló, A., Samos, J., Saltor, F.: A Framework for the Classi
ation and De-
s
ription of Multidimensional Data Models. In: Pro
. of the 12th Intl. Conf. on
Database and Expert Systems Appli
ations (DEXA'01), Muni
h, Germany (2001)
668677
10. Obje
t Management Group (OMG): Unied Modeling Language Spe
i
ation 1.4.
Internet: http://www.omg.org/
gi-bin/do
?formal/01-09-67 (2001)
11. Gardner, S.: Building the Data Warehouse. Communi
ations of the ACM 41
(1998) 5260
12. Debevoise, N.: The Data Warehouse Method. Prenti
e-Hall, New Jersey, USA
(1999)
13. Corey, M., Abbey, M., Abramson, I., Taub, B.: Ora
le8i Data Warehousing. Ora
le
Press. Osborne/M
Graw-Hill (2001)
14. Kimball, R.: The Data Warehouse Toolkit. John Wiley & Sons (1996) (Last
edition: 2nd edition, John Wiley & Sons, 2002).
15. Moody, D., Kortink, M.: From Enterprise Models to Dimensional Models: A
Methodology for Data Warehouse and Data Mart Design. In: Pro
. of the 3rd
Intl. Workshop on Design and Management of Data Warehouses (DMDW'01), In-
terlaken, Switzerland (2001) 110
16. Cavero, J., Piattini, M., Mar
os, E.: MIDEA: A Multidimensional Data Warehouse
Methodology. In: Pro
. of the 3rd Intl. Conf. on Enterprise Information Systems
(ICEIS'01), Setubal, Portugal (2001) 138144
17. Carneiro, L., Brayner, A.: X-META: A Methodology for Data Warehouse Design
with Metadata Management. In: Pro
. of the 4th Intl. Workshop on Design and
Management of Data Warehouses (DMDW'02), Toronto, Canada (2002) 1322
18. Rational Software Corporation: The UML and Data Modeling. Internet:
http://www.rational.
om/media/whitepapers/Tp180.PDF (2000)
19. Mar
os, E., Vela, B., Cavero, J.: Extending UML for Obje
t-Relational Database
Design. In: Pro
. of the 4th Intl. Conf. on the Unied Modeling Language
(UML'01). Volume 2185 of LNCS., Toronto, Canada (2001) 225239
20. Naiburg, E., Maksim
huk, R.: UML for Database Design. Obje
t Te
hnology
Series. Addison-Wesley (2001)
21. Abelló, A., Samos, J., Saltor, F.: Benets of an Obje
t-Oriented Multidimen-
sional Data Model. In: Pro
. of the Symposium on Obje
ts and Databases in 14th
European Conf. on Obje
t-Oriented Programming (ECOOP'00). Volume 1944 of
LNCS., Sophia Antipolis and Cannes, Fran
e (2000) 141152
22. Jarke, M., Lenzerini, M., Vassiliou, Y., Vassiliadis, P.: Fundamentals of Data
Warehouses. 2 edn. Springer-Verlag (2003)
23. Rational Software Corporation: Migrating from XML DTD to XML-S
hema us-
ing UML. Internet: http://www.rational.
om/media/whitepapers/TP189draft.pdf
(2000)
24. Luján-Mora, S., Trujillo, J., Song, I.: Extending UML for Multidimensional Mod-
eling. In: Pro
. of the 5th Intl. Conf. on the Unied Modeling Language (UML'02).
Volume 2460 of LNCS., Dresden, Germany (2002) 290304
25. Luján-Mora, S., Trujillo, J., Song, I.: Multidimensional Modeling with UML Pa
k-
age Diagrams. In: Pro
. of the 21st Intl. Conf. on Con
eptual Modeling (ER'02).
Volume 2503 of LNCS., Tampere, Finland (2002) 199213
26. Trujillo, J., Luján-Mora, S.: A UML Based Approa
h for Modeling ETL Pro
esses
in Data Warehouses. In: Pro
. of the 22nd Intl. Conf. on Con
eptual Modeling
(ER'03). LNCS, Chi
ago, USA (2003) (To be presented).
27. Haisten, M.: Designing a Data Warehouse. InfoDB 9 (1995) 29