Documentos de Académico
Documentos de Profesional
Documentos de Cultura
Framework For Building ML Systems: Crisp-Dm
Framework For Building ML Systems: Crisp-Dm
CRISP-DM
The six different phases are briefly described in the sections below, Ap-
pendix A shows a detailed view of the various steps involved in every
phase.
Business Understanding
Data Understanding
This phase starts with an initial data collection and will then proceed
with the goal of understanding the data [5]. To gain this understanding
different activities will be performed, such as identity data quality prob-
lems, discover first insights into the data and detect interesting
subsets.
Data preparation
Modeling
Evaluation
Before the model can be deployed the conducted work needs to be eval-
uated to be sure that the result meets the business requirements. This
will be done in this phase. The steps that have been executed to create
the result will be reviewed and evaluated thoroughly and at the end of
this phase a decision on the data mining result should have been reached.
Deployment
SEMMA
SEMMA consists of five different steps, which are all described overall in
the sections below and Figure A, but it is not mandatory to include all
the steps in the project.
Sample
The first step is called Sample. Here the sampling of the data will begin
which will then be used for modeling. The data collected should be big
enough to contain the necessary information but small enough to be easy
to process. This phase also includes partitioning the data to create
training, validation and test samples.
Explore
In this step, the data will be explored and searched for any interesting
patterns and relationships. This is done to gain an understanding of the
data and from that draw conclusions and get ideas. This can be done
16 Chapter 2. Background
with the use of visualization, but if the visualizations do not show any
clear trends, a statistical analysis can be used instead.
Modify
This step builds on the previous Explore step. In this step, the data
begins to be modified and prepared to be used in a specific model [20].
It may include additional segmentation of the sample and the creation
of new variables.
Model
Assess
Pre-KDD
Selection
The next task is to create a target dataset, this includes finding out
what data are available or needs to be obtained and integrate it into one
dataset. It can be focused on a subset of variables or data samples. This
target dataset is where the knowledge is to be performed.
Pre-processing
In this stage, the data is cleaned and pre-processed [31]. Common tasks
in this stage include removal of noise or accounting for it, collecting the
necessary information to model, decide how to uniformly handle miss-
ing data fields and accounting for time-sequence information and known
changes.
Transformation
The data is prepared for the Data Mining step. Here useful features will
be searched for which will be used to represent the data [31]. Methods
to help with this are the dimensional reduction, such as feature selection
and extraction and record sampling, or transformation methods.
Machine Learning
16 Chapter 2. Background
This stage consists of three different stages which are described
below]:
1. In the first step, the data is prepared and a data mining method is
chosen. The selected method is based on the goal of the KDD process
defined in the first step. The data mining method can, for example be,
classification, regression or clustering.
2. The next step is to choose a specific data mining algorithm and select
methods to find patterns in the selected data. A model is decided and
parameters are set to match a specific Machine Learning method and the
overall criteria in the KDD processes.
Interpretation/Evaluation
In this step, the patterns that have been mined in the previous step are
interpreted and evaluated with respect to the goals that were determined
in the first step of the process. It could also be necessary to return to
one of the previous steps at this stage to do some changes.
Post-KDD
When the desired result is obtained it, the next step is to act on the
discovered knowledge. The knowledge can be used directly or it could
be implemented into another system for further action or provide
documentation and reporting it.
2.4 Scrum
A signal describes how some physical quantity varies over time and/or
space. A signal could, for example, be sound pressure, radio/television
broadcast or a movie. Signal processing is manipulating a signal to
change its characteristics or extract information. It is performed by a
computer, special purpose integrated circuits or analog electrical circuits
[32]. Technology that uses signal processing is HD-TV, GPS and target
tracking for surveillance [32]. Models play a fundamental role, the foun-
dation of the models are derived from prior knowledge in physics and
biology. They characterize the signal and noise, describe distortion and
relate the desired quantity to measured data. To create models and as-
sessments, mathematics like calculus and linear algebra is used together
with probability and statistics. They can develop models for minimizing
the noise in a signal as well as characterize the confidence and uncertainty
[32].
2.5. Signal processing 17
A radar warning system collects the radio wave that another radar system
sends out, by collecting the pulses and sending them through a signal
processing chain. Then the knowledge about the object sending out
radar signals can be retrieved.
18 Chapter 2. Background
The signal chain can be defined as how the signal travels from the moment
the antenna captures it until the radar warning system can detect if
it is a threat. In Figure 2.5 the steps in the signal chain are shown.
For example, if we were looking for a specific card in a deck of cards,
the antenna would collect several signals. Digital processing would find
the signals that represented a deck of cards. Pulse processing would
look through the cards and sort them in order. Track processing would
identify which cards are heart, spades, diamonds and clubs. In the same
way, we can sort out the signals and find out whether there is a threat.
Method
This chapter contains information about how this study was conducted,
which methods were used and how the results were achieved. First, the
general approach of our work is presented followed by alternative stud-
ies and the conducted literature study and semi-structured interviews.
Lastly, the evaluation method is presented.
19
20 Chapter 3. Metho d
The article by A. Azeved and M.F Santos [1] was used as an inspiration
in our comparison between the frameworks. Articles by S. Aishah et
al. [25] and Lukasz A. Kurgan and Petr Musilek [19] was used as an
inspiration to our in-depth analysis of the frameworks.
With our knowledge about the chosen field, we decided to have semi-
structured interviews which consisted of open questions allowing the in-
terviewee to answer broadly and opened up new areas for us to explore.
If we had a more profound knowledge of the field we could have created
structured interviews following strict questions. In our thesis, it was more
beneficial to use semi-structured interviews to learn about new areas and
get a deeper understanding of the subject and fulfilling our purpose.
To understand the corporate culture and deepen our knowledge about the
area, we conducted semi-structured interviews were we prepared ques-
tions based on our literature study. The interviews were held with senior
3.5. Evaluation Metho d 21
Saab employees and master students writing their master thesis within
the area of machine learning at Saab. The senior employees at Saab
had relevant experience in software development, machine learning, sig-
nal processing and radar warning systems. The interviews deepened our
knowledge and were of great value when exploring the chosen frameworks.
The first criterion was created from the semi-structured interviews with
Saab. They explained their work and from this, we got an understanding
of how important their data management is. Therefore, we evaluated the
methods on their data management and if they can handle the data in
the way that Saab wishes for the specific case study of radar warning
system.
The third criterion was created from the literature study and our previous
knowledge about working in teams. It is vital for everyone involved to
understand the purpose and process of the work. Therefore, we evaluated
the frameworks on how distinct the different steps are. This is to facilitate
for everyone involved to understand the framework and work towards a
common goal.
• Can the framework manage data in the way that is required by the
specific case study suggested by Saab?
• Does the framework take into account the business perspective of
the problem?
• Is the framework distinct in how to use it?
Results
This chapter will present the result of the study. The different frame-
works are compared with each other. A more in-depth analysis of the
frameworks is then made based on different case studies about each
framework.
22
4.2. Literature Study 23
When the neural network is trained well enough to produce a good and
reliable result Saab needs a way to safely record the working neural net-
work to be able to recreate it. This leads to that the whole training
sequence also needs to be recorded.
From this, we got the understanding that for the machine learning im-
plementation to work it is essential that Saab has a proper data manage-
ment. They need to find a suitable framework surrounding the machine
learning algorithm to be able to handle the data in an efficient way.
In the sections below we will present our findings from the literature
study and make a comparison between the three chosen frameworks and
analyze them with regards to their strengths and limitations.
Looking at the Data Mining phase in KDD, the data mining method are
chosen and applied to the final dataset. This is also what is happening in
the Model phase of CRISP-DM, it is therefore also possible to translate
these two phases to each other.
In the final steps of CRISP-DM, the result from the Model phase is
evaluated in the Evaluation phase, which in parallel are done in Inter-
pretation/Evaluation phase in KDD. At last, the final model is deployed
in the Deployment phase in CRISP-DM which is also the final stage of
KDD. Table 4.1 displays the result of the comparison so far.
SEMMA does not consist of any stage were the goal of the project is
determined from a business perspective or a phase where the whole work
in the project is evaluated. This is the most significant difference between
these three models. Apart from that, SEMMA consists of five phases that
focus on the data management part of a Machine Learning project. The
phases in SEMMA could be directly translated to the data handling
phases in KDD, and therefore also translated to the phases in CRISP-
DM. See the Table 4.2 below for the final comparison of the models.
Table 4.2: Final comparison between CRISP-DM, KDD and SEMMA [1]
4.2. Literature Study 25
In Table 4.4 below the relevant strengths and limitations found in the
articles are presented.
due to that they use different techniques. One limitation we found with
CRISP-DM that was relevant to Saab is that the data preparation and
the modeling phases of streaming data are different from the traditional
static Machine Learning because of its times-series nature [23]. This is a
type of data that could be used in signal processing and the specific case
study suggested by Saab. This different case of data may not be
covered in CRISP-DM documentation as it is made for a more general
approach to data mining [23].
When analyzing KDD, we could see that this framework also supports
different data mining techniques, for example neural networks. This by
studying the articles with case studies that used KDD [4] [6]. One limi-
tation of KDD is that it has no website or manual with clear instructions
about how to use the framework [19] [25]. This makes it harder to get
a clear view of how to use the framework without knowing data mining
from before. SEMMA however, has full documentation on SAS Enter-
prise MinerTM tool, where the framework of SEMMA is implemented.
This could though be a limitation, which is mentioned in the article by
Herman Jair Gomez Palacios et al. [8]. The framework is designed to
work with the SAS Enterprise MinerTM tool, but if a non-typical
Machine Learning case shows up problem will undoubtedly arise [8].
Another limi- tation is the lack of steps that take into account the
business perspective of the problem that both CRISP-DM and KDD has.
However, SEMMA does support a different kind of Machine Learning
techniques, including neural network, which is shown in the document
from SAS Institute Inc with case studies [2].
An interesting fact found about the three frameworks was how much they
were used. Polls by KDNuggets [13], a leading site on business analytics,
big data, data mining, data Science, and machine learning, where found.
The polls showed that CRISP-DM was the most used framework, followed
by SEMMA and KDD. It is worth mentioning that the second most used
framework was own made. The result of the poll is shown below in Table
4.5.
Table 4.5: Polls from KDNuggets about the usage of the frameworks
28 Chapter 4. Results
KDD does not offer a website with instructions, instead the guidelines
are based on a scientific article which makes KDD harder to follow and
understand than both CRISP-DM and SEMMA.
Conclusion
5.1 Conclusion
30
5.2. Discussion 31
5.2 Discussion
An interesting thing found in the study was the usage of the different
frameworks. The most used one was CRISP-DM, but the second most
common framework was own made. Maybe that is an indication that the
perfect framework for all machine learning areas does not exist and that
some modification of the already existing frameworks must be done to
fit some specific problems.
In our search for case studies for our area, many articles showed up that
were extending existing frameworks with steps to make it better and fit
their exact areas. A speculation from us is that a company needs a good
standard framework which will fit the standard tasks in the company.
But it also has to be easy to modify for some non-ordinary task. The
high usage of CRISP-DM is perhaps because of this, that is applicable to
most problems, but that it is also possible to modify it to fit non-ordinary
tasks.
32 Chapter 5. Conclusion
In our study, there are limitations which can affect the result and our
conclusion. Since this subject is very new and no standardized tools
have been adopted by the public, it is hard to find a common source of
information, which possibly could have affected our work.
CRISP-DM was the most used method, there exists many sources and
case studies that used CRISP-DM. The number of studies about KDD
and SEMMA was much smaller. This can affect the study when it is
much easier to find information about CRISP-DM and a lot harder with
the other frameworks.
Another limitation of the thesis is that only three frameworks were eval-
uated. It exists more frameworks than just CRISP-DM, SEMMA and
KDD in the area of software development. There is a possibility that
with evaluating more or other frameworks another result could have been
reached. Also, no implementation of machine learning has been done to
test the framework and verify our result. If this were to be done it is also
possible that another result could have been reached.
33
34 BIBLIOGRAPHY
www.kth.se