P. 1
Customer Segmentation and Profiling Thesis

Customer Segmentation and Profiling Thesis

|Views: 459|Likes:
Publicado porAnkesh Srivastava

More info:

Published by: Ankesh Srivastava on Jan 13, 2011
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

05/29/2013

pdf

text

original

Sections

  • Introduction
  • 1.1 Customer segmentation and customer pro-
  • 1.1.1 Customer segmentation
  • 1.1.2 Customer profiling
  • 1.2 Data mining
  • 1.3 Structure of the report
  • 2.1 Data warehouse
  • 2.1.1 Selecting the customers
  • 2.1.2 Call detail data
  • 2.1.3 Customer data
  • 2.2 Data preparation
  • Clustering
  • 3.1 Cluster analysis
  • 3.1.1 The data
  • 3.1.2 The clusters
  • 3.1.3 Cluster partition
  • 3.2 Cluster algorithms
  • 3.2.1 K-means
  • 3.2.2 K-medoid
  • 3.2.3 Fuzzy C-means
  • 3.2.4 The Gustafson-Kessel algorithm
  • 3.2.5 The Gath Geva algorithm
  • 3.3 Validation
  • 3.4 Visualization
  • 3.4.1 Principal Component Analysis
  • 3.4.2 Sammon mapping
  • 3.4.3 Fuzzy Sammon mapping
  • 4.1 Determining the optimal number of clusters
  • 4.2 Comparing the clustering algorithms
  • 4.3 Designing the segments
  • Support Vector Machines
  • 5.1 The separating hyperplane
  • 5.2 The maximum-margin hyperplane
  • 5.3 The soft margin
  • 5.4 The kernel functions
  • 5.5 Multi class classification
  • 6.1 K-fold cross validation
  • 6.2 Parameter setting
  • 6.3 Feature Validation
  • Conclusions and discussion
  • 7.1 Conclusions
  • 7.2 Recommendations for future work
  • Model of data warehouse

Customer Segmentation and Customer Profiling

for a Mobile Telecommunications Company
Based on Usage Behavior
A Vodafone Case Study
S.M.H. Jansen
July 17, 2007
Acknowledgments
This Master thesis was written to complete the study Operations Research at
the University of Maastricht (UM). The research took place at the Department
of Mathematics of UM and at the Department of Information Management of
Vodafone Maastricht. During this research, I had the privilege to work together
with several people. I would like to express my gratitude to all those people for
giving me the support to complete this thesis. I want to thank the Department
of Information Management for giving me permission to commence this thesis
in the first instance, to do the necessary research work and to use departmental
data.
I am deeply indebted to my supervisor Dr. Ronald Westra, whose help, stimu-
lating suggestions and encouragement helped me in all the time of research for
and writing of this thesis. Furthermore, I would like to give my special thanks
to my second supervisor Dr. Ralf Peeters, whose patience and enthusiasm en-
abled me to complete this work. I have also to thank my thesis instructor, Drs.
Annette Schade, for her stimulating support and encouraging me to go ahead
with my thesis.
My former colleagues from the Department of Information Management sup-
ported me in my research work. I want to thank them for all their help, support,
interest and valuable hints. Especially I am obliged to Drs. Philippe Theunen
and Laurens Alberts, MSc.
Finally, I would like to thank the people, who looked closely at the final ver-
sion of the thesis for English style and grammar, correcting both and offering
suggestions for improvement.
1
Contents
1 Introduction 8
1.1 Customer segmentation and customer profiling . . . . . . . . . . 9
1.1.1 Customer segmentation . . . . . . . . . . . . . . . . . . . 9
1.1.2 Customer profiling . . . . . . . . . . . . . . . . . . . . . . 10
1.2 Data mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3 Structure of the report . . . . . . . . . . . . . . . . . . . . . . . . 13
2 Data collection and preparation 14
2.1 Data warehouse . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.1 Selecting the customers . . . . . . . . . . . . . . . . . . . 14
2.1.2 Call detail data . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.3 Customer data . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Data preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3 Clustering 22
3.1 Cluster analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.1.1 The data . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1.2 The clusters . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1.3 Cluster partition . . . . . . . . . . . . . . . . . . . . . . . 24
3.2 Cluster algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.1 K-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.2 K-medoid . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.3 Fuzzy C-means . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.4 The Gustafson-Kessel algorithm . . . . . . . . . . . . . . 29
3.2.5 The Gath Geva algorithm . . . . . . . . . . . . . . . . . . 30
3.3 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4.1 Principal Component Analysis . . . . . . . . . . . . . . . 33
3.4.2 Sammon mapping . . . . . . . . . . . . . . . . . . . . . . 34
3.4.3 Fuzzy Sammon mapping . . . . . . . . . . . . . . . . . . . 35
4 Experiments and results of customer segmentation 37
4.1 Determining the optimal number of clusters . . . . . . . . . . . . 37
4.2 Comparing the clustering algorithms . . . . . . . . . . . . . . . . 42
2
4.3 Designing the segments . . . . . . . . . . . . . . . . . . . . . . . 45
5 Support Vector Machines 53
5.1 The separating hyperplane . . . . . . . . . . . . . . . . . . . . . . 53
5.2 The maximum-margin hyperplane . . . . . . . . . . . . . . . . . 55
5.3 The soft margin . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.4 The kernel functions . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.5 Multi class classification . . . . . . . . . . . . . . . . . . . . . . . 59
6 Experiments and results of classifying the customer segments 60
6.1 K-fold cross validation . . . . . . . . . . . . . . . . . . . . . . . . 60
6.2 Parameter setting . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.3 Feature Validation . . . . . . . . . . . . . . . . . . . . . . . . . . 65
7 Conclusions and discussion 66
7.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
7.2 Recommendations for future work . . . . . . . . . . . . . . . . . 68
Bibliography 68
A Model of data warehouse 71
B Extra results for optimal number of clusters 73
3
List of Figures
1.1 A taxonomy of data mining tasks . . . . . . . . . . . . . . . . . . 12
2.1 Structure of customers by Vodafone . . . . . . . . . . . . . . . . 15
2.2 Visualization of phone calls per hour . . . . . . . . . . . . . . . . 17
2.3 Histograms of feature values . . . . . . . . . . . . . . . . . . . . . 18
2.4 Relation between originated and received calls . . . . . . . . . . . 18
2.5 Relation between daytime and weekday calls . . . . . . . . . . . 19
3.1 Example of clustering data . . . . . . . . . . . . . . . . . . . . . 22
3.2 Different cluster shapes in R
2
. . . . . . . . . . . . . . . . . . . . 24
3.3 Hard and fuzzy clustering . . . . . . . . . . . . . . . . . . . . . . 25
4.1 Values of Partition Index, Separation Index and the Xie Beni Index 38
4.2 Values of Dunn’s Index and the Alternative Dunn Index . . . . . 39
4.3 Values of Partition coefficient and Classification Entropy with
Gustafson-Kessel clustering . . . . . . . . . . . . . . . . . . . . . 40
4.4 Values of Partition Index, Separation Index and the Xie Beni
Index with Gustafson-Kessel clustering . . . . . . . . . . . . . . . 41
4.5 Values of Dunn’s Index and Alternative Dunn Index with Gustafson-
Kessel clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.6 Result of K-means algorithm . . . . . . . . . . . . . . . . . . . . 43
4.7 Result of K-medoid algorithm . . . . . . . . . . . . . . . . . . . . 44
4.8 Result of Fuzzy C-means algorithm . . . . . . . . . . . . . . . . . 44
4.9 Result of Gustafson-Kessel algorithm . . . . . . . . . . . . . . . . 44
4.10 Result of Gath-Geva algorithm . . . . . . . . . . . . . . . . . . . 45
4.11 Distribution of distances from cluster centers within clusters for
the Gath-Geva algorithm with c = 4 . . . . . . . . . . . . . . . . 46
4.12 Distribution of distances from cluster centers within clusters for
the Gustafson-Kessel algorithm with c = 6 . . . . . . . . . . . . . 46
4.13 Cluster profiles for c = 4 . . . . . . . . . . . . . . . . . . . . . . . 47
4.14 Cluster profiles for c = 6 . . . . . . . . . . . . . . . . . . . . . . . 48
4.15 Cluster profiles of centers for c = 4 . . . . . . . . . . . . . . . . . 49
4.16 Cluster profiles of centers for c = 6 . . . . . . . . . . . . . . . . . 50
5.1 Two-dimensional customer data of segment 1 and segment 2 . . . 54
4
5.2 Separating hyperplanes in different dimensions . . . . . . . . . . 54
5.3 Demonstration of the maximum-margin hyperplane . . . . . . . . 55
5.4 Demonstration of the soft margin . . . . . . . . . . . . . . . . . . 56
5.5 Demonstration of kernels . . . . . . . . . . . . . . . . . . . . . . . 57
5.6 Examples of separation with kernels . . . . . . . . . . . . . . . . 58
5.7 A separation of classes with complex boundaries . . . . . . . . . 59
6.1 Under fitting and over fitting . . . . . . . . . . . . . . . . . . . . 60
6.2 Determining the stopping point of training the SVM . . . . . . . 61
6.3 A K-fold partition of the dataset . . . . . . . . . . . . . . . . . . 61
6.4 Results while leaving out one of the features with 4 segments . . 65
6.5 Results while leaving out one of the features with 6 segments . . 65
A.1 Model of the Vodafone data warehouse . . . . . . . . . . . . . . . 72
B.1 Partition index and Separation index of K-medoid . . . . . . . . 73
B.2 Dunn’s index and Alternative Dunn’s index of K-medoid . . . . . 74
B.3 Partition coefficient and Classification Entropy of Fuzzy C-means 74
B.4 Partition index, Separation index and Xie Beni index of Fuzzy
C-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
B.5 Dunn’s index and Alternative Dunn’s index of Fuzzy C-means . . 75
5
List of Tables
2.1 Proportions within the different classification groups . . . . . . . 20
4.1 The values of all the validation measures with K-means clustering 39
4.2 The values of all the validation measures with Gustafson-Kessel
clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3 The numerical values of validation measures for c = 4 . . . . . . 42
4.4 The numerical values of validation measures for c = 6 . . . . . . 43
4.5 Segmentation results . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.1 Linear Kernel, 4 segments . . . . . . . . . . . . . . . . . . . . . . 61
6.2 Linear Kernel, 6 segments . . . . . . . . . . . . . . . . . . . . . . 62
6.3 Average C-value for polynomial kernel, 4 segments . . . . . . . . 62
6.4 Average C-value for polynomial kernel, 6 segments . . . . . . . . 62
6.5 Polynomial kernel, 4 segments . . . . . . . . . . . . . . . . . . . . 62
6.6 Polynomial kernel, 6 segments . . . . . . . . . . . . . . . . . . . . 62
6.7 Radial basis function, 4 segments . . . . . . . . . . . . . . . . . . 63
6.8 Radial basis function, 6 segments . . . . . . . . . . . . . . . . . . 63
6.9 Sigmoid function, 4 segments . . . . . . . . . . . . . . . . . . . . 63
6.10 Sigmoid function, 6 segments . . . . . . . . . . . . . . . . . . . . 64
6.11 Confusion matrix, 4 segments . . . . . . . . . . . . . . . . . . . . 64
6.12 Confusion matrix, 6 segments . . . . . . . . . . . . . . . . . . . . 64
6
Abstract
Vodafone, an International mobile telecommunications company, has accumu-
lated vast amounts of data on consumer mobile phone behavior in a data ware-
house. The magnitude of this data is so huge that manual analysis of data is
not feasible. However, this data holds valuable information that can be applied
for operational and strategical purposes. Therefore, in order to extract such in-
formation from this data, automatic analysis is essential, by means of advanced
data mining techniques. These data mining techniques search and analyze the
data in order to find implicit and useful information, without direct knowledge
of human experts. This research will address the question how to perform cus-
tomer segmentation and customer profiling with data mining techniques. In
our context, ’customer segmentation’ is a term used to describe the process of
dividing customers into homogeneous groups on the basis of shared or common
attributes (habits, tastes, etc). ’Customer profiling’ is describing customers
by their attributes, such as age, gender, income and lifestyles. Having these
two components, managers can decide which marketing actions to take for each
segment. In this research, the customer segmentation is based on usage call
behavior, i.e. the behavior of a customer measured in the amounts of incoming
or outgoing communication of whichever form. This thesis describes the process
of selecting and preparing the accurate data from the data warehouse, in order
to perform customer segmentation and to profile the customer. A number of
advanced and state-of-the-art clustering algorithms are modified and applied
for creating customer segments. An optimality criterion is constructed in order
to measure their performance. The best i.e. most optimal in the sense of the
optimality criterion clustering technique will be used to perform customer seg-
mentation. Each segment will be described and analyzed. Customer profiling
can be accomplished with information from the data warehouse, such as age,
gender and residential area information. Finally, with a recent data mining
technique, called Support Vector Machines, the segment of a customer will be
estimated based on the customers profile. Different kernel functions with dif-
ferent parameters will be examined and analyzed. The customer segmentation
will lead to two solutions. One solution with four segments and one solution
with six segments. With the Support Vector Machine approach it is possible in
80.3% of the cases to classify the segment of a customer based on its profile for
the situation with four segments. With six segments, a correct classification of
78.5% is obtained.
7
Chapter 1
Introduction
Vodafone is world’s leading mobile telecommunications company, with approx-
imately 4.1 million customers in The Netherlands. From all these customers a
tremendous amount of data is stored. These data include, among others, call de-
tail data, network data and customer data. Call detail data gives a description
of the calls that traverse the telecommunication networks, while the network
data gives a description of the state of the hardware and software components
in the network. The customer data contains information of the telecommunica-
tion customers. The amount of data is so great that manual analysis of data is
difficult, if not impossible [22]. The need to handle such large volumes of data
led to the development of knowledge-based expert systems [17, 22]. These auto-
mated systems perform important functions such as identifying network faults
and detecting fraudulent phone calls. A disadvantage of this approach is that
it is based on knowledge from human experts.
Obtaining knowledge from human experts is a time consuming process, and
in many cases, the experts do not have the requisite knowledge [2]. Solutions
to these problems were promised by data mining techniques. Data mining is
the process of searching and analyzing data in order to find implicit, but po-
tentially useful, information [12]. Within the telecommunication branch, many
data mining tasks can be distinguished. Examples of main problems for market-
ing and sales departments of telecommunication operators are churn prediction,
fraud detection, identifying trends in customer behavior and cross selling and
up-selling.
Vodafone is interested in a complete different issue, namely customer segmenta-
tion and customer profiling and the relation between them. Customer segmen-
tation is a term used to describe the process of dividing customers into homoge-
neous groups on the basis of shared or common attributes (habits, tastes, etc)
[10]. Customer profiling is describing customers by their attributes, such as age,
gender, income and lifestyles [1, 10]. Having these two components, marketers
can decide which marketing actions to take for each segment and then allocate
scarce resources to segments in order to meet specific business objectives.
A basic way to perform customer segmentation is to define segmentations in
8
advance with knowledge of an expert, and dividing the customers over these
segmentations by their best fits. This research will deal with the problem of
making customer segmentations without knowledge of an expert and without
defining the segmentations in advance. The segmentations will be determined
based on (call) usage behavior. To realize this, different data mining techniques,
called clustering techniques, will be developed, tested, validated and compared
to each other. In this report, the principals of the clustering techniques will be
described and the process of determining the best technique will be discussed.
Once the segmentations are obtained, for each customer a profile will be de-
termined with the customer data. To find a relation between the profile and
the segments, a data mining technique called Support Vector Machines (SVM)
will be used. A Support Vector machine is able to estimate the segment of a
customer by personal information, such as age, gender and lifestyle. Based on
the combination of the personal information (the customer profile), the segment
can be estimated and the usage behavior of the customer profile can be deter-
mined. In this research, different settings of the Support Vector Machines will
be examined and the best working estimation model will be used.
1.1 Customer segmentation and customer pro-
filing
To compete with other providers of mobile telecommunications it is important
to know enough about your customers and to know the wants and needs of your
customers [15]. To realize this, it is needed to divide customers in segments
and to profile the customers. Another key benefit of utilizing the customer
profile is making effective marketing strategies. Customer profiling is done by
building a customer’s behavior model and estimating its parameters. Customer
profiling is a way of applying external data to a population of possible customers.
Depending on data available, it can be used to prospect new customers or to
recognize existing bad customers. The goal is to predict behavior based on
the information we have on each customer [18]. Profiling is performed after
customer segmentation.
1.1.1 Customer segmentation
Segmentation is a way to have more targeted communication with the customers.
The process of segmentation describes the characteristics of the customer groups
(called segments or clusters) within the data. Segmenting means putting the
population in to segments according to their affinity or similar characteristics.
Customer segmentation is a preparation step for classifying each customer ac-
cording to the customer groups that have been defined.
Segmentation is essential to cope with today’s dynamically fragmenting con-
sumer marketplace. By using segmentation, marketers are more effective in
channeling resources and discovering opportunities. The construction of user
9
segmentations is not an easy task. Difficulties in making good segmentation are
[18]:
• Relevance and quality of data are essential to develop meaningful seg-
ments. If the company has insufficient customer data, the meaning of a
customer segmentation in unreliable and almost worthless. Alternatively,
too much data can lead to complex and time-consuming analysis. Poorly
organize data (different formats, different source systems) makes it also
difficult to extract interesting information. Furthermore, the resulting
segmentation can be too complicated for the organization to implement
effectively. In particular, the use of too many segmentation variables can
be confusing and result in segments which are unfit for management deci-
sion making. On the other hand, apparently effective variables may not be
identifiable. Many of these problems are due to an inadequate customer
database.
• Intuition: Although data can be highly informative, data analysts need
to be continuously developing segmentation hypotheses in order to identify
the ’right’ data for analysis.
• Continuous process: Segmentation demands continuous development
and updating as new customer data is acquired. In addition, effective seg-
mentation strategies will influence the behavior of the customers affected
by them; thereby necessitating revision and reclassification of customers.
Moreover, in an e-commerce environment where feedback is almost imme-
diate, segmentation would require almost a daily update.
• Over-segmentation: A segment can become too small and/or insuffi-
ciently distinct to justify treatment as separate segments.
One solution to construct segments can be provided by data mining methods
that belong to the category of clustering algorithms. In this report, several
clustering algorithms will be discussed and compared to each other.
1.1.2 Customer profiling
Customer profiling provides a basis for marketers to ’communicate’ with existing
customers in order to offer them better services and retaining them. This is done
by assembling collected information on the customer such as demographic and
personal data. Customer profiling is also used to prospect new customers using
external sources, such as demographic data purchased from various sources.
This data is used to find a relation with the customer segmentations that were
constructed before. This makes it possible to estimate for each profile (the
combination of demographic and personal information) the related segment and
visa versa. More directly, for each profile, an estimation of the usage behavior
can be obtained.
Depending on the goal, one has to select what is the profile that will be relevant
to the project. A simple customer profile is a file that contains at least age and
10
gender. If one needs profiles for specific products, the file would contain product
information and/or volume of money spent. Customer features one can use for
profiling, are described in [2, 10, 19]:
• Geographic. Are they grouped regionally, nationally or globally
• Cultural and ethnic. What languages do they speak? Does ethnicity affect
their tastes or buying behaviors?
• Economic conditions, income and/or purchasing power. What is the av-
erage household incom or power of the customers? Do they have any
payment difficulty? How much or how often does a customer spend on
each product?
• Age and gender. What is the predominant age group of your target buyers?
How many children and what age are in the family? Are more female or
males using a certain service or product?
• Values, attitudes and beliefs. What is the customers’ attitude toward your
kind of product or service?
• Life cycle. How long has the customer been regularly purchasing products?
• Knowledge and awareness. How much knowledge do customers have about
a product,service, or industry? How much education is needed? How much
brand building advertising is needed to make a pool of customers aware
of offer?
• Lifestyle. How many lifestyle characteristics about purchasers are useful?
• Recruitment method. How was the customer recruited?
The choice of the features depends also on the availability of the data. With
these features, an estimation model can be made. This can be realized by a
data mining method called Support Vector Machines (SVM). This report gives
an description of SVM’s and it will be researched under which circumstances
and parameters a SVM works best in this case.
1.2 Data mining
In section 1.1, the term data mining was used. Data mining is the process of
searching and analyzing data in order to find implicit, but potentially useful,
information [12]. It involves selecting, exploring and modeling large amounts of
data to uncover previously unknown patterns, and ultimately comprehensible
information, from large databases. Data mining uses a broad family of computa-
tional methods that include statistical analysis, decision trees, neural networks,
rule induction and refinement, and graphic visualization. Although, data min-
ing tools have been available for a long time, the advances in computer hardware
and software, particularly exploratory tools like data visualization and neural
11
networks, have made data mining more attractive and practical. The typical
data mining process consist of the following steps [4]:
• problem formulation
• data preparation
• model building
• interpretation and evaluation of the results
Pattern extraction is an important component of any data mining activity and
it deals with relationships between subsets of data. Formally, a pattern is de-
fined as [4]:
A statement S in L that describes relationships among a subsets of facts F
s
of a given set of facts F, with some certainty C, such that S is simpler than the
enumeration of all facts in F
s
.
Data mining tasks are used to extract patterns from large data sets. The vari-
ous data mining tasks can be broadly divided into six categories as summarized
in Figure 1.1. The taxonomy reflects the emerging role of data visualization as
Figure 1.1: A taxonomy of data mining tasks
a separate data mining task, even as it is used to support other data mining
tasks. Validation of the results is also a data mining task. By the fact that the
validation supports the other data mining tasks and is always necessary within
a research, this task was not mentioned as a separate one. Different data mining
tasks are grouped into categories depending on the type of knowledge extracted
by the tasks. The identification of patterns in a large data set is the first step to
gaining useful marketing insights and marking critical marketing decisions. The
data mining tasks generate an assortment of customer and market knowledge
which form the core of knowledge management process. The specific tasks to
be used in this research are Clustering (for the customer segmentation), Classi-
fication (for estimating the segment) and Data visualization.
Clustering algorithms produce classes that maximize similarity within clusters
but minimize similarity between classes. A drawback of this method is that the
number of clusters has to be given in advance. The advantage of clustering is
that expert knowledge is not required. For example, based on user behavior
data, clustering algorithms can classify the Vodafone customers into ”call only”
users, ”international callers”, ”SMS only” users etc.
Classification algorithms groups customers in predefined classes. For example,
12
Vodafone can classify its customers based on their age, gender and type of sub-
scription and then target its user behavior.
Data visualization allow data miners to view complex patterns in their cus-
tomer data as visual objects complete in three or two dimensions and colors.
In some cases it is needed to reduce high dimensional data into three or two
dimensions. To realize this, algorithms as Principal Component Analysis and
Sammon’s Mapping (discussed in Section 3.4) can be used. To provide varying
levels of details of observed patterns, data miners use applications that provide
advanced manipulation capabilities to slice, rotate or zoom the objects.
1.3 Structure of the report
The report comprises 6 chapters and several appendices. In addition to to this
introductory chapter, Chapter 2 describes the process of selecting the right data
from the data ware house. It provides information about the structure of the
data and the data ware house. Furthermore, it gives an overview of the data
that is used to perform customer segmentation and customer profiling. It ends
with an explanation of the preprocessing techniques that were used to prepare
the data for further usage.
In Chapter 3 the process of clustering is discussed. Clustering is a data mining
technique, that in this research is used to determine the customer segmenta-
tions. The chapter starts with explaining the general process of clustering.
Different cluster algorithms will be studied. It also focuses on validation meth-
ods, which can be used to determine the optimal number of clusters and to
measure the performance of the different cluster algorithms. The chapter ends
with a description of visualization methods. These methods are used to analyze
the results of the clustering.
Chapter 4 analyzes the different cluster algorithms of Chapter 3. This will be
tested with the prepared call detail data as described in Chapter 2 For each
algorithm, the optimal numbers of cluster will be determined. Then, the cluster
algorithms will be compared to each other and the best algorithm will be chosen
to determine the segments. Multiple plots and figures will show the working of
the different cluster methods and the meaning of each segment will be described.
Once the segments are determined, with the customer data of Chapter 2, a pro-
file can be made. Chapter 5 delves into a data mining technique called Support
Vector Machines. This technique will be used to classify the right segment for
each customer profile. Different parameter settings of the Support Vector Ma-
chines will be researched and examined in Chapter 6 to find the best working
model. Finally, in Chapter 7, the research will be discussed. Conclusions and
recommendations are given and future work is proposed.
13
Chapter 2
Data collection and
preparation
The first step (after the problem formulation) in the data mining process is
to understand the data. Without such an understanding, useful applications
cannot be developed. All data of Vodafone is stored in a data warehouse. In
this chapter, the process of collecting the right data from this data ware house,
will be described. Furthermore, the process of preparing the data for customer
segmentation and customer profiling will be explained.
2.1 Data warehouse
Vodafone has stored vast amounts of data in a Teradata data warehouse. This
data warehouse exists off more than 200 tables. A simplified model of the data
warehouse can be found in Appendix A.
2.1.1 Selecting the customers
Vodafone Maastricht is interested in customer segmentation and customer pro-
filing for (postpaid) business customers. In general, business customers can be
seen as employees of a business that have a subscription by Vodafone in re-
lation with that business. A more precisely view can be found in Figure 2.1.
It is clear to see, that prepaid users are always consumers. In the postpaid
group, there are captive and non captive users. A non-captive customer is using
the Vodafone network but has not a Vodafone subscription or prepaid (called
roaming). Vodafone has made an accomplishment with two other telecommuni-
cations companies, Debitel and InterCity Mobile Communications (ICMC), that
their customers can use the Vodafone network. Debitel customers are always
consumers and ICMC customers are always business customers. The ICMC cus-
tomers will also be involved in this research. A captive customer has a business
account if his telephone or subscription is bought in relation with the business
14
Figure 2.1: Structure of customers by Vodafone
he works. These customers are called business users. In some cases, customers
with a consumer account, can have a subscription that is under normal circum-
stances only available for business users. These customers also count as business
users. The total number of (postpaid) business users at Vodafone is more than
800,000. The next sections describe which data of these customers is needed for
customer segmentation and profiling.
2.1.2 Call detail data
Every time a call is placed on the telecommunications network of Vodafone,
descriptive information about the call is saved as a call detail record. The num-
ber of call detail records that are generated and stored is huge. For example,
Vodafone customers generate over 20 million call detail records per day. Given
that 12 months of call detail data is typically kept on line, this means that
hundreds of millions of call detail data will need to be stored at any time. Call
detail records include sufficient information to describe the important charac-
teristics of each call. At a minimum, each call detail record will include the
originating and terminating phone numbers, the date and the time of the call
and the duration of the call. Call detail records are generated in two or three
days after the day the calls were made, and will be available almost immediately
for data mining. This is in contrast with billing data, which is typically made
available only once per month. Call detail records can not be used directly
for data mining, since the goal of data applications is to extract knowledge at
the customer level, not at the level of individual phone calls [7, 8]. Thus, the
call detail records associated with a customer must be summarized into a single
record that describes the customer’s calling behavior. The choice of summary
variables (features) is critical in order to obtain a useful description of the cus-
tomer []. To define the features, one can think of the smallest set of variables
that describe the complete behavior of a customer. Keywords like what, when,
where, how often, who, etc. can help with this process:
15
• How?: How can a customer cause a call detail record? By making a voice
call, or sending an SMS (there are more possibilities, but their appearances
are so rare that they were not used during this research). The customer
can also receive an SMS or voice call.
• Who?: Who is the customer calling? Does he call to fixed lines? Does
he call to Vodafone mobiles?
• What?: What is the location of the customer and the recipient? They
can make international phone calls.
• When?: When does a customer call? A business customer can call during
office daytime, or in private time in the evening or at night and during
the weekend.
• Where?: Where is the customer calling? Is he calling abroad?
• How long?: How long is the customer calling?
• How often?: How often does a customer call or receive a call?
Based on these keywords and based on proposed features in the literature [1,
15, 19, 20] , a list of features that can be used as a summary description of a
customer based on the calls they originate and receive over some time period P
is obtained:
• 1. average call duration
• 2. average # calls received per day
• 3. average # calls originated per day
• 4. % daytime calls (9am - 6pm)
• 5. % of weekday calls (Monday - Friday)
• 6. % of calls to mobile phones
• 7. average # sms received per day
• 8. average # sms originated per day
• 9. % international calls
• 10. % of outgoing calls within the same operator
• 11. # unique area codes called during P
• 12. # different numbers called during P
16
These twelve features can be used to build customer segments. Such a segment
describes a certain behavior of group of customers. For examples, customers
who use their telephone only at their office could be in a different segment then
users that use their telephone also for private purposes. In that case, the seg-
mentation was based on the percentage weekday and daytime calls. Most of the
twelve features listed above can be generated in a straightforward manner from
the underlying data of the data ware house, but some features require a little
more creativity and operations on the data.
It may be clear that generating useful features, including summary features, is a
critical step within the data mining process. Should poor features be generated,
data mining will not be successful. Although the construction of these features
may be guided by common sense, it should include exploratory data analysis.
For example, the use of the time period 9am-6pm in the fourth feature is not
based on the commonsense knowledge that the typical workday on a office is
from 9am to 5pm. More detailed exploratory data analysis, shown in Figure
2.2 indicates that the period from 9am to 6pm is actually more appropriate for
this purpose. Furthermore, for each summary feature, there should be sufficient
Figure 2.2: Visualization of phone calls per hour
variance within the data, otherwise distinguish between customers is not possi-
ble and the feature is not useful. On the other hand, to much variance hampers
the process of segmentation. For some features values, the variance is visible in
the following histograms. Figure 2.3 shows that the average call duration, the
number of weekday and daytime calls and the originated calls have sufficient
variance. Note that the histograms resemble well known distributions. This
also indicates that the chosen features are suited for the customer segmenta-
tion. Interesting to see is the relation between the number of calls originated
and received. First of all, in general, customers originating more calls than
receiving. Figure 2.4 demonstrates this, values above the blue line represent
customers with more originating calls than receiving calls. In Figure 2.4 is also
visible that the customers that originated more calls, receive also more calls in
proportion. Another aspect that is simple to figure out is the fact that customer
17
(a) Call duration (b) Weekday calls
(c) Daytime calls (d) Originated calls
Figure 2.3: Histograms of feature values
Figure 2.4: Relation between originated and received calls
18
that make more weekday calls also call more at daytime (in proportion). This is
plotted in Figure 2.5. It is clear to see that the chosen features contain sufficient
variance and that certain relations and different customer behavior are already
visible. The chosen features appear to be well chosen and useful for customer
segmentation.
Figure 2.5: Relation between daytime and weekday calls
2.1.3 Customer data
To profile the customer, customer data is needed. The proposed data in Section
1.1.2 is not completely available. Information about lifestyles and income is
missing. However, with some creativity, some information can be subtracted
from the data ware house. The information that Vodafone stored in the data
ware house include name and address information and also include other in-
formation such as service plan, contract information and telephone equipment
information. With this information, the following variables can be used to define
a customers profile:
• Age group: <25, 25-40 40-55 >55
• Gender: male, female
• Type telephone: simple, basic, advanced
• Type subscription: basic, advance, expanded
• Company size: small, intermediate, big
• Living area: (big) city, small city /town
19
Because a relative small difference in age between customers should show close
relationships, the age of the customers has to be grouped. Otherwise, the result
of the classification algorithm is too specific to the trainings data [14]. In general,
the goal of grouping variables is to reduce the number of variables to a more
manageable size and to remove the correlations between each variable. The
composition of the groups should be chosen with care. It is of high importance
that the sizes of the groups are almost equal (if this is possible) [22]. If there is
one group with a sufficient higher amount of customers than other groups, this
feature will not increase the performance of the classification. This is caused
by the fact that from each segment a relative high number of customers is
represented in this group. Based on this feature, the segment of a customer can
not be determined. Table 2.1 shows the percentages of customers within the
chosen groups. It is clear to see that sizes of the groups were chosen with care
Age: <25 25-40 40-55 >55
21.2% 29.5% 27.9% 21.4%
Gender: Male Female
60.2% 39.8%
Telephone type: simple basic advanced
33.5% 38.7% 27.8%
Type of subscription: simple advanced expanded
34.9% 36.0% 29.1%
Company size: small intermediate big
31.5% 34.3% 34.2%
Living area: (big) city small city/town
42.0% 58.0%
Table 2.1: Proportions within the different classification groups
and the values can be used for defining the customers profile.With this profile, a
Support Vector Machine will be used to estimate the segment of the customer.
Chapter 5 and Chapter 6 contain information and results of this method.
2.2 Data preparation
Before the data can be used for the actual data mining process, it need to
cleaned and prepared in a required format. These tasks are [7]:
• Discovering and repairing inconsistent data formats and inconsistent data
encoding, spelling errors, abbreviations and punctuation.
• Deleting unwanted data fields. Data may contain many meaningless fields
from an analysis point of view, such as production keys and version num-
bers.
• Interpreting codes into text or replacing text into meaningful numbers.
20
Data may contain cryptic codes. These codes has to be augmented and
replaced by recognizable and equivalent text.
• Combining data, for instance the customer data, from multiple tables into
one common variable.
• Finding multiple used fields. A possible way to determine is to count or
list all the distinct variables of a field.
The following data preparations were needed during this research:
• Checking abnormal, out of bounds or ambiguous values. Some of these
outliers may be correct but this is highly unusual, thus almost impossible
to explain.
• Checking missing data fields or fields that have been replaced by a default
value.
• Adding computed fields as inputs or targets.
• Mapping continuous values into ranges, e.g. decision trees.
• Normalization of the variables. There are two types of normalization. The
first type is to normalize the values between [0,1]. The second type is to
normalize the variance to one.
• Converting nominal data (for example yes/no answers) to metric scales.
• Converting from textual to numeral or numeric data.
New fields can be generated through combinations of e.g. frequencies, aver-
ages and minimum/maximum values. The goal of this approach is to reduce
the number of variables to a more manageable size while also the correlations
between each variable will be removed. Techniques used for this purpose are of-
ten referred to as factor analysis, correspondence analysis and conjoint analysis
[14]. When there is a large amount of data, it is also useful to apply data reduc-
tion techniques (data cube aggregation, dimension and numerosity reduction,
discretization and concept hierarchy generation). Dimension reduction means
that one has to select relevant feature to a minimum set of attributes such that
the resulting probability distribution of data classes is a close as possible to the
original distribution given the values of all features. For this additional tools
may be needed, e.g. exhaustive, random or heuristic search, clustering, decision
trees or associations rules.
21
Chapter 3
Clustering
In this chapter, the used techniques for the cluster segmentation will be ex-
plained.
3.1 Cluster analysis
The objective of cluster analysis is the organization of objects into groups, ac-
cording to similarities among them [13]. Clustering can be considered the most
important unsupervised learning method. As every other unsupervised method,
it does not use prior class identifiers to detect the underlying structure in a
collection of data. A cluster can be defined as a collection of objects which are
”similar” between them and ”dissimilar” to the objects belonging to other clus-
ters. Figure 3.1 shows this with a simple graphical example. In this case the 3
Figure 3.1: Example of clustering data
clusters into which the data can be divided were easily identified. The similarity
criterion that was used in this case is distance: two or more objects belong to
the same cluster if they are ”close” according to a given distance (in this case
geometrical distance). This is called distance-based clustering. Another way of
clustering is conceptual clustering. Within this method, two or more objects
22
belong to the same cluster if this one defines a concept common to all that
objects. In other words, objects are grouped according to their fit to descriptive
concepts, not according to simple similarity measures. In this research, only
distance-based clustering algorithms were used.
3.1.1 The data
One can apply clustering techniques to quantitative (numerical) data, qualita-
tive (categoric) data, or a mixture of both. In this research, the clustering of
quantitative data is considered. The data, as described in Section 2.1.2, are
typically summarized observations of a physical process (call behavior of a cus-
tomer). Each observation of the customers calling behavior consists of n mea-
sured values, grouped into an n-dimensional row vector x
k
= [x
k1
, x
k2
, ..., x
kn
]
T
,
where x
k
∈ R
n
. A set of N observations is denoted by X = {x
k
|k = 1, 2, ..., N},
and is represented as an N x n matrix:
X =

¸
¸
¸
¸
x
11
x
12
· · · x
1n
x
21
x
22
· · · x
2n
.
.
.
.
.
.
.
.
.
.
.
.
x
N1
x
N2
· · · x
Nn
¸

. (3.1)
In pattern recognition terminology, the rows of X are called patterns or objects,
the columns are called the features or attributes, and X is called the pattern
matrix. In this research, X will be referred to the data matrix. The rows of
X represent the customers, and the columns are the feature variables of their
behavior as described in Section 2.1.2. As mentioned before, the purpose of
clustering is to find relationships between independent system variables, called
the regressors, and future values of dependent variables, called the regressands.
However, one should realize, that the relations revealed by clustering are not
more than associations among the data vectors. And therefore, they will not
automatically constitute a prediction model of the given system. To obtain such
a model, additional steps are needed.
3.1.2 The clusters
The definition of a cluster can be formulated in various ways, depending on
the objective of the clustering. In general, one can accept the definition that a
cluster is a group of objects that are more similar to another than to members
of other clusters. The term ”similarity” can be interpreted as mathematical
similarity, measured in some well-defined sense. In metric spaces, similarity is
often defined by means of a distance norm, or distance measure. Distance can
be measured in different ways. The first possibility is to measure among the
data vectors themselves. A second way is to measure the distance form the
data vector to some prototypical object of the cluster. The cluster centers are
usually (and also in this research) not known a priori, and will be calculated
by the clustering algorithms simultaneously with the partitioning of the data.
23
The cluster centers may be vectors of the same dimensions as the data objects,
but can also be defined as ”higher-level” geometrical objects, such as linear or
nonlinear subspaces or functions.
Data can reveal clusters of different geometrical shapes, sizes and densities as
demonstrated in Figure 3.2 Clusters can be spherical, elongated and also be
(a) Elongated (b) Spherical
(c) Hollow (d) Hollow
Figure 3.2: Different cluster shapes in R
2
hollow. Cluster can be found in any n-dimensional space. Clusters a,c and d
can be characterized as linear and non linear subspaces of the data space (R
2
in
this case). Clustering algorithms are able to detect subspaces of the data space,
and therefore reliable for identification. The performance of most clustering
algorithms is influenced not only by the geometrical shapes and densities of the
individual clusters, but also by the spatial relations and distances among the
clusters. Clusters can be well-separated, continuously connected to each other,
or overlapping each other.
3.1.3 Cluster partition
Clusters can formally be seen as subsets of the data set. One can distinguish
two possible outcomes of the classification of clustering methods. Subsets can
24
either be fuzzy or crisp (hard). Hard clustering methods are based on the clas-
sical set theory, which requires that an object either does or does not belong
to a cluster. Hard clustering in a data set X means partitioning the data into
a specified number of exclusive subsets of X. The number of subsets (clusters)
is denoted by c. Fuzzy clustering methods allow objects to belong to several
clusters simultaneously, with different degrees of membership. The data set X
is thus partitioned into c fuzzy subsets. In many real situations, fuzzy cluster-
ing is more natural than hard clustering, as objects on the boundaries between
several classes are not forced to fully belong to one of the classes, but rather
are assigned membership degrees between 0 and 1 indicating their partial mem-
berships (illustrated by Figure 3.3 The discrete nature of hard partitioning also
Figure 3.3: Hard and fuzzy clustering
causes analytical and algorithmic intractability of algorithms based on analytic
functionals, since these functionals are not differentiable. The structure of the
partition matrix U = [µ
ik
]:
U =

¸
¸
¸
¸
µ
1,1
µ
1,2
· · · µ
1,c
µ
2,1
µ
2,2
· · · µ
2,c
.
.
.
.
.
.
.
.
.
.
.
.
µ
N,1
µ
N,2
· · · µ
N,c
¸

. (3.2)
Hard partition
The objective of clustering is to partition the data set X into c clusters. Assume
that c is known, e.g. based on prior knowledge, or it is a trial value, of witch
partition results must be validated. Using classical sets, a hard partition can be
seen as a family of subsets {A
i
|1 ≤ i ≤ c ⊂ P(X)}, its properties can be defined
as follows:
c
¸
i=1
A
i
= X, (3.3)
A
i
∩ A
j
, 1 ≤ i = j ≤ c, (3.4)
Ø ⊂ A
i
⊂ X, 1 ≤ i ≤ c. (3.5)
25
These conditions imply that the subsets A
i
contain all the data in X, they must
be disjoint and none of them is empty nor contains all the data in X. Expressed
in the terms of membership functions:
c
¸
i=1
µ
Ai
= 1, (3.6)
µ
Ai
∨ µ
Ai
, 1 ≤ i = j ≤ c, (3.7)
0 ≤ µ
Ai
< 1, 1 ≤ i ≤ c. (3.8)
Where µ
Ai
represents the characteristic function of the subset A
i
which value
is zero or one. To simplify these notations, µ
i
will be used instead of µ
Ai
,
and denoting µ
i
(x
k
) by µ
ik
, partitions can be represented in a matrix notation.
U = [µ
ik
], a Nxc matrix, is a representation of the hard partition if and only if
its elements satisfy:
µ
ij
∈ {0, 1}, 1 ≤ i ≤ N, 1 ≤ k ≤ c, (3.9)
c
¸
k=1
µ
ik
= 1, 1 ≤ i ≤ N, (3.10)
0 <
N
¸
i=1
µ
ik
< N, 1 ≤ k ≤ c. (3.11)
A definition of a hard partitioning space can be defined as follows:
Let X be a finite data set and the number of clusters 2 ≤ c < N ∈ N. Then,
the hard partitioning space for X can be seen as the set:
M
hc
= {U ∈ R
Nxc

ik
∈ {0, 1}, ∀i, k;
c
¸
k=1
µ
ik
= 1, ∀i; 0 <
N
¸
i=1
µ
ik
< N, ∀k}.
(3.12)
Fuzzy partition
Fuzzy partition can be defined as a generalization of hard partitioning, in this
case µ
ik
is allowed to acquire all real values between zero and 1. Consider the
matrix U = [µ
ik
], containing the fuzzy partitions, its conditions are given by:
µ
ij
∈ [0, 1], 1 ≤ i ≤ N, 1 ≤ k ≤ c, (3.13)
c
¸
k=1
µ
ik
= 1, 1 ≤ i ≤ N, (3.14)
0 <
N
¸
i=1
µ
ik
< N, 1 ≤ k ≤ c. (3.15)
Note that there is only one difference with the conditions of the hard partition-
ing. Also the definition of the fuzzy partitioning space will not much differ with
26
the definition of the hard partitioning space. It can be defined as follows: Let
X be a finite data set and the number of clusters 2 ≤ c < N ∈ N. Then, the
fuzzy partitioning space for X can be seen as the set:
M
fc
= {U ∈ R
Nxc

ik
∈ [0, 1], ∀i, k;
c
¸
k=1
µ
ik
= 1, ∀i; 0 <
N
¸
i=1
µ
ik
< N, ∀k}.
(3.16)
The i-th column of U contains values of the membership functions of the i-th
fuzzy subset of X. Equation (1.14) implies that the sum of each column should
be 1, which means that the total membership of each x
k
in X equals one. There
are no constraints on the distribution of memberships among the fuzzy clusters.
This research will focus on hard partitioning. However, fuzzy cluster algorithms
will be applied as well. To deal with the problem of fuzzy memberships, the
cluster with the highest degree of membership will be chosen as the cluster were
the object belongs to. This method will result into hard partitioned clusters.
The possibilistic partition will not be used in this researched and will not be
discussed here.
3.2 Cluster algorithms
This section gives an overview of the clustering algorithms that were used during
the research.
3.2.1 K-means
K-means is one of the simplest unsupervised learning algorithms that solves
the clustering problem. However, the results of this hard partitioning method
are not always reliable and this algorithm has numerical problems as well. The
procedure follows an easy way to classify a given N x n data set through a certain
numbers of c clusters defined in advance. The K-means algorithm allocates each
data point to one of the c clusters to minimize the within sum of squares:
c
¸
i=1
sum
k∈Ai
||x
k
−v
i
||
2
. (3.17)
A
i
represents a set of data points in the i-th cluster and v
i
is the average of the
data points in cluster i. Note that ||x
k
−v
i
||
2
is actually a chosen distance norm.
Within the cluster algorithms, v
i
is the cluster center (also called prototype) of
cluster i:
v
i
=
¸
Ni
k=1
x
k
N
i
, x
k
∈ A
i
, (3.18)
where N
i
is the number of data points in A
i
.
27
3.2.2 K-medoid
K-medoid clustering, also a hard partitioning algorithm, uses the same equations
as the K-means algorithm. The only difference is that in K-medoid the cluster
centers are the nearest data points to the mean of the data in one cluster V =
{v
i
∈ X|1 ≤ i ≤ c}. This can be useful when, for example, there is no continuity
in the data space. This implies that a mean of the points in one cluster does
actually not exist.
3.2.3 Fuzzy C-means
The Fuzzy C-means algorithm (FCM) minimizes an objective function, called
C-means functional, to define the clusters. The C-means functional, invented
by Dunn, is defined as follows:
J(X; U, V ) =
c
¸
i=1
N
¸
k=1

ik
)
m
||x
k
−v
i
||
2
A
, (3.19)
with
V = [v
1
, v
2
, ..., v
c
], v
i
∈ R
n
. (3.20)
V denotes the vector with the cluster centers that has to be determined. The
distance norm ||x
k
−v
i
||
2
A
is called a squared inner-product distance norm and
is defined by:
D
2
ikA
= ||k
k
−v
i
||
2
A
= (x
k
−v
i
)
T
A(x
k
−v
i
). (3.21)
On a statistical point of view, equation 3.19 measures the total number of vari-
ance of x
k
from v
i
. The minimization of the C-means functional can be seen as a
non linear optimization problem, that can be solved by a variety of methods. Ex-
amples of methods that can solve non linear optimization problems are grouped
coordinate minimization and genetic algorithms. The simplest method to solve
this problem is utilizing the Picard iteration through the first-order conditions
for the stationary points of equation 3.19. This method is called the fuzzy c-
means algorithm. To find the stationary points of the c-means functional, one
can adjoint the constrained in 3.14 to J by means of Lagrange multipliers:
¯
J(X; U, V, λ) =
c
¸
i=1
N
¸
k=1

ik
)
m
D
2
ikA
+
N
¸
k=1
λ
k

c
¸
i=1
µ
ik
−1

, (3.22)
and by setting the gradients of (
ˆ
J), with respect to U, V and λ, to zero. When
D
2
ikA
> 0, ∀i, k and m > 1, then the C-means functional may only be minimized
by (U, V ) ∈ M
fc
xR
nxc
if
µ
ik
=
1
¸
c
j=1
(D
ikA
/D
jkA
)
2/(m−1)
, 1 ≤ i ≤ c, 1 ≤ k ≤ N, (3.23)
28
and
v
i
=
¸
N
k=1
µ
m
ik
x
k
¸
N
k=1
µ
m
i,k
, 1 ≤ i ≤ c. (3.24)
The solution of these equations are satisfying the constraints that were given in
equation (3.13) and (3.15). Remark that the v
i
of equation (3.24) is the weighted
average of the data points that belong to a cluster and the weights represents the
membership degrees. This explains why the name of the algorithm is c-means.
The Fuzzy C-means algorithm is actually an iteration between the equations
(3.23) and (3.24). The FCM algorithm uses the standard Euclidean distance for
its computations. Therefor, it is able to define hyper spherical clusters. Note
that it can only detect clusters with the same shape, caused by the common
choice of the norm inducing matrix A = I. The norm inducing matrix can also
be chosen as an nxn diagonal matrix of the form:
A
D
=

¸
¸
¸
¸
(1/σ
1
)
2
0 · · · 0
0 (1/σ
2
)
2
· · · 0
.
.
.
.
.
.
.
.
.
.
.
.
0 0 · · · (1/σ
n
)
2
¸

. (3.25)
This matrix accounts for different variances in the directions of the coordinate
axes of X. Another possibility is to choose A as the inverse of the nxn covariance
matrix A = F
−1
, where
F =
1
N
N
¸
k=1
(x
k
− ˆ x)(x
k
− ˆ x)
T
(3.26)
and ˆ x denotes the mean of the data. Hence that, in this case, matrix A is based
on the Mahalanobis distance norm.
3.2.4 The Gustafson-Kessel algorithm
The Gustafson and Kessel (GK) algorithm is a variation on the Fuzzy c-means
algorithm [11]. It employs a different and adaptive distance norm to recognize
geometrical shapes in the data. Each cluster will have its own norm-inducing
matrix A
i
, satisfying the following inner-product norm:
D
2
ikA
= (x
k
−v
i
)
T
· A
i
(x
k
−v
i
), where 1 ≤ i ≤ c and 1 ≤ k ≤ N. (3.27)
The matrices A
i
are used as optimization variables in the c-means functional.
This implies that each cluster is allowed to adapt the distance norm to the local
topological structure of the data. A c-tuple of the norm-inducing matrices is
defined by A, where A = (A
1
, A
2
, ..., A
c
). The objective functional of the GK
algorithm can be calculated by:
J(X; U, V, A) =
c
¸
i=1
N
¸
k=1
(u
ik
)
m
D
2
ikAi
. (3.28)
29
If A is fixed, the conditions under (3.13), (3.14) and (3.15) can be applied
without any problems. Unfortunately, equation (3.28) can not be minimized
in a straight forward manner, since it is linear in A
i
. This implies that J can
be made as small as desired by making A
i
less positive definite. To avoid this,
A
i
has to be constrained to obtain a feasible solution. A general way to this
is by constraining the determinant of the matrix. A varying A
i
with a fixed
determinant relates to the optimization of the cluster whit a fixed volume:
||A
i
|| = ρ
i
, ρ > 0. (3.29)
Here ρ is a remaining constant for each cluster. In combination with the La-
grange multiplier, A
i
can be expressed in the following way:
A
i
= [ρ
i
det(F
i
)]
1/n
F
−1
i
, (3.30)
with
F
i
=
¸
N
k=1

ik
)
m
(x
k
−v
i
)(x
k
−v
i
)
T
¸
N
k=1

ik
)
m
. (3.31)
F
i
is also called the fuzzy covariance matrix. Hence that this equation in com-
bination with equation (3.30) can be substituted into equation (3.27). The
outcome of the inner-product norm of (3.27) is a generalized squared Maha-
lanobis norm between the data points and the cluster center. The covariance is
weighted by the membership degrees of U.
3.2.5 The Gath Geva algorithm
Bezdek and Dunn [5] proposed a fuzzy maximum likelihood estimation (FMLE)
algorithm with a corresponding distance norm:
D
ik
(x
k
, v
i
) =

det(F
wi
)
α
i

1
2
(x
k
−v
(l)
i
)
T
F
−1
wi
(x
k
−v
(l)
i
)

, (3.32)
Comparing this with the Gustafson-Kessel algorithm, the distance norm includes
an exponentional term. This implies that this distance norm will decrease faster
than the inner-product norm. In this case, the fuzzy covariance matrix Fi is
defined by:
F
wi
=
¸
N
k=1

ik
)
w
(x
k
−v
i
)(x
k
−v
i
)
T
¸
N
k=1

ik
)
w
, 1 ≤ i ≤ c. (3.33)
The reason for using the w variable is to generalize this expression. In the origi-
nal FMLE algorithm, w = 1. In this research, w will be set to 2, to compensate
the exponential term and obtain clusters that are more fuzzy. Because of the
generalization, two weighted covariance matrices arise. The variable α
i
in equa-
tion (3.32) is the prior probability of selecting cluster i. α
i
can be defines as
follows:
α
i
=
1
N
N
¸
k=1
µ
ik
. (3.34)
30
Gath and Geva [9] discovered that the FMLE algorithm is able to detect clusters
of different shapes, sizes and densities and that the clusters are not constrained
in volume. The main drawback of this algorithm is the robustness, since the
exponential distance norm can converge to a local optimum. Furthermore, it is
not know how reliable the results of this algorithm are.
3.3 Validation
Cluster validation refers to the problem whether a found partition is correct and
how to measure the correctness of a partition. A clustering algorithm is designed
to parameterize clusters in a way that it gives the best fit. However, this does
not apply that the best fit is meaningful at all. The number of clusters might
not be correct or the cluster shapes do not correspond to the actual groups in
the data. In the worst case, the data can not be grouped in a meaningful way at
all. One can distinguish two main approaches to determine the correct number
of clusters in the data:
• Start with a sufficiently large number of clusters, and successively reducing
this number by combining clusters that have the same properties.
• Cluster the data for different values of c and validate the correctness of
the obtained clusters with validation measures.
To be able to perform the second approach, validation measures has to be de-
signed. Different validation methods have been proposed in the literature, how-
ever, none of them is perfect by oneself. Therefore, in this research are used
several indexes, which are described below:
• Partition Coefficient (PC): measures the amount of ”overlapping” be-
tween clusters. It is defined by Bezdek [5] as follows:
PC(c) =
1
N
c
¸
i=1
N
¸
j=1
(u
ij
)
2
(3.35)
where u
ij
is the membership of data point j in cluster i. The main draw-
back of this validity measure is the lack of direct connection to the data
itself. The optimal number of clusters can be found by the maximum
value.
• Classification Entropy (CE): measures only the fuzziness of the cluster,
which is a slightly variation on the Partition Coefficient.
CE(c) = −
1
N
c
¸
i=1
N
¸
j=1
u
ij
log(u
ij
) (3.36)
31
• Partition Index (PI): expresses the ratio of the sum of compactness and
separation of the clusters. Each individual cluster is measured with the
cluster validation method. This value is normalized by dividing it by the
fuzzy cardinality of the cluster. To receive the Partition index, the sum
of the value for each individual cluster is used.
PI(c) =
c
¸
i=1
¸
N
j=1
(u
ij
)
m
||x
j
−v
i
||
2
N
i
¸
c
k=1
||v
k
−v
i
||
2
(3.37)
PI is mainly used for the comparing of different partitions with the same
number of clusters. A minor value of a SC means a better partitioning.
• Separation Index (SI): in contrast with the partition index (PI), the
separation index uses a minimum-distance separation to validate the par-
titioning.
SI(c) =
¸
c
i=1
¸
N
j=1
(u
ij
)
2
||x
j
−v
i
||
2
N min
i,k
||v
k
−v
i
||
2
(3.38)
• Xie and Beni’s Index (XB): is a method to quantify the ratio of the
total variation within the clusters and the separations of the clusters [3].
XB(c) =
¸
c
i=1
¸
N
j=1
(u
ij
)
m
||x
j
−v
i
||
2
N min
i,j
||x
j
−v
i
||
2
(3.39)
The lowest value of the XB index should indicate the optimal number of
clusters.
• Dunn’s Index (DI): this index was originally designed for the identifica-
tion of hard partitioning clustering. Therefor, the result of the clustering
has to be recalculated.
DI(c) = min
i∈c
{ min
j∈c,i=j
{
min
x∈Ci,y∈Cj
d(x, y)
max
k∈c
{max
x,y∈C
d(x, y)}
}} (3.40)
The main disadvantage of the Dunn’s index is the very expansive compu-
tational complexity as c and N increase.
• Alternative Dunn Index (ADI):To simplify the calculation of the
Dunn index, the Alternative Dunn Index was designed. This will be the
case when the dissimilarity between two clusters, measured with min
x∈Ci,y∈Cj
d(x, y),
is rated in under bound by the triangle-inequality:
d(x, y) ≥ |d(y, v
j
) −d(x, v
j
)| (3.41)
were v
j
represents the cluster center of the j-th cluster.
ADI(c) = min
i∈c
{ min
j∈c,i=j
{
min
xi∈Ci,xj∈Cj
|d(y, v
j
) −d(x
i
, v
j
)|
max
k∈c
{max
x,y∈C
d(x, y)}
}} (3.42)
32
Note, that the Partition Coefficient and the Classification Entropy are only
useful for fuzzy partitioned clustering. In case of fuzzy clusters the values of the
Dunn’s Index and the Alternative Dunn Index are not reliable. This is caused
by the repartitioning of the results with the hard partition method.
3.4 Visualization
To understand the data and the results of the clustering methods, it is useful
to visualize the data and the results. However, the used data set is a high-
dimensional data set, which can not be plotted and visualized directly. This
section describes three methods that can map the data points into a lower
dimensional space.
In this research, the three mapping methods will be used for the visualization
of the clustering results. The first method is the Principal Component Analysis
(PCA), a standard and a most widely method to map high-dimensional data
into a lower dimensional space. Then, this report will focus on the Sammon
mapping method. The advantage of the Sammon mapping is the ability to
preserve inter pattern distances. This kind of mapping of distances is much
closer related to the proposition of clustering than saving the variances (which
will be done by PCA). However, the Sammon mapping application has two main
drawbacks:
• Sammon mapping is a projection method, which is based on the preser-
vation of the Euclidean inter point distance norm. This implies that the
Sammon mapping only can be applied on clustering algorithms that use
the Euclidean distance norm during the calculations of the clusters.
• The Sammon mapping method aims to find in a high n-dimensional space
N points in a lower q-dimensional subspace, such in a way the inter
point distances correspond to the distances measured in the n-dimensional
space. To achieve this, a computational expensive algorithm is needed, be-
cause in every iteration step a computation of N(N − 1)/2 distances is
required.
To avoid these problems of the Sammon mapping method, a modified algorithm,
called the Fuzzy Sammon mapping, is used during this research. A draw back
of this Fuzzy Sammon mapping is the loose of precision in distance, since only
the distance between the data points and the cluster centers considered to be
important.
The three visualisation methods will be explained in more detail in the following
subsections.
3.4.1 Principal Component Analysis
Principal component analysis (PCA) include a mathematical procedure that
maps a number of correlated variables into a smaller set of uncorrelated vari-
ables, called the principal components. The first principal component represents
33
as much of the variability in the data as possible. The succeeding components
describe the remaining variability. The main goals of the PCA method are:
• Identifying new meaningful underlying variables.
• Discovering and/or reducing the dimensionality of a data set.
In a mathematical way, the principal components will be achieved by analyzing
the eigenvectors and eigenvalues. The direction of the first principal component
is diverted from the eigenvector with the largest eigenvalue. The eigenvalue
associated with the second largest eigenvalue correspond to the second principal
component, etc. In this research, the second objective is used. In this case, the
covariance matrix of the data set can be described by:
F =
1
N
(x
k
−v)(x
k
−v)
T
, (3.43)
where v = ¯ x
k
. Principal Component Analysis is based on the projection of
correlated high-dimensional data onto a hyperplane [3]. This methods uses
only the first q nonzero eigenvalues and the corresponding eigenvectors of the
covariance matrix:
F
i
= U
i
Λ
i
U
T
i
. (3.44)
With Λ
i
as a matrix that contains the eigenvalues λ
i,j
of F
i
in its diagonal in
decreasing order and U
i
is a matrix containing the eigenvectors corresponding
to the eigenvalues in its columns. Furthermore, there is a q-dimensional reduced
vector that represents the vector x
k
of X, which can be defined as follows:
y
i,k
= W
−1
i
(x
k
) = W
T
i
(x
k
). (3.45)
The weight matrix W
i
contains the q principal orthonormal axes in its column:
W
i
= U
i,q
Λ
1
2
i,q
. (3.46)
3.4.2 Sammon mapping
As mentioned before,the Sammon mapping uses inter point distance measures to
find N points in a q-dimensional data space, which are representative for a higher
n-dimensional data set. The inter point distance measure of the n-dimensional
space, defined by d
ij
= d(x
i
, x
j
) correspond to the inter point distances in the
q-dimensional space, given by d

ij
= d

(y
i
, y
j
). This is achieved by Sammon’s
stress, a minimization criterion of the error:
E =
1
λ
N−1
¸
i=1
N
¸
j=i+1
(d
ij
−d

ij
)
2
d
ij
, (3.47)
where λ is a constant:
λ =
¸
i<j
d
ij
=
N−1
¸
i=1
N
¸
j=i+1
d
ij
. (3.48)
34
Note that there is no need to maintain λ, since a constant does not change
the result of the optimization process. The minimization of the error E is
an optimization problem in the Nxq variables y
il
, with i ∈ {1, 2, ..., N} and
l ∈ {1, 2, ..., q} which implies that y
i
= [y
i1
, ..., y
iq
]
T
. The rating of y
il
at the
t-th iteration can defined by:
y
il
(t + 1) = y
il
(t) −α

∂E(t)
∂y
il
(t)

2
E(t)
∂y
2
il
(t)
¸
¸
, (3.49)
where α is a nonnegative scalar constant, with a recommended value α 0.3 −
0.4. This scalar constant represents the step size for gradient search in the
direction of
∂E(t)
∂y
il
(t)
= −
2
λ
N
¸
k=1,k=i
¸
d
ki
−d

ki
d
ki
d

ki

(y
il
−y
kl
) (3.50)

2
E(t)
∂y
2
il
(t)
= −
2
λ
N
¸
k=1,k=i
1
d
ki
d

ki
¸
(d
ki
−d

ki
) −

(y
il
−y
kl
)
2
d

ki

1 +
d
ki
−d

ki
d
ki

(3.51)
With this gradient-descent method, it is not possible to reach a local minimum in
the error surface, while searching for the minimum of E. This is a disadvantage,
because multiple experiments with different random initializations are necessary
to find the minimum. However, it is possible to estimate the correct initialization
based on the information which is obtained from the data.
3.4.3 Fuzzy Sammon mapping
As mentioned in the introduction of this section, Sammon’s mapping has several
drawbacks. To avoid this drawbacks, a modified mapping method is designed
which takes into account the basic properties of fuzzy clustering algorithms
where only the distance between the data points and the clustering centers are
considered to be important [3]. The modified algorithm, called Fuzzy Sammon
mapping, uses only N∗c distances, weighted by the membership values similarly
to equation (3.19):
E
fuzz
=
c
¸
i=1
N
¸
k=1

ki
)
m
(d(x
k
, v
i
) −d

ki
)
2
, (3.52)
with d(x
k
, v
i
) representing the distance between data point x
k
and the cluster
center v
i
in the original n-dimensional space. The Euclidean distance between
the cluster center z
i
and the data y
k
of the projected q-dimensional space is
represented by d

(y
k
, z
i
). According to this information, in a projected two
dimensional space every cluster is represented by a single point, independently
to the shape of the original cluster. The Fuzzy Sammon mapping algorithm is
similar to the original Sammon mapping, but in this case the projected cluster
35
center will be recalculated in every iteration after the adaption of the projected
data points. The recalculation will be based on the weighted mean formula of
the fuzzy clustering algorithms, described in Section 3.2.3 (equation 3.19).
The membership values of the projected data can be plotted based on the stan-
dard equation for the calculation of the membership values:
µ

ki
=
1
¸
c
j=1

d

(x
k
,ηi)
d

(x
k
,vj)
2
m−1
, (3.53)
where U

= [µ

ki
] is the partition matrix with the recalculated memberships.
The plot will only give an approximation of the high dimensional clustering in
a two dimensional space. To measure the quality of this rating, an evaluation
function that determines the mean square error between the original and the
recalculated error can be defined as follows:
P = ||U −U

||. (3.54)
In the next chapter, the cluster algorithms will be tested and evaluated. The
PCA and the (Fuzzy) Sammon mapping methods will be used to visualize the
data and the clusters.
36
Chapter 4
Experiments and results of
customer segmentation
In this chapter, the cluster algorithms will be tested and their performance will
be measured with the proposed validation methods of the previous chapter. The
best working cluster method will be used to determine the segments. The chap-
ter ends with an evaluation of the segments.
4.1 Determining the optimal number of clusters
The disadvantage of the proposed cluster algorithms is the number of clusters
that has to be given in advance. In this research the number of clusters is not
known. Therefor, the optimal number of clusters has to be searched with the
given validation methods of Section 3.3. For each algorithm, calculations for
each cluster, c ∈ [215], were executed. To find the optimal number of clusters,
a process called Elbow Criterion is used. The elbow criterion is a common rule
of thumb to determine what number of clusters should be chosen. The elbow
criterion says that one should choose a number of clusters so that adding an-
other cluster does not add sufficient information. More precisely, by graphing
a validation measure explained by the clusters against the number of clusters,
the first clusters will add much information (explain a lot of variance), but at
some point the marginal gain will drop, giving an angle in the graph (the el-
bow). Unfortunately, this elbow can not always be unambiguously identified.
To demonstrate the working of the elbow criterion, the feature values that rep-
resent the call behavior of the customers, as described in Section 2.1.2, are used
as input for the cluster algorithms. From the 800,000 business customers of
Vodafone, 25,000 customers were randomly selected for the experiments. More
customers would lead to computational problems. First, the K-means algorithm
will be evaluated. The values of the validation methods depending on the num-
ber of clusters will be plotted. The value of the Partition Coefficient is for all
37
clusters 1, and the classification entropy is always ’NaN’. This is caused by the
fact that these 2 measures were designed for fuzzy partitioning methods, and
in this case the hard partitioning algorithm K-means is used. In Figure 4.1,
the values of the Partition Index, Separation Index and Xie and Beni’s Index
are shown. Mention again, that no validation index is reliable only by itself.
Figure 4.1: Values of Partition Index, Separation Index and the Xie Beni Index
Therefor, all the validation indexes are shown. The optimum could differ by
using different validation methods. This means that the optimum only could
be detected by the comparison of all the results. To find the optimal number of
cluster, partitions with less clusters are considered better, when the difference
between the values of the validation measure are small. Figure 4.1 shows that for
the PI and SI, the number of clusters easily could be rated to 4. For the Xie and
Beni index, this is much harder. The elbow can be found at c = 3, c = 6, c = 9
or c = 13, depending on the definition and parameters of an elbow. In Figure
4.2 there are more informative plots shown. The Dunn’s index and the Alterna-
tive Dunn’s index confirm that the optimal number of clusters for the K-means
algorithm should be chosen to 4. The values of all the validation measures for
the K-means algorithm, are embraced in table 4.1
38
Figure 4.2: Values of Dunn’s Index and the Alternative Dunn Index
c 2 3 4 5 6 7 8
PC 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
CE NaN NaN NaN NaN NaN NaN NaN
PI 3.8318 1.9109 1.1571 1.0443 1.2907 0.9386 0.8828
SI 0.0005 0.0003 0.0002 0.0002 0.0002 0.0002 0.0002
XBI 5.4626 4.9519 5.0034 4.3353 3.9253 4.2214 3.9079
DI 0.0082 0.0041 0.0034 0.0065 0.0063 0.0072 0.0071
ADI 0.0018 0.0013 0.0002 0.0001 0.0001 0.0001 0.0000
c 9 10 11 12 13 14 15
PC 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
CE NaN NaN NaN NaN NaN NaN NaN
PI 0.8362 0.8261 0.8384 0.7783 0.7696 0.7557 0.7489
SI 0.0002 0.0002 0.0002 0.0001 0.0001 0.0001 0.0001
XBI 3.7225 3.8620 3.8080 3.8758 3.4379 3.3998 3.5737
DI 0.0071 0.0052 0.0061 0.0070 0.0061 0.0061 0.0061
ADI 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
Table 4.1: The values of all the validation measures with K-means clustering
39
It is also possible to define the optimal numbers of clusters for fuzzy cluster-
ing algorithms with this method. To illustrate this, the results of the Gustafson-
Kessel algorithm will be shown. In Figure 4.3 the results of the Partition Index
and the Classification Entropy are plotted. Compared to the hard clustering
methods, the validation methods can be used now for the fuzzy clustering. How-
ever, the main drawback of PC is the monotonic decreasing with c, which makes
it hardly to detect the optimal number of cluster. The same problem holds for
CE: monotonic increasing, caused by the lack of direct connection to the data.
The optimal number of cluster can not be rated based on those two validation
methods. Figure 4.4 gives more information about the optimal number of clus-
Figure 4.3: Values of Partition coefficient and Classification Entropy with
Gustafson-Kessel clustering
ters. For the PI and the SI, the local minimum is reached at c = 6. Again,
for the XBI, it is difficult to find the optimal number of clusters. The points
at c = 3, c = 6 and c = 11, can be seen as an elbow. In Figure 4.5, the Dunn
index also indicates that the optimal number of clusters should be at c = 6. On
the other hand, the Alternative Dunn index, has an elbow at the point c = 3.
However, for the Alternative Dunn Index is not known how reliable its results
are, so the optimal number of clusters for the Gustafson-Kessel algorithm will be
six. The results of the validation measures for the Gustafson-Kessel algorithm
are written in table 4.2. This process can be repeated for all other cluster algo-
rithms. The results can be found in Appendix B. For the K-means, K-medoid
and the Gath-Geva,the optimal number of clusters is chosen at c = 4. For the
other algorithms, the optimal number of clusters is located at c = 6.
40
Figure 4.4: Values of Partition Index, Separation Index and the Xie Beni Index
with Gustafson-Kessel clustering
Figure 4.5: Values of Dunn’s Index and Alternative Dunn Index with Gustafson-
Kessel clustering
41
c 2 3 4 5 6 7 8
PC 0.6462 0.5085 0.3983 0.3209 0.3044 0.2741 0.2024
CE 0.5303 0.8218 1.0009 1.2489 1.4293 1.5512 1.7575
PI 0.9305 1.2057 1.5930 1.9205 0.8903 0.7797 0.8536
SI 0.0002 0.0003 0.0007 0.0004 0.0001 0.0001 0.0002
XBI 2.3550 1.6882 1.4183 1.1573 0.9203 0.9019 0.7233
DI 0.0092 0.0082 0.0083 0.0062 0.0029 0.0041 0.0046
ADI 0.0263 0.0063 0.0039 0.0018 0.0007 0.0001 0.0009
c 9 10 11 12 13 14 15
PC 0.2066 0.1611 0.1479 0.1702 0.1410 0.1149 0.1469
CE 1.8128 2.0012 2.0852 2.0853 2.2189 2.3500 2.3046
PI 0.9364 0.7293 0.7447 0.7813 0.7149 0.6620 0.7688
SI 0.0002 0.0001 0.0002 0.0002 0.0001 0.0001 0.0001
XBI 0.5978 0.5131 0.4684 0.5819 0.5603 0.5675 0.5547
DI 0.0039 0.0030 0.0028 0.0027 0.0017 0.0015 0.0006
ADI 0.0003 0.0002 0.0004 0.0002 0.0000 0.0001 0.0000
Table 4.2: The values of all the validation measures with Gustafson-Kessel
clustering
4.2 Comparing the clustering algorithms
The optimal number of cluster can be determined with the validation methods,
as mentioned in the previous section. The validation measures can also be
used to compare the different cluster methods. As examined in the previous
section, the optimal number of clusters was found at c = 4 or c = 6, depending
on the clustering algorithm. The validation measures for c = 4 and c = 6 of
all the clustering methods are collected in the tables 4.3 and 4.4. Table 4.3
PC CE PI SI XBI DI ADI
K-means 1 NaN 1.1571 0.0002 5.0034 0.0034 0.0002
K-medoid 1 NaN 0.2366 0.0001 Inf 0.0084 0.0002
FCM 0.2800 1.3863 0.0002 42.2737 1.0867 0.0102 0.0063
GK 0.3983 1.0009 1.5930 0.0007 1.4183 0.0083 0.0039
GG 0.4982 1.5034 0.0001 0.0001 1.0644 0.0029 00030
Table 4.3: The numerical values of validation measures for c = 4
and 4.4 show that the PC and CE are useless for the hard clustering methods
K-means and K-medoid. On the score of the values of the three most used
indexes, Separation index, Xie and Beni’s index and Dunn’s index, one can
conclude that for c = 4 the Gath-Geva algorithm has the best results and
for c = 6 the Gustafson-Kessel algorithm. To visualize the clustering results,
the validation methods that are described in Section 3.4 can be used. With
these visualization methods, the dataset can be reduced to a 2-dimensional
space. To avoid visibility problems (plotting too much values will cause one
42
PC CE PI SI XBI DI ADI
K-means 1 NaN 1.2907 0.0002 3.9253 0.0063 0.0001
K-medoid 1 NaN 0.1238 0.0001 Inf 0.0070 0.0008
FCM 0.1667 1.7918 0.0001 19.4613 0.9245 0.0102 0.0008
GK 0.3044 1.4293 0.8903 0.0001 0.9203 0.0029 0.0007
GG 0.3773 1.6490 0.1043 0.0008 1.0457 0.0099 0.0009
Table 4.4: The numerical values of validation measures for c = 6
big cloud of data points), only 500 values (representing 500 customers) from
this 2-dimensional dataset will be randomly picked. For the K-means and the
K-medoid algorithm, the Sammon’s mapping gives the best visualization of
the results. For the other cluster algorithms, the Fuzzy Sammon’s mapping
visualization gives the best projection with respect to the partitions of the data
set. These visualization methods are used for the following plots. Figures 4.x-
4.x show the different clustering results for c = 4 and c = 6 on the data set.
Figure 4.6 and 4.7 show that hard clustering methods can find a solution
Figure 4.6: Result of K-means algorithm
for the clustering problem. None of the clusters contain sufficient more or less
customers than other clusters. The plot of the Fuzzy C-means algorithm, in
Figure 4.8, shows unexpected results. For the situation with 4 clusters, there
are only 2 clusters clearly visible. By a detailed look at the plot, one can see that
there are actually 4 cluster centers, but the cluster centers are almost situated
on the same location. In the situation with 6 clusters, one can see three big
cluster, with one small cluster in one of the big clusters. The other two cluster
centers are nearly invisible. This implies that the Fuzzy C-means algorithm is
not able to find good clusters for this data set. In Figure 4.9, the results of the
Gustafon-Kessel algorithm are plotted. For both situations, the clusters are well
separated. Note that the cluster in the left bottom corner and the cluster in the
43
Figure 4.7: Result of K-medoid algorithm
Figure 4.8: Result of Fuzzy C-means algorithm
Figure 4.9: Result of Gustafson-Kessel algorithm
44
top right corner in Figure 4.9 are also maintained in the situation with 6 clusters.
This may indicate that the data points in these clusters represents customers
that differ on multiple fields with the other customers of Vodafone. The results
Figure 4.10: Result of Gath-Geva algorithm
of the Gath-Geva algorithm, visualized in Figure 4.10, for the situation c = 4
look similar to the result of the Gustafson-Kessel algorithm. The result for the
c = 6 situation is remarkable. Here are also appearing clusters in other clusters.
In the real high-dimensional situation, the clusters are not a subset of each
other, but are separated. The fact that this is the case in the two-dimensional
plot, indicates that a clustering with six clusters with the Gustafson-Kessel
algorithms not a good solution. With the results of the validation methods and
the visualization of the clustering, one can conclude that there are two possible
best solutions: The Gath-Geva algorithm for c = 4 and the Gustafson-Kessel
algorithm for c = 6. To determine which partitioning will be used to define
the segments, a closer look to the meaning of the clusters will be needed. In
the next section, the two different partitions will be closely compared with each
other.
4.3 Designing the segments
To define which clustering method will be used for the segmentation, one can
look at the distances from the points to each cluster. In Figure 4.11 and 4.12, two
box plots of the distances from the data points to the cluster are plotted. The
box indicates the upper and lower quartiles. In both situations, the results show
that the clusters are homogeneous. This indicates that, based on the distances
to the cluster, one can not distinguish between the two cluster algorithms.
Another way to view the differences between the cluster methods is to profile
the clusters. For each cluster, a profile can be made by drawing a line between
all normalized feature values (each feature value is represented at the x-as)
of the customers within this cluster. The result is visible for the Gath-Geva
algoithm for c = 4 and for the Gustafson-Kessel algorithm with six clusters.
45
Figure 4.11: Distribution of distances from cluster centers within clusters for
the Gath-Geva algorithm with c = 4
Figure 4.12: Distribution of distances from cluster centers within clusters for
the Gustafson-Kessel algorithm with c = 6
46
The profiles of the different clusters do not differ much in shape. However, in
each cluster, at least one value differs sufficient from the values of the other
cluster. This confirms the assumption that customers of different clusters have
indeed a different usage behavior. Most of the lines in one profile are drawn
closely together. This means that the customers in one profile contain similar
values of the feature values.
Figure 4.13: Cluster profiles for c = 4
47
Figure 4.14: Cluster profiles for c = 6
48
More relevant plots are shown in Figure 4.15 and ??. The mean of all the
lines (equivalent to the cluster center) was calculated and a line between all the
(normalized) feature vales was drawn. The difference between the clusters are
visible by some feature values. For instance, in the situation with four clusters,
Cluster 1 has customers, compared with other cluster, have a high feature value
at feature 8. Cluster 2 has high values at position 6 and 9, while Cluster 3
contains peaks at features 2 and 12. The 4th and final cluster has high values
at feature 8 and 9.
Figure 4.15: Cluster profiles of centers for c = 4
49
Figure 4.16: Cluster profiles of centers for c = 6
50
With the previous clustering results, validation measures and plots, it is not
possible to decide which of the two clustering methods gives a better result.
Therefor, both results will be used as a solution for the customer segmentation.
For the Gath-Geva algorithm with c = 4 and the Gustafson-Kessel algorithm
with c = 6, table 4.5 shows the result of the customer segmentation. The feature
Feature 1 2 3 4 5 6
Average 119.5 1.7 3.9 65.8 87.0 75.7
Segment 1 (27.2%) 91.3 0.9 2.9 54.8 86.6 58.2
c = 4 Segment 2 (28.7%) 120.1 1.8 3.6 73.6 87.1 93.7
Segment 3 (23.9%) 132.8 2.4 4.4 60.1 86.7 72.1
Segment 4 (20.2%) 133.8 1.7 4.7 74.7 87.6 78.8
Segment 1 (18.1%) 94.7 1.2 2.8 66.3 88.0 72.6
Segment 2 (14.4%) 121.8 1.7 4.1 65.9 86.4 73.0
c = 6 Segment 3 (18.3%) 121.6 2.5 4.9 66.0 84.3 71.5
Segment 4 (17.6%) 126.8 1.6 4.0 65.7 87.3 71.2
Segment 5 (14.8%) 96.8 1.1 3.5 65.2 88.6 92.9
Segment 6 (16.8%) 155.3 2.1 4.1 65.7 87.4 73.0
Feature 7 8 9 10 11 12
Average 1.6 3.7 2.2 14.4 6.9 25.1
Segment 1 (27.2%) 1.7 4.0 1.6 12.3 6.2 12.2
c = 4 Segment 2 (28.7%) 1.2 3.1 2.1 12.8 6.6 30.6
Segment 3 (23.9%) 1.4 3.4 2.1 22.4 9.4 39.7
Segment 4 (20.2%) 2.1 4.3 3.0 10.1 5.4 17.9
Segment 1 (18.1%) 2.3 4.5 1.8 11.3 6.1 13.5
Segment 2 (14.4%) 1.6 3.7 1.9 17.8 9.5 40.4
c = 6 Segment 3 (18.3%) 1.0 2.9 2.9 15.1 6.6 26.9
Segment 4 (17.6%) 1.5 3.6 1.9 15.0 6.2 24.0
Segment 5 (14.8%) 0.8 2.9 1.8 12.4 6.1 23.1
Segment 6 (16.8%) 2.4 4.6 2.9 14.8 6.9 22.7
Table 4.5: Segmentation results
numbers correspond to the feature numbers of Section 2.1.2. (Feature 1 is the
call duration, feature 2 the received voices calls and feature 3 the originated
calls, feature 4 the daytime calls, feature 5 the weekday calls, 6 are calls to
mobile phones, 7 received sms, 8 originated sms, feature 9 the international
calls, feature 10 the calls to Vodafone mobiles, 11 the unique are codes and
feature 12 the number of different numbers called). In words, the segments can
be described as follows: For the situation with 4 segments:
• Segment 1: In this segment are customers with a relative low number of
voice calls. This customers call more in the evening (in proportion) and to
fixed lines then other customers. Their sms usage is higher then normal.
The number of international calls is low.
• Segment 2: This segment contains customers with an average voice call
51
usage. They call often to mobile phones during day time. They do not
send and receive many sms messages.
• Segment 3: The customers in this segment make relative many voice
calls. These customers call to many different numbers and have a lot of
contacts which are Vodafone customers.
• Segment 4: These customers originate many voice calls. They also send
and receive many sms messages. They call often during daytime and call
more then average to international numbers. Their call duration is high.
Remarkable is the fact that they don not have so many contacts as the
number of calls do suspect. They have a relative small number of contacts.
For the situation with 6 segments, the customers in this segments can be de-
scribed as follows:
• Segment 1: In this segment are customers with a relative low number
of voice calls. Their average call duration is also lower than average.
However, their sms usage is relative high. These customers do not call to
many different numbers.
• Segment 2: This segment contains customers with a relative high number
of contacts. They also call to many different areas. They have also more
contacts with a Vodafone mobile.
• Segment 3: The customers in this segment make relative many voice
calls. Their sms usage is low. In proportion, they make more international
phone calls than other customers.
• Segment 4: These customers are the average customers. None of the
feature values is high or low.
• Segment 5: These customers do not receive many voice calls. The aver-
age call duration is low. They also receive and originate a low number of
sms messages.
• Segment 6: These customers originate and receive many voice calls.
They also send and receive many sms messages. The duration of their
voice calls is longer than average. The percentage of international calls is
high.
In the next session the classification method Support Vector Machine will be
explained. This technique will be used to classify/estimate the segment of a
customer by personal information as age, gender and lifestyle (the customer
data of Section 2.1.3).
52
Chapter 5
Support Vector Machines
A Support Vector Machine is a algorithm that learns by example to assign
labels to objects [16]. In this research a Support Vector machine will be used
to recognize the segment of a customer by examining thousands of customers
(e.g. the customer data features of Section 2.1.3) of each segment. In general, a
Support Vector Machine is a mathematical entity; an algorithm for maximizing
a particular mathematical function with respect to a given collection of data.
However, the basic ideas of Support Vector Machines can be explained without
any equations. The next few sections will describe the four basic concepts:
• The separating hyper plane
• The maximum-margin hyperplane
• The soft margin
• The kernel function
For now, to allow an easy, geometric interpretation of the data, imagine that
there exists only two segments. In this case the customer data consist of 2
feature values, age and income, which can be easily plotted. The green dots
represent the customers that are in segment 1 and the red dots are customers
that are in segment 2. The goal of the SVM is learn to tell the difference between
the groups and, given an unlabeled customer, such as the one labeled ’Unknown’
in Figure 5.1, predict whether it corresponds to segment 1 or segment 2.
5.1 The separating hyperplane
A human being is very good at pattern recognition. Even a quick glance at Fig-
ure 5.1a shows that the green dots form a group and the reds dots form another
group that can easily be separated by drawing a line between the two groups
(Figure 5.1b). Subsequently, predicting the label of an unknown customer is
simple: one simply needs to ask whether the new customer falls on the segment
53
(a) Two-dimensional representation of the
customers
(b) A separating hyperplane
Figure 5.1: Two-dimensional customer data of segment 1 and segment 2
1 or the segment 2 side of the separating line. Now, to define the notion of
a separating hyperplane, consider the situation where there are not just two
feature values to describe the customer. For example, if there was just 1 feature
value to describe the customer, then the space in which the corresponding one-
dimensional feature resides is a one-dimensional line. This line can be divided
in half by using a single point (see Figure 5.2a). In two dimensions, a straight
line divides the space in half (remember Figure 5.1b) In a three-dimensional
space, a plane is needed to divide the space, illustrated in Figure 5.2b. This
procedure can be extrapolated mathematically in higher dimensions. The term
for a straight line in a high-dimensional space is a hyperplane. So the term
separating hyperplane is, essentially, the line that separates the segments.
(a) One dimension (b) Three dimensions
Figure 5.2: Separating hyperplanes in different dimensions
54
5.2 The maximum-margin hyperplane
The concept of treating objects as points in a high-dimensional space and finding
a line that separates them, is a common way of classification, and therefore not
unique to the SVM. However, the SVM differs from all other classifier methods
by virtue of how the hyperplane should be selected. Consider again the classifi-
cation problem of Figure 5.1a The goal of SVM is to find a line that separates
the segment 1 customers from the segment 2 customers. However, there are
an infinite number of possible lines, as portrayed in Figure 5.2 The question is
which line should be chosen as the optimal classifier and how should the optimal
line be defined. A logical way of selecting the optimal line, is selecting the line
that is, roughly speaking, ’in the middle’. In other words, the line that sepa-
rates the two segments and adopts the maximal distance from any of the given
customers (see Figure 5.2). It is not surprising that a theorem of the statistical
learning theory is supporting this choice [6]. By defining the distance from the
hyperplane to the nearest customer (in general an expression vector) as the mar-
gin of the hyperplane, the SVM selects the maximum separating hyperplane.
By selecting this hyper plane, the SVM is able to predict the unknown segment
of the customer in Figure 5.1a. The vectors (points) that constrain the width
of the margin are the support vectors. This theorem, is in many ways, the key
(a) Many possibilities (b) The maximum-margin hyperplane
Figure 5.3: Demonstration of the maximum-margin hyperplane
to the success of Support Vector Machines. However, there are a some remarks
and caveats to deal with. First at all, the theorem is based on the fact that the
data on which the SVM is trained are drawn from the same distribution as the
data it has to classify. This is of course logical, since it is not reasonable that
a Support Vector machine trained on customer data is able to classify different
car types. More relevantly, it is not reasonable to expect that the SVM can
classify well if the training data set is prepared with a different protocol then
the test data set. On the other hand, the theorem of a SVM indicates that the
two data sets has to be drawn from the same distribution. For example, a SVM
55
does not assume that the data is drawn from a normal distribution.
5.3 The soft margin
So far, the theory assumed that the data can be separated by a straight line.
However, many real data sets are not cleanly separable by a straight line, for
example the data of Figure 5.4a. In this figure, the data contains an error
object. A intuitively way to deal with the problems of errors is designing the
SVM in such a way that it allows a few anomalous customers to fall on the
’wrong side’ of separation line. This can be achieved by adding a ’soft margin’
to the SVM. The soft margin allows a small percentage of the data points
to push their way through the margin of the separating hyperplane without
affecting the final result. With the soft margin, the data set of Figure 5.4a will
be separated in the way it is illustrated in Figure 5.3 The customer can be seen
as an outlier and resides on the same side of the line with customers of segment
1. Of course, a SVM should not allow too many misclassification. Note, that
(a) Data set containing one error (b) Separating with soft margin
Figure 5.4: Demonstration of the soft margin
with the introduction of the soft margin, a user-specified parameter is involved
that controls the soft margin and, roughly, controls the number of customers
that is allowed to violate the separation line and determines how far across the
line they are allowed. Setting this parameter is a complicated process, by the
fact that a large margin will be achieved with respect to the number of correct
classifications. In other words, the soft margin specifies a trade-off between
hyper plane violations and the size of the margin.
5.4 The kernel functions
To understand the notion of a kernel function, the example data will be sim-
plified even further. Assume that, instead of a two-dimensional data set, there
56
is a one-dimensional data set, as seen before in Figure 5.1. In that case, the
separating hyperplane was a single point. Now, consider the situation of Figure
5.4, which illustrates an non separable data set. No single point can separate
the two segments and introducing a soft margin would not help. A kernel func-
tion provides a solution to this problem. The kernel function adds an extra
dimension to the data, in this case by squaring the one dimensional data set.
The result is plotted in Figure 5.4. Within the new higher dimensional space,
as shown in the figure, the SVM can separate the data in two segments by one
straight line. In general, the kernel function can be seen as a mathematical
trick for the SVM to project data from a low-dimensional space to a space of
higher dimensions. If one chooses a good kernel function, the data will become
separable in the corresponding higher dimension. To understand kernels better,
(a) None separable dataset (b) Separating previously non separable
dataset
Figure 5.5: Demonstration of kernels
some extra examples will be given. In Figure 5.4 is plotted a two-dimensional
data set. With a relative simple kernel function, this data can be projected to a
four-dimensional space. It is not possible to draw the data in the 4 dimensional
space, but with a projection of the SVM hyperplane in the four-dimensional
space back down to the original two-dimensional space, the result is shown as
the curved line in Figure 5.4. it is possible to prove that for any data set exists
a kernel function that allows the SVM to separate the data linearly in a higher
dimension. Of course, the data set must contain consistent labels, which means
that two identical data points may not have different labels. So, in theory,
the SVM should be a perfect classifier. However, there are some drawbacks of
projecting data in a very high-dimensional space to find the separating hyper-
plane. the first problem is the so called curse of dimensionality: as the numbers
of variables under consideration increases, the number of possible solutions also
increases, but exponentially. Consequently, it becomes harder for any algorithm
to find a correct solution. In Figure 5.4 the situation is drawn when the data is
project into a space with too many dimensions. The figure contains the same
data as Figure 5.4, but the projected hyperplane is found by a very high dimen-
57
sional kernel. This results in boundaries which are to specific to the examples
of the data set. This phenomenon is called over fitting. The SVM will not
function well on new unseen unlabeled data. There exists another large practi-
(a) Linearly separable in four dimensions (b) A SVM that has over fit the data
Figure 5.6: Examples of separation with kernels
cal difficulty when applying new unseen data to the SVM. This problems relies
on the question how to choose a kernel function that separates the data, but
without introducing too many irrelevant dimensions. Unfortunately, the answer
too this question is, in most cases, trial and error. In this research a SVM will
be experimented with a variety of ’standard’ kernel functions. By using the
cross-validation method, the optimal kernel will be selected on a statistical way.
However, this is a time-consuming process and it is not guaranteed that the
best kernel function that was found during cross-validation, is actually the best
kernel function that exists. It is more likely that there exists a kernel function
that was not tested and performs better than the selected kernel function. Prac-
tically, the method described above, mainly gives sufficient results. In general
the kernel function is defined by:
K(x
i
, x
j
) = Φ(x
i
)
T
Φ(x
j
), (5.1)
where x
i
are the training vectors. The vectors are mapped into a higher dimen-
sional space by the function Φ. Many kernel mapping functions can be used,
probably an infinite number, but a few kernel functions have been found to work
well in for a wide variety of applications [16]. The default and recommended
kernel functions were used during this research and will be discussed now.
• Linear: which function is defined by:
K(x
i
, x
j
) = x
T
i
x
j
. (5.2)
• Polynomial: the polynomial kernel of degree d is of the form
K(x
i
, x
j
) = (γx
T
i
x
j
+c
0
)
d
. (5.3)
58
• Radial basis function: also known as the Gaussian kernel is of the form
K(x
i
, x
j
) = exp(−γ||x
i
−x
j
||
2
). (5.4)
• Sigmoid: the sigmoid function, which is also used in neural networks, is
defined by
K(x
i
, x
j
) = tanh(γx
T
i
x
j
+c
0
). (5.5)
When the sigmoid function is used, one can regard it with a as a two-layer
neural network.
In this research the constant c
0
is set to 1. The concept of a kernel mapping
function is very powerful. It allows a SVM to perform separations even with
very complex boundaries as shown in Figure 5.7
Figure 5.7: A separation of classes with complex boundaries
5.5 Multi class classification
So far, the idea of using a hyperplane to separate the feature vectors into two
groups was described, but only for two target categories. How does a SVM
discriminate between a large variety of classes, as in our case 4 or 6 segments?
There are several approaches proposed, but two methods are the most popu-
lar and most used [16]. The first approach is to train multiple, one-versus-all
classifiers. For example, if the SVM has to recognize three classes, A, B and C,
one can simply train three separate SVM to answer the binary questions, ”Is it
A?”, ”Is it B?” and ”Is it C?”. Another simple approach is the one-versus-one
where k(k −1)/2 models are constructed, where k is the number of classes. In
this research the one-verses-one technique will be used.
59
Chapter 6
Experiments and results of
classifying the customer
segments
6.1 K-fold cross validation
To avoid over fitting, cross-validation is used to evaluate the fitting provided by
each parameter value set tried during the experiments. Figure 6.1 demonstrates
how important the training process is. Different parameter values may cause
under or over fitting. By K-fold cross validation the training dataset will be
Figure 6.1: Under fitting and over fitting
divided into two groups, the training set, the test set and the validation set.
The training set will be used to train the SVM. The test set will be used to
estimate the error during the training of the SVM. With the validation set,
the actual performance of the SVM will be measured after the SVM is trained.
The training of the SVM will be stopped when the test error reached a local
60
minimum, see Figure 6.2. By K-fold cross validation, a k-fold partition of the
Figure 6.2: Determining the stopping point of training the SVM
data set is created. For each of K experiments, K-1 folds will be used for training
and the remaining one for testing. Figure 6.3 illustrates this process. In this
Figure 6.3: A K-fold partition of the dataset
research, K is set to 10. The advantage of K-fold cross validation is that all the
examples in the dataset are eventually used for both training and testing. The
error is calculated by taking the average off all K experiments.
6.2 Parameter setting
In this section, the optimal parameters for the Support Vector Machine will
be researched and examined. Each kernel function with its parameters will be
tested on their performance. The linear Kernel function itself has no parameters.
The only parameter that can be researched is the soft margin value of the
Support Vector Machin, denoted by C. In table 6.1 and table 6.2 the results for
the different C-values are summarized. For the situation with 4 clusters, the
C 1 2 5 10 20 50 100 200 500
42.1% 42.6% 43.0% 43.2% 43.0% 42.4% 41.7% 40.8% 36.1%
Table 6.1: Linear Kernel, 4 segments
61
C 1 2 5 10 20 50 100 200 500
28.9% 29.4% 30.9% 31.3% 31.4% 32.0% 27.6% 27.6% 21.8%
Table 6.2: Linear Kernel, 6 segments
optimal value for the soft margin is C = 10 and by using the 6 segments C = 50.
The correct number of classifications are respectively, 43.2% and 32.0%. For the
polynomial kernel function, there are two parameters. The number of degrees,
denoted by d and the width γ. Therefor, the optimal number for the maximal
margin will be determined. This is done by multiple test runs with random
values for d and γ. The average value for each soft margin C can be found in the
tables 6.3 and 6.4. These C-values are used to find out which d and γ give the
C 1 2 5 10 20 50 100 200 500
73.8% 77.4% 76.6% 74.6% 73.5% 72.8% 70.6% 63.2% 53.7%
Table 6.3: Average C-value for polynomial kernel, 4 segments
C 1 2 5 10 20 50 100 200 500
70.1% 74.4% 75.3% 75.1% 75.0% 75.1% 50.6% 42.7% 26.0%
Table 6.4: Average C-value for polynomial kernel, 6 segments
best results. The results are shown in tables 6.5 and 6.6. For the situation with
d 1 2 3 4 5 6 7
γ = 0.4 76.1% 76.3% 78.1% 73.2% 74.8% 76.0% 75.0%
γ = 0.6 76.0% 76.3% 77.6% 74.1% 74.5% 75.4% 75.8%
γ = 0.8 75.8% 76.3% 77.2% 74.0% 74.4% 77.1% 75.2%
γ = 1.0 76.2% 76.4% 78.0% 75.0% 75.2% 75.6% 75.8%
γ = 1.2 76.0% 76.2% 78.1% 74.6% 75.1% 76.0% 75.8%
γ = 1.4 75.2% 76.2% 78.1% 74.9% 75.5% 76.3% 74.9%
Table 6.5: Polynomial kernel, 4 segments
d 1 2 3 4 5 6 7
γ = 0.4 75.0% 74.6% 75.9% 76.0% 75.8% 74.3% 73.9%
γ = 0.6 74.2% 75.1% 74.9% 76.2% 75.0% 74.5% 74.0%
γ = 0.8 73.8% 74.7% 74.3% 76.2% 75.9% 74.8% 73.1%
γ = 1.0 74.1% 75.0% 73.6% 76.1% 75.3% 74.2% 72.8%
γ = 1.2 72.1% 74.1% 75.5% 75.4% 75.4% 74.1% 73.0%
γ = 1.4 73.6% 74.3% 72.2% 76.0% 74.4% 74.3% 72.9%
Table 6.6: Polynomial kernel, 6 segments
62
4 segments, the optimal score is 78.1% and for 6 segments 76.2%. The following
kernel function, the radial basis function has only one variable, namely γ. The
results of the Radial Basis function are given in table 6.7 and table 6.8. The
C 1 2 5 10 20 50 100 200 500
γ = 0.4 80.0 79.0 76.6 78.3 76.4 73.3 60.2 52.4 37.5
γ = 0.6 80.1 80.3 77.7 79.0 79.9 72.8 63.6 59.6 27.5
γ = 0.8 79.3 79.5 78.2 80.2 78.4 69.3 59.3 51.4 29.7
γ = 1.0 78.4 78.2 80.3 78.5 76.9 66.2 61.7 47.9 30.6
γ = 1.2 79.6 79.9 79.8 80.2 80.1 69.0 61.3 45.5 26.3
γ = 1.4 77.4 76.9 76.5 79.4 77.7 71.4 61.3 41.2 26.0
Table 6.7: Radial basis function, 4 segments
C 1 2 5 10 20 50 100 200 500
γ = 0.4 73.6 77.4 72.6 70.9 68.0 65.1 52.7 51.8 40.0
γ = 0.6 72.5 74.8 74.8 72.7 73.0 70.4 54.0 49.3 39.1
γ = 0.8 74.1 76.6 80.3 80.0 68.4 60.5 55.5 54.1 40.9
γ = 1.0 70.7 72.9 73.8 70.9 66.1 64.7 52.2 48.5 34.2
γ = 1.2 72.6 73.5 73.4 73.1 71.9 74.6 64.8 60.0 38.3
γ = 1.4 69.4 68.5 70.7 69.1 68.0 68.5 54.4 52.4 31.0
Table 6.8: Radial basis function, 6 segments
best result with 4 segments is 80.3%, with 6 segments the best score is 78.5%.
The sigmoid function has also only 1 variable. The results are given in table
6.9 and 6.10 The results show that 66.1% and 44.6% of the data is classified
C 1 2 5 10 20 50 100 200 500
γ = 0.4 58.2 53.0 57.7 58.2 56.1 57.9 30.3 47.5 38.9
γ = 0.6 47.6 56.1 55.5 46.0 58.3 44.1 30.6 30.7 34.5
γ = 0.8 52.1 60.5 54.6 57.9 58.6 44.7 43.2 44.3 38.7
γ = 1.0 51.4 57.3 52.0 50.7 50.2 48.6 44.7 42.2 40.0
γ = 1.2 66.1 64.8 61.3 62.8 59.6 57.1 46.5 44.0 42.0
γ = 1.4 63.2 61.4 59.7 65.0 53.8 51.1 52.2 47.6 41.4
Table 6.9: Sigmoid function, 4 segments
correct, with respectively 4 and 6 segemtents, by the Sigmoid function. This
means that the Radial basis function has the best score for both situations, with
80.3% and 78.5%. Remarkable is the fact that the difference is small between
the two situations, while there are two extra clusters. The confusion matrix for
both situations, table 6.11 and 6.12, show that there are two clusters which can
easily be classified with the customer profile. This corresponds to the cluster in
the top right corner and the cluster in the bottom of Figures 4.9 and 4.10.
63
C 1 2 5 10 20 50 100 200 500
γ = 0.4 33.8 34.0 34.7 33.1 34.6 30.0 32.6 28.8 28.8
γ = 0.6 29.6 27.4 28.5 29.7 21.4 20.8 20.0 18.8 18.1
γ = 0.8 39.1 36.4 33.6 35.7 38.9 32.0 26.4 24.6 22.9
γ = 1.0 40.0 42.5 39.8 40.7 39.9 39.8 30.4 31.1 28.0
γ = 1.2 41.9 40.6 43.6 43.2 44.1 43.2 44.6 40.6 41.7
γ = 1.4 38.6 34.5 32.1 30.6 30.2 27.5 24.3 26.3 27.9
Table 6.10: Sigmoid function, 6 segments
Predicted → Segment 1 Segment 2 Segment 3 Segment 4
Actual ↓
Segment 1 97.1% 0.5% 1.9% 0.5%
Segment 2 3.6% 76.6% 7.8% 12.0%
Segment 3 2.2% 0.8% 96.3% 0.7%
Segment 4 7.1% 13.0% 6.9% 73.0%
Table 6.11: Confusion matrix, 4 segments
Predicted → Segm. 1 Segm. 2 Segm. 3 Segm. 4 Segm. 5 Segm. 6
Actual ↓
Segment 1 74.1% 1.1% 10.1% 8.4% 0.6% 5.7%
Segment 2 0.2% 94.5% 0.6% 1.4% 1.2% 2.1%
Segment 3 5.6% 4.7% 71.2% 9.1% 2.1% 7.3%
Segment 4 12.3% 4.1% 3.9% 68.9% 6.8% 4.0%
Segment 5 2.0% 0.6% 0.7% 1.3% 92.6% 2.8%
Segment 6 12.5% 2.4% 3.7% 10.4% 1.3% 69.7%
Table 6.12: Confusion matrix, 6 segments
64
6.3 Feature Validation
In this section, the features will be validated. The importance of each feature
will be measured. This will be done, by leaving one feature out of the feature
vector and train the SVM without this feature. The results of both situations,
are shown in Figure 6.4 and 6.5. The result show that Age is an important
Figure 6.4: Results while leaving out one of the features with 4 segments
Figure 6.5: Results while leaving out one of the features with 6 segments
feature for classifying the right segment. This is in contrast with the type of
telephone, which increase the result with only tenths of percents. Each feature
increases the result and therefore each feature is useful for the classification.
65
Chapter 7
Conclusions and discussion
This section concludes the research and the corresponding results and will give
some recommendations for future work.
7.1 Conclusions
The first objective of our research was to perform automatic customer segmen-
tation based on usage behavior, without the direct intervention of a human
specialist. The second part of the research was focused on profiling customers
and finding a relation between the profile and the segments. The customer
segments were constructed by applying several clustering algorithms. The clus-
tering algorithms used selected and preprocessed data from the Vodafone data
warehouse. This led to solutions for the customer segmentation with respec-
tively four segments and six segments. The customer’s profile was based on
personal information of the customers. A novel data mining technique, called
Support Vector Machines was used to estimate the segment of a customer based
on his profile.
There are various ways for selecting suitable feature values for the clustering
algorithms. This selection is vital for the resulting quality of the clustering.
One different feature value will result in different segments. The result of the
clustering can therefore not be regarded as universally valid, but merely as one
possible outcome. In this research, the feature values were selected in such a
way that it would describe the customer’s behavior as complete as possible.
However, it is not possible to include all possible combinations of usage behav-
ior characteristics within the scope of this research. To find the optimal number
of clusters, the so-called elbow criterion was applied. Unfortunately, this crite-
rion could not always be unambiguously identified. An other problem was that
the location of the elbow could differ between the validation measures for the
same algorithm. For some algorithms, the elbow was located at c = 4 and for
other algorithms, the location was c = 6. To identify the best algorithm, several
validation measures were used. Not every validation method marked the same
66
algorithm as the best algorithm. Therefore, some widely established validation
measures were employed to determine the most optimal algorithm. It was how-
ever not possible to determine one algorithm that was optimal for c = 4 and
c = 6. For the situation with four clusters, the Gath-Geva algorithm appears to
be the best algorithm and the Gustafson-Kessel algorithm gives the best results
by six clusters. To determine which customer segmentation algorithm is best
suited for a particular data set and a specific parameter setting, the clustering
results were interpreted in a profiling format. The results show, that in both
situations the clusters were well separated and clearly distinguished from each
other. It is hard to compare the two clustering results, because of the different
number of clusters. Therefore, both clustering results were used as a starting
point for the segmentation algorithm. The corresponding segments differ on
features as number of voice calls, sms usage, call duration, international calls,
different numbers called and percentage of weekday and daytime calls. A short
characterization of each cluster was made.
A Support Vector Machine algorithm was used to classify the segment of a
customer, based on the customer’s profile. The profile exists of the age, gen-
der, telephone type, subscription type, company size, and residential area of
the customer. As a comparison, four different kernel functions with different
parameters were tested on their performance. It was found that the radial basis
function gives the best result with a classification of 80.3% for the situation
with four segments and 78.5% for the situation with six segments. It appeared
that the resulting percentage of correctly classified segments was not as high
as expected. A possible explanation could be that the features of the customer
are not adequate for making a customer’s profile. This is caused by the fre-
quently missing data in the Vodafone data warehouse about lifestyle, habits
and income of the customers. A second reason for the low number of correct
classification is the fact that the usage behavior in the database corresponds
to a telephone number and this telephone number corresponds to a person. In
real life, however, this telephone is maybe not used exclusively by the person
(and the corresponding customer’s profile) as stored in the database. Customers
may lend their telephone to relatives, and companies may exchange telephones
among their employees. In such cases, the usage behavior does not correspond
to a single customer’s profile and this impairs the classification process.
The last part of the research involves the relative importance of each individ-
ual feature of the customer’s profile. By leaving out one feature value during
classification, the effect of each feature value became visible. It was found that
without the concept of ’customer age’, the resulting quality of the classifica-
tion was significantly decreased. On the other hand, leaving out a feature such
as the ’telephone type’ barely decreased the classification result. However, this
and some other features did well increase the performance of classification. This
implies, that this feature bears some importance for the customer profiling and
the classification of the customer’s segment.
67
7.2 Recommendations for future work
Based on our research and experiments, it is possible to formulate some recom-
mendations for obtaining more suitable customer profiling and segmentation.
The first recommendation is to use different feature values for the customer
segmentation. This can lead to different clusters and thus different segments.
To know the influence of the feature values on the outcome of the clustering, a
complete data analysis research is required. Also, a detailed data analysis of the
meaning of the cluster is recommended. In this research, the results are given
by a short description of each segment. Extrapolating this approach, a more
detailed view of the clusters and their boundaries can be obtained. Another
way to validate the resulting clusters is to offer them to a human expert, and
use his feed-back for improving the clustering criteria.
To improve on determining the actual number of clusters present in the data
set, the application of more specialized methods than the elbow criterion could
be applied. An interesting alternative is, for instance, the application of evolu-
tionary algorithms, as proposed by Wei Lu [21]. Another way of improving this
research is to extent the number of cluster algorithms like main shift cluster-
ing, hierarchical clustering or mixture of Gaussians. To estimate the segment
of the customer, also, other classification methods can be used. For instance,
neural networks, genetic algorithms or Bayesian algorithms. Of specific interest
is, within the framework of Support Vector Machines, cluster analysis of the
application of miscellaneous (non-linear) kernel functions.
Furthermore, it should be noted that the most obvious and best way to improve
the classification is to come to a more accurate and precise definition of the
customer profiles. The customer profile used in this research is not sufficient
detailed enough to describe the wide spectrum of customers. One reason for this
is the missing data in the Vodafone data warehouse. Consequently, an enhanced
and more precise analysis of the data ware house will lead to improved features
and, thus, to an improved classification.
Finally, we note that the study would improve noticeably by involving multiple
criteria to evaluate the user behavior, rather than mere phone usage as em-
ployed here. Similarly, it is challenging to classify the profile of the customer
based on the corresponding segment alone. However, this is a complex course
and it essentially requires the availability of high-quality features.
68
Bibliography
[1] Ahola, J. and Rinta-Runsala E., Data mining case studies in customer profiling.
Research report TTE1-2001-29, VTT Information Technology (2001).
[2] Amat, J.L., Using reporting and data mining techniques to improve knowledge of
subscribers; applications to customer profiling and fraud management. J. Telecom-
mun. Inform. Technol., no. 3 (2002), pp. 11-16.
[3] Balasko, B., Abonyi, J. and Balazs, F., Fuzzy Clustering and Data Analysis Tool-
box For Use with Matlab. (2006).
[4] Bounsaythip, C. and Rinta-Runsala, E., Overview of Data Mining for Customer
Behavior Modeling. Research report TTE1-2001-18, VTT Information Technol-
ogy (2001).
[5] Bezdek, J.C. and Dunn, J.C., Optimal fuzzy partition: A heuristic for estimating
the parameters in a mixture of normal distributions. IEEE Trans. Comput., vol.
C-24 (1975), pp. 835-838.
[6] Dibike, Y.B., Velickov, S., Solomatine D. and Abbott, M.B., Model Induction
with Support Vector Machines: Introduction and Applications. J. Comp. in Civ.
Engrg., vol. 15 iss. 3 (2001), pp. 208-216.
[7] Feldman, R. and Dagan, I., Knowledge discovery in textual databases (KDT). In
Proc. 1st Int. Conf. Knowledge Discovery and Data Mining, (2005), pp. 112-117.
[8] Frawley, W.J., Piatetsky-Shapiro, G. and Matheus, C.J., Knowledge discovery in
databases, AAAI/MIT Press (1991), pp. 1-27.
[9] Gath, I. and Geva, A.B., Unsupervised optimal fuzzy clustering. IEEE Trans
Pattern and Machine Intell, vol. 11 no. 7 (1989), pp. 773-781.
[10] Giha, F.E., Singh, Y.P. and Ewe, H.T., Customer Profiling and Segmentation
based on Association Rule Mining Technique. Proc. Softw. Engin. and Appl., no.
397 (2003).
[11] Gustafson, D.E. and Kessel, W.E., Fuzzy clustering with a fuzzy covariance ma-
trix. In Proc. IEEE CDC, (1979), pp. 761766.
[12] Janusz, G., Data mining and complex telecommunications problems modeling. J.
Telecommun. Inform. Technol., no. 3 (2003), pp. 115-120.
69
[13] Mali, K., Clustering and its validation in a symbolic framework. Patt. Recogn.
Lett., vol. 24 (2003), pp. 2367-2376.
[14] Mattison, R., Data Warehousing and Data Mining for Telecommunications.
Boston, London: Artech House, (1997).
[15] McDonald, M. and Dunbar, I., Market segmentation. How to do it, how to profit
from it. Palgrave Publ., (1998).
[16] Noble, W.S., What is a support vector machine? Nature Biotechnology, vol. 24
no. 12 (2006), pp. 1565-1567.
[17] Shaw, M.J., Subramaniam, C., Tan, G.W. and Welge, M.E., Knowledge manage-
ment and data mining for marketing. Decision Support Systems, vol. 31 (2001),
pp. 127137.
[18] Verhoef, P., Spring, P., Hoekstra, J. and Lee, P., The commercial use of segmenta-
tion and predictive modeling techniques for database marketing in the Netherlands.
Decis. Supp. Syst., vol. 34 (2002), pp. 471-481.
[19] Virvou, M., Savvopoulos, A. Tsihrintzis, G.A. and Sotiropoulos, D.N., Construct-
ing Stereotypes for an Adaptive e-Shop Using AIN-Based Clustering. ICANNGA
(2007), pp. 837-845.
[20] Wei, C.P. and Chiu, I.T., Turning telecommunications call detail to churn pre-
diction: a data mining approach. Expert Syst. Appl., vol. 23 (2002), pp. 103112.
[21] Wei Lu, I.T., A New Evolutionary Algorithm for Determining the Optimal Num-
ber of Clusters. CIMCA/IAWTIC (2005), pp. 648-653.
[22] Weiss, G.M., Data Mining in Telecommunications. The Data Mining and Knowl-
edge Discovery Handbook (2005), pp. 1189-1201.
70
Appendix A
Model of data warehouse
In this Appendix a simplified model of the data ware house can be found. The
white rectangles correspond to the tables that were used for this research. The
most important data fields of these tables are written in the table. The colored
boxes group the tables in a category. To connected the tables with each other,
the relation tables (the red tables in the middle) are needed.
71
Figure A.1: Model of the Vodafone data warehouse
72
Appendix B
Extra results for optimal
number of clusters
In this Appendix, the plots of the validation measures, for the algorithms that
not were discussed in Section 4.1, are given.
The K-medoid algorithm:
Figure B.1: Partition index and Separation index of K-medoid
73
Figure B.2: Dunn’s index and Alternative Dunn’s index of K-medoid
The Fuzzy-C-means algorithm:
Figure B.3: Partition coefficient and Classification Entropy of Fuzzy C-means
74
Figure B.4: Partition index, Separation index and Xie Beni index of Fuzzy
C-means
Figure B.5: Dunn’s index and Alternative Dunn’s index of Fuzzy C-means
75

Acknowledgments
This Master thesis was written to complete the study Operations Research at the University of Maastricht (UM). The research took place at the Department of Mathematics of UM and at the Department of Information Management of Vodafone Maastricht. During this research, I had the privilege to work together with several people. I would like to express my gratitude to all those people for giving me the support to complete this thesis. I want to thank the Department of Information Management for giving me permission to commence this thesis in the first instance, to do the necessary research work and to use departmental data. I am deeply indebted to my supervisor Dr. Ronald Westra, whose help, stimulating suggestions and encouragement helped me in all the time of research for and writing of this thesis. Furthermore, I would like to give my special thanks to my second supervisor Dr. Ralf Peeters, whose patience and enthusiasm enabled me to complete this work. I have also to thank my thesis instructor, Drs. Annette Schade, for her stimulating support and encouraging me to go ahead with my thesis. My former colleagues from the Department of Information Management supported me in my research work. I want to thank them for all their help, support, interest and valuable hints. Especially I am obliged to Drs. Philippe Theunen and Laurens Alberts, MSc. Finally, I would like to thank the people, who looked closely at the final version of the thesis for English style and grammar, correcting both and offering suggestions for improvement.

1

Contents
1 Introduction 1.1 Customer segmentation and customer profiling 1.1.1 Customer segmentation . . . . . . . . . 1.1.2 Customer profiling . . . . . . . . . . . . 1.2 Data mining . . . . . . . . . . . . . . . . . . . . 1.3 Structure of the report . . . . . . . . . . . . . . 2 Data collection and preparation 2.1 Data warehouse . . . . . . . . . 2.1.1 Selecting the customers 2.1.2 Call detail data . . . . . 2.1.3 Customer data . . . . . 2.2 Data preparation . . . . . . . . 8 9 9 10 11 13 14 14 14 15 19 20 22 22 23 23 24 27 27 28 28 29 30 31 33 33 34 35

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

3 Clustering 3.1 Cluster analysis . . . . . . . . . . . . . . 3.1.1 The data . . . . . . . . . . . . . 3.1.2 The clusters . . . . . . . . . . . . 3.1.3 Cluster partition . . . . . . . . . 3.2 Cluster algorithms . . . . . . . . . . . . 3.2.1 K-means . . . . . . . . . . . . . . 3.2.2 K-medoid . . . . . . . . . . . . . 3.2.3 Fuzzy C-means . . . . . . . . . . 3.2.4 The Gustafson-Kessel algorithm 3.2.5 The Gath Geva algorithm . . . . 3.3 Validation . . . . . . . . . . . . . . . . . 3.4 Visualization . . . . . . . . . . . . . . . 3.4.1 Principal Component Analysis . 3.4.2 Sammon mapping . . . . . . . . 3.4.3 Fuzzy Sammon mapping . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

4 Experiments and results of customer segmentation 37 4.1 Determining the optimal number of clusters . . . . . . . . . . . . 37 4.2 Comparing the clustering algorithms . . . . . . . . . . . . . . . . 42

2

. . . 5. . . . . . .4. . . . . . . . . . . . . . . . . . .3 Designing the segments . . . . 5. . . . .3 Feature Validation . . . . 45 53 53 55 56 56 59 5 Support Vector Machines 5. . 66 7. . . . . . . . . . . . . . . . . . . . . . . . .2 The maximum-margin hyperplane 5. . . . . . . . . . 5. 61 . . . .4 The kernel functions . . . . . 6. . . . . . . . . 60 . . . . . . . . . . . . . . . . . .1 Conclusions . . . 6 Experiments and results 6. . . . .3 The soft margin . . . . . . . . . . . . . . . . . . .5 Multi class classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2 Recommendations for future work . .1 The separating hyperplane . . . . . . .1 K-fold cross validation 6. . . . . . . . . . . . . . . . . . . .2 Parameter setting . . . 68 Bibliography A Model of data warehouse B Extra results for optimal number of clusters 68 71 73 3 . . . . . . . . . . . . . . . . . . . . . of classifying the customer segments 60 . . . . . . . . . . 65 7 Conclusions and discussion 66 7. . . . . . . . . . . . . .

. . . . . . . Values of Partition coefficient and Classification Entropy with Gustafson-Kessel clustering . . . . . . . . . .4 2. . . . . . . . . . . Values of Partition Index. . . Distribution of distances from cluster centers within clusters for the Gath-Geva algorithm with c = 4 . . . . . . .13 4. . . . .List of Figures 1. . . . . . . . Cluster profiles for c = 6 . . . . . . . . . . . .5 3. . . . . . Separation Index and the Xie Beni Index Values of Dunn’s Index and the Alternative Dunn Index . . . . . . . . . . . . . .1 A taxonomy of data mining tasks . . . . . . . . . . . . . . .2 2. . . . . . . Result of K-means algorithm . .8 4. .9 4. . . .15 4. . . . . . . .12 4. . . . . Visualization of phone calls per hour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cluster profiles of centers for c = 4 . . . . . . . 4 . Two-dimensional customer data of segment 1 and segment 2 . . Values of Dunn’s Index and Alternative Dunn Index with GustafsonKessel clustering . Distribution of distances from cluster centers within clusters for the Gustafson-Kessel algorithm with c = 6 . . . .1 2. . . . . . . . 12 15 17 18 18 19 22 24 25 38 39 40 41 41 43 44 44 44 45 46 46 47 48 49 50 54 Example of clustering data . . . . . . .14 4. . . . . . . . . . . . . .6 4. Different cluster shapes in R2 . . . . . Result of Gustafson-Kessel algorithm .5 4. . . . . . . . . . .3 4. . .3 2. . . . . . . . . . . . . . . . . . . . . . . Relation between daytime and weekday calls . . . . . . . . . . . . . . . . . .7 4. . . . . . . . . . . . . . . Cluster profiles of centers for c = 6 . .1 3.11 4. . . Cluster profiles for c = 4 . . . . Histograms of feature values . . . . .2 4. . .10 4. . . . . . . . . Separation Index and the Xie Beni Index with Gustafson-Kessel clustering . . . . . . . . . . . .3 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . Result of K-medoid algorithm . . . . . . . . . . Structure of customers by Vodafone . . . . . . Result of Gath-Geva algorithm .2 3. . . . . . . . . . . . . . . . . . . . . . . Hard and fuzzy clustering . . . . . . . . . Relation between originated and received calls . . . . .16 5. . . . . . . . . . . . .1 4. . . . . . .4 4. . . .1 2. . Result of Fuzzy C-means algorithm . . . . . . . . . . . . . . . . . Values of Partition Index.

.7 6. . . B. . . . . . . .3 5.4 5. . . .5 Separating hyperplanes in different dimensions . . . . . . . . . . . . . . . Examples of separation with kernels . . Dunn’s index and Alternative Dunn’s index of K-medoid . A K-fold partition of the dataset . . . . . . . . . . . . . . . . .4 Partition index and Separation index of K-medoid . . .2 6. . . . .4 6. . . . . . . . Separation index and Xie Beni index of Fuzzy C-means . . . .1 6. . .5 Dunn’s index and Alternative Dunn’s index of Fuzzy C-means . . . . . B. Demonstration of kernels . . . . . . . .2 B. . . . . . . . . . . 54 55 56 57 58 59 60 61 61 65 65 72 73 74 74 75 75 Under fitting and over fitting . . . Demonstration of the maximum-margin hyperplane Demonstration of the soft margin . . . A separation of classes with complex boundaries . .3 B. . . . .6 5. . .3 6. . Results while leaving out one of the features with 4 segments Results while leaving out one of the features with 6 segments A. . . . . . . . . . . . . . . . . . .5 5. . . . . Determining the stopping point of training the SVM . . . . . .5. . . . . . . . . 5 .1 Model of the Vodafone data warehouse . . . . . . . . . . . . . . Partition coefficient and Classification Entropy of Fuzzy C-means Partition index. . . . . . . . . . . . . . . . . .1 B. . . . . .2 5. . . . . . . . . . . . . . . . . .

. 4 segments . . . . . .1 4. . . . . . . . . . . . . . .4 4. . . . . . . . . . . . . . . . . . segments segments . . . . . . . . . . . . Radial basis function. . The values of all the validation measures with K-means clustering The values of all the validation measures with Gustafson-Kessel clustering . . . . . . . . . Average C-value for polynomial kernel. 6 segments . . . . . . . . . . . . . . . . . . . . . Sigmoid function. .1 6. . . . . . . . . . . . . . . . Confusion matrix. . . . . . . . . . 4 segments . . . . . The numerical values of validation measures for c = 6 . . . . .11 6. . . . . . . . Linear Kernel. 4 6 . . . The numerical values of validation measures for c = 4 . . . . . . . .3 6. . . . . . . . . . . Segmentation results . . . . . . . . 20 39 42 42 43 51 61 62 62 62 62 62 63 63 63 64 64 64 6 .List of Tables 2. . . . . . . . . . 4 segments . Sigmoid function. .7 6. . . . Polynomial kernel. . . .9 6. . .4 6. . .5 6. . . . . . . . . . . . . 4 segments . . . . . . . . . .5 6. . . . . . . Radial basis function. .8 6. . . . . . Linear Kernel. 4 segments . . . . . . . . . . . . . . Average C-value for polynomial kernel. . . . . . 6 segments . . .10 6. . . . . . . . . . . . . . .2 6. . .1 4. . Confusion matrix. 6 segments . . . . .3 4. . . . . . . 6 segments . . .12 Proportions within the different classification groups . . . . . . . . . . Polynomial kernel. . . . . . . . . . . . . 6 segments . .6 6. . . . . . . .2 4.

With six segments. with a recent data mining technique.3% of the cases to classify the segment of a customer based on its profile for the situation with four segments. the customer segmentation is based on usage call behavior. such as age. the segment of a customer will be estimated based on the customers profile. In this research. Finally. However. Each segment will be described and analyzed. Different kernel functions with different parameters will be examined and analyzed. This research will address the question how to perform customer segmentation and customer profiling with data mining techniques. i. The customer segmentation will lead to two solutions. has accumulated vast amounts of data on consumer mobile phone behavior in a data warehouse. such as age. without direct knowledge of human experts. The magnitude of this data is so huge that manual analysis of data is not feasible.e. the behavior of a customer measured in the amounts of incoming or outgoing communication of whichever form. tastes. ’Customer profiling’ is describing customers by their attributes. by means of advanced data mining techniques. With the Support Vector Machine approach it is possible in 80. One solution with four segments and one solution with six segments. in order to perform customer segmentation and to profile the customer. called Support Vector Machines.Abstract Vodafone. Therefore. a correct classification of 78. The best i. automatic analysis is essential. gender. This thesis describes the process of selecting and preparing the accurate data from the data warehouse. gender and residential area information. etc). In our context. An optimality criterion is constructed in order to measure their performance. These data mining techniques search and analyze the data in order to find implicit and useful information. income and lifestyles. 7 . Customer profiling can be accomplished with information from the data warehouse. ’customer segmentation’ is a term used to describe the process of dividing customers into homogeneous groups on the basis of shared or common attributes (habits. Having these two components. managers can decide which marketing actions to take for each segment. this data holds valuable information that can be applied for operational and strategical purposes. most optimal in the sense of the optimality criterion clustering technique will be used to perform customer segmentation.5% is obtained. A number of advanced and state-of-the-art clustering algorithms are modified and applied for creating customer segments.e. in order to extract such information from this data. an International mobile telecommunications company.

These automated systems perform important functions such as identifying network faults and detecting fraudulent phone calls. while the network data gives a description of the state of the hardware and software components in the network. many data mining tasks can be distinguished. Customer profiling is describing customers by their attributes. tastes.Chapter 1 Introduction Vodafone is world’s leading mobile telecommunications company. etc) [10]. marketers can decide which marketing actions to take for each segment and then allocate scarce resources to segments in order to meet specific business objectives. and in many cases. Customer segmentation is a term used to describe the process of dividing customers into homogeneous groups on the basis of shared or common attributes (habits. Obtaining knowledge from human experts is a time consuming process. Vodafone is interested in a complete different issue. Solutions to these problems were promised by data mining techniques. call detail data. Call detail data gives a description of the calls that traverse the telecommunication networks. Data mining is the process of searching and analyzing data in order to find implicit. identifying trends in customer behavior and cross selling and up-selling. These data include. namely customer segmentation and customer profiling and the relation between them. such as age. The customer data contains information of the telecommunication customers. information [12]. among others. The need to handle such large volumes of data led to the development of knowledge-based expert systems [17. but potentially useful. if not impossible [22].1 million customers in The Netherlands. network data and customer data. 10]. Within the telecommunication branch. the experts do not have the requisite knowledge [2]. A disadvantage of this approach is that it is based on knowledge from human experts. 22]. Examples of main problems for marketing and sales departments of telecommunication operators are churn prediction. From all these customers a tremendous amount of data is stored. gender. income and lifestyles [1. Having these two components. fraud detection. The amount of data is so great that manual analysis of data is difficult. A basic way to perform customer segmentation is to define segmentations in 8 . with approximately 4.

1. and dividing the customers over these segmentations by their best fits. for each customer a profile will be determined with the customer data. Customer segmentation is a preparation step for classifying each customer according to the customer groups that have been defined. such as age. will be developed. The segmentations will be determined based on (call) usage behavior. This research will deal with the problem of making customer segmentations without knowledge of an expert and without defining the segmentations in advance. Another key benefit of utilizing the customer profile is making effective marketing strategies.1. The goal is to predict behavior based on the information we have on each customer [18]. Once the segmentations are obtained. tested. By using segmentation. In this report. a data mining technique called Support Vector Machines (SVM) will be used. Segmentation is essential to cope with today’s dynamically fragmenting consumer marketplace. marketers are more effective in channeling resources and discovering opportunities. Profiling is performed after customer segmentation. To realize this. different data mining techniques. it is needed to divide customers in segments and to profile the customers. Customer profiling is a way of applying external data to a population of possible customers. gender and lifestyle. Based on the combination of the personal information (the customer profile). In this research. The construction of user 9 .1 Customer segmentation Segmentation is a way to have more targeted communication with the customers.1 Customer segmentation and customer profiling To compete with other providers of mobile telecommunications it is important to know enough about your customers and to know the wants and needs of your customers [15]. A Support Vector machine is able to estimate the segment of a customer by personal information. called clustering techniques. different settings of the Support Vector Machines will be examined and the best working estimation model will be used.advance with knowledge of an expert. Depending on data available. To realize this. the principals of the clustering techniques will be described and the process of determining the best technique will be discussed. the segment can be estimated and the usage behavior of the customer profile can be determined. To find a relation between the profile and the segments. The process of segmentation describes the characteristics of the customer groups (called segments or clusters) within the data. validated and compared to each other. Customer profiling is done by building a customer’s behavior model and estimating its parameters. Segmenting means putting the population in to segments according to their affinity or similar characteristics. it can be used to prospect new customers or to recognize existing bad customers. 1.

thereby necessitating revision and reclassification of customers.segmentations is not an easy task. A simple customer profile is a file that contains at least age and 10 . too much data can lead to complex and time-consuming analysis. Many of these problems are due to an inadequate customer database.1. the use of too many segmentation variables can be confusing and result in segments which are unfit for management decision making. effective segmentation strategies will influence the behavior of the customers affected by them. apparently effective variables may not be identifiable. If the company has insufficient customer data. Customer profiling is also used to prospect new customers using external sources. the resulting segmentation can be too complicated for the organization to implement effectively. an estimation of the usage behavior can be obtained. More directly. the meaning of a customer segmentation in unreliable and almost worthless. Furthermore. One solution to construct segments can be provided by data mining methods that belong to the category of clustering algorithms. different source systems) makes it also difficult to extract interesting information. On the other hand. This data is used to find a relation with the customer segmentations that were constructed before.2 Customer profiling Customer profiling provides a basis for marketers to ’communicate’ with existing customers in order to offer them better services and retaining them. Depending on the goal. This is done by assembling collected information on the customer such as demographic and personal data. Difficulties in making good segmentation are [18]: • Relevance and quality of data are essential to develop meaningful segments. • Continuous process: Segmentation demands continuous development and updating as new customer data is acquired. In this report. 1. in an e-commerce environment where feedback is almost immediate. Moreover. Poorly organize data (different formats. several clustering algorithms will be discussed and compared to each other. one has to select what is the profile that will be relevant to the project. In particular. for each profile. data analysts need to be continuously developing segmentation hypotheses in order to identify the ’right’ data for analysis. • Intuition: Although data can be highly informative. This makes it possible to estimate for each profile (the combination of demographic and personal information) the related segment and visa versa. such as demographic data purchased from various sources. Alternatively. In addition. • Over-segmentation: A segment can become too small and/or insufficiently distinct to justify treatment as separate segments. segmentation would require almost a daily update.

but potentially useful. 19]: • Geographic. data mining tools have been available for a long time. How many lifestyle characteristics about purchasers are useful? • Recruitment method. or industry? How much education is needed? How much brand building advertising is needed to make a pool of customers aware of offer? • Lifestyle. What is the customers’ attitude toward your kind of product or service? • Life cycle. It involves selecting. How long has the customer been regularly purchasing products? • Knowledge and awareness. neural networks. are described in [2. This can be realized by a data mining method called Support Vector Machines (SVM). exploring and modeling large amounts of data to uncover previously unknown patterns. the file would contain product information and/or volume of money spent. How was the customer recruited? The choice of the features depends also on the availability of the data. How much knowledge do customers have about a product. Are they grouped regionally. Although. the term data mining was used. This report gives an description of SVM’s and it will be researched under which circumstances and parameters a SVM works best in this case. What is the average household incom or power of the customers? Do they have any payment difficulty? How much or how often does a customer spend on each product? • Age and gender.gender. nationally or globally • Cultural and ethnic. What languages do they speak? Does ethnicity affect their tastes or buying behaviors? • Economic conditions. 10. an estimation model can be made.2 Data mining In section 1. and ultimately comprehensible information. What is the predominant age group of your target buyers? How many children and what age are in the family? Are more female or males using a certain service or product? • Values. income and/or purchasing power. Customer features one can use for profiling. rule induction and refinement. and graphic visualization. attitudes and beliefs.service. particularly exploratory tools like data visualization and neural 11 . Data mining is the process of searching and analyzing data in order to find implicit. from large databases. decision trees. Data mining uses a broad family of computational methods that include statistical analysis.1. information [12]. the advances in computer hardware and software. If one needs profiles for specific products. With these features. 1.

have made data mining more attractive and practical. The typical data mining process consist of the following steps [4]: • problem formulation • data preparation • model building • interpretation and evaluation of the results Pattern extraction is an important component of any data mining activity and it deals with relationships between subsets of data. For example. Classification algorithms groups customers in predefined classes. Clustering algorithms produce classes that maximize similarity within clusters but minimize similarity between classes. a pattern is defined as [4]: A statement S in L that describes relationships among a subsets of facts Fs of a given set of facts F. The identification of patterns in a large data set is the first step to gaining useful marketing insights and marking critical marketing decisions. with some certainty C. ”international callers”. clustering algorithms can classify the Vodafone customers into ”call only” users. such that S is simpler than the enumeration of all facts in Fs . The specific tasks to be used in this research are Clustering (for the customer segmentation). even as it is used to support other data mining tasks. Data mining tasks are used to extract patterns from large data sets. Classification (for estimating the segment) and Data visualization. 12 .1. The taxonomy reflects the emerging role of data visualization as Figure 1. this task was not mentioned as a separate one. Formally. Different data mining tasks are grouped into categories depending on the type of knowledge extracted by the tasks. The various data mining tasks can be broadly divided into six categories as summarized in Figure 1. based on user behavior data. Validation of the results is also a data mining task.networks. A drawback of this method is that the number of clusters has to be given in advance. By the fact that the validation supports the other data mining tasks and is always necessary within a research. ”SMS only” users etc.1: A taxonomy of data mining tasks a separate data mining task. For example. The data mining tasks generate an assortment of customer and market knowledge which form the core of knowledge management process. The advantage of clustering is that expert knowledge is not required.

Chapter 2 describes the process of selecting the right data from the data ware house. Multiple plots and figures will show the working of the different cluster methods and the meaning of each segment will be described. that in this research is used to determine the customer segmentations. Different cluster algorithms will be studied. Different parameter settings of the Support Vector Machines will be researched and examined in Chapter 6 to find the best working model. It also focuses on validation methods.4) can be used. the research will be discussed. the optimal numbers of cluster will be determined. It provides information about the structure of the data and the data ware house. Clustering is a data mining technique. it gives an overview of the data that is used to perform customer segmentation and customer profiling. algorithms as Principal Component Analysis and Sammon’s Mapping (discussed in Section 3. To realize this. rotate or zoom the objects. The chapter starts with explaining the general process of clustering. Chapter 4 analyzes the different cluster algorithms of Chapter 3. 13 . in Chapter 7. The chapter ends with a description of visualization methods. a profile can be made. In Chapter 3 the process of clustering is discussed. Then. To provide varying levels of details of observed patterns. Furthermore. Once the segments are determined. Finally. Conclusions and recommendations are given and future work is proposed. These methods are used to analyze the results of the clustering.3 Structure of the report The report comprises 6 chapters and several appendices. In some cases it is needed to reduce high dimensional data into three or two dimensions. which can be used to determine the optimal number of clusters and to measure the performance of the different cluster algorithms. with the customer data of Chapter 2. Chapter 5 delves into a data mining technique called Support Vector Machines. This technique will be used to classify the right segment for each customer profile. In addition to to this introductory chapter.Vodafone can classify its customers based on their age. It ends with an explanation of the preprocessing techniques that were used to prepare the data for further usage. gender and type of subscription and then target its user behavior. Data visualization allow data miners to view complex patterns in their customer data as visual objects complete in three or two dimensions and colors. This will be tested with the prepared call detail data as described in Chapter 2 For each algorithm. the cluster algorithms will be compared to each other and the best algorithm will be chosen to determine the segments. 1. data miners use applications that provide advanced manipulation capabilities to slice.

In the postpaid group. that their customers can use the Vodafone network.Chapter 2 Data collection and preparation The first step (after the problem formulation) in the data mining process is to understand the data. In this chapter. there are captive and non captive users. All data of Vodafone is stored in a data warehouse. This data warehouse exists off more than 200 tables.1. business customers can be seen as employees of a business that have a subscription by Vodafone in relation with that business. 2.1 Data warehouse Vodafone has stored vast amounts of data in a Teradata data warehouse.1. Debitel and InterCity Mobile Communications (ICMC). The ICMC customers will also be involved in this research. Debitel customers are always consumers and ICMC customers are always business customers. the process of preparing the data for customer segmentation and customer profiling will be explained. Vodafone has made an accomplishment with two other telecommunications companies. In general.1 Selecting the customers Vodafone Maastricht is interested in customer segmentation and customer profiling for (postpaid) business customers. A more precisely view can be found in Figure 2. It is clear to see. A non-captive customer is using the Vodafone network but has not a Vodafone subscription or prepaid (called roaming). that prepaid users are always consumers. 2. Furthermore. the process of collecting the right data from this data ware house. useful applications cannot be developed. will be described. A captive customer has a business account if his telephone or subscription is bought in relation with the business 14 . A simplified model of the data warehouse can be found in Appendix A. Without such an understanding.

where. The choice of summary variables (features) is critical in order to obtain a useful description of the customer []. since the goal of data applications is to extract knowledge at the customer level. can have a subscription that is under normal circumstances only available for business users. etc. who. Call detail records can not be used directly for data mining. This is in contrast with billing data. 2. 8].1: Structure of customers by Vodafone he works. The total number of (postpaid) business users at Vodafone is more than 800. can help with this process: 15 . In some cases.2 Call detail data Every time a call is placed on the telecommunications network of Vodafone. Call detail records include sufficient information to describe the important characteristics of each call. and will be available almost immediately for data mining. These customers also count as business users. not at the level of individual phone calls [7. customers with a consumer account. Given that 12 months of call detail data is typically kept on line. Thus. one can think of the smallest set of variables that describe the complete behavior of a customer. when. The next sections describe which data of these customers is needed for customer segmentation and profiling. the date and the time of the call and the duration of the call. At a minimum. These customers are called business users. To define the features.000. The number of call detail records that are generated and stored is huge. how often. Keywords like what. the call detail records associated with a customer must be summarized into a single record that describes the customer’s calling behavior.1. descriptive information about the call is saved as a call detail record. Call detail records are generated in two or three days after the day the calls were made. each call detail record will include the originating and terminating phone numbers. which is typically made available only once per month. Vodafone customers generate over 20 million call detail records per day. this means that hundreds of millions of call detail data will need to be stored at any time. For example.Figure 2.

% of outgoing calls within the same operator • 11. but their appearances are so rare that they were not used during this research). • When? : When does a customer call? A business customer can call during office daytime. • Who? : Who is the customer calling? Does he call to fixed lines? Does he call to Vodafone mobiles? • What? : What is the location of the customer and the recipient? They can make international phone calls. or sending an SMS (there are more possibilities.6pm) • 5. average # sms originated per day • 9. % daytime calls (9am . average # calls originated per day • 4. average # calls received per day • 3. a list of features that can be used as a summary description of a customer based on the calls they originate and receive over some time period P is obtained: • 1. average # sms received per day • 8. 19. or in private time in the evening or at night and during the weekend. % of weekday calls (Monday . # different numbers called during P 16 . • Where? : Where is the customer calling? Is he calling abroad? • How long? : How long is the customer calling? • How often? : How often does a customer call or receive a call? Based on these keywords and based on proposed features in the literature [1.Friday) • 6. % of calls to mobile phones • 7. The customer can also receive an SMS or voice call. % international calls • 10. average call duration • 2.• How? : How can a customer cause a call detail record? By making a voice call. 15. # unique area codes called during P • 12. 20] .

Although the construction of these features may be guided by common sense. Figure 2. shown in Figure 2. for each summary feature. Should poor features be generated. in general. First of all. the segmentation was based on the percentage weekday and daytime calls. values above the blue line represent customers with more originating calls than receiving calls. Most of the twelve features listed above can be generated in a straightforward manner from the underlying data of the data ware house. is a critical step within the data mining process. For some features values.2: Visualization of phone calls per hour variance within the data. For example. It may be clear that generating useful features. On the other hand. Note that the histograms resemble well known distributions. customers who use their telephone only at their office could be in a different segment then users that use their telephone also for private purposes. but some features require a little more creativity and operations on the data. receive also more calls in proportion. otherwise distinguish between customers is not possible and the feature is not useful. Such a segment describes a certain behavior of group of customers. the use of the time period 9am-6pm in the fourth feature is not based on the commonsense knowledge that the typical workday on a office is from 9am to 5pm. For examples.2 indicates that the period from 9am to 6pm is actually more appropriate for this purpose. Interesting to see is the relation between the number of calls originated and received. In that case. Figure 2. the number of weekday and daytime calls and the originated calls have sufficient variance.4 demonstrates this. Another aspect that is simple to figure out is the fact that customer 17 . Furthermore. there should be sufficient Figure 2. customers originating more calls than receiving.3 shows that the average call duration. In Figure 2. including summary features. data mining will not be successful. to much variance hampers the process of segmentation.These twelve features can be used to build customer segments. the variance is visible in the following histograms. it should include exploratory data analysis. This also indicates that the chosen features are suited for the customer segmentation. More detailed exploratory data analysis.4 is also visible that the customers that originated more calls.

4: Relation between originated and received calls 18 .3: Histograms of feature values Figure 2.(a) Call duration (b) Weekday calls (c) Daytime calls (d) Originated calls Figure 2.

expanded • Company size: small. intermediate. Figure 2.5. With this information. The information that Vodafone stored in the data ware house include name and address information and also include other information such as service plan.1.2 is not completely available. Information about lifestyles and income is missing. the following variables can be used to define a customers profile: • Age group: <25. small city /town 19 . with some creativity. contract information and telephone equipment information. some information can be subtracted from the data ware house. big • Living area: (big) city. 25-40 40-55 >55 • Gender: male.1. It is clear to see that the chosen features contain sufficient variance and that certain relations and different customer behavior are already visible.3 Customer data To profile the customer. The proposed data in Section 1.that make more weekday calls also call more at daytime (in proportion). customer data is needed. advanced • Type subscription: basic. This is plotted in Figure 2. The chosen features appear to be well chosen and useful for customer segmentation. female • Type telephone: simple.5: Relation between daytime and weekday calls 2. However. basic. advance.

• Interpreting codes into text or replacing text into meaningful numbers.5% simple 34. the segment of a customer can not be determined.5% Female 39. this feature will not increase the performance of the classification. Data may contain many meaningless fields from an analysis point of view.9% >55 21. the result of the classification algorithm is too specific to the trainings data [14].1 shows the percentages of customers within the chosen groups. This is caused by the fact that from each segment a relative high number of customers is represented in this group. it need to cleaned and prepared in a required format.2% Table 2. Otherwise.9% small 31. It is clear to see that sizes of the groups were chosen with care Age: Gender: Telephone type: Type of subscription: Company size: Living area: <25 21. If there is one group with a sufficient higher amount of customers than other groups.0% intermediate 34.0% 25-40 29.8% basic 38. 2. The composition of the groups should be chosen with care. abbreviations and punctuation.2% Male 60. Table 2. These tasks are [7]: • Discovering and repairing inconsistent data formats and inconsistent data encoding.Because a relative small difference in age between customers should show close relationships.4% advanced 27.5% (big) city 42. spelling errors.7% advanced 36. 20 . the age of the customers has to be grouped.8% expanded 29. In general.2% simple 33.0% 40-55 27.1% big 34. such as production keys and version numbers. Based on this feature. • Deleting unwanted data fields. the goal of grouping variables is to reduce the number of variables to a more manageable size and to remove the correlations between each variable.1: Proportions within the different classification groups and the values can be used for defining the customers profile. Chapter 5 and Chapter 6 contain information and results of this method.With this profile.3% small city/town 58. It is of high importance that the sizes of the groups are almost equal (if this is possible) [22].2 Data preparation Before the data can be used for the actual data mining process. a Support Vector Machine will be used to estimate the segment of the customer.

correspondence analysis and conjoint analysis [14]. • Normalization of the variables. • Converting from textual to numeral or numeric data. • Combining data.g. The goal of this approach is to reduce the number of variables to a more manageable size while also the correlations between each variable will be removed. decision trees. decision trees or associations rules. • Mapping continuous values into ranges. • Adding computed fields as inputs or targets. exhaustive. discretization and concept hierarchy generation). averages and minimum/maximum values. e.g. New fields can be generated through combinations of e. Dimension reduction means that one has to select relevant feature to a minimum set of attributes such that the resulting probability distribution of data classes is a close as possible to the original distribution given the values of all features. from multiple tables into one common variable. frequencies. • Checking missing data fields or fields that have been replaced by a default value. When there is a large amount of data. • Converting nominal data (for example yes/no answers) to metric scales. Techniques used for this purpose are often referred to as factor analysis. thus almost impossible to explain. The following data preparations were needed during this research: • Checking abnormal. it is also useful to apply data reduction techniques (data cube aggregation. clustering. A possible way to determine is to count or list all the distinct variables of a field.g. 21 .Data may contain cryptic codes. dimension and numerosity reduction. For this additional tools may be needed. out of bounds or ambiguous values. random or heuristic search. There are two types of normalization. The first type is to normalize the values between [0. Some of these outliers may be correct but this is highly unusual. for instance the customer data. • Finding multiple used fields.1]. These codes has to be augmented and replaced by recognizable and equivalent text. e. The second type is to normalize the variance to one.

1: Example of clustering data clusters into which the data can be divided were easily identified. it does not use prior class identifiers to detect the underlying structure in a collection of data.Chapter 3 Clustering In this chapter. Within this method. the used techniques for the cluster segmentation will be explained.1 shows this with a simple graphical example. The similarity criterion that was used in this case is distance: two or more objects belong to the same cluster if they are ”close” according to a given distance (in this case geometrical distance). Another way of clustering is conceptual clustering. A cluster can be defined as a collection of objects which are ”similar” between them and ”dissimilar” to the objects belonging to other clusters. Clustering can be considered the most important unsupervised learning method. two or more objects 22 .1 Cluster analysis The objective of cluster analysis is the organization of objects into groups. This is called distance-based clustering. In this case the 3 Figure 3. As every other unsupervised method. 3. according to similarities among them [13]. Figure 3.

2 The clusters The definition of a cluster can be formulated in various ways. objects are grouped according to their fit to descriptive concepts. are typically summarized observations of a physical process (call behavior of a customer). Distance can be measured in different ways. measured in some well-defined sense. To obtain such a model. (3. called the regressors. 3.2. additional steps are needed. the purpose of clustering is to find relationships between independent system variables.1. and X is called the pattern matrix. and will be calculated by the clustering algorithms simultaneously with the partitioning of the data.1 The data One can apply clustering techniques to quantitative (numerical) data. and future values of dependent variables.. depending on the objective of the clustering.2. N }. As mentioned before.. . . The data. A second way is to measure the distance form the data vector to some prototypical object of the cluster.. or distance measure. And therefore. the clustering of quantitative data is considered.1. and is represented as an N x n matrix:   x11 x12 · · · x1n  x21 x22 · · · x2n    X= . . Each observation of the customers calling behavior consists of n measured values. xkn ]T .belong to the same cluster if this one defines a concept common to all that objects. as described in Section 2. the columns are called the features or attributes. where xk ∈ Rn . not according to simple similarity measures. they will not automatically constitute a prediction model of the given system. X will be referred to the data matrix. 23 .   . . The term ”similarity” can be interpreted as mathematical similarity. The rows of X represent the customers.1) . and the columns are the feature variables of their behavior as described in Section 2. 2. similarity is often defined by means of a distance norm. called the regressands.. that the relations revealed by clustering are not more than associations among the data vectors. The first possibility is to measure among the data vectors themselves.. . . The cluster centers are usually (and also in this research) not known a priori. In this research. . . In general. . A set of N observations is denoted by X = {xk |k = 1. grouped into an n-dimensional row vector xk = [xk1 . xk2 . However.1. . one should realize. qualitative (categoric) data. one can accept the definition that a cluster is a group of objects that are more similar to another than to members of other clusters. the rows of X are called patterns or objects. xN 1 xN 2 · · · xN n In pattern recognition terminology.. or a mixture of both. In this research. In this research. 3. only distance-based clustering algorithms were used. In metric spaces. In other words.. .1.

c and d can be characterized as linear and non linear subspaces of the data space (R2 in this case).2 Clusters can be spherical. continuously connected to each other. The performance of most clustering algorithms is influenced not only by the geometrical shapes and densities of the individual clusters. but also by the spatial relations and distances among the clusters.3 Cluster partition Clusters can formally be seen as subsets of the data set. elongated and also be (a) Elongated (b) Spherical (c) Hollow (d) Hollow Figure 3.1. Data can reveal clusters of different geometrical shapes. Clusters a. such as linear or nonlinear subspaces or functions.The cluster centers may be vectors of the same dimensions as the data objects. but can also be defined as ”higher-level” geometrical objects.2: Different cluster shapes in R2 hollow. Subsets can 24 . Cluster can be found in any n-dimensional space. or overlapping each other. Clustering algorithms are able to detect subspaces of the data space. 3. and therefore reliable for identification. One can distinguish two possible outcomes of the classification of clustering methods. sizes and densities as demonstrated in Figure 3. Clusters can be well-separated.

. . Ø ⊂ Ai ⊂ X. .g.3) (3. 1 ≤ i = j ≤ c.2 · · · µ1. since these functionals are not differentiable.1 µ1.4) (3.c    (3.c Hard partition The objective of clustering is to partition the data set X into c clusters. . .2) U= . Fuzzy clustering methods allow objects to belong to several clusters simultaneously. fuzzy clustering is more natural than hard clustering.2 · · · µ2. .c  µ2. but rather are assigned membership degrees between 0 and 1 indicating their partial memberships (illustrated by Figure 3. . Assume that c is known. based on prior knowledge. . The structure of the partition matrix U = [µik ]:   µ1. which requires that an object either does or does not belong to a cluster. The number of subsets (clusters) is denoted by c.1 µN. i=1 (3. Using classical sets. In many real situations. 25 . with different degrees of membership. . as objects on the boundaries between several classes are not forced to fully belong to one of the classes. Hard clustering methods are based on the classical set theory.5) Ai ∩ Aj .1 µ2.either be fuzzy or crisp (hard). 1 ≤ i ≤ c. its properties can be defined as follows: c Ai = X. µN. a hard partition can be seen as a family of subsets {Ai |1 ≤ i ≤ c ⊂ P (X)}. Hard clustering in a data set X means partitioning the data into a specified number of exclusive subsets of X. of witch partition results must be validated. The data set X is thus partitioned into c fuzzy subsets.3 The discrete nature of hard partitioning also Figure 3.   . . or it is a trial value.2 · · · µN. . e.3: Hard and fuzzy clustering causes analytical and algorithmic intractability of algorithms based on analytic functionals.

the hard partitioning space for X can be seen as the set: c N Mhc = {U ∈ RN xc |µik ∈ {0. they must be disjoint and none of them is empty nor contains all the data in X.These conditions imply that the subsets Ai contain all the data in X.6) (3. (3. 1 ≤ k ≤ c.8) µAi ∨ µAi . Then. i=1 (3. 1}.10) µik = 1. k=1 N 0< i=1 µik < N. 0 ≤ µAi < 1.13) (3. ∀k}. is a representation of the hard partition if and only if its elements satisfy: µij ∈ {0. 1].11) A definition of a hard partitioning space can be defined as follows: Let X be a finite data set and the number of clusters 2 ≤ c < N ∈ N. To simplify these notations. U = [µik ]. partitions can be represented in a matrix notation.15) Note that there is only one difference with the conditions of the hard partitioning. (3. (3. Expressed in the terms of membership functions: c µAi = 1. 1 ≤ k ≤ c. 0 < i=1 µik < N .14) µik = 1. in this case µik is allowed to acquire all real values between zero and 1. c (3. k=1 N 0< i=1 µik < N. ∀i. ∀i. its conditions are given by: µij ∈ [0. 1 ≤ i = j ≤ c. 1 ≤ i ≤ N. 1 ≤ k ≤ c. 1 ≤ i ≤ c. 1 ≤ k ≤ c. µi will be used instead of µAi . 1 ≤ i ≤ N. 1 ≤ i ≤ N.12) Fuzzy partition Fuzzy partition can be defined as a generalization of hard partitioning. Where µAi represents the characteristic function of the subset Ai which value is zero or one. k=1 µik = 1. 1}. a N xc matrix. k. containing the fuzzy partitions.9) (3.7) (3. Consider the matrix U = [µik ]. c (3. Also the definition of the fuzzy partitioning space will not much differ with 26 . 1 ≤ i ≤ N. and denoting µi (xk ) by µik .

xk ∈ Ai . It can be defined as follows: Let X be a finite data set and the number of clusters 2 ≤ c < N ∈ N. However. fuzzy cluster algorithms will be applied as well. This method will result into hard partitioned clusters. Equation (1. 3. (3. i=1 (3. ∀i. the fuzzy partitioning space for X can be seen as the set: c N Mf c = {U ∈ RN xc |µik ∈ [0. ∀k}. ∀i. To deal with the problem of fuzzy memberships.17) Ai represents a set of data points in the i-th cluster and vi is the average of the data points in cluster i. The procedure follows an easy way to classify a given N x n data set through a certain numbers of c clusters defined in advance. vi is the cluster center (also called prototype) of cluster i: Ni xk (3.2.14) implies that the sum of each column should be 1.2 Cluster algorithms This section gives an overview of the clustering algorithms that were used during the research. the cluster with the highest degree of membership will be chosen as the cluster were the object belongs to. which means that the total membership of each xk in X equals one. 3. The K-means algorithm allocates each data point to one of the c clusters to minimize the within sum of squares: c sumk∈Ai ||xk − vi ||2 . However. Ni where Ni is the number of data points in Ai . the results of this hard partitioning method are not always reliable and this algorithm has numerical problems as well. 0 < i=1 µik < N . Within the cluster algorithms. 1]. k.16) The i-th column of U contains values of the membership functions of the i-th fuzzy subset of X. k=1 µik = 1.18) vi = k=1 . This research will focus on hard partitioning. The possibilistic partition will not be used in this researched and will not be discussed here. There are no constraints on the distribution of memberships among the fuzzy clusters. Note that ||xk −vi ||2 is actually a chosen distance norm.1 K-means K-means is one of the simplest unsupervised learning algorithms that solves the clustering problem. 27 .the definition of the hard partitioning space. Then.

3.19 measures the total number of variance of xk from vi .. for example. A (3. A (3. that can be solved by a variety of methods.23) 28 . there is no continuity in the data space. The only difference is that in K-medoid the cluster centers are the nearest data points to the mean of the data in one cluster V = {vi ∈ X|1 ≤ i ≤ c}. called C-means functional. with respect to U. one can adjoint the constrained in 3.2. 3.3 Fuzzy C-means The Fuzzy C-means algorithm (FCM) minimizes an objective function. (3. To find the stationary points of the c-means functional. The minimization of the C-means functional can be seen as a non linear optimization problem.. V and λ. The C-means functional. (DikA /DjkA )2/(m−1) (3. V.22) ˆ and by setting the gradients of (J). λ) = λk i=1 µik − 1 . This implies that a mean of the points in one cluster does actually not exist.2.19) with V = [v1 .2 K-medoid K-medoid clustering. invented by Dunn. to zero.20) V denotes the vector with the cluster centers that has to be determined. also a hard partitioning algorithm. .14 to J by means of Lagrange multipliers: c N 2 (µik )m DikA + i=1 k=1 k=1 N c ¯ J(X.19. (3. v2 . then the C-means functional may only be minimized by (U. ∀i. vc ].. vi ∈ Rn . When 2 DikA > 0. k and m > 1. to define the clusters. V ) = i=1 k=1 (µik )m ||xk − vi ||2 . is defined as follows: c N J(X. This method is called the fuzzy cmeans algorithm. 1 ≤ k ≤ N. equation 3. 1 ≤ i ≤ c. Examples of methods that can solve non linear optimization problems are grouped coordinate minimization and genetic algorithms. uses the same equations as the K-means algorithm. V ) ∈ Mf c xRnxc if µik = c j=1 1 . U. The simplest method to solve this problem is utilizing the Picard iteration through the first-order conditions for the stationary points of equation 3. This can be useful when.21) On a statistical point of view. The distance norm ||xk − vi ||2 is called a squared inner-product distance norm and A is defined by: 2 DikA = ||kk − vi ||2 = (xk − vi )T A(xk − vi ). U.

Remark that the vi of equation (3.27) The matrices Ai are used as optimization variables in the c-means functional. U.4 The Gustafson-Kessel algorithm The Gustafson and Kessel (GK) algorithm is a variation on the Fuzzy c-means algorithm [11].25) AD =  . V. in this case. A2 .2. . The FCM algorithm uses the standard Euclidean distance for its computations. It employs a different and adaptive distance norm to recognize geometrical shapes in the data. .. it is able to define hyper spherical clusters. .13) and (3. . matrix A is based ˆ on the Mahalanobis distance norm. . where F = 1 N N (xk − x)(xk − x)T ˆ ˆ k=1 (3. Hence that.24) The solution of these equations are satisfying the constraints that were given in equation (3.1 N m k=1 µi. . Another possibility is to choose A as the inverse of the nxn covariance matrix A = F −1 . caused by the common choice of the norm inducing matrix A = I. Ac ). (3. A) = (3. .23) and (3.k ≤ i ≤ c. This implies that each cluster is allowed to adapt the distance norm to the local topological structure of the data. .24) is the weighted average of the data points that belong to a cluster and the weights represents the membership degrees.and vi = N m k=1 µik xk . (3. The objective functional of the GK algorithm can be calculated by: c N 2 (uik )m DikAi . 3. Each cluster will have its own norm-inducing matrix Ai . satisfying the following inner-product norm: 2 DikA = (xk − vi )T · Ai (xk − vi ). where A = (A1 .   . The norm inducing matrix can also be chosen as an nxn diagonal matrix of the form:   (1/σ1 )2 0 ··· 0   0 (1/σ2 )2 · · · 0   (3.24).26) and x denotes the mean of the data. Note that it can only detect clusters with the same shape. i=1 k=1 J(X.28) 29 . ... .15). The Fuzzy C-means algorithm is actually an iteration between the equations (3. where 1 ≤ i ≤ c and 1 ≤ k ≤ N. This explains why the name of the algorithm is c-means. 0 0 · · · (1/σn )2 This matrix accounts for different variances in the directions of the coordinate axes of X. Therefor. A c-tuple of the norm-inducing matrices is defined by A.. .

2. This implies that J can be made as small as desired by making Ai less positive definite. The outcome of the inner-product norm of (3.28) can not be minimized in a straight forward manner. In the original FMLE algorithm.34) N k=1 30 . The variable αi in equation (3. (3.27) is a generalized squared Mahalanobis norm between the data points and the cluster center. (3. w = 1.33) The reason for using the w variable is to generalize this expression.30) can be substituted into equation (3.13).29) Here ρ is a remaining constant for each cluster.If A is fixed. To avoid this.31) (µik )m Fi is also called the fuzzy covariance matrix. This implies that this distance norm will decrease faster than the inner-product norm. (3. Ai can be expressed in the following way: Ai = [ρi det(Fi )]1/n Fi−1 . the distance norm includes an exponentional term. to compensate the exponential term and obtain clusters that are more fuzzy. A general way to this is by constraining the determinant of the matrix.32) Comparing this with the Gustafson-Kessel algorithm. w will be set to 2. The covariance is weighted by the membership degrees of U . A varying Ai with a fixed determinant relates to the optimization of the cluster whit a fixed volume: ||Ai || = ρi . Hence that this equation in combination with equation (3. In this research. ρ > 0. 1 ≤ i ≤ c.27). Ai has to be constrained to obtain a feasible solution.32) is the prior probability of selecting cluster i. (3. since it is linear in Ai . αi can be defines as follows: N 1 αi = µik .5 The Gath Geva algorithm Bezdek and Dunn [5] proposed a fuzzy maximum likelihood estimation (FMLE) algorithm with a corresponding distance norm: Dik (xk . Because of the generalization. Unfortunately. two weighted covariance matrices arise.30) .15) can be applied without any problems. vi ) = det(Fwi ) αi (l) (l) T −1 1 2 (xk −vi ) Fwi (xk −vi ) . the fuzzy covariance matrix F i is defined by: Fwi = N k=1 (µik )w (xk − vi )(xk − vi )T N k=1 (µik )w .14) and (3. with Fi = N k=1 (3. In combination with the Lagrange multiplier. N k=1 (µik )m (xk − vi )(xk − vi )T 3. (3. In this case. (3. the conditions under (3. equation (3.

3.Gath and Geva [9] discovered that the FMLE algorithm is able to detect clusters of different shapes. Therefore. which are described below: • Partition Coefficient (PC): measures the amount of ”overlapping” between clusters. since the exponential distance norm can converge to a local optimum.36) 31 . The main drawback of this validity measure is the lack of direct connection to the data itself. However. Different validation methods have been proposed in the literature. sizes and densities and that the clusters are not constrained in volume. CE(c) = − 1 N c N uij log(uij ) i=1 j=1 (3. • Classification Entropy (CE): measures only the fuzziness of the cluster. It is defined by Bezdek [5] as follows: 1 P C(c) = N c N (uij )2 i=1 j=1 (3.35) where uij is the membership of data point j in cluster i. the data can not be grouped in a meaningful way at all.3 Validation Cluster validation refers to the problem whether a found partition is correct and how to measure the correctness of a partition. this does not apply that the best fit is meaningful at all. validation measures has to be designed. To be able to perform the second approach. The number of clusters might not be correct or the cluster shapes do not correspond to the actual groups in the data. it is not know how reliable the results of this algorithm are. in this research are used several indexes. and successively reducing this number by combining clusters that have the same properties. which is a slightly variation on the Partition Coefficient. A clustering algorithm is designed to parameterize clusters in a way that it gives the best fit. however. In the worst case. The optimal number of clusters can be found by the maximum value. • Cluster the data for different values of c and validate the correctness of the obtained clusters with validation measures. The main drawback of this algorithm is the robustness. One can distinguish two main approaches to determine the correct number of clusters in the data: • Start with a sufficiently large number of clusters. Furthermore. none of them is perfect by oneself.

vj ) − d(x. is rated in under bound by the triangle-inequality: d(x.37) P I is mainly used for the comparing of different partitions with the same number of clusters.40) The main disadvantage of the Dunn’s index is the very expansive computational complexity as c and N increase.y∈Cj d(x. y)} (3. the result of the clustering has to be recalculated. y) }} maxk∈c {maxx.i=j minx∈Ci . This will be the case when the dissimilarity between two clusters. y) ≥ |d(y. DI(c) = min{ min { i∈c j∈c. • Alternative Dunn Index (ADI):To simplify the calculation of the Dunn index. c N 2 2 i=1 j=1 (uij ) ||xj − vi || SI(c) = (3.42) . the separation index uses a minimum-distance separation to validate the partitioning. c P I(c) = i=1 N j=1 (uij )m ||xj − vi ||2 c k=1 Ni ||vk − vi ||2 (3. • Separation Index (SI): in contrast with the partition index (PI). vj ) − d(xi .41) minxi ∈Ci .y∈C d(x.xj ∈Cj |d(y. the Alternative Dunn Index was designed. Each individual cluster is measured with the cluster validation method. vj )| were vj represents the cluster center of the j-th cluster.38) N mini.y∈C d(x. measured with minx∈Ci . • Dunn’s Index (DI): this index was originally designed for the identification of hard partitioning clustering. A minor value of a SC means a better partitioning. XB(c) = c i=1 N j=1 (uij )m ||xj − vi ||2 N mini.• Partition Index (PI): expresses the ratio of the sum of compactness and separation of the clusters. ADI(c) = min{ min { i∈c j∈c.i=j (3.39) The lowest value of the XB index should indicate the optimal number of clusters. This value is normalized by dividing it by the fuzzy cardinality of the cluster.j ||xj − vi ||2 (3. Therefor. y).k ||vk − vi ||2 • Xie and Beni’s Index (XB): is a method to quantify the ratio of the total variation within the clusters and the separations of the clusters [3]. y)} 32 (3. the sum of the value for each individual cluster is used.y∈Cj d(x. vj )| }} maxk∈c {maxx. To receive the Partition index.

a standard and a most widely method to map high-dimensional data into a lower dimensional space. called the principal components. This kind of mapping of distances is much closer related to the proposition of clustering than saving the variances (which will be done by PCA). that the Partition Coefficient and the Classification Entropy are only useful for fuzzy partitioned clustering. which is based on the preservation of the Euclidean inter point distance norm. The first principal component represents 33 .4 Visualization To understand the data and the results of the clustering methods. However. This section describes three methods that can map the data points into a lower dimensional space. A draw back of this Fuzzy Sammon mapping is the loose of precision in distance.1 Principal Component Analysis Principal component analysis (PCA) include a mathematical procedure that maps a number of correlated variables into a smaller set of uncorrelated variables.Note. the Sammon mapping application has two main drawbacks: • Sammon mapping is a projection method. In this research.4. • The Sammon mapping method aims to find in a high n-dimensional space N points in a lower q-dimensional subspace. In case of fuzzy clusters the values of the Dunn’s Index and the Alternative Dunn Index are not reliable. To avoid these problems of the Sammon mapping method. the three mapping methods will be used for the visualization of the clustering results. this report will focus on the Sammon mapping method. Then. The three visualisation methods will be explained in more detail in the following subsections. 3. a computational expensive algorithm is needed. The first method is the Principal Component Analysis (PCA). it is useful to visualize the data and the results. However. 3. a modified algorithm. is used during this research. This implies that the Sammon mapping only can be applied on clustering algorithms that use the Euclidean distance norm during the calculations of the clusters. called the Fuzzy Sammon mapping. This is caused by the repartitioning of the results with the hard partition method. which can not be plotted and visualized directly. the used data set is a highdimensional data set. since only the distance between the data points and the cluster centers considered to be important. such in a way the inter point distances correspond to the distances measured in the n-dimensional space. because in every iteration step a computation of N (N − 1)/2 distances is required. The advantage of the Sammon mapping is the ability to preserve inter pattern distances. To achieve this.

The inter point distance measure of the n-dimensional space. Furthermore. The direction of the first principal component is diverted from the eigenvector with the largest eigenvalue. xj ) correspond to the inter point distances in the q-dimensional space.47) λ= i<j dij = i=1 j=i+1 dij .46) 3.the Sammon mapping uses inter point distance measures to find N points in a q-dimensional data space.4. a minimization criterion of the error: E= where λ is a constant: N −1 N 1 λ N −1 i=1 (dij − d∗ )2 ij . (3. In this case. N (3. dij j=i+1 N (3.as much of the variability in the data as possible. This methods uses only the first q nonzero eigenvalues and the corresponding eigenvectors of the covariance matrix: Fi = Ui Λi UiT . which are representative for a higher n-dimensional data set. The succeeding components describe the remaining variability. (3. defined by dij = d(xi .j of Fi in its diagonal in decreasing order and Ui is a matrix containing the eigenvectors corresponding to the eigenvalues in its columns.q .k = Wi−1 (xk ) = WiT (xk ).q Λi. the principal components will be achieved by analyzing the eigenvectors and eigenvalues. etc. yj ).2 Sammon mapping As mentioned before.48) 34 . the second objective is used. given by d∗ = d∗ (yi . the covariance matrix of the data set can be described by: F = 1 (xk − v)(xk − v)T .44) With Λi as a matrix that contains the eigenvalues λi. This is achieved by Sammon’s ij stress. • Discovering and/or reducing the dimensionality of a data set. which can be defined as follows: yi. The main goals of the PCA method are: • Identifying new meaningful underlying variables. (3. The eigenvalue associated with the second largest eigenvalue correspond to the second principal component. (3.45) The weight matrix Wi contains the q principal orthonormal axes in its column: 1 2 Wi = Ui. Principal Component Analysis is based on the projection of ¯ correlated high-dimensional data onto a hyperplane [3]. In this research. there is a q-dimensional reduced vector that represents the vector xk of X. In a mathematical way.43) where v = xk .

The minimization of the error E is an optimization problem in the N xq variables yil .19): c N Ef uzz = i=1 k=1 (µki )m (d(xk .3 Fuzzy Sammon mapping As mentioned in the introduction of this section.50) ∂yil (t) λ dki d∗ ki k=1. 2.. 2 ∂yil (t) ∂E(t) ∂y (t) (3. According to this information. The modified algorithm. uses only N ∗c distances. The Fuzzy Sammon mapping algorithm is similar to the original Sammon mapping.k=i ∂ 2 E(t) 2 =− 2 ∂yil (t) λ N k=1. but in this case the projected cluster 35 .k=i 1 (dki − d∗ ) − ki dki d∗ ki (yil − ykl )2 d∗ ki 1+ dki − d∗ ki dki (3. This scalar constant represents the step size for gradient search in the direction of N 2 dki − d∗ ∂E(t) ki =− (yil − ykl ) (3... N } and l ∈ {1. The Euclidean distance between the cluster center zi and the data yk of the projected q-dimensional space is represented by d∗ (yk . ki (3.Note that there is no need to maintain λ.52) with d(xk .51) With this gradient-descent method. since a constant does not change the result of the optimization process. independently to the shape of the original cluster. However. vi ) representing the distance between data point xk and the cluster center vi in the original n-dimensional space. in a projected two dimensional space every cluster is represented by a single point.. . it is possible to estimate the correct initialization based on the information which is obtained from the data. zi ).. with i ∈ {1.. while searching for the minimum of E. 2.. called Fuzzy Sammon mapping. . weighted by the membership values similarly to equation (3. 3.49) where α is a nonnegative scalar constant. a modified mapping method is designed which takes into account the basic properties of fuzzy clustering algorithms where only the distance between the data points and the clustering centers are considered to be important [3].3 − 0. because multiple experiments with different random initializations are necessary to find the minimum. it is not possible to reach a local minimum in the error surface. To avoid this drawbacks. The rating of yil at the t-th iteration can defined by:   il yil (t + 1) = yil (t) − α  ∂ 2 E(t)  . This is a disadvantage. vi ) − d∗ )2 .4.. yiq ]T .. q} which implies that yi = [yi1 . Sammon’s mapping has several drawbacks.4. . with a recommended value α 0.

center will be recalculated in every iteration after the adaption of the projected data points. The recalculation will be based on the weighted mean formula of the fuzzy clustering algorithms, described in Section 3.2.3 (equation 3.19). The membership values of the projected data can be plotted based on the standard equation for the calculation of the membership values: µ∗ = ki
c j=1

1
d∗ (xk ,ηi) d∗ (xk ,vj )
2 m−1

,

(3.53)

where U ∗ = [µ∗ ] is the partition matrix with the recalculated memberships. ki The plot will only give an approximation of the high dimensional clustering in a two dimensional space. To measure the quality of this rating, an evaluation function that determines the mean square error between the original and the recalculated error can be defined as follows: P = ||U − U ∗ ||. (3.54)

In the next chapter, the cluster algorithms will be tested and evaluated. The PCA and the (Fuzzy) Sammon mapping methods will be used to visualize the data and the clusters.

36

Chapter 4

Experiments and results of customer segmentation
In this chapter, the cluster algorithms will be tested and their performance will be measured with the proposed validation methods of the previous chapter. The best working cluster method will be used to determine the segments. The chapter ends with an evaluation of the segments.

4.1

Determining the optimal number of clusters

The disadvantage of the proposed cluster algorithms is the number of clusters that has to be given in advance. In this research the number of clusters is not known. Therefor, the optimal number of clusters has to be searched with the given validation methods of Section 3.3. For each algorithm, calculations for each cluster, c ∈ [215], were executed. To find the optimal number of clusters, a process called Elbow Criterion is used. The elbow criterion is a common rule of thumb to determine what number of clusters should be chosen. The elbow criterion says that one should choose a number of clusters so that adding another cluster does not add sufficient information. More precisely, by graphing a validation measure explained by the clusters against the number of clusters, the first clusters will add much information (explain a lot of variance), but at some point the marginal gain will drop, giving an angle in the graph (the elbow). Unfortunately, this elbow can not always be unambiguously identified. To demonstrate the working of the elbow criterion, the feature values that represent the call behavior of the customers, as described in Section 2.1.2, are used as input for the cluster algorithms. From the 800,000 business customers of Vodafone, 25,000 customers were randomly selected for the experiments. More customers would lead to computational problems. First, the K-means algorithm will be evaluated. The values of the validation methods depending on the number of clusters will be plotted. The value of the Partition Coefficient is for all 37

clusters 1, and the classification entropy is always ’NaN’. This is caused by the fact that these 2 measures were designed for fuzzy partitioning methods, and in this case the hard partitioning algorithm K-means is used. In Figure 4.1, the values of the Partition Index, Separation Index and Xie and Beni’s Index are shown. Mention again, that no validation index is reliable only by itself.

Figure 4.1: Values of Partition Index, Separation Index and the Xie Beni Index Therefor, all the validation indexes are shown. The optimum could differ by using different validation methods. This means that the optimum only could be detected by the comparison of all the results. To find the optimal number of cluster, partitions with less clusters are considered better, when the difference between the values of the validation measure are small. Figure 4.1 shows that for the PI and SI, the number of clusters easily could be rated to 4. For the Xie and Beni index, this is much harder. The elbow can be found at c = 3, c = 6, c = 9 or c = 13, depending on the definition and parameters of an elbow. In Figure 4.2 there are more informative plots shown. The Dunn’s index and the Alternative Dunn’s index confirm that the optimal number of clusters for the K-means algorithm should be chosen to 4. The values of all the validation measures for the K-means algorithm, are embraced in table 4.1

38

7783 0.0041 0.4626 0.0000 N aN 0.9386 0.0000 N aN 0.0002 4.7489 0.9253 0.8384 0.0000 3 1.8362 0.0061 0.0000 N aN 3.0443 0.0000 N aN 0.0000 N aN 0.0000 4 1.0071 0.8080 0.0072 0.0001 3.0002 3.0034 0.0003 4.0000 N aN 0.0000 N aN 0.0000 N aN 1.8318 0.7225 0.9109 0.2214 0.0001 14 1.8620 0.3998 0.0005 5.0000 N aN 1.0001 3.0002 11 1.0002 3.0061 0.9079 0.0000 N aN 0.0034 0.0002 5.0002 4.0052 0.2: Values of Dunn’s Index and the Alternative Dunn Index c PC CE PI SI XBI DI ADI c PC CE PI SI XBI DI ADI 2 1.0002 3.0000 6 1.0070 0.0082 0.3353 0.0002 3.8261 0.8828 0.0000 5 1.0061 0.0000 N aN 0.0065 0.1: The values of all the validation measures with K-means clustering 39 .0063 0.9519 0.0000 N aN 0.0000 N aN 1.0071 0.0001 3.0013 10 1.7696 0.5737 0.4379 0.0002 3.0000 15 1.0061 0.1571 0.0001 12 1.Figure 4.0001 13 1.7557 0.8758 0.0000 8 1.0018 9 1.0000 N aN 1.0001 3.2907 0.0000 Table 4.0000 7 1.

the optimal number of clusters is located at c = 6. the main drawback of PC is the monotonic decreasing with c. which makes it hardly to detect the optimal number of cluster. the validation methods can be used now for the fuzzy clustering. In Figure 4.3 the results of the Partition Index and the Classification Entropy are plotted. the Alternative Dunn index. has an elbow at the point c = 3. For the PI and the SI. The results can be found in Appendix B.the optimal number of clusters is chosen at c = 4. For the other algorithms. Figure 4. For the K-means. it is difficult to find the optimal number of clusters. can be seen as an elbow. caused by the lack of direct connection to the data. the local minimum is reached at c = 6.5.3: Values of Partition coefficient and Classification Entropy with Gustafson-Kessel clustering ters. However. The results of the validation measures for the Gustafson-Kessel algorithm are written in table 4. 40 . The optimal number of cluster can not be rated based on those two validation methods. the Dunn index also indicates that the optimal number of clusters should be at c = 6. Compared to the hard clustering methods. c = 6 and c = 11. This process can be repeated for all other cluster algorithms.2. To illustrate this.It is also possible to define the optimal numbers of clusters for fuzzy clustering algorithms with this method. Again. for the Alternative Dunn Index is not known how reliable its results are. On the other hand. for the XBI. In Figure 4. K-medoid and the Gath-Geva. The points at c = 3.4 gives more information about the optimal number of clus- Figure 4. However. the results of the GustafsonKessel algorithm will be shown. so the optimal number of clusters for the Gustafson-Kessel algorithm will be six. The same problem holds for CE: monotonic increasing.

Separation Index and the Xie Beni Index with Gustafson-Kessel clustering Figure 4.4: Values of Partition Index.5: Values of Dunn’s Index and Alternative Dunn Index with GustafsonKessel clustering 41 .Figure 4.

0002 0.0102 0.3 PC 1 1 0.0007 13 0. To visualize the clustering results.8903 0.0009 1. On the score of the values of the three most used indexes.5303 0.9305 0.3 and 4.1611 2.0001 0.4183 0.1469 2.1149 2.1573 0.0003 3 0. the validation methods that are described in Section 3.9203 0.0034 Inf 1.0001 0.2189 0.2741 1.0009 15 0.3: The numerical values of validation measures for c = 4 and 4.8128 0.0003 1.0002 0.0867 1.0002 0.0041 0.0009 1.7447 0.3983 1.5547 0.1410 2.2 Comparing the clustering algorithms The optimal number of cluster can be determined with the validation methods.0063 10 0.0002 4 0.0092 0.8218 1.0001 SI 0.7149 0.0002 6 0.0001 0. Table 4.4293 0.1571 0.0007 0.3983 0.0002 0.0063 0.2366 0.0083 0.7293 0.2: The values of all the validation measures with Gustafson-Kessel clustering 4.0034 0.5603 0.0002 0.0039 0. The validation measures for c = 4 and c = 6 of all the clustering methods are collected in the tables 4.0015 0. one can conclude that for c = 4 the Gath-Geva algorithm has the best results and for c = 6 the Gustafson-Kessel algorithm.1479 2.6620 0.4982 CE NaN NaN 1.2024 1.6882 0.0082 0.c PC CE PI SI XBI DI ADI c PC CE PI SI XBI DI ADI 2 0.3550 0.9019 0.0852 0.0001 14 0. the dataset can be reduced to a 2-dimensional space.7688 0.2800 0.5930 0.0007 1.0039 00030 K-means K-medoid FCM GK GG Table 4.7813 0.0084 0.3044 1.6462 0.0018 12 0. Separation index.0001 0. Xie and Beni’s index and Dunn’s index.0028 0.9364 0.0083 0.3863 1.0004 1.9205 0.5930 0.0000 Table 4.0027 0. the optimal number of clusters was found at c = 4 or c = 6.0002 1.0039 11 0.5131 0.5675 0.0644 DI 0.0000 7 0.0001 42.0017 0.2737 0.4183 1.3046 0.5034 PI 1. To avoid visibility problems (plotting too much values will cause one 42 .1702 2.0029 ADI 0.8536 0.0006 0.3500 0. The validation measures can also be used to compare the different cluster methods. As examined in the previous section.5819 0.0012 0.2057 0.5085 0.0046 0.7233 0.0029 0.5978 0.0062 0.2489 1.0002 2.4 can be used.0263 9 0.0002 0.0001 XBI 5.2066 1.7575 0.0030 0.0853 0. as mentioned in the previous section.5512 0.3209 1.0002 0.0001 0.4. With these visualization methods.4 show that the PC and CE are useless for the hard clustering methods K-means and K-medoid.4684 0.0004 5 0.0001 0. depending on the clustering algorithm.7797 0.0001 8 0.

This implies that the Fuzzy C-means algorithm is not able to find good clusters for this data set. Figures 4.0029 0.0008 0.3773 CE NaN NaN 1.x show the different clustering results for c = 4 and c = 6 on the data set.9. only 500 values (representing 500 customers) from this 2-dimensional dataset will be randomly picked. In the situation with 6 clusters. the clusters are well separated. but the cluster centers are almost situated on the same location. the Sammon’s mapping gives the best visualization of the results.0008 0.1238 0.9203 1.0099 ADI 0.0001 0. shows unexpected results.x4. one can see three big cluster.0063 0. in Figure 4.0002 0.8.0001 19. In Figure 4. with one small cluster in one of the big clusters. one can see that there are actually 4 cluster centers.0070 0. Note that the cluster in the left bottom corner and the cluster in the 43 .7918 1.0009 Table 4. For the other cluster algorithms. the results of the Gustafon-Kessel algorithm are plotted.0008 XBI 3.0457 DI 0.6490 PI 1. These visualization methods are used for the following plots. None of the clusters contain sufficient more or less customers than other clusters. Figure 4. there are only 2 clusters clearly visible.1043 SI 0. The other two cluster centers are nearly invisible.6 and 4.0102 0.0001 0. By a detailed look at the plot.4: The numerical values of validation measures for c = 6 big cloud of data points).K-means K-medoid FCM GK GG PC 1 1 0.7 show that hard clustering methods can find a solution Figure 4.4293 1. For the K-means and the K-medoid algorithm.4613 0.0001 0.1667 0. the Fuzzy Sammon’s mapping visualization gives the best projection with respect to the partitions of the data set. The plot of the Fuzzy C-means algorithm.0007 0.2907 0.3044 0. For the situation with 4 clusters.9253 Inf 0.9245 0.8903 0. For both situations.6: Result of K-means algorithm for the clustering problem.

9: Result of Gustafson-Kessel algorithm 44 .Figure 4.8: Result of Fuzzy C-means algorithm Figure 4.7: Result of K-medoid algorithm Figure 4.

In the real high-dimensional situation. This indicates that. one can not distinguish between the two cluster algorithms. a profile can be made by drawing a line between all normalized feature values (each feature value is represented at the x-as) of the customers within this cluster.10. visualized in Figure 4. Another way to view the differences between the cluster methods is to profile the clusters. In Figure 4. In the next section. The fact that this is the case in the two-dimensional plot. The result is visible for the Gath-Geva algoithm for c = 4 and for the Gustafson-Kessel algorithm with six clusters. The box indicates the upper and lower quartiles.9 are also maintained in the situation with 6 clusters. To determine which partitioning will be used to define the segments. the results show that the clusters are homogeneous. Here are also appearing clusters in other clusters. With the results of the validation methods and the visualization of the clustering. 45 . based on the distances to the cluster.11 and 4. 4. the two different partitions will be closely compared with each other. for the situation c = 4 look similar to the result of the Gustafson-Kessel algorithm.10: Result of Gath-Geva algorithm of the Gath-Geva algorithm. This may indicate that the data points in these clusters represents customers that differ on multiple fields with the other customers of Vodafone. The results Figure 4. In both situations. a closer look to the meaning of the clusters will be needed. one can look at the distances from the points to each cluster. two box plots of the distances from the data points to the cluster are plotted. the clusters are not a subset of each other. one can conclude that there are two possible best solutions: The Gath-Geva algorithm for c = 4 and the Gustafson-Kessel algorithm for c = 6.top right corner in Figure 4.3 Designing the segments To define which clustering method will be used for the segmentation. For each cluster. indicates that a clustering with six clusters with the Gustafson-Kessel algorithms not a good solution. The result for the c = 6 situation is remarkable. but are separated.12.

12: Distribution of distances from cluster centers within clusters for the Gustafson-Kessel algorithm with c = 6 46 .11: Distribution of distances from cluster centers within clusters for the Gath-Geva algorithm with c = 4 Figure 4.Figure 4.

The profiles of the different clusters do not differ much in shape. This confirms the assumption that customers of different clusters have indeed a different usage behavior. Figure 4. This means that the customers in one profile contain similar values of the feature values.13: Cluster profiles for c = 4 47 . Most of the lines in one profile are drawn closely together. However. at least one value differs sufficient from the values of the other cluster. in each cluster.

Figure 4.14: Cluster profiles for c = 6 48 .

The mean of all the lines (equivalent to the cluster center) was calculated and a line between all the (normalized) feature vales was drawn. have a high feature value at feature 8.More relevant plots are shown in Figure 4. compared with other cluster. while Cluster 3 contains peaks at features 2 and 12.15 and ??. The difference between the clusters are visible by some feature values. Cluster 2 has high values at position 6 and 9. The 4th and final cluster has high values at feature 8 and 9. Figure 4. in the situation with four clusters.15: Cluster profiles of centers for c = 4 49 . For instance. Cluster 1 has customers.

Figure 4.16: Cluster profiles of centers for c = 6 50 .

9%) Segment 4 (20.2%) Segment 1 (18.4%) Segment 3 (18. In words.9 4 65. validation measures and plots.0 23.7%) Segment 3 (23.2 1.7 2. For the Gath-Geva algorithm with c = 4 and the Gustafson-Kessel algorithm with c = 6.0 71.1 12.3 120.0 65.4 2.4 2 1.8 2.9 3.6 9.9%) Segment 4 (20. feature 4 the daytime calls.5 1. The number of international calls is low.2 6.6 39.1 15.3 1.7 2.0 1.9 13.1 3.4 84.6 1.6 6.9 6 75.8%) Segment 6 (16.2 6.1 9.8 22. • Segment 2: This segment contains customers with an average voice call 51 .2 93.9 1.9 24.4 10.7 4.3 87.0 12 25.6 88.7 2.5 71.4 26.1 8 3.7 72.7 c=4 c=6 c=4 c=6 Table 4.6%) Segment 5 (14.2 1.1 86.1 11.1 4.5 shows the result of the customer segmentation. This customers call more in the evening (in proportion) and to fixed lines then other customers.1 2.6 3 3.2%) Segment 1 (18.1.2 30.4%) Segment 3 (18.8 73.1 74.2.8 2.9 66.1%) Segment 2 (14.8 1.4 1.5 4.2%) Segment 2 (28.7 65.7 0.7 1.8 72.2 92. Their sms usage is higher then normal.8%) Segment 6 (16.8 15.3 17.9 1.9 1.6 87.7 58. feature 9 the international calls. The feature Feature Average Segment 1 (27.4 5.8 2.9 3.8 4.7 17.1 2.1 2.3 12.9 2.1 78.0 1.6 126. feature 5 the weekday calls.3 7 1.4 12.8%) Feature Average Segment 1 (27. 8 originated sms.0 3.7%) Segment 3 (23.5 40.0 3.4 4.9 6.2 65.9 73.3 88.9 4.4 14.5 6. feature 10 the calls to Vodafone mobiles.6 2.6 73. (Feature 1 is the call duration.8 121.6 4. feature 2 the received voices calls and feature 3 the originated calls. 7 received sms.8 5 87.6 60.7 121.7 1.1%) Segment 2 (14.1 6.9 2.0 86.6%) Segment 5 (14.With the previous clustering results. table 4.2 1. 11 the unique are codes and feature 12 the number of different numbers called).4 6.8%) 1 119.6 87.7 87. it is not possible to decide which of the two clustering methods gives a better result.8 96.0 12.6 1.8 155.3%) Segment 4 (17.9 4.8 54.5: Segmentation results numbers correspond to the feature numbers of Section 2.6 1.8 94.7 66.3 65.1 9 2. the segments can be described as follows: For the situation with 4 segments: • Segment 1: In this segment are customers with a relative low number of voice calls. Therefor. 6 are calls to mobile phones.0 86. both results will be used as a solution for the customer segmentation.1 3.1 132.1 22.5 91.5 3.8 133.6 2.4 11 6.5 0.2%) Segment 2 (28.3 4.4 4.7 10 14.3%) Segment 4 (17.

• Segment 4: These customers are the average customers. This technique will be used to classify/estimate the segment of a customer by personal information as age. However. The average call duration is low. They also send and receive many sms messages. The duration of their voice calls is longer than average. the customers in this segments can be described as follows: • Segment 1: In this segment are customers with a relative low number of voice calls. They also receive and originate a low number of sms messages. • Segment 2: This segment contains customers with a relative high number of contacts. • Segment 3: The customers in this segment make relative many voice calls. The percentage of international calls is high. In the next session the classification method Support Vector Machine will be explained. They do not send and receive many sms messages. They call often to mobile phones during day time. For the situation with 6 segments. their sms usage is relative high. Their sms usage is low. 52 . • Segment 4: These customers originate many voice calls. In proportion. None of the feature values is high or low. gender and lifestyle (the customer data of Section 2. They also call to many different areas. These customers do not call to many different numbers.usage.1. • Segment 6: These customers originate and receive many voice calls. They have a relative small number of contacts. They call often during daytime and call more then average to international numbers. Remarkable is the fact that they don not have so many contacts as the number of calls do suspect.3). • Segment 3: The customers in this segment make relative many voice calls. Their average call duration is also lower than average. They also send and receive many sms messages. These customers call to many different numbers and have a lot of contacts which are Vodafone customers. • Segment 5: These customers do not receive many voice calls. They have also more contacts with a Vodafone mobile. Their call duration is high. they make more international phone calls than other customers.

In general. imagine that there exists only two segments. Even a quick glance at Figure 5. However. such as the one labeled ’Unknown’ in Figure 5. The goal of the SVM is learn to tell the difference between the groups and.3) of each segment. an algorithm for maximizing a particular mathematical function with respect to a given collection of data. 5. The green dots represent the customers that are in segment 1 and the red dots are customers that are in segment 2. to allow an easy.1 The separating hyperplane A human being is very good at pattern recognition. the customer data features of Section 2. The next few sections will describe the four basic concepts: • The separating hyper plane • The maximum-margin hyperplane • The soft margin • The kernel function For now.1. In this research a Support Vector machine will be used to recognize the segment of a customer by examining thousands of customers (e. which can be easily plotted. Subsequently.1b).1.1a shows that the green dots form a group and the reds dots form another group that can easily be separated by drawing a line between the two groups (Figure 5. the basic ideas of Support Vector Machines can be explained without any equations. a Support Vector Machine is a mathematical entity. In this case the customer data consist of 2 feature values. age and income. geometric interpretation of the data. predict whether it corresponds to segment 1 or segment 2. predicting the label of an unknown customer is simple: one simply needs to ask whether the new customer falls on the segment 53 .Chapter 5 Support Vector Machines A Support Vector Machine is a algorithm that learns by example to assign labels to objects [16]. given an unlabeled customer.g.

Now. a plane is needed to divide the space. to define the notion of a separating hyperplane.(a) Two-dimensional representation of the customers (b) A separating hyperplane Figure 5.2a). then the space in which the corresponding onedimensional feature resides is a one-dimensional line. the line that separates the segments.2: Separating hyperplanes in different dimensions 54 .1: Two-dimensional customer data of segment 1 and segment 2 1 or the segment 2 side of the separating line. So the term separating hyperplane is. illustrated in Figure 5. consider the situation where there are not just two feature values to describe the customer. a straight line divides the space in half (remember Figure 5. (a) One dimension (b) Three dimensions Figure 5. essentially.2b. For example. This line can be divided in half by using a single point (see Figure 5. This procedure can be extrapolated mathematically in higher dimensions. if there was just 1 feature value to describe the customer.1b) In a three-dimensional space. The term for a straight line in a high-dimensional space is a hyperplane. In two dimensions.

it is not reasonable to expect that the SVM can classify well if the training data set is prepared with a different protocol then the test data set. In other words.1a. The vectors (points) that constrain the width of the margin are the support vectors. For example. since it is not reasonable that a Support Vector machine trained on customer data is able to classify different car types. a SVM 55 . there are an infinite number of possible lines. the key (a) Many possibilities (b) The maximum-margin hyperplane Figure 5. the theorem of a SVM indicates that the two data sets has to be drawn from the same distribution. the SVM selects the maximum separating hyperplane. the line that separates the two segments and adopts the maximal distance from any of the given customers (see Figure 5. the SVM is able to predict the unknown segment of the customer in Figure 5. This is of course logical. By defining the distance from the hyperplane to the nearest customer (in general an expression vector) as the margin of the hyperplane. This theorem. More relevantly. First at all.5.2 The maximum-margin hyperplane The concept of treating objects as points in a high-dimensional space and finding a line that separates them. there are a some remarks and caveats to deal with.2 The question is which line should be chosen as the optimal classifier and how should the optimal line be defined. ’in the middle’. roughly speaking.3: Demonstration of the maximum-margin hyperplane to the success of Support Vector Machines. the theorem is based on the fact that the data on which the SVM is trained are drawn from the same distribution as the data it has to classify. On the other hand.1a The goal of SVM is to find a line that separates the segment 1 customers from the segment 2 customers. is a common way of classification. and therefore not unique to the SVM. as portrayed in Figure 5. is in many ways. the SVM differs from all other classifier methods by virtue of how the hyperplane should be selected. However. A logical way of selecting the optimal line. However. Consider again the classification problem of Figure 5. It is not surprising that a theorem of the statistical learning theory is supporting this choice [6]. By selecting this hyper plane.2). However. is selecting the line that is.

In other words. Note. A intuitively way to deal with the problems of errors is designing the SVM in such a way that it allows a few anomalous customers to fall on the ’wrong side’ of separation line.4a.4: Demonstration of the soft margin with the introduction of the soft margin.3 The soft margin So far.4a will be separated in the way it is illustrated in Figure 5. controls the number of customers that is allowed to violate the separation line and determines how far across the line they are allowed.does not assume that the data is drawn from a normal distribution. by the fact that a large margin will be achieved with respect to the number of correct classifications. However. This can be achieved by adding a ’soft margin’ to the SVM. the theory assumed that the data can be separated by a straight line. the data contains an error object. for example the data of Figure 5. instead of a two-dimensional data set. 5. With the soft margin. roughly. 5. Assume that. there 56 . the data set of Figure 5. Setting this parameter is a complicated process. a user-specified parameter is involved that controls the soft margin and. a SVM should not allow too many misclassification. The soft margin allows a small percentage of the data points to push their way through the margin of the separating hyperplane without affecting the final result. Of course.3 The customer can be seen as an outlier and resides on the same side of the line with customers of segment 1.4 The kernel functions To understand the notion of a kernel function. the example data will be simplified even further. that (a) Data set containing one error (b) Separating with soft margin Figure 5. many real data sets are not cleanly separable by a straight line. the soft margin specifies a trade-off between hyper plane violations and the size of the margin. In this figure.

In that case. (a) None separable dataset (b) Separating previously non separable dataset Figure 5. So. Within the new higher dimensional space. In Figure 5. In general. No single point can separate the two segments and introducing a soft margin would not help. as shown in the figure. the first problem is the so called curse of dimensionality: as the numbers of variables under consideration increases. With a relative simple kernel function.5: Demonstration of kernels some extra examples will be given. The kernel function adds an extra dimension to the data. the data will become separable in the corresponding higher dimension. consider the situation of Figure 5. the SVM should be a perfect classifier. the result is shown as the curved line in Figure 5. the kernel function can be seen as a mathematical trick for the SVM to project data from a low-dimensional space to a space of higher dimensions. However.4. the data set must contain consistent labels. the separating hyperplane was a single point. it is possible to prove that for any data set exists a kernel function that allows the SVM to separate the data linearly in a higher dimension. which means that two identical data points may not have different labels. the number of possible solutions also increases. To understand kernels better. as seen before in Figure 5. In Figure 5. but with a projection of the SVM hyperplane in the four-dimensional space back down to the original two-dimensional space. Now. It is not possible to draw the data in the 4 dimensional space.4 is plotted a two-dimensional data set.4 the situation is drawn when the data is project into a space with too many dimensions. but the projected hyperplane is found by a very high dimen57 .4. Of course.is a one-dimensional data set. If one chooses a good kernel function. it becomes harder for any algorithm to find a correct solution. the SVM can separate the data in two segments by one straight line. The result is plotted in Figure 5. The figure contains the same data as Figure 5.4. in theory. which illustrates an non separable data set. there are some drawbacks of projecting data in a very high-dimensional space to find the separating hyperplane. this data can be projected to a four-dimensional space.4. Consequently.1. in this case by squaring the one dimensional data set. A kernel function provides a solution to this problem. but exponentially.

Practically. the answer too this question is. trial and error. this is a time-consuming process and it is not guaranteed that the best kernel function that was found during cross-validation. There exists another large practi- (a) Linearly separable in four dimensions (b) A SVM that has over fit the data Figure 5. The SVM will not function well on new unseen unlabeled data.2) . This phenomenon is called over fitting. probably an infinite number. The vectors are mapped into a higher dimensional space by the function Φ. i • Polynomial: the polynomial kernel of degree d is of the form K(xi . xj ) = xT xj . In this research a SVM will be experimented with a variety of ’standard’ kernel functions.6: Examples of separation with kernels cal difficulty when applying new unseen data to the SVM.3) (5. It is more likely that there exists a kernel function that was not tested and performs better than the selected kernel function. Unfortunately. This results in boundaries which are to specific to the examples of the data set. The default and recommended kernel functions were used during this research and will be discussed now. By using the cross-validation method. xj ) = (γxT xj + c0 )d . xj ) = Φ(xi )T Φ(xj ).1) where xi are the training vectors. However. Many kernel mapping functions can be used. the method described above. mainly gives sufficient results. is actually the best kernel function that exists. the optimal kernel will be selected on a statistical way. in most cases. i 58 (5. but without introducing too many irrelevant dimensions.sional kernel. (5. This problems relies on the question how to choose a kernel function that separates the data. • Linear: which function is defined by: K(xi . but a few kernel functions have been found to work well in for a wide variety of applications [16]. In general the kernel function is defined by: K(xi .

one-versus-all classifiers. A. xj ) = exp(−γ||xi − xj ||2 ). xj ) = tanh(γxT xj + c0 ). B and C. In this research the constant c0 is set to 1.5 Multi class classification So far. one can simply train three separate SVM to answer the binary questions. (5.• Radial basis function: also known as the Gaussian kernel is of the form K(xi . The concept of a kernel mapping function is very powerful. the idea of using a hyperplane to separate the feature vectors into two groups was described. It allows a SVM to perform separations even with very complex boundaries as shown in Figure 5. In this research the one-verses-one technique will be used. How does a SVM discriminate between a large variety of classes.7: A separation of classes with complex boundaries 5. ”Is it A?”. but only for two target categories. ”Is it B?” and ”Is it C?”.4) • Sigmoid: the sigmoid function. For example. which is also used in neural networks. Another simple approach is the one-versus-one where k(k − 1)/2 models are constructed.5) i When the sigmoid function is used. as in our case 4 or 6 segments? There are several approaches proposed. is defined by K(xi .7 Figure 5. 59 . The first approach is to train multiple. but two methods are the most popular and most used [16]. if the SVM has to recognize three classes. where k is the number of classes. (5. one can regard it with a as a two-layer neural network.

the actual performance of the SVM will be measured after the SVM is trained.1 demonstrates how important the training process is. By K-fold cross validation the training dataset will be Figure 6. The training of the SVM will be stopped when the test error reached a local 60 . the test set and the validation set.1 K-fold cross validation To avoid over fitting. the training set. Different parameter values may cause under or over fitting.1: Under fitting and over fitting divided into two groups. Figure 6. The training set will be used to train the SVM. cross-validation is used to evaluate the fitting provided by each parameter value set tried during the experiments. With the validation set.Chapter 6 Experiments and results of classifying the customer segments 6. The test set will be used to estimate the error during the training of the SVM.

2. K-1 folds will be used for training and the remaining one for testing.4% 100 41. Each kernel function with its parameters will be tested on their performance.1 and table 6.7% 200 40. The error is calculated by taking the average off all K experiments. 6. the C 1 42. By K-fold cross validation. a k-fold partition of the Figure 6.6% 5 43.3: A K-fold partition of the dataset research. 4 segments 61 . For the situation with 4 clusters. Figure 6.0% 50 42. K is set to 10.1: Linear Kernel. see Figure 6. In this Figure 6. denoted by C. In table 6.1% 2 42. For each of K experiments. the optimal parameters for the Support Vector Machine will be researched and examined.minimum. The advantage of K-fold cross validation is that all the examples in the dataset are eventually used for both training and testing.2% 20 43.3 illustrates this process.8% 500 36.2 Parameter setting In this section.0% 10 43. The linear Kernel function itself has no parameters.2: Determining the stopping point of training the SVM data set is created. The only parameter that can be researched is the soft margin value of the Support Vector Machin.2 the results for the different C-values are summarized.1% Table 6.

9% 5 74.9% 2 29.0% 74.3% 7 75. This is done by multiple test runs with random values for d and γ.3% 3 75.0% 76.0 = 1.1% 75.5 and 6.0% 75. For the situation with γ γ γ γ γ γ d = 0.5% 74.9% 75.1% 2 74. For the polynomial kernel function.2% and 32.0%.0% 76.1% 76.1% 75.1% 4 73.6% 75.2% 2 76.2% 78. The correct number of classifications are respectively. there are two parameters.6% 76.2% 500 53.4% 6 74.2% 75. 6 segments optimal value for the soft margin is C = 10 and by using the 6 segments C = 50.3: Average C-value for polynomial kernel.6% 75.3% 75.0 = 1.2% 4 76.8% Table 6.3% 7 73.2% 76.0% 74.2 = 1.1% 74.1% 78.8% 74.4% 77.3% 76.8% 74.7% 500 26.2: Linear Kernel.1% 74.4% 5 76.4% 76.6 = 0. The average value for each soft margin C can be found in the tables 6.3% 76. 6 segments best results.8% 74.0% 100 27.0% 73.8% 75.2% 76.9% Table 6.4 1 76.8% 75.1% 72.2% 74. the optimal number for the maximal margin will be determined.8% 73.5% 50 72.9% 10 31.2 = 1.1% 72.2% 3 78.7% 75.6% 200 27.6 = 0.9% 74.6% 20 73.4 = 0. Therefor.4% 74.4% 76.1% 77.2% 75.4% 5 75.1% 74.4% 75.4% 5 30.3% 20 31.4 = 0.0% 72.6% 74.0% 75.6.0% 75.2% 74.0% 75.0% 75.8 = 1.0% 75.5% 6 76. 6 segments 62 .4 1 75.1% 73.7% Table 6.8% 74.4: Average C-value for polynomial kernel.5% 74. These C-values are used to find out which d and γ give the C 1 73.2% 76.3% 73.0% 78.4% 50 32.6% 200 63.8% 2 77.8 = 1.6% 10 74.C 1 28.8% 100 70.0% 74. 4 segments γ γ γ γ γ γ d = 0.9% Table 6.1% 74.1% 75.0% Table 6.2% 76. The number of degrees.1% 20 75.3% 76. 43.6% 77.9% 74.6% 500 21.0% 50 75.8% 75.1% 100 50. The results are shown in tables 6.3% 10 75.6% 2 74. denoted by d and the width γ.6: Polynomial kernel.9% 74.2% 73.0% 5 75.6% 200 42.3 and 6.4.3% 74. 4 segments C 1 70.5% 72.8% 76.5: Polynomial kernel.

2 = 1.2 500 37.3 72.2 64.5 38.6 = 0.3 61.0 50 65.6 72.5%.9 44.2 44.4 66.6 59. The following kernel function. The confusion matrix for both situations.7 and table 6.0 71.6 77.6% of the data is classified γ γ γ γ γ γ C = 0.4 59.9: Sigmoid function.2 63.9 5 76.6 500 38.7 80.2 78.4 200 51.3 30.2 69.8 54.5 60.1 70.1 60.7 44.9 73.4 74.1 77.0 52.9 45.5 78. the optimal score is 78.6 51.4 1 80.2 200 47.5%.3 66. This corresponds to the cluster in the top right corner and the cluster in the bottom of Figures 4.8 61.7 30. namely γ.4 5 57.4 20 76.5 29.7 48.5 27.9 72.3 79.9 50.0 57.5 5 72.10. 6 segments best result with 4 segments is 80. while there are two extra clusters.9 78.1 70.0 80. Remarkable is the fact that the difference is small between the two situations.2%. The C = 0.1 79.5 52.4 79.6 72. The sigmoid function has also only 1 variable.7 74.3 73.5 68.9 80.0 80.4 66.6 43.4 500 40.8 76.5 54.7 78.10 The results show that 66.8 = 1.4 47.2 79.2 47.4 2 77.7 61.9 34.8 = 1.3% and 78.0 68.0 41.0 39.4 = 0.1 100 30.4 Table 6. the radial basis function has only one variable.5 41.4 100 60.4 2 79.8 65.6 68.9 73.8 73.5 30.12.7 50 73.2 59.5 74.9 34. The results are given in table 6.0 20 56.2 = 1.0 61.4 = 0.6 50.5 64.6 69.2 = 1.1 58.0 42.3 26. 4 segments correct.6 52.6 = 0.4 segments.1 51.7 10 58.7 72.4 70.7 54.3 64.3 79.4 79.6 52.2 79.0 56.5 52.4 1 73.8 80.2 2 53.7 46.11 and 6.9 and 6.5 100 52. show that there are two clusters which can easily be classified with the customer profile.7 10 70.1 51.3 54.9 68.3 79.1% and 44.0 80.3 59.4 1 58. with respectively 4 and 6 segemtents.6 77.4 = 0.2 44.0 47.3 200 52. 4 segments γ γ γ γ γ γ C = 0.5 57.9 76.5 10 78.8: Radial basis function.3 31.6 74.7 62.6 57.1 40.8 49.7 55.4 76.1 20 68.8 50 57.1% and for 6 segments 76.1 69.0 Table 6. This means that the Radial basis function has the best score for both situations.0 γ γ γ γ γ γ Table 6. The results of the Radial Basis function are given in table 6. 63 .3%.8 69.1 71.8.1 48.2 80.7 40.9 and 4. table 6.0 55.4 60. with 80.0 73. with 6 segments the best score is 78.8 76.3 61.6 = 0.0 = 1.0 = 1.0 70.1 44.3 78.2 38.7: Radial basis function.3 58.8 = 1.6 26.2 46.0 = 1.1 63.6 53.5 80. by the Sigmoid function.3 42.

3% 4. 6 5.0% 0.9 39.3% 6.4 42.4% Segm.6 39.0 41.6 20 34.4% 9.6 = 0.6 32.8 = 1.γ γ γ γ γ γ C = 0.2 27.7 27.1% 6.8 32.1 40.6% 2.0 41.1% Segment 2 0.8% 69.1% 7.2 50 30.8 43.6% 2. 6 segments 64 .8% 92.1% 68.8 29. 2 1.3% Segm.1% 94.1% 0.7 35.3 200 28.9 38.3% 10.2% 2.7 40.4 = 0.11: Confusion matrix.7% Segm.6 39.6 21.3% 2.5 33.5% 12.9% 1.1 40.8% 13.9% 0.6% 1.6% 1.6% 0.5 5 34.0% Table 6.0% 2. 5 0.7 43.4 30.4 44.9% Segment 4 0.6 24.5 100 32.0% Segment 3 1.1% 0.7% 4.6% 12.1 30.6 31.1 22.1 29.2 30.7% 2.5 40.3 500 28.9 44.9 28.8 24. 1 74.4 36.2 = 1.0 39.0% 12.1% 3.1 10 33.6 26.7% Table 6.6 2 34.9 Table 6.0 27.8 18.8 18.0 = 1.6% 71.12: Confusion matrix. 6 segments Predicted → Actual ↓ Segment 1 Segment 2 Segment 3 Segment 4 Segment 1 97.4 38. 4 8.4% Segm. 4 segments Predicted → Actual ↓ Segment 1 Segment 2 Segment 3 Segment 4 Segment 5 Segment 6 Segm.6 34.6 20.8 43.4% 1.2% 3.1% 0.7% 73.10: Sigmoid function.2% 7.7 28.5% 76.5% Segm.2% 5.0 26.7% 3.9% 7.0 20.5% 4.4 1 33. 3 10.8% 96.

by leaving one feature out of the feature vector and train the SVM without this feature. This will be done.3 Feature Validation In this section.4 and 6.5. The results of both situations. The importance of each feature will be measured.5: Results while leaving out one of the features with 6 segments feature for classifying the right segment. This is in contrast with the type of telephone.4: Results while leaving out one of the features with 4 segments Figure 6. 65 . The result show that Age is an important Figure 6. which increase the result with only tenths of percents. are shown in Figure 6.6. Each feature increases the result and therefore each feature is useful for the classification. the features will be validated.

the elbow was located at c = 4 and for other algorithms. the location was c = 6. the so-called elbow criterion was applied. this criterion could not always be unambiguously identified. However. Not every validation method marked the same 66 . A novel data mining technique. The customer segments were constructed by applying several clustering algorithms. the feature values were selected in such a way that it would describe the customer’s behavior as complete as possible. For some algorithms. but merely as one possible outcome. The second part of the research was focused on profiling customers and finding a relation between the profile and the segments. without the direct intervention of a human specialist. several validation measures were used. One different feature value will result in different segments. it is not possible to include all possible combinations of usage behavior characteristics within the scope of this research.Chapter 7 Conclusions and discussion This section concludes the research and the corresponding results and will give some recommendations for future work. The result of the clustering can therefore not be regarded as universally valid. In this research. To identify the best algorithm. There are various ways for selecting suitable feature values for the clustering algorithms.1 Conclusions The first objective of our research was to perform automatic customer segmentation based on usage behavior. This selection is vital for the resulting quality of the clustering. An other problem was that the location of the elbow could differ between the validation measures for the same algorithm. To find the optimal number of clusters. The customer’s profile was based on personal information of the customers. called Support Vector Machines was used to estimate the segment of a customer based on his profile. The clustering algorithms used selected and preprocessed data from the Vodafone data warehouse. This led to solutions for the customer segmentation with respectively four segments and six segments. Unfortunately. 7.

In real life. It is hard to compare the two clustering results. because of the different number of clusters. leaving out a feature such as the ’telephone type’ barely decreased the classification result. some widely established validation measures were employed to determine the most optimal algorithm. based on the customer’s profile. and residential area of the customer. However. It was found that the radial basis function gives the best result with a classification of 80. A Support Vector Machine algorithm was used to classify the segment of a customer.3% for the situation with four segments and 78. telephone type. It was however not possible to determine one algorithm that was optimal for c = 4 and c = 6. For the situation with four clusters. that in both situations the clusters were well separated and clearly distinguished from each other. and companies may exchange telephones among their employees. this and some other features did well increase the performance of classification. The profile exists of the age. A possible explanation could be that the features of the customer are not adequate for making a customer’s profile. different numbers called and percentage of weekday and daytime calls. To determine which customer segmentation algorithm is best suited for a particular data set and a specific parameter setting. 67 . that this feature bears some importance for the customer profiling and the classification of the customer’s segment. the resulting quality of the classification was significantly decreased. This is caused by the frequently missing data in the Vodafone data warehouse about lifestyle. habits and income of the customers. the effect of each feature value became visible. subscription type. the usage behavior does not correspond to a single customer’s profile and this impairs the classification process. call duration. this telephone is maybe not used exclusively by the person (and the corresponding customer’s profile) as stored in the database. As a comparison. In such cases. The results show. A second reason for the low number of correct classification is the fact that the usage behavior in the database corresponds to a telephone number and this telephone number corresponds to a person. both clustering results were used as a starting point for the segmentation algorithm. however.5% for the situation with six segments.algorithm as the best algorithm. On the other hand. A short characterization of each cluster was made. It was found that without the concept of ’customer age’. gender. four different kernel functions with different parameters were tested on their performance. The last part of the research involves the relative importance of each individual feature of the customer’s profile. company size. Customers may lend their telephone to relatives. The corresponding segments differ on features as number of voice calls. This implies. international calls. the Gath-Geva algorithm appears to be the best algorithm and the Gustafson-Kessel algorithm gives the best results by six clusters. Therefore. the clustering results were interpreted in a profiling format. By leaving out one feature value during classification. Therefore. sms usage. It appeared that the resulting percentage of correctly classified segments was not as high as expected.

For instance. To improve on determining the actual number of clusters present in the data set. thus. a complete data analysis research is required. rather than mere phone usage as employed here. an enhanced and more precise analysis of the data ware house will lead to improved features and. the application of more specialized methods than the elbow criterion could be applied. This can lead to different clusters and thus different segments. In this research. it is possible to formulate some recommendations for obtaining more suitable customer profiling and segmentation.2 Recommendations for future work Based on our research and experiments. it is challenging to classify the profile of the customer based on the corresponding segment alone. hierarchical clustering or mixture of Gaussians. Extrapolating this approach. Another way to validate the resulting clusters is to offer them to a human expert. to an improved classification. genetic algorithms or Bayesian algorithms. the application of evolutionary algorithms. The customer profile used in this research is not sufficient detailed enough to describe the wide spectrum of customers. One reason for this is the missing data in the Vodafone data warehouse. An interesting alternative is. To estimate the segment of the customer. the results are given by a short description of each segment. for instance. Furthermore. it should be noted that the most obvious and best way to improve the classification is to come to a more accurate and precise definition of the customer profiles. we note that the study would improve noticeably by involving multiple criteria to evaluate the user behavior. Finally. The first recommendation is to use different feature values for the customer segmentation. other classification methods can be used. a detailed data analysis of the meaning of the cluster is recommended. Another way of improving this research is to extent the number of cluster algorithms like main shift clustering. as proposed by Wei Lu [21]. Also. 68 . neural networks. However. a more detailed view of the clusters and their boundaries can be obtained. and use his feed-back for improving the clustering criteria.7. cluster analysis of the application of miscellaneous (non-linear) kernel functions. also. Consequently. Of specific interest is. To know the influence of the feature values on the outcome of the clustering. Similarly. within the framework of Support Vector Machines. this is a complex course and it essentially requires the availability of high-quality features.

. S. W. F. Comp. VTT Information Technology (2001).J. [12] Janusz. R. Research report TTE1-2001-18. in Civ. [3] Balasko. Data mining and complex telecommunications problems modeling... J. pp.. (1979). I. pp. J. Singh. In Proc.. [2] Amat.B.. In Proc. [10] Giha. [7] Feldman. J. Y. 3 (2003). [4] Bounsaythip. no. [9] Gath. Solomatine D. [11] Gustafson. Inform.. AAAI/MIT Press (1991). Fuzzy clustering with a fuzzy covariance matrix. Overview of Data Mining for Customer Behavior Modeling. Engrg. Research report TTE1-2001-29. 15 iss.B. vol.E. J.P.... [5] Bezdek.. Conf.E.. Inform... J. and Dunn.L.. and Abbott. Telecommun. Model Induction with Support Vector Machines: Introduction and Applications.C. Using reporting and data mining techniques to improve knowledge of subscribers.. D. Comput.T.. 11-16. F. 3 (2002). no. Knowledge discovery in databases. 835-838. and Matheus. 208-216. 397 (2003). C. and Rinta-Runsala. M.E.J. 11 no. 3 (2001).C. 7 (1989). 1-27. IEEE Trans. Engin. IEEE CDC. 773-781. vol. and Balazs. C. and Geva. J. 115-120.. Technol. pp.. G. 761766. W. Y. and Rinta-Runsala E. and Appl. Softw. IEEE Trans Pattern and Machine Intell. Piatetsky-Shapiro. Velickov... VTT Information Technology (2001). (2006). J. and Ewe.B. Proc. A. 1st Int. applications to customer profiling and fraud management. Optimal fuzzy partition: A heuristic for estimating the parameters in a mixture of normal distributions. and Kessel.. Knowledge discovery in textual databases (KDT). Telecommun. 112-117. Knowledge Discovery and Data Mining. Technol. H. Abonyi. [6] Dibike. pp. Unsupervised optimal fuzzy clustering. and Dagan. B. pp. E. pp. Fuzzy Clustering and Data Analysis Toolbox For Use with Matlab. [8] Frawley. J. no. Customer Profiling and Segmentation based on Association Rule Mining Technique. (2005). pp. G. 69 . vol. pp.Bibliography [1] Ahola. I. Data mining case studies in customer profiling. C-24 (1975).

A. Savvopoulos. and Chiu. Constructing Stereotypes for an Adaptive e-Shop Using AIN-Based Clustering. P. Decision Support Systems. vol. [20] Wei. [21] Wei Lu.. Tan. vol.. Boston. Decis.. The Data Mining and Knowledge Discovery Handbook (2005). 648-653. [16] Noble. vol. pp. 34 (2002).W. M. How to do it.. Appl.S.. Palgrave Publ.. Supp.. What is a support vector machine? Nature Biotechnology. 12 (2006). [17] Shaw.. 127137. P. W. Data Warehousing and Data Mining for Telecommunications. 471-481. M. and Dunbar. Syst. Knowledge management and data mining for marketing. I. vol. and Sotiropoulos.N. [18] Verhoef. R. 24 no.M. 23 (2002). C. I. 1189-1201.J.. G. Clustering and its validation in a symbolic framework.T. pp. [14] Mattison. pp. C. Tsihrintzis. M. ICANNGA (2007). 24 (2003). Market segmentation.E. CIMCA/IAWTIC (2005). pp.P... London: Artech House. Hoekstra. 31 (2001). Turning telecommunications call detail to churn prediction: a data mining approach. 70 . Spring. Lett.. The commercial use of segmentation and predictive modeling techniques for database marketing in the Netherlands. [22] Weiss. I. D.[13] Mali. [19] Virvou. Subramaniam. pp. and Welge.. pp... A New Evolutionary Algorithm for Determining the Optimal Number of Clusters. P. A. (1997). Patt.T. Recogn. G. vol. 2367-2376. 837-845.. K. (1998). J. how to profit from it.. pp. M. 103112. Data Mining in Telecommunications. Expert Syst.. 1565-1567. G. pp.. and Lee. [15] McDonald.

To connected the tables with each other. The white rectangles correspond to the tables that were used for this research. the relation tables (the red tables in the middle) are needed. 71 .Appendix A Model of data warehouse In this Appendix a simplified model of the data ware house can be found. The colored boxes group the tables in a category. The most important data fields of these tables are written in the table.

1: Model of the Vodafone data warehouse 72 .Figure A.

1. for the algorithms that not were discussed in Section 4. the plots of the validation measures.Appendix B Extra results for optimal number of clusters In this Appendix. are given.1: Partition index and Separation index of K-medoid 73 . The K-medoid algorithm: Figure B.

3: Partition coefficient and Classification Entropy of Fuzzy C-means 74 .Figure B.2: Dunn’s index and Alternative Dunn’s index of K-medoid The Fuzzy-C-means algorithm: Figure B.

Separation index and Xie Beni index of Fuzzy C-means Figure B.5: Dunn’s index and Alternative Dunn’s index of Fuzzy C-means 75 .Figure B.4: Partition index.

You're Reading a Free Preview

Descarga
scribd
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->