## ¿Está seguro?

This action might not be possible to undo. Are you sure you want to continue?

**for a Mobile Telecommunications Company
**

Based on Usage Behavior

A Vodafone Case Study

S.M.H. Jansen

July 17, 2007

Acknowledgments

This Master thesis was written to complete the study Operations Research at

the University of Maastricht (UM). The research took place at the Department

of Mathematics of UM and at the Department of Information Management of

Vodafone Maastricht. During this research, I had the privilege to work together

with several people. I would like to express my gratitude to all those people for

giving me the support to complete this thesis. I want to thank the Department

of Information Management for giving me permission to commence this thesis

in the ﬁrst instance, to do the necessary research work and to use departmental

data.

I am deeply indebted to my supervisor Dr. Ronald Westra, whose help, stimu-

lating suggestions and encouragement helped me in all the time of research for

and writing of this thesis. Furthermore, I would like to give my special thanks

to my second supervisor Dr. Ralf Peeters, whose patience and enthusiasm en-

abled me to complete this work. I have also to thank my thesis instructor, Drs.

Annette Schade, for her stimulating support and encouraging me to go ahead

with my thesis.

My former colleagues from the Department of Information Management sup-

ported me in my research work. I want to thank them for all their help, support,

interest and valuable hints. Especially I am obliged to Drs. Philippe Theunen

and Laurens Alberts, MSc.

Finally, I would like to thank the people, who looked closely at the ﬁnal ver-

sion of the thesis for English style and grammar, correcting both and oﬀering

suggestions for improvement.

1

Contents

1 Introduction 8

1.1 Customer segmentation and customer proﬁling . . . . . . . . . . 9

1.1.1 Customer segmentation . . . . . . . . . . . . . . . . . . . 9

1.1.2 Customer proﬁling . . . . . . . . . . . . . . . . . . . . . . 10

1.2 Data mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.3 Structure of the report . . . . . . . . . . . . . . . . . . . . . . . . 13

2 Data collection and preparation 14

2.1 Data warehouse . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.1.1 Selecting the customers . . . . . . . . . . . . . . . . . . . 14

2.1.2 Call detail data . . . . . . . . . . . . . . . . . . . . . . . . 15

2.1.3 Customer data . . . . . . . . . . . . . . . . . . . . . . . . 19

2.2 Data preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3 Clustering 22

3.1 Cluster analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.1.1 The data . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.1.2 The clusters . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.1.3 Cluster partition . . . . . . . . . . . . . . . . . . . . . . . 24

3.2 Cluster algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2.1 K-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2.2 K-medoid . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2.3 Fuzzy C-means . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2.4 The Gustafson-Kessel algorithm . . . . . . . . . . . . . . 29

3.2.5 The Gath Geva algorithm . . . . . . . . . . . . . . . . . . 30

3.3 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.4 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.4.1 Principal Component Analysis . . . . . . . . . . . . . . . 33

3.4.2 Sammon mapping . . . . . . . . . . . . . . . . . . . . . . 34

3.4.3 Fuzzy Sammon mapping . . . . . . . . . . . . . . . . . . . 35

4 Experiments and results of customer segmentation 37

4.1 Determining the optimal number of clusters . . . . . . . . . . . . 37

4.2 Comparing the clustering algorithms . . . . . . . . . . . . . . . . 42

2

4.3 Designing the segments . . . . . . . . . . . . . . . . . . . . . . . 45

5 Support Vector Machines 53

5.1 The separating hyperplane . . . . . . . . . . . . . . . . . . . . . . 53

5.2 The maximum-margin hyperplane . . . . . . . . . . . . . . . . . 55

5.3 The soft margin . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.4 The kernel functions . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.5 Multi class classiﬁcation . . . . . . . . . . . . . . . . . . . . . . . 59

6 Experiments and results of classifying the customer segments 60

6.1 K-fold cross validation . . . . . . . . . . . . . . . . . . . . . . . . 60

6.2 Parameter setting . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

6.3 Feature Validation . . . . . . . . . . . . . . . . . . . . . . . . . . 65

7 Conclusions and discussion 66

7.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

7.2 Recommendations for future work . . . . . . . . . . . . . . . . . 68

Bibliography 68

A Model of data warehouse 71

B Extra results for optimal number of clusters 73

3

List of Figures

1.1 A taxonomy of data mining tasks . . . . . . . . . . . . . . . . . . 12

2.1 Structure of customers by Vodafone . . . . . . . . . . . . . . . . 15

2.2 Visualization of phone calls per hour . . . . . . . . . . . . . . . . 17

2.3 Histograms of feature values . . . . . . . . . . . . . . . . . . . . . 18

2.4 Relation between originated and received calls . . . . . . . . . . . 18

2.5 Relation between daytime and weekday calls . . . . . . . . . . . 19

3.1 Example of clustering data . . . . . . . . . . . . . . . . . . . . . 22

3.2 Diﬀerent cluster shapes in R

2

. . . . . . . . . . . . . . . . . . . . 24

3.3 Hard and fuzzy clustering . . . . . . . . . . . . . . . . . . . . . . 25

4.1 Values of Partition Index, Separation Index and the Xie Beni Index 38

4.2 Values of Dunn’s Index and the Alternative Dunn Index . . . . . 39

4.3 Values of Partition coeﬃcient and Classiﬁcation Entropy with

Gustafson-Kessel clustering . . . . . . . . . . . . . . . . . . . . . 40

4.4 Values of Partition Index, Separation Index and the Xie Beni

Index with Gustafson-Kessel clustering . . . . . . . . . . . . . . . 41

4.5 Values of Dunn’s Index and Alternative Dunn Index with Gustafson-

Kessel clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.6 Result of K-means algorithm . . . . . . . . . . . . . . . . . . . . 43

4.7 Result of K-medoid algorithm . . . . . . . . . . . . . . . . . . . . 44

4.8 Result of Fuzzy C-means algorithm . . . . . . . . . . . . . . . . . 44

4.9 Result of Gustafson-Kessel algorithm . . . . . . . . . . . . . . . . 44

4.10 Result of Gath-Geva algorithm . . . . . . . . . . . . . . . . . . . 45

4.11 Distribution of distances from cluster centers within clusters for

the Gath-Geva algorithm with c = 4 . . . . . . . . . . . . . . . . 46

4.12 Distribution of distances from cluster centers within clusters for

the Gustafson-Kessel algorithm with c = 6 . . . . . . . . . . . . . 46

4.13 Cluster proﬁles for c = 4 . . . . . . . . . . . . . . . . . . . . . . . 47

4.14 Cluster proﬁles for c = 6 . . . . . . . . . . . . . . . . . . . . . . . 48

4.15 Cluster proﬁles of centers for c = 4 . . . . . . . . . . . . . . . . . 49

4.16 Cluster proﬁles of centers for c = 6 . . . . . . . . . . . . . . . . . 50

5.1 Two-dimensional customer data of segment 1 and segment 2 . . . 54

4

5.2 Separating hyperplanes in diﬀerent dimensions . . . . . . . . . . 54

5.3 Demonstration of the maximum-margin hyperplane . . . . . . . . 55

5.4 Demonstration of the soft margin . . . . . . . . . . . . . . . . . . 56

5.5 Demonstration of kernels . . . . . . . . . . . . . . . . . . . . . . . 57

5.6 Examples of separation with kernels . . . . . . . . . . . . . . . . 58

5.7 A separation of classes with complex boundaries . . . . . . . . . 59

6.1 Under ﬁtting and over ﬁtting . . . . . . . . . . . . . . . . . . . . 60

6.2 Determining the stopping point of training the SVM . . . . . . . 61

6.3 A K-fold partition of the dataset . . . . . . . . . . . . . . . . . . 61

6.4 Results while leaving out one of the features with 4 segments . . 65

6.5 Results while leaving out one of the features with 6 segments . . 65

A.1 Model of the Vodafone data warehouse . . . . . . . . . . . . . . . 72

B.1 Partition index and Separation index of K-medoid . . . . . . . . 73

B.2 Dunn’s index and Alternative Dunn’s index of K-medoid . . . . . 74

B.3 Partition coeﬃcient and Classiﬁcation Entropy of Fuzzy C-means 74

B.4 Partition index, Separation index and Xie Beni index of Fuzzy

C-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

B.5 Dunn’s index and Alternative Dunn’s index of Fuzzy C-means . . 75

5

List of Tables

2.1 Proportions within the diﬀerent classiﬁcation groups . . . . . . . 20

4.1 The values of all the validation measures with K-means clustering 39

4.2 The values of all the validation measures with Gustafson-Kessel

clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.3 The numerical values of validation measures for c = 4 . . . . . . 42

4.4 The numerical values of validation measures for c = 6 . . . . . . 43

4.5 Segmentation results . . . . . . . . . . . . . . . . . . . . . . . . . 51

6.1 Linear Kernel, 4 segments . . . . . . . . . . . . . . . . . . . . . . 61

6.2 Linear Kernel, 6 segments . . . . . . . . . . . . . . . . . . . . . . 62

6.3 Average C-value for polynomial kernel, 4 segments . . . . . . . . 62

6.4 Average C-value for polynomial kernel, 6 segments . . . . . . . . 62

6.5 Polynomial kernel, 4 segments . . . . . . . . . . . . . . . . . . . . 62

6.6 Polynomial kernel, 6 segments . . . . . . . . . . . . . . . . . . . . 62

6.7 Radial basis function, 4 segments . . . . . . . . . . . . . . . . . . 63

6.8 Radial basis function, 6 segments . . . . . . . . . . . . . . . . . . 63

6.9 Sigmoid function, 4 segments . . . . . . . . . . . . . . . . . . . . 63

6.10 Sigmoid function, 6 segments . . . . . . . . . . . . . . . . . . . . 64

6.11 Confusion matrix, 4 segments . . . . . . . . . . . . . . . . . . . . 64

6.12 Confusion matrix, 6 segments . . . . . . . . . . . . . . . . . . . . 64

6

Abstract

Vodafone, an International mobile telecommunications company, has accumu-

lated vast amounts of data on consumer mobile phone behavior in a data ware-

house. The magnitude of this data is so huge that manual analysis of data is

not feasible. However, this data holds valuable information that can be applied

for operational and strategical purposes. Therefore, in order to extract such in-

formation from this data, automatic analysis is essential, by means of advanced

data mining techniques. These data mining techniques search and analyze the

data in order to ﬁnd implicit and useful information, without direct knowledge

of human experts. This research will address the question how to perform cus-

tomer segmentation and customer proﬁling with data mining techniques. In

our context, ’customer segmentation’ is a term used to describe the process of

dividing customers into homogeneous groups on the basis of shared or common

attributes (habits, tastes, etc). ’Customer proﬁling’ is describing customers

by their attributes, such as age, gender, income and lifestyles. Having these

two components, managers can decide which marketing actions to take for each

segment. In this research, the customer segmentation is based on usage call

behavior, i.e. the behavior of a customer measured in the amounts of incoming

or outgoing communication of whichever form. This thesis describes the process

of selecting and preparing the accurate data from the data warehouse, in order

to perform customer segmentation and to proﬁle the customer. A number of

advanced and state-of-the-art clustering algorithms are modiﬁed and applied

for creating customer segments. An optimality criterion is constructed in order

to measure their performance. The best i.e. most optimal in the sense of the

optimality criterion clustering technique will be used to perform customer seg-

mentation. Each segment will be described and analyzed. Customer proﬁling

can be accomplished with information from the data warehouse, such as age,

gender and residential area information. Finally, with a recent data mining

technique, called Support Vector Machines, the segment of a customer will be

estimated based on the customers proﬁle. Diﬀerent kernel functions with dif-

ferent parameters will be examined and analyzed. The customer segmentation

will lead to two solutions. One solution with four segments and one solution

with six segments. With the Support Vector Machine approach it is possible in

80.3% of the cases to classify the segment of a customer based on its proﬁle for

the situation with four segments. With six segments, a correct classiﬁcation of

78.5% is obtained.

7

Chapter 1

Introduction

Vodafone is world’s leading mobile telecommunications company, with approx-

imately 4.1 million customers in The Netherlands. From all these customers a

tremendous amount of data is stored. These data include, among others, call de-

tail data, network data and customer data. Call detail data gives a description

of the calls that traverse the telecommunication networks, while the network

data gives a description of the state of the hardware and software components

in the network. The customer data contains information of the telecommunica-

tion customers. The amount of data is so great that manual analysis of data is

diﬃcult, if not impossible [22]. The need to handle such large volumes of data

led to the development of knowledge-based expert systems [17, 22]. These auto-

mated systems perform important functions such as identifying network faults

and detecting fraudulent phone calls. A disadvantage of this approach is that

it is based on knowledge from human experts.

Obtaining knowledge from human experts is a time consuming process, and

in many cases, the experts do not have the requisite knowledge [2]. Solutions

to these problems were promised by data mining techniques. Data mining is

the process of searching and analyzing data in order to ﬁnd implicit, but po-

tentially useful, information [12]. Within the telecommunication branch, many

data mining tasks can be distinguished. Examples of main problems for market-

ing and sales departments of telecommunication operators are churn prediction,

fraud detection, identifying trends in customer behavior and cross selling and

up-selling.

Vodafone is interested in a complete diﬀerent issue, namely customer segmenta-

tion and customer proﬁling and the relation between them. Customer segmen-

tation is a term used to describe the process of dividing customers into homoge-

neous groups on the basis of shared or common attributes (habits, tastes, etc)

[10]. Customer proﬁling is describing customers by their attributes, such as age,

gender, income and lifestyles [1, 10]. Having these two components, marketers

can decide which marketing actions to take for each segment and then allocate

scarce resources to segments in order to meet speciﬁc business objectives.

A basic way to perform customer segmentation is to deﬁne segmentations in

8

advance with knowledge of an expert, and dividing the customers over these

segmentations by their best ﬁts. This research will deal with the problem of

making customer segmentations without knowledge of an expert and without

deﬁning the segmentations in advance. The segmentations will be determined

based on (call) usage behavior. To realize this, diﬀerent data mining techniques,

called clustering techniques, will be developed, tested, validated and compared

to each other. In this report, the principals of the clustering techniques will be

described and the process of determining the best technique will be discussed.

Once the segmentations are obtained, for each customer a proﬁle will be de-

termined with the customer data. To ﬁnd a relation between the proﬁle and

the segments, a data mining technique called Support Vector Machines (SVM)

will be used. A Support Vector machine is able to estimate the segment of a

customer by personal information, such as age, gender and lifestyle. Based on

the combination of the personal information (the customer proﬁle), the segment

can be estimated and the usage behavior of the customer proﬁle can be deter-

mined. In this research, diﬀerent settings of the Support Vector Machines will

be examined and the best working estimation model will be used.

1.1 Customer segmentation and customer pro-

ﬁling

To compete with other providers of mobile telecommunications it is important

to know enough about your customers and to know the wants and needs of your

customers [15]. To realize this, it is needed to divide customers in segments

and to proﬁle the customers. Another key beneﬁt of utilizing the customer

proﬁle is making eﬀective marketing strategies. Customer proﬁling is done by

building a customer’s behavior model and estimating its parameters. Customer

proﬁling is a way of applying external data to a population of possible customers.

Depending on data available, it can be used to prospect new customers or to

recognize existing bad customers. The goal is to predict behavior based on

the information we have on each customer [18]. Proﬁling is performed after

customer segmentation.

1.1.1 Customer segmentation

Segmentation is a way to have more targeted communication with the customers.

The process of segmentation describes the characteristics of the customer groups

(called segments or clusters) within the data. Segmenting means putting the

population in to segments according to their aﬃnity or similar characteristics.

Customer segmentation is a preparation step for classifying each customer ac-

cording to the customer groups that have been deﬁned.

Segmentation is essential to cope with today’s dynamically fragmenting con-

sumer marketplace. By using segmentation, marketers are more eﬀective in

channeling resources and discovering opportunities. The construction of user

9

segmentations is not an easy task. Diﬃculties in making good segmentation are

[18]:

• Relevance and quality of data are essential to develop meaningful seg-

ments. If the company has insuﬃcient customer data, the meaning of a

customer segmentation in unreliable and almost worthless. Alternatively,

too much data can lead to complex and time-consuming analysis. Poorly

organize data (diﬀerent formats, diﬀerent source systems) makes it also

diﬃcult to extract interesting information. Furthermore, the resulting

segmentation can be too complicated for the organization to implement

eﬀectively. In particular, the use of too many segmentation variables can

be confusing and result in segments which are unﬁt for management deci-

sion making. On the other hand, apparently eﬀective variables may not be

identiﬁable. Many of these problems are due to an inadequate customer

database.

• Intuition: Although data can be highly informative, data analysts need

to be continuously developing segmentation hypotheses in order to identify

the ’right’ data for analysis.

• Continuous process: Segmentation demands continuous development

and updating as new customer data is acquired. In addition, eﬀective seg-

mentation strategies will inﬂuence the behavior of the customers aﬀected

by them; thereby necessitating revision and reclassiﬁcation of customers.

Moreover, in an e-commerce environment where feedback is almost imme-

diate, segmentation would require almost a daily update.

• Over-segmentation: A segment can become too small and/or insuﬃ-

ciently distinct to justify treatment as separate segments.

One solution to construct segments can be provided by data mining methods

that belong to the category of clustering algorithms. In this report, several

clustering algorithms will be discussed and compared to each other.

1.1.2 Customer proﬁling

Customer proﬁling provides a basis for marketers to ’communicate’ with existing

customers in order to oﬀer them better services and retaining them. This is done

by assembling collected information on the customer such as demographic and

personal data. Customer proﬁling is also used to prospect new customers using

external sources, such as demographic data purchased from various sources.

This data is used to ﬁnd a relation with the customer segmentations that were

constructed before. This makes it possible to estimate for each proﬁle (the

combination of demographic and personal information) the related segment and

visa versa. More directly, for each proﬁle, an estimation of the usage behavior

can be obtained.

Depending on the goal, one has to select what is the proﬁle that will be relevant

to the project. A simple customer proﬁle is a ﬁle that contains at least age and

10

gender. If one needs proﬁles for speciﬁc products, the ﬁle would contain product

information and/or volume of money spent. Customer features one can use for

proﬁling, are described in [2, 10, 19]:

• Geographic. Are they grouped regionally, nationally or globally

• Cultural and ethnic. What languages do they speak? Does ethnicity aﬀect

their tastes or buying behaviors?

• Economic conditions, income and/or purchasing power. What is the av-

erage household incom or power of the customers? Do they have any

payment diﬃculty? How much or how often does a customer spend on

each product?

• Age and gender. What is the predominant age group of your target buyers?

How many children and what age are in the family? Are more female or

males using a certain service or product?

• Values, attitudes and beliefs. What is the customers’ attitude toward your

kind of product or service?

• Life cycle. How long has the customer been regularly purchasing products?

• Knowledge and awareness. How much knowledge do customers have about

a product,service, or industry? How much education is needed? How much

brand building advertising is needed to make a pool of customers aware

of oﬀer?

• Lifestyle. How many lifestyle characteristics about purchasers are useful?

• Recruitment method. How was the customer recruited?

The choice of the features depends also on the availability of the data. With

these features, an estimation model can be made. This can be realized by a

data mining method called Support Vector Machines (SVM). This report gives

an description of SVM’s and it will be researched under which circumstances

and parameters a SVM works best in this case.

1.2 Data mining

In section 1.1, the term data mining was used. Data mining is the process of

searching and analyzing data in order to ﬁnd implicit, but potentially useful,

information [12]. It involves selecting, exploring and modeling large amounts of

data to uncover previously unknown patterns, and ultimately comprehensible

information, from large databases. Data mining uses a broad family of computa-

tional methods that include statistical analysis, decision trees, neural networks,

rule induction and reﬁnement, and graphic visualization. Although, data min-

ing tools have been available for a long time, the advances in computer hardware

and software, particularly exploratory tools like data visualization and neural

11

networks, have made data mining more attractive and practical. The typical

data mining process consist of the following steps [4]:

• problem formulation

• data preparation

• model building

• interpretation and evaluation of the results

Pattern extraction is an important component of any data mining activity and

it deals with relationships between subsets of data. Formally, a pattern is de-

ﬁned as [4]:

A statement S in L that describes relationships among a subsets of facts F

s

of a given set of facts F, with some certainty C, such that S is simpler than the

enumeration of all facts in F

s

.

Data mining tasks are used to extract patterns from large data sets. The vari-

ous data mining tasks can be broadly divided into six categories as summarized

in Figure 1.1. The taxonomy reﬂects the emerging role of data visualization as

Figure 1.1: A taxonomy of data mining tasks

a separate data mining task, even as it is used to support other data mining

tasks. Validation of the results is also a data mining task. By the fact that the

validation supports the other data mining tasks and is always necessary within

a research, this task was not mentioned as a separate one. Diﬀerent data mining

tasks are grouped into categories depending on the type of knowledge extracted

by the tasks. The identiﬁcation of patterns in a large data set is the ﬁrst step to

gaining useful marketing insights and marking critical marketing decisions. The

data mining tasks generate an assortment of customer and market knowledge

which form the core of knowledge management process. The speciﬁc tasks to

be used in this research are Clustering (for the customer segmentation), Classi-

ﬁcation (for estimating the segment) and Data visualization.

Clustering algorithms produce classes that maximize similarity within clusters

but minimize similarity between classes. A drawback of this method is that the

number of clusters has to be given in advance. The advantage of clustering is

that expert knowledge is not required. For example, based on user behavior

data, clustering algorithms can classify the Vodafone customers into ”call only”

users, ”international callers”, ”SMS only” users etc.

Classiﬁcation algorithms groups customers in predeﬁned classes. For example,

12

Vodafone can classify its customers based on their age, gender and type of sub-

scription and then target its user behavior.

Data visualization allow data miners to view complex patterns in their cus-

tomer data as visual objects complete in three or two dimensions and colors.

In some cases it is needed to reduce high dimensional data into three or two

dimensions. To realize this, algorithms as Principal Component Analysis and

Sammon’s Mapping (discussed in Section 3.4) can be used. To provide varying

levels of details of observed patterns, data miners use applications that provide

advanced manipulation capabilities to slice, rotate or zoom the objects.

1.3 Structure of the report

The report comprises 6 chapters and several appendices. In addition to to this

introductory chapter, Chapter 2 describes the process of selecting the right data

from the data ware house. It provides information about the structure of the

data and the data ware house. Furthermore, it gives an overview of the data

that is used to perform customer segmentation and customer proﬁling. It ends

with an explanation of the preprocessing techniques that were used to prepare

the data for further usage.

In Chapter 3 the process of clustering is discussed. Clustering is a data mining

technique, that in this research is used to determine the customer segmenta-

tions. The chapter starts with explaining the general process of clustering.

Diﬀerent cluster algorithms will be studied. It also focuses on validation meth-

ods, which can be used to determine the optimal number of clusters and to

measure the performance of the diﬀerent cluster algorithms. The chapter ends

with a description of visualization methods. These methods are used to analyze

the results of the clustering.

Chapter 4 analyzes the diﬀerent cluster algorithms of Chapter 3. This will be

tested with the prepared call detail data as described in Chapter 2 For each

algorithm, the optimal numbers of cluster will be determined. Then, the cluster

algorithms will be compared to each other and the best algorithm will be chosen

to determine the segments. Multiple plots and ﬁgures will show the working of

the diﬀerent cluster methods and the meaning of each segment will be described.

Once the segments are determined, with the customer data of Chapter 2, a pro-

ﬁle can be made. Chapter 5 delves into a data mining technique called Support

Vector Machines. This technique will be used to classify the right segment for

each customer proﬁle. Diﬀerent parameter settings of the Support Vector Ma-

chines will be researched and examined in Chapter 6 to ﬁnd the best working

model. Finally, in Chapter 7, the research will be discussed. Conclusions and

recommendations are given and future work is proposed.

13

Chapter 2

Data collection and

preparation

The ﬁrst step (after the problem formulation) in the data mining process is

to understand the data. Without such an understanding, useful applications

cannot be developed. All data of Vodafone is stored in a data warehouse. In

this chapter, the process of collecting the right data from this data ware house,

will be described. Furthermore, the process of preparing the data for customer

segmentation and customer proﬁling will be explained.

2.1 Data warehouse

Vodafone has stored vast amounts of data in a Teradata data warehouse. This

data warehouse exists oﬀ more than 200 tables. A simpliﬁed model of the data

warehouse can be found in Appendix A.

2.1.1 Selecting the customers

Vodafone Maastricht is interested in customer segmentation and customer pro-

ﬁling for (postpaid) business customers. In general, business customers can be

seen as employees of a business that have a subscription by Vodafone in re-

lation with that business. A more precisely view can be found in Figure 2.1.

It is clear to see, that prepaid users are always consumers. In the postpaid

group, there are captive and non captive users. A non-captive customer is using

the Vodafone network but has not a Vodafone subscription or prepaid (called

roaming). Vodafone has made an accomplishment with two other telecommuni-

cations companies, Debitel and InterCity Mobile Communications (ICMC), that

their customers can use the Vodafone network. Debitel customers are always

consumers and ICMC customers are always business customers. The ICMC cus-

tomers will also be involved in this research. A captive customer has a business

account if his telephone or subscription is bought in relation with the business

14

Figure 2.1: Structure of customers by Vodafone

he works. These customers are called business users. In some cases, customers

with a consumer account, can have a subscription that is under normal circum-

stances only available for business users. These customers also count as business

users. The total number of (postpaid) business users at Vodafone is more than

800,000. The next sections describe which data of these customers is needed for

customer segmentation and proﬁling.

2.1.2 Call detail data

Every time a call is placed on the telecommunications network of Vodafone,

descriptive information about the call is saved as a call detail record. The num-

ber of call detail records that are generated and stored is huge. For example,

Vodafone customers generate over 20 million call detail records per day. Given

that 12 months of call detail data is typically kept on line, this means that

hundreds of millions of call detail data will need to be stored at any time. Call

detail records include suﬃcient information to describe the important charac-

teristics of each call. At a minimum, each call detail record will include the

originating and terminating phone numbers, the date and the time of the call

and the duration of the call. Call detail records are generated in two or three

days after the day the calls were made, and will be available almost immediately

for data mining. This is in contrast with billing data, which is typically made

available only once per month. Call detail records can not be used directly

for data mining, since the goal of data applications is to extract knowledge at

the customer level, not at the level of individual phone calls [7, 8]. Thus, the

call detail records associated with a customer must be summarized into a single

record that describes the customer’s calling behavior. The choice of summary

variables (features) is critical in order to obtain a useful description of the cus-

tomer []. To deﬁne the features, one can think of the smallest set of variables

that describe the complete behavior of a customer. Keywords like what, when,

where, how often, who, etc. can help with this process:

15

• How?: How can a customer cause a call detail record? By making a voice

call, or sending an SMS (there are more possibilities, but their appearances

are so rare that they were not used during this research). The customer

can also receive an SMS or voice call.

• Who?: Who is the customer calling? Does he call to ﬁxed lines? Does

he call to Vodafone mobiles?

• What?: What is the location of the customer and the recipient? They

can make international phone calls.

• When?: When does a customer call? A business customer can call during

oﬃce daytime, or in private time in the evening or at night and during

the weekend.

• Where?: Where is the customer calling? Is he calling abroad?

• How long?: How long is the customer calling?

• How often?: How often does a customer call or receive a call?

Based on these keywords and based on proposed features in the literature [1,

15, 19, 20] , a list of features that can be used as a summary description of a

customer based on the calls they originate and receive over some time period P

is obtained:

• 1. average call duration

• 2. average # calls received per day

• 3. average # calls originated per day

• 4. % daytime calls (9am - 6pm)

• 5. % of weekday calls (Monday - Friday)

• 6. % of calls to mobile phones

• 7. average # sms received per day

• 8. average # sms originated per day

• 9. % international calls

• 10. % of outgoing calls within the same operator

• 11. # unique area codes called during P

• 12. # diﬀerent numbers called during P

16

These twelve features can be used to build customer segments. Such a segment

describes a certain behavior of group of customers. For examples, customers

who use their telephone only at their oﬃce could be in a diﬀerent segment then

users that use their telephone also for private purposes. In that case, the seg-

mentation was based on the percentage weekday and daytime calls. Most of the

twelve features listed above can be generated in a straightforward manner from

the underlying data of the data ware house, but some features require a little

more creativity and operations on the data.

It may be clear that generating useful features, including summary features, is a

critical step within the data mining process. Should poor features be generated,

data mining will not be successful. Although the construction of these features

may be guided by common sense, it should include exploratory data analysis.

For example, the use of the time period 9am-6pm in the fourth feature is not

based on the commonsense knowledge that the typical workday on a oﬃce is

from 9am to 5pm. More detailed exploratory data analysis, shown in Figure

2.2 indicates that the period from 9am to 6pm is actually more appropriate for

this purpose. Furthermore, for each summary feature, there should be suﬃcient

Figure 2.2: Visualization of phone calls per hour

variance within the data, otherwise distinguish between customers is not possi-

ble and the feature is not useful. On the other hand, to much variance hampers

the process of segmentation. For some features values, the variance is visible in

the following histograms. Figure 2.3 shows that the average call duration, the

number of weekday and daytime calls and the originated calls have suﬃcient

variance. Note that the histograms resemble well known distributions. This

also indicates that the chosen features are suited for the customer segmenta-

tion. Interesting to see is the relation between the number of calls originated

and received. First of all, in general, customers originating more calls than

receiving. Figure 2.4 demonstrates this, values above the blue line represent

customers with more originating calls than receiving calls. In Figure 2.4 is also

visible that the customers that originated more calls, receive also more calls in

proportion. Another aspect that is simple to ﬁgure out is the fact that customer

17

(a) Call duration (b) Weekday calls

(c) Daytime calls (d) Originated calls

Figure 2.3: Histograms of feature values

Figure 2.4: Relation between originated and received calls

18

that make more weekday calls also call more at daytime (in proportion). This is

plotted in Figure 2.5. It is clear to see that the chosen features contain suﬃcient

variance and that certain relations and diﬀerent customer behavior are already

visible. The chosen features appear to be well chosen and useful for customer

segmentation.

Figure 2.5: Relation between daytime and weekday calls

2.1.3 Customer data

To proﬁle the customer, customer data is needed. The proposed data in Section

1.1.2 is not completely available. Information about lifestyles and income is

missing. However, with some creativity, some information can be subtracted

from the data ware house. The information that Vodafone stored in the data

ware house include name and address information and also include other in-

formation such as service plan, contract information and telephone equipment

information. With this information, the following variables can be used to deﬁne

a customers proﬁle:

• Age group: <25, 25-40 40-55 >55

• Gender: male, female

• Type telephone: simple, basic, advanced

• Type subscription: basic, advance, expanded

• Company size: small, intermediate, big

• Living area: (big) city, small city /town

19

Because a relative small diﬀerence in age between customers should show close

relationships, the age of the customers has to be grouped. Otherwise, the result

of the classiﬁcation algorithm is too speciﬁc to the trainings data [14]. In general,

the goal of grouping variables is to reduce the number of variables to a more

manageable size and to remove the correlations between each variable. The

composition of the groups should be chosen with care. It is of high importance

that the sizes of the groups are almost equal (if this is possible) [22]. If there is

one group with a suﬃcient higher amount of customers than other groups, this

feature will not increase the performance of the classiﬁcation. This is caused

by the fact that from each segment a relative high number of customers is

represented in this group. Based on this feature, the segment of a customer can

not be determined. Table 2.1 shows the percentages of customers within the

chosen groups. It is clear to see that sizes of the groups were chosen with care

Age: <25 25-40 40-55 >55

21.2% 29.5% 27.9% 21.4%

Gender: Male Female

60.2% 39.8%

Telephone type: simple basic advanced

33.5% 38.7% 27.8%

Type of subscription: simple advanced expanded

34.9% 36.0% 29.1%

Company size: small intermediate big

31.5% 34.3% 34.2%

Living area: (big) city small city/town

42.0% 58.0%

Table 2.1: Proportions within the diﬀerent classiﬁcation groups

and the values can be used for deﬁning the customers proﬁle.With this proﬁle, a

Support Vector Machine will be used to estimate the segment of the customer.

Chapter 5 and Chapter 6 contain information and results of this method.

2.2 Data preparation

Before the data can be used for the actual data mining process, it need to

cleaned and prepared in a required format. These tasks are [7]:

• Discovering and repairing inconsistent data formats and inconsistent data

encoding, spelling errors, abbreviations and punctuation.

• Deleting unwanted data ﬁelds. Data may contain many meaningless ﬁelds

from an analysis point of view, such as production keys and version num-

bers.

• Interpreting codes into text or replacing text into meaningful numbers.

20

Data may contain cryptic codes. These codes has to be augmented and

replaced by recognizable and equivalent text.

• Combining data, for instance the customer data, from multiple tables into

one common variable.

• Finding multiple used ﬁelds. A possible way to determine is to count or

list all the distinct variables of a ﬁeld.

The following data preparations were needed during this research:

• Checking abnormal, out of bounds or ambiguous values. Some of these

outliers may be correct but this is highly unusual, thus almost impossible

to explain.

• Checking missing data ﬁelds or ﬁelds that have been replaced by a default

value.

• Adding computed ﬁelds as inputs or targets.

• Mapping continuous values into ranges, e.g. decision trees.

• Normalization of the variables. There are two types of normalization. The

ﬁrst type is to normalize the values between [0,1]. The second type is to

normalize the variance to one.

• Converting nominal data (for example yes/no answers) to metric scales.

• Converting from textual to numeral or numeric data.

New ﬁelds can be generated through combinations of e.g. frequencies, aver-

ages and minimum/maximum values. The goal of this approach is to reduce

the number of variables to a more manageable size while also the correlations

between each variable will be removed. Techniques used for this purpose are of-

ten referred to as factor analysis, correspondence analysis and conjoint analysis

[14]. When there is a large amount of data, it is also useful to apply data reduc-

tion techniques (data cube aggregation, dimension and numerosity reduction,

discretization and concept hierarchy generation). Dimension reduction means

that one has to select relevant feature to a minimum set of attributes such that

the resulting probability distribution of data classes is a close as possible to the

original distribution given the values of all features. For this additional tools

may be needed, e.g. exhaustive, random or heuristic search, clustering, decision

trees or associations rules.

21

Chapter 3

Clustering

In this chapter, the used techniques for the cluster segmentation will be ex-

plained.

3.1 Cluster analysis

The objective of cluster analysis is the organization of objects into groups, ac-

cording to similarities among them [13]. Clustering can be considered the most

important unsupervised learning method. As every other unsupervised method,

it does not use prior class identiﬁers to detect the underlying structure in a

collection of data. A cluster can be deﬁned as a collection of objects which are

”similar” between them and ”dissimilar” to the objects belonging to other clus-

ters. Figure 3.1 shows this with a simple graphical example. In this case the 3

Figure 3.1: Example of clustering data

clusters into which the data can be divided were easily identiﬁed. The similarity

criterion that was used in this case is distance: two or more objects belong to

the same cluster if they are ”close” according to a given distance (in this case

geometrical distance). This is called distance-based clustering. Another way of

clustering is conceptual clustering. Within this method, two or more objects

22

belong to the same cluster if this one deﬁnes a concept common to all that

objects. In other words, objects are grouped according to their ﬁt to descriptive

concepts, not according to simple similarity measures. In this research, only

distance-based clustering algorithms were used.

3.1.1 The data

One can apply clustering techniques to quantitative (numerical) data, qualita-

tive (categoric) data, or a mixture of both. In this research, the clustering of

quantitative data is considered. The data, as described in Section 2.1.2, are

typically summarized observations of a physical process (call behavior of a cus-

tomer). Each observation of the customers calling behavior consists of n mea-

sured values, grouped into an n-dimensional row vector x

k

= [x

k1

, x

k2

, ..., x

kn

]

T

,

where x

k

∈ R

n

. A set of N observations is denoted by X = {x

k

|k = 1, 2, ..., N},

and is represented as an N x n matrix:

X =

¸

¸

¸

¸

x

11

x

12

· · · x

1n

x

21

x

22

· · · x

2n

.

.

.

.

.

.

.

.

.

.

.

.

x

N1

x

N2

· · · x

Nn

¸

. (3.1)

In pattern recognition terminology, the rows of X are called patterns or objects,

the columns are called the features or attributes, and X is called the pattern

matrix. In this research, X will be referred to the data matrix. The rows of

X represent the customers, and the columns are the feature variables of their

behavior as described in Section 2.1.2. As mentioned before, the purpose of

clustering is to ﬁnd relationships between independent system variables, called

the regressors, and future values of dependent variables, called the regressands.

However, one should realize, that the relations revealed by clustering are not

more than associations among the data vectors. And therefore, they will not

automatically constitute a prediction model of the given system. To obtain such

a model, additional steps are needed.

3.1.2 The clusters

The deﬁnition of a cluster can be formulated in various ways, depending on

the objective of the clustering. In general, one can accept the deﬁnition that a

cluster is a group of objects that are more similar to another than to members

of other clusters. The term ”similarity” can be interpreted as mathematical

similarity, measured in some well-deﬁned sense. In metric spaces, similarity is

often deﬁned by means of a distance norm, or distance measure. Distance can

be measured in diﬀerent ways. The ﬁrst possibility is to measure among the

data vectors themselves. A second way is to measure the distance form the

data vector to some prototypical object of the cluster. The cluster centers are

usually (and also in this research) not known a priori, and will be calculated

by the clustering algorithms simultaneously with the partitioning of the data.

23

The cluster centers may be vectors of the same dimensions as the data objects,

but can also be deﬁned as ”higher-level” geometrical objects, such as linear or

nonlinear subspaces or functions.

Data can reveal clusters of diﬀerent geometrical shapes, sizes and densities as

demonstrated in Figure 3.2 Clusters can be spherical, elongated and also be

(a) Elongated (b) Spherical

(c) Hollow (d) Hollow

Figure 3.2: Diﬀerent cluster shapes in R

2

hollow. Cluster can be found in any n-dimensional space. Clusters a,c and d

can be characterized as linear and non linear subspaces of the data space (R

2

in

this case). Clustering algorithms are able to detect subspaces of the data space,

and therefore reliable for identiﬁcation. The performance of most clustering

algorithms is inﬂuenced not only by the geometrical shapes and densities of the

individual clusters, but also by the spatial relations and distances among the

clusters. Clusters can be well-separated, continuously connected to each other,

or overlapping each other.

3.1.3 Cluster partition

Clusters can formally be seen as subsets of the data set. One can distinguish

two possible outcomes of the classiﬁcation of clustering methods. Subsets can

24

either be fuzzy or crisp (hard). Hard clustering methods are based on the clas-

sical set theory, which requires that an object either does or does not belong

to a cluster. Hard clustering in a data set X means partitioning the data into

a speciﬁed number of exclusive subsets of X. The number of subsets (clusters)

is denoted by c. Fuzzy clustering methods allow objects to belong to several

clusters simultaneously, with diﬀerent degrees of membership. The data set X

is thus partitioned into c fuzzy subsets. In many real situations, fuzzy cluster-

ing is more natural than hard clustering, as objects on the boundaries between

several classes are not forced to fully belong to one of the classes, but rather

are assigned membership degrees between 0 and 1 indicating their partial mem-

berships (illustrated by Figure 3.3 The discrete nature of hard partitioning also

Figure 3.3: Hard and fuzzy clustering

causes analytical and algorithmic intractability of algorithms based on analytic

functionals, since these functionals are not diﬀerentiable. The structure of the

partition matrix U = [µ

ik

]:

U =

¸

¸

¸

¸

µ

1,1

µ

1,2

· · · µ

1,c

µ

2,1

µ

2,2

· · · µ

2,c

.

.

.

.

.

.

.

.

.

.

.

.

µ

N,1

µ

N,2

· · · µ

N,c

¸

. (3.2)

Hard partition

The objective of clustering is to partition the data set X into c clusters. Assume

that c is known, e.g. based on prior knowledge, or it is a trial value, of witch

partition results must be validated. Using classical sets, a hard partition can be

seen as a family of subsets {A

i

|1 ≤ i ≤ c ⊂ P(X)}, its properties can be deﬁned

as follows:

c

¸

i=1

A

i

= X, (3.3)

A

i

∩ A

j

, 1 ≤ i = j ≤ c, (3.4)

Ø ⊂ A

i

⊂ X, 1 ≤ i ≤ c. (3.5)

25

These conditions imply that the subsets A

i

contain all the data in X, they must

be disjoint and none of them is empty nor contains all the data in X. Expressed

in the terms of membership functions:

c

¸

i=1

µ

Ai

= 1, (3.6)

µ

Ai

∨ µ

Ai

, 1 ≤ i = j ≤ c, (3.7)

0 ≤ µ

Ai

< 1, 1 ≤ i ≤ c. (3.8)

Where µ

Ai

represents the characteristic function of the subset A

i

which value

is zero or one. To simplify these notations, µ

i

will be used instead of µ

Ai

,

and denoting µ

i

(x

k

) by µ

ik

, partitions can be represented in a matrix notation.

U = [µ

ik

], a Nxc matrix, is a representation of the hard partition if and only if

its elements satisfy:

µ

ij

∈ {0, 1}, 1 ≤ i ≤ N, 1 ≤ k ≤ c, (3.9)

c

¸

k=1

µ

ik

= 1, 1 ≤ i ≤ N, (3.10)

0 <

N

¸

i=1

µ

ik

< N, 1 ≤ k ≤ c. (3.11)

A deﬁnition of a hard partitioning space can be deﬁned as follows:

Let X be a ﬁnite data set and the number of clusters 2 ≤ c < N ∈ N. Then,

the hard partitioning space for X can be seen as the set:

M

hc

= {U ∈ R

Nxc

|µ

ik

∈ {0, 1}, ∀i, k;

c

¸

k=1

µ

ik

= 1, ∀i; 0 <

N

¸

i=1

µ

ik

< N, ∀k}.

(3.12)

Fuzzy partition

Fuzzy partition can be deﬁned as a generalization of hard partitioning, in this

case µ

ik

is allowed to acquire all real values between zero and 1. Consider the

matrix U = [µ

ik

], containing the fuzzy partitions, its conditions are given by:

µ

ij

∈ [0, 1], 1 ≤ i ≤ N, 1 ≤ k ≤ c, (3.13)

c

¸

k=1

µ

ik

= 1, 1 ≤ i ≤ N, (3.14)

0 <

N

¸

i=1

µ

ik

< N, 1 ≤ k ≤ c. (3.15)

Note that there is only one diﬀerence with the conditions of the hard partition-

ing. Also the deﬁnition of the fuzzy partitioning space will not much diﬀer with

26

the deﬁnition of the hard partitioning space. It can be deﬁned as follows: Let

X be a ﬁnite data set and the number of clusters 2 ≤ c < N ∈ N. Then, the

fuzzy partitioning space for X can be seen as the set:

M

fc

= {U ∈ R

Nxc

|µ

ik

∈ [0, 1], ∀i, k;

c

¸

k=1

µ

ik

= 1, ∀i; 0 <

N

¸

i=1

µ

ik

< N, ∀k}.

(3.16)

The i-th column of U contains values of the membership functions of the i-th

fuzzy subset of X. Equation (1.14) implies that the sum of each column should

be 1, which means that the total membership of each x

k

in X equals one. There

are no constraints on the distribution of memberships among the fuzzy clusters.

This research will focus on hard partitioning. However, fuzzy cluster algorithms

will be applied as well. To deal with the problem of fuzzy memberships, the

cluster with the highest degree of membership will be chosen as the cluster were

the object belongs to. This method will result into hard partitioned clusters.

The possibilistic partition will not be used in this researched and will not be

discussed here.

3.2 Cluster algorithms

This section gives an overview of the clustering algorithms that were used during

the research.

3.2.1 K-means

K-means is one of the simplest unsupervised learning algorithms that solves

the clustering problem. However, the results of this hard partitioning method

are not always reliable and this algorithm has numerical problems as well. The

procedure follows an easy way to classify a given N x n data set through a certain

numbers of c clusters deﬁned in advance. The K-means algorithm allocates each

data point to one of the c clusters to minimize the within sum of squares:

c

¸

i=1

sum

k∈Ai

||x

k

−v

i

||

2

. (3.17)

A

i

represents a set of data points in the i-th cluster and v

i

is the average of the

data points in cluster i. Note that ||x

k

−v

i

||

2

is actually a chosen distance norm.

Within the cluster algorithms, v

i

is the cluster center (also called prototype) of

cluster i:

v

i

=

¸

Ni

k=1

x

k

N

i

, x

k

∈ A

i

, (3.18)

where N

i

is the number of data points in A

i

.

27

3.2.2 K-medoid

K-medoid clustering, also a hard partitioning algorithm, uses the same equations

as the K-means algorithm. The only diﬀerence is that in K-medoid the cluster

centers are the nearest data points to the mean of the data in one cluster V =

{v

i

∈ X|1 ≤ i ≤ c}. This can be useful when, for example, there is no continuity

in the data space. This implies that a mean of the points in one cluster does

actually not exist.

3.2.3 Fuzzy C-means

The Fuzzy C-means algorithm (FCM) minimizes an objective function, called

C-means functional, to deﬁne the clusters. The C-means functional, invented

by Dunn, is deﬁned as follows:

J(X; U, V ) =

c

¸

i=1

N

¸

k=1

(µ

ik

)

m

||x

k

−v

i

||

2

A

, (3.19)

with

V = [v

1

, v

2

, ..., v

c

], v

i

∈ R

n

. (3.20)

V denotes the vector with the cluster centers that has to be determined. The

distance norm ||x

k

−v

i

||

2

A

is called a squared inner-product distance norm and

is deﬁned by:

D

2

ikA

= ||k

k

−v

i

||

2

A

= (x

k

−v

i

)

T

A(x

k

−v

i

). (3.21)

On a statistical point of view, equation 3.19 measures the total number of vari-

ance of x

k

from v

i

. The minimization of the C-means functional can be seen as a

non linear optimization problem, that can be solved by a variety of methods. Ex-

amples of methods that can solve non linear optimization problems are grouped

coordinate minimization and genetic algorithms. The simplest method to solve

this problem is utilizing the Picard iteration through the ﬁrst-order conditions

for the stationary points of equation 3.19. This method is called the fuzzy c-

means algorithm. To ﬁnd the stationary points of the c-means functional, one

can adjoint the constrained in 3.14 to J by means of Lagrange multipliers:

¯

J(X; U, V, λ) =

c

¸

i=1

N

¸

k=1

(µ

ik

)

m

D

2

ikA

+

N

¸

k=1

λ

k

c

¸

i=1

µ

ik

−1

, (3.22)

and by setting the gradients of (

ˆ

J), with respect to U, V and λ, to zero. When

D

2

ikA

> 0, ∀i, k and m > 1, then the C-means functional may only be minimized

by (U, V ) ∈ M

fc

xR

nxc

if

µ

ik

=

1

¸

c

j=1

(D

ikA

/D

jkA

)

2/(m−1)

, 1 ≤ i ≤ c, 1 ≤ k ≤ N, (3.23)

28

and

v

i

=

¸

N

k=1

µ

m

ik

x

k

¸

N

k=1

µ

m

i,k

, 1 ≤ i ≤ c. (3.24)

The solution of these equations are satisfying the constraints that were given in

equation (3.13) and (3.15). Remark that the v

i

of equation (3.24) is the weighted

average of the data points that belong to a cluster and the weights represents the

membership degrees. This explains why the name of the algorithm is c-means.

The Fuzzy C-means algorithm is actually an iteration between the equations

(3.23) and (3.24). The FCM algorithm uses the standard Euclidean distance for

its computations. Therefor, it is able to deﬁne hyper spherical clusters. Note

that it can only detect clusters with the same shape, caused by the common

choice of the norm inducing matrix A = I. The norm inducing matrix can also

be chosen as an nxn diagonal matrix of the form:

A

D

=

¸

¸

¸

¸

(1/σ

1

)

2

0 · · · 0

0 (1/σ

2

)

2

· · · 0

.

.

.

.

.

.

.

.

.

.

.

.

0 0 · · · (1/σ

n

)

2

¸

. (3.25)

This matrix accounts for diﬀerent variances in the directions of the coordinate

axes of X. Another possibility is to choose A as the inverse of the nxn covariance

matrix A = F

−1

, where

F =

1

N

N

¸

k=1

(x

k

− ˆ x)(x

k

− ˆ x)

T

(3.26)

and ˆ x denotes the mean of the data. Hence that, in this case, matrix A is based

on the Mahalanobis distance norm.

3.2.4 The Gustafson-Kessel algorithm

The Gustafson and Kessel (GK) algorithm is a variation on the Fuzzy c-means

algorithm [11]. It employs a diﬀerent and adaptive distance norm to recognize

geometrical shapes in the data. Each cluster will have its own norm-inducing

matrix A

i

, satisfying the following inner-product norm:

D

2

ikA

= (x

k

−v

i

)

T

· A

i

(x

k

−v

i

), where 1 ≤ i ≤ c and 1 ≤ k ≤ N. (3.27)

The matrices A

i

are used as optimization variables in the c-means functional.

This implies that each cluster is allowed to adapt the distance norm to the local

topological structure of the data. A c-tuple of the norm-inducing matrices is

deﬁned by A, where A = (A

1

, A

2

, ..., A

c

). The objective functional of the GK

algorithm can be calculated by:

J(X; U, V, A) =

c

¸

i=1

N

¸

k=1

(u

ik

)

m

D

2

ikAi

. (3.28)

29

If A is ﬁxed, the conditions under (3.13), (3.14) and (3.15) can be applied

without any problems. Unfortunately, equation (3.28) can not be minimized

in a straight forward manner, since it is linear in A

i

. This implies that J can

be made as small as desired by making A

i

less positive deﬁnite. To avoid this,

A

i

has to be constrained to obtain a feasible solution. A general way to this

is by constraining the determinant of the matrix. A varying A

i

with a ﬁxed

determinant relates to the optimization of the cluster whit a ﬁxed volume:

||A

i

|| = ρ

i

, ρ > 0. (3.29)

Here ρ is a remaining constant for each cluster. In combination with the La-

grange multiplier, A

i

can be expressed in the following way:

A

i

= [ρ

i

det(F

i

)]

1/n

F

−1

i

, (3.30)

with

F

i

=

¸

N

k=1

(µ

ik

)

m

(x

k

−v

i

)(x

k

−v

i

)

T

¸

N

k=1

(µ

ik

)

m

. (3.31)

F

i

is also called the fuzzy covariance matrix. Hence that this equation in com-

bination with equation (3.30) can be substituted into equation (3.27). The

outcome of the inner-product norm of (3.27) is a generalized squared Maha-

lanobis norm between the data points and the cluster center. The covariance is

weighted by the membership degrees of U.

3.2.5 The Gath Geva algorithm

Bezdek and Dunn [5] proposed a fuzzy maximum likelihood estimation (FMLE)

algorithm with a corresponding distance norm:

D

ik

(x

k

, v

i

) =

det(F

wi

)

α

i

1

2

(x

k

−v

(l)

i

)

T

F

−1

wi

(x

k

−v

(l)

i

)

, (3.32)

Comparing this with the Gustafson-Kessel algorithm, the distance norm includes

an exponentional term. This implies that this distance norm will decrease faster

than the inner-product norm. In this case, the fuzzy covariance matrix Fi is

deﬁned by:

F

wi

=

¸

N

k=1

(µ

ik

)

w

(x

k

−v

i

)(x

k

−v

i

)

T

¸

N

k=1

(µ

ik

)

w

, 1 ≤ i ≤ c. (3.33)

The reason for using the w variable is to generalize this expression. In the origi-

nal FMLE algorithm, w = 1. In this research, w will be set to 2, to compensate

the exponential term and obtain clusters that are more fuzzy. Because of the

generalization, two weighted covariance matrices arise. The variable α

i

in equa-

tion (3.32) is the prior probability of selecting cluster i. α

i

can be deﬁnes as

follows:

α

i

=

1

N

N

¸

k=1

µ

ik

. (3.34)

30

Gath and Geva [9] discovered that the FMLE algorithm is able to detect clusters

of diﬀerent shapes, sizes and densities and that the clusters are not constrained

in volume. The main drawback of this algorithm is the robustness, since the

exponential distance norm can converge to a local optimum. Furthermore, it is

not know how reliable the results of this algorithm are.

3.3 Validation

Cluster validation refers to the problem whether a found partition is correct and

how to measure the correctness of a partition. A clustering algorithm is designed

to parameterize clusters in a way that it gives the best ﬁt. However, this does

not apply that the best ﬁt is meaningful at all. The number of clusters might

not be correct or the cluster shapes do not correspond to the actual groups in

the data. In the worst case, the data can not be grouped in a meaningful way at

all. One can distinguish two main approaches to determine the correct number

of clusters in the data:

• Start with a suﬃciently large number of clusters, and successively reducing

this number by combining clusters that have the same properties.

• Cluster the data for diﬀerent values of c and validate the correctness of

the obtained clusters with validation measures.

To be able to perform the second approach, validation measures has to be de-

signed. Diﬀerent validation methods have been proposed in the literature, how-

ever, none of them is perfect by oneself. Therefore, in this research are used

several indexes, which are described below:

• Partition Coeﬃcient (PC): measures the amount of ”overlapping” be-

tween clusters. It is deﬁned by Bezdek [5] as follows:

PC(c) =

1

N

c

¸

i=1

N

¸

j=1

(u

ij

)

2

(3.35)

where u

ij

is the membership of data point j in cluster i. The main draw-

back of this validity measure is the lack of direct connection to the data

itself. The optimal number of clusters can be found by the maximum

value.

• Classiﬁcation Entropy (CE): measures only the fuzziness of the cluster,

which is a slightly variation on the Partition Coeﬃcient.

CE(c) = −

1

N

c

¸

i=1

N

¸

j=1

u

ij

log(u

ij

) (3.36)

31

• Partition Index (PI): expresses the ratio of the sum of compactness and

separation of the clusters. Each individual cluster is measured with the

cluster validation method. This value is normalized by dividing it by the

fuzzy cardinality of the cluster. To receive the Partition index, the sum

of the value for each individual cluster is used.

PI(c) =

c

¸

i=1

¸

N

j=1

(u

ij

)

m

||x

j

−v

i

||

2

N

i

¸

c

k=1

||v

k

−v

i

||

2

(3.37)

PI is mainly used for the comparing of diﬀerent partitions with the same

number of clusters. A minor value of a SC means a better partitioning.

• Separation Index (SI): in contrast with the partition index (PI), the

separation index uses a minimum-distance separation to validate the par-

titioning.

SI(c) =

¸

c

i=1

¸

N

j=1

(u

ij

)

2

||x

j

−v

i

||

2

N min

i,k

||v

k

−v

i

||

2

(3.38)

• Xie and Beni’s Index (XB): is a method to quantify the ratio of the

total variation within the clusters and the separations of the clusters [3].

XB(c) =

¸

c

i=1

¸

N

j=1

(u

ij

)

m

||x

j

−v

i

||

2

N min

i,j

||x

j

−v

i

||

2

(3.39)

The lowest value of the XB index should indicate the optimal number of

clusters.

• Dunn’s Index (DI): this index was originally designed for the identiﬁca-

tion of hard partitioning clustering. Therefor, the result of the clustering

has to be recalculated.

DI(c) = min

i∈c

{ min

j∈c,i=j

{

min

x∈Ci,y∈Cj

d(x, y)

max

k∈c

{max

x,y∈C

d(x, y)}

}} (3.40)

The main disadvantage of the Dunn’s index is the very expansive compu-

tational complexity as c and N increase.

• Alternative Dunn Index (ADI):To simplify the calculation of the

Dunn index, the Alternative Dunn Index was designed. This will be the

case when the dissimilarity between two clusters, measured with min

x∈Ci,y∈Cj

d(x, y),

is rated in under bound by the triangle-inequality:

d(x, y) ≥ |d(y, v

j

) −d(x, v

j

)| (3.41)

were v

j

represents the cluster center of the j-th cluster.

ADI(c) = min

i∈c

{ min

j∈c,i=j

{

min

xi∈Ci,xj∈Cj

|d(y, v

j

) −d(x

i

, v

j

)|

max

k∈c

{max

x,y∈C

d(x, y)}

}} (3.42)

32

Note, that the Partition Coeﬃcient and the Classiﬁcation Entropy are only

useful for fuzzy partitioned clustering. In case of fuzzy clusters the values of the

Dunn’s Index and the Alternative Dunn Index are not reliable. This is caused

by the repartitioning of the results with the hard partition method.

3.4 Visualization

To understand the data and the results of the clustering methods, it is useful

to visualize the data and the results. However, the used data set is a high-

dimensional data set, which can not be plotted and visualized directly. This

section describes three methods that can map the data points into a lower

dimensional space.

In this research, the three mapping methods will be used for the visualization

of the clustering results. The ﬁrst method is the Principal Component Analysis

(PCA), a standard and a most widely method to map high-dimensional data

into a lower dimensional space. Then, this report will focus on the Sammon

mapping method. The advantage of the Sammon mapping is the ability to

preserve inter pattern distances. This kind of mapping of distances is much

closer related to the proposition of clustering than saving the variances (which

will be done by PCA). However, the Sammon mapping application has two main

drawbacks:

• Sammon mapping is a projection method, which is based on the preser-

vation of the Euclidean inter point distance norm. This implies that the

Sammon mapping only can be applied on clustering algorithms that use

the Euclidean distance norm during the calculations of the clusters.

• The Sammon mapping method aims to ﬁnd in a high n-dimensional space

N points in a lower q-dimensional subspace, such in a way the inter

point distances correspond to the distances measured in the n-dimensional

space. To achieve this, a computational expensive algorithm is needed, be-

cause in every iteration step a computation of N(N − 1)/2 distances is

required.

To avoid these problems of the Sammon mapping method, a modiﬁed algorithm,

called the Fuzzy Sammon mapping, is used during this research. A draw back

of this Fuzzy Sammon mapping is the loose of precision in distance, since only

the distance between the data points and the cluster centers considered to be

important.

The three visualisation methods will be explained in more detail in the following

subsections.

3.4.1 Principal Component Analysis

Principal component analysis (PCA) include a mathematical procedure that

maps a number of correlated variables into a smaller set of uncorrelated vari-

ables, called the principal components. The ﬁrst principal component represents

33

as much of the variability in the data as possible. The succeeding components

describe the remaining variability. The main goals of the PCA method are:

• Identifying new meaningful underlying variables.

• Discovering and/or reducing the dimensionality of a data set.

In a mathematical way, the principal components will be achieved by analyzing

the eigenvectors and eigenvalues. The direction of the ﬁrst principal component

is diverted from the eigenvector with the largest eigenvalue. The eigenvalue

associated with the second largest eigenvalue correspond to the second principal

component, etc. In this research, the second objective is used. In this case, the

covariance matrix of the data set can be described by:

F =

1

N

(x

k

−v)(x

k

−v)

T

, (3.43)

where v = ¯ x

k

. Principal Component Analysis is based on the projection of

correlated high-dimensional data onto a hyperplane [3]. This methods uses

only the ﬁrst q nonzero eigenvalues and the corresponding eigenvectors of the

covariance matrix:

F

i

= U

i

Λ

i

U

T

i

. (3.44)

With Λ

i

as a matrix that contains the eigenvalues λ

i,j

of F

i

in its diagonal in

decreasing order and U

i

is a matrix containing the eigenvectors corresponding

to the eigenvalues in its columns. Furthermore, there is a q-dimensional reduced

vector that represents the vector x

k

of X, which can be deﬁned as follows:

y

i,k

= W

−1

i

(x

k

) = W

T

i

(x

k

). (3.45)

The weight matrix W

i

contains the q principal orthonormal axes in its column:

W

i

= U

i,q

Λ

1

2

i,q

. (3.46)

3.4.2 Sammon mapping

As mentioned before,the Sammon mapping uses inter point distance measures to

ﬁnd N points in a q-dimensional data space, which are representative for a higher

n-dimensional data set. The inter point distance measure of the n-dimensional

space, deﬁned by d

ij

= d(x

i

, x

j

) correspond to the inter point distances in the

q-dimensional space, given by d

∗

ij

= d

∗

(y

i

, y

j

). This is achieved by Sammon’s

stress, a minimization criterion of the error:

E =

1

λ

N−1

¸

i=1

N

¸

j=i+1

(d

ij

−d

∗

ij

)

2

d

ij

, (3.47)

where λ is a constant:

λ =

¸

i<j

d

ij

=

N−1

¸

i=1

N

¸

j=i+1

d

ij

. (3.48)

34

Note that there is no need to maintain λ, since a constant does not change

the result of the optimization process. The minimization of the error E is

an optimization problem in the Nxq variables y

il

, with i ∈ {1, 2, ..., N} and

l ∈ {1, 2, ..., q} which implies that y

i

= [y

i1

, ..., y

iq

]

T

. The rating of y

il

at the

t-th iteration can deﬁned by:

y

il

(t + 1) = y

il

(t) −α

∂E(t)

∂y

il

(t)

∂

2

E(t)

∂y

2

il

(t)

¸

¸

, (3.49)

where α is a nonnegative scalar constant, with a recommended value α 0.3 −

0.4. This scalar constant represents the step size for gradient search in the

direction of

∂E(t)

∂y

il

(t)

= −

2

λ

N

¸

k=1,k=i

¸

d

ki

−d

∗

ki

d

ki

d

∗

ki

(y

il

−y

kl

) (3.50)

∂

2

E(t)

∂y

2

il

(t)

= −

2

λ

N

¸

k=1,k=i

1

d

ki

d

∗

ki

¸

(d

ki

−d

∗

ki

) −

(y

il

−y

kl

)

2

d

∗

ki

1 +

d

ki

−d

∗

ki

d

ki

(3.51)

With this gradient-descent method, it is not possible to reach a local minimum in

the error surface, while searching for the minimum of E. This is a disadvantage,

because multiple experiments with diﬀerent random initializations are necessary

to ﬁnd the minimum. However, it is possible to estimate the correct initialization

based on the information which is obtained from the data.

3.4.3 Fuzzy Sammon mapping

As mentioned in the introduction of this section, Sammon’s mapping has several

drawbacks. To avoid this drawbacks, a modiﬁed mapping method is designed

which takes into account the basic properties of fuzzy clustering algorithms

where only the distance between the data points and the clustering centers are

considered to be important [3]. The modiﬁed algorithm, called Fuzzy Sammon

mapping, uses only N∗c distances, weighted by the membership values similarly

to equation (3.19):

E

fuzz

=

c

¸

i=1

N

¸

k=1

(µ

ki

)

m

(d(x

k

, v

i

) −d

∗

ki

)

2

, (3.52)

with d(x

k

, v

i

) representing the distance between data point x

k

and the cluster

center v

i

in the original n-dimensional space. The Euclidean distance between

the cluster center z

i

and the data y

k

of the projected q-dimensional space is

represented by d

∗

(y

k

, z

i

). According to this information, in a projected two

dimensional space every cluster is represented by a single point, independently

to the shape of the original cluster. The Fuzzy Sammon mapping algorithm is

similar to the original Sammon mapping, but in this case the projected cluster

35

center will be recalculated in every iteration after the adaption of the projected

data points. The recalculation will be based on the weighted mean formula of

the fuzzy clustering algorithms, described in Section 3.2.3 (equation 3.19).

The membership values of the projected data can be plotted based on the stan-

dard equation for the calculation of the membership values:

µ

∗

ki

=

1

¸

c

j=1

d

∗

(x

k

,ηi)

d

∗

(x

k

,vj)

2

m−1

, (3.53)

where U

∗

= [µ

∗

ki

] is the partition matrix with the recalculated memberships.

The plot will only give an approximation of the high dimensional clustering in

a two dimensional space. To measure the quality of this rating, an evaluation

function that determines the mean square error between the original and the

recalculated error can be deﬁned as follows:

P = ||U −U

∗

||. (3.54)

In the next chapter, the cluster algorithms will be tested and evaluated. The

PCA and the (Fuzzy) Sammon mapping methods will be used to visualize the

data and the clusters.

36

Chapter 4

Experiments and results of

customer segmentation

In this chapter, the cluster algorithms will be tested and their performance will

be measured with the proposed validation methods of the previous chapter. The

best working cluster method will be used to determine the segments. The chap-

ter ends with an evaluation of the segments.

4.1 Determining the optimal number of clusters

The disadvantage of the proposed cluster algorithms is the number of clusters

that has to be given in advance. In this research the number of clusters is not

known. Therefor, the optimal number of clusters has to be searched with the

given validation methods of Section 3.3. For each algorithm, calculations for

each cluster, c ∈ [215], were executed. To ﬁnd the optimal number of clusters,

a process called Elbow Criterion is used. The elbow criterion is a common rule

of thumb to determine what number of clusters should be chosen. The elbow

criterion says that one should choose a number of clusters so that adding an-

other cluster does not add suﬃcient information. More precisely, by graphing

a validation measure explained by the clusters against the number of clusters,

the ﬁrst clusters will add much information (explain a lot of variance), but at

some point the marginal gain will drop, giving an angle in the graph (the el-

bow). Unfortunately, this elbow can not always be unambiguously identiﬁed.

To demonstrate the working of the elbow criterion, the feature values that rep-

resent the call behavior of the customers, as described in Section 2.1.2, are used

as input for the cluster algorithms. From the 800,000 business customers of

Vodafone, 25,000 customers were randomly selected for the experiments. More

customers would lead to computational problems. First, the K-means algorithm

will be evaluated. The values of the validation methods depending on the num-

ber of clusters will be plotted. The value of the Partition Coeﬃcient is for all

37

clusters 1, and the classiﬁcation entropy is always ’NaN’. This is caused by the

fact that these 2 measures were designed for fuzzy partitioning methods, and

in this case the hard partitioning algorithm K-means is used. In Figure 4.1,

the values of the Partition Index, Separation Index and Xie and Beni’s Index

are shown. Mention again, that no validation index is reliable only by itself.

Figure 4.1: Values of Partition Index, Separation Index and the Xie Beni Index

Therefor, all the validation indexes are shown. The optimum could diﬀer by

using diﬀerent validation methods. This means that the optimum only could

be detected by the comparison of all the results. To ﬁnd the optimal number of

cluster, partitions with less clusters are considered better, when the diﬀerence

between the values of the validation measure are small. Figure 4.1 shows that for

the PI and SI, the number of clusters easily could be rated to 4. For the Xie and

Beni index, this is much harder. The elbow can be found at c = 3, c = 6, c = 9

or c = 13, depending on the deﬁnition and parameters of an elbow. In Figure

4.2 there are more informative plots shown. The Dunn’s index and the Alterna-

tive Dunn’s index conﬁrm that the optimal number of clusters for the K-means

algorithm should be chosen to 4. The values of all the validation measures for

the K-means algorithm, are embraced in table 4.1

38

Figure 4.2: Values of Dunn’s Index and the Alternative Dunn Index

c 2 3 4 5 6 7 8

PC 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000

CE NaN NaN NaN NaN NaN NaN NaN

PI 3.8318 1.9109 1.1571 1.0443 1.2907 0.9386 0.8828

SI 0.0005 0.0003 0.0002 0.0002 0.0002 0.0002 0.0002

XBI 5.4626 4.9519 5.0034 4.3353 3.9253 4.2214 3.9079

DI 0.0082 0.0041 0.0034 0.0065 0.0063 0.0072 0.0071

ADI 0.0018 0.0013 0.0002 0.0001 0.0001 0.0001 0.0000

c 9 10 11 12 13 14 15

PC 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000

CE NaN NaN NaN NaN NaN NaN NaN

PI 0.8362 0.8261 0.8384 0.7783 0.7696 0.7557 0.7489

SI 0.0002 0.0002 0.0002 0.0001 0.0001 0.0001 0.0001

XBI 3.7225 3.8620 3.8080 3.8758 3.4379 3.3998 3.5737

DI 0.0071 0.0052 0.0061 0.0070 0.0061 0.0061 0.0061

ADI 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000

Table 4.1: The values of all the validation measures with K-means clustering

39

It is also possible to deﬁne the optimal numbers of clusters for fuzzy cluster-

ing algorithms with this method. To illustrate this, the results of the Gustafson-

Kessel algorithm will be shown. In Figure 4.3 the results of the Partition Index

and the Classiﬁcation Entropy are plotted. Compared to the hard clustering

methods, the validation methods can be used now for the fuzzy clustering. How-

ever, the main drawback of PC is the monotonic decreasing with c, which makes

it hardly to detect the optimal number of cluster. The same problem holds for

CE: monotonic increasing, caused by the lack of direct connection to the data.

The optimal number of cluster can not be rated based on those two validation

methods. Figure 4.4 gives more information about the optimal number of clus-

Figure 4.3: Values of Partition coeﬃcient and Classiﬁcation Entropy with

Gustafson-Kessel clustering

ters. For the PI and the SI, the local minimum is reached at c = 6. Again,

for the XBI, it is diﬃcult to ﬁnd the optimal number of clusters. The points

at c = 3, c = 6 and c = 11, can be seen as an elbow. In Figure 4.5, the Dunn

index also indicates that the optimal number of clusters should be at c = 6. On

the other hand, the Alternative Dunn index, has an elbow at the point c = 3.

However, for the Alternative Dunn Index is not known how reliable its results

are, so the optimal number of clusters for the Gustafson-Kessel algorithm will be

six. The results of the validation measures for the Gustafson-Kessel algorithm

are written in table 4.2. This process can be repeated for all other cluster algo-

rithms. The results can be found in Appendix B. For the K-means, K-medoid

and the Gath-Geva,the optimal number of clusters is chosen at c = 4. For the

other algorithms, the optimal number of clusters is located at c = 6.

40

Figure 4.4: Values of Partition Index, Separation Index and the Xie Beni Index

with Gustafson-Kessel clustering

Figure 4.5: Values of Dunn’s Index and Alternative Dunn Index with Gustafson-

Kessel clustering

41

c 2 3 4 5 6 7 8

PC 0.6462 0.5085 0.3983 0.3209 0.3044 0.2741 0.2024

CE 0.5303 0.8218 1.0009 1.2489 1.4293 1.5512 1.7575

PI 0.9305 1.2057 1.5930 1.9205 0.8903 0.7797 0.8536

SI 0.0002 0.0003 0.0007 0.0004 0.0001 0.0001 0.0002

XBI 2.3550 1.6882 1.4183 1.1573 0.9203 0.9019 0.7233

DI 0.0092 0.0082 0.0083 0.0062 0.0029 0.0041 0.0046

ADI 0.0263 0.0063 0.0039 0.0018 0.0007 0.0001 0.0009

c 9 10 11 12 13 14 15

PC 0.2066 0.1611 0.1479 0.1702 0.1410 0.1149 0.1469

CE 1.8128 2.0012 2.0852 2.0853 2.2189 2.3500 2.3046

PI 0.9364 0.7293 0.7447 0.7813 0.7149 0.6620 0.7688

SI 0.0002 0.0001 0.0002 0.0002 0.0001 0.0001 0.0001

XBI 0.5978 0.5131 0.4684 0.5819 0.5603 0.5675 0.5547

DI 0.0039 0.0030 0.0028 0.0027 0.0017 0.0015 0.0006

ADI 0.0003 0.0002 0.0004 0.0002 0.0000 0.0001 0.0000

Table 4.2: The values of all the validation measures with Gustafson-Kessel

clustering

4.2 Comparing the clustering algorithms

The optimal number of cluster can be determined with the validation methods,

as mentioned in the previous section. The validation measures can also be

used to compare the diﬀerent cluster methods. As examined in the previous

section, the optimal number of clusters was found at c = 4 or c = 6, depending

on the clustering algorithm. The validation measures for c = 4 and c = 6 of

all the clustering methods are collected in the tables 4.3 and 4.4. Table 4.3

PC CE PI SI XBI DI ADI

K-means 1 NaN 1.1571 0.0002 5.0034 0.0034 0.0002

K-medoid 1 NaN 0.2366 0.0001 Inf 0.0084 0.0002

FCM 0.2800 1.3863 0.0002 42.2737 1.0867 0.0102 0.0063

GK 0.3983 1.0009 1.5930 0.0007 1.4183 0.0083 0.0039

GG 0.4982 1.5034 0.0001 0.0001 1.0644 0.0029 00030

Table 4.3: The numerical values of validation measures for c = 4

and 4.4 show that the PC and CE are useless for the hard clustering methods

K-means and K-medoid. On the score of the values of the three most used

indexes, Separation index, Xie and Beni’s index and Dunn’s index, one can

conclude that for c = 4 the Gath-Geva algorithm has the best results and

for c = 6 the Gustafson-Kessel algorithm. To visualize the clustering results,

the validation methods that are described in Section 3.4 can be used. With

these visualization methods, the dataset can be reduced to a 2-dimensional

space. To avoid visibility problems (plotting too much values will cause one

42

PC CE PI SI XBI DI ADI

K-means 1 NaN 1.2907 0.0002 3.9253 0.0063 0.0001

K-medoid 1 NaN 0.1238 0.0001 Inf 0.0070 0.0008

FCM 0.1667 1.7918 0.0001 19.4613 0.9245 0.0102 0.0008

GK 0.3044 1.4293 0.8903 0.0001 0.9203 0.0029 0.0007

GG 0.3773 1.6490 0.1043 0.0008 1.0457 0.0099 0.0009

Table 4.4: The numerical values of validation measures for c = 6

big cloud of data points), only 500 values (representing 500 customers) from

this 2-dimensional dataset will be randomly picked. For the K-means and the

K-medoid algorithm, the Sammon’s mapping gives the best visualization of

the results. For the other cluster algorithms, the Fuzzy Sammon’s mapping

visualization gives the best projection with respect to the partitions of the data

set. These visualization methods are used for the following plots. Figures 4.x-

4.x show the diﬀerent clustering results for c = 4 and c = 6 on the data set.

Figure 4.6 and 4.7 show that hard clustering methods can ﬁnd a solution

Figure 4.6: Result of K-means algorithm

for the clustering problem. None of the clusters contain suﬃcient more or less

customers than other clusters. The plot of the Fuzzy C-means algorithm, in

Figure 4.8, shows unexpected results. For the situation with 4 clusters, there

are only 2 clusters clearly visible. By a detailed look at the plot, one can see that

there are actually 4 cluster centers, but the cluster centers are almost situated

on the same location. In the situation with 6 clusters, one can see three big

cluster, with one small cluster in one of the big clusters. The other two cluster

centers are nearly invisible. This implies that the Fuzzy C-means algorithm is

not able to ﬁnd good clusters for this data set. In Figure 4.9, the results of the

Gustafon-Kessel algorithm are plotted. For both situations, the clusters are well

separated. Note that the cluster in the left bottom corner and the cluster in the

43

Figure 4.7: Result of K-medoid algorithm

Figure 4.8: Result of Fuzzy C-means algorithm

Figure 4.9: Result of Gustafson-Kessel algorithm

44

top right corner in Figure 4.9 are also maintained in the situation with 6 clusters.

This may indicate that the data points in these clusters represents customers

that diﬀer on multiple ﬁelds with the other customers of Vodafone. The results

Figure 4.10: Result of Gath-Geva algorithm

of the Gath-Geva algorithm, visualized in Figure 4.10, for the situation c = 4

look similar to the result of the Gustafson-Kessel algorithm. The result for the

c = 6 situation is remarkable. Here are also appearing clusters in other clusters.

In the real high-dimensional situation, the clusters are not a subset of each

other, but are separated. The fact that this is the case in the two-dimensional

plot, indicates that a clustering with six clusters with the Gustafson-Kessel

algorithms not a good solution. With the results of the validation methods and

the visualization of the clustering, one can conclude that there are two possible

best solutions: The Gath-Geva algorithm for c = 4 and the Gustafson-Kessel

algorithm for c = 6. To determine which partitioning will be used to deﬁne

the segments, a closer look to the meaning of the clusters will be needed. In

the next section, the two diﬀerent partitions will be closely compared with each

other.

4.3 Designing the segments

To deﬁne which clustering method will be used for the segmentation, one can

look at the distances from the points to each cluster. In Figure 4.11 and 4.12, two

box plots of the distances from the data points to the cluster are plotted. The

box indicates the upper and lower quartiles. In both situations, the results show

that the clusters are homogeneous. This indicates that, based on the distances

to the cluster, one can not distinguish between the two cluster algorithms.

Another way to view the diﬀerences between the cluster methods is to proﬁle

the clusters. For each cluster, a proﬁle can be made by drawing a line between

all normalized feature values (each feature value is represented at the x-as)

of the customers within this cluster. The result is visible for the Gath-Geva

algoithm for c = 4 and for the Gustafson-Kessel algorithm with six clusters.

45

Figure 4.11: Distribution of distances from cluster centers within clusters for

the Gath-Geva algorithm with c = 4

Figure 4.12: Distribution of distances from cluster centers within clusters for

the Gustafson-Kessel algorithm with c = 6

46

The proﬁles of the diﬀerent clusters do not diﬀer much in shape. However, in

each cluster, at least one value diﬀers suﬃcient from the values of the other

cluster. This conﬁrms the assumption that customers of diﬀerent clusters have

indeed a diﬀerent usage behavior. Most of the lines in one proﬁle are drawn

closely together. This means that the customers in one proﬁle contain similar

values of the feature values.

Figure 4.13: Cluster proﬁles for c = 4

47

Figure 4.14: Cluster proﬁles for c = 6

48

More relevant plots are shown in Figure 4.15 and ??. The mean of all the

lines (equivalent to the cluster center) was calculated and a line between all the

(normalized) feature vales was drawn. The diﬀerence between the clusters are

visible by some feature values. For instance, in the situation with four clusters,

Cluster 1 has customers, compared with other cluster, have a high feature value

at feature 8. Cluster 2 has high values at position 6 and 9, while Cluster 3

contains peaks at features 2 and 12. The 4th and ﬁnal cluster has high values

at feature 8 and 9.

Figure 4.15: Cluster proﬁles of centers for c = 4

49

Figure 4.16: Cluster proﬁles of centers for c = 6

50

With the previous clustering results, validation measures and plots, it is not

possible to decide which of the two clustering methods gives a better result.

Therefor, both results will be used as a solution for the customer segmentation.

For the Gath-Geva algorithm with c = 4 and the Gustafson-Kessel algorithm

with c = 6, table 4.5 shows the result of the customer segmentation. The feature

Feature 1 2 3 4 5 6

Average 119.5 1.7 3.9 65.8 87.0 75.7

Segment 1 (27.2%) 91.3 0.9 2.9 54.8 86.6 58.2

c = 4 Segment 2 (28.7%) 120.1 1.8 3.6 73.6 87.1 93.7

Segment 3 (23.9%) 132.8 2.4 4.4 60.1 86.7 72.1

Segment 4 (20.2%) 133.8 1.7 4.7 74.7 87.6 78.8

Segment 1 (18.1%) 94.7 1.2 2.8 66.3 88.0 72.6

Segment 2 (14.4%) 121.8 1.7 4.1 65.9 86.4 73.0

c = 6 Segment 3 (18.3%) 121.6 2.5 4.9 66.0 84.3 71.5

Segment 4 (17.6%) 126.8 1.6 4.0 65.7 87.3 71.2

Segment 5 (14.8%) 96.8 1.1 3.5 65.2 88.6 92.9

Segment 6 (16.8%) 155.3 2.1 4.1 65.7 87.4 73.0

Feature 7 8 9 10 11 12

Average 1.6 3.7 2.2 14.4 6.9 25.1

Segment 1 (27.2%) 1.7 4.0 1.6 12.3 6.2 12.2

c = 4 Segment 2 (28.7%) 1.2 3.1 2.1 12.8 6.6 30.6

Segment 3 (23.9%) 1.4 3.4 2.1 22.4 9.4 39.7

Segment 4 (20.2%) 2.1 4.3 3.0 10.1 5.4 17.9

Segment 1 (18.1%) 2.3 4.5 1.8 11.3 6.1 13.5

Segment 2 (14.4%) 1.6 3.7 1.9 17.8 9.5 40.4

c = 6 Segment 3 (18.3%) 1.0 2.9 2.9 15.1 6.6 26.9

Segment 4 (17.6%) 1.5 3.6 1.9 15.0 6.2 24.0

Segment 5 (14.8%) 0.8 2.9 1.8 12.4 6.1 23.1

Segment 6 (16.8%) 2.4 4.6 2.9 14.8 6.9 22.7

Table 4.5: Segmentation results

numbers correspond to the feature numbers of Section 2.1.2. (Feature 1 is the

call duration, feature 2 the received voices calls and feature 3 the originated

calls, feature 4 the daytime calls, feature 5 the weekday calls, 6 are calls to

mobile phones, 7 received sms, 8 originated sms, feature 9 the international

calls, feature 10 the calls to Vodafone mobiles, 11 the unique are codes and

feature 12 the number of diﬀerent numbers called). In words, the segments can

be described as follows: For the situation with 4 segments:

• Segment 1: In this segment are customers with a relative low number of

voice calls. This customers call more in the evening (in proportion) and to

ﬁxed lines then other customers. Their sms usage is higher then normal.

The number of international calls is low.

• Segment 2: This segment contains customers with an average voice call

51

usage. They call often to mobile phones during day time. They do not

send and receive many sms messages.

• Segment 3: The customers in this segment make relative many voice

calls. These customers call to many diﬀerent numbers and have a lot of

contacts which are Vodafone customers.

• Segment 4: These customers originate many voice calls. They also send

and receive many sms messages. They call often during daytime and call

more then average to international numbers. Their call duration is high.

Remarkable is the fact that they don not have so many contacts as the

number of calls do suspect. They have a relative small number of contacts.

For the situation with 6 segments, the customers in this segments can be de-

scribed as follows:

• Segment 1: In this segment are customers with a relative low number

of voice calls. Their average call duration is also lower than average.

However, their sms usage is relative high. These customers do not call to

many diﬀerent numbers.

• Segment 2: This segment contains customers with a relative high number

of contacts. They also call to many diﬀerent areas. They have also more

contacts with a Vodafone mobile.

• Segment 3: The customers in this segment make relative many voice

calls. Their sms usage is low. In proportion, they make more international

phone calls than other customers.

• Segment 4: These customers are the average customers. None of the

feature values is high or low.

• Segment 5: These customers do not receive many voice calls. The aver-

age call duration is low. They also receive and originate a low number of

sms messages.

• Segment 6: These customers originate and receive many voice calls.

They also send and receive many sms messages. The duration of their

voice calls is longer than average. The percentage of international calls is

high.

In the next session the classiﬁcation method Support Vector Machine will be

explained. This technique will be used to classify/estimate the segment of a

customer by personal information as age, gender and lifestyle (the customer

data of Section 2.1.3).

52

Chapter 5

Support Vector Machines

A Support Vector Machine is a algorithm that learns by example to assign

labels to objects [16]. In this research a Support Vector machine will be used

to recognize the segment of a customer by examining thousands of customers

(e.g. the customer data features of Section 2.1.3) of each segment. In general, a

Support Vector Machine is a mathematical entity; an algorithm for maximizing

a particular mathematical function with respect to a given collection of data.

However, the basic ideas of Support Vector Machines can be explained without

any equations. The next few sections will describe the four basic concepts:

• The separating hyper plane

• The maximum-margin hyperplane

• The soft margin

• The kernel function

For now, to allow an easy, geometric interpretation of the data, imagine that

there exists only two segments. In this case the customer data consist of 2

feature values, age and income, which can be easily plotted. The green dots

represent the customers that are in segment 1 and the red dots are customers

that are in segment 2. The goal of the SVM is learn to tell the diﬀerence between

the groups and, given an unlabeled customer, such as the one labeled ’Unknown’

in Figure 5.1, predict whether it corresponds to segment 1 or segment 2.

5.1 The separating hyperplane

A human being is very good at pattern recognition. Even a quick glance at Fig-

ure 5.1a shows that the green dots form a group and the reds dots form another

group that can easily be separated by drawing a line between the two groups

(Figure 5.1b). Subsequently, predicting the label of an unknown customer is

simple: one simply needs to ask whether the new customer falls on the segment

53

(a) Two-dimensional representation of the

customers

(b) A separating hyperplane

Figure 5.1: Two-dimensional customer data of segment 1 and segment 2

1 or the segment 2 side of the separating line. Now, to deﬁne the notion of

a separating hyperplane, consider the situation where there are not just two

feature values to describe the customer. For example, if there was just 1 feature

value to describe the customer, then the space in which the corresponding one-

dimensional feature resides is a one-dimensional line. This line can be divided

in half by using a single point (see Figure 5.2a). In two dimensions, a straight

line divides the space in half (remember Figure 5.1b) In a three-dimensional

space, a plane is needed to divide the space, illustrated in Figure 5.2b. This

procedure can be extrapolated mathematically in higher dimensions. The term

for a straight line in a high-dimensional space is a hyperplane. So the term

separating hyperplane is, essentially, the line that separates the segments.

(a) One dimension (b) Three dimensions

Figure 5.2: Separating hyperplanes in diﬀerent dimensions

54

5.2 The maximum-margin hyperplane

The concept of treating objects as points in a high-dimensional space and ﬁnding

a line that separates them, is a common way of classiﬁcation, and therefore not

unique to the SVM. However, the SVM diﬀers from all other classiﬁer methods

by virtue of how the hyperplane should be selected. Consider again the classiﬁ-

cation problem of Figure 5.1a The goal of SVM is to ﬁnd a line that separates

the segment 1 customers from the segment 2 customers. However, there are

an inﬁnite number of possible lines, as portrayed in Figure 5.2 The question is

which line should be chosen as the optimal classiﬁer and how should the optimal

line be deﬁned. A logical way of selecting the optimal line, is selecting the line

that is, roughly speaking, ’in the middle’. In other words, the line that sepa-

rates the two segments and adopts the maximal distance from any of the given

customers (see Figure 5.2). It is not surprising that a theorem of the statistical

learning theory is supporting this choice [6]. By deﬁning the distance from the

hyperplane to the nearest customer (in general an expression vector) as the mar-

gin of the hyperplane, the SVM selects the maximum separating hyperplane.

By selecting this hyper plane, the SVM is able to predict the unknown segment

of the customer in Figure 5.1a. The vectors (points) that constrain the width

of the margin are the support vectors. This theorem, is in many ways, the key

(a) Many possibilities (b) The maximum-margin hyperplane

Figure 5.3: Demonstration of the maximum-margin hyperplane

to the success of Support Vector Machines. However, there are a some remarks

and caveats to deal with. First at all, the theorem is based on the fact that the

data on which the SVM is trained are drawn from the same distribution as the

data it has to classify. This is of course logical, since it is not reasonable that

a Support Vector machine trained on customer data is able to classify diﬀerent

car types. More relevantly, it is not reasonable to expect that the SVM can

classify well if the training data set is prepared with a diﬀerent protocol then

the test data set. On the other hand, the theorem of a SVM indicates that the

two data sets has to be drawn from the same distribution. For example, a SVM

55

does not assume that the data is drawn from a normal distribution.

5.3 The soft margin

So far, the theory assumed that the data can be separated by a straight line.

However, many real data sets are not cleanly separable by a straight line, for

example the data of Figure 5.4a. In this ﬁgure, the data contains an error

object. A intuitively way to deal with the problems of errors is designing the

SVM in such a way that it allows a few anomalous customers to fall on the

’wrong side’ of separation line. This can be achieved by adding a ’soft margin’

to the SVM. The soft margin allows a small percentage of the data points

to push their way through the margin of the separating hyperplane without

aﬀecting the ﬁnal result. With the soft margin, the data set of Figure 5.4a will

be separated in the way it is illustrated in Figure 5.3 The customer can be seen

as an outlier and resides on the same side of the line with customers of segment

1. Of course, a SVM should not allow too many misclassiﬁcation. Note, that

(a) Data set containing one error (b) Separating with soft margin

Figure 5.4: Demonstration of the soft margin

with the introduction of the soft margin, a user-speciﬁed parameter is involved

that controls the soft margin and, roughly, controls the number of customers

that is allowed to violate the separation line and determines how far across the

line they are allowed. Setting this parameter is a complicated process, by the

fact that a large margin will be achieved with respect to the number of correct

classiﬁcations. In other words, the soft margin speciﬁes a trade-oﬀ between

hyper plane violations and the size of the margin.

5.4 The kernel functions

To understand the notion of a kernel function, the example data will be sim-

pliﬁed even further. Assume that, instead of a two-dimensional data set, there

56

is a one-dimensional data set, as seen before in Figure 5.1. In that case, the

separating hyperplane was a single point. Now, consider the situation of Figure

5.4, which illustrates an non separable data set. No single point can separate

the two segments and introducing a soft margin would not help. A kernel func-

tion provides a solution to this problem. The kernel function adds an extra

dimension to the data, in this case by squaring the one dimensional data set.

The result is plotted in Figure 5.4. Within the new higher dimensional space,

as shown in the ﬁgure, the SVM can separate the data in two segments by one

straight line. In general, the kernel function can be seen as a mathematical

trick for the SVM to project data from a low-dimensional space to a space of

higher dimensions. If one chooses a good kernel function, the data will become

separable in the corresponding higher dimension. To understand kernels better,

(a) None separable dataset (b) Separating previously non separable

dataset

Figure 5.5: Demonstration of kernels

some extra examples will be given. In Figure 5.4 is plotted a two-dimensional

data set. With a relative simple kernel function, this data can be projected to a

four-dimensional space. It is not possible to draw the data in the 4 dimensional

space, but with a projection of the SVM hyperplane in the four-dimensional

space back down to the original two-dimensional space, the result is shown as

the curved line in Figure 5.4. it is possible to prove that for any data set exists

a kernel function that allows the SVM to separate the data linearly in a higher

dimension. Of course, the data set must contain consistent labels, which means

that two identical data points may not have diﬀerent labels. So, in theory,

the SVM should be a perfect classiﬁer. However, there are some drawbacks of

projecting data in a very high-dimensional space to ﬁnd the separating hyper-

plane. the ﬁrst problem is the so called curse of dimensionality: as the numbers

of variables under consideration increases, the number of possible solutions also

increases, but exponentially. Consequently, it becomes harder for any algorithm

to ﬁnd a correct solution. In Figure 5.4 the situation is drawn when the data is

project into a space with too many dimensions. The ﬁgure contains the same

data as Figure 5.4, but the projected hyperplane is found by a very high dimen-

57

sional kernel. This results in boundaries which are to speciﬁc to the examples

of the data set. This phenomenon is called over ﬁtting. The SVM will not

function well on new unseen unlabeled data. There exists another large practi-

(a) Linearly separable in four dimensions (b) A SVM that has over ﬁt the data

Figure 5.6: Examples of separation with kernels

cal diﬃculty when applying new unseen data to the SVM. This problems relies

on the question how to choose a kernel function that separates the data, but

without introducing too many irrelevant dimensions. Unfortunately, the answer

too this question is, in most cases, trial and error. In this research a SVM will

be experimented with a variety of ’standard’ kernel functions. By using the

cross-validation method, the optimal kernel will be selected on a statistical way.

However, this is a time-consuming process and it is not guaranteed that the

best kernel function that was found during cross-validation, is actually the best

kernel function that exists. It is more likely that there exists a kernel function

that was not tested and performs better than the selected kernel function. Prac-

tically, the method described above, mainly gives suﬃcient results. In general

the kernel function is deﬁned by:

K(x

i

, x

j

) = Φ(x

i

)

T

Φ(x

j

), (5.1)

where x

i

are the training vectors. The vectors are mapped into a higher dimen-

sional space by the function Φ. Many kernel mapping functions can be used,

probably an inﬁnite number, but a few kernel functions have been found to work

well in for a wide variety of applications [16]. The default and recommended

kernel functions were used during this research and will be discussed now.

• Linear: which function is deﬁned by:

K(x

i

, x

j

) = x

T

i

x

j

. (5.2)

• Polynomial: the polynomial kernel of degree d is of the form

K(x

i

, x

j

) = (γx

T

i

x

j

+c

0

)

d

. (5.3)

58

• Radial basis function: also known as the Gaussian kernel is of the form

K(x

i

, x

j

) = exp(−γ||x

i

−x

j

||

2

). (5.4)

• Sigmoid: the sigmoid function, which is also used in neural networks, is

deﬁned by

K(x

i

, x

j

) = tanh(γx

T

i

x

j

+c

0

). (5.5)

When the sigmoid function is used, one can regard it with a as a two-layer

neural network.

In this research the constant c

0

is set to 1. The concept of a kernel mapping

function is very powerful. It allows a SVM to perform separations even with

very complex boundaries as shown in Figure 5.7

Figure 5.7: A separation of classes with complex boundaries

5.5 Multi class classiﬁcation

So far, the idea of using a hyperplane to separate the feature vectors into two

groups was described, but only for two target categories. How does a SVM

discriminate between a large variety of classes, as in our case 4 or 6 segments?

There are several approaches proposed, but two methods are the most popu-

lar and most used [16]. The ﬁrst approach is to train multiple, one-versus-all

classiﬁers. For example, if the SVM has to recognize three classes, A, B and C,

one can simply train three separate SVM to answer the binary questions, ”Is it

A?”, ”Is it B?” and ”Is it C?”. Another simple approach is the one-versus-one

where k(k −1)/2 models are constructed, where k is the number of classes. In

this research the one-verses-one technique will be used.

59

Chapter 6

Experiments and results of

classifying the customer

segments

6.1 K-fold cross validation

To avoid over ﬁtting, cross-validation is used to evaluate the ﬁtting provided by

each parameter value set tried during the experiments. Figure 6.1 demonstrates

how important the training process is. Diﬀerent parameter values may cause

under or over ﬁtting. By K-fold cross validation the training dataset will be

Figure 6.1: Under ﬁtting and over ﬁtting

divided into two groups, the training set, the test set and the validation set.

The training set will be used to train the SVM. The test set will be used to

estimate the error during the training of the SVM. With the validation set,

the actual performance of the SVM will be measured after the SVM is trained.

The training of the SVM will be stopped when the test error reached a local

60

minimum, see Figure 6.2. By K-fold cross validation, a k-fold partition of the

Figure 6.2: Determining the stopping point of training the SVM

data set is created. For each of K experiments, K-1 folds will be used for training

and the remaining one for testing. Figure 6.3 illustrates this process. In this

Figure 6.3: A K-fold partition of the dataset

research, K is set to 10. The advantage of K-fold cross validation is that all the

examples in the dataset are eventually used for both training and testing. The

error is calculated by taking the average oﬀ all K experiments.

6.2 Parameter setting

In this section, the optimal parameters for the Support Vector Machine will

be researched and examined. Each kernel function with its parameters will be

tested on their performance. The linear Kernel function itself has no parameters.

The only parameter that can be researched is the soft margin value of the

Support Vector Machin, denoted by C. In table 6.1 and table 6.2 the results for

the diﬀerent C-values are summarized. For the situation with 4 clusters, the

C 1 2 5 10 20 50 100 200 500

42.1% 42.6% 43.0% 43.2% 43.0% 42.4% 41.7% 40.8% 36.1%

Table 6.1: Linear Kernel, 4 segments

61

C 1 2 5 10 20 50 100 200 500

28.9% 29.4% 30.9% 31.3% 31.4% 32.0% 27.6% 27.6% 21.8%

Table 6.2: Linear Kernel, 6 segments

optimal value for the soft margin is C = 10 and by using the 6 segments C = 50.

The correct number of classiﬁcations are respectively, 43.2% and 32.0%. For the

polynomial kernel function, there are two parameters. The number of degrees,

denoted by d and the width γ. Therefor, the optimal number for the maximal

margin will be determined. This is done by multiple test runs with random

values for d and γ. The average value for each soft margin C can be found in the

tables 6.3 and 6.4. These C-values are used to ﬁnd out which d and γ give the

C 1 2 5 10 20 50 100 200 500

73.8% 77.4% 76.6% 74.6% 73.5% 72.8% 70.6% 63.2% 53.7%

Table 6.3: Average C-value for polynomial kernel, 4 segments

C 1 2 5 10 20 50 100 200 500

70.1% 74.4% 75.3% 75.1% 75.0% 75.1% 50.6% 42.7% 26.0%

Table 6.4: Average C-value for polynomial kernel, 6 segments

best results. The results are shown in tables 6.5 and 6.6. For the situation with

d 1 2 3 4 5 6 7

γ = 0.4 76.1% 76.3% 78.1% 73.2% 74.8% 76.0% 75.0%

γ = 0.6 76.0% 76.3% 77.6% 74.1% 74.5% 75.4% 75.8%

γ = 0.8 75.8% 76.3% 77.2% 74.0% 74.4% 77.1% 75.2%

γ = 1.0 76.2% 76.4% 78.0% 75.0% 75.2% 75.6% 75.8%

γ = 1.2 76.0% 76.2% 78.1% 74.6% 75.1% 76.0% 75.8%

γ = 1.4 75.2% 76.2% 78.1% 74.9% 75.5% 76.3% 74.9%

Table 6.5: Polynomial kernel, 4 segments

d 1 2 3 4 5 6 7

γ = 0.4 75.0% 74.6% 75.9% 76.0% 75.8% 74.3% 73.9%

γ = 0.6 74.2% 75.1% 74.9% 76.2% 75.0% 74.5% 74.0%

γ = 0.8 73.8% 74.7% 74.3% 76.2% 75.9% 74.8% 73.1%

γ = 1.0 74.1% 75.0% 73.6% 76.1% 75.3% 74.2% 72.8%

γ = 1.2 72.1% 74.1% 75.5% 75.4% 75.4% 74.1% 73.0%

γ = 1.4 73.6% 74.3% 72.2% 76.0% 74.4% 74.3% 72.9%

Table 6.6: Polynomial kernel, 6 segments

62

4 segments, the optimal score is 78.1% and for 6 segments 76.2%. The following

kernel function, the radial basis function has only one variable, namely γ. The

results of the Radial Basis function are given in table 6.7 and table 6.8. The

C 1 2 5 10 20 50 100 200 500

γ = 0.4 80.0 79.0 76.6 78.3 76.4 73.3 60.2 52.4 37.5

γ = 0.6 80.1 80.3 77.7 79.0 79.9 72.8 63.6 59.6 27.5

γ = 0.8 79.3 79.5 78.2 80.2 78.4 69.3 59.3 51.4 29.7

γ = 1.0 78.4 78.2 80.3 78.5 76.9 66.2 61.7 47.9 30.6

γ = 1.2 79.6 79.9 79.8 80.2 80.1 69.0 61.3 45.5 26.3

γ = 1.4 77.4 76.9 76.5 79.4 77.7 71.4 61.3 41.2 26.0

Table 6.7: Radial basis function, 4 segments

C 1 2 5 10 20 50 100 200 500

γ = 0.4 73.6 77.4 72.6 70.9 68.0 65.1 52.7 51.8 40.0

γ = 0.6 72.5 74.8 74.8 72.7 73.0 70.4 54.0 49.3 39.1

γ = 0.8 74.1 76.6 80.3 80.0 68.4 60.5 55.5 54.1 40.9

γ = 1.0 70.7 72.9 73.8 70.9 66.1 64.7 52.2 48.5 34.2

γ = 1.2 72.6 73.5 73.4 73.1 71.9 74.6 64.8 60.0 38.3

γ = 1.4 69.4 68.5 70.7 69.1 68.0 68.5 54.4 52.4 31.0

Table 6.8: Radial basis function, 6 segments

best result with 4 segments is 80.3%, with 6 segments the best score is 78.5%.

The sigmoid function has also only 1 variable. The results are given in table

6.9 and 6.10 The results show that 66.1% and 44.6% of the data is classiﬁed

C 1 2 5 10 20 50 100 200 500

γ = 0.4 58.2 53.0 57.7 58.2 56.1 57.9 30.3 47.5 38.9

γ = 0.6 47.6 56.1 55.5 46.0 58.3 44.1 30.6 30.7 34.5

γ = 0.8 52.1 60.5 54.6 57.9 58.6 44.7 43.2 44.3 38.7

γ = 1.0 51.4 57.3 52.0 50.7 50.2 48.6 44.7 42.2 40.0

γ = 1.2 66.1 64.8 61.3 62.8 59.6 57.1 46.5 44.0 42.0

γ = 1.4 63.2 61.4 59.7 65.0 53.8 51.1 52.2 47.6 41.4

Table 6.9: Sigmoid function, 4 segments

correct, with respectively 4 and 6 segemtents, by the Sigmoid function. This

means that the Radial basis function has the best score for both situations, with

80.3% and 78.5%. Remarkable is the fact that the diﬀerence is small between

the two situations, while there are two extra clusters. The confusion matrix for

both situations, table 6.11 and 6.12, show that there are two clusters which can

easily be classiﬁed with the customer proﬁle. This corresponds to the cluster in

the top right corner and the cluster in the bottom of Figures 4.9 and 4.10.

63

C 1 2 5 10 20 50 100 200 500

γ = 0.4 33.8 34.0 34.7 33.1 34.6 30.0 32.6 28.8 28.8

γ = 0.6 29.6 27.4 28.5 29.7 21.4 20.8 20.0 18.8 18.1

γ = 0.8 39.1 36.4 33.6 35.7 38.9 32.0 26.4 24.6 22.9

γ = 1.0 40.0 42.5 39.8 40.7 39.9 39.8 30.4 31.1 28.0

γ = 1.2 41.9 40.6 43.6 43.2 44.1 43.2 44.6 40.6 41.7

γ = 1.4 38.6 34.5 32.1 30.6 30.2 27.5 24.3 26.3 27.9

Table 6.10: Sigmoid function, 6 segments

Predicted → Segment 1 Segment 2 Segment 3 Segment 4

Actual ↓

Segment 1 97.1% 0.5% 1.9% 0.5%

Segment 2 3.6% 76.6% 7.8% 12.0%

Segment 3 2.2% 0.8% 96.3% 0.7%

Segment 4 7.1% 13.0% 6.9% 73.0%

Table 6.11: Confusion matrix, 4 segments

Predicted → Segm. 1 Segm. 2 Segm. 3 Segm. 4 Segm. 5 Segm. 6

Actual ↓

Segment 1 74.1% 1.1% 10.1% 8.4% 0.6% 5.7%

Segment 2 0.2% 94.5% 0.6% 1.4% 1.2% 2.1%

Segment 3 5.6% 4.7% 71.2% 9.1% 2.1% 7.3%

Segment 4 12.3% 4.1% 3.9% 68.9% 6.8% 4.0%

Segment 5 2.0% 0.6% 0.7% 1.3% 92.6% 2.8%

Segment 6 12.5% 2.4% 3.7% 10.4% 1.3% 69.7%

Table 6.12: Confusion matrix, 6 segments

64

6.3 Feature Validation

In this section, the features will be validated. The importance of each feature

will be measured. This will be done, by leaving one feature out of the feature

vector and train the SVM without this feature. The results of both situations,

are shown in Figure 6.4 and 6.5. The result show that Age is an important

Figure 6.4: Results while leaving out one of the features with 4 segments

Figure 6.5: Results while leaving out one of the features with 6 segments

feature for classifying the right segment. This is in contrast with the type of

telephone, which increase the result with only tenths of percents. Each feature

increases the result and therefore each feature is useful for the classiﬁcation.

65

Chapter 7

Conclusions and discussion

This section concludes the research and the corresponding results and will give

some recommendations for future work.

7.1 Conclusions

The ﬁrst objective of our research was to perform automatic customer segmen-

tation based on usage behavior, without the direct intervention of a human

specialist. The second part of the research was focused on proﬁling customers

and ﬁnding a relation between the proﬁle and the segments. The customer

segments were constructed by applying several clustering algorithms. The clus-

tering algorithms used selected and preprocessed data from the Vodafone data

warehouse. This led to solutions for the customer segmentation with respec-

tively four segments and six segments. The customer’s proﬁle was based on

personal information of the customers. A novel data mining technique, called

Support Vector Machines was used to estimate the segment of a customer based

on his proﬁle.

There are various ways for selecting suitable feature values for the clustering

algorithms. This selection is vital for the resulting quality of the clustering.

One diﬀerent feature value will result in diﬀerent segments. The result of the

clustering can therefore not be regarded as universally valid, but merely as one

possible outcome. In this research, the feature values were selected in such a

way that it would describe the customer’s behavior as complete as possible.

However, it is not possible to include all possible combinations of usage behav-

ior characteristics within the scope of this research. To ﬁnd the optimal number

of clusters, the so-called elbow criterion was applied. Unfortunately, this crite-

rion could not always be unambiguously identiﬁed. An other problem was that

the location of the elbow could diﬀer between the validation measures for the

same algorithm. For some algorithms, the elbow was located at c = 4 and for

other algorithms, the location was c = 6. To identify the best algorithm, several

validation measures were used. Not every validation method marked the same

66

algorithm as the best algorithm. Therefore, some widely established validation

measures were employed to determine the most optimal algorithm. It was how-

ever not possible to determine one algorithm that was optimal for c = 4 and

c = 6. For the situation with four clusters, the Gath-Geva algorithm appears to

be the best algorithm and the Gustafson-Kessel algorithm gives the best results

by six clusters. To determine which customer segmentation algorithm is best

suited for a particular data set and a speciﬁc parameter setting, the clustering

results were interpreted in a proﬁling format. The results show, that in both

situations the clusters were well separated and clearly distinguished from each

other. It is hard to compare the two clustering results, because of the diﬀerent

number of clusters. Therefore, both clustering results were used as a starting

point for the segmentation algorithm. The corresponding segments diﬀer on

features as number of voice calls, sms usage, call duration, international calls,

diﬀerent numbers called and percentage of weekday and daytime calls. A short

characterization of each cluster was made.

A Support Vector Machine algorithm was used to classify the segment of a

customer, based on the customer’s proﬁle. The proﬁle exists of the age, gen-

der, telephone type, subscription type, company size, and residential area of

the customer. As a comparison, four diﬀerent kernel functions with diﬀerent

parameters were tested on their performance. It was found that the radial basis

function gives the best result with a classiﬁcation of 80.3% for the situation

with four segments and 78.5% for the situation with six segments. It appeared

that the resulting percentage of correctly classiﬁed segments was not as high

as expected. A possible explanation could be that the features of the customer

are not adequate for making a customer’s proﬁle. This is caused by the fre-

quently missing data in the Vodafone data warehouse about lifestyle, habits

and income of the customers. A second reason for the low number of correct

classiﬁcation is the fact that the usage behavior in the database corresponds

to a telephone number and this telephone number corresponds to a person. In

real life, however, this telephone is maybe not used exclusively by the person

(and the corresponding customer’s proﬁle) as stored in the database. Customers

may lend their telephone to relatives, and companies may exchange telephones

among their employees. In such cases, the usage behavior does not correspond

to a single customer’s proﬁle and this impairs the classiﬁcation process.

The last part of the research involves the relative importance of each individ-

ual feature of the customer’s proﬁle. By leaving out one feature value during

classiﬁcation, the eﬀect of each feature value became visible. It was found that

without the concept of ’customer age’, the resulting quality of the classiﬁca-

tion was signiﬁcantly decreased. On the other hand, leaving out a feature such

as the ’telephone type’ barely decreased the classiﬁcation result. However, this

and some other features did well increase the performance of classiﬁcation. This

implies, that this feature bears some importance for the customer proﬁling and

the classiﬁcation of the customer’s segment.

67

7.2 Recommendations for future work

Based on our research and experiments, it is possible to formulate some recom-

mendations for obtaining more suitable customer proﬁling and segmentation.

The ﬁrst recommendation is to use diﬀerent feature values for the customer

segmentation. This can lead to diﬀerent clusters and thus diﬀerent segments.

To know the inﬂuence of the feature values on the outcome of the clustering, a

complete data analysis research is required. Also, a detailed data analysis of the

meaning of the cluster is recommended. In this research, the results are given

by a short description of each segment. Extrapolating this approach, a more

detailed view of the clusters and their boundaries can be obtained. Another

way to validate the resulting clusters is to oﬀer them to a human expert, and

use his feed-back for improving the clustering criteria.

To improve on determining the actual number of clusters present in the data

set, the application of more specialized methods than the elbow criterion could

be applied. An interesting alternative is, for instance, the application of evolu-

tionary algorithms, as proposed by Wei Lu [21]. Another way of improving this

research is to extent the number of cluster algorithms like main shift cluster-

ing, hierarchical clustering or mixture of Gaussians. To estimate the segment

of the customer, also, other classiﬁcation methods can be used. For instance,

neural networks, genetic algorithms or Bayesian algorithms. Of speciﬁc interest

is, within the framework of Support Vector Machines, cluster analysis of the

application of miscellaneous (non-linear) kernel functions.

Furthermore, it should be noted that the most obvious and best way to improve

the classiﬁcation is to come to a more accurate and precise deﬁnition of the

customer proﬁles. The customer proﬁle used in this research is not suﬃcient

detailed enough to describe the wide spectrum of customers. One reason for this

is the missing data in the Vodafone data warehouse. Consequently, an enhanced

and more precise analysis of the data ware house will lead to improved features

and, thus, to an improved classiﬁcation.

Finally, we note that the study would improve noticeably by involving multiple

criteria to evaluate the user behavior, rather than mere phone usage as em-

ployed here. Similarly, it is challenging to classify the proﬁle of the customer

based on the corresponding segment alone. However, this is a complex course

and it essentially requires the availability of high-quality features.

68

Bibliography

[1] Ahola, J. and Rinta-Runsala E., Data mining case studies in customer proﬁling.

Research report TTE1-2001-29, VTT Information Technology (2001).

[2] Amat, J.L., Using reporting and data mining techniques to improve knowledge of

subscribers; applications to customer proﬁling and fraud management. J. Telecom-

mun. Inform. Technol., no. 3 (2002), pp. 11-16.

[3] Balasko, B., Abonyi, J. and Balazs, F., Fuzzy Clustering and Data Analysis Tool-

box For Use with Matlab. (2006).

[4] Bounsaythip, C. and Rinta-Runsala, E., Overview of Data Mining for Customer

Behavior Modeling. Research report TTE1-2001-18, VTT Information Technol-

ogy (2001).

[5] Bezdek, J.C. and Dunn, J.C., Optimal fuzzy partition: A heuristic for estimating

the parameters in a mixture of normal distributions. IEEE Trans. Comput., vol.

C-24 (1975), pp. 835-838.

[6] Dibike, Y.B., Velickov, S., Solomatine D. and Abbott, M.B., Model Induction

with Support Vector Machines: Introduction and Applications. J. Comp. in Civ.

Engrg., vol. 15 iss. 3 (2001), pp. 208-216.

[7] Feldman, R. and Dagan, I., Knowledge discovery in textual databases (KDT). In

Proc. 1st Int. Conf. Knowledge Discovery and Data Mining, (2005), pp. 112-117.

[8] Frawley, W.J., Piatetsky-Shapiro, G. and Matheus, C.J., Knowledge discovery in

databases, AAAI/MIT Press (1991), pp. 1-27.

[9] Gath, I. and Geva, A.B., Unsupervised optimal fuzzy clustering. IEEE Trans

Pattern and Machine Intell, vol. 11 no. 7 (1989), pp. 773-781.

[10] Giha, F.E., Singh, Y.P. and Ewe, H.T., Customer Proﬁling and Segmentation

based on Association Rule Mining Technique. Proc. Softw. Engin. and Appl., no.

397 (2003).

[11] Gustafson, D.E. and Kessel, W.E., Fuzzy clustering with a fuzzy covariance ma-

trix. In Proc. IEEE CDC, (1979), pp. 761766.

[12] Janusz, G., Data mining and complex telecommunications problems modeling. J.

Telecommun. Inform. Technol., no. 3 (2003), pp. 115-120.

69

[13] Mali, K., Clustering and its validation in a symbolic framework. Patt. Recogn.

Lett., vol. 24 (2003), pp. 2367-2376.

[14] Mattison, R., Data Warehousing and Data Mining for Telecommunications.

Boston, London: Artech House, (1997).

[15] McDonald, M. and Dunbar, I., Market segmentation. How to do it, how to proﬁt

from it. Palgrave Publ., (1998).

[16] Noble, W.S., What is a support vector machine? Nature Biotechnology, vol. 24

no. 12 (2006), pp. 1565-1567.

[17] Shaw, M.J., Subramaniam, C., Tan, G.W. and Welge, M.E., Knowledge manage-

ment and data mining for marketing. Decision Support Systems, vol. 31 (2001),

pp. 127137.

[18] Verhoef, P., Spring, P., Hoekstra, J. and Lee, P., The commercial use of segmenta-

tion and predictive modeling techniques for database marketing in the Netherlands.

Decis. Supp. Syst., vol. 34 (2002), pp. 471-481.

[19] Virvou, M., Savvopoulos, A. Tsihrintzis, G.A. and Sotiropoulos, D.N., Construct-

ing Stereotypes for an Adaptive e-Shop Using AIN-Based Clustering. ICANNGA

(2007), pp. 837-845.

[20] Wei, C.P. and Chiu, I.T., Turning telecommunications call detail to churn pre-

diction: a data mining approach. Expert Syst. Appl., vol. 23 (2002), pp. 103112.

[21] Wei Lu, I.T., A New Evolutionary Algorithm for Determining the Optimal Num-

ber of Clusters. CIMCA/IAWTIC (2005), pp. 648-653.

[22] Weiss, G.M., Data Mining in Telecommunications. The Data Mining and Knowl-

edge Discovery Handbook (2005), pp. 1189-1201.

70

Appendix A

Model of data warehouse

In this Appendix a simpliﬁed model of the data ware house can be found. The

white rectangles correspond to the tables that were used for this research. The

most important data ﬁelds of these tables are written in the table. The colored

boxes group the tables in a category. To connected the tables with each other,

the relation tables (the red tables in the middle) are needed.

71

Figure A.1: Model of the Vodafone data warehouse

72

Appendix B

Extra results for optimal

number of clusters

In this Appendix, the plots of the validation measures, for the algorithms that

not were discussed in Section 4.1, are given.

The K-medoid algorithm:

Figure B.1: Partition index and Separation index of K-medoid

73

Figure B.2: Dunn’s index and Alternative Dunn’s index of K-medoid

The Fuzzy-C-means algorithm:

Figure B.3: Partition coeﬃcient and Classiﬁcation Entropy of Fuzzy C-means

74

Figure B.4: Partition index, Separation index and Xie Beni index of Fuzzy

C-means

Figure B.5: Dunn’s index and Alternative Dunn’s index of Fuzzy C-means

75

Acknowledgments

This Master thesis was written to complete the study Operations Research at the University of Maastricht (UM). The research took place at the Department of Mathematics of UM and at the Department of Information Management of Vodafone Maastricht. During this research, I had the privilege to work together with several people. I would like to express my gratitude to all those people for giving me the support to complete this thesis. I want to thank the Department of Information Management for giving me permission to commence this thesis in the ﬁrst instance, to do the necessary research work and to use departmental data. I am deeply indebted to my supervisor Dr. Ronald Westra, whose help, stimulating suggestions and encouragement helped me in all the time of research for and writing of this thesis. Furthermore, I would like to give my special thanks to my second supervisor Dr. Ralf Peeters, whose patience and enthusiasm enabled me to complete this work. I have also to thank my thesis instructor, Drs. Annette Schade, for her stimulating support and encouraging me to go ahead with my thesis. My former colleagues from the Department of Information Management supported me in my research work. I want to thank them for all their help, support, interest and valuable hints. Especially I am obliged to Drs. Philippe Theunen and Laurens Alberts, MSc. Finally, I would like to thank the people, who looked closely at the ﬁnal version of the thesis for English style and grammar, correcting both and oﬀering suggestions for improvement.

1

Contents

1 Introduction 1.1 Customer segmentation and customer proﬁling 1.1.1 Customer segmentation . . . . . . . . . 1.1.2 Customer proﬁling . . . . . . . . . . . . 1.2 Data mining . . . . . . . . . . . . . . . . . . . . 1.3 Structure of the report . . . . . . . . . . . . . . 2 Data collection and preparation 2.1 Data warehouse . . . . . . . . . 2.1.1 Selecting the customers 2.1.2 Call detail data . . . . . 2.1.3 Customer data . . . . . 2.2 Data preparation . . . . . . . . 8 9 9 10 11 13 14 14 14 15 19 20 22 22 23 23 24 27 27 28 28 29 30 31 33 33 34 35

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

3 Clustering 3.1 Cluster analysis . . . . . . . . . . . . . . 3.1.1 The data . . . . . . . . . . . . . 3.1.2 The clusters . . . . . . . . . . . . 3.1.3 Cluster partition . . . . . . . . . 3.2 Cluster algorithms . . . . . . . . . . . . 3.2.1 K-means . . . . . . . . . . . . . . 3.2.2 K-medoid . . . . . . . . . . . . . 3.2.3 Fuzzy C-means . . . . . . . . . . 3.2.4 The Gustafson-Kessel algorithm 3.2.5 The Gath Geva algorithm . . . . 3.3 Validation . . . . . . . . . . . . . . . . . 3.4 Visualization . . . . . . . . . . . . . . . 3.4.1 Principal Component Analysis . 3.4.2 Sammon mapping . . . . . . . . 3.4.3 Fuzzy Sammon mapping . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

4 Experiments and results of customer segmentation 37 4.1 Determining the optimal number of clusters . . . . . . . . . . . . 37 4.2 Comparing the clustering algorithms . . . . . . . . . . . . . . . . 42

2

61 . . 5. . . . . . . . .2 Recommendations for future work . . . . . . . . . . . . . . . . . .1 The separating hyperplane . . . . . . . 6. . . 60 . . . . . . . . . . . .3 Feature Validation . . . 66 7. . . . . . . . . . . . . . . . . . . 65 7 Conclusions and discussion 66 7. . . . . . . . . . 68 Bibliography A Model of data warehouse B Extra results for optimal number of clusters 68 71 73 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4. . . . . 45 53 53 55 56 56 59 5 Support Vector Machines 5. . . . . . . . . . . . . .2 Parameter setting . . . . . . . . . . . . . of classifying the customer segments 60 . . . . . . . .5 Multi class classiﬁcation . . . .1 Conclusions . . . . . . . . . . . . . . . . . . . 6 Experiments and results 6. . . . 5. . .1 K-fold cross validation 6. . . . . . . . . . . . .2 The maximum-margin hyperplane 5. . . . . . . . . . . . .3 Designing the segments .4 The kernel functions . . . . . . . . . . . . . . 5. .3 The soft margin . . . . . . . . .

. . .16 5. . . . . . . Relation between daytime and weekday calls . Result of Gustafson-Kessel algorithm . . . . . . . . . . . . . . . .1 A taxonomy of data mining tasks . . . . . . . . . . . . . Diﬀerent cluster shapes in R2 . . . . Values of Dunn’s Index and Alternative Dunn Index with GustafsonKessel clustering . Structure of customers by Vodafone .4 4. . . . Distribution of distances from cluster centers within clusters for the Gath-Geva algorithm with c = 4 . . . . . . . . .6 4.1 2. . . Cluster proﬁles for c = 6 . . . . . . . . . . . . . . . . . . . . . . Result of K-means algorithm .5 3.14 4. . . . . . . .15 4. . . . . . Separation Index and the Xie Beni Index Values of Dunn’s Index and the Alternative Dunn Index .7 4. Result of Fuzzy C-means algorithm . . . . . . Result of Gath-Geva algorithm . . . . . . . .4 2. . . . . . . . . . . . . . . . . . . . . . . . . .List of Figures 1. . . . . . . . .3 4. . . . .1 4. . . . . . .1 3. . . . . . . . . . . . . . Two-dimensional customer data of segment 1 and segment 2 . 4 . . . . . . Histograms of feature values . . . . . . . . . . . . . . . .5 4. . . . . . . . . . . . . . . . . . . .1 2. . . . . . . . .11 4.13 4. . . . . . . . . . . . . . . . Relation between originated and received calls . . . . . . Cluster proﬁles of centers for c = 6 . . . .9 4. . .2 3. . . . . . . . . . . Hard and fuzzy clustering . Cluster proﬁles for c = 4 . . . . . . . . . . . . . . Values of Partition Index. . .10 4. . . . . Values of Partition Index.3 2. . . . . Distribution of distances from cluster centers within clusters for the Gustafson-Kessel algorithm with c = 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Values of Partition coeﬃcient and Classiﬁcation Entropy with Gustafson-Kessel clustering . . . . . . . .2 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Visualization of phone calls per hour . . . Cluster proﬁles of centers for c = 4 . . . . . . . 12 15 17 18 18 19 22 24 25 38 39 40 41 41 43 44 44 44 45 46 46 47 48 49 50 54 Example of clustering data . Result of K-medoid algorithm . . . . . . . . . . . . . . . . . . . . . . . .8 4. . . . . . . . . . . . . . . . . . .12 4. . . . . . . . Separation Index and the Xie Beni Index with Gustafson-Kessel clustering . . . . . . .3 4. .2 2. . . . . . . .

. . . . .1 Model of the Vodafone data warehouse . . . . . . . . . . . .6 5. . . .2 B. . . .4 5. . . . . . . . . . .7 6. . . . . . . . . . . . . . . . Results while leaving out one of the features with 4 segments Results while leaving out one of the features with 6 segments A. . . . . A K-fold partition of the dataset . . . . . . Partition coeﬃcient and Classiﬁcation Entropy of Fuzzy C-means Partition index.1 6. . .5 Separating hyperplanes in diﬀerent dimensions . . . . . . . . . . .3 B. . . . . . . .5 5. . . . . . . . . . . . . . . . . . . . B. Demonstration of the maximum-margin hyperplane Demonstration of the soft margin .5 Dunn’s index and Alternative Dunn’s index of Fuzzy C-means . . . . . . . . . . . . B. . . . . . . . . 5 .2 5. . Demonstration of kernels .3 5. . . . . . .5. . . . . . . Dunn’s index and Alternative Dunn’s index of K-medoid . . . . . . . . Determining the stopping point of training the SVM . . . . . . . . . . . . . . . . A separation of classes with complex boundaries . . . . . . Separation index and Xie Beni index of Fuzzy C-means . . . 54 55 56 57 58 59 60 61 61 65 65 72 73 74 74 75 75 Under ﬁtting and over ﬁtting .4 Partition index and Separation index of K-medoid . . .3 6. . . . . . . . Examples of separation with kernels . . . . . . . .1 B. . . . .2 6. . . . . .4 6. . .

. . Confusion matrix. . . . . . . . . .5 6. . . Average C-value for polynomial kernel.4 6. . .10 6. . . .12 Proportions within the diﬀerent classiﬁcation groups . . . . . . . . . .1 4. . . . . .9 6. . . . Polynomial kernel. . . . . . . Average C-value for polynomial kernel. . . . . . . . . . Radial basis function. . . . . Sigmoid function. . . . . . . 4 segments . . . . . . 20 39 42 42 43 51 61 62 62 62 62 62 63 63 63 64 64 64 6 . Sigmoid function. . . . . . . . . . 6 segments . . . . . . . . . . The values of all the validation measures with K-means clustering The values of all the validation measures with Gustafson-Kessel clustering . . 4 segments . . . . . . . . . . . . . . . . .11 6. .5 6. . 4 segments . . . . . . . . . . . . 6 segments . . . . . . . . . . . Confusion matrix. . . . . . . The numerical values of validation measures for c = 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . segments segments .2 4. . . . . 4 segments .6 6. . . . . 6 segments . . . . . . Radial basis function.1 6. Linear Kernel. The numerical values of validation measures for c = 6 .1 4. .2 6. . . Polynomial kernel. .8 6. . . . . . . 4 6 . . . . . . . .3 4. . . . . . . . . . . . . . . . . . . . . .7 6. . . . . . . . . . . . . Linear Kernel. . . .4 4. . Segmentation results . . . . . . . . . . . . . . . . . . . . . . .List of Tables 2. . . . . . . . . . . . . 4 segments . . . . 6 segments . . . . . .3 6. . . . . . . . . 6 segments . . .

In this research. In our context.3% of the cases to classify the segment of a customer based on its proﬁle for the situation with four segments. by means of advanced data mining techniques. tastes. ’Customer proﬁling’ is describing customers by their attributes. Finally. without direct knowledge of human experts. Customer proﬁling can be accomplished with information from the data warehouse. gender and residential area information. i. automatic analysis is essential. With six segments. Having these two components. ’customer segmentation’ is a term used to describe the process of dividing customers into homogeneous groups on the basis of shared or common attributes (habits. the segment of a customer will be estimated based on the customers proﬁle. gender. with a recent data mining technique. The magnitude of this data is so huge that manual analysis of data is not feasible. The customer segmentation will lead to two solutions. Each segment will be described and analyzed. An optimality criterion is constructed in order to measure their performance. etc). Diﬀerent kernel functions with different parameters will be examined and analyzed. has accumulated vast amounts of data on consumer mobile phone behavior in a data warehouse. the customer segmentation is based on usage call behavior. A number of advanced and state-of-the-art clustering algorithms are modiﬁed and applied for creating customer segments. this data holds valuable information that can be applied for operational and strategical purposes.e. One solution with four segments and one solution with six segments. managers can decide which marketing actions to take for each segment. These data mining techniques search and analyze the data in order to ﬁnd implicit and useful information. a correct classiﬁcation of 78.5% is obtained. in order to extract such information from this data. such as age.e. the behavior of a customer measured in the amounts of incoming or outgoing communication of whichever form. called Support Vector Machines.Abstract Vodafone. Therefore. This research will address the question how to perform customer segmentation and customer proﬁling with data mining techniques. However. This thesis describes the process of selecting and preparing the accurate data from the data warehouse. income and lifestyles. most optimal in the sense of the optimality criterion clustering technique will be used to perform customer segmentation. 7 . With the Support Vector Machine approach it is possible in 80. such as age. in order to perform customer segmentation and to proﬁle the customer. The best i. an International mobile telecommunications company.

These automated systems perform important functions such as identifying network faults and detecting fraudulent phone calls. Call detail data gives a description of the calls that traverse the telecommunication networks. such as age. and in many cases. call detail data. marketers can decide which marketing actions to take for each segment and then allocate scarce resources to segments in order to meet speciﬁc business objectives. but potentially useful.1 million customers in The Netherlands. with approximately 4. Solutions to these problems were promised by data mining techniques. The need to handle such large volumes of data led to the development of knowledge-based expert systems [17. From all these customers a tremendous amount of data is stored.Chapter 1 Introduction Vodafone is world’s leading mobile telecommunications company. while the network data gives a description of the state of the hardware and software components in the network. tastes. fraud detection. 22]. etc) [10]. Examples of main problems for marketing and sales departments of telecommunication operators are churn prediction. income and lifestyles [1. A basic way to perform customer segmentation is to deﬁne segmentations in 8 . 10]. among others. Having these two components. namely customer segmentation and customer proﬁling and the relation between them. The customer data contains information of the telecommunication customers. Vodafone is interested in a complete diﬀerent issue. many data mining tasks can be distinguished. the experts do not have the requisite knowledge [2]. network data and customer data. Customer segmentation is a term used to describe the process of dividing customers into homogeneous groups on the basis of shared or common attributes (habits. identifying trends in customer behavior and cross selling and up-selling. Within the telecommunication branch. Customer proﬁling is describing customers by their attributes. if not impossible [22]. gender. information [12]. These data include. Obtaining knowledge from human experts is a time consuming process. The amount of data is so great that manual analysis of data is diﬃcult. Data mining is the process of searching and analyzing data in order to ﬁnd implicit. A disadvantage of this approach is that it is based on knowledge from human experts.

diﬀerent settings of the Support Vector Machines will be examined and the best working estimation model will be used. validated and compared to each other.1 Customer segmentation and customer proﬁling To compete with other providers of mobile telecommunications it is important to know enough about your customers and to know the wants and needs of your customers [15]. a data mining technique called Support Vector Machines (SVM) will be used. A Support Vector machine is able to estimate the segment of a customer by personal information. Once the segmentations are obtained. 1. The construction of user 9 . To realize this. To realize this. marketers are more eﬀective in channeling resources and discovering opportunities. To ﬁnd a relation between the proﬁle and the segments. tested. and dividing the customers over these segmentations by their best ﬁts. Depending on data available. This research will deal with the problem of making customer segmentations without knowledge of an expert and without deﬁning the segmentations in advance. gender and lifestyle. for each customer a proﬁle will be determined with the customer data. Customer segmentation is a preparation step for classifying each customer according to the customer groups that have been deﬁned. diﬀerent data mining techniques. In this research.advance with knowledge of an expert. such as age. Segmenting means putting the population in to segments according to their aﬃnity or similar characteristics. it can be used to prospect new customers or to recognize existing bad customers.1. Proﬁling is performed after customer segmentation. 1. Another key beneﬁt of utilizing the customer proﬁle is making eﬀective marketing strategies. the principals of the clustering techniques will be described and the process of determining the best technique will be discussed.1 Customer segmentation Segmentation is a way to have more targeted communication with the customers. Customer proﬁling is a way of applying external data to a population of possible customers. Customer proﬁling is done by building a customer’s behavior model and estimating its parameters. called clustering techniques. The goal is to predict behavior based on the information we have on each customer [18]. Segmentation is essential to cope with today’s dynamically fragmenting consumer marketplace. The segmentations will be determined based on (call) usage behavior. The process of segmentation describes the characteristics of the customer groups (called segments or clusters) within the data. will be developed. In this report. the segment can be estimated and the usage behavior of the customer proﬁle can be determined. By using segmentation. Based on the combination of the personal information (the customer proﬁle). it is needed to divide customers in segments and to proﬁle the customers.

thereby necessitating revision and reclassiﬁcation of customers. In addition. eﬀective segmentation strategies will inﬂuence the behavior of the customers aﬀected by them. In this report.1. for each proﬁle. an estimation of the usage behavior can be obtained. One solution to construct segments can be provided by data mining methods that belong to the category of clustering algorithms. More directly. This data is used to ﬁnd a relation with the customer segmentations that were constructed before. On the other hand. Moreover. In particular. segmentation would require almost a daily update. Alternatively. diﬀerent source systems) makes it also diﬃcult to extract interesting information. such as demographic data purchased from various sources. several clustering algorithms will be discussed and compared to each other. the meaning of a customer segmentation in unreliable and almost worthless. 1. Customer proﬁling is also used to prospect new customers using external sources. Poorly organize data (diﬀerent formats. in an e-commerce environment where feedback is almost immediate. apparently eﬀective variables may not be identiﬁable. This is done by assembling collected information on the customer such as demographic and personal data. • Intuition: Although data can be highly informative. Furthermore.segmentations is not an easy task. This makes it possible to estimate for each proﬁle (the combination of demographic and personal information) the related segment and visa versa. Diﬃculties in making good segmentation are [18]: • Relevance and quality of data are essential to develop meaningful segments. one has to select what is the proﬁle that will be relevant to the project.2 Customer proﬁling Customer proﬁling provides a basis for marketers to ’communicate’ with existing customers in order to oﬀer them better services and retaining them. data analysts need to be continuously developing segmentation hypotheses in order to identify the ’right’ data for analysis. Many of these problems are due to an inadequate customer database. Depending on the goal. If the company has insuﬃcient customer data. the resulting segmentation can be too complicated for the organization to implement eﬀectively. • Over-segmentation: A segment can become too small and/or insuﬃciently distinct to justify treatment as separate segments. • Continuous process: Segmentation demands continuous development and updating as new customer data is acquired. the use of too many segmentation variables can be confusing and result in segments which are unﬁt for management decision making. A simple customer proﬁle is a ﬁle that contains at least age and 10 . too much data can lead to complex and time-consuming analysis.

This report gives an description of SVM’s and it will be researched under which circumstances and parameters a SVM works best in this case. 19]: • Geographic. or industry? How much education is needed? How much brand building advertising is needed to make a pool of customers aware of oﬀer? • Lifestyle. What is the predominant age group of your target buyers? How many children and what age are in the family? Are more female or males using a certain service or product? • Values. Customer features one can use for proﬁling. neural networks. It involves selecting.service. are described in [2. Data mining uses a broad family of computational methods that include statistical analysis. 1. income and/or purchasing power. the ﬁle would contain product information and/or volume of money spent. the term data mining was used. What languages do they speak? Does ethnicity aﬀect their tastes or buying behaviors? • Economic conditions.2 Data mining In section 1. but potentially useful. an estimation model can be made. This can be realized by a data mining method called Support Vector Machines (SVM). Data mining is the process of searching and analyzing data in order to ﬁnd implicit. from large databases. exploring and modeling large amounts of data to uncover previously unknown patterns. How was the customer recruited? The choice of the features depends also on the availability of the data. If one needs proﬁles for speciﬁc products. rule induction and reﬁnement. 10. With these features. Are they grouped regionally. and graphic visualization. particularly exploratory tools like data visualization and neural 11 . information [12]. What is the average household incom or power of the customers? Do they have any payment diﬃculty? How much or how often does a customer spend on each product? • Age and gender.1. data mining tools have been available for a long time. Although. and ultimately comprehensible information. attitudes and beliefs. How many lifestyle characteristics about purchasers are useful? • Recruitment method. decision trees. How long has the customer been regularly purchasing products? • Knowledge and awareness.gender. How much knowledge do customers have about a product. the advances in computer hardware and software. nationally or globally • Cultural and ethnic. What is the customers’ attitude toward your kind of product or service? • Life cycle.

The taxonomy reﬂects the emerging role of data visualization as Figure 1. For example. The speciﬁc tasks to be used in this research are Clustering (for the customer segmentation). The data mining tasks generate an assortment of customer and market knowledge which form the core of knowledge management process. 12 . have made data mining more attractive and practical. By the fact that the validation supports the other data mining tasks and is always necessary within a research. Formally. A drawback of this method is that the number of clusters has to be given in advance. Diﬀerent data mining tasks are grouped into categories depending on the type of knowledge extracted by the tasks. this task was not mentioned as a separate one. ”international callers”. based on user behavior data. Data mining tasks are used to extract patterns from large data sets. The various data mining tasks can be broadly divided into six categories as summarized in Figure 1. The advantage of clustering is that expert knowledge is not required. clustering algorithms can classify the Vodafone customers into ”call only” users. even as it is used to support other data mining tasks. For example.1. Validation of the results is also a data mining task. ”SMS only” users etc. Clustering algorithms produce classes that maximize similarity within clusters but minimize similarity between classes. Classiﬁcation algorithms groups customers in predeﬁned classes.1: A taxonomy of data mining tasks a separate data mining task. The typical data mining process consist of the following steps [4]: • problem formulation • data preparation • model building • interpretation and evaluation of the results Pattern extraction is an important component of any data mining activity and it deals with relationships between subsets of data. a pattern is deﬁned as [4]: A statement S in L that describes relationships among a subsets of facts Fs of a given set of facts F. with some certainty C. Classiﬁcation (for estimating the segment) and Data visualization.networks. The identiﬁcation of patterns in a large data set is the ﬁrst step to gaining useful marketing insights and marking critical marketing decisions. such that S is simpler than the enumeration of all facts in Fs .

Diﬀerent parameter settings of the Support Vector Machines will be researched and examined in Chapter 6 to ﬁnd the best working model. in Chapter 7. To realize this. Multiple plots and ﬁgures will show the working of the diﬀerent cluster methods and the meaning of each segment will be described. To provide varying levels of details of observed patterns.3 Structure of the report The report comprises 6 chapters and several appendices.Vodafone can classify its customers based on their age. the research will be discussed. Chapter 5 delves into a data mining technique called Support Vector Machines. It provides information about the structure of the data and the data ware house. In addition to to this introductory chapter. algorithms as Principal Component Analysis and Sammon’s Mapping (discussed in Section 3. This will be tested with the prepared call detail data as described in Chapter 2 For each algorithm. It also focuses on validation methods. Diﬀerent cluster algorithms will be studied. This technique will be used to classify the right segment for each customer proﬁle. it gives an overview of the data that is used to perform customer segmentation and customer proﬁling. rotate or zoom the objects. Finally. Clustering is a data mining technique. The chapter ends with a description of visualization methods. In some cases it is needed to reduce high dimensional data into three or two dimensions. that in this research is used to determine the customer segmentations. with the customer data of Chapter 2. It ends with an explanation of the preprocessing techniques that were used to prepare the data for further usage. 1. Data visualization allow data miners to view complex patterns in their customer data as visual objects complete in three or two dimensions and colors. The chapter starts with explaining the general process of clustering. data miners use applications that provide advanced manipulation capabilities to slice. gender and type of subscription and then target its user behavior. the optimal numbers of cluster will be determined.4) can be used. the cluster algorithms will be compared to each other and the best algorithm will be chosen to determine the segments. Chapter 2 describes the process of selecting the right data from the data ware house. Chapter 4 analyzes the diﬀerent cluster algorithms of Chapter 3. In Chapter 3 the process of clustering is discussed. which can be used to determine the optimal number of clusters and to measure the performance of the diﬀerent cluster algorithms. a proﬁle can be made. 13 . Once the segments are determined. These methods are used to analyze the results of the clustering. Then. Conclusions and recommendations are given and future work is proposed. Furthermore.

A simpliﬁed model of the data warehouse can be found in Appendix A.1. In the postpaid group. In general. A non-captive customer is using the Vodafone network but has not a Vodafone subscription or prepaid (called roaming). In this chapter. 2. that prepaid users are always consumers. A more precisely view can be found in Figure 2.1.Chapter 2 Data collection and preparation The ﬁrst step (after the problem formulation) in the data mining process is to understand the data. It is clear to see. Debitel customers are always consumers and ICMC customers are always business customers. the process of collecting the right data from this data ware house. there are captive and non captive users.1 Data warehouse Vodafone has stored vast amounts of data in a Teradata data warehouse. Without such an understanding.1 Selecting the customers Vodafone Maastricht is interested in customer segmentation and customer proﬁling for (postpaid) business customers. This data warehouse exists oﬀ more than 200 tables. Furthermore. that their customers can use the Vodafone network. Vodafone has made an accomplishment with two other telecommunications companies. the process of preparing the data for customer segmentation and customer proﬁling will be explained. All data of Vodafone is stored in a data warehouse. 2. A captive customer has a business account if his telephone or subscription is bought in relation with the business 14 . The ICMC customers will also be involved in this research. Debitel and InterCity Mobile Communications (ICMC). business customers can be seen as employees of a business that have a subscription by Vodafone in relation with that business. will be described. useful applications cannot be developed.

how often. where. The total number of (postpaid) business users at Vodafone is more than 800. can have a subscription that is under normal circumstances only available for business users. and will be available almost immediately for data mining. the call detail records associated with a customer must be summarized into a single record that describes the customer’s calling behavior. At a minimum. which is typically made available only once per month. Thus. one can think of the smallest set of variables that describe the complete behavior of a customer. not at the level of individual phone calls [7. when.000. The choice of summary variables (features) is critical in order to obtain a useful description of the customer []. this means that hundreds of millions of call detail data will need to be stored at any time. These customers also count as business users. Call detail records include suﬃcient information to describe the important characteristics of each call.1. 2. etc.2 Call detail data Every time a call is placed on the telecommunications network of Vodafone.Figure 2. who. each call detail record will include the originating and terminating phone numbers. This is in contrast with billing data. Given that 12 months of call detail data is typically kept on line. To deﬁne the features. In some cases. descriptive information about the call is saved as a call detail record. customers with a consumer account. Keywords like what. These customers are called business users. The next sections describe which data of these customers is needed for customer segmentation and proﬁling. the date and the time of the call and the duration of the call. The number of call detail records that are generated and stored is huge. can help with this process: 15 .1: Structure of customers by Vodafone he works. For example. Vodafone customers generate over 20 million call detail records per day. Call detail records can not be used directly for data mining. since the goal of data applications is to extract knowledge at the customer level. 8]. Call detail records are generated in two or three days after the day the calls were made.

% of weekday calls (Monday . # unique area codes called during P • 12. 15. % international calls • 10. average # sms originated per day • 9. # diﬀerent numbers called during P 16 . • When? : When does a customer call? A business customer can call during oﬃce daytime. or in private time in the evening or at night and during the weekend. • Where? : Where is the customer calling? Is he calling abroad? • How long? : How long is the customer calling? • How often? : How often does a customer call or receive a call? Based on these keywords and based on proposed features in the literature [1. but their appearances are so rare that they were not used during this research). % of outgoing calls within the same operator • 11. average # calls originated per day • 4. average # sms received per day • 8. • Who? : Who is the customer calling? Does he call to ﬁxed lines? Does he call to Vodafone mobiles? • What? : What is the location of the customer and the recipient? They can make international phone calls. 19.• How? : How can a customer cause a call detail record? By making a voice call. average # calls received per day • 3.Friday) • 6. % of calls to mobile phones • 7. average call duration • 2. % daytime calls (9am . 20] .6pm) • 5. The customer can also receive an SMS or voice call. or sending an SMS (there are more possibilities. a list of features that can be used as a summary description of a customer based on the calls they originate and receive over some time period P is obtained: • 1.

In that case. it should include exploratory data analysis.2: Visualization of phone calls per hour variance within the data. This also indicates that the chosen features are suited for the customer segmentation. customers originating more calls than receiving. the number of weekday and daytime calls and the originated calls have suﬃcient variance. For example. customers who use their telephone only at their oﬃce could be in a diﬀerent segment then users that use their telephone also for private purposes. It may be clear that generating useful features. More detailed exploratory data analysis. is a critical step within the data mining process. otherwise distinguish between customers is not possible and the feature is not useful. the variance is visible in the following histograms. Note that the histograms resemble well known distributions. On the other hand. First of all.3 shows that the average call duration.4 is also visible that the customers that originated more calls. Although the construction of these features may be guided by common sense. Should poor features be generated. For some features values. Interesting to see is the relation between the number of calls originated and received. including summary features. For examples. Another aspect that is simple to ﬁgure out is the fact that customer 17 .4 demonstrates this. the use of the time period 9am-6pm in the fourth feature is not based on the commonsense knowledge that the typical workday on a oﬃce is from 9am to 5pm. shown in Figure 2. Most of the twelve features listed above can be generated in a straightforward manner from the underlying data of the data ware house. in general.2 indicates that the period from 9am to 6pm is actually more appropriate for this purpose. for each summary feature. Figure 2. values above the blue line represent customers with more originating calls than receiving calls. to much variance hampers the process of segmentation. there should be suﬃcient Figure 2. Furthermore.These twelve features can be used to build customer segments. In Figure 2. data mining will not be successful. the segmentation was based on the percentage weekday and daytime calls. Such a segment describes a certain behavior of group of customers. but some features require a little more creativity and operations on the data. Figure 2. receive also more calls in proportion.

(a) Call duration (b) Weekday calls (c) Daytime calls (d) Originated calls Figure 2.3: Histograms of feature values Figure 2.4: Relation between originated and received calls 18 .

It is clear to see that the chosen features contain suﬃcient variance and that certain relations and diﬀerent customer behavior are already visible. advance. female • Type telephone: simple. contract information and telephone equipment information.1. 25-40 40-55 >55 • Gender: male. Information about lifestyles and income is missing. However.that make more weekday calls also call more at daytime (in proportion).5. This is plotted in Figure 2. small city /town 19 .1. big • Living area: (big) city. customer data is needed. with some creativity. intermediate.3 Customer data To proﬁle the customer. advanced • Type subscription: basic. some information can be subtracted from the data ware house. The chosen features appear to be well chosen and useful for customer segmentation. With this information. basic. the following variables can be used to deﬁne a customers proﬁle: • Age group: <25. Figure 2.2 is not completely available. expanded • Company size: small.5: Relation between daytime and weekday calls 2. The proposed data in Section 1. The information that Vodafone stored in the data ware house include name and address information and also include other information such as service plan.

0% 40-55 27. These tasks are [7]: • Discovering and repairing inconsistent data formats and inconsistent data encoding. 2. In general. the goal of grouping variables is to reduce the number of variables to a more manageable size and to remove the correlations between each variable.9% small 31. this feature will not increase the performance of the classiﬁcation. the age of the customers has to be grouped.Because a relative small diﬀerence in age between customers should show close relationships.8% basic 38. a Support Vector Machine will be used to estimate the segment of the customer. • Interpreting codes into text or replacing text into meaningful numbers.5% simple 34. The composition of the groups should be chosen with care.2 Data preparation Before the data can be used for the actual data mining process. • Deleting unwanted data ﬁelds.2% simple 33.1 shows the percentages of customers within the chosen groups. it need to cleaned and prepared in a required format.5% Female 39. Based on this feature.1% big 34. It is clear to see that sizes of the groups were chosen with care Age: Gender: Telephone type: Type of subscription: Company size: Living area: <25 21. such as production keys and version numbers. abbreviations and punctuation.0% intermediate 34.8% expanded 29. 20 .0% 25-40 29. It is of high importance that the sizes of the groups are almost equal (if this is possible) [22]. Data may contain many meaningless ﬁelds from an analysis point of view.3% small city/town 58.9% >55 21. the result of the classiﬁcation algorithm is too speciﬁc to the trainings data [14].2% Male 60. If there is one group with a suﬃcient higher amount of customers than other groups. Table 2.2% Table 2. spelling errors. Chapter 5 and Chapter 6 contain information and results of this method. This is caused by the fact that from each segment a relative high number of customers is represented in this group.4% advanced 27.With this proﬁle.1: Proportions within the diﬀerent classiﬁcation groups and the values can be used for deﬁning the customers proﬁle.5% (big) city 42.7% advanced 36. Otherwise. the segment of a customer can not be determined.

The following data preparations were needed during this research: • Checking abnormal.g.g. • Finding multiple used ﬁelds. e. for instance the customer data.g. correspondence analysis and conjoint analysis [14]. These codes has to be augmented and replaced by recognizable and equivalent text.Data may contain cryptic codes. • Converting from textual to numeral or numeric data. decision trees or associations rules. it is also useful to apply data reduction techniques (data cube aggregation. New ﬁelds can be generated through combinations of e. • Mapping continuous values into ranges. dimension and numerosity reduction. Some of these outliers may be correct but this is highly unusual. The goal of this approach is to reduce the number of variables to a more manageable size while also the correlations between each variable will be removed.1]. • Adding computed ﬁelds as inputs or targets. The second type is to normalize the variance to one. averages and minimum/maximum values. decision trees. Dimension reduction means that one has to select relevant feature to a minimum set of attributes such that the resulting probability distribution of data classes is a close as possible to the original distribution given the values of all features. 21 . out of bounds or ambiguous values. The ﬁrst type is to normalize the values between [0. There are two types of normalization. For this additional tools may be needed. Techniques used for this purpose are often referred to as factor analysis. random or heuristic search. e. clustering. from multiple tables into one common variable. When there is a large amount of data. • Checking missing data ﬁelds or ﬁelds that have been replaced by a default value. • Combining data. discretization and concept hierarchy generation). thus almost impossible to explain. • Converting nominal data (for example yes/no answers) to metric scales. frequencies. A possible way to determine is to count or list all the distinct variables of a ﬁeld. exhaustive. • Normalization of the variables.

As every other unsupervised method. according to similarities among them [13]. In this case the 3 Figure 3.Chapter 3 Clustering In this chapter.1 Cluster analysis The objective of cluster analysis is the organization of objects into groups. The similarity criterion that was used in this case is distance: two or more objects belong to the same cluster if they are ”close” according to a given distance (in this case geometrical distance).1: Example of clustering data clusters into which the data can be divided were easily identiﬁed. it does not use prior class identiﬁers to detect the underlying structure in a collection of data. the used techniques for the cluster segmentation will be explained.1 shows this with a simple graphical example. two or more objects 22 . Within this method. A cluster can be deﬁned as a collection of objects which are ”similar” between them and ”dissimilar” to the objects belonging to other clusters. Figure 3. Another way of clustering is conceptual clustering. 3. This is called distance-based clustering. Clustering can be considered the most important unsupervised learning method.

. called the regressands. objects are grouped according to their ﬁt to descriptive concepts. Each observation of the customers calling behavior consists of n measured values. In metric spaces. and X is called the pattern matrix. .1.. additional steps are needed. . measured in some well-deﬁned sense. or distance measure. one can accept the deﬁnition that a cluster is a group of objects that are more similar to another than to members of other clusters. In this research.2. xkn ]T . A second way is to measure the distance form the data vector to some prototypical object of the cluster. the columns are called the features or attributes. . and the columns are the feature variables of their behavior as described in Section 2. one should realize.1. In other words. (3.1) .2 The clusters The deﬁnition of a cluster can be formulated in various ways. that the relations revealed by clustering are not more than associations among the data vectors.. . The cluster centers are usually (and also in this research) not known a priori. A set of N observations is denoted by X = {xk |k = 1. . . and will be calculated by the clustering algorithms simultaneously with the partitioning of the data. grouped into an n-dimensional row vector xk = [xk1 .1 The data One can apply clustering techniques to quantitative (numerical) data. 2. In this research. 3.2. In this research. they will not automatically constitute a prediction model of the given system. qualitative (categoric) data. As mentioned before. the rows of X are called patterns or objects. .. . and is represented as an N x n matrix: x11 x12 · · · x1n x21 x22 · · · x2n X= . The data. the clustering of quantitative data is considered. 23 . are typically summarized observations of a physical process (call behavior of a customer). X will be referred to the data matrix.. And therefore. Distance can be measured in diﬀerent ways. the purpose of clustering is to ﬁnd relationships between independent system variables. To obtain such a model. as described in Section 2. . similarity is often deﬁned by means of a distance norm. In general. .. 3. . not according to simple similarity measures. The term ”similarity” can be interpreted as mathematical similarity. called the regressors. The ﬁrst possibility is to measure among the data vectors themselves. only distance-based clustering algorithms were used.1. and future values of dependent variables. However.. xk2 . where xk ∈ Rn . or a mixture of both.belong to the same cluster if this one deﬁnes a concept common to all that objects. xN 1 xN 2 · · · xN n In pattern recognition terminology. N }. The rows of X represent the customers. . depending on the objective of the clustering.1.

The performance of most clustering algorithms is inﬂuenced not only by the geometrical shapes and densities of the individual clusters. Clusters a. Clustering algorithms are able to detect subspaces of the data space.3 Cluster partition Clusters can formally be seen as subsets of the data set. but can also be deﬁned as ”higher-level” geometrical objects. elongated and also be (a) Elongated (b) Spherical (c) Hollow (d) Hollow Figure 3. and therefore reliable for identiﬁcation. such as linear or nonlinear subspaces or functions. or overlapping each other.The cluster centers may be vectors of the same dimensions as the data objects. but also by the spatial relations and distances among the clusters. Data can reveal clusters of diﬀerent geometrical shapes.c and d can be characterized as linear and non linear subspaces of the data space (R2 in this case). Clusters can be well-separated.2: Diﬀerent cluster shapes in R2 hollow.2 Clusters can be spherical. sizes and densities as demonstrated in Figure 3. continuously connected to each other. One can distinguish two possible outcomes of the classiﬁcation of clustering methods. Cluster can be found in any n-dimensional space.1. 3. Subsets can 24 .

3: Hard and fuzzy clustering causes analytical and algorithmic intractability of algorithms based on analytic functionals.either be fuzzy or crisp (hard).1 µ1. with diﬀerent degrees of membership.5) Ai ∩ Aj . since these functionals are not diﬀerentiable. which requires that an object either does or does not belong to a cluster.2) U= .c (3.2 · · · µN.2 · · · µ2.1 µN. its properties can be deﬁned as follows: c Ai = X. based on prior knowledge. . Hard clustering methods are based on the classical set theory. but rather are assigned membership degrees between 0 and 1 indicating their partial memberships (illustrated by Figure 3. 1 ≤ i ≤ c. . . .. . as objects on the boundaries between several classes are not forced to fully belong to one of the classes.3 The discrete nature of hard partitioning also Figure 3.1 µ2. 25 . .2 · · · µ1. .c Hard partition The objective of clustering is to partition the data set X into c clusters.c µ2. The data set X is thus partitioned into c fuzzy subsets. Assume that c is known.3) (3. e. a hard partition can be seen as a family of subsets {Ai |1 ≤ i ≤ c ⊂ P (X)}. Hard clustering in a data set X means partitioning the data into a speciﬁed number of exclusive subsets of X. Ø ⊂ Ai ⊂ X. .4) (3. . or it is a trial value. fuzzy clustering is more natural than hard clustering. 1 ≤ i = j ≤ c. i=1 (3. Using classical sets.g. of witch partition results must be validated. . Fuzzy clustering methods allow objects to belong to several clusters simultaneously. µN. . The number of subsets (clusters) is denoted by c. The structure of the partition matrix U = [µik ]: µ1. In many real situations.

1 ≤ k ≤ c. 1}.14) µik = 1. 1]. c (3. ∀i. they must be disjoint and none of them is empty nor contains all the data in X. 1 ≤ k ≤ c. Expressed in the terms of membership functions: c µAi = 1. the hard partitioning space for X can be seen as the set: c N Mhc = {U ∈ RN xc |µik ∈ {0. Where µAi represents the characteristic function of the subset Ai which value is zero or one.7) (3. (3. 1 ≤ k ≤ c.These conditions imply that the subsets Ai contain all the data in X. Also the deﬁnition of the fuzzy partitioning space will not much diﬀer with 26 . 1 ≤ i ≤ N. (3.8) µAi ∨ µAi .13) (3. k. c (3. i=1 (3. in this case µik is allowed to acquire all real values between zero and 1. 1 ≤ i ≤ N.6) (3. 1 ≤ k ≤ c. and denoting µi (xk ) by µik .11) A deﬁnition of a hard partitioning space can be deﬁned as follows: Let X be a ﬁnite data set and the number of clusters 2 ≤ c < N ∈ N. Then. is a representation of the hard partition if and only if its elements satisfy: µij ∈ {0. 1 ≤ i ≤ c.10) µik = 1. k=1 µik = 1. 0 < i=1 µik < N . 1 ≤ i = j ≤ c. 1}.15) Note that there is only one diﬀerence with the conditions of the hard partitioning. (3. ∀i. 1 ≤ i ≤ N. To simplify these notations.12) Fuzzy partition Fuzzy partition can be deﬁned as a generalization of hard partitioning. k=1 N 0< i=1 µik < N. U = [µik ]. its conditions are given by: µij ∈ [0. Consider the matrix U = [µik ]. µi will be used instead of µAi . ∀k}. partitions can be represented in a matrix notation.9) (3. 0 ≤ µAi < 1. k=1 N 0< i=1 µik < N. a N xc matrix. 1 ≤ i ≤ N. containing the fuzzy partitions.

However.16) The i-th column of U contains values of the membership functions of the i-th fuzzy subset of X. There are no constraints on the distribution of memberships among the fuzzy clusters. The K-means algorithm allocates each data point to one of the c clusters to minimize the within sum of squares: c sumk∈Ai ||xk − vi ||2 . 3. This method will result into hard partitioned clusters. 0 < i=1 µik < N . i=1 (3.18) vi = k=1 . xk ∈ Ai . the results of this hard partitioning method are not always reliable and this algorithm has numerical problems as well.17) Ai represents a set of data points in the i-th cluster and vi is the average of the data points in cluster i.the deﬁnition of the hard partitioning space. However. This research will focus on hard partitioning. ∀k}.14) implies that the sum of each column should be 1. 1]. 27 .1 K-means K-means is one of the simplest unsupervised learning algorithms that solves the clustering problem. fuzzy cluster algorithms will be applied as well. k.2 Cluster algorithms This section gives an overview of the clustering algorithms that were used during the research. Note that ||xk −vi ||2 is actually a chosen distance norm. the fuzzy partitioning space for X can be seen as the set: c N Mf c = {U ∈ RN xc |µik ∈ [0. 3. Then. Within the cluster algorithms. To deal with the problem of fuzzy memberships. The procedure follows an easy way to classify a given N x n data set through a certain numbers of c clusters deﬁned in advance. which means that the total membership of each xk in X equals one. k=1 µik = 1.2. It can be deﬁned as follows: Let X be a ﬁnite data set and the number of clusters 2 ≤ c < N ∈ N. ∀i. (3. The possibilistic partition will not be used in this researched and will not be discussed here. the cluster with the highest degree of membership will be chosen as the cluster were the object belongs to. vi is the cluster center (also called prototype) of cluster i: Ni xk (3. ∀i. Equation (1. Ni where Ni is the number of data points in Ai .

with respect to U.21) On a statistical point of view. ∀i. there is no continuity in the data space.14 to J by means of Lagrange multipliers: c N 2 (µik )m DikA + i=1 k=1 k=1 N c ¯ J(X. equation 3.20) V denotes the vector with the cluster centers that has to be determined.. V. v2 . to deﬁne the clusters. This implies that a mean of the points in one cluster does actually not exist. (3. one can adjoint the constrained in 3. 1 ≤ k ≤ N. uses the same equations as the K-means algorithm. 1 ≤ i ≤ c. vi ∈ Rn . U. V and λ. for example..23) 28 . also a hard partitioning algorithm. k and m > 1. The minimization of the C-means functional can be seen as a non linear optimization problem. then the C-means functional may only be minimized by (U.2. invented by Dunn. When 2 DikA > 0.22) ˆ and by setting the gradients of (J). The distance norm ||xk − vi ||2 is called a squared inner-product distance norm and A is deﬁned by: 2 DikA = ||kk − vi ||2 = (xk − vi )T A(xk − vi ).3 Fuzzy C-means The Fuzzy C-means algorithm (FCM) minimizes an objective function. is deﬁned as follows: c N J(X. Examples of methods that can solve non linear optimization problems are grouped coordinate minimization and genetic algorithms. (3. The C-means functional. that can be solved by a variety of methods. A (3. This method is called the fuzzy cmeans algorithm. U.19) with V = [v1 .2 K-medoid K-medoid clustering. The simplest method to solve this problem is utilizing the Picard iteration through the ﬁrst-order conditions for the stationary points of equation 3. λ) = λk i=1 µik − 1 .19 measures the total number of variance of xk from vi .19.. This can be useful when.3.2. called C-means functional. vc ]. (DikA /DjkA )2/(m−1) (3. V ) ∈ Mf c xRnxc if µik = c j=1 1 . The only diﬀerence is that in K-medoid the cluster centers are the nearest data points to the mean of the data in one cluster V = {vi ∈ X|1 ≤ i ≤ c}. . To ﬁnd the stationary points of the c-means functional. A (3. V ) = i=1 k=1 (µik )m ||xk − vi ||2 . 3. to zero.

. i=1 k=1 J(X. A) = (3. The FCM algorithm uses the standard Euclidean distance for its computations.2. where A = (A1 . (3. Each cluster will have its own norm-inducing matrix Ai . The norm inducing matrix can also be chosen as an nxn diagonal matrix of the form: (1/σ1 )2 0 ··· 0 0 (1/σ2 )2 · · · 0 (3.. Therefor.25) AD = . V.15). caused by the common choice of the norm inducing matrix A = I. Note that it can only detect clusters with the same shape. . It employs a diﬀerent and adaptive distance norm to recognize geometrical shapes in the data.24) The solution of these equations are satisfying the constraints that were given in equation (3.1 N m k=1 µi.23) and (3. in this case. Ac ). . where 1 ≤ i ≤ c and 1 ≤ k ≤ N. . U. . matrix A is based ˆ on the Mahalanobis distance norm. it is able to deﬁne hyper spherical clusters.24) is the weighted average of the data points that belong to a cluster and the weights represents the membership degrees. .and vi = N m k=1 µik xk .. satisfying the following inner-product norm: 2 DikA = (xk − vi )T · Ai (xk − vi ). The Fuzzy C-means algorithm is actually an iteration between the equations (3.26) and x denotes the mean of the data. (3. . The objective functional of the GK algorithm can be calculated by: c N 2 (uik )m DikAi .. . Hence that. . Another possibility is to choose A as the inverse of the nxn covariance matrix A = F −1 . A2 . .24). . This implies that each cluster is allowed to adapt the distance norm to the local topological structure of the data. 0 0 · · · (1/σn )2 This matrix accounts for diﬀerent variances in the directions of the coordinate axes of X.k ≤ i ≤ c.13) and (3. A c-tuple of the norm-inducing matrices is deﬁned by A. . This explains why the name of the algorithm is c-means.28) 29 . Remark that the vi of equation (3.27) The matrices Ai are used as optimization variables in the c-means functional.4 The Gustafson-Kessel algorithm The Gustafson and Kessel (GK) algorithm is a variation on the Fuzzy c-means algorithm [11]. where F = 1 N N (xk − x)(xk − x)T ˆ ˆ k=1 (3. . 3.

To avoid this. the fuzzy covariance matrix F i is deﬁned by: Fwi = N k=1 (µik )w (xk − vi )(xk − vi )T N k=1 (µik )w .32) is the prior probability of selecting cluster i. This implies that J can be made as small as desired by making Ai less positive deﬁnite. (3. equation (3. Hence that this equation in combination with equation (3. (3.If A is ﬁxed. two weighted covariance matrices arise. (3.2. since it is linear in Ai . ρ > 0.15) can be applied without any problems. (3. the conditions under (3.5 The Gath Geva algorithm Bezdek and Dunn [5] proposed a fuzzy maximum likelihood estimation (FMLE) algorithm with a corresponding distance norm: Dik (xk . The variable αi in equation (3. In the original FMLE algorithm. (3. (3.30) can be substituted into equation (3. to compensate the exponential term and obtain clusters that are more fuzzy. In this research. w = 1. Unfortunately. Because of the generalization.31) (µik )m Fi is also called the fuzzy covariance matrix. 1 ≤ i ≤ c. In this case.14) and (3.28) can not be minimized in a straight forward manner.32) Comparing this with the Gustafson-Kessel algorithm. A general way to this is by constraining the determinant of the matrix. w will be set to 2.30) . This implies that this distance norm will decrease faster than the inner-product norm.27). Ai has to be constrained to obtain a feasible solution. with Fi = N k=1 (3.34) N k=1 30 . αi can be deﬁnes as follows: N 1 αi = µik . In combination with the Lagrange multiplier. The covariance is weighted by the membership degrees of U .33) The reason for using the w variable is to generalize this expression.27) is a generalized squared Mahalanobis norm between the data points and the cluster center. vi ) = det(Fwi ) αi (l) (l) T −1 1 2 (xk −vi ) Fwi (xk −vi ) .13). the distance norm includes an exponentional term. Ai can be expressed in the following way: Ai = [ρi det(Fi )]1/n Fi−1 . A varying Ai with a ﬁxed determinant relates to the optimization of the cluster whit a ﬁxed volume: ||Ai || = ρi .29) Here ρ is a remaining constant for each cluster. N k=1 (µik )m (xk − vi )(xk − vi )T 3. The outcome of the inner-product norm of (3.

To be able to perform the second approach. A clustering algorithm is designed to parameterize clusters in a way that it gives the best ﬁt. • Classiﬁcation Entropy (CE): measures only the fuzziness of the cluster. It is deﬁned by Bezdek [5] as follows: 1 P C(c) = N c N (uij )2 i=1 j=1 (3. since the exponential distance norm can converge to a local optimum. however. which are described below: • Partition Coeﬃcient (PC): measures the amount of ”overlapping” between clusters. none of them is perfect by oneself. The main drawback of this algorithm is the robustness. this does not apply that the best ﬁt is meaningful at all.Gath and Geva [9] discovered that the FMLE algorithm is able to detect clusters of diﬀerent shapes. The main drawback of this validity measure is the lack of direct connection to the data itself. validation measures has to be designed. In the worst case. 3.3 Validation Cluster validation refers to the problem whether a found partition is correct and how to measure the correctness of a partition. The number of clusters might not be correct or the cluster shapes do not correspond to the actual groups in the data. CE(c) = − 1 N c N uij log(uij ) i=1 j=1 (3.35) where uij is the membership of data point j in cluster i. However. in this research are used several indexes. Furthermore. and successively reducing this number by combining clusters that have the same properties. The optimal number of clusters can be found by the maximum value. it is not know how reliable the results of this algorithm are. sizes and densities and that the clusters are not constrained in volume. One can distinguish two main approaches to determine the correct number of clusters in the data: • Start with a suﬃciently large number of clusters. which is a slightly variation on the Partition Coeﬃcient. Diﬀerent validation methods have been proposed in the literature. the data can not be grouped in a meaningful way at all. • Cluster the data for diﬀerent values of c and validate the correctness of the obtained clusters with validation measures.36) 31 . Therefore.

40) The main disadvantage of the Dunn’s index is the very expansive computational complexity as c and N increase.41) minxi ∈Ci . ADI(c) = min{ min { i∈c j∈c.• Partition Index (PI): expresses the ratio of the sum of compactness and separation of the clusters. A minor value of a SC means a better partitioning.k ||vk − vi ||2 • Xie and Beni’s Index (XB): is a method to quantify the ratio of the total variation within the clusters and the separations of the clusters [3]. c P I(c) = i=1 N j=1 (uij )m ||xj − vi ||2 c k=1 Ni ||vk − vi ||2 (3.37) P I is mainly used for the comparing of diﬀerent partitions with the same number of clusters. XB(c) = c i=1 N j=1 (uij )m ||xj − vi ||2 N mini. y) ≥ |d(y. • Dunn’s Index (DI): this index was originally designed for the identiﬁcation of hard partitioning clustering. This will be the case when the dissimilarity between two clusters. • Separation Index (SI): in contrast with the partition index (PI).i=j (3.j ||xj − vi ||2 (3.i=j minx∈Ci . the separation index uses a minimum-distance separation to validate the partitioning.y∈C d(x.42) . Therefor. y)} (3. vj )| were vj represents the cluster center of the j-th cluster. the result of the clustering has to be recalculated.y∈Cj d(x.y∈C d(x. the Alternative Dunn Index was designed. vj ) − d(x. y) }} maxk∈c {maxx. is rated in under bound by the triangle-inequality: d(x. vj )| }} maxk∈c {maxx. This value is normalized by dividing it by the fuzzy cardinality of the cluster. To receive the Partition index. DI(c) = min{ min { i∈c j∈c. • Alternative Dunn Index (ADI):To simplify the calculation of the Dunn index. Each individual cluster is measured with the cluster validation method.39) The lowest value of the XB index should indicate the optimal number of clusters. c N 2 2 i=1 j=1 (uij ) ||xj − vi || SI(c) = (3. y). the sum of the value for each individual cluster is used. y)} 32 (3. measured with minx∈Ci .y∈Cj d(x.38) N mini. vj ) − d(xi .xj ∈Cj |d(y.

4 Visualization To understand the data and the results of the clustering methods. a modiﬁed algorithm. However. 3. To avoid these problems of the Sammon mapping method. that the Partition Coeﬃcient and the Classiﬁcation Entropy are only useful for fuzzy partitioned clustering. The three visualisation methods will be explained in more detail in the following subsections. This kind of mapping of distances is much closer related to the proposition of clustering than saving the variances (which will be done by PCA). However. the used data set is a highdimensional data set. which is based on the preservation of the Euclidean inter point distance norm. Then. called the Fuzzy Sammon mapping. In this research. because in every iteration step a computation of N (N − 1)/2 distances is required. a computational expensive algorithm is needed. since only the distance between the data points and the cluster centers considered to be important. which can not be plotted and visualized directly. is used during this research. To achieve this.Note. The ﬁrst principal component represents 33 . This implies that the Sammon mapping only can be applied on clustering algorithms that use the Euclidean distance norm during the calculations of the clusters. The advantage of the Sammon mapping is the ability to preserve inter pattern distances. this report will focus on the Sammon mapping method. called the principal components. This section describes three methods that can map the data points into a lower dimensional space. The ﬁrst method is the Principal Component Analysis (PCA). the Sammon mapping application has two main drawbacks: • Sammon mapping is a projection method. A draw back of this Fuzzy Sammon mapping is the loose of precision in distance.4. • The Sammon mapping method aims to ﬁnd in a high n-dimensional space N points in a lower q-dimensional subspace. a standard and a most widely method to map high-dimensional data into a lower dimensional space. In case of fuzzy clusters the values of the Dunn’s Index and the Alternative Dunn Index are not reliable. the three mapping methods will be used for the visualization of the clustering results. such in a way the inter point distances correspond to the distances measured in the n-dimensional space.1 Principal Component Analysis Principal component analysis (PCA) include a mathematical procedure that maps a number of correlated variables into a smaller set of uncorrelated variables. 3. it is useful to visualize the data and the results. This is caused by the repartitioning of the results with the hard partition method.

• Discovering and/or reducing the dimensionality of a data set.k = Wi−1 (xk ) = WiT (xk ). The succeeding components describe the remaining variability. deﬁned by dij = d(xi . (3.j of Fi in its diagonal in decreasing order and Ui is a matrix containing the eigenvectors corresponding to the eigenvalues in its columns. xj ) correspond to the inter point distances in the q-dimensional space. which are representative for a higher n-dimensional data set. Furthermore. The inter point distance measure of the n-dimensional space. the covariance matrix of the data set can be described by: F = 1 (xk − v)(xk − v)T . which can be deﬁned as follows: yi.q . etc. This methods uses only the ﬁrst q nonzero eigenvalues and the corresponding eigenvectors of the covariance matrix: Fi = Ui Λi UiT . The direction of the ﬁrst principal component is diverted from the eigenvector with the largest eigenvalue. a minimization criterion of the error: E= where λ is a constant: N −1 N 1 λ N −1 i=1 (dij − d∗ )2 ij . given by d∗ = d∗ (yi . The eigenvalue associated with the second largest eigenvalue correspond to the second principal component.47) λ= i<j dij = i=1 j=i+1 dij . N (3.as much of the variability in the data as possible. the principal components will be achieved by analyzing the eigenvectors and eigenvalues. (3. In this case. The main goals of the PCA method are: • Identifying new meaningful underlying variables. In this research.44) With Λi as a matrix that contains the eigenvalues λi.the Sammon mapping uses inter point distance measures to ﬁnd N points in a q-dimensional data space.45) The weight matrix Wi contains the q principal orthonormal axes in its column: 1 2 Wi = Ui.2 Sammon mapping As mentioned before. (3.43) where v = xk .46) 3. dij j=i+1 N (3. This is achieved by Sammon’s ij stress. the second objective is used.q Λi. In a mathematical way. Principal Component Analysis is based on the projection of ¯ correlated high-dimensional data onto a hyperplane [3]. (3. there is a q-dimensional reduced vector that represents the vector xk of X.4.48) 34 . yj ).

The Fuzzy Sammon mapping algorithm is similar to the original Sammon mapping. independently to the shape of the original cluster. weighted by the membership values similarly to equation (3.52) with d(xk . ki (3.3 Fuzzy Sammon mapping As mentioned in the introduction of this section.. 3. in a projected two dimensional space every cluster is represented by a single point.19): c N Ef uzz = i=1 k=1 (µki )m (d(xk . called Fuzzy Sammon mapping.3 − 0. zi ). The modiﬁed algorithm. 2.49) where α is a nonnegative scalar constant. N } and l ∈ {1.51) With this gradient-descent method. since a constant does not change the result of the optimization process.. To avoid this drawbacks.. The rating of yil at the t-th iteration can deﬁned by: il yil (t + 1) = yil (t) − α ∂ 2 E(t) . 2 ∂yil (t) ∂E(t) ∂y (t) (3. while searching for the minimum of E.4.Note that there is no need to maintain λ.. with a recommended value α 0. 2. vi ) representing the distance between data point xk and the cluster center vi in the original n-dimensional space. it is possible to estimate the correct initialization based on the information which is obtained from the data. because multiple experiments with diﬀerent random initializations are necessary to ﬁnd the minimum. vi ) − d∗ )2 .k=i 1 (dki − d∗ ) − ki dki d∗ ki (yil − ykl )2 d∗ ki 1+ dki − d∗ ki dki (3. Sammon’s mapping has several drawbacks..k=i ∂ 2 E(t) 2 =− 2 ∂yil (t) λ N k=1. . it is not possible to reach a local minimum in the error surface.4. a modiﬁed mapping method is designed which takes into account the basic properties of fuzzy clustering algorithms where only the distance between the data points and the clustering centers are considered to be important [3]. but in this case the projected cluster 35 . According to this information. with i ∈ {1.. .. yiq ]T .. . However. This scalar constant represents the step size for gradient search in the direction of N 2 dki − d∗ ∂E(t) ki =− (yil − ykl ) (3.50) ∂yil (t) λ dki d∗ ki k=1. uses only N ∗c distances. q} which implies that yi = [yi1 . The minimization of the error E is an optimization problem in the N xq variables yil . This is a disadvantage. The Euclidean distance between the cluster center zi and the data yk of the projected q-dimensional space is represented by d∗ (yk ..

center will be recalculated in every iteration after the adaption of the projected data points. The recalculation will be based on the weighted mean formula of the fuzzy clustering algorithms, described in Section 3.2.3 (equation 3.19). The membership values of the projected data can be plotted based on the standard equation for the calculation of the membership values: µ∗ = ki

c j=1

1

d∗ (xk ,ηi) d∗ (xk ,vj )

2 m−1

,

(3.53)

where U ∗ = [µ∗ ] is the partition matrix with the recalculated memberships. ki The plot will only give an approximation of the high dimensional clustering in a two dimensional space. To measure the quality of this rating, an evaluation function that determines the mean square error between the original and the recalculated error can be deﬁned as follows: P = ||U − U ∗ ||. (3.54)

In the next chapter, the cluster algorithms will be tested and evaluated. The PCA and the (Fuzzy) Sammon mapping methods will be used to visualize the data and the clusters.

36

Chapter 4

**Experiments and results of customer segmentation
**

In this chapter, the cluster algorithms will be tested and their performance will be measured with the proposed validation methods of the previous chapter. The best working cluster method will be used to determine the segments. The chapter ends with an evaluation of the segments.

4.1

Determining the optimal number of clusters

The disadvantage of the proposed cluster algorithms is the number of clusters that has to be given in advance. In this research the number of clusters is not known. Therefor, the optimal number of clusters has to be searched with the given validation methods of Section 3.3. For each algorithm, calculations for each cluster, c ∈ [215], were executed. To ﬁnd the optimal number of clusters, a process called Elbow Criterion is used. The elbow criterion is a common rule of thumb to determine what number of clusters should be chosen. The elbow criterion says that one should choose a number of clusters so that adding another cluster does not add suﬃcient information. More precisely, by graphing a validation measure explained by the clusters against the number of clusters, the ﬁrst clusters will add much information (explain a lot of variance), but at some point the marginal gain will drop, giving an angle in the graph (the elbow). Unfortunately, this elbow can not always be unambiguously identiﬁed. To demonstrate the working of the elbow criterion, the feature values that represent the call behavior of the customers, as described in Section 2.1.2, are used as input for the cluster algorithms. From the 800,000 business customers of Vodafone, 25,000 customers were randomly selected for the experiments. More customers would lead to computational problems. First, the K-means algorithm will be evaluated. The values of the validation methods depending on the number of clusters will be plotted. The value of the Partition Coeﬃcient is for all 37

clusters 1, and the classiﬁcation entropy is always ’NaN’. This is caused by the fact that these 2 measures were designed for fuzzy partitioning methods, and in this case the hard partitioning algorithm K-means is used. In Figure 4.1, the values of the Partition Index, Separation Index and Xie and Beni’s Index are shown. Mention again, that no validation index is reliable only by itself.

Figure 4.1: Values of Partition Index, Separation Index and the Xie Beni Index Therefor, all the validation indexes are shown. The optimum could diﬀer by using diﬀerent validation methods. This means that the optimum only could be detected by the comparison of all the results. To ﬁnd the optimal number of cluster, partitions with less clusters are considered better, when the diﬀerence between the values of the validation measure are small. Figure 4.1 shows that for the PI and SI, the number of clusters easily could be rated to 4. For the Xie and Beni index, this is much harder. The elbow can be found at c = 3, c = 6, c = 9 or c = 13, depending on the deﬁnition and parameters of an elbow. In Figure 4.2 there are more informative plots shown. The Dunn’s index and the Alternative Dunn’s index conﬁrm that the optimal number of clusters for the K-means algorithm should be chosen to 4. The values of all the validation measures for the K-means algorithm, are embraced in table 4.1

38

0443 0.0061 0.0034 0.0002 3.8261 0.1571 0.0002 4.0034 0.0000 5 1.0072 0.8318 0.0000 Table 4.0000 3 1.0018 9 1.0071 0.0001 3.0000 N aN 0.0000 N aN 0.7696 0.0000 N aN 0.4626 0.0052 0.8384 0.0005 5.8080 0.0000 15 1.0002 5.0003 4.0000 8 1.0001 3.8620 0.9079 0.8758 0.7225 0.0000 N aN 0.0000 N aN 0.5737 0.0070 0.0002 3.0002 11 1.0061 0.2: Values of Dunn’s Index and the Alternative Dunn Index c PC CE PI SI XBI DI ADI c PC CE PI SI XBI DI ADI 2 1.0002 3.0001 14 1.7489 0.2907 0.0000 N aN 0.3998 0.9253 0.0000 4 1.0001 12 1.0082 0.0071 0.0000 N aN 0.0000 N aN 0.8362 0.0000 N aN 0.0000 N aN 1.0063 0.0013 10 1.0041 0.0061 0.0000 7 1.0002 3.0000 6 1.Figure 4.0001 3.3353 0.0061 0.1: The values of all the validation measures with K-means clustering 39 .0002 3.0000 N aN 1.0001 3.0002 4.0000 N aN 1.0000 N aN 3.9109 0.0001 13 1.9519 0.9386 0.0000 N aN 1.7557 0.2214 0.0065 0.8828 0.7783 0.4379 0.

2. In Figure 4. has an elbow at the point c = 3. 40 .the optimal number of clusters is chosen at c = 4. Again. The results can be found in Appendix B. for the Alternative Dunn Index is not known how reliable its results are. Figure 4. Compared to the hard clustering methods. the Dunn index also indicates that the optimal number of clusters should be at c = 6. the results of the GustafsonKessel algorithm will be shown. so the optimal number of clusters for the Gustafson-Kessel algorithm will be six. The results of the validation measures for the Gustafson-Kessel algorithm are written in table 4. On the other hand. for the XBI. which makes it hardly to detect the optimal number of cluster. For the other algorithms. The same problem holds for CE: monotonic increasing. it is diﬃcult to ﬁnd the optimal number of clusters. the validation methods can be used now for the fuzzy clustering.It is also possible to deﬁne the optimal numbers of clusters for fuzzy clustering algorithms with this method. The points at c = 3. the optimal number of clusters is located at c = 6. K-medoid and the Gath-Geva. the Alternative Dunn index. c = 6 and c = 11. the main drawback of PC is the monotonic decreasing with c.3 the results of the Partition Index and the Classiﬁcation Entropy are plotted. In Figure 4. For the PI and the SI.3: Values of Partition coeﬃcient and Classiﬁcation Entropy with Gustafson-Kessel clustering ters. caused by the lack of direct connection to the data. For the K-means. can be seen as an elbow. However. To illustrate this. the local minimum is reached at c = 6. This process can be repeated for all other cluster algorithms.4 gives more information about the optimal number of clus- Figure 4. The optimal number of cluster can not be rated based on those two validation methods. However.5.

Figure 4.4: Values of Partition Index.5: Values of Dunn’s Index and Alternative Dunn Index with GustafsonKessel clustering 41 . Separation Index and the Xie Beni Index with Gustafson-Kessel clustering Figure 4.

3: The numerical values of validation measures for c = 4 and 4.5547 0.4982 CE NaN NaN 1.9205 0.0028 0.0001 0.7149 0.0263 9 0. With these visualization methods.1611 2.0002 0.4183 1.0003 3 0.0083 0.0041 0. To visualize the clustering results.0083 0.3550 0.0004 5 0.1149 2.2800 0.4 show that the PC and CE are useless for the hard clustering methods K-means and K-medoid.0001 0.0002 6 0.2189 0.0003 1.0853 0.0004 1. Table 4.6882 0.0009 1.0001 14 0.0002 0.0001 0.2024 1.8128 0.c PC CE PI SI XBI DI ADI c PC CE PI SI XBI DI ADI 2 0.0018 12 0.4183 0.9019 0.0062 0.1410 2.0000 7 0.5930 0.4 can be used.2057 0.0001 42.9203 0.9364 0.6462 0.0002 1.8903 0.0867 1.1479 2. the validation methods that are described in Section 3.0001 0.3983 0. one can conclude that for c = 4 the Gath-Geva algorithm has the best results and for c = 6 the Gustafson-Kessel algorithm.0002 0.5603 0.7447 0.5675 0.7797 0.0000 Table 4.0006 0.8218 1.3500 0.3046 0.2741 1.7575 0.7813 0. the dataset can be reduced to a 2-dimensional space. To avoid visibility problems (plotting too much values will cause one 42 .0039 11 0.5930 0.0084 0.0002 0.7688 0.0102 0.1702 2.0039 0.3983 1.0852 0. as mentioned in the previous section.5131 0. Xie and Beni’s index and Dunn’s index.0001 SI 0.0001 8 0. Separation index.0001 XBI 5.5978 0.0027 0.0002 0.0015 0.7233 0.0063 10 0. On the score of the values of the three most used indexes.1571 0.0007 1.3209 1.0002 0.0029 ADI 0.0001 0.2 Comparing the clustering algorithms The optimal number of cluster can be determined with the validation methods.5512 0.0007 13 0.4293 0.0063 0.5085 0.0009 1.0034 Inf 1. depending on the clustering algorithm.3863 1.0644 DI 0.0082 0. the optimal number of clusters was found at c = 4 or c = 6.0002 2.0034 0.0002 4 0. The validation measures for c = 4 and c = 6 of all the clustering methods are collected in the tables 4.3044 1.0002 0.0046 0.0029 0.3 PC 1 1 0.0039 00030 K-means K-medoid FCM GK GG Table 4.2737 0.0012 0.2366 0.7293 0.0007 0.5819 0.2066 1.5034 PI 1.4.3 and 4.0009 15 0.4684 0.8536 0.0030 0. As examined in the previous section.1573 0.6620 0.0001 0. The validation measures can also be used to compare the diﬀerent cluster methods.1469 2.5303 0.9305 0.2489 1.0017 0.2: The values of all the validation measures with Gustafson-Kessel clustering 4.0092 0.

one can see three big cluster.K-means K-medoid FCM GK GG PC 1 1 0.8.0001 0.0001 0. shows unexpected results.4293 1. These visualization methods are used for the following plots.x show the diﬀerent clustering results for c = 4 and c = 6 on the data set. Note that the cluster in the left bottom corner and the cluster in the 43 . For the K-means and the K-medoid algorithm. This implies that the Fuzzy C-means algorithm is not able to ﬁnd good clusters for this data set.0008 0.4: The numerical values of validation measures for c = 6 big cloud of data points). The other two cluster centers are nearly invisible. Figure 4.9203 1.4613 0.0457 DI 0.8903 0. only 500 values (representing 500 customers) from this 2-dimensional dataset will be randomly picked.0008 XBI 3.x4.0063 0.9.0001 0.1043 SI 0.9253 Inf 0.0099 ADI 0.0001 19.1238 0. For the other cluster algorithms.0008 0.0002 0.3773 CE NaN NaN 1.0070 0. Figures 4.2907 0.0007 0. the Fuzzy Sammon’s mapping visualization gives the best projection with respect to the partitions of the data set.6: Result of K-means algorithm for the clustering problem.0009 Table 4.6490 PI 1.9245 0. with one small cluster in one of the big clusters. None of the clusters contain suﬃcient more or less customers than other clusters.7918 1.1667 0. one can see that there are actually 4 cluster centers. the results of the Gustafon-Kessel algorithm are plotted. the Sammon’s mapping gives the best visualization of the results. By a detailed look at the plot.7 show that hard clustering methods can ﬁnd a solution Figure 4. In Figure 4. in Figure 4. In the situation with 6 clusters.6 and 4. The plot of the Fuzzy C-means algorithm. the clusters are well separated.0102 0. For both situations.0029 0. there are only 2 clusters clearly visible.3044 0. For the situation with 4 clusters. but the cluster centers are almost situated on the same location.

8: Result of Fuzzy C-means algorithm Figure 4.Figure 4.9: Result of Gustafson-Kessel algorithm 44 .7: Result of K-medoid algorithm Figure 4.

but are separated. The box indicates the upper and lower quartiles. For each cluster. Another way to view the diﬀerences between the cluster methods is to proﬁle the clusters. the results show that the clusters are homogeneous. Here are also appearing clusters in other clusters. 4.10. The result for the c = 6 situation is remarkable.top right corner in Figure 4. The result is visible for the Gath-Geva algoithm for c = 4 and for the Gustafson-Kessel algorithm with six clusters.10: Result of Gath-Geva algorithm of the Gath-Geva algorithm.3 Designing the segments To deﬁne which clustering method will be used for the segmentation. In the real high-dimensional situation. for the situation c = 4 look similar to the result of the Gustafson-Kessel algorithm. a closer look to the meaning of the clusters will be needed. This indicates that.9 are also maintained in the situation with 6 clusters. a proﬁle can be made by drawing a line between all normalized feature values (each feature value is represented at the x-as) of the customers within this cluster. To determine which partitioning will be used to deﬁne the segments. The results Figure 4. one can conclude that there are two possible best solutions: The Gath-Geva algorithm for c = 4 and the Gustafson-Kessel algorithm for c = 6. two box plots of the distances from the data points to the cluster are plotted. indicates that a clustering with six clusters with the Gustafson-Kessel algorithms not a good solution. In Figure 4. the clusters are not a subset of each other. the two diﬀerent partitions will be closely compared with each other. With the results of the validation methods and the visualization of the clustering. In both situations.11 and 4. visualized in Figure 4.12. The fact that this is the case in the two-dimensional plot. 45 . This may indicate that the data points in these clusters represents customers that diﬀer on multiple ﬁelds with the other customers of Vodafone. based on the distances to the cluster. In the next section. one can not distinguish between the two cluster algorithms. one can look at the distances from the points to each cluster.

12: Distribution of distances from cluster centers within clusters for the Gustafson-Kessel algorithm with c = 6 46 .Figure 4.11: Distribution of distances from cluster centers within clusters for the Gath-Geva algorithm with c = 4 Figure 4.

Most of the lines in one proﬁle are drawn closely together.The proﬁles of the diﬀerent clusters do not diﬀer much in shape.13: Cluster proﬁles for c = 4 47 . in each cluster. This conﬁrms the assumption that customers of diﬀerent clusters have indeed a diﬀerent usage behavior. This means that the customers in one proﬁle contain similar values of the feature values. at least one value diﬀers suﬃcient from the values of the other cluster. Figure 4. However.

Figure 4.14: Cluster proﬁles for c = 6 48 .

have a high feature value at feature 8. The diﬀerence between the clusters are visible by some feature values. For instance.15 and ??. The mean of all the lines (equivalent to the cluster center) was calculated and a line between all the (normalized) feature vales was drawn. The 4th and ﬁnal cluster has high values at feature 8 and 9. Cluster 1 has customers. Cluster 2 has high values at position 6 and 9. compared with other cluster.15: Cluster proﬁles of centers for c = 4 49 .More relevant plots are shown in Figure 4. Figure 4. in the situation with four clusters. while Cluster 3 contains peaks at features 2 and 12.

Figure 4.16: Cluster proﬁles of centers for c = 6 50 .

6 2.6 87.8 94.8 1.2 1.3 120.6 1. 11 the unique are codes and feature 12 the number of diﬀerent numbers called).4 2.8 2. 6 are calls to mobile phones.9%) Segment 4 (20.1 22.1 3.9 4 65.3%) Segment 4 (17.4 11 6.3%) Segment 4 (17.5 4. This customers call more in the evening (in proportion) and to ﬁxed lines then other customers.7 0.5 40.9 3.0 86.1 3.6%) Segment 5 (14.6 9.1 9 2.3 65.9 6.6 88.8 2.5: Segmentation results numbers correspond to the feature numbers of Section 2. • Segment 2: This segment contains customers with an average voice call 51 .8 15.5 6.1 86. In words.7 10 14.7%) Segment 3 (23.6 60.7 2. the segments can be described as follows: For the situation with 4 segments: • Segment 1: In this segment are customers with a relative low number of voice calls.1 2.2 6. feature 4 the daytime calls.7 1.6 126.4 4.9 73.4 84.5 0.9 1.4 1.6 3 3.3 87.3 7 1.1 2.9 66.6 4.1 4. both results will be used as a solution for the customer segmentation.6 1.7 65.6 1. feature 9 the international calls.0 71.1 6.4 6. The number of international calls is low.With the previous clustering results.8 96.7 2. The feature Feature Average Segment 1 (27.8 155.7 58. feature 5 the weekday calls.4 12.0 1.8%) Segment 6 (16.0 1.8 4.1 2.2 65.0 3.9 24. feature 10 the calls to Vodafone mobiles.7 17.8 121.1 74.9%) Segment 4 (20.2%) Segment 2 (28.7 c=4 c=6 c=4 c=6 Table 4.8 2.9 2.2%) Segment 2 (28.3 12.6 2.1%) Segment 2 (14.8 5 87.6 6.3 4.1 8 3. For the Gath-Geva algorithm with c = 4 and the Gustafson-Kessel algorithm with c = 6. (Feature 1 is the call duration.1 78.8%) 1 119.8%) Feature Average Segment 1 (27.1 12.2.5 71.3 88.7%) Segment 3 (23.6 87.4%) Segment 3 (18.3 17. Their sms usage is higher then normal. feature 2 the received voices calls and feature 3 the originated calls.9 1.9 13.0 23.7 87.9 1.1 15.7 72.8 73.4 26.1%) Segment 2 (14.1 9.5 91.4 10. validation measures and plots.6%) Segment 5 (14.9 2. Therefor.9 4. it is not possible to decide which of the two clustering methods gives a better result.0 12 25.4 14. table 4.1 132.2 1.0 3.7 66.9 3.0 12.8 54. 7 received sms.2%) Segment 1 (18.5 shows the result of the customer segmentation.2 30.9 6 75.4 2 1.1 11.8 72.2 93.6 39.8 22.4 5.7 1. 8 originated sms.9 4.2 1.7 4.2 92.1.7 2.6 73.4 4.8%) Segment 6 (16.5 1.7 121.0 65.2%) Segment 1 (18.8 133.0 86.3 1.4%) Segment 3 (18.5 3.2 6.

Remarkable is the fact that they don not have so many contacts as the number of calls do suspect. For the situation with 6 segments. They also call to many diﬀerent areas. They call often during daytime and call more then average to international numbers. In the next session the classiﬁcation method Support Vector Machine will be explained. This technique will be used to classify/estimate the segment of a customer by personal information as age.usage. They also send and receive many sms messages. In proportion. These customers do not call to many diﬀerent numbers. They have a relative small number of contacts. • Segment 3: The customers in this segment make relative many voice calls. None of the feature values is high or low. • Segment 6: These customers originate and receive many voice calls. • Segment 2: This segment contains customers with a relative high number of contacts. These customers call to many diﬀerent numbers and have a lot of contacts which are Vodafone customers. their sms usage is relative high. They have also more contacts with a Vodafone mobile. Their average call duration is also lower than average.3). the customers in this segments can be described as follows: • Segment 1: In this segment are customers with a relative low number of voice calls. • Segment 3: The customers in this segment make relative many voice calls. They also send and receive many sms messages. they make more international phone calls than other customers. The percentage of international calls is high. However. Their sms usage is low. They do not send and receive many sms messages. 52 .1. • Segment 5: These customers do not receive many voice calls. Their call duration is high. • Segment 4: These customers originate many voice calls. They call often to mobile phones during day time. gender and lifestyle (the customer data of Section 2. The duration of their voice calls is longer than average. The average call duration is low. • Segment 4: These customers are the average customers. They also receive and originate a low number of sms messages.

a Support Vector Machine is a mathematical entity. Subsequently. geometric interpretation of the data. predict whether it corresponds to segment 1 or segment 2. 5. The green dots represent the customers that are in segment 1 and the red dots are customers that are in segment 2. the customer data features of Section 2. predicting the label of an unknown customer is simple: one simply needs to ask whether the new customer falls on the segment 53 . In this research a Support Vector machine will be used to recognize the segment of a customer by examining thousands of customers (e. In this case the customer data consist of 2 feature values. the basic ideas of Support Vector Machines can be explained without any equations. such as the one labeled ’Unknown’ in Figure 5. an algorithm for maximizing a particular mathematical function with respect to a given collection of data.1a shows that the green dots form a group and the reds dots form another group that can easily be separated by drawing a line between the two groups (Figure 5. which can be easily plotted. However.g.1.1b).1. In general. The next few sections will describe the four basic concepts: • The separating hyper plane • The maximum-margin hyperplane • The soft margin • The kernel function For now. Even a quick glance at Figure 5. to allow an easy.Chapter 5 Support Vector Machines A Support Vector Machine is a algorithm that learns by example to assign labels to objects [16]. The goal of the SVM is learn to tell the diﬀerence between the groups and. given an unlabeled customer. imagine that there exists only two segments.1 The separating hyperplane A human being is very good at pattern recognition. age and income.3) of each segment.

2: Separating hyperplanes in diﬀerent dimensions 54 .2b. illustrated in Figure 5. the line that separates the segments. essentially. consider the situation where there are not just two feature values to describe the customer. if there was just 1 feature value to describe the customer. This line can be divided in half by using a single point (see Figure 5. then the space in which the corresponding onedimensional feature resides is a one-dimensional line. For example.2a). In two dimensions.1b) In a three-dimensional space.(a) Two-dimensional representation of the customers (b) A separating hyperplane Figure 5. The term for a straight line in a high-dimensional space is a hyperplane. This procedure can be extrapolated mathematically in higher dimensions. So the term separating hyperplane is. a plane is needed to divide the space. (a) One dimension (b) Three dimensions Figure 5. Now.1: Two-dimensional customer data of segment 1 and segment 2 1 or the segment 2 side of the separating line. a straight line divides the space in half (remember Figure 5. to deﬁne the notion of a separating hyperplane.

the line that separates the two segments and adopts the maximal distance from any of the given customers (see Figure 5. First at all. a SVM 55 . A logical way of selecting the optimal line.2). the SVM is able to predict the unknown segment of the customer in Figure 5. is in many ways. However. This theorem. it is not reasonable to expect that the SVM can classify well if the training data set is prepared with a diﬀerent protocol then the test data set. In other words. It is not surprising that a theorem of the statistical learning theory is supporting this choice [6]. is a common way of classiﬁcation. However. For example. since it is not reasonable that a Support Vector machine trained on customer data is able to classify diﬀerent car types. is selecting the line that is.2 The question is which line should be chosen as the optimal classiﬁer and how should the optimal line be deﬁned. the theorem is based on the fact that the data on which the SVM is trained are drawn from the same distribution as the data it has to classify. Consider again the classiﬁcation problem of Figure 5. On the other hand. the key (a) Many possibilities (b) The maximum-margin hyperplane Figure 5.5. ’in the middle’. and therefore not unique to the SVM. However. This is of course logical. By selecting this hyper plane. More relevantly. The vectors (points) that constrain the width of the margin are the support vectors. roughly speaking. there are a some remarks and caveats to deal with. there are an inﬁnite number of possible lines.2 The maximum-margin hyperplane The concept of treating objects as points in a high-dimensional space and ﬁnding a line that separates them. the SVM diﬀers from all other classiﬁer methods by virtue of how the hyperplane should be selected. By deﬁning the distance from the hyperplane to the nearest customer (in general an expression vector) as the margin of the hyperplane.3: Demonstration of the maximum-margin hyperplane to the success of Support Vector Machines. as portrayed in Figure 5. the SVM selects the maximum separating hyperplane.1a The goal of SVM is to ﬁnd a line that separates the segment 1 customers from the segment 2 customers.1a. the theorem of a SVM indicates that the two data sets has to be drawn from the same distribution.

the example data will be simpliﬁed even further. In other words. many real data sets are not cleanly separable by a straight line. the data contains an error object.4a will be separated in the way it is illustrated in Figure 5.3 The customer can be seen as an outlier and resides on the same side of the line with customers of segment 1. instead of a two-dimensional data set. the soft margin speciﬁes a trade-oﬀ between hyper plane violations and the size of the margin. In this ﬁgure. 5. A intuitively way to deal with the problems of errors is designing the SVM in such a way that it allows a few anomalous customers to fall on the ’wrong side’ of separation line. 5. a SVM should not allow too many misclassiﬁcation. by the fact that a large margin will be achieved with respect to the number of correct classiﬁcations. that (a) Data set containing one error (b) Separating with soft margin Figure 5. controls the number of customers that is allowed to violate the separation line and determines how far across the line they are allowed. With the soft margin. for example the data of Figure 5. Setting this parameter is a complicated process. roughly.4a. Of course. However. the theory assumed that the data can be separated by a straight line.3 The soft margin So far. The soft margin allows a small percentage of the data points to push their way through the margin of the separating hyperplane without aﬀecting the ﬁnal result. the data set of Figure 5. This can be achieved by adding a ’soft margin’ to the SVM. Assume that.4: Demonstration of the soft margin with the introduction of the soft margin.does not assume that the data is drawn from a normal distribution. a user-speciﬁed parameter is involved that controls the soft margin and. Note.4 The kernel functions To understand the notion of a kernel function. there 56 .

In general. No single point can separate the two segments and introducing a soft margin would not help. but exponentially. the data set must contain consistent labels. which illustrates an non separable data set. the SVM should be a perfect classiﬁer. the SVM can separate the data in two segments by one straight line. the kernel function can be seen as a mathematical trick for the SVM to project data from a low-dimensional space to a space of higher dimensions.5: Demonstration of kernels some extra examples will be given. The kernel function adds an extra dimension to the data.4. in this case by squaring the one dimensional data set. So. but the projected hyperplane is found by a very high dimen57 . If one chooses a good kernel function.4.1. there are some drawbacks of projecting data in a very high-dimensional space to ﬁnd the separating hyperplane. The ﬁgure contains the same data as Figure 5. consider the situation of Figure 5. The result is plotted in Figure 5. In Figure 5. In that case.4 the situation is drawn when the data is project into a space with too many dimensions. in theory. Of course. To understand kernels better. which means that two identical data points may not have diﬀerent labels. (a) None separable dataset (b) Separating previously non separable dataset Figure 5. as shown in the ﬁgure. the data will become separable in the corresponding higher dimension. Now. A kernel function provides a solution to this problem. as seen before in Figure 5. the number of possible solutions also increases. In Figure 5. However. Within the new higher dimensional space. but with a projection of the SVM hyperplane in the four-dimensional space back down to the original two-dimensional space. It is not possible to draw the data in the 4 dimensional space.4.4. this data can be projected to a four-dimensional space. the ﬁrst problem is the so called curse of dimensionality: as the numbers of variables under consideration increases. With a relative simple kernel function. it becomes harder for any algorithm to ﬁnd a correct solution. the separating hyperplane was a single point. the result is shown as the curved line in Figure 5. Consequently.4 is plotted a two-dimensional data set. it is possible to prove that for any data set exists a kernel function that allows the SVM to separate the data linearly in a higher dimension.is a one-dimensional data set.

is actually the best kernel function that exists. the method described above. This phenomenon is called over ﬁtting. probably an inﬁnite number. but without introducing too many irrelevant dimensions. trial and error. in most cases. this is a time-consuming process and it is not guaranteed that the best kernel function that was found during cross-validation. By using the cross-validation method. There exists another large practi- (a) Linearly separable in four dimensions (b) A SVM that has over ﬁt the data Figure 5. The SVM will not function well on new unseen unlabeled data. xj ) = Φ(xi )T Φ(xj ). mainly gives suﬃcient results. In this research a SVM will be experimented with a variety of ’standard’ kernel functions. The default and recommended kernel functions were used during this research and will be discussed now. It is more likely that there exists a kernel function that was not tested and performs better than the selected kernel function. Practically. The vectors are mapped into a higher dimensional space by the function Φ. but a few kernel functions have been found to work well in for a wide variety of applications [16]. This results in boundaries which are to speciﬁc to the examples of the data set. Unfortunately. However.sional kernel. • Linear: which function is deﬁned by: K(xi .1) where xi are the training vectors.2) . xj ) = xT xj . i 58 (5. xj ) = (γxT xj + c0 )d .6: Examples of separation with kernels cal diﬃculty when applying new unseen data to the SVM. the answer too this question is. This problems relies on the question how to choose a kernel function that separates the data.3) (5. (5. i • Polynomial: the polynomial kernel of degree d is of the form K(xi . the optimal kernel will be selected on a statistical way. Many kernel mapping functions can be used. In general the kernel function is deﬁned by: K(xi .

• Radial basis function: also known as the Gaussian kernel is of the form K(xi . as in our case 4 or 6 segments? There are several approaches proposed.5) i When the sigmoid function is used. How does a SVM discriminate between a large variety of classes. but two methods are the most popular and most used [16]. xj ) = tanh(γxT xj + c0 ). The ﬁrst approach is to train multiple.7: A separation of classes with complex boundaries 5. one-versus-all classiﬁers. (5. 59 .5 Multi class classiﬁcation So far. which is also used in neural networks. The concept of a kernel mapping function is very powerful. xj ) = exp(−γ||xi − xj ||2 ).4) • Sigmoid: the sigmoid function. B and C. For example. where k is the number of classes. It allows a SVM to perform separations even with very complex boundaries as shown in Figure 5. In this research the constant c0 is set to 1. Another simple approach is the one-versus-one where k(k − 1)/2 models are constructed. (5. is deﬁned by K(xi . one can regard it with a as a two-layer neural network. ”Is it A?”. the idea of using a hyperplane to separate the feature vectors into two groups was described. if the SVM has to recognize three classes. one can simply train three separate SVM to answer the binary questions.7 Figure 5. ”Is it B?” and ”Is it C?”. A. but only for two target categories. In this research the one-verses-one technique will be used.

cross-validation is used to evaluate the ﬁtting provided by each parameter value set tried during the experiments. the training set. With the validation set. the actual performance of the SVM will be measured after the SVM is trained. Figure 6. Diﬀerent parameter values may cause under or over ﬁtting. The training of the SVM will be stopped when the test error reached a local 60 . By K-fold cross validation the training dataset will be Figure 6.1 demonstrates how important the training process is. the test set and the validation set.1 K-fold cross validation To avoid over ﬁtting. The test set will be used to estimate the error during the training of the SVM.1: Under ﬁtting and over ﬁtting divided into two groups. The training set will be used to train the SVM.Chapter 6 Experiments and results of classifying the customer segments 6.

2% 20 43. For each of K experiments. 6.2 Parameter setting In this section. The error is calculated by taking the average oﬀ all K experiments.7% 200 40. 4 segments 61 .1% Table 6. By K-fold cross validation.8% 500 36.2: Determining the stopping point of training the SVM data set is created.0% 50 42. denoted by C. For the situation with 4 clusters. The linear Kernel function itself has no parameters.2. K-1 folds will be used for training and the remaining one for testing.2 the results for the diﬀerent C-values are summarized. In this Figure 6.3: A K-fold partition of the dataset research. In table 6. Figure 6. The advantage of K-fold cross validation is that all the examples in the dataset are eventually used for both training and testing.1 and table 6.1: Linear Kernel.minimum. The only parameter that can be researched is the soft margin value of the Support Vector Machin.0% 10 43. see Figure 6. K is set to 10. the C 1 42.4% 100 41. the optimal parameters for the Support Vector Machine will be researched and examined.1% 2 42.3 illustrates this process. Each kernel function with its parameters will be tested on their performance. a k-fold partition of the Figure 6.6% 5 43.

6% 76.1% 73. 43.3% 7 73.1% 72.0% 75.4% 75.1% 100 50.4 1 76.C 1 28.4 = 0.0 = 1.8% 2 77.5% 74.0% 100 27.7% 500 26.8 = 1. For the polynomial kernel function.1% 4 73.2% 78.6% 2 74.4% 77. 6 segments best results.6 = 0.4.3% 73.0% 75.8% 100 70.5 and 6.3% 75. 6 segments optimal value for the soft margin is C = 10 and by using the 6 segments C = 50.6% 500 21.5% 74.1% 74. The correct number of classiﬁcations are respectively. the optimal number for the maximal margin will be determined.0% 73.0% 76.9% 75.1% 2 74. there are two parameters.0% 74.0% 75. Therefor.5% 50 72.2% 500 53.9% 5 74.3% 76. 6 segments 62 .2% 76.4% 5 30.1% 74.8% 74.4% 5 76.0% 75.3 and 6.4: Average C-value for polynomial kernel.2% 3 78.3% 74.0% 75.4% 74.4% 5 75.6% 200 63.6% 74.6% 20 73.4 1 75. The results are shown in tables 6.7% 75. denoted by d and the width γ.1% 75.8% 74.3% 76.6% 75.1% 75.0% 50 75.2: Linear Kernel. 4 segments γ γ γ γ γ γ d = 0.6: Polynomial kernel.2% 76.3% 3 75.6% 77.7% Table 6.2 = 1.1% 20 75.8% 75.2% 73.6% 200 27.0% 75.8% 73.0% 74.8 = 1.0% 72.5% 6 76.4% 76.3: Average C-value for polynomial kernel.2% 2 76.0% 76.2 = 1.3% 76.4% 6 74.1% 74.1% 77.9% 74.6% 200 42.2% 74.1% 74.5% 72.8% 75.8% 74.2% 4 76.2% 76.3% 20 31.6 = 0.0% Table 6.0%.1% 75.4% 50 32.5: Polynomial kernel.9% 10 31.3% 7 75.6.2% and 32. For the situation with γ γ γ γ γ γ d = 0.4 = 0.9% Table 6. The average value for each soft margin C can be found in the tables 6.4% 76.2% 75.2% 74.0% 74.6% 75.1% 72.3% 10 75.0% 5 75.8% 75.0 = 1. This is done by multiple test runs with random values for d and γ.9% Table 6.8% Table 6.6% 10 74. 4 segments C 1 70.9% 74.8% 76.1% 78.0% 78. The number of degrees.1% 76.8% 74. These C-values are used to ﬁnd out which d and γ give the C 1 73.9% 2 29.2% 75.2% 76.9% 74.

3 42.1 40.6 77. with 80.3 59.6 26.5 29.2 38.4 = 0.1 51.6 57.9 45.6 50. The results of the Radial Basis function are given in table 6. The results are given in table 6.0 80.4 76.9 76.6 43.6 500 38.0 Table 6.2 79.6 68.3 73.3 66.5 80.3 58.4 = 0.7: Radial basis function.7 30.7 10 58.1 51.9 78. the radial basis function has only one variable.7 48.1 77.2%.2 = 1. The C = 0.6 74.4 100 60. The confusion matrix for both situations.0 56.3 54.3 200 52.5 27.8 80. This corresponds to the cluster in the top right corner and the cluster in the bottom of Figures 4. table 6. 4 segments γ γ γ γ γ γ C = 0.5 52. 6 segments best result with 4 segments is 80.4 79.10.3%.8 61.7 46.7 and table 6.4 70.0 20 56.2 2 53. The following kernel function.4 5 57.4 79.3 78. while there are two extra clusters.3 79. show that there are two clusters which can easily be classiﬁed with the customer proﬁle.6 = 0.4 Table 6.9 and 4. 63 .9 44.8 69. The sigmoid function has also only 1 variable.6 72.8 76.6 51.7 62.4 1 80.8 = 1.5 41.1 79.4 59.2 = 1.6 52.3 64.0 50 65.9 72.0 = 1. Remarkable is the fact that the diﬀerence is small between the two situations.5 54.2 59.2 44.6 53.4 1 58.7 44.4 20 76.4 2 77.0 γ γ γ γ γ γ Table 6.0 47.0 39.5 100 52.3 61.9 and 6.2 44.2 500 37.7 55.1 71.3 79.0 73.1 69.8 = 1.3 30.7 72.7 40.0 42.5%.6 59.5 38.9 34.0 52.5 30.1% and 44.4 200 51.2 79.0 61.8: Radial basis function.4 1 73.0 71.9 34. This means that the Radial basis function has the best score for both situations.9 80.7 78.3 79. the optimal score is 78.8 73.6 52.1 48.7 80.3 31.4 47.11 and 6.9 50.4 66.5 74. namely γ.8 50 57.8 49.0 68.1 70.8 65.2 = 1.0 80.6 72.0 41.4 500 40.0 55.8 = 1.3 26.3 72.5 64.1 44.1 20 68.4 segments.2 47.9: Sigmoid function.9 5 76. 4 segments correct.0 = 1.6 77.3 61.4 66.0 80.12.4 = 0.7 50 73.5 57.8.9 73.3% and 78.1 58.6 = 0.8 54.5 5 72.1 100 30.4 60.5%.1% and for 6 segments 76.6 69.7 74.5 52.1 60.0 57.5 68.2 46.1 70.1 63.4 2 79.2 78.7 61.9 68.0 = 1.6 = 0.2 63.0 70.6% of the data is classiﬁed γ γ γ γ γ γ C = 0.2 200 47. by the Sigmoid function.9 73. with 6 segments the best score is 78.2 80.5 10 78. with respectively 4 and 6 segemtents.10 The results show that 66.2 69.5 78.7 10 70.2 64.5 60.8 76.7 54.4 74.

5 100 32.0 39.7% 4.0 20. 2 1.0% 2.3% Segm.6 21.3 500 28.9 28.3% 2. 6 5.γ γ γ γ γ γ C = 0.8% 92.0 41.4 = 0.9 44.9% Segment 4 0.6 2 34.1 40.1 22.0 26.7 40.7% Table 6.3% 10.0% 12. 4 8.9 39.8% 96.6% 2.8 32.7% 2.4% Segm.7% Segm.6 = 0.4 38.6% 0.9% 7.2% 5.2% 3.1 10 33.2 = 1.4 44.5% 12.4% 1.8 18.1% 94.2% 7.8 24.1% 0.5 40.11: Confusion matrix.6% 1.6% 2.8 18. 1 74.0% Table 6.12: Confusion matrix.9 Table 6.6 34.2 30.7% 73.6 31.5% 4.9% 1.6 20.10: Sigmoid function.8 43.1 30.0 41.2% 2.7 35. 4 segments Predicted → Actual ↓ Segment 1 Segment 2 Segment 3 Segment 4 Segment 5 Segment 6 Segm.4 36.9 38.1% 0.6% 1.4 30.1% 6.5 33. 3 10.0 = 1.7% 3.6% 12.3% 6.8 = 1.1% 7.6% 71.3 200 28.6 26.3% 4.0% 0.1% 3.8% 13.9% 0.5% Segm.2 27.6 20 34. 6 segments 64 .7 28.8 29.4% Segm.6 39.2 50 30.1% 0.6 24.5 5 34.1% Segment 2 0.7 27.4% 9.1 40.7 43.4 42.1 29.1% 68. 5 0.8 43.8% 69. 6 segments Predicted → Actual ↓ Segment 1 Segment 2 Segment 3 Segment 4 Segment 1 97.6 32.0% Segment 3 1.5% 76.6 39.0 27.4 1 33.

4 and 6. are shown in Figure 6. The results of both situations. The result show that Age is an important Figure 6.3 Feature Validation In this section.4: Results while leaving out one of the features with 4 segments Figure 6. This will be done.5: Results while leaving out one of the features with 6 segments feature for classifying the right segment.5. The importance of each feature will be measured. by leaving one feature out of the feature vector and train the SVM without this feature. This is in contrast with the type of telephone. which increase the result with only tenths of percents. the features will be validated.6. 65 . Each feature increases the result and therefore each feature is useful for the classiﬁcation.

The result of the clustering can therefore not be regarded as universally valid. Unfortunately. For some algorithms. the feature values were selected in such a way that it would describe the customer’s behavior as complete as possible.1 Conclusions The ﬁrst objective of our research was to perform automatic customer segmentation based on usage behavior. In this research. it is not possible to include all possible combinations of usage behavior characteristics within the scope of this research. The clustering algorithms used selected and preprocessed data from the Vodafone data warehouse. but merely as one possible outcome. the elbow was located at c = 4 and for other algorithms. The customer’s proﬁle was based on personal information of the customers. To ﬁnd the optimal number of clusters. The second part of the research was focused on proﬁling customers and ﬁnding a relation between the proﬁle and the segments. One diﬀerent feature value will result in diﬀerent segments. Not every validation method marked the same 66 . the location was c = 6. called Support Vector Machines was used to estimate the segment of a customer based on his proﬁle. several validation measures were used. However. This selection is vital for the resulting quality of the clustering. the so-called elbow criterion was applied. without the direct intervention of a human specialist. A novel data mining technique. To identify the best algorithm. An other problem was that the location of the elbow could diﬀer between the validation measures for the same algorithm. The customer segments were constructed by applying several clustering algorithms. This led to solutions for the customer segmentation with respectively four segments and six segments. 7. There are various ways for selecting suitable feature values for the clustering algorithms. this criterion could not always be unambiguously identiﬁed.Chapter 7 Conclusions and discussion This section concludes the research and the corresponding results and will give some recommendations for future work.

A short characterization of each cluster was made. some widely established validation measures were employed to determine the most optimal algorithm. It appeared that the resulting percentage of correctly classiﬁed segments was not as high as expected. However. This is caused by the frequently missing data in the Vodafone data warehouse about lifestyle. To determine which customer segmentation algorithm is best suited for a particular data set and a speciﬁc parameter setting. international calls. The proﬁle exists of the age. sms usage. In real life. In such cases. subscription type. and residential area of the customer. 67 . A possible explanation could be that the features of the customer are not adequate for making a customer’s proﬁle. gender. This implies. this telephone is maybe not used exclusively by the person (and the corresponding customer’s proﬁle) as stored in the database. telephone type. call duration. A second reason for the low number of correct classiﬁcation is the fact that the usage behavior in the database corresponds to a telephone number and this telephone number corresponds to a person. For the situation with four clusters. By leaving out one feature value during classiﬁcation. It is hard to compare the two clustering results. It was found that the radial basis function gives the best result with a classiﬁcation of 80. the eﬀect of each feature value became visible. that in both situations the clusters were well separated and clearly distinguished from each other. It was found that without the concept of ’customer age’. diﬀerent numbers called and percentage of weekday and daytime calls. both clustering results were used as a starting point for the segmentation algorithm. The last part of the research involves the relative importance of each individual feature of the customer’s proﬁle. The corresponding segments diﬀer on features as number of voice calls.3% for the situation with four segments and 78. A Support Vector Machine algorithm was used to classify the segment of a customer. that this feature bears some importance for the customer proﬁling and the classiﬁcation of the customer’s segment. habits and income of the customers. and companies may exchange telephones among their employees. the Gath-Geva algorithm appears to be the best algorithm and the Gustafson-Kessel algorithm gives the best results by six clusters. As a comparison. because of the diﬀerent number of clusters. The results show. It was however not possible to determine one algorithm that was optimal for c = 4 and c = 6. based on the customer’s proﬁle.5% for the situation with six segments. this and some other features did well increase the performance of classiﬁcation. however. company size. four diﬀerent kernel functions with diﬀerent parameters were tested on their performance.algorithm as the best algorithm. Customers may lend their telephone to relatives. leaving out a feature such as the ’telephone type’ barely decreased the classiﬁcation result. the resulting quality of the classiﬁcation was signiﬁcantly decreased. On the other hand. the clustering results were interpreted in a proﬁling format. Therefore. the usage behavior does not correspond to a single customer’s proﬁle and this impairs the classiﬁcation process. Therefore.

it should be noted that the most obvious and best way to improve the classiﬁcation is to come to a more accurate and precise deﬁnition of the customer proﬁles. it is possible to formulate some recommendations for obtaining more suitable customer proﬁling and segmentation. we note that the study would improve noticeably by involving multiple criteria to evaluate the user behavior. To estimate the segment of the customer. a detailed data analysis of the meaning of the cluster is recommended. However. Extrapolating this approach. Also. An interesting alternative is. Another way of improving this research is to extent the number of cluster algorithms like main shift clustering.2 Recommendations for future work Based on our research and experiments. 68 . In this research. a more detailed view of the clusters and their boundaries can be obtained.7. within the framework of Support Vector Machines. rather than mere phone usage as employed here. For instance. Furthermore. an enhanced and more precise analysis of the data ware house will lead to improved features and. thus. The ﬁrst recommendation is to use diﬀerent feature values for the customer segmentation. it is challenging to classify the proﬁle of the customer based on the corresponding segment alone. To improve on determining the actual number of clusters present in the data set. to an improved classiﬁcation. Similarly. as proposed by Wei Lu [21]. neural networks. hierarchical clustering or mixture of Gaussians. this is a complex course and it essentially requires the availability of high-quality features. genetic algorithms or Bayesian algorithms. and use his feed-back for improving the clustering criteria. Another way to validate the resulting clusters is to oﬀer them to a human expert. This can lead to diﬀerent clusters and thus diﬀerent segments. cluster analysis of the application of miscellaneous (non-linear) kernel functions. the results are given by a short description of each segment. also. for instance. One reason for this is the missing data in the Vodafone data warehouse. Consequently. The customer proﬁle used in this research is not suﬃcient detailed enough to describe the wide spectrum of customers. a complete data analysis research is required. other classiﬁcation methods can be used. the application of more specialized methods than the elbow criterion could be applied. Finally. To know the inﬂuence of the feature values on the outcome of the clustering. Of speciﬁc interest is. the application of evolutionary algorithms.

F. Model Induction with Support Vector Machines: Introduction and Applications. C. Customer Proﬁling and Segmentation based on Association Rule Mining Technique. Singh. Telecommun. Softw. 208-216. Conf.. F.B. vol.C.... VTT Information Technology (2001). (2005).. Solomatine D. [2] Amat. (2006). 15 iss. 3 (2001). [5] Bezdek. 7 (1989). Research report TTE1-2001-18.. Fuzzy clustering with a fuzzy covariance matrix... G. Velickov. vol. VTT Information Technology (2001). 1st Int. 1-27. [12] Janusz. [11] Gustafson. IEEE CDC. J. E. Knowledge discovery in databases.B.E. no. Telecommun. [10] Giha. and Abbott. A. Engin. G. R. (1979). pp. AAAI/MIT Press (1991). S. pp.. Technol. and Dunn. Unsupervised optimal fuzzy clustering. vol.C. and Rinta-Runsala. J. J.. 773-781. and Balazs.. J. I.J. W. Abonyi. pp.J. M. [6] Dibike. Optimal fuzzy partition: A heuristic for estimating the parameters in a mixture of normal distributions. J. 112-117. and Appl. in Civ. pp. In Proc. Fuzzy Clustering and Data Analysis Toolbox For Use with Matlab. Comput.. Piatetsky-Shapiro. 3 (2003). Proc. Knowledge Discovery and Data Mining. 69 . pp. [4] Bounsaythip. and Ewe.E. no..P. Technol.Bibliography [1] Ahola. Overview of Data Mining for Customer Behavior Modeling. J. and Matheus. C-24 (1975).E. [9] Gath. 835-838.. IEEE Trans Pattern and Machine Intell. [8] Frawley. Research report TTE1-2001-29. C. J. J..L.... and Kessel. Engrg. applications to customer proﬁling and fraud management. no. pp... Using reporting and data mining techniques to improve knowledge of subscribers. Inform. Inform. [3] Balasko. I. Y. and Dagan. 115-120. pp.T. W. Data mining and complex telecommunications problems modeling. 11-16.. 11 no. and Rinta-Runsala E. H. IEEE Trans. 761766. 3 (2002).. Knowledge discovery in textual databases (KDT). 397 (2003). Y. pp. Data mining case studies in customer proﬁling. [7] Feldman. D. B. In Proc. Comp.B. and Geva.

24 (2003). Supp. I. 34 (2002). P. Data Mining in Telecommunications.. ICANNGA (2007). [21] Wei Lu.. P. I. W.. D. and Sotiropoulos. [18] Verhoef. 1189-1201. [20] Wei.. [22] Weiss. A New Evolutionary Algorithm for Determining the Optimal Number of Clusters.. and Lee.. How to do it. pp. 127137... Market segmentation. Savvopoulos.. Lett. (1997).. 2367-2376. pp. pp. how to proﬁt from it. Tan. 24 no. 103112. Boston. The Data Mining and Knowledge Discovery Handbook (2005). Turning telecommunications call detail to churn prediction: a data mining approach.. Constructing Stereotypes for an Adaptive e-Shop Using AIN-Based Clustering. A. 23 (2002). M. 837-845.[13] Mali. 12 (2006). Knowledge management and data mining for marketing.. J. K. Expert Syst. G. [19] Virvou.T. [15] McDonald. Tsihrintzis. Recogn.. M. G.. C. Data Warehousing and Data Mining for Telecommunications. pp. Spring. [17] Shaw. M. 1565-1567. I. vol. vol.E. CIMCA/IAWTIC (2005). 70 . pp. Subramaniam. (1998). M.T. Clustering and its validation in a symbolic framework. G. Hoekstra.A. P. London: Artech House. pp. Appl. Patt. [16] Noble... 31 (2001).. Decis..J. Decision Support Systems.S. and Welge. [14] Mattison. vol. R. vol. What is a support vector machine? Nature Biotechnology. and Dunbar. Syst. C. 648-653. vol.M.. pp. 471-481. Palgrave Publ. The commercial use of segmentation and predictive modeling techniques for database marketing in the Netherlands. and Chiu. pp.P.N.W.

The most important data ﬁelds of these tables are written in the table. The white rectangles correspond to the tables that were used for this research. 71 .Appendix A Model of data warehouse In this Appendix a simpliﬁed model of the data ware house can be found. the relation tables (the red tables in the middle) are needed. To connected the tables with each other. The colored boxes group the tables in a category.

Figure A.1: Model of the Vodafone data warehouse 72 .

for the algorithms that not were discussed in Section 4.Appendix B Extra results for optimal number of clusters In this Appendix. are given.1. The K-medoid algorithm: Figure B. the plots of the validation measures.1: Partition index and Separation index of K-medoid 73 .

Figure B.3: Partition coeﬃcient and Classiﬁcation Entropy of Fuzzy C-means 74 .2: Dunn’s index and Alternative Dunn’s index of K-medoid The Fuzzy-C-means algorithm: Figure B.

5: Dunn’s index and Alternative Dunn’s index of Fuzzy C-means 75 .Figure B.4: Partition index. Separation index and Xie Beni index of Fuzzy C-means Figure B.

- 06_chapter2
- Kmeans Matlab Code Feed Own Data Source - QuestionInBox
- Customer Segmentation Profiling
- Customer Segmentation
- i Jsa It 01132012
- Customer Segmentation
- Customer Segmentation & Profiling
- cluster_index
- Predicting Students’ Performance using K-Median Clustering
- 8clst.doc
- V2I2022
- ASSET1 Estrategias de Aprendizaje
- [IJCST-V4I3P38]
- Annie
- Es 23861870
- Hyper Graph Deviation Comparison & Cluster Relation In Categorical Data
- ALL NOTES FOR MECH
- Pattern Recognition (BME-407.2) RCS
- Av 23274279
- Jam 01
- .
- RMD_S2 MR Applications
- Print Bolton Richard
- Clustering 1
- 16-09-28 PGDM-Trim5-S-02 (Segmentation) - Handout
- Silayoi Speece 2007
- Global Marketing Management
- Vedaldi sift.pdf
- Text Mining Assignment
- Too Many Markers

- A Comparative Study on Distance Measuring Approaches for Clustering
- Analysis And Detection of Infected Fruit Part Using Improved k-means Clustering and Segmentation Techniques
- General purpose computer-assisted clustering and conceptualization
- Introduction to Multi-Objective Clustering Ensemble
- A Survey on Optimization of User Search Goals by Using Feedback Session
- A Review on Image Segmentation using Clustering and Swarm Optimization Techniques
- A Mining Cluster Based Temporal Mobile Sequential Patterns in Location Based Service Environments using CTMSP Mining
- IJSRDV4I30068Design of new cluster based routing protocol in vanet using fuzzy particle swarm optimization
- Cluster Analysis Techniques in Data Mining
- A Survey on Medical Data Classification
- tmpD453
- Lemon Disease Detection Using Image Processing
- tmpDF60.tmp
- IJSRDV3I110455.pdf
- Implementation Of Web Document Clustering Methods For Forensic Analysis
- An enhanced Method for performing clustering and detecting outliers using mapreduce in datamining
- tmp77F6.tmp
- tmp1529.tmp
- A Survey On Mining Conceptual Rule and Ontological Matching For Text Summarization
- tmpA3B3.tmp
- Text Based Fuzzy Clustering Algorithm to Filter Spam E-mail
- A Survey on Android application as an advertisement portal
- A Survey of Various Intrusion Detection Systems
- Analysis of Infected Fruit Part Using Improved K-Means Clustering Algorithm
- Effectient Clustered Based Oversampling Approach to Solve Rare Class Problem
- Clustering on Uncertain Data
- Tumor Detection using Particle Swarm Optimization to Initialize Fuzzy C-means
- A Survey on Hard Computing Techniques based on Classification Problem
- Measuring Underemployment
- A Review Paper on Cluster Based Ant Defence Mechanism Technique For Attack In MANET

- URBN
- UT Dallas Syllabus for mkt6301.501.11f taught by Ernan Haruvy (eeh017200)
- Service Marketing
- UT Dallas Syllabus for mkt6301.mbc.10f taught by Ram Rao (rrao, eeh017200)
- Tourism Marketing
- UT Dallas Syllabus for ba3365.010.11f taught by Ernan Haruvy (eeh017200)
- Marketing Management
- Global Market For Multimedia Chipsets Was Estimated To Be Worth $19.8 Billion In 2012
- UT Dallas Syllabus for ba3365.010.11f taught by Ernan Haruvy (eeh017200)
- CREW
- Set Up a New Distribution Channel - GrowthPanel.com
- UT Dallas Syllabus for mkt6329.501.10s taught by Norris Bruce (nxb018100)
- Future of Segmentation - MTA Seminar
- Proteomics
- Edit Tourism Markeing
- Accenture Outlook
- International Price Dispersion and Market Segmentation in Japan and the United States
- UT Dallas Syllabus for mkt6329.501.11s taught by Norris Bruce (nxb018100)
- UT Dallas Syllabus for entp6380.501.08f taught by Joseph Picken (jcp016300)
- UT Dallas Syllabus for mkt6301.501.10f taught by Ernan Haruvy (eeh017200)
- UT Dallas Syllabus for mkt6329.501.08s taught by Norris Bruce (nxb018100)

Sign up to vote on this title

UsefulNot usefulRead Free for 30 Days

Cancel anytime.

Cerrar diálogo## ¿Está seguro?

This action might not be possible to undo. Are you sure you want to continue?

Cerrar diálogo## This title now requires a credit

Use one of your book credits to continue reading from where you left off, or restart the preview.

Loading