Está en la página 1de 8

Problem statement: To identify customer who will be retained by the telecom

operator and determining churn

Predictor variables

Dependent Variable

Account length
Vmail message
VMail Message
Day Mins
Eve Mins
Night Mins
Intl Mins
CustServ Calls
Day Calls
Day Charge
Eve Calls
Eve Charge
Night Calls
Night Charge
Intl Calls
Intl Charge
State
Area Code

CHURN

Non predictor: phone number

2) To begin with Multi co linearity test data should be binned


Unbinned variables are

Descriptive Statistics
N

Minimum

Maximum

Mean

Std. Deviation

Account Length

3333

243

101.06

39.822

VMail Message

3333

51

8.10

13.688

Day Mins

3333

.0

350.8

179.775

54.4674

Eve Mins

3333

.0

363.7

200.980

50.7138

Night Mins

3333

23.2

395.0

200.872

50.5738

Intl Mins

3333

20

10.24

2.792

Day Calls

3333

165

100.44

20.069

Day Charge

3333

.00

59.64

30.5623

9.25943

Eve Calls

3333

170

100.11

19.923

Eve Charge

3333

.00

30.91

17.0835

4.31067

Night Calls

3333

33

175

100.11

19.569

Night Charge

3333

1.04

17.77

9.0393

2.27587

Intl Charge

3333

.0

5.4

2.765

.7538

Area Code

3333

408

510

437.18

42.371

Valid N (listwise)

3333

Multicolinearity on binned data is shown below

From the above table VIF for all the variable is less than 10 which means no
variables can be dropped
Decision tree
KMO value

KMO and Bartlett's Test


Kaiser-Meyer-Olkin Measure of Sampling Adequacy.
Approx. Chi-Square
Bartlett's Test of Sphericity

.510
7421.300

df

120

Sig.

.000

Split 70:30 Method Chaid


Classification
Sample

Observed

Predicted
0

Training

Percent Correct

1966

48

97.6%

246

83

25.2%

94.4%

5.6%

87.5%

805

31

96.3%

129

25

16.2%

94.3%

5.7%

83.8%

Overall Percentage

Test

Overall Percentage
Growing Method: CHAID
Dependent Variable: Churn

Training accuracy is 87.5 and testing is 83.8 means any new sample fed into this model will
be 83.3% accurate
Split 70:30 Method CRT

Classification
Sample

Observed

Predicted
0

Training

Percent Correct

1983

69

96.6%

172

177

50.7%

89.8%

10.2%

90.0%

767

31

96.1%

62

72

53.7%

88.9%

11.1%

90.0%

Overall Percentage

Test

Overall Percentage
Growing Method: CRT
Dependent Variable: Churn

Training accuracy is 90% and testing is 90% means any new sample fed into this model will
be 90% accurate

Split 60:30 Method CHAID

Classification
Sample

Observed

Predicted
0

Training

Percent Correct

1703

32

98.2%

193

86

30.8%

94.1%

5.9%

88.8%

1095

20

98.2%

147

57

27.9%

94.2%

5.8%

87.3%

Overall Percentage

Test

Overall Percentage
Growing Method: CHAID
Dependent Variable: Churn

Classification
Sample

Observed

Predicted
0

Training

Percent Correct

1542

142

91.6%

117

172

59.5%

84.1%

15.9%

86.9%

1078

88

92.5%

92

102

52.6%

86.0%

14.0%

86.8%

Overall Percentage

Test

Overall Percentage
Growing Method: CRT
Dependent Variable: Churn

Cross validation CRT


Classification
Observed

Predicted
0

Percent Correct

2750

100

96.5%

234

249

51.6%

89.5%

10.5%

90.0%

Overall Percentage

Growing Method: CRT


Dependent Variable: Churn

Cross validation CHAID


Classification
Observed

Predicted
0

Percent Correct

2698

152

94.7%

241

242

50.1%

88.2%

11.8%

88.2%

Overall Percentage
Growing Method: CHAID
Dependent Variable: Churn

After analysing above confusion matrix


5) Tubular format is drawn shown below

Method with CRT and spilt 70:30 will give the accuracy of 90 and testing also 90
Best method is CRT with spilt 70:30

Logistic regression

Classification Tablea
Observed

Predicted
Churn
0

Step 1

Churn

2759

Percentage
Correct

1
30

98.9

436

42

8.8

Overall Percentage

85.7

a. The cut value is .500

Regression is 85.7 % which is good enough measure

Omnibus Tests of Model Coefficients


Chi-square

Step 1

df

Sig.

Step

357.955

15

.000

Block

357.955

15

.000

Model

357.955

15

.000

Sig if less than 0.05 which means data variables have impact on churn

Model Summary
Step

-2 Log likelihood

2361.871

Cox & Snell R

Nagelkerke R

Square

Square

.104

.184

a. Estimation terminated at iteration number 6 because


parameter estimates changed by less than .001.

Neural network

Classification
Sample

Observed

Predicted
0

Percent Correct

Training

1933

53

97.3%

251

77

23.5%

94.4%

5.6%

86.9%

774

29

96.4%

108

41

27.5%

92.6%

7.4%

85.6%

Overall Percent

Testing

Overall Percent
Dependent Variable: Churn

Area Under the Curve


Area
Churn

.788

.788

From the above table and ROC curve

Accuracy is 86.9 for testing and 85.6 for testing


From roc curve
Curve is above benchmark line and can be accepted for testing future sample
data o determine churn

Performance evaluation

Decision tree accuracy is 90 with split 70:30 CRT method


Logistic regression is 85.7 % accurate
Neural network is 86.9 % accurate

Conclusion:
From the above analysis
Decision tree with 70:30 split CRT is best technique
This method will help in determining the churn depending on various factors in
future