Está en la página 1de 53

1

1. Analyse the attributes in the data, and consider their relative importance with respect to the target class.

The dataset is diabetes.arff dataset provided with Weka and the title is Pima Indians Diabetes Database.
What is an Attribute?
Each individual, independent instance that provides the input to machine learning is characterized by its
values on a fixed, predefined set of features or attributes.

We have an instance with different attributes and a class. These attributes can be either discrete (nominal)
or continuous (numeric). It can be seen that pima_diabetes is a dataset. The total numbers of instances are
768 and total numbers of attributes are 8 while the last one is known as a class.

8 of the attributes are continuous (numeric) while class is discrete (nominal).

There are 2 values for the class and the labels of these values give some indication what this dataset is
about. According to the figure above, the labels are tested_negative and tested_positive. The blue bar

2
graph means that 500 patients have no diabetes and 268 patients have diabetes. The type is nominal. In
discrete (nominal), it will be either yes or no only. It is also known as classification.

This is a dataset for Number of times pregnant and it shows the minimum value which is equal to 0 and
maximum value which is 17. The type is numeric. In continuous (numeric), the value is predicted rather
than yes or no. it is also known as regression. According to the graph, the number of patients which are
not having diabetes during pregnancy is greater than the number of patients which are having diabetes
during pregnancy. It can be seen that when number of readings are less, the number of patients with
diabetes are more and vice versa. There is no positive correlation between number of pregnancies and
diabetes.

3
According to this graph above plasma glucose concentration in a 2 hour oral glucose tolerance test ranges
from 0 as its minimum value to a maximum value of 199 that indicates a true diabetes patient. Normal
value of plasma glucose concentration is 136 or below. The unique patients affected are about 2 % which
have very low level of plasma glucose. The mean value is 120 approx. And when the plasma glucose
concentration is more than or equal to the mean value, the number of readings are less and the number of
patients affected by diabetes are more and vice versa.

According to the graph above, the mean value of diastolic pressure is about 69 and the maximum value is
approximately 122. The patients having diastolic blood pressure of approx. 40 or below are exceptional
cases which are about 1 % and are more affected by diabetes. On the other hand, patients having diastolic
blood pressure of 60 or above have more chances of being affected by diabetes and number of affected
patients increases with increase in diastolic blood pressure.

According to the graph above, the range lies between 0 and 99 and the mean is 20. Triceps skin fold
thickness between 0 and 31 is normal which means less number of patients is affected by diabetes. The
number of patients affected by diabetes with triceps skin fold thickness of about 31 to 44 is approximately
equal to the patients unaffected by diabetes. With skin fold thickness of about 40-49 the diabetes affected
patients ratio is more. From 50-56 skin fold thickness the diabetes patients are very few. And from triceps
skin fold thickness in range 56-99 is an exceptional case. Hence the graph shows that diabetes and Triceps
skin fold thickness are not correlated to each other.

2-Hour serum insulin (mu U/ml)

According to the graph above, the range is 0-846 for serum insulin and the mean value is 79. In the range
0-134, numbers of patients affected with diabetes are less as compared to the patients unaffected with
diabetes. Between serum level of 134 and 222, number of affected and unaffected patients is
approximately the same. And from serum insulin level of 222 and onwards the number of patients
affected by diabetes increases. From the range 489-624 it is a unique case and involves about 12 %.
Hence Serum Insulin and diabetes strongly correlated to each other.

Body mass index (weight in kg/(height in m)^2)


According to the graph shown above, the minimum value is 0 and the maximum value is 67.1 while the
mean 31. When the body mass index approaches 30 the number of patients affected by diabetes increases.
And as the body mass index further increases the number of patients affected increases simultaneously.
Because there is a fixed body mass value for any individual and as this value is exceeded it results in
diabetes. There are about 10% unique patients who are exceptional cases whose diabetes is not related to
their body mass index. Its expected that their diabetes can be because of any other disease or
abnormality.

Diabetes pedigree function


According to the graph shown above, the minimum value is 0.08 and the maximum value of the graph is
2.42 while the mean of the graph is 0.472. It shows that the people having less or no family history of
diabetes are not that much affected by diabetes while the people having family history of diabetes have
more number of chances to get diabetes. But as we can see from the graph that there are about 55%
chances of diabetes because of hereditary factors and about 45% of people have unique cases as they
dont get diabetes even if they have positive family history of diabetes. Hence we can conclude from the
graph that hereditary factors have about 50% effect on patients to suffer with diabetes.

Age (years)
According to the graph shown above, Age has a strong relation with diabetes. The minimum age value in
this graph is 21 and maximum value is 81 and the mean is 33. When the age is 21 to 27 (approx.) the
number of patients unaffected by diabetes is more as compared to the patients affected but as the age
approaches 30 or more the ratio of patients affected by diabetes increases. It is mainly because in elderly
people the immune system is weak. And in the graph there is unique percentage of about 1% which is
exceptional cases. Hence the Age and diabetes are directly co-related to each other.

9
2. Construct graphs of classification performance against training set size for a range of classifiers taken from
those considered in the module. You may need to experiment with different training sets, depending on what
you have discovered about the data in step (1).

(I ANALYSED THE DATASET AS I HAVE FILTERED 5 ATTRIBUTES TO STUDY ABOUT


THE DIABETES BUT I HAVE SHOWN THE WORKING OF 9 ATTRIBUTES TOO)
With 9 Attributes

Figure 1 SVM Percentage Split 10%

Figure 2 SVM Percentage Split 20%

Figure 3 SVM Percentage Split 30%

10

Figure 4 SVM Percentage Split 40%

Figure 5 SVM Percentage Split 50%

11

Figure 6 SVM Percentage Split 60%

Figure 7 SVM Percentage Split 70%

12

Figure 8 SVM Percentage Split 80%

Figure 9 SVM Percentage Split 90%

13

Figure10 j48 Percentage Split 10%

Figure 11 J48 Percentage Split 20%

14

Figure 12 J48 Percentage Split 30%

Figure 13 J48 Percentage Split 40%

15

Figure 14 J48 Percentage Split 50%

Figure 15 J48 Percentage Split 60%

16

Figure 16 J48 Percentage Split 70%

Figure 17 J48 Percentage Split 80%

17

Figure 18 J48 Percentage Split 90%

Figure 19 MLP Percentage Split 10%

18

Figure 20 MLP Percentage Split 20%

Figure 21 MLP Percentage Split 30%

19

Figure 22 MLP Percentage Split 40%

Figure 23 MLP Percentage Split 50%

20

Figure 24 MLP Percentage Split 60%

Figure 25 MLP Percentage Split 70%

21

Figure 26 MLP Percentage Split 80%

Figure 27 MLP Percentage Split 90%

22

Figure 28 Nave Bayes Percentage Split 10%

Figure 29 Nave Bayes Percentage Split 20%

23

Figure 30 Nave Bayes Percentage Split 30%

Figure 31 Nave Bayes Percentage Split 40%

24

Figure 32 Nave Bayes Percentage Split 50%

Figure 33 Nave Bayes Percentage Split 60%

25

Figure 34 Nave Bayes Percentage Split 70%

Figure 35 Nave Bayes Percentage Split 80%

26

Figure 36 Nave Bayes Percentage Split 90%


..
After filtering With 5 attributes i.e pregnancy, mass index, pedigree function, age and a class
Figure 1 SVM Percentage Split 10%

Figure 2 SVM Percentage Split 20%

27

Figure3 SVM Percentage Split 30%

Figure 4 SVM Percentage Split 40%

28

Figure 5 SVM Percentage Split 50%

Figure 6 SVM Percentage Split 60%

29

Figure 7 SVM Percentage Split 70%

Figure8 SVM Percentage Split 80%

30

Figure 9 SVM Percentage Split 90%

Figure 10 J48 Percentage Split 10%

31

Figure 11 J48 Percentage Split 20%

Figure 12 J48 Percentage Split 30%

32

Figure 13 J48 Percentage Split 40%

Figure 14 J48 Percentage Split 50%

33

Figure 15 J48 Percentage Split 60%

Figure 16 J48 Percentage Split 70%

34

Figure 17 J48 Percentage Split 80%

Figure 18 J48 Percentage Split 90%

35

Figure 19 NAVIE BAYES Percentage Split 10%

Figure 20 NAVIE BAYES Percentage Split 20%

36

Figure 21 NAVIE BAYES Percentage Split 30%

Figure 22 NAVIE BAYES Percentage Split 40%

37

Figure 23 NAVIE BAYES Percentage Split 50%

Figure 24 NAVIE BAYES Percentage Split 60%

38

Figure 25 NAVIE BAYES Percentage Split 70%

Figure 26 NAVIE BAYES Percentage Split 80%

39

Figure 27 NAVIE BAYES Percentage Split 90%

Figure 28 MLP Percentage Split 10%

40

Figure 29 MLP Percentage Split 20%

Figure 30 MLP Percentage Split 30%

41

Figure 31 MLP Percentage Split 40%

Figure 32 MLP Percentage Split 50%

42

Figure 33 MLP Percentage Split 60%

Figure 34 MLP Percentage Split 70%

43

Figure 35 MLP Percentage Split 80%

Figure 36 MLP Percentage Split 90%

44

Table 1 Different performance metrics running in WEKA (With 9 attributes)

45

Table 2 Different performance metrics running in WEKA (With 5 attributes)

46

Table 3 Error measurement for different classifiers in WEKA (with 9 attributes)

Table 4 Error measurement for different classifiers in WEKA (with 5 attributes)

47

Table 5 Performance measuring in training and test data set using WEKA (with 9 attributes)

Table 6 Performance measuring in training and test data set using WEKA (with 5 attributes)

48
ALL GRAPHS ARE FOR 5 ATTRIBUTES

Graph 1 Percentage Split 10-90 vs Mean Absolute Error

Graph 2 Percentage Split 10-90 vs Root Mean Square Error

49

Graph 3 Percentage Split 10-90 vs Relative Absolute Error

Graph 4 Percentage Split 10-90 vs Root Relative Squared Error

50

Graph 5 Percentage Split 10-90 vs Accuracy

Graph 6 Percentage Split 10-90 vs Error Rate

51

Graph 7 Percentage Split 10-90 vs Time (s)

Graph 8 Percentage Split 10-90 vs Kappa Statistics

3. Analyse the data structure/representation generated by at least three classifiers when trained on the
complete dataset. What does your analysis tell you about the data set?

The diagrams, tables and a graph are made by using different classifiers. The classifiers which are used
for the interpretation are J48, MLP, Nave Bayes and SMO. There are many test options which are as
follows:
Use training set:This should be chosen if the actual data set is used as training and testing set.
Supplied test set:
It is an option if the actual data set is used as training set and you have got a separate testing set.
Cross-Validation:

52
Cross-Validation provides the opportunity to use one data set. It splits the data set into m folds and use m1 folds as training sets and one fold as testing set.
Percentage split:
Allows to split on n percentage the actual data set into training and testing set.
Percentage split (10,20,30,40,50,60,70,80,90) is used. Table 2 is made for easier analysis and evaluation.
Different performance matrix like TP rate, FP rate, Precision, Recall, F-measure and ROC are presented
in numeric value during training and testing phase. In Table 4, different types of error measurement like
mean absolute error and root mean squared error, the time taken to build in seconds and KAPPA statistics.
Finally, Graphs are made to make it more easier to understand.
Now lets start with SMO classifier. According to Figure 1(WITH 9 ATTRIBUTES), the correctly
classified instances are approximately 69% and incorrectly classified instances are approximately 31%.
The confusion matrix states that 366 As are correctly classified as As whereas 89 Bs were incorrectly
classified as As and126 Bs are incorrectly classified as As whereas 110 Bs are correctly classified as Bs.
The kappa statistic shown is 0.2811 and ROC Area is 0635. Kappa statistics is used to assess the accuracy
of any particular measuring cases, it is usual to distinguish between the reliability of the data collected
and their validity. A kappa of 1 indicates perfect agreement, whereas a kappa of 0 indicates agreement
equivalent to chance. 0.60-0.70 is acceptable figure.
The rest of the figures (remaining figures of 9 attributes and of 5 attributes) can be easily interpreted as
explained above.
Performance:Performance should be analysed in two ways. The ability of each classifier to generalise is compared in a
table. This will tell that which classifier is better than other classifier. The second way of analysing
performance is to study the pattern of errors. The total time required to shape the model is also an
essential parameter in comparing the classification algorithm.
According to Table 4, SMO is the best because of lower error rate and Second best is MLP. Nave Bayes
is on third Number and J48 is on Fourth which means worst algorithm.
According to Table 6, Nave Bayes classifier requires the shortest time which is around 0.011 whereas J48
is on second with 0.014. MLP algorithm requires the longest model building time which is around 0.37
seconds.
4. Combine the results from the previous three steps and all your classifiers to develop a model of why
instances fall into particular classes. (Your answer to this question should be understandable by someone
who is not a specialist in data mining.)

According to the graphs and my analysis, there are some attributes which are the causes of diabetes and
some of them are effects of diabetes. Few of them neither are the cause of diabetes nor the effect of
diabetes. Lets start with pregnancy; one of the causes of diabetes is Pregnancy. There are increased
chances of gestational diabetes if women had symptoms of diabetes during her previous pregnancy. It is
caused by a change in the way a womans body responds to the hormone insulin during her pregnancy. As
the number of times pregnancy increases, then the chance of diabetes goes up with it establishing a direct
correlation between pregnancy and diabetes. As age increases, a chance of increase in diabetes is
observed. Diabetes is mostly observed in elderly people. One of the reason of diabetes in elderly people is
weak immune system because of lack of exercise, proper diet, co-existing health issues and cognitive
complications. Diabetes pedigree function is also an attribute that contributes in diabetes progression.

53
People having diabetes in family history have significantly increased chances of having diabetes in any
part of their life. Body mass index has a specific value for individual of any age and is one of the main
factors contributing to diabetes. Because of obesity many problems arise. Obesity causes abnormal
glucose tolerance in the body that leads to diabetes. Most of the people get diabetes because their weight
is more than their healthy weight range.
There are some attributes which are the effects of diabetes. Lets talk about blood pressure; Diabetes is
the one of the main causes thats leads to high blood pressure. Diabetes plays a role in damaging arteries
and makes their target for hardening. Hardening of arteries cause pressure in arteries hence causes high
blood pressure. Chances of having low blood pressure for a patient having diabetes are very few. On the
other hand, overweight is also a factor which causes blood pressure. Body mass index is also related to
skin fold thickness as the body mass index increases the sin fold thickness increase. The serum insulin
and plasma glucose concentration are the tests which are always taken in case of diabetes. If the plasma
glucose concentration of a patient is more than 136 (approx.) the patient is likely to have diabetes but if
the patient has plasma glucose concentration of 199 or more he is confirmed to be a diabetes patient.
Serum insulin is also a test used to check diabetes in a patient. So if there is diabetes, these two tests are
used to know how much the diabetes is and is present or not. When we know the stage of diabetes by the
help of these tests we can easily find a way to treat the patients to overcome the problem of diabetes.