The Second Iteration for Information Gain
In the second iteration, we need to update our data table.
Since Expensive and Standard “Travel cost/km” have been associated with pure class, we do not need these data any longer.
For second iteration, our data table D
is only come from the Cheap “Travel cost/km.”
We remove attribute travel cost/km from the data because they are equal and redundant as follows.
Attributes |
Classes |
Gender |
Car Ownership |
Travel Cost ($)/km |
Income Level |
Transportation Mode |
Female |
0 |
Cheap |
Low |
Bus |
Male |
0 |
Cheap |
Low |
Bus |
Male |
1 |
Cheap |
Medium |
Bus |
Male |
1 |
Cheap |
Medium |
Bus |
Female |
1 |
Cheap |
Medium |
Train |
⇓
Attributes |
Classes |
Gender |
Car Ownership |
Income Level |
Transportation Mode |
Female |
0 |
Low |
Bus |
Male |
0 |
Low |
Bus |
Male |
1 |
Medium |
Bus |
Male |
1 |
Medium |
Bus |
Female |
1 |
Medium |
Train |
Now we have only three attributes: “Gender,” “Car ownership,” and “Income level.”
Based on these data, probability of each class and the degree of impurity are computed as follows:
Prob( Bus ) = 4 / 5 = 0.8 # 4B / 5 rows
Prob( Train ) = 1 / 5 = 0.2 # 1T / 5 rows
Entropy = –0.8×log(0.8) – 0.2×log(0.2) = 0.722
Gini index = 1 – (0.82 + 0.22) = 0.320
Classification error = 1 – Max{0.8, 0.2} = 1 – 0.8 = 0.200