The Second Iteration for Information Gain


In the second iteration, we need to update our data table. Since Expensive and Standard “Travel cost/km” have been associated with pure class, we do not need these data any longer. For second iteration, our data table D is only come from the Cheap “Travel cost/km.” We remove attribute travel cost/km from the data because they are equal and redundant as follows.

Attributes Classes
Gender Car Ownership Travel Cost ($)/km Income Level Transportation Mode
Female 0 Cheap Low Bus
Male 0 Cheap Low Bus
Male 1 Cheap Medium Bus
Male 1 Cheap Medium Bus
Female 1 Cheap Medium Train
Attributes Classes
Gender Car Ownership Income Level Transportation Mode
Female 0 Low Bus
Male 0 Low Bus
Male 1 Medium Bus
Male 1 Medium Bus
Female 1 Medium Train

Now we have only three attributes: “Gender,” “Car ownership,” and “Income level.” Based on these data, probability of each class and the degree of impurity are computed as follows:
  Prob( Bus )   = 4 / 5 = 0.8        # 4B / 5 rows
  Prob( Train ) = 1 / 5 = 0.2        # 1T / 5 rows

Entropy = –0.8×log(0.8) – 0.2×log(0.2) = 0.722
Gini index = 1 – (0.82 + 0.22) = 0.320
Classification error = 1 – Max{0.8, 0.2} = 1 – 0.8 = 0.200



      Q: What did the duck say when he bought lipstick?    
      A: “Put it on my bill.”