Slide 12.7: Measuring impurity

Slide 12.6: How to use a decision tree? (cont.)
Slide 12.8: Entropy
Home Print version

Measuring Impurity

Given a data table that contains attributes and class of the attributes, we can measure homogeneity (or heterogeneity) of the table based on the classes. A table is pure or homogenous if it contains only a single class. If a data table contains several classes, then the table is impure or heterogeneous. There are several indices to measure degree of impurity quantitatively. Most well known indices to measure degree of impurity are as follows:

                Entropy = Σ[-p_j(log₂p_j)]     for all j
             Gini Index = 1 – Σ(p_j²)        for all j
   Classification Error = 1 – max{p_j}

where p_j is the probability of the class value j. In the following example (10 rows), the classes of Transportation mode consist of three groups of Bus, Car and Train. It has 4 buses, 3 cars and 3 trains (4B, 3C, and 3T in short).

Attributes				Classes
Gender	Car Ownership	Travel Cost ($)/km	Income Level	Transportation Mode
Male	0	Cheap	Low	Bus
Male	1	Cheap	Medium	Bus
Female	0	Cheap	Low	Bus
Male	1	Cheap	Medium	Bus
Female	1	Expensive	High	Car
Male	2	Expensive	Medium	Car
Female	2	Expensive	High	Car
Female	1	Cheap	Medium	Train
Male	0	Standard	Medium	Train
Female	1	Standard	Medium	Train

Based on these data, we can compute probability of each class:

    Prob( Bus )   = 4 / 10 = 0.4        # 4B / 10 rows
    Prob( Car )   = 3 / 10 = 0.3        # 3C / 10 rows
    Prob( Train ) = 3 / 10 = 0.3        # 3T / 10 rows

Having the probability of each class, now we are ready to compute the quantitative indices of impurity degrees.

◀
Previous

Slide 12.6: How to use a decision tree? (cont.)
Slide 12.8: Entropy
Home Print version

▶
Next

“You cannot control what happens to you,
but you can control your attitude toward what happens to you, and in that,
you will be mastering change rather than allowing it to master you.”
― Brian Tracy