Measuring Impurity


Given a data table that contains attributes and class of the attributes, we can measure homogeneity (or heterogeneity) of the table based on the classes. A table is pure or homogenous if it contains only a single class. If a data table contains several classes, then the table is impure or heterogeneous. There are several indices to measure degree of impurity quantitatively. Most well known indices to measure degree of impurity are as follows:
                Entropy = Σ[-pj(log2pj)]     for all j
             Gini Index = 1 – Σ(pj2)        for all j
   Classification Error = 1 – max{pj}
where pj is the probability of the class value j. In the following example (10 rows), the classes of Transportation mode consist of three groups of Bus, Car and Train. It has 4 buses, 3 cars and 3 trains (4B, 3C, and 3T in short).

Attributes Classes
Gender Car Ownership Travel Cost ($)/km Income Level Transportation Mode
Male 0 Cheap Low Bus
Male 1 Cheap Medium Bus
Female 0 Cheap Low Bus
Male 1 Cheap Medium Bus
Female 1 Expensive High Car
Male 2 Expensive Medium Car
Female 2 Expensive High Car
Female 1 Cheap Medium Train
Male 0 Standard Medium Train
Female 1 Standard Medium Train

Based on these data, we can compute probability of each class.
    Prob( Bus )   = 4 / 10 = 0.4        # 4B / 10 rows
    Prob( Car )   = 3 / 10 = 0.3        # 3C / 10 rows
    Prob( Train ) = 3 / 10 = 0.3        # 3T / 10 rows
Having the probability of each class, now we are ready to compute the quantitative indices of impurity degrees.