Web Classification Using Decision Trees


The First-Level Categories of Yahoo Directories

Arts & Humanities
Recreation & Sports
Education
Science
Health
News & Media
Computers & Internet
Regional
Government
Society & Culture
Business & Economy
Reference
Entertainment
Social Science


Three URLs with Keywords, Descriptions, and Hyperlinks

URL Information
URL1 Keyword baseball, MLB, bat, college baseball, home run, news
Description Baseball.com – catch your Major League Baseball info
Hyperlink baseball (22), sports (12), news (3), computer (2)
URL2 Keyword basketball, NBA, block, defense, foul out, NBA news
Description Basketball! NBA news, basketball sites, NBA scores
Hyperlink basketball (17), football (6), sports (6), news (2)
URL3 Keyword football, NFL, National Football League, block, defense
Description Football! Chat room for football fans, football sites
Hyperlink football (33), sports (9)

An Example
Classify web pages by using a height-three modified decision tree which splits the root, depth-one nodes, and depth-two nodes based on keywords, descriptions, and hyperlinks, respectively. A classification starts with a web page at the root of the decision tree and traces paths downward to leaves, which give the categories of the page.





      Johnny collected lots of money from trick or treating and    
      he went to the candy store to buy some chocolate.    
      “You should give that money to charity”, said the shopkeeper.    
      “No, I’ll buy the chocolate. YOU give the money to charity!”