The Web flourishes because of its format-free style.
Lacking a unifying structure popularizes the Web, but this level of complexity also makes Web searches difficult.
HTML pages provide the following information:
Audio/figure/flash/table/video captions:
A caption is usually a description of the subject.
Content:
Web page content provides the most accurate and full-text information.
However, it is also the least-used information for a search engine since content extraction is still far less practical.
Descriptions:
Web page descriptions can either be constructed from the meta tags or submitted by webmasters or reviewers.
A metatag is an HTML tag that provides information such as author, expiration date, a list of keywords, about a web page.
Hyperlinks:
Hyperlinks contain high-quality semantic clues to a page’s topic.
A hyperlink to a web page represents an implicit endorsement of the page being pointed to.
Hyperlink text:
Hyperlink text is normally a title or brief summary of the target page.
Keywords:
Keywords can be extracted from full-text documents or metatags.
Filtering operations are applied to a document before obtaining keywords from the full-text document.
Typical operations include the removal of common words using a list of stopwords, the transformation of upper-case letters to lower-case letters, etc.
Page titles:
The title tag defines the title of an HTML document.
Text with a different font, style, color, or size:
Emphasized text is usually given a different font to highlight its importance.
The first sentences:
The first sentence of a web page is usually an introduction or an abstract.
Wife: “What are you doing?”
Husband : “Nothing.”
Wife : “Nothing…? You’ve been reading our marriage certificate for an hour.”
Husband : “I was looking for the expiration date.”