World Wide Web Spiders (Crawlers or Robots) Help Page
A World Wide Web Spiders (Crawlers or Robots): A program that automatically explores the World Wide Web by retrieving a document and recursively retrieving some or all the documents that are referenced in it. Spiders follow the links on a site to find other relevant pages. There are two algorithms involved in spiders: one is depth-first search; the other is breadth-first search. The depth-first creates a relatively comprehensive database on a few objects, while the breadth-first builds a database that touches more lightly on a wider variety of documents. The following basic tools are usually used to implement an experimental spider. To construct an efficient and practical spider, some other networking tools have to be used.
- Spider Download
Acme.Spider
CPAN: WWW
The Web Robots Pages
WWW::Robot
Yahoo!'s Web Search Robots and Spiders Page
- Spider Implementation Using Lynx
lynx is a general purpose distributed information browser for the World Wide Web.
/opt/local/bin/lynx
lynxc.cc
lynxj.java
lynxp.pl
- Spider Implementation Using Java
java.net.URL represents a URL and allows the data referred to by the URL to be downloaded. A URL may be specified as a single string or with separate potocol, host, port, and file specifications.
java.net.URL
java.net.URLConnection defines a network connection to an object specified by a URL.
java.net.URLConnection
net.java
- Spider Implementation Using Perl
CPAN (Comprehensie Perl Archive Network): In CPAN, you will find all things Perl.
CPAN
webget: Given a URL on the command line, webget fetches the named object (HTML text, images, audio, whatever the object happens to be).
webget
network.pl used by webget
www.pl used by webget
spider.pl (Warning! Warning! Program is unstable!)