Programming Exercise I:
Web Data Crawling, Collection, and Storage


(Industry-Level, Second-to-None Comprehensive Specifications)



Absolutely no copying others’ work

Development Requirements
When start developing the exercise, follow the two requirements below:

Due Date and Submission Methods
On or before Monday, October 02, 2023. Send an email including to the instructor at wenchen@cs.und.edu to remind him the exercise is ready for grading.

Note that you are allowed to use any languages and tools for this exercise, but the exams will focus on PHP and MySQL unless otherwise specified.



Background and Objectives
A World Wide Web search engine includes the following three major components:
  • Crawlers, which visit and read every page on web sites, using hypertext links on each page to discover and read a site’s other pages,

  • Web page indexes, which are created from the pages that have been read, and

  • Search and ranking software, which is used to receive user’s search request, compares it to the entries in the index, and returns results to the user.

This exercise is for students to learn and practice the first steps of data life cycle (data collection, preparation, and indexing & storage) by implementing a web crawler.


The Requirements
Design and implement the first part of a World Wide Web search engine: crawling web pages, collecting data, and saving the collected data. The system includes the following requirements:

An Example of System Interfaces
Note that the interfaces are just an example and may not meet the exercise requirements. You should design your own interfaces. The example without using an iframe can be found from here.


Evaluations
The following features will be considered when grading: