Chapter 1

Background about spider

What is a spider?

A spider is a program that search engines use to collect information about web sites on the Internet. These programs traverse the world wide web gathering the structure and/or contents of web sites by retrieving a document and recursively retrieving all documents that are referenced, and storing that information for later processing. Spiders also are called robots, webbots, crawlers, web wanderers.

There are two basic ways that spiders can find your web site. You can tell the search program about your web site, or let it find your site on its own. Typically search program will have a place on their web site that allows you to suggest a site to them. After a site has been suggested, the spider will visit that web site to collect information about it.

Spiders also follow the links on each web site to find linked sites to visit. This is how a spider will find your site by itself. The more web sites that link to your site, the more likely a spider will find your site without you telling it your sites URL (1).

How to interact with the Internet?

The most interesting thing about the spider program is the fact that it is a network program. The subroutine analysis_page() encapsulates all the network programming required to implement a spider; it does the ``fetch'' alluded to in step 2 of the above algorithm. This subroutine opens a socket to a server and uses the HTTP protocol to retrieve a page. If the server has a port number appended to it, this port is used to establish the connection; otherwise, the well-known port 80 is used.

Once a connection to the remote machine has been established, analysis_page () sends a string such as: GET /index.html HTTP/1.0 This string is followed by two newline characters. This is a snippet of the Hypertext Transport Protocol (HTTP), the protocol on which the Web is based. This request asks the web server to which we are connected to send the contents of the file /index.html to us. analysis_page() then reads the socket until an end of file is encountered. We submit a request, the web server sends a response and the connection is terminated.

The response from the web server consists of a header, as specified by the HTTP standard, and the HTML-tagged text making up the page. These two parts of the response are separated by a blank line. The following is a typical response from a web server.

HTTP/1.0 200 OK
Date: Wed, 05 Dec 2001 20:04:26 GMT
Server: Apache/1.3.2 (Unix) mod_perl/1.15_01 mod_ssl/2.0.11 SSLeay/0.9.0b
Content-type: text/html
Content-length: 79
Content-length:Accept-Ranges:	bytes
Last-modified: Mon, 03 Dec 2001 14:30:52 GMT
<HTML><TITLE>My Web Page</TITLE>
<BODY>
This is my web page.
</BODY>

</HTML>(2).

Application of spiders

 

 




The students working on this project are Man Luo, Wenqi Su, Liqiang Xi.