Wednesday, May 9, 2007

Crawlers, thesis

The anatomy of a hypertext Web search engine

http://infolab.stanford.edu/~backrub/google.html

Brin and Page (1998)

Crawlers

World Wide Web Worm (McBryan, 1994) was a crawler used to build a simple index of document titles and URLs. The index could be searched by using the grep Unix command.

Google Crawler (Brin and Page, 1998) is described in some detail, but the reference is only about an early version of its architecture, which was based in C++ and Python. The crawler was integrated with the indexing process, because text parsing was done for full-text indexing and also for URL extraction. There is a URL server that sends lists of URLs to be fetched by several crawling processes. During parsing, the URLs found were passed to a URL server that checked if the URL have been previously seen. If not, the URL was added to the queue of the URL server.