Web Crawler
Web Crawler
Web Crawler
PRESENTATION ON
WEB CRAWLING
SUBMITTED TO:MR. YUDHVIR SINGH
(LECTURER IN C.S.E. DEPT)
INTRODUCTION
The World Wide Web (WWW) or the Web for short, is a collection of billions of documents written in a way that enables them to cite each other using hyperlinks, which is why they are a form of hypertext. These documents, or Web Pages, are typically a few thousand characters long, written in a diversity of languages, and cover essentially all the topics of human endeavor. The World Wide Web has become highly popular in the last few years, and is now one of the primary means of information publishing on the Internet. When the size of the Web increased beyond a few sites and a small number of documents, it became clear that manual browsing through a significant portion of the hypertext structure is no longer possible, let alone an effective method for resource discovery. Browsing is a useful but restrictive means of finding information. Given a page with many links to follow, it would be unclear and painstaking to explore them in search of a specific information need.
Breadth-First Crawling
The idea of breadth-first indexing is to retrieve all the pages around the starting point before following links further away from the start. This is the most common way that crawlers follow links. If your robot is indexing several hosts, this approach distributes the load quickly, so that no one server must respond constantly. It's also easier for crawler writers to implement parallel processing for this system In the diagram, the starting point is at the center, with the darkest gray. The next pages, in the medium gray, will be indexed first, followed by those they link (in the light gray), and the outer links, in white.
Fig. 1.1
Depth-First Crawling
The alternate approach, depth-first indexing, follows all the links from the first link on the starting page, then the first link on the second page, and so on. Once it has indexed the first link on each page, it goes on to the second and subsequent links, and follows them. Some unsophisticated crawlers use this method, as it can be easier to code. In this diagram, the starting point is at the center, with the darkest gray. The first linked page is a dark gray, and the first links from that page are lighter grays.
Fig. 1.2
Scalability By scalable, we mean that a Web Crawler is designed to scale up to the entire web, and can be used to fetch tens of millions of web documents. Scalability is achieved by implementing the data structures used during a Web crawl such that they use a bounded amount of memory, regardless of the size of the crawl. Hence, the vast majority of the data structures are stored on disk, and small parts of them are stored in memory for efficiency. Extendibility By extensible, we mean that a Web Crawler is designed in a modular way, with the expectation that third parties will add new functionality
As the size of the Web grows, it becomes more difficult to retrieve the whole or a significant portion of the Web using a single process. Therefore, many search engines often run multiple processes in parallel to crawl the Web, so that download rate is maximized. We refer to this type of crawler as a parallel crawler. The main design criterion for such crawlers is to maximize its performance (e.g., download rate) while minimizing the overhead from parallelization.
The following issues make the study of a parallel crawler challenging and interesting:
Overlap: When multiple processes run in parallel to download pages, it is possible that different processes download the same page multiple times. One process may not be aware that another process has already downloaded the page. Clearly, such multiple downloads should be minimized to save network bandwidth and increase the crawler's effectiveness. Then how can we coordinate the processes to prevent overlap? Quality: Often, a crawler wants to download ``important'' pages first, in order to maximize the ``quality'' of the downloaded collection. However, in a parallel crawler, each process may not be aware of the whole image of the Web that they have collectively downloaded so far. For this reason, each
process may make a crawling decision solely based on its own image of the Web (that itself has downloaded) and thus make a poor crawling decision. Then how can we make sure that the quality of the downloaded pages is as good for a parallel crawler as for a centralized one?
Communication bandwidth: In order to prevent overlap, or to improve the quality of the downloaded pages, crawling processes need to periodically communicate to coordinate with each other. However, this communication may grow significantly as the number of crawling processes increases. Exactly what do they need to communicate and how significant would this overhead be? Can we minimize this communication overhead while maintaining the effectiveness of the crawler?
Network-load reduction: In addition to the dispersing load, a parallel crawler may actually reduce the network load.
Focused Crawling
Large-scale, topic-specific information gatherers are called focused crawlers. In contrast to giant, all-purpose crawlers, which must process large portions of the Web in a centralized manner, a distributed federation of focused crawlers can cover specialized topics in more depth and keep the crawl more fresh, because there is less to cover for each crawler.
REFERENCES
Robots & Spiders & Crawlers: How Web and intranet search engines follow links to build indexes A white paper Avi Rappaport The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page Mercator: A Scalable, Extensible Web Crawler Allan Heydon and Marc Najork Robots in the Web: threat or treat? Martijn Koster How public is the Web? : Robots, access, and scholarly communication Center for Social Informatics SLIS Indiana University Accelerated Focused Crawling through Online Relevance Feedback Soumen Chakrabarti, Kunal Punera, Mallela Subramanyam Parallel Crawlers Junghoo Cho www.searchenginewatch.com www.botspot.com www.spiderline.com