0% found this document useful (0 votes)
36 views

WebCrawlingChapter Chapter 8

The document summarizes chapter 8 on web crawling from the book "Web Data Mining" by Bing Liu. It discusses the motivation and taxonomy of crawlers, including basic crawlers, universal crawlers, and preferential crawlers. It covers implementation issues for crawlers like data structures, parsing HTML, and handling stop words. The goal of crawlers is to retrieve web pages to support search engines, monitor sites, and harvest data, and there are challenges to building robust, efficient crawlers at scale.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views

WebCrawlingChapter Chapter 8

The document summarizes chapter 8 on web crawling from the book "Web Data Mining" by Bing Liu. It discusses the motivation and taxonomy of crawlers, including basic crawlers, universal crawlers, and preferential crawlers. It covers implementation issues for crawlers like data structures, parsing HTML, and handling stop words. The goal of crawlers is to retrieve web pages to support search engines, monitor sites, and harvest data, and there are challenges to building robust, efficient crawlers at scale.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 114

Ch.

8: Web Crawling

By Filippo Menczer
Indiana University School of Informatics

in Web Data Mining by Bing Liu


Springer, 2007
Outline
• Motivation and taxonomy of crawlers
• Basic crawlers and implementation issues
• Universal crawlers
• Preferential (focused and topical) crawlers
• Evaluation of preferential crawlers
• Crawler ethics and conflicts
• New developments: social, collaborative, federated
crawlers

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Q: How does a
search engine
know that all
these pages
contain the query
terms?
A: Because all of
those pages
have been
crawled

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
starting
pages
(seeds)

Crawler:
basic
idea

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Many names
• Crawler
• Spider
• Robot (or bot)
• Web agent
• Wanderer, worm, …
• And famous instances: googlebot,
scooter, slurp, msnbot, …

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Googlebot & you

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Motivation for crawlers
• Support universal search engines (Google,
Yahoo, MSN/Windows Live, Ask, etc.)
• Vertical (specialized) search engines, e.g.
news, shopping, papers, recipes, reviews, etc.
• Business intelligence: keep track of potential
competitors, partners
• Monitor Web sites of interest
• Evil: harvest emails for spamming, phishing…
• … Can you think of some others?…

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
A crawler within a search engine
Web Page
repository
googlebot

Text & link


Query analysis

hits

Text index PageRank


Ranker

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
One taxonomy of crawlers
Crawlers

Universal crawlers Preferential crawlers

Focused crawlers Topical crawlers

Adaptive topical crawlers Static crawlers

Evolutionary crawlers Reinforcement learning crawlers Best-first PageRank

etc... etc...

• Many other criteria could be used:


– Incremental, Interactive, Concurrent, Etc.

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Outline
• Motivation and taxonomy of crawlers
• Basic crawlers and implementation issues
• Universal crawlers
• Preferential (focused and topical) crawlers
• Evaluation of preferential crawlers
• Crawler ethics and conflicts
• New developments: social, collaborative, federated
crawlers

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Basic crawlers
• This is a sequential
crawler
• Seeds can be any list of
starting URLs
• Order of page visits is
determined by frontier
data structure
• Stop criterion can be
anything

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Graph traversal
(BFS or DFS?)
• Breadth First Search
– Implemented with QUEUE (FIFO)
– Finds pages along shortest paths
– If we start with “good” pages, this
keeps us close; maybe other good
stuff…
• Depth First Search
– Implemented with STACK (LIFO)
– Wander away (“lost in cyberspace”)

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
A basic crawler in Perl
• Queue: a FIFO list (shift and push)
my @frontier = read_seeds($file);
while (@frontier && $tot < $max) {
my $next_link = shift @frontier;
my $page = fetch($next_link);
add_to_index($page);
my @links = extract_links($page, $next_link);
push @frontier, process(@links);
}

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Implementation issues
• Don’t want to fetch same page twice!
– Keep lookup table (hash) of visited pages
– What if not visited but in frontier already?
• The frontier grows very fast!
– May need to prioritize for large crawls
• Fetcher must be robust!
– Don’t crash if download fails
– Timeout mechanism
• Determine file type to skip unwanted files
– Can try using extensions, but not reliable
– Can issue ‘HEAD’ HTTP commands to get Content-Type
(MIME) headers, but overhead of extra Internet requests

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
More implementation issues

• Fetching
– Get only the first 10-100 KB per page
– Take care to detect and break
redirection loops
– Soft fail for timeout, server not
responding, file not found, and other
errors

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
More implementation issues: Parsing
• HTML has the structure of a DOM
(Document Object Model) tree
• Unfortunately actual HTML is often
incorrect in a strict syntactic sense
• Crawlers, like browsers, must be
robust/forgiving
• Fortunately there are tools that can
help
– E.g. tidy.sourceforge.net
• Must pay attention to HTML
entities and unicode in text
• What to do with a growing number
of other formats?
– Flash, SVG, RSS, AJAX…

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
More implementation issues
• Stop words
– Noise words that do not carry meaning should be eliminated
(“stopped”) before they are indexed
– E.g. in English: AND, THE, A, AT, OR, ON, FOR, etc…
– Typically syntactic markers
– Typically the most common terms
– Typically kept in a negative dictionary
• 10–1,000 elements
• E.g. http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words
– Parser can detect these right away and disregard them

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
More implementation issues
Conflation and thesauri
• Idea: improve recall by merging words with same
meaning
1. We want to ignore superficial morphological
features, thus merge semantically similar tokens
– {student, study, studying, studious} => studi
2. We can also conflate synonyms into a single form
using a thesaurus
– 30-50% smaller index
– Doing this in both pages and queries allows to retrieve
pages about ‘automobile’ when user asks for ‘car’
– Thesaurus can be implemented as a hash table

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
More implementation issues
• Stemming
– Morphological conflation based on rewrite rules
– Language dependent!
– Porter stemmer very popular for English
• http://www.tartarus.org/~martin/PorterStemmer/
• Context-sensitive grammar rules, eg:
– “IES” except (“EIES” or “AIES”) --> “Y”
• Versions in Perl, C, Java, Python, C#, Ruby, PHP, etc.
– Porter has also developed Snowball, a language to create
stemming algorithms in any language
• http://snowball.tartarus.org/
• Ex. Perl modules: Lingua::Stem and Lingua::Stem::Snowball

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
More implementation issues
• Static vs. dynamic pages
– Is it worth trying to eliminate dynamic pages and only index
static pages?
– Examples:
• http://www.census.gov/cgi-bin/gazetteer
• http://informatics.indiana.edu/research/colloquia.asp
• http://www.amazon.com/exec/obidos/subst/home/home.html/002-8332429-6490452

• http://www.imdb.com/Name?Menczer,+Erico
• http://www.imdb.com/name/nm0578801/
– Why or why not? How can we tell if a page is dynamic? What
about ‘spider traps’?
– What do Google and other search engines do?

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
More implementation issues
• Relative vs. Absolute URLs
– Crawler must translate relative URLs into absolute
URLs
– Need to obtain Base URL from HTTP header, or
HTML Meta tag, or else current page path by
default
– Examples
• Base: http://www.cnn.com/linkto/
• Relative URL: intl.html
• Absolute URL: http://www.cnn.com/linkto/intl.html
• Relative URL: /US/
• Absolute URL: http://www.cnn.com/US/

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
More implementation issues
• URL canonicalization
– All of these:
• http://www.cnn.com/TECH
• http://WWW.CNN.COM/TECH/
• http://www.cnn.com:80/TECH/
• http://www.cnn.com/bogus/../TECH/
– Are really equivalent to this canonical form:
• http://www.cnn.com/TECH/
– In order to avoid duplication, the crawler must
transform all URLs into canonical form
– Definition of “canonical” is arbitrary, e.g.:
• Could always include port
• Or only include port when not default :80

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
More on Canonical URLs
• Some transformation are trivial, for example:

 http://informatics.indiana.edu
 http://informatics.indiana.edu/

 http://informatics.indiana.edu/index.html#fragment
 http://informatics.indiana.edu/index.html

 http://informatics.indiana.edu/dir1/./../dir2/
 http://informatics.indiana.edu/dir2/

 http://informatics.indiana.edu/%7Efil/
 http://informatics.indiana.edu/~fil/

 http://INFORMATICS.INDIANA.EDU/fil/
 http://informatics.indiana.edu/fil/

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
More on Canonical URLs
Other transformations require heuristic assumption
about the intentions of the author or configuration
of the Web server:
1. Removing default file name
 http://informatics.indiana.edu/fil/index.html
 http://informatics.indiana.edu/fil/
– This is reasonable in general but would be wrong in this
case because the default happens to be ‘default.asp’
instead of ‘index.html’
2. Trailing directory
 http://informatics.indiana.edu/fil
 http://informatics.indiana.edu/fil/
– This is correct in this case but how can we be sure in
general that there isn’t a file named ‘fil’ in the root dir?

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
More implementation issues
• Spider traps
– Misleading sites: indefinite number of pages
dynamically generated by CGI scripts
– Paths of arbitrary depth created using soft
directory links and path rewriting features in
HTTP server
– Only heuristic defensive measures:
• Check URL length; assume spider trap above some
threshold, for example 128 characters
• Watch for sites with very large number of URLs
• Eliminate URLs with non-textual data types
• May disable crawling of dynamic pages, if can detect

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
More implementation issues
• Page repository
– Naïve: store each page as a separate file
• Can map URL to unique filename using a hashing function,
e.g. MD5
• This generates a huge number of files, which is inefficient
from the storage perspective
– Better: combine many pages into a single large file, using
some XML markup to separate and identify them
• Must map URL to {filename, page_id}
– Database options
• Any RDBMS -- large overhead
• Light-weight, embedded databases such as Berkeley DB

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Concurrency
• A crawler incurs several delays:
– Resolving the host name in the URL to an
IP address using DNS
– Connecting a socket to the server and
sending the request
– Receiving the requested page in response
• Solution: Overlap the above delays by
fetching many pages concurrently

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Architecture
of a
concurrent
crawler

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Concurrent crawlers
• Can use multi-processing or multi-threading
• Each process or thread works like a
sequential crawler, except they share data
structures: frontier and repository
• Shared data structures must be
synchronized (locked for concurrent
writes)
• Speedup of factor of 5-10 are easy this
way

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Outline
• Motivation and taxonomy of crawlers
• Basic crawlers and implementation issues
• Universal crawlers
• Preferential (focused and topical) crawlers
• Evaluation of preferential crawlers
• Crawler ethics and conflicts
• New developments: social, collaborative, federated
crawlers

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Universal crawlers
• Support universal search engines
• Large-scale
• Huge cost (network bandwidth) of
crawl is amortized over many queries
from users
• Incremental updates to existing
index and other data repositories

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Large-scale universal crawlers
• Two major issues:
1. Performance
• Need to scale up to billions of pages
2. Policy
• Need to trade-off coverage,
freshness, and bias (e.g. toward
“important” pages)

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Large-scale crawlers: scalability
• Need to minimize overhead of DNS lookups
• Need to optimize utilization of network bandwidth
and disk throughput (I/O is bottleneck)
• Use asynchronous sockets
– Multi-processing or multi-threading do not scale up to
billions of pages
– Non-blocking: hundreds of network connections open
simultaneously
– Polling socket to monitor completion of network
transfers

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Several parallel
queues to spread load DNS server using UDP
across servers (keep (less overhead than
connections alive) TCP), large persistent
in-memory cache, and
prefetching

High-level
architecture of a
scalable universal
crawler

Optimize use of
network bandwidth

Huge farm of crawl machines Optimize disk I/O throughput

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Universal crawlers: Policy
• Coverage
– New pages get added all the time
– Can the crawler find every page?
• Freshness
– Pages change over time, get removed, etc.
– How frequently can a crawler revisit ?
• Trade-off!
– Focus on most “important” pages (crawler bias)?
– “Importance” is subjective

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Web coverage by search engine crawlers
100%
90% This assumes we know the
size of the entire the Web.
80%
Do we? Can you define “the
70% size of the Web”?
60%
50%
50%
40% 35% 34%
30%
20% 16%
10%
0%
1997 1998 1999 2000

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Maintaining a “fresh” collection
• Universal crawlers are never “done”
• High variance in rate and amount of page changes
• HTTP headers are notoriously unreliable
– Last-modified
– Expires
• Solution
– Estimate the probability that a previously visited page
has changed in the meanwhile
– Prioritize by this probability estimate

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Estimating page change rates
• Algorithms for maintaining a crawl in which
most pages are fresher than a specified
epoch
– Brewington & Cybenko; Cho, Garcia-Molina & Page
• Assumption: recent past predicts the future
(Ntoulas, Cho & Olston 2004)
– Frequency of change not a good predictor
– Degree of change is a better predictor

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Do we need to crawl the entire Web?
• If we cover too much, it will get stale
• There is an abundance of pages in the Web
• For PageRank, pages with very low prestige are largely
useless
• What is the goal?
– General search engines: pages with high prestige
– News portals: pages that change often
– Vertical portals: pages on some topic
• What are appropriate priority measures in these
cases? Approximations?

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Breadth-first crawlers
• BF crawler tends to
crawl high-
PageRank pages
very early
• Therefore, BF
crawler is a good
baseline to gauge
other crawlers
• But why is this so? Najork and Weiner 2001

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Bias of breadth-first crawlers
• The structure of the
Web graph is very
different from a random
network
• Power-law distribution of
in-degree
• Therefore there are hub
pages with very high PR
and many incoming links
• These are attractors: you
cannot avoid them!

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Outline
• Motivation and taxonomy of crawlers
• Basic crawlers and implementation issues
• Universal crawlers
• Preferential (focused and topical) crawlers
• Evaluation of preferential crawlers
• Crawler ethics and conflicts
• New developments: social, collaborative, federated
crawlers

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Preferential crawlers
• Assume we can estimate for each page an
importance measure, I(p)
• Want to visit pages in order of decreasing I(p)
• Maintain the frontier as a priority queue sorted by
I(p)
• Possible figures of merit:
– Precision ~
| p: crawled(p) & I(p) > threshold | / | p: crawled(p) |
– Recall ~
| p: crawled(p) & I(p) > threshold | / | p: I(p) > threshold |

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Preferential crawlers
• Selective bias toward some pages, eg. most
“relevant”/topical, closest to seeds, most popular/largest
PageRank, unknown servers, highest rate/amount of
change, etc…
• Focused crawlers
– Supervised learning: classifier based on labeled examples
• Topical crawlers
– Best-first search based on similarity(topic, parent)
– Adaptive crawlers
• Reinforcement learning
• Evolutionary algorithms/artificial life

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Preferential crawling algorithms:
Examples
• Breadth-First
– Exhaustively visit all links in order encountered
• Best-N-First
– Priority queue sorted by similarity, explore top N at a time
– Variants: DOM context, hub scores
• PageRank
– Priority queue sorted by keywords, PageRank
• SharkSearch
– Priority queue sorted by combination of similarity, anchor text, similarity of
parent, etc. (powerful cousin of FishSearch)
• InfoSpiders
– Adaptive distributed algorithm using an evolving population of learning
agents

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Preferential crawlers: Examples

• For I(p) = PageRank


(estimated based on
pages crawled so
far), we can find
high-PR pages faster
than a breadth-first Recall
crawler (Cho, Garcia-
Molina & Page 1998)

Crawl size

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Focused crawlers: Basic idea
• Naïve-Bayes classifier based
on example pages in desired
topic, c*
• Score(p) = Pr(c*|p)
– Soft focus: frontier is priority
queue using page score
– Hard focus:
• Find best leaf ĉ for p
• If an ancestor c’ of ĉ is in c*
then add links from p to
frontier, else discard
– Soft and hard focus work
equally well empirically
Example: Open Directory

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Focused crawlers
• Can have multiple topics with as many classifiers,
with scores appropriately combined (Chakrabarti et
al. 1999)
• Can use a distiller to find topical hubs periodically,
and add these to the frontier
• Can accelerate with the use of a critic (Chakrabarti
et al. 2002)
• Can use alternative classifier algorithms to naïve-
Bayes, e.g. SVM and neural nets have reportedly
performed better (Pant & Srinivasan 2005)

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Context-focused crawlers
• Same idea, but multiple classes (and
classifiers) based on link distance
from relevant targets
– ℓ=0 is topic of interest
– ℓ=1 link to topic of interest
– Etc.
• Initially needs a back-crawl from
seeds (or known targets) to train
classifiers to estimate distance
• Links in frontier prioritized based on
estimated distance from targets
• Outperforms standard focused
crawler empirically

Context graph
Slides © 2007 Filippo Menczer, Indiana University School of Informatics
Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Topical crawlers
• All we have is a topic (query, description,
keywords) and a set of seed pages (not
necessarily relevant)
• No labeled examples
• Must predict relevance of unvisited links to
prioritize
• Original idea: Menczer 1997, Menczer &
Belew 1998

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Example: myspiders.informatics.indiana.edu

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Topical locality
• Topical locality is a necessary condition for a topical
crawler to work, and for surfing to be a worthwhile
activity for humans
• Links must encode semantic information, i.e. say
something about neighbor pages, not be random
• It is also a sufficient condition if we start from “good”
seed pages
• Indeed we know that Web topical locality is strong :
– Indirectly (crawlers work and people surf the Web)
– From direct measurements (Davison 2000; Menczer 2004, 2005)

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Quantifying topical locality
• Different ways to pose the
question:
– How quickly does semantic
locality decay?
– How fast is topic drift?
– How quickly does content
change as we surf away from a
starting page?
• To answer these questions,
let us consider exhaustive
breadth-first crawls from
100 topic pages

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
The “link-cluster” conjecture
• Connection between semantic topology (relevance) and link
topology (hypertext)
– G = Pr[rel(p)] ~ fraction of relevant/topical pages (topic generality)
– R = Pr[rel(p) | rel(q) AND link(q,p)] ~ cond. prob. Given neighbor on topic
• Related nodes are clustered if R > G
– Necessary and
sufficient
condition for a
random crawler
to find pages related
to start points
– Example:
2 topical clusters G = 5/15
with stronger C=2
R = 3/6
modularity within = 2/4
each cluster than outside

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Link-cluster conjecture

• Stationary hit rate for a random crawler:


η(t + 1) = η(t) ⋅R + (1 − η(t))⋅ G ≥ η (t)
G
t →∞ ∗
η ⏐⏐⏐→η =
1− (R − G) Conjecture

Value η ∗> G ⇔ R > G


added
η∗ R− G
of links −1 =
G 1 − (R − G)

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
R(q,δ) Pr[ rel( p) | rel(q) ∧ path(q, p) ≤ δ ]

G(q) Pr[rel( p)]

Link-cluster
conjecture
€ Preservation of

semantics (meaning)
across links
• 1000 times more
likely to be on topic
if near an on-topic
page!

∑ path(q, p)
{ p: path(q, p ) ≤δ }
L(q,δ) ≡
Slides © 2007 Filippo Menczer, Indiana University School of Informatics
{ p : path(q, p) ≤ δ}
Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
The “link-content” ∑ sim(q, p)
conjecture S(q,δ) ≡
{ p: path(q, p ) ≤δ }

• Correlation of { p : path(q, p) ≤ δ}
lexical (content)
and linkage
topology
• L(): average link
distance €
• S(): average
content
similarity to
start (topic)
page from pages
up to distance 
• Correlation
(L,S) = –0.76

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Heterogeneity of
link-content correlation

aL b edu net
S = c + (1− c)e

gov
€ org

.com has
more drift com
signif. diff. a only (=0.05)

signif. diff. a & b (=0.05)

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Topical locality-inspired tricks
for topical crawlers
• Co-citation (a.k.a. sibling
locality): A and C are good
hubs, thus A and D should
be given high priority
• Co-reference (a.k.a.
blbliographic coupling):
E and G are good
authorities, thus E and H
should be given high
priority

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Correlations between different
similarity measures
• Semantic similarity measured
from ODP, correlated with:
– Content similarity: TF or TF-IDF
vector cosine
– Link similarity: Jaccard
coefficient of (in+out) link
neighborhoods
• Correlation overall is significant
but weak
• Much stronger topical locality in
some topics, e.g.:
– Links very informative in news
sources
– Text very informative in recipes

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Naïve Best-First
BestFirst(topic, seed_urls) {
foreach link (seed_urls) {
Simplest enqueue(frontier, link);
topical crawler: }
Frontier is while (#frontier > 0 and visited < MAX_PAGES) {
link := dequeue_link_with_max_score(frontier);
priority queue doc := fetch_new_document(link);
based on text score := sim(topic, doc);
similarity foreach outlink (extract_links(doc)) {
between topic if (#frontier >= MAX_BUFFER) {
dequeue_link_with_min_score(frontier);
and parent }
page enqueue(frontier, outlink, score);
}
}
}

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Best-first variations
• Many in literature, mostly stemming from
different ways to score unvisited URLs. E.g.:
– Giving more importance to certain HTML markup in
parent page
– Extending text representation of parent page with
anchor text from “grandparent” pages (SharkSearch)
– Limiting link context to less than entire page
– Exploiting topical locality (co-citation)
– Exploration vs exploitation: relax priorities
• Any of these can be (and many have been)
combined

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Link context based on text neighborhood
• Often consider a
fixed-size window, e.g.
50 words around
anchor
• Can weigh links based
on their distance from
topic keywords within
the document
(InfoSpiders, Clever)
• Anchor text deserves
extra importance

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Link context based on DOM tree
• Consider DOM subtree
rooted at parent node of
link’s <a> tag
• Or can go further up in the
tree (Naïve Best-First is
special case of entire
document body)
• Trade-off between noise
due to too small or too
large context tree (Pant
2003)

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
DOM context
Link score = linear
combination between
page-based and context-
based similarity score

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Co-citation: hub scores
Link scorehub = linear
combination between
link and hub score

Number of seeds linked from page

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Combining DOM context and hub scores

Experiment based on
Split ODP URLs 159 ODP topics (Pant Add 10 best hubs
between seeds and & Menczer 2003) to seeds for 94
targets topics

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Exploration vs Exploitation
• Best-N-First (or BFSN)
• Rather than re-sorting
the frontier every time
you add links, be lazy and
sort only every N pages
visited
• Empirically, being less
greedy helps crawler
performance
significantly: escape
“local topical traps” by
exploring more

Pant et al. 2002

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
InfoSpiders
• A series of intelligent
multi-agent topical
crawling algorithms
employing various
adaptive techniques:
– Evolutionary bias of
exploration/exploitation
– Selective query
expansion
– (Connectionist)
reinforcement learning
Menczer & Belew 1998, 2000;
Menczer et al. 2004

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Link scoring and Pr [l ] =
e βλl
∑e βλl' Stochastic
selector
selection by each l'

crawling agent
λ l = nnet(in1, ..., inN )

δ ( k, w)
ink = ∑
w∈D distw
( , l)

k1

ki λl

kn

sum of matches
linkll Instances
link instances of ki with
Agent’s neural
agent's neural net net
inverse-distance
of ki weighting

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Artificial life-inspired Evolutionary
Local Selection Algorithm (ELSA)
Foreach agent thread:
Pick & follow link from local frontier
Evaluate new links, merge frontier
Adjust link estimator
E := E + payoff - cost Reinforcement
If E < 0: learning
Die
match Elsif E > Selection_Threshold:
resource Clone offspring
bias Split energy with offspring
Split frontier with offspring
Mutate offspring selective
query
expansion
Slides © 2007 Filippo Menczer, Indiana University School of Informatics
Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Adaptation in InfoSpiders
• Unsupervised population evolution
– Select agents to match resource bias
– Mutate internal queries: selective query
expansion
– Mutate weights
• Unsupervised individual adaptation
– Q-learning: adjust neural net weights to
predict relevance locally

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
InfoSpiders
evolutionary bias: an
agent in a relevant area
will spawn other agents
to exploit/explore that local frontier
neighborhood

keyword
vector neural net

offspring
Slides © 2007 Filippo Menczer, Indiana University School of Informatics
Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Multithreaded InfoSpiders
(MySpiders)
• Different ways to compute the cost of
visiting a document:
– Constant: costconst = E0 p0 / Tmax
– Proportional to download time:
costtime = f(costconst t / timeout)
• The latter is of course more efficient
(faster crawling), but it also yields
better quality pages!

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Selective Query Expansion in InfoSpiders:
Internalization of local text features
When a new agent is
spawned, it picks up a
common term from the
POLIT
current page (here ‘th’)
0.84
CONSTITUT
0.18

TH 0.31 0.99

SYSTEM 0.22

GOVERN 0.19

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Reinforcement Learning
• In general, reward function R: S  A  
• Learn policy (: S  A) to maximize reward
over time, typically discounted in the
future: V = ∑ γ r(t), 0 ≤ γ < 1 t

• Q-learning: optimal policy a1 s1



*
π (s) = argmax Q(s,a) s
a
a2 s2
= argmax[ R(s,a) + γV * (s')]
a

Value of following optimal policy in future


Slides © 2007 Filippo Menczer, Indiana University School of Informatics
Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Q-learning in InfoSpiders
• Use neural nets to estimate Q scores
• Compare estimated relevance of visited page with Q score of
link estimated from parent page to obtain feedback signal
• Learn neural net weights using back-propagation of error with
teaching input: E(D) +  maxl(D) l

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Other Reinforcement Learning Crawlers
• Rennie & McCallum (1999):
– Naïve-Bayes classifier trained on text nearby links in
pre-labeled examples to estimate Q values
– Immediate reward R=1 for “on-topic” pages (with desired
CS papers for CORA repository)
– All RL algorithms outperform Breath-First Search
• Future discounting: “For spidering, it is always
better to choose immediate over delayed rewards”
-- Or is it?
– But we cannot possibly cover the entire search space, and
recall that by being greedy we can be trapped in local
topical clusters and fail to discover better ones
– Need to explore!

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Outline
• Motivation and taxonomy of crawlers
• Basic crawlers and implementation issues
• Universal crawlers
• Preferential (focused and topical) crawlers
• Evaluation of preferential crawlers
• Crawler ethics and conflicts
• New developments: social, collaborative, federated
crawlers

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Evaluation of topical crawlers
• Goal: build “better” crawlers to support
applications (Srinivasan & al. 2005)
• Build an unbiased evaluation framework
– Define common tasks of measurable difficulty
– Identify topics, relevant targets
– Identify appropriate performance measures
• Effectiveness: quality of crawler pages, order, etc.
• Efficiency: separate CPU & memory of crawler algorithms
from bandwidth & common utilities

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Evaluation
corpus =
ODP + Web

• Automate
evaluation
using edited
directories

• Different
sources of
relevance
assessments

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Topics and Targets

topic level ~ specificity


depth ~ generality
Slides © 2007 Filippo Menczer, Indiana University School of Informatics
Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Tasks
Start from seeds, find targets
and/or pages similar to target descriptions

d=2

d=3

Back-crawl from targets to get seeds

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Target based performance measures

Q: What assumption are we making? A: Independence!…

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Performance matrix

t
S ∩ Td
c
Sct ∩ Td
target t
pages Td S c

target
∑σ c ( p,Dd ) ∑σ c ( p,Dd ) target
descriptions p ∈S ct depth
p ∈S ct
€ € S t
d=2
c d=1
d=0
“recall” “precision”


Slides © 2007 Filippo Menczer, Indiana University School of Informatics
Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Crawling evaluation framework
• Keywords
• Seed URLs Main

Private Data
Crawler 1 Crawler N Structures
Logic Logic (limited resource)

URL Common
Data
Concurrent Fetch/Parse/Stem Modules
Web Structures
HTTP HTTP HTTP

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Using framework to compare crawler performance

Average

target

page

recall

Pages crawled
Slides © 2007 Filippo Menczer, Indiana University School of Informatics
Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Efficiency & scalability

Performance/cost

Link frontier size

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Topical crawler
performance depends
on topic characteristics
C = target link cohesiveness
A = target authoritativeness
P = popularity (topic kw generality)
L = seed-target similarity

Target pages Target descriptions


Crawler
C A P L C A P L

BreadthFirst + + + +

BFS-1 + + +

BFS-256 + + +

InfoSpiders + + + +

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Outline
• Motivation and taxonomy of crawlers
• Basic crawlers and implementation issues
• Universal crawlers
• Preferential (focused and topical) crawlers
• Evaluation of preferential crawlers
• Crawler ethics and conflicts
• New developments: social, collaborative, federated
crawlers

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Crawler ethics and conflicts
• Crawlers can cause trouble, even
unwillingly, if not properly designed to be
“polite” and “ethical”
• For example, sending too many requests in
rapid succession to a single server can
amount to a Denial of Service (DoS) attack!
– Server administrator and users will be upset
– Crawler developer/admin IP address may be
blacklisted

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Crawler etiquette (important!)
• Identify yourself
– Use ‘User-Agent’ HTTP header to identify crawler, website with
description of crawler and contact information for crawler developer
– Use ‘From’ HTTP header to specify crawler developer email
– Do not disguise crawler as a browser by using their ‘User-Agent’ string
• Always check that HTTP requests are successful, and in case of
error, use HTTP error code to determine and immediately address
problem
• Pay attention to anything that may lead to too many requests to any
one server, even unwillingly, e.g.:
– redirection loops
– spider traps

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Crawler etiquette (important!)
• Spread the load, do not overwhelm a server
– Make sure that no more than some max. number of requests to
any single server per unit time, say < 1/second
• Honor the Robot Exclusion Protocol
– A server can specify which parts of its document tree any
crawler is or is not allowed to crawl by a file named ‘robots.txt’
placed in the HTTP root directory, e.g.
http://www.indiana.edu/robots.txt
– Crawler should always check, parse, and obey this file before
sending any requests to a server
– More info at:
• http://www.google.com/robots.txt
• http://www.robotstxt.org/wc/exclusion.html

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
More on robot exclusion
• Make sure URLs are canonical before
checking against robots.txt
• Avoid fetching robots.txt for each
request to a server by caching its
policy as relevant to this crawler
• Let’s look at some examples to
understand the protocol…

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
www.apple.com/robots.txt

# robots.txt for http://www.apple.com/

User-agent: *
Disallow:

All crawlers…

…can go
anywhere!

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
www.microsoft.com/robots.txt
# Robots.txt file for http://www.microsoft.com
All crawlers…
User-agent: *
Disallow: /canada/Library/mnp/2/aspx/
Disallow: /communities/bin.aspx
Disallow: /communities/eventdetails.mspx
Disallow: /communities/blogs/PortalResults.mspx
Disallow: /communities/rss.aspx
Disallow: /downloads/Browse.aspx
Disallow: /downloads/info.aspx
Disallow: /france/formation/centres/planning.asp
Disallow: /france/mnp_utility.mspx
Disallow: /germany/library/images/mnp/
Disallow: /germany/mnp_utility.mspx
…are not
Disallow: /ie/ie40/ allowed in
Disallow: /info/customerror.htm
Disallow: /info/smart404.asp
these
Disallow: /intlkb/ paths…
Disallow: /isapi/
#etc…

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
www.springer.com/robots.txt
# Robots.txt for http://www.springer.com (fragment)

User-agent: Googlebot
Disallow: /chl/* Google crawler is
Disallow: /uk/* allowed everywhere
Disallow: /italy/*
Disallow: /france/*
except these paths

User-agent: slurp
Disallow: Yahoo and
Crawl-delay: 2 MSN/Windows Live
User-agent: MSNBot
are allowed
Disallow: everywhere but
Crawl-delay: 2 should slow down
User-agent: scooter
Disallow:
AltaVista has no limits
# all others
User-agent: * Everyone else keep off!
Disallow: /

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
More crawler ethics issues
• Is compliance with robot exclusion a matter
of law?
– No! Compliance is voluntary, but if you do not
comply, you may be blocked
– Someone (unsuccessfully) sued Internet Archive
over a robots.txt related issue
• Some crawlers disguise themselves
– Using false User-Agent
– Randomizing access frequency to look like a
human/browser
– Example: click fraud for ads

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
More crawler ethics issues
• Servers can disguise themselves, too
– Cloaking: present different content based on
User-Agent
– E.g. stuff keywords on version of page shown to
search engine crawler
– Search engines do not look kindly on this type
of “spamdexing” and remove from their index
sites that perform such abuse
• Case of bmw.de made the news

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Gray areas for crawler ethics
• If you write a crawler that unwillingly
follows links to ads, are you just being
careless, or are you violating terms of
service, or are you violating the law by
defrauding advertisers?
– Is non-compliance with Google’s robots.txt in
this case equivalent to click fraud?
• If you write a browser extension that
performs some useful service, should you
comply with robot exclusion?

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Outline
• Motivation and taxonomy of crawlers
• Basic crawlers and implementation issues
• Universal crawlers
• Preferential (focused and topical) crawlers
• Evaluation of preferential crawlers
• Crawler ethics and conflicts
• New developments

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
New developments: social,
collaborative, federated crawlers
• Idea: go beyond the “one-fits-all”
model of centralized search engines
• Extend the search task to anyone,
and distribute the crawling task
• Each search engine is a peer agent
• Agents collaborate by routing queries
and results

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
WWW bookmarks 6S: Collaborative
Peer Search
Crawler Index
Peer
query

local query
storage

query
A query
B query
hit
hit

C
Data mining & referral
opportunities
Emerging communities
Slides © 2007 Filippo Menczer, Indiana University School of Informatics
Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Basic idea: Learn based on prior
query/response interactions

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Learning about other peers

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Query routing in 6S

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Emergent semantic clustering

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Simulation 1: 70 peers, 7 groups

• The dynamic network


of queries and results
exchanged among 6S
peer agents quickly
forms a small-world,
with small diameter
and high clustering
(Wu & al. 2005)

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Simulation 2: ODP (dmoz.org)
500 Users

Each synthetic user


associated with a topic

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Semantic
similarity

Peers with similar interests are


more likely to talk to each other
(Akavipat & al. 2006)

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Quality of results

More
sophisticated
learning
algorithms
The more
do better
interactions,
the better

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Download and try free 6S prototype:
http://homer.informatics.indiana.edu/~nan/6S/

1-click configuration of
personal crawler and
setup of search engine

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Download and try free 6S prototype:
http://homer.informatics.indiana.edu/~nan/6S/

Search via
Firefox browser
extension

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer
Need crawling code?
• Reference C implementation of HTTP, HTML parsing, etc
– w3c-libwww package from World-Wide Web Consortium: www.w3c.org/Library/
• LWP (Perl)
– http://www.oreilly.com/catalog/perllwp/
– http://search.cpan.org/~gaas/libwww-perl-5.804/
• Open source crawlers/search engines
– Nutch: http://www.nutch.org/ (Jakarta Lucene: jakarta.apache.org/lucene/)
– Heretrix: http://crawler.archive.org/
– WIRE: http://www.cwr.cl/projects/WIRE/
– Terrier: http://ir.dcs.gla.ac.uk/terrier/
• Open source topical crawlers, Best-First-N (Java)
– http://informatics.indiana.edu/fil/IS/JavaCrawlers/
• Evaluation framework for topical crawlers (Perl)
– http://informatics.indiana.edu/fil/IS/Framework/

Slides © 2007 Filippo Menczer, Indiana University School of Informatics


Bing Liu: Web Data Mining. Springer, 2007
Ch. 8 Web Crawling by Filippo Menczer

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy