BDA Presentation1

COMPUTATION OF PAGERANK AND PAGERANK
ITERATION , TOPIC SENSITIVE PAGERANK AND

LINK SPAM, HUBS AND AUTHORITIES, WEB
COMMUNITIES, LIMITATIONS OF LINK, RANK AND
WEB GRAPH ANALYSIS
PRESENTED BY:
Pushpa Rama devadiga-4SF20CS093
Subhiksha s-4SF20CS155
Sweekriti V Gaonkar-4SF20CS163
Medha P Shetty -4SF20CS066
COMPUTATION OF PAGERANK AND PAGERANK
ITERATION
Assume that a web graph models the web pages. Page hyperlinks are the property of the graph node (vertex). Assume a
Page, Pg (v) in-links from Pg (u), and Pg (u) out-linking similar to Pg (v), to total NOUT[(Pg (u)] pages. Figure 9.9
shows Pg (v) in-links from Pg (u) and other pages.
1.PAGERANK ALGORITHM USING THE IN-DEGREES OS CONFERRING AUTHORITY
Assume that the page U, when out-linking to Page V “considers” an equal fraction of its authority to all
the pages it points to, such as Pgv. The following equation gives the initially suggested page rank, PR
(based on in-degrees) of a page Pg
where N(Pgu) is the total number of out-links from U. Sum is over all Pgv in-links.Normalization
constant denotes by nc, such that PR of all pages sums equal to 1.
2.PAGERANK ALGORITHM USING THE RELATIVE AUTHORITY OF
THE PARENTS OVER LINKED CHILDREN
A method of PageRank considers the entire web in place of local neighbourhood of the pages and considers the
relative authority of the parents (children). The algorithm uses the relative authority of the parents (children) and adds
a rank for each page from a rank source.
The PageRank method considers assigning weight according to the rank of the parents. Page rank is proportional to the
weight of the parent and inversely proportional to the out-links of the parent.
Assume that (i) Page v (Pgv) has in-links with parent Page u (Pgu) and other pages in set PA (v) of parent pages to v
that means € PA(v), (ii) R(v) is PageRank of Pgv, (iii) R (u) is weight importance/rank) of Pgu, and (iv) ch (u) is
weight of child (out-links) of Pgu. Then the
following equation gives PageRank R (v) of link v:
where PA(v) is a set of links who are parents (in-links) of link v. Sum is over all parents of v. nc
is normalization constant whose sum of weights is 1.
An alternative equation is as follows:
where nc = [1/R(v)]. R(v) is iterated and computed for each parent in the set PA(v) till new value of R(v) does not change
within the defined margin, say 0.001 in the succeeding iterations.
Page Rank Iteration using MapReduce Functions in Spark Graph:
The computation of PageRank using SparkGraph method (Section 8.5),graph.pageRank(0.0001).vertices
ranksByUsername = users.join(ranks).map{case id, (username, rank)) => (username, rank).
The method includes conversions to MapReduce functions and using HDFS compatible files. Functions PageRank (),
ranks ByUsername () do the computations using the PageRankObject. GraphX consists of these functions (Graphops).
GraphX Operators includes the functions.
Assume specified tolerance at the start of iterations is 0.0001 (1 in 10000). When the rank does not change beyond that
tolerance, it means rank value will converge and then the iterative process will stop.
Topic Sensitive PageRank and Link Spam :
Number of methods have been suggested for computations of topic-sensitive page ranking, RTS . The RTS (v) of a page P (v)
may be higher for a specific topic compared to other topics. A topic associates with a distinct bag of words for which the page has
higher probability of surfing than other bags for that topic .
Topic-sensitive PageRank method uses surfing weights (probabilities) for the pages containing the topic or bag of words
corresponding to a topic. Method for creating topic- sensitive PageRank is to compute the bias to rank R(v) and thus increase the
effect of certain pages containing that topic or bag of words
A method of introducing biasing is simple. It assumes that a rank source E exists that is additional having in-links from other
pages, and thus adds to the rank of each page R (v) by a fixed (uniform) or non-uniform weight factor n. The factor O is a
multiplication factor to actual in-links without the bias.
An alternative equation for topic sensitive PageRank, R(v) computation for page P (v) is as follow:
Link Spam:
Effects of a find spam can be nullified using the topic-sensitive PageRank algorithm. Link Spam tries to mislead the PageRank
algorithm. A link spam attempts to make PageRank algorithm ineffective. The spam assisting pages connects to the page
repeatedly and increases the in-degree of a page, thereby enhancing the rank to a large value.
A link spam creator website ws also has a page I s, for whom ws, attempts to enhance the PageRank. The ws has a large number
of assisting pages als which out-links to 1sonly. The als, pages also prevent the PageRank of 1s, from being lost. A spam mass
consists of ws, Is, and its als pages.
Following are the steps for finding spam mass:
1. A distant topic sensitive page has unusually high in-degrees compared to the other pages of the same topic. A plot known as
power-law plot is drawn between the log of number of web pages on the y-axis out-linking to the page v and logs of them
in- degrees of v on the x-axis.
2. 2. Plot is nearly linear as the number exponential decays is within degrees. N is proportional to exp (-d), where d is decay
constant.
3. . An unusual pattern with marked deviation from near linearity identifies the distant link spam mass.
Hubs and Authorities
A hub is an index page that out-links to a number of content pages. A content page is topic authority. An authority is a page that
has recognition due to its useful, reliable and significant information. Figure 9.10(a) shows hubs (shaded circles) with the
number of out-links associated with each hub. Figure 9.10(b) shows authorities (dotted circles) with the number of in-links and
out-links associated with each link.
Figure 9.10 (a) Hubs (shaded circles) and (b) Authorities (dotted circles)
In-degrees (number of in-edges from other vertices) can be one of the measures for the authority. However, in-degrees do not
distinguish between an in-link from a greater authority or lesser authority. Authority, auth1, in Figure 9.10(b) has in-links from 6
vertices (in-degrees = 6) and auth2 has in-links to just 2 (in-degree = 2). However, auth, has link with six vertices with indegrees
= 1, 1, 1, 1, 1 and 120 (total = 125). Authority, auth2, has links with two vertices with in-degrees = 120 and 200 (total = 220).
Auth2 has association with greater authorities. Therefore, in-degrees may not be a good measure as compared to authority.
Consider a specifically queried topic t. Following are the steps:
1. Let a set of pages discover a root set R using standard search engine. Root pages may limit to top200 for t.
2. Find a sub-graph of pages S, using a query that provides relevant pages for t and pointed by pages at R. Sub-graph S pages
form Set for computations as it includes the children of parent R and limit to a random set of maximum so pages returned by a
“reverse link” query.
3. Eliminate purely navigational links and links between two pages on the same host.
4. Consider only u (IIuII =~ 4-8) pages from a given hyperlink as pointer to any individual page. Sub-graph for HITS consisting
of root set R of pages and children of parents in the sub-graph S. Figure 9.11 shows subgraph S for HITS consisting of root set R
of pages and all the pages pointed to by any page of R
Web Communities:
Web communities are web sites or collections of websites, which limit the contents view and links to members. Examples
of web communities are social networks, such as Linkedln, SlideShare, Twitter and Facebook. The communities consist of
sites for do-it-yourself sites, social networks, blogs or bulletin boards. The issues are privacy and reliability of information.
Metric for analysis of web-community sites are web graph parameters, such as triangle count, clustering coefficient and K-
neighbourhood.
K-neighbourhood analysis means the number of 1 st neighbour nodes, 2 nd neighbour nodes, and so on
(K = 1, 2, 3, 4 and so on). K-core analysis means the number of cores within a marked area. A core may consist of a
triangle of connected vertices. A core may consist of a rectangle with interconnected edges and diagonals. A core may also
be a group of cores. Spark GraphX described functions for degree centralities, degree distribution, separation of degree,
betweenness centralities, closeness centralities, neighbourhoods, strongly connected components, triangle counts,
PageRank, shortest path, Breadth First Search (BFS), minimum spanning tree (forest), spectral clustering and cluster
coefficient.
Limitations of Link, Rank and Web Graph Analysis:
1. Search engines rely on metatags or metadata of the documents. That enhances the rank if metadata has biased
information.
2. Search engines themselves may introduce bias while ranking the pages of clients higher as the pages of advertising
companies.
3. May provide higher searches and hence lead to biased ranks.
4. A top authority may be a hub of pages on a different topic resulting in increased rank of the authority page.
5. Topic drift and content evolution can affect the rank. Off-topic pages may return the authorities.
6. Mutually reinforcing affiliates or affiliated pages/sites can enhance each other's rank and authorities.
7. The ranks may be unstable as adding additional nodes may have greater influence in rank changes.

BDA Presentation1

Uploaded by

Copyright:

Available Formats

BDA Presentation1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BDA Presentation1

Uploaded by

Copyright:

Available Formats

COMPUTATION OF PAGERANK AND PAGERANK

ITERATION , TOPIC SENSITIVE PAGERANK AND

Following are the steps for finding spam mass:

3. May provide higher searches and hence lead to biased ranks.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.