Jure Leskovec Stanford Univ - səhifə 72
CHAPTER 5. LINK ANALYSIS
Solving Linear Equations
If you look at the 4-node “Web” of Example 5.2, you might think that the
way to solve the equation v = M v is by Gaussian elimination. Indeed,
in that example, we argued what the limit would be essentially by doing
so. However, in realistic examples, where there are tens or hundreds of
billions of nodes, Gaussian elimination is not feasible. The reason is that
Gaussian elimination takes time that is cubic in the number of equations.
Thus, the only way to solve equations on this scale is to iterate as we
have suggested. Even that iteration is quadratic at each round, but we
can speed it up by taking advantage of the fact that the matrix M is very
sparse; there are on average about ten links per page, i.e., ten nonzero
entries per column.
Moreover, there is another diﬀerence between PageRank calculation
and solving linear equations. The equation v = M v has an inﬁnite number
of solutions, since we can take any solution v, multiply its components by
any ﬁxed constant c, and get another solution to the same equation. When
we include the constraint that the sum of the components is 1, as we have
done, then we get a unique solution.
get by multiplying at each step by M is:
, . . . ,
Notice that in this example, the probabilities for B, C, and D remain thesame. It is easy to see that B and C must always have the same values at any
iteration, because their rows in M are identical. To show that their values are
also the same as the value for D, an inductive proof works, and we leave it as
an exercise. Given that the last three values of the limiting vector must be the
same, it is easy to discover the limit of the above sequence. The ﬁrst row of
M tells us that the probability of A must be 3/2 the other probabilities, so the
limit has the probability of A equal to 3/9, or 1/3, while the probability for the
other three nodes is 2/9.
This diﬀerence in probability is not great. But in the real Web, with billions
of nodes of greatly varying importance, the true probability of being at a node
like www.amazon.com is orders of magnitude greater than the probability of
Structure of the Web
It would be nice if the Web were strongly connected like Fig. 5.1. However, it
is not, in practice. An early study of the Web found it to have the structure
shown in Fig. 5.2. There was a large strongly connected component (SCC), but
there were several other portions that were almost as large.
1. The in-component, consisting of pages that could reach the SCC by fol-
lowing links, but were not reachable from the SCC.
2. The out-component, consisting of pages reachable from the SCC but un-
able to reach the SCC.
3. Tendrils, which are of two types. Some tendrils consist of pages reachable
from the in-component but not able to reach the in-component. The
other tendrils can reach the out-component, but are not reachable from
Figure 5.2: The “bowtie” picture of the Web
In addition, there were small numbers of pages found either in
CHAPTER 5. LINK ANALYSIS
(a) Tubes, which are pages reachable from the in-component and able to reach
the out-component, but unable to reach the SCC or be reached from the
(b) Isolated components that are unreachable from the large components (theSCC, in- and out-components) and unable to reach those components.
Several of these structures violate the assumptions needed for the Markov-
process iteration to converge to a limit. For example, when a random surfer
enters the out-component, they can never leave. As a result, surfers starting
in either the SCC or in-component are going to wind up in either the out-
component or a tendril oﬀ the in-component. Thus, no page in the SCC or in-
component winds up with any probability of a surfer being there. If we interpret
this probability as measuring the importance of a page, then we conclude falsely
that nothing in the SCC or in-component is of any importance.
As a result, PageRank is usually modiﬁed to prevent such anomalies. There
are really two problems we need to avoid. First is the dead end, a page that
has no links out. Surfers reaching such a page disappear, and the result is that
in the limit no page that can reach a dead end can have any PageRank at all.
The second problem is groups of pages that all have outlinks but they never
link to any other pages. These structures are called spider traps.
Both theseproblems are solved by a method called “taxation,” where we assume a random
surfer has a ﬁnite probability of leaving the Web at any step, and new surfers
are started at each page. We shall illustrate this process as we study each of
the two problem cases.
Avoiding Dead EndsRecall that a page with no link out is called a dead end. If we allow dead
ends, the transition matrix of the Web is no longer stochastic, since some of
the columns will sum to 0 rather than 1. A matrix whose column sums are at
most 1 is called substochastic.
If we compute M
vfor increasing powers of a
substochastic matrix M , then some or all of the components of the vector go
to 0. That is, importance “drains out” of the Web, and we get no information
about the relative importance of pages.
Example 5.3 :
In Fig. 5.3 we have modiﬁed Fig. 5.1 by removing the arc from
C to A. Thus, C becomes a dead end. In terms of random surfers, when
a surfer reaches C they disappear at the next round. The matrix M that
describes Fig. 5.3 is
1/3 1/2 0
3They are so called because the programs that crawl the Web, recording pages and links,
are often referred to as “spiders.” Once a spider enters a spider trap, it can never leave.
Dostları ilə paylaş:
©2018 Учебные документы
Рады что Вы стали частью нашего образовательного сообщества.