Stochastic analysis of web page ranking

Y. Volkovich

Research output: ThesisPhD Thesis - Research UT, graduation UTAcademic

53 Downloads (Pure)

Abstract

Today, the study of the World Wide Web is one of the most challenging subjects. In this work we consider the Web from a probabilistic point of view. We analyze the relations between various characteristics of the Web. In particular, we are interested in the Web properties that affect the Web page ranking, which is a measure of popularity and importance of a page in the Web. Mainly we restrict our attention on two widely-used algorithms for ranking: the number of references on a page (indegree), and Google’s PageRank. For the majority of self-organizing networks, such as the Web and the Wikipedia, the in-degree and the PageRank are observed to follow power laws. In this thesis we present a new methodology for analyzing the probabilistic behavior of the PageRank distribution and the dependence between various power law parameters of the Web. Our approach is based on the techniques from the theory of regular variations and the extreme value theory. We start Chapter 2 with models for distributions of the number of incoming (indegree) and outgoing (out-degree) links of a page. Next, we define the PageRank as a solution of a stochastic equation R d= PN i=1 AiRi+B, where Ri’s are distributed as R. This equation is inspired by the original definition of the PageRank. In particular, N models in-degree of a page, and B stays for the user preference. We use a probabilistic approach to show that the equation has a unique non-trivial solution with fixed finite mean. Our analysis based on a recurrent stochastic model for the power iteration algorithm commonly used in PageRank computations. Further, we obtain that the PageRank asymptotics after each iteration are determined by the asymptotics of the random variable with the heaviest tail among N and B. If the tails of N and B are equally heavy, then in fact we get the sum of two asymptotic expressions. We predict the tail behavior of the limiting distribution of the PageRank as a convergence of the results for iterations. To prove the predicted behavior we use another techniques in Chapter 3. In Chapter 3 we define the tail behavior for the models of the in-degree and the PageRank distribution using Laplace-Stieltjes transforms and the Tauberian theorem. We derive the equation for the Laplace-Stieltjes transforms, that corresponds to the general stochastic equation, and obtain our main result that establishes the tail behavior of the solution of the stochastic equation. In Chapter 4 we perform a number of experiments on the Web and the Wikipedia data sets, and on preferential attachment graphs in order to justify the results obtained in Chapters 2 and 3. The numerical results show a good agreement with our stochastic model for the PageRank distribution. Moreover, in Section 4.1 we also address the problem of evaluating power laws in the real data sets. We define several state of the art techniques from the statistical analysis of heavy tails, and we provide empirical evidence on the asymptotic similarity between in-degree and PageRank. Inspired by the minor effect of the out-degree distribution on the asymptotics of the PageRank, in Section 4.4 we introduce a new ranking scheme, called PAR, which combines features of HITS and PageRank ranking schemes. In Chapter 5 we examine the dependence structure in the power law graphs. First, we analytically define the tail dependencies between in-degree and PageRank of a one particular page by using the stochastic equation of the PageRank. We formally establish the relative importance of the two main factors for high ranking: large in-degree and a high rank of one of the ancestors. Second, we compute the angular measures for in-degrees, out-degrees and PageRank scores in three large data sets. The analysis of extremal dependence leads us to propose a new rank correlation measure which is particularly plausible for power law data. Finally, in Chapter 6 we apply the new rank correlation measure from Chapter 5 to various problems of rank aggregation. From numerical results we conclude that methods that are defined by the angular measure can provide good precision for the top nodes in large data sets, however they can fail in a small data sets.
Original languageUndefined
Awarding Institution
  • University of Twente
Supervisors/Advisors
  • Litvak, Nelly, Advisor
  • Boucherie, Richardus J., Supervisor
Thesis sponsors
Award date24 Apr 2009
Place of PublicationZuthpen
Publisher
Print ISBNs978-90-365-2823-8
DOIs
Publication statusPublished - 24 Apr 2009

Keywords

  • METIS-263943
  • IR-61071
  • Multivariate extremes
  • Rank aggregation
  • Wikipedia
  • Extremal dependencies
  • Taube-rian theorems
  • Regular variation
  • Web
  • PageRank
  • Statistical Analysis
  • Power laws
  • Preferential attachment
  • Stochastic equation
  • EWI-15767

Cite this

Volkovich, Y. (2009). Stochastic analysis of web page ranking. Zuthpen: University of Twente. https://doi.org/10.3990/1.9789036528238
Volkovich, Y.. / Stochastic analysis of web page ranking. Zuthpen : University of Twente, 2009. 117 p.
@phdthesis{b903d5b6c6474548963f7ebdfbf17f51,
title = "Stochastic analysis of web page ranking",
abstract = "Today, the study of the World Wide Web is one of the most challenging subjects. In this work we consider the Web from a probabilistic point of view. We analyze the relations between various characteristics of the Web. In particular, we are interested in the Web properties that affect the Web page ranking, which is a measure of popularity and importance of a page in the Web. Mainly we restrict our attention on two widely-used algorithms for ranking: the number of references on a page (indegree), and Google’s PageRank. For the majority of self-organizing networks, such as the Web and the Wikipedia, the in-degree and the PageRank are observed to follow power laws. In this thesis we present a new methodology for analyzing the probabilistic behavior of the PageRank distribution and the dependence between various power law parameters of the Web. Our approach is based on the techniques from the theory of regular variations and the extreme value theory. We start Chapter 2 with models for distributions of the number of incoming (indegree) and outgoing (out-degree) links of a page. Next, we define the PageRank as a solution of a stochastic equation R d= PN i=1 AiRi+B, where Ri’s are distributed as R. This equation is inspired by the original definition of the PageRank. In particular, N models in-degree of a page, and B stays for the user preference. We use a probabilistic approach to show that the equation has a unique non-trivial solution with fixed finite mean. Our analysis based on a recurrent stochastic model for the power iteration algorithm commonly used in PageRank computations. Further, we obtain that the PageRank asymptotics after each iteration are determined by the asymptotics of the random variable with the heaviest tail among N and B. If the tails of N and B are equally heavy, then in fact we get the sum of two asymptotic expressions. We predict the tail behavior of the limiting distribution of the PageRank as a convergence of the results for iterations. To prove the predicted behavior we use another techniques in Chapter 3. In Chapter 3 we define the tail behavior for the models of the in-degree and the PageRank distribution using Laplace-Stieltjes transforms and the Tauberian theorem. We derive the equation for the Laplace-Stieltjes transforms, that corresponds to the general stochastic equation, and obtain our main result that establishes the tail behavior of the solution of the stochastic equation. In Chapter 4 we perform a number of experiments on the Web and the Wikipedia data sets, and on preferential attachment graphs in order to justify the results obtained in Chapters 2 and 3. The numerical results show a good agreement with our stochastic model for the PageRank distribution. Moreover, in Section 4.1 we also address the problem of evaluating power laws in the real data sets. We define several state of the art techniques from the statistical analysis of heavy tails, and we provide empirical evidence on the asymptotic similarity between in-degree and PageRank. Inspired by the minor effect of the out-degree distribution on the asymptotics of the PageRank, in Section 4.4 we introduce a new ranking scheme, called PAR, which combines features of HITS and PageRank ranking schemes. In Chapter 5 we examine the dependence structure in the power law graphs. First, we analytically define the tail dependencies between in-degree and PageRank of a one particular page by using the stochastic equation of the PageRank. We formally establish the relative importance of the two main factors for high ranking: large in-degree and a high rank of one of the ancestors. Second, we compute the angular measures for in-degrees, out-degrees and PageRank scores in three large data sets. The analysis of extremal dependence leads us to propose a new rank correlation measure which is particularly plausible for power law data. Finally, in Chapter 6 we apply the new rank correlation measure from Chapter 5 to various problems of rank aggregation. From numerical results we conclude that methods that are defined by the angular measure can provide good precision for the top nodes in large data sets, however they can fail in a small data sets.",
keywords = "METIS-263943, IR-61071, Multivariate extremes, Rank aggregation, Wikipedia, Extremal dependencies, Taube-rian theorems, Regular variation, Web, PageRank, Statistical Analysis, Power laws, Preferential attachment, Stochastic equation, EWI-15767",
author = "Y. Volkovich",
note = "10.3990/1.9789036528238",
year = "2009",
month = "4",
day = "24",
doi = "10.3990/1.9789036528238",
language = "Undefined",
isbn = "978-90-365-2823-8",
publisher = "University of Twente",
address = "Netherlands",
school = "University of Twente",

}

Volkovich, Y 2009, 'Stochastic analysis of web page ranking', University of Twente, Zuthpen. https://doi.org/10.3990/1.9789036528238

Stochastic analysis of web page ranking. / Volkovich, Y.

Zuthpen : University of Twente, 2009. 117 p.

Research output: ThesisPhD Thesis - Research UT, graduation UTAcademic

TY - THES

T1 - Stochastic analysis of web page ranking

AU - Volkovich, Y.

N1 - 10.3990/1.9789036528238

PY - 2009/4/24

Y1 - 2009/4/24

N2 - Today, the study of the World Wide Web is one of the most challenging subjects. In this work we consider the Web from a probabilistic point of view. We analyze the relations between various characteristics of the Web. In particular, we are interested in the Web properties that affect the Web page ranking, which is a measure of popularity and importance of a page in the Web. Mainly we restrict our attention on two widely-used algorithms for ranking: the number of references on a page (indegree), and Google’s PageRank. For the majority of self-organizing networks, such as the Web and the Wikipedia, the in-degree and the PageRank are observed to follow power laws. In this thesis we present a new methodology for analyzing the probabilistic behavior of the PageRank distribution and the dependence between various power law parameters of the Web. Our approach is based on the techniques from the theory of regular variations and the extreme value theory. We start Chapter 2 with models for distributions of the number of incoming (indegree) and outgoing (out-degree) links of a page. Next, we define the PageRank as a solution of a stochastic equation R d= PN i=1 AiRi+B, where Ri’s are distributed as R. This equation is inspired by the original definition of the PageRank. In particular, N models in-degree of a page, and B stays for the user preference. We use a probabilistic approach to show that the equation has a unique non-trivial solution with fixed finite mean. Our analysis based on a recurrent stochastic model for the power iteration algorithm commonly used in PageRank computations. Further, we obtain that the PageRank asymptotics after each iteration are determined by the asymptotics of the random variable with the heaviest tail among N and B. If the tails of N and B are equally heavy, then in fact we get the sum of two asymptotic expressions. We predict the tail behavior of the limiting distribution of the PageRank as a convergence of the results for iterations. To prove the predicted behavior we use another techniques in Chapter 3. In Chapter 3 we define the tail behavior for the models of the in-degree and the PageRank distribution using Laplace-Stieltjes transforms and the Tauberian theorem. We derive the equation for the Laplace-Stieltjes transforms, that corresponds to the general stochastic equation, and obtain our main result that establishes the tail behavior of the solution of the stochastic equation. In Chapter 4 we perform a number of experiments on the Web and the Wikipedia data sets, and on preferential attachment graphs in order to justify the results obtained in Chapters 2 and 3. The numerical results show a good agreement with our stochastic model for the PageRank distribution. Moreover, in Section 4.1 we also address the problem of evaluating power laws in the real data sets. We define several state of the art techniques from the statistical analysis of heavy tails, and we provide empirical evidence on the asymptotic similarity between in-degree and PageRank. Inspired by the minor effect of the out-degree distribution on the asymptotics of the PageRank, in Section 4.4 we introduce a new ranking scheme, called PAR, which combines features of HITS and PageRank ranking schemes. In Chapter 5 we examine the dependence structure in the power law graphs. First, we analytically define the tail dependencies between in-degree and PageRank of a one particular page by using the stochastic equation of the PageRank. We formally establish the relative importance of the two main factors for high ranking: large in-degree and a high rank of one of the ancestors. Second, we compute the angular measures for in-degrees, out-degrees and PageRank scores in three large data sets. The analysis of extremal dependence leads us to propose a new rank correlation measure which is particularly plausible for power law data. Finally, in Chapter 6 we apply the new rank correlation measure from Chapter 5 to various problems of rank aggregation. From numerical results we conclude that methods that are defined by the angular measure can provide good precision for the top nodes in large data sets, however they can fail in a small data sets.

AB - Today, the study of the World Wide Web is one of the most challenging subjects. In this work we consider the Web from a probabilistic point of view. We analyze the relations between various characteristics of the Web. In particular, we are interested in the Web properties that affect the Web page ranking, which is a measure of popularity and importance of a page in the Web. Mainly we restrict our attention on two widely-used algorithms for ranking: the number of references on a page (indegree), and Google’s PageRank. For the majority of self-organizing networks, such as the Web and the Wikipedia, the in-degree and the PageRank are observed to follow power laws. In this thesis we present a new methodology for analyzing the probabilistic behavior of the PageRank distribution and the dependence between various power law parameters of the Web. Our approach is based on the techniques from the theory of regular variations and the extreme value theory. We start Chapter 2 with models for distributions of the number of incoming (indegree) and outgoing (out-degree) links of a page. Next, we define the PageRank as a solution of a stochastic equation R d= PN i=1 AiRi+B, where Ri’s are distributed as R. This equation is inspired by the original definition of the PageRank. In particular, N models in-degree of a page, and B stays for the user preference. We use a probabilistic approach to show that the equation has a unique non-trivial solution with fixed finite mean. Our analysis based on a recurrent stochastic model for the power iteration algorithm commonly used in PageRank computations. Further, we obtain that the PageRank asymptotics after each iteration are determined by the asymptotics of the random variable with the heaviest tail among N and B. If the tails of N and B are equally heavy, then in fact we get the sum of two asymptotic expressions. We predict the tail behavior of the limiting distribution of the PageRank as a convergence of the results for iterations. To prove the predicted behavior we use another techniques in Chapter 3. In Chapter 3 we define the tail behavior for the models of the in-degree and the PageRank distribution using Laplace-Stieltjes transforms and the Tauberian theorem. We derive the equation for the Laplace-Stieltjes transforms, that corresponds to the general stochastic equation, and obtain our main result that establishes the tail behavior of the solution of the stochastic equation. In Chapter 4 we perform a number of experiments on the Web and the Wikipedia data sets, and on preferential attachment graphs in order to justify the results obtained in Chapters 2 and 3. The numerical results show a good agreement with our stochastic model for the PageRank distribution. Moreover, in Section 4.1 we also address the problem of evaluating power laws in the real data sets. We define several state of the art techniques from the statistical analysis of heavy tails, and we provide empirical evidence on the asymptotic similarity between in-degree and PageRank. Inspired by the minor effect of the out-degree distribution on the asymptotics of the PageRank, in Section 4.4 we introduce a new ranking scheme, called PAR, which combines features of HITS and PageRank ranking schemes. In Chapter 5 we examine the dependence structure in the power law graphs. First, we analytically define the tail dependencies between in-degree and PageRank of a one particular page by using the stochastic equation of the PageRank. We formally establish the relative importance of the two main factors for high ranking: large in-degree and a high rank of one of the ancestors. Second, we compute the angular measures for in-degrees, out-degrees and PageRank scores in three large data sets. The analysis of extremal dependence leads us to propose a new rank correlation measure which is particularly plausible for power law data. Finally, in Chapter 6 we apply the new rank correlation measure from Chapter 5 to various problems of rank aggregation. From numerical results we conclude that methods that are defined by the angular measure can provide good precision for the top nodes in large data sets, however they can fail in a small data sets.

KW - METIS-263943

KW - IR-61071

KW - Multivariate extremes

KW - Rank aggregation

KW - Wikipedia

KW - Extremal dependencies

KW - Taube-rian theorems

KW - Regular variation

KW - Web

KW - PageRank

KW - Statistical Analysis

KW - Power laws

KW - Preferential attachment

KW - Stochastic equation

KW - EWI-15767

U2 - 10.3990/1.9789036528238

DO - 10.3990/1.9789036528238

M3 - PhD Thesis - Research UT, graduation UT

SN - 978-90-365-2823-8

PB - University of Twente

CY - Zuthpen

ER -

Volkovich Y. Stochastic analysis of web page ranking. Zuthpen: University of Twente, 2009. 117 p. https://doi.org/10.3990/1.9789036528238