Sunday, September 20, 2015

Power Iteration for Page Rank in R

Use Case of Page Rank Algorithm

Today, we get such a ranking thanks to the work of Jose Lages at the University of Franche-Comte in France and a few pals. They’ve used the way universities are mentioned on Wikipedia to produce a world ranking. Their results provide a new way to think about rankings that may help to avoid some of the biases that can occur in other ranking systems.

Biases can crop up remarkably easily. For example, in the last century, English has become the de facto language of science, and the advantage this gives English-speaking countries is hard to quantify.

And there are other factors that are unique to university rankings. Some institutions focus more on teaching than on research—how should these factors be balanced?

The new work attempts to get around some of these problems using the Pagerank algorithm that Google famously uses to rank websites in search results. This uses the network of links between nodes on a network to determine those that are the most important.

The key insight is that the algorithm counts a node as important if other important nodes point to it. So it repeatedly works through the links, recalculating the importance of every node on each iteration, to come up with a ranking.

Exactly this process can be applied to Wikipedia articles. Each university mentioned in an article is a node in the network, and the links pointing toward it are used to determine a ranking (see also “Artificial Intelligence Aims to Make Wikipedia Friendlier and Better”).

Lages and co apply this process to 24 different language editions of Wikipedia. This database contains some four million articles in English, 1.5 million in German and around a million in each of French, Dutch, Italian, Spanish, and Russian. It also includes Chinese, Hebrew, Hungarian, and so on. “These 24 languages cover 59% of world population and 68% of the total number of Wikipedia articles in all 287 languages,” they say.

The team first determines a ranking for each language and point out that each language edition tends to favor its own universities. So the top 100 list in French includes 32 French-speaking universities, the top 100 in German includes 63 German-speaking universities, and so on. They then combine the lists to produce a global ranking.

Mathematics of Page Rank Algorithm

http://www.ams.org/samplings/feature-column/fcarc-pagerank


# simple power iteration
a <- c(1/3,1/3,1/3);
A <- matrix(c(1/2,1/2,0,1/2,0,1,0,1/2,0), nrow=3);
for (i in 1:100){
  a=A%*%a;
  print(a)
}

# teleport
a <- c(1/3,1/3,1/3);
A <- matrix(c(1/2,1/2,0,1/2,0,0,0,1/2,1), nrow=3);
B <- matrix(rep(1/3,9), nrow=3);
C <- .8*A+.2*B;
for (i in 1:100){
  a=C%*%a;
  print(a)
}

# ex1
a <- c(1/3,1/3,1/3);
A <- matrix(c(0,0,0,1/2,0,0,1/2,1,1), nrow=3, byrow=T);
B <- matrix(rep(1/3,9), nrow=3);
C <- .7*A+.3*B;
for (i in 1:100){
  a=C%*%a;
  print(a)
}


a <- c(1,1,1);
A <- matrix(c(0,0,1,1/2,0,0,1/2,1,0), nrow=3, byrow=T);
for (i in 1:100){
  a=A%*%a;
  print(a)
}

Sunday, September 6, 2015

Estimating the exponent from empirical data in R

   pwrdist <- function(u,...) {
      # u is vector of event counts, e.g. how many
      # crimes was a given perpetrator charged for by the police
      fx <- table(u)
      i <- as.numeric(names(fx))
      y <- rep(0,max(i))
      y[i] <- fx
      m0 <- glm(y~log(1:max(i)),family=quasipoisson())
      print(summary(m0))
      sub <- paste("s=",round(m0$coef[2],2),"lambda=",sum(u),"/",length(u))
      plot(i,fx,log="xy",xlab="x",sub=sub,ylab="counts",...)
      grid()
      lines(1:max(i),(fitted(m0)),type="l")
      return(m0)

  }