The Common Crawl WWW Ranking

Crawl the Web

Everything starts with some big data. Common Crawl Foundation gathered in 2012 a large general web crawl of about 3.5 billion HTML pages. The crawl is open data, and is accessible to everyone.

Build the Graph

The Data and Web Science Group of the University of Mannheim parsed the crawl, obtaining a web graph: a mathematical representation of the links of the web. In a web graph nodes represent pages, and arcs represent hypertext links.

In collaboration with Sebastiano Vigna of the Laboratory for Web Algorithmics the web graph was analyzed using the WebGraph framework. You can actually download the graph in WebGraph format. Or you can see what we have found in our paper or on the Web Data Commons site.

Do Your Link Analysis

We then extracted the host graph, a smaller graph where nodes represent hosts, rather than pages. There is an arc between two hosts if some page within the first host point to some page within the second host. We used the information in the graph to find the most important hosts of the web graph from Common Crawl 2012.

Harmonic Centrality

The default ranking we show you is by harmonic centrality. If you want, you can find its definition in Wikipedia. But we can explain it easily.

Suppose your site is example.com. Your score by harmonic centrality is, as a start, the number of sites with a link towards example.com. They are called sites at distance one. Say, there are 50 such sites: your score is now 50.

There will be also sites with a link towards sites that have a link towards example.com, but they are not at distance one. They are called sites at distance two. Say, there are 80 such sites: they are not as important as before—we will give them just half a point. So you get 40 more points and your score is now 90.

We can go on: there will be also sites with a link towards sites that have a link towards sites that have a link towards example.com (!), but they are not at distance one or two. They are called sites at distance three. Say, there are 100 such sites: as you can guess, we will give them just one third of a point. So you get 33.333… more points and your score is now 123.333….

You do this for every site that can get to example.com just following links, and you have your score by harmonic centrality. And we have software that will approximate harmonic centrality for very large graphs.

Indegree, Katz, PageRank

Since we like options, we let you play with other rankings. Ranking by indegree simply means that your score is the number of sites with a link towards you: the more links, the higher your rank. Browsing the top sites by indegree is a nice way to get acquainted with spam, as it is very easy to increase artificially your indegree.

Katz and PageRank are two well-known and very similar centrality indices that count the number of possible ways in which you can get from any other site to your site. The definition of PageRank appears in the first paper about Google. They have been computed using the LAW library.

You can find a very detailed and readable discussion of all this indices in this paper.

Play

You can compare the results of different rankings in two ways. If you select “Compare ranks” (the default), the column selected for sorting displays sites, and the other columns show the rank of each site following other indices. For instance, the second page of result by harmonic centrality features the site of the Guardian at position 19, but you can see that PageRank would place it in position 270.

If you instead select “Compare listings”, you will see for each index the sites in the same position. The first page contains the top sites for each ranking, and so on.