Webgraph
WebGraph is a framework for graph compression.
WebGraph is a framework for graph compression aimed at studying web graphs. It provides simple ways to manage very large graphs, exploiting modern compression techniques. More precisely, it is currently made of: The project is written primarily in Java, distributed under the GNU Lesser General Public License v2.1 license, first published in 2021. Key topics include: centrality-measures, compression-library, graph, graph-compression, java.
Welcome to WebGraph!
Introduction
WebGraph is a framework for graph compression aimed at studying web
graphs. It provides simple ways to manage very large graphs, exploiting
modern compression techniques. More precisely, it is currently made of:
-
A set of flat codes, called ζ codes, which are particularly
suitable for storing web graphs (or, in general, integers with power-law
distribution in a certain exponent range). The fact that these codes work
well can be easily tested empirically, but we also try to provide a
detailed mathematical
analysis. -
Algorithms for compressing web graphs that exploit gap compression and
referentiation (à la LINK),
intervalization and ζ codes to provide a high compression ratio (see our
datasets). The algorithms are
controlled by several parameters, which provide different tradeoffs
between access speed and compression ratio. -
Algorithms for accessing a compressed graph without actually
decompressing it, using lazy techniques that delay the decompression until
it is actually necessary. -
Algorithms for analysing very large graphs, such as
HyperBall
which has been used to show that Facebook has just four degrees of
separation. -
An implementation of the algorithms above in Java and
Rust distributed under either the GNU
Lesser General Public License
2.1+ or the
Apache Software License
2.0. Besides a clearly
defined API, we also provide several classes tha modify (e.g., transpose)
or recompress a graph, so to experiment with various settings. -
Datasets for large graph. These are either
gathered from public sources (such as
WebBase), or
produced by UbiCrawler and BUbiNG.
In the end, with WebGraph you can access and analyse very large web
graphs. Using WebGraph is as easy as installing a few jar files and
downloading a dataset. This makes studying phenomena such as PageRank,
distribution of graph properties of the web graph, etc. very easy.
You are welcome to use and improve WebGraph! If you find our software
useful for your research, please quote this
paper.
This version of WebGraph is limited to graphs with at most 2³¹ nodes. For
larger graphs, have a look at the big
version.
Users of WebGraph
<a href="https://www.softwareheritage.org/"><img src="svg/SWH.svg" width="200"></a>
<a href="https://www.commoncrawl.org/"><img src="svg/CC.svg" width="200"></a>
JGraphT
<a href="https://jgrapht.org/">JGraphT</a> has a
few <a href="https://jgrapht.org/guide/WebGraphAdapters">WebGraph adapters</a>.
Python
There are Python bindings for WebGraph.
Papers
-
A detailed description of
the compression algorithms used in WebGraph, published in the proceedings
of the Thirteenth International World–Wide Web
Conference. -
A mathematical analysis
of the performance of γ, δ and ζ codes against power-law distributions. -
Some quite surprising
experiments showing that the
transpose graph reacts very peculiarly to compression after
lexicographical or Gray-code sorting. -
A paper about
HyperBall
(then named HyperANF), our tool for computing an approximation of the
neighbourhood function, reachable nodes and geometric centralities of
massive graphs. More information can be found in this
preprint. -
HyperBall was used to
find out that on average there are just four degrees of
separation on
Facebook, and the experiment was reported by the
New York
Times.
Alas, the degrees were actually 3.74 (one less than the average
distance), but the off-by-one
between graph theory (“distance”) and sociology (“degrees of separation”)
generated a lot of confusion. -
A paper were we
describe our efforts to compress one of the largest social graphs
available—the graph of commits of the Software Heritage archive.
Contributors
Showing top 2 contributors by commit count.
