GitPedia

Cc notebooks

Various Jupyter notebooks about Common Crawl data

From commoncrawl·Updated April 24, 2026·View on GitHub·

* analyzing data using the [columnar index](https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/) - blocking of internet connections from and to the Islamic Republic of Iran during the November 2019 crawl: [net-blocking-iran-cc-main-2019-47.ipynb](./cc-index-table/net-blocking-iran-cc-main-2019-47.ipynb) - total number of captures 2013 – 2019, domain coverage and approximation of unique URLs for the `.edu` top-level domain: [cc-main-2013-2019-metrics.ipynb](./cc-index... The project is written primarily in Jupyter Notebook, distributed under the Apache License 2.0 license, first published in 2019. Key topics include: aws-athena, common-crawl, commoncrawl, jupyter-notebook, webarchiving.

Jupyter Notebooks to Analyze Common Crawl Data

Contributors

Showing top 3 contributors by commit count.

View all contributors on GitHub →

This article is auto-generated from commoncrawl/cc-notebooks via the GitHub API.Last fetched: 6/25/2026