Cc notebooks
Various Jupyter notebooks about Common Crawl data
* analyzing data using the [columnar index](https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/) - blocking of internet connections from and to the Islamic Republic of Iran during the November 2019 crawl: [net-blocking-iran-cc-main-2019-47.ipynb](./cc-index-table/net-blocking-iran-cc-main-2019-47.ipynb) - total number of captures 2013 – 2019, domain coverage and approximation of unique URLs for the `.edu` top-level domain: [cc-main-2013-2019-metrics.ipynb](./cc-index... The project is written primarily in Jupyter Notebook, distributed under the Apache License 2.0 license, first published in 2019. Key topics include: aws-athena, common-crawl, commoncrawl, jupyter-notebook, webarchiving.
Jupyter Notebooks to Analyze Common Crawl Data
-
analyzing data using the columnar index
- blocking of internet connections from and to the Islamic Republic of Iran during the November 2019 crawl: net-blocking-iran-cc-main-2019-47.ipynb
- total number of captures 2013 – 2019, domain coverage and approximation of unique URLs for the
.edutop-level domain: cc-main-2013-2019-metrics.ipynb - correlations between character sets and lanuages: correlation-language-charset.ipynb
-
analyze the Common Crawl webgraph data sets and interactively explore the graphs: cc-webgraph-statistics
-
how to explore WARC files running a notebook on AWS EMR
-
truncated record payloads in WARC Files:
- verify that all truncated payloads are annotated by the WARC-Truncated header
- which MIME types are mostly affected by truncation? Aggregations using the columnar index.
-
a notebook version of our introductory whirlwind python tour (external repository)
Contributors
Showing top 3 contributors by commit count.
