commoncrawl/cc-downloader
A polite and user-friendly downloader for Common Crawl data
This new pre-release adds more documentation with details about the installation process. It also corrects some existing typos.
This release adds support for [CC-NEWS](https://commoncrawl.org/blog/news-dataset-available) as well as validation mechanisms for the crawl reference that the user input when using the `download-paths` sub-command. The release also updates multiple dependencies and bumps both the edition and the rust compiler version to `2024 edition` and `1.85` respectively.
📋 Changes
- In this pre-release we:
- fixed issue #6 by adding a new User Agent
- Introduce refactors so that linter check are all passed
- Introduce a rust workflow for ensuring that the code compiles and test are passed in the `dev` and `main` branches
- Introduce changes to the contributing policy so that PRs are merged to the `dev` branch
- Introduce slight updates to the documentation
💥 Breaking Changes
- There are no breaking changes for this release.
📦 Notes
- This pre-release starts organizing the `download.rs` file so that `cc-downloader` can also be used as a library and so that bindings can be more easily written.
Today we are happy to announce `cc-downloader`, an experimental command-line tool for downloading Common Crawl data via `https`. `cc-downloader` is intended to be a user-friendly and polite downloader. It was made in response to the significant increase in downloads of our data in recent months. That was very exciting to see at first, especially in terms of the large rise in interest for our dataset. But it also makes it harder for some users to successfully download our data due to quirks of downloading from a high-traffic storage bucket. `cc-downloader` is our solution to this problem, enabling our users to continue downloading our data via `https` without issues. We have designed `cc-downloader` with a polite retry mechanism that allows our users to make sure that every single file requested is downloaded. It also implements [jitter](https://en.wikipedia.org/wiki/Jitter) and exponential [backoff](https://en.wikipedia.org/wiki/Exponential_backoff) strategies, in order to avoid overwhelming our infrastructure. If you wish to install `cc-downloader`, we have released pre-compiled binaries here for all major operating systems and architectures. `cc-downloader` is written in [`Rust`](https://www.rust-lang.org/) and is distributed as a “crate”, so if you have [`cargo`](https://www.rust-lang.org/learn/get-started) installed, you can also install `cc-downloader` with this command: ``` cargo install cc-downloader ``` Once you have installed it, you’ll see that `cc-downloader` has 2 sub-commands: First, `download-paths` downloads the file paths list for a given crawl and subset from our bucket, to a given destination folder path in your file system: ``` cc-downloader download-paths CC-MAIN-2024-46 wet path/to/folder ``` This paths file will be (in this case) `path/to/folder/wet.paths.gz`. Next, `download` reads this file paths list and concurrently downloads the files to a given destination folder in your file system: ``` cc-downloader download path/to/folder/wet.paths.gz path/to/folder ``` This will preserve the tree structure that we use internally by default. `cc-downloader` is still under active development, so if you find any issue or would like to submit a feature request, please visit our GitHub repository https://github.com/commoncrawl/cc-downloader/. Contributions are always welcome! We hope that with this tool our users will find it easier to download and use our data. If you’re encountering any problems with `cc-downloader` that look like high traffic, you can check out our current traffic levels by looking at our [infrastructure status webpage](https://status.commoncrawl.org/).
