GitPedia
centic9

centic9/CommonCrawlDocumentDownload

A small tool which uses the CommonCrawl URL Index to download documents with certain file types or mime-types. This is used for mass-testing of frameworks like Apache POI and Apache Tika

7 Releases
Latest: 7mo ago
1.0.0.11Latest
centic9centic9·7mo ago·November 10, 2025
GitHub

📋 Changes

  • [Update to latest crawl and disable throttling, seems not necessary cu…](https://github.com/centic9/CommonCrawlDocumentDownload/commit/9756d08fa054b1ebd7bea4faad892fb7536ccba7)
  • [Don't add matches twice to file with matching mimetypes/extensions](https://github.com/centic9/CommonCrawlDocumentDownload/commit/7cc151a29f0506e0b92a18d9113fad1a26365a55)
  • [Add note about missing backoff and link to commoncrawl-fetcher-lite](https://github.com/centic9/CommonCrawlDocumentDownload/commit/2c5f852262a14f99e1d086a855d5d1cd453536fe)
  • [Update Github Action](https://github.com/centic9/CommonCrawlDocumentDownload/commit/258a113c3b979d4195634fa0d4ee33436590221f)
  • [Update to JDK 17](https://github.com/centic9/CommonCrawlDocumentDownload/commit/7ae7eb459e9eb97482f6288811b075d18f40a40c)
  • [Migrate to JUnit 5](https://github.com/centic9/CommonCrawlDocumentDownload/commit/993f081c877fe851e1adb479917c8cc140f0842d)
  • [Migrate to Apache Http Client 5](https://github.com/centic9/CommonCrawlDocumentDownload/commit/4cb82eb057d698c1f621c340399d301c5d86d754)
  • Update third party libraries
1.0.0.9
centic9centic9·3y ago·January 15, 2023
GitHub

📋 Changes

  • Switch to Gradle 7.6 and to the new maven-publish plugin
  • Update third-party-libraries
  • Update to more recent CC-MAIN
  • Parse newer fields
  • Adjust logging configuration
1.0.0.8
centic9centic9·3y ago·January 15, 2023
GitHub

Intermediate release while switching to Gradle 7.6, not uploaded to Maven Central. **Full Changelog**: https://github.com/centic9/CommonCrawlDocumentDownload/compare/1.0.0.7...1.0.0.8

1.0.0.10
centic9centic9·3y ago·January 15, 2023
GitHub

📋 Changes

  • Re-publish with correct artifactId
1.0.0.7
centic9centic9·4y ago·March 13, 2022
GitHub

📋 Changes

  • Add Extension .pot for powerpoint
  • Switch to CC-MAIN-2019-39
  • Update third-party libraries
1.0.0.6
centic9centic9·7y ago·March 21, 2019
GitHub

📋 Changes

  • Update 3rd party libraries
  • Use common-crawl 2018-43 by default
  • Write accumulated mimetypes to a separate text-file after each index-file
  • Add some support for detecting duplicate files and moving them out of the list to not re-process the same file over and over by the post-processing steps
  • Some small adjustments for behavior changes in Java 11
1.0.0.5
centic9centic9·8y ago·October 30, 2017
GitHub

📋 Changes

  • Update 3rd party libraries
  • Download some more mime-types out of the box
  • Use longer socket-timeout
  • Switch to the new S3 public dataset URL
  • Handle new item "mime-detected" in JSON
  • Some refactoring