centic9/CommonCrawlDocumentDownload
A small tool which uses the CommonCrawl URL Index to download documents with certain file types or mime-types. This is used for mass-testing of frameworks like Apache POI and Apache Tika
7 Releases
Latest: 7mo ago
1.0.0.11Latest
📋 Changes
- [Update to latest crawl and disable throttling, seems not necessary cu…](https://github.com/centic9/CommonCrawlDocumentDownload/commit/9756d08fa054b1ebd7bea4faad892fb7536ccba7)
- [Don't add matches twice to file with matching mimetypes/extensions](https://github.com/centic9/CommonCrawlDocumentDownload/commit/7cc151a29f0506e0b92a18d9113fad1a26365a55)
- [Add note about missing backoff and link to commoncrawl-fetcher-lite](https://github.com/centic9/CommonCrawlDocumentDownload/commit/2c5f852262a14f99e1d086a855d5d1cd453536fe)
- [Update Github Action](https://github.com/centic9/CommonCrawlDocumentDownload/commit/258a113c3b979d4195634fa0d4ee33436590221f)
- [Update to JDK 17](https://github.com/centic9/CommonCrawlDocumentDownload/commit/7ae7eb459e9eb97482f6288811b075d18f40a40c)
- [Migrate to JUnit 5](https://github.com/centic9/CommonCrawlDocumentDownload/commit/993f081c877fe851e1adb479917c8cc140f0842d)
- [Migrate to Apache Http Client 5](https://github.com/centic9/CommonCrawlDocumentDownload/commit/4cb82eb057d698c1f621c340399d301c5d86d754)
- Update third party libraries
1.0.0.9
📋 Changes
- Switch to Gradle 7.6 and to the new maven-publish plugin
- Update third-party-libraries
- Update to more recent CC-MAIN
- Parse newer fields
- Adjust logging configuration
1.0.0.8
Intermediate release while switching to Gradle 7.6, not uploaded to Maven Central. **Full Changelog**: https://github.com/centic9/CommonCrawlDocumentDownload/compare/1.0.0.7...1.0.0.8
1.0.0.10
📋 Changes
- Re-publish with correct artifactId
1.0.0.7
📋 Changes
- Add Extension .pot for powerpoint
- Switch to CC-MAIN-2019-39
- Update third-party libraries
1.0.0.6
📋 Changes
- Update 3rd party libraries
- Use common-crawl 2018-43 by default
- Write accumulated mimetypes to a separate text-file after each index-file
- Add some support for detecting duplicate files and moving them out of the list to not re-process the same file over and over by the post-processing steps
- Some small adjustments for behavior changes in Java 11
1.0.0.5
📋 Changes
- Update 3rd party libraries
- Download some more mime-types out of the box
- Use longer socket-timeout
- Switch to the new S3 public dataset URL
- Handle new item "mime-detected" in JSON
- Some refactoring
