โInspired by google c4, here is a series of colossal clean data cleaning scripts focused on CommonCrawl data processing. Including Chinese data processing and cleaning methods in MassiveText.โ
โA small tool which uses the CommonCrawl URL Index to download documents with certain file types or mime-types. This is used for mass-testing of frameworks like Apache POI and Apache Tikaโ