This data was gathered from Common Crawl as part of TIKA-2750. We documented some details on our wiki's CommonCrawl3 article. As we describe in the wiki, we refetched some truncated files from their original websites, and we stored those files in commoncrawl3_refected.

Please make sure to adhere to Common Crawl's Terms of Use as well as their full terms of use.