These files were gathered to support file research and regression testing for Apache Tika, Apache PDFBox, Apache POI and other open source parsers. These files came from the internet and may contain malware. Use these files at your own risk. These files are NOT released under the Apache 2.0 license. Please see the individual README files for each collection for: license, provenance and other notes. If we can make this corpus easier to navigate or use, or if you have any questions or comments, please email corpora-dev@tika.apache.org. Some metadata: 1) Mimes are available here: https://corpora.tika.apache.org/base/metadata/mimes/ 2) There's an H2 db of tika-eval-1.24.1 in Profile mode: https://corpora.tika.apache.org/base/metadata/tika-eval/ 3) There's a sqlite database that combines 1 and 2 in one sqlite database: https://corpora.tika.apache.org/base/metadata/corpora-metadata.db