To make access to some subsets of files easier for users, we've gathered sets of files into tar.gz files. For now, we've started with PDF files gathered from bug trackers. Please see https://corpora.tika.apache.org/base/docs/bug_trackers/README.txt for a description of how these files were gathered and from where. As we mention in that README, this collection of files may contain malicious files. Beware! We include the files by project that we initially gathered in Feb 2020 in the archived/ directory. For the Nov 2020 crawl, we increased the number projects dramatically. Note that we did not re-crawl pdfium in the Nov 2020 crawl. For Nov 2020, we batched the PDFs by project: batch1/ PDFBOX batch2/ GHOSTSCRIPT TIKA batch3/ MOZILLA batch4/ LIBRE_OFFICE OOO (Open Office) pdf.js batch5/ androidpdfviewer cairo dejavu DSS FOP laravel-snappy libvips NUTCH ocrmypdf openpdf parsr pdfcpu PDFIUM pdfkit pdfminer.six pikepdf POI poppler prawn qpdf react-pdf REDHAT sumatrapdf tabula tabula-java batch6/ cairo-gitlab evince poppler-gitlab