These files were gathered from several issue trackers. We used APIs for bugzilla- and JIRA-based bug trackers, and did straight html scraping for github-based trackers. We also wrote a fully custom crawler for chromium. For bugzilla-based trackers, we ran queries for issues that contained an attachment with a mime type including the word 'application'. This was a broad reach query. We then gathered all attachments from those issues. After downloading the files, we identified compressed (e.g. *.bz2) files and package files (e.g. *.zip files). We wrote code to extract/decompress those files (including the .tgz combo). These are files named, for example: OOO-5860-10.zip-0.pdf. For the compressed/packaged files, we used Apache Tika for file identification and to attach a file extension. We wrote code to change the file extension of the original files (via Apache Tika). We discovered a fairly large list of file types that need to be added to Tika (mostly specializations of .xml or .zip), and we created a deny list so that we wouldn't overwrite those file extensions on that list. For github-based sites, we found that users frequently supplied a url to an external file rather than or in addition to actually attaching a file. For those files, we tried to download the file from the external url, and we used Tika to modify the file extension. These files are named MOZILLA-LINK-3185-0.pdf...that's the first valid external url for mozilla issue 3185. For all files, we changed the filesystem "last modified" date to the date the issue was opened to give a sense of the age of the file -- at some point we might change this to "uploaded date" where available. We removed 0-byte files, *.diff and *.txt files. This is a work in progress and imperfect. Please let us know if you notice any areas for improvement. The source code for all crawlers except the pdfium crawler is available here: https://github.com/tballison/tika-addons/tree/master/bugtracker-crawler In November 2020, we refreshed the crawl for all sites except for pdfium. Sites Bugzilla-based (using the standard rest-based API) MOZILLA https://bugzilla.mozilla.org/ REDHAT https://bugzilla.redhat.com/ OOO https://bz.apache.org/ooo POI https://bz.apache.org/bugzilla/ LIBRE_OFFICE https://bugs.documentfoundation.org/ GHOSTSCRIPT https://bugs.ghostscript.com/ Bugzilla-based html scraping (rest-based API is turned off) from https://bugs.freedesktop.org, products: cairo colord dejavu poppler Gitlab-based https://gitlab.freedesktop.org/poppler/poppler (stored in poppler-gitlab/) https://gitlab.freedesktop.org/cairo/cairo (stored in cairo-gitlab/) https://gitlab.gnome.org/GNOME/evince Github-based https://github.com/sumatrapdfreader/sumatrapdf https://github.com/mozilla/pdf.js https://github.com/qpdf/qpdf https://github.com/LibrePDF/OpenPDF https://github.com/jbarlow83/OCRmyPDF https://github.com/barryvdh/laravel-snappy https://github.com/pdfminer/pdfminer.six https://github.com/diegomura/react-pdf https://github.com/foliojs/pdfkit https://github.com/barteksc/AndroidPdfViewer https://github.com/tabulapdf/tabula https://github.com/tabulapdf/tabula-java https://github.com/libvips/libvips https://github.com/prawnpdf/prawn https://github.com/axa-group/Parsr https://github.com/pdfcpu/pdfcpu https://github.com/pikepdf/pikepdf On 8-9 March, 2021, we also crawled jpeg sites, including: https://github.com/libjpeg-turbo/libjpeg-turbo https://github.com/haraldk/TwelveMonkeys https://github.com/google/guetzli https://github.com/mozilla/mozjpeg https://github.com/tjko/jpegoptim https://github.com/lovell/sharp https://github.com/libvips/libvips https://github.com/dropbox/lepton https://github.com/SixLabors/ImageSharp https://github.com/drewnoakes/metadata-extractor https://github.com/contentful-labs/Concorde https://github.com/spatie/image-optimizer https://github.com/danielgtaylor/jpeg-archive JIRA-based https://issues.apache.org/jira/projects/COMPRESS https://issues.apache.org/jira/projects/FOP https://issues.apache.org/jira/projects/PDFBOX https://issues.apache.org/jira/projects/TIKA https://issues.apache.org/jira/projects/NUTCH https://ec.europa.eu/cefdigital/tracker/projects/DSS other pdfium https://bugs.chromium.org/p/pdfium/issues/list Known issues -- * We deleted REDHAT-894449-14.gz, that included a 78GB zip bomb * These files come straight from the internet. We've identified a handful of malicious documents, but there may be more. Let us know what you find! * Tika's file type detection is imperfect * There can be duplicates in attachments if there are different links to the same attachment but with different urls within an issue (no obv solution) or across issues * There can be duplicates in external links per issue (now fixed) and across issues * Still todo: ** Redhat is bugzilla-based but overwhelming... we must do more precise queries for the file types of interest ** git lab -- poppler Future work * Need to modify code for incremental updates * Figure out how to balance files better into subdirectories