These files were gathered from several issue trackers.

We used APIs for bugzilla- and JIRA-based bug trackers, and did straight html scraping for github-based trackers.
We also wrote a fully custom crawler for chromium.

For bugzilla-based trackers, we ran queries for issues that contained an attachment with a mime type including the word 'application'.
This was a broad reach query.  We then gathered all attachments from those issues.

After downloading the files, we identified compressed (e.g. *.bz2) files and package files (e.g. *.zip files). We wrote
code to extract/decompress those files (including the .tgz combo).  These are files named, for example: OOO-5860-10.zip-0.pdf.
For the compressed/packaged files, we used Apache Tika for file identification and to attach a file extension.

We wrote code to change the file extension of the original files (via Apache Tika).  We discovered a fairly large
list of file types that need to be added to Tika (mostly specializations of .xml or .zip), and we created a deny list so
that we wouldn't overwrite those file extensions on that list.

For github-based sites, we found that users frequently supplied a url to an external file rather than or in addition
to actually attaching a file.  For those files, we tried to download the file from the external url,
and we used Tika to modify the file extension.  These files are named MOZILLA-LINK-3185-0.pdf...that's the first valid
external url for mozilla issue 3185.

For all files, we changed the filesystem "last modified" date to the date the issue was opened to give a sense of
the age of the file -- at some point we might change this to "uploaded date" where available.

We removed 0-byte files, *.diff and *.txt files.

This is a work in progress and imperfect.  Please let us know if you notice any areas for improvement.

The source code for all crawlers except the pdfium crawler is available here: 
https://github.com/tballison/tika-addons/tree/master/bugtracker-crawler

In November 2020, we refreshed the crawl for all sites except for pdfium.

Sites

Bugzilla-based (using the standard rest-based API)
MOZILLA https://bugzilla.mozilla.org/
REDHAT https://bugzilla.redhat.com/
OOO https://bz.apache.org/ooo
POI https://bz.apache.org/bugzilla/
LIBRE_OFFICE https://bugs.documentfoundation.org/ 
GHOSTSCRIPT https://bugs.ghostscript.com/

Bugzilla-based html scraping (rest-based API is turned off) from https://bugs.freedesktop.org, products:
cairo
colord
dejavu
poppler

Gitlab-based
https://gitlab.freedesktop.org/poppler/poppler (stored in poppler-gitlab/)
https://gitlab.freedesktop.org/cairo/cairo (stored in cairo-gitlab/)
https://gitlab.gnome.org/GNOME/evince

Github-based
https://github.com/sumatrapdfreader/sumatrapdf
https://github.com/mozilla/pdf.js
https://github.com/qpdf/qpdf
https://github.com/LibrePDF/OpenPDF
https://github.com/jbarlow83/OCRmyPDF
https://github.com/barryvdh/laravel-snappy
https://github.com/pdfminer/pdfminer.six
https://github.com/diegomura/react-pdf
https://github.com/foliojs/pdfkit
https://github.com/barteksc/AndroidPdfViewer
https://github.com/tabulapdf/tabula
https://github.com/tabulapdf/tabula-java
https://github.com/libvips/libvips
https://github.com/prawnpdf/prawn
https://github.com/axa-group/Parsr
https://github.com/pdfcpu/pdfcpu
https://github.com/pikepdf/pikepdf

On 8-9 March, 2021, we also crawled jpeg sites, including:
https://github.com/libjpeg-turbo/libjpeg-turbo
https://github.com/haraldk/TwelveMonkeys
https://github.com/google/guetzli
https://github.com/mozilla/mozjpeg
https://github.com/tjko/jpegoptim
https://github.com/lovell/sharp
https://github.com/libvips/libvips
https://github.com/dropbox/lepton
https://github.com/SixLabors/ImageSharp
https://github.com/drewnoakes/metadata-extractor
https://github.com/contentful-labs/Concorde
https://github.com/spatie/image-optimizer
https://github.com/danielgtaylor/jpeg-archive

JIRA-based
https://issues.apache.org/jira/projects/COMPRESS
https://issues.apache.org/jira/projects/FOP
https://issues.apache.org/jira/projects/PDFBOX
https://issues.apache.org/jira/projects/TIKA
https://issues.apache.org/jira/projects/NUTCH
https://ec.europa.eu/cefdigital/tracker/projects/DSS

other
pdfium https://bugs.chromium.org/p/pdfium/issues/list

Known issues --
* We deleted REDHAT-894449-14.gz, that included a 78GB zip bomb
* These files come straight from the internet.  We've identified a handful of malicious documents, but 
  there may be more.  Let us know what you find!
* Tika's file type detection is imperfect
* There can be duplicates in attachments if there are different links to the same attachment
  but with different urls within an issue (no obv solution) or across issues
* There can be duplicates in external links per issue (now fixed) and across issues

* Still todo:
        ** Redhat is bugzilla-based but overwhelming... we must do more precise queries for the file types of interest
        ** git lab -- poppler

Future work
* Need to modify code for incremental updates
* Figure out how to balance files better into subdirectories