This post provides a distilled overview regarding the rediscovery of 50,000 samples within the MNIST dataset.
MNIST: The Potential Danger of Overfitting
Recently, Chhavi Yadav (NYU) and Leon Bottou (Facebook AI Research and NYU) indicated in their paper, “Cold Case: The Lost MNIST Digits”, how they reconstructed the MNIST (Modified National Institute of Standards and Technology) dataset and added 50,000 samples to the test set for a total of 60,000 samples. Many data scientists and researchers have used the MNIST test set of 10,000 samples for training and testing models for over 20 years. Yet, the industry is aware of how the popularity and usage of MNIST (and other popular datasets) may also increase the potential danger of overfitting. This has led to researchers to look for ways to address the rising danger of overfitting by reconstructing datasets, measuring the accuracy, and then sharing their process. Sharing the process increases the likelihood of reproducibility and building off of existing work within the industry as a whole.
For example, in Yadav and Bottou’s paper, they indicate that
“Hundreds of publications report increasingly good performance on this same test set [10,000 samples]. Did they overfit the test set? Can we trust any new conclusion drawn from this data set?” ….and how “50,000 samples have since been lost.”
MNIST Reconstruction Steps
To address these questions about the potential danger of overfitting, Yadav and Bottou reconstructed the MNIST dataset. While the paper dives into detail regarding the process, the readme on their github provides a distilled summary of the steps they took:
“1. Start with a first reconstruction algorithm according to the information found in the [separate] paper introducing the MNIST dataset.
2. Use the Hungarian algorithm to find the best pairwise match between the MNIST training digits and our reconstructed training digits. Visually examine the poorest matches, trying to understand what the MNIST authors could have done differently to justify these differences without at the same time changing the existing close matches.
3. Try new variants of the reconstruction algorithm, match their outputs to their best counterpart in the MNIST training set, and repeat the process.”
This sharing of process helps support research reproducibility and contributes to moving industry forward.
The Found MNIST Digits
Through this work, Yadav and Bottou were able to rediscover the lost 50,000 samples and
“In the same spirit as [Recht et al., 2018, 2019], the rediscovery of the 50,000 lost MNIST test digits provides an opportunity to quantify the degradation of the official MNIST test set over a quarter-century of experimental research.”
They also were able to
“track each MNIST digit to its NIST source image and associated metadata”… “These fresh testing samples allow us to precisely investigate how the results reported on a standard testing set suffer from repeated experiments over a long period of time. Our results confirm the trends observed by Recht et al. [2018, 2019], albeit on a different dataset and in a substantially more controlled setup. All these results essentially show that the “testing set rot” problem exists but is far less severe than feared. Although the practice of repeatedly using the same testing samples impacts the absolute performance numbers, it also delivers pairing advantages that help model selection in the long run.”
Yet, the potential impact continues. Yadav and Battou’s work was also shared on Yann LeCun’s twitter feed.
MNIST reborn, restored and expanded.
Now with an extra 50,000 training samples.
If you used the original MNIST test set more than a few times, chances are your models overfit the test set. Time to test them on those extra samples. https://t.co/l7QA1u94jF
— Yann LeCun (@ylecun) May 29, 2019
Resources to Consider
This post provided a distilled overview of a recently released paper on the rediscovery of 50,000 samples within MNIST. If you are interested in learning more about MNIST, then consider the following resources that were cited and referenced in this post.
- LeCun, Cortes, and Burges’ “The MNIST Database“
- Recht, Roelofs, Schmidt, and Shankar’s “Do CIFAR-10 Classifiers Generalize to CIFAR-10?“
- Recht, Roelofs, Schmidt, and Shankar’s “Do ImageNet Classifiers Generalize to ImageNet?“
- Yadav and Bottou’s paper “Cold Case: The Lost MNIST Digits” and github: https://github.com/facebookresearch/qmnist
Domino Data Science Field Notes provide highlights of data science research, trends, techniques, and more, that support data scientists and data science leaders accelerate their work or careers. If you are interested in your data science work being covered in this blog series, please send us an email at content(at)dominodatalab(dot)com.