Last verified October 19, 2024 (though the datasets seem dated!)
Last night I was on a data science career panel (of awesome ladies!) as part the Vancouver Datajam 2020 and I promised (as I’ve been meaning to do for a while…) to post a list of data resources. The hardest part of finding data isn’t finding such a list but finding such a list that is up-to-date but I’ll try. Should I fail here first is a list of well maintained curated data source lists.
An Aggregate of Other Aggregators
Tensorflow models/datasets resource is offered by Google. Many of the datasets below are accessible via tensorflow_datasets
UCI ML Repository “currently maintain 557 data sets as a service to the machine learning community”
Kaggle including such gems as the arXiv and avocado prices
Google Public Data is curating datasets; they also have a Dataset Search
OpenSpending: “search over 3,446 data packages from 83 countries with over 159,706,407 fiscal records”
Harvard Dataverse is a repository for research data (and code!).
FiveThirtyEight posts all the data to back the articles
A compilation of Twitter/X datasets
Tableau Public hosts datasets
Stats Canada; DataBC; Vancouver Open Data; US Data.gov; NYC OpenData; Seattle Open Data; Our World in Data; etc…
Appen hosts some Open Source Datasets
KDnuggets has datasets galore and also aggregates yet more aggregators. Alas, some links are out of date.
Ok, but how to divvy up the data types?
Ultimately I have a taxonomy problem: divide the data by datatype, domain or best-suited algorithm type? Finally, I’ll do a mixture of all three. This is how my mind divides them; this is how I ultimately search among them; this is hopefully how such a list will be most useful.
Curated Datasets
A breed all their own: they’re uniform, tidy, split into training/validation/test sets, (over-)used to pit algorithms against each other (some curated and shared for that purpose but aren’t adopted as readily). Older benchmarks are good for starting out or for hard variants of the problem statement (eg. one-shot!). See Papers With Code hosted SOTA by benchmark.
Disentanglement/Representation Learning
- MPI3D datasets simulated and real-world environments
- disentanglement_lib includes dSprites, Color/Noisy/Scream-dSprites, SmallNORB, Cars3D, and Shapes3D
- also, try generating points on a surface in 3d to represent in 2d, such as the swiss roll (more rolls are harder to learn)
Images
- MNIST, CIFAR-10/100, and Fashion-MNIST all have ~60k images split among 10 classes
- ciFAIR-10/100 duplicate free versions of CIFAR-10/100
- ImageNet is large with bigger images a decent subset annotated with bounding boxes
- DanBooru2021 a large-scale anime image database with 4.9m+ images annotated with 162m+ tags
- Large-scale Fashion (DeepFashion) Database to scale up Fashion-MNIST
- Plant Disease is the most widely used in agriculture studies
- Unsplash lite and Full
- Zappos50K 4 categories of shoes
- UCSD Birds 200 categories of birds
- CelebA 200K images each with 40 attributes
- Visual Domain Decathlon 10 simultaneous visual challenges
NLP
- Large Movie Review Dataset and Sentiment140 for sentiment analysis
- Twenty Newsgroups for text classification
- currated Wikipedia Corpus or dumps from Wikipedia itself
- Blog Authorship Corpus many blogs of many bloggers
- Machine Translation ~15GB within various “tasks”
- Yelp Open Dataset mixes NLP with images, interaction timelines, coordinates
- One Billion Words a standard corpus of reasonable size (0.8 billion words)
- Fake News Corpus
- PG-19 extracted from Gutenberg
- Snowden archive
- Darknet Market Archives 2013-2015 scrapes covering vendor pages, feedback, images, etc.
- 3m Russian Troll tweets from FiveThirtyEight
NLU
- SQuAD 1-2 datasets
- GLUE and SuperGlue
- Measuring Massive Multitask Language Understanding bigger, harder to test GPT-3
- and so many many more since the explosion of LLMs…
Recommenders
- MS Learning to Rank dataset
- MovieLens 25m ratings for ~60k movies of ~160k users
- Spotify Recsys Challenge 2018 assembled by MSc students independent of Spotify who no longer host it
- Goodbooks-10k scraped from GoodReads
- Book-Crossing
- Netflix Prize, a classic
- GroupLens links to various datasets (book crossing is on Kaggle! look back two links)
Various
- Penn ML Benchmarks for supervised learning algorithms
- AutoML/AutoDL competitions datasets dating back to 2016; Springer has open access to the book with a chapter reviewing the challenge
- OKCupid dataset N=68,371, 2,620 variables from the dating site OKCupid
- Common Crawl has petabytes of data, regularly collected since 2008
- GDELT Project “watching our world unfold”, or (less creepy) “the GDELT Project monitors the world’s broadcast, print, and web news from nearly every corner of every country in over 100 languages and identifies the people, locations, organizations, themes, sources, emotions, counts, quotes, images and events driving our global society every second of every day, creating a free open platform for computing on the entire world.”
- CS bibliography has many datasets in many domains
Outlier/Anomaly/Event Detection
- On the Evaluation of Unsupervised Outlier Detection data
- Outlier Detection Datasets (ODDS)
- Unsupervised Anomaly Detection Benchmark data
- Anomaly Detection Meta-Analysis Benchmarks data
- Numenta Anomaly Benchmark (NAB)
- Turing Change Point Dataset
- *MAVEN: A massive general domain event detection dataset, and accompanying paper.
One/Few Shot
- miniImageNet was introduced in Matching Networks for One Shot Learning; Meta-Transfer Learning for Few-Shot Learning added tieredImageNet and Fewshot-CIFAR10 both available to downloaded directly; also see mini on Kaggle
- Meta-Dataset assembles various datasets into one benchmark
- Chollet’s ARC-AGI dataset and a recent (ongoing as I write this in 2024) competition
Graphs
- Stanford Large Network Dataset Collection for social graphs, roads, communication networks and more
- Open Graph Benchmark (OGB) for “a collection of realistic, large-scale, and diverse benchmark datasets”
- OpenStreetMap
- SketchGraphs “A Large-Scale Dataset for Modeling Relational Geometry in Computer-Aided Design”
- Data for STREETS
- 2013 NYC Taxi Trip Data
Symbolic Regression
Both from the universe of Max Tegmark: * AI Feynman all eqns from the Feynman lectures, includes bonus eqns * AI Physicist considers different forces per region of space;
Audio
- Free Spoken Digit Dataset = spoken MNIST
- Speech Command Dataset with 65k 1s utterances of 30 short spoken commands like “Yes”, “No”, “Stop”, “Go”
- Free Music Archive ~900GB/343 days of Creative-Commons-licensed audio from 106,574 tracks from 16,341 artists and 14,854 albums, arranged in a hierarchical taxonomy of 161 genres
- Million Song Dataset audio features and metadata of ~1m popular music tracks
- LibriSpeech ~1k hrs of audiobooks from LibriVox
- VoxCeleb ~1m utterances by ~7k celebrities, >2k hrs
- Spotify OpenMic; their podcast dataset TREC is no longer available
and Video: AViD collected videos with a creative-commons license shared as a static dataset
Data in the Wild
Time Series
- S&P 500
- Spotify Sequential Skip Prediction Challenge but this has pages of User Agreements to scroll+click through
- CompEngine a self-organizing db of time-series data
Climate data
- Global climate data
- NOAA
- www.data.gov/climate
- AI for Earth may help with resources
- Catalyst Cooperative
- Washington Post Data behind the series “2ºC: Beyond the Limit.”, also here, here, and here.
Recommended by Amanda Giang during the discussion: * Pangeo and the dataset WeatherBench that they host * Zenodo * Google Earth Engine