Data is everywhere! – Lara Thompson:: ML/AI/Data Scientist

Last verified October 19, 2024 (though the datasets seem dated!)

Last night I was on a data science career panel (of awesome ladies!) as part the Vancouver Datajam 2020 and I promised (as I’ve been meaning to do for a while…) to post a list of data resources. The hardest part of finding data isn’t finding such a list but finding such a list that is up-to-date but I’ll try. Should I fail here first is a list of well maintained curated data source lists.

An Aggregate of Other Aggregators

Huggingface datasets
PyTorch Vision and PyTorch NLP
Tensorflow models/datasets resource is offered by Google. Many of the datasets below are accessible via tensorflow_datasets
UCI ML Repository “currently maintain 557 data sets as a service to the machine learning community”
Kaggle including such gems as the arXiv and avocado prices
Google Public Data is curating datasets; they also have a Dataset Search
OpenSpending: “search over 3,446 data packages from 83 countries with over 159,706,407 fiscal records”
Harvard Dataverse is a repository for research data (and code!).
FiveThirtyEight posts all the data to back the articles
A compilation of Twitter/X datasets
Tableau Public hosts datasets
Stats Canada; DataBC; Vancouver Open Data; US Data.gov; NYC OpenData; Seattle Open Data; Our World in Data; etc…
Appen hosts some Open Source Datasets
KDnuggets has datasets galore and also aggregates yet more aggregators. Alas, some links are out of date.

Ok, but how to divvy up the data types?

Ultimately I have a taxonomy problem: divide the data by datatype, domain or best-suited algorithm type? Finally, I’ll do a mixture of all three. This is how my mind divides them; this is how I ultimately search among them; this is hopefully how such a list will be most useful.

Curated Datasets

A breed all their own: they’re uniform, tidy, split into training/validation/test sets, (over-)used to pit algorithms against each other (some curated and shared for that purpose but aren’t adopted as readily). Older benchmarks are good for starting out or for hard variants of the problem statement (eg. one-shot!). See Papers With Code hosted SOTA by benchmark.

Disentanglement/Representation Learning

MPI3D datasets simulated and real-world environments
disentanglement_lib includes dSprites, Color/Noisy/Scream-dSprites, SmallNORB, Cars3D, and Shapes3D
also, try generating points on a surface in 3d to represent in 2d, such as the swiss roll (more rolls are harder to learn)

Images

MNIST, CIFAR-10/100, and Fashion-MNIST all have ~60k images split among 10 classes
ciFAIR-10/100 duplicate free versions of CIFAR-10/100
ImageNet is large with bigger images a decent subset annotated with bounding boxes
DanBooru2021 a large-scale anime image database with 4.9m+ images annotated with 162m+ tags
Large-scale Fashion (DeepFashion) Database to scale up Fashion-MNIST
Plant Disease is the most widely used in agriculture studies
Unsplash lite and Full
Zappos50K 4 categories of shoes
UCSD Birds 200 categories of birds
CelebA 200K images each with 40 attributes
Visual Domain Decathlon 10 simultaneous visual challenges

Segmentation & Captioning

Street View House Numbers (SVHN)
COCO is a large-scale object detection, segmentation, and captioning dataset
Open Images image labels, bounding boxes, segmentation, relations, and narratives
VisualQA (VQA) open-ended questions about images requiring an understanding of vision, language and commonsense knowledge to answer

NLP

Large Movie Review Dataset and Sentiment140 for sentiment analysis
Twenty Newsgroups for text classification
currated Wikipedia Corpus or dumps from Wikipedia itself
Blog Authorship Corpus many blogs of many bloggers
Machine Translation ~15GB within various “tasks”
Yelp Open Dataset mixes NLP with images, interaction timelines, coordinates
One Billion Words a standard corpus of reasonable size (0.8 billion words)
Fake News Corpus
PG-19 extracted from Gutenberg
Snowden archive
Darknet Market Archives 2013-2015 scrapes covering vendor pages, feedback, images, etc.
3m Russian Troll tweets from FiveThirtyEight

SentiWordNet (“assigns to each synset of WordNet three sentiment scores: positivity, negativity, objectivity”) may be interesting to compare against in sentiment analysis from supervised datasets.

NLU

SQuAD 1-2 datasets
GLUE and SuperGlue
Measuring Massive Multitask Language Understanding bigger, harder to test GPT-3
and so many many more since the explosion of LLMs…

Recommenders

MS Learning to Rank dataset
MovieLens 25m ratings for ~60k movies of ~160k users
Spotify Recsys Challenge 2018 assembled by MSc students independent of Spotify who no longer host it
Goodbooks-10k scraped from GoodReads
Book-Crossing
Netflix Prize, a classic
GroupLens links to various datasets (book crossing is on Kaggle! look back two links)

Various

Penn ML Benchmarks for supervised learning algorithms
AutoML/AutoDL competitions datasets dating back to 2016; Springer has open access to the book with a chapter reviewing the challenge
OKCupid dataset N=68,371, 2,620 variables from the dating site OKCupid
Common Crawl has petabytes of data, regularly collected since 2008
GDELT Project “watching our world unfold”, or (less creepy) “the GDELT Project monitors the world’s broadcast, print, and web news from nearly every corner of every country in over 100 languages and identifies the people, locations, organizations, themes, sources, emotions, counts, quotes, images and events driving our global society every second of every day, creating a free open platform for computing on the entire world.”
CS bibliography has many datasets in many domains

Outlier/Anomaly/Event Detection

On the Evaluation of Unsupervised Outlier Detection data
Outlier Detection Datasets (ODDS)
Unsupervised Anomaly Detection Benchmark data
Anomaly Detection Meta-Analysis Benchmarks data
Numenta Anomaly Benchmark (NAB)
Turing Change Point Dataset
*MAVEN: A massive general domain event detection dataset, and accompanying paper.

One/Few Shot

miniImageNet was introduced in Matching Networks for One Shot Learning; Meta-Transfer Learning for Few-Shot Learning added tieredImageNet and Fewshot-CIFAR10 both available to downloaded directly; also see mini on Kaggle
Meta-Dataset assembles various datasets into one benchmark
Chollet’s ARC-AGI dataset and a recent (ongoing as I write this in 2024) competition

Graphs

Stanford Large Network Dataset Collection for social graphs, roads, communication networks and more
Open Graph Benchmark (OGB) for “a collection of realistic, large-scale, and diverse benchmark datasets”
OpenStreetMap
SketchGraphs “A Large-Scale Dataset for Modeling Relational Geometry in Computer-Aided Design”
Data for STREETS
2013 NYC Taxi Trip Data

Symbolic Regression

Both from the universe of Max Tegmark: * AI Feynman all eqns from the Feynman lectures, includes bonus eqns * AI Physicist considers different forces per region of space;

Audio

Free Spoken Digit Dataset = spoken MNIST
Speech Command Dataset with 65k 1s utterances of 30 short spoken commands like “Yes”, “No”, “Stop”, “Go”
Free Music Archive ~900GB/343 days of Creative-Commons-licensed audio from 106,574 tracks from 16,341 artists and 14,854 albums, arranged in a hierarchical taxonomy of 161 genres
Million Song Dataset audio features and metadata of ~1m popular music tracks
LibriSpeech ~1k hrs of audiobooks from LibriVox
VoxCeleb ~1m utterances by ~7k celebrities, >2k hrs
Spotify OpenMic; their podcast dataset TREC is no longer available

and Video: AViD collected videos with a creative-commons license shared as a static dataset

Data in the Wild

Time Series

S&P 500
Spotify Sequential Skip Prediction Challenge but this has pages of User Agreements to scroll+click through
CompEngine a self-organizing db of time-series data

Climate data

Global climate data
NOAA
www.data.gov/climate
AI for Earth may help with resources
Catalyst Cooperative
Washington Post Data behind the series “2ºC: Beyond the Limit.”, also here, here, and here.

Recommended by Amanda Giang during the discussion: * Pangeo and the dataset WeatherBench that they host * Zenodo * Google Earth Engine