Data is everywhere!

Uh, where again?

Last night I was on a data science career panel (of awesome ladies!) as part the Vancouver Datajam 2020 and I promised (as I’ve been meaning to do for a while…) to post a list of data resources.
machine learning
datasets
Author

Lara Thompson

Published

September 13, 2020

Last verified October 19, 2024 (though the datasets seem dated!)

Last night I was on a data science career panel (of awesome ladies!) as part the Vancouver Datajam 2020 and I promised (as I’ve been meaning to do for a while…) to post a list of data resources. The hardest part of finding data isn’t finding such a list but finding such a list that is up-to-date but I’ll try. Should I fail here first is a list of well maintained curated data source lists.

An Aggregate of Other Aggregators

Ok, but how to divvy up the data types?

Ultimately I have a taxonomy problem: divide the data by datatype, domain or best-suited algorithm type? Finally, I’ll do a mixture of all three. This is how my mind divides them; this is how I ultimately search among them; this is hopefully how such a list will be most useful.

Curated Datasets

A breed all their own: they’re uniform, tidy, split into training/validation/test sets, (over-)used to pit algorithms against each other (some curated and shared for that purpose but aren’t adopted as readily). Older benchmarks are good for starting out or for hard variants of the problem statement (eg. one-shot!). See Papers With Code hosted SOTA by benchmark.

Disentanglement/Representation Learning

  • MPI3D datasets simulated and real-world environments
  • disentanglement_lib includes dSprites, Color/Noisy/Scream-dSprites, SmallNORB, Cars3D, and Shapes3D
  • also, try generating points on a surface in 3d to represent in 2d, such as the swiss roll (more rolls are harder to learn)

Images

Segmentation & Captioning

  • Street View House Numbers (SVHN)
  • COCO is a large-scale object detection, segmentation, and captioning dataset
  • Open Images image labels, bounding boxes, segmentation, relations, and narratives
  • VisualQA (VQA) open-ended questions about images requiring an understanding of vision, language and commonsense knowledge to answer

NLP

SentiWordNet (“assigns to each synset of WordNet three sentiment scores: positivity, negativity, objectivity”) may be interesting to compare against in sentiment analysis from supervised datasets.

NLU

Recommenders

Various

  • Penn ML Benchmarks for supervised learning algorithms
  • AutoML/AutoDL competitions datasets dating back to 2016; Springer has open access to the book with a chapter reviewing the challenge
  • OKCupid dataset N=68,371, 2,620 variables from the dating site OKCupid
  • Common Crawl has petabytes of data, regularly collected since 2008
  • GDELT Project “watching our world unfold”, or (less creepy) “the GDELT Project monitors the world’s broadcast, print, and web news from nearly every corner of every country in over 100 languages and identifies the people, locations, organizations, themes, sources, emotions, counts, quotes, images and events driving our global society every second of every day, creating a free open platform for computing on the entire world.”
  • CS bibliography has many datasets in many domains

Outlier/Anomaly/Event Detection

One/Few Shot

Graphs

Symbolic Regression

Both from the universe of Max Tegmark: * AI Feynman all eqns from the Feynman lectures, includes bonus eqns * AI Physicist considers different forces per region of space;

Audio

  • Free Spoken Digit Dataset = spoken MNIST
  • Speech Command Dataset with 65k 1s utterances of 30 short spoken commands like “Yes”, “No”, “Stop”, “Go”
  • Free Music Archive ~900GB/343 days of Creative-Commons-licensed audio from 106,574 tracks from 16,341 artists and 14,854 albums, arranged in a hierarchical taxonomy of 161 genres
  • Million Song Dataset audio features and metadata of ~1m popular music tracks
  • LibriSpeech ~1k hrs of audiobooks from LibriVox
  • VoxCeleb ~1m utterances by ~7k celebrities, >2k hrs
  • Spotify OpenMic; their podcast dataset TREC is no longer available

and Video: AViD collected videos with a creative-commons license shared as a static dataset

Data in the Wild

Time Series

Climate data

Recommended by Amanda Giang during the discussion: * Pangeo and the dataset WeatherBench that they host * Zenodo * Google Earth Engine

Sports stats