September 2020

Data is everywhere!

Uh, where again?

Last verified October 4, 2020

Last night I was on a data science career panel (of awesome ladies!) as part the Vancouver Datajam 2020 and I promised (as I’ve been meaning to do for a while…) to post a list of data resources. The hardest part of finding data isn’t finding such a list but finding such a list that is up-to-date but I’ll try. Should I fail here first is a list of well maintained curated data source lists.

An Aggregate of Other Aggregators

Ok, but how to divvy up the data types?

Ultimately I have a taxonomy problem: divide the data by datatype, domain or best-suited algorithm type? Finally, I’ll do a mixture of all three. This is how my mind divides them; this is how I ultimately search among them; this is hopefully how such a list will be most useful.

Curated Datasets

A breed all their own: they’re uniform, tidy, split into training/validation/test sets, (over-)used to pit algorithms against each other (some curated and shared for that purpose but aren’t adopted as readily). Older benchmarks are good for starting out or for hard variants of the problem statement (eg. one-shot!). See sotabench for a Papers With Code project to encourage reproducing published results. In no particular order:

Disentanglement/Representation Learning

Images

Segmentation & Captioning

NLP

For benchmarking:

Recommenders

Various

Outlier/Anomaly/Event Detection

One/Few Shot

Graphs

Symbolic Regression

Both from the universe of Max Tegmark:

Audio

and Video: AViD collected videos with a creative-commons license shared as a static dataset

Data in the Wild

Time Series

Climate data

Recommended by Amanda Giang during the discussion:

Sports stats

  1. SentiWordNet (“assigns to each synset of WordNet three sentiment scores: positivity, negativity, objectivity”) may be interesting to compare against in sentiment analysis from supervised datasets. 

tags: machine learning

Hosted on Github Pages