Last verified October 4, 2020
Last night I was on a data science career panel (of awesome ladies!) as part the Vancouver Datajam 2020 and I promised (as I’ve been meaning to do for a while…) to post a list of data resources. The hardest part of finding data isn’t finding such a list but finding such a list that is up-to-date but I’ll try. Should I fail here first is a list of well maintained curated data source lists.
Tensorflow models/datasets resource is offered by Google. Many of the datasets below are accessible via tensorflow_datasets
UCI ML Repository “currently maintain 557 data sets as a service to the machine learning community”
Kaggle including such gems as the arXiv and avocado prices
Google Public Data is curating datasets; they also have a Dataset Search
OpenSpending: “search over 3,446 data packages from 83 countries with over 159,706,407 fiscal records”
Harvard Dataverse is a repository for research data (and code!).
FiveThirtyEight posts all the data to back the articles
Tableau Public hosted datasets
Stats Canada; DataBC; Vancouver Open Data; US Data.gov; NYC OpenData; Seattle Open Data; Switzerland’s data; Our World in Data; data.world; etc…
Appen hosts some Open Source Datasets
KDnuggets has datasets galore and also aggregates yet more aggregators. Alas, some links are out of date.
Ultimately I have a taxonomy problem: divide the data by datatype, domain or best-suited algorithm type? Finally, I’ll do a mixture of all three. This is how my mind divides them; this is how I ultimately search among them; this is hopefully how such a list will be most useful.
A breed all their own: they’re uniform, tidy, split into training/validation/test sets, (over-)used to pit algorithms against each other (some curated and shared for that purpose but aren’t adopted as readily). Older benchmarks are good for starting out or for hard variants of the problem statement (eg. one-shot!). See sotabench for a Papers With Code project to encourage reproducing published results. In no particular order:
For benchmarking:
Both from the universe of Max Tegmark:
and Video: AViD collected videos with a creative-commons license shared as a static dataset
Recommended by Amanda Giang during the discussion:
SentiWordNet (“assigns to each synset of WordNet three sentiment scores: positivity, negativity, objectivity”) may be interesting to compare against in sentiment analysis from supervised datasets. ↩
Hosted on Github Pages