Places Overview
"100 million commercial points-of-interest (POI) worldwide and is rich with real-world information" some good open source data, but a lot of the actually interesting attributes (hours, tips, descriptions) are only available for $$$
"100 million commercial points-of-interest (POI) worldwide and is rich with real-world information" some good open source data, but a lot of the actually interesting attributes (hours, tips, descriptions) are only available for $$$
"over 4 million frames of motion capture data for 100 different styles of locomotion"
"structured resources for the visual arts domain, including art, architecture, decorative arts, other cultural works, archival materials, visual surrogates, and art conservation" tons of fun stuff in here, love me a controlled vocabulary (via data is plural via lynn cherny)
"The independent, open, public interest resource detailing incidents and controversies driven by and relating to AI, algorithms, and automation"
"...links concept labels from different conceptlists to concept sets. Each concept set is given a unique identifier, a unique label, and a human-readable definition. Concept sets are further structured by defining different relations between the concepts..."
"a collection of full article texts extracted from historical U.S. newspaper images [that] includes nearly 20 million scans from the public domain"
(via data is plural): "every quotation from 22 novels, plus who speaks each line, who they’re addressing, the characters they mention, and more. With 35,000+ quotations, the corpus 'is by an order of magnitude the largest dataset of annotated quotations for literary texts in English.'"
"a unique catalog of oral traditions spanning approximately 1,000 societies"
"TextOCR provides ~1M high quality word annotations on TextVQA images allowing application of end-to-end reasoning on downstream tasks such as visual question answering or image captioning."
"a noisy but fascinating collection of documents which can be studied through the lens of natural language processing, information retrieval, and linguistics"
allennlp's version of the c4 dataset
legislation in us states
"The extended largest dataset in first-person (egocentric) vision; multi-faceted, audio-visual, non-scripted recordings in native environments - i.e. the wearers' homes, capturing all daily activities in the kitchen over multiple days. Annotations are collected using a novel 'Pause-and-Talk' narration interface."
python interface for the HTRC Extracted Features dataset
"[a] dictionary and graphical data for over 9000 of the most common simplified and traditional Chinese characters. Among other things, this data includes stroke-order vector graphics for all these characters." (via gábor ugray's !!con 2020 talk)
"freely available for research with the condition that the research be used for the benefit of children"
"CCMatrix is the largest data set of high-quality, web-based bitexts for training translation models. With more than 4.5 billion parallel sentences in 576 language pairs pulled from snapshots of the CommonCrawl public data set, CCMatrix is more than 50 times larger than the WikiMatrix corpus that we shared last year."
"a large-scale dataset of misspellings and grammatical errors along with their corrections harvested from GitHub. It contains more than 350k edits and 65M characters in more than 15 languages, making it the largest dataset of misspellings to date."
"In the movie-oriented CCPE dataset, individuals posing as a user speak into a microphone and the audio is played directly to the person posing as a digital assistant. The “assistant” types out their response, which is in turn played to the user via text-to-speech. [...] The Taskmaster-1 dataset makes use of both the methodology described above as well as a one-person, written technique to increase the corpus size and speaker diversity—about 7.7k written “self-dialog” entries and ~5.5k 2-person, spoken dialogs. For written dialogs, we engaged people to create the full conversation themselves based on scenarios outlined for each task, thereby playing roles of both the user and assistant."
"Universal Dependencies (UD) is a framework for consistent annotation of grammar (parts of speech, morphological features, and syntactic dependencies) across different human languages. UD is an open community effort with over 200 contributors producing more than 100 treebanks in over 70 languages."
"over 110,000 processed, cleaned street network graphs (which in turn comprise over 55 million nodes and over 137 million edges)"
"[A] dataset of UN Security Council debates between January 1995 and December 2017... split into distinct speeches" with metadata on "the speaker, the speaker's nation or affiliation, and the speaker's role in the meeting" and "the topic of the meeting." 65393 speeches extracted from 4958 meeting protocols (!). via data is plural