Allison's bookmarks (tagged datasets)

Most recent

Places Overview

"100 million commercial points-of-interest (POI) worldwide and is rich with real-world information" some good open source data, but a lot of the actually interesting attributes (hours, tips, descriptions) are only available for $$$

datasets geography

Getty Vocabularies (Getty Research Institute)

"structured resources for the visual arts domain, including art, architecture, decorative arts, other cultural works, archival materials, visual surrogates, and art conservation" tons of fun stuff in here, love me a controlled vocabulary (via data is plural via lynn cherny)

datasets language art

AIAAIC - AIAAIC Repository

"The independent, open, public interest resource detailing incidents and controversies driven by and relating to AI, algorithms, and automation"

ai datasets journalism

CLLD Concepticon 3.2.0 -

"...links concept labels from different conceptlists to concept sets. Each concept set is given a unique identifier, a unique label, and a human-readable definition. Concept sets are further structured by defining different relations between the concepts..."

linguistics semantics datasets

Priya22/project-dialogism-novel-corpus: The official repository for the The Project Dialogism Novel Corpus, a dataset of annotated quotations in full-length English novels.

(via data is plural): "every quotation from 22 novels, plus who speaks each line, who they’re addressing, the characters they mention, and more. With 35,000+ quotations, the corpus 'is by an order of magnitude the largest dataset of annotated quotations for literary texts in English.'"

data datasets corpora text

TextOCR

"TextOCR provides ~1M high quality word annotations on TextVQA images allowing application of end-to-end reasoning on downstream tasks such as visual question answering or image captioning."

text ocr datasets language poetics machinelearning

EPIC-KITCHENS Dataset

"The extended largest dataset in first-person (egocentric) vision; multi-faceted, audio-visual, non-scripted recordings in native environments - i.e. the wearers' homes, capturing all daily activities in the kitchen over multiple days. Annotations are collected using a novel 'Pause-and-Talk' narration interface."

datasets cooking cv text corpus

Google AI Blog: Announcing Two New Natural Language Dialog Datasets

"In the movie-oriented CCPE dataset, individuals posing as a user speak into a microphone and the audio is played directly to the person posing as a digital assistant. The “assistant” types out their response, which is in turn played to the user via text-to-speech. [...] The Taskmaster-1 dataset makes use of both the methodology described above as well as a one-person, written technique to increase the corpus size and speaker diversity—about 7.7k written “self-dialog” entries and ~5.5k 2-person, spoken dialogs. For written dialogs, we engaged people to create the full conversation themselves based on scenarios outlined for each task, thereby playing roles of both the user and assistant."

datasets nlproc text conversation poetics

Universal Dependencies

"Universal Dependencies (UD) is a framework for consistent annotation of grammar (parts of speech, morphological features, and syntactic dependencies) across different human languages. UD is an open community effort with over 200 contributors producing more than 100 treebanks in over 70 languages."

datasets nlproc language text poetics

The UN Security Council Debates - Harvard Dataverse

"[A] dataset of UN Security Council debates between January 1995 and December 2017... split into distinct speeches" with metadata on "the speaker, the speaker's nation or affiliation, and the speaker's role in the meeting" and "the topic of the meeting." 65393 speeches extracted from 4958 meeting protocols (!). via data is plural

text datasets politics