Allison's bookmarks (tagged datasets)

Most recent

Priya22/project-dialogism-novel-corpus: The official repository for the The Project Dialogism Novel Corpus, a dataset of annotated quotations in full-length English novels.

(via data is plural): "every quotation from 22 novels, plus who speaks each line, who they’re addressing, the characters they mention, and more. With 35,000+ quotations, the corpus 'is by an order of magnitude the largest dataset of annotated quotations for literary texts in English.'"

data datasets corpora text

TextOCR

"TextOCR provides ~1M high quality word annotations on TextVQA images allowing application of end-to-end reasoning on downstream tasks such as visual question answering or image captioning."

text ocr datasets language poetics machinelearning

EPIC-KITCHENS Dataset

"The extended largest dataset in first-person (egocentric) vision; multi-faceted, audio-visual, non-scripted recordings in native environments - i.e. the wearers' homes, capturing all daily activities in the kitchen over multiple days. Annotations are collected using a novel 'Pause-and-Talk' narration interface."

datasets cooking cv text corpus

Google AI Blog: Announcing Two New Natural Language Dialog Datasets

"In the movie-oriented CCPE dataset, individuals posing as a user speak into a microphone and the audio is played directly to the person posing as a digital assistant. The “assistant” types out their response, which is in turn played to the user via text-to-speech. [...] The Taskmaster-1 dataset makes use of both the methodology described above as well as a one-person, written technique to increase the corpus size and speaker diversity—about 7.7k written “self-dialog” entries and ~5.5k 2-person, spoken dialogs. For written dialogs, we engaged people to create the full conversation themselves based on scenarios outlined for each task, thereby playing roles of both the user and assistant."

datasets nlproc text conversation poetics

Universal Dependencies

"Universal Dependencies (UD) is a framework for consistent annotation of grammar (parts of speech, morphological features, and syntactic dependencies) across different human languages. UD is an open community effort with over 200 contributors producing more than 100 treebanks in over 70 languages."

datasets nlproc language text poetics

The UN Security Council Debates - Harvard Dataverse

"[A] dataset of UN Security Council debates between January 1995 and December 2017... split into distinct speeches" with metadata on "the speaker, the speaker's nation or affiliation, and the speaker's role in the meeting" and "the topic of the meeting." 65393 speeches extracted from 4958 meeting protocols (!). via data is plural

text datasets politics