Allison's bookmarks (tagged datasets)

all text in nyc

"a search engine that finds text in New York City's Google Street View images. Search for any word or phrase to see where it appears across the city—in shop signs, graffiti, advertisements, and protest signs"

text poetics data datasets nyc geography

Saved 2025-09-01T19:25:45.877625Z

Places Overview

"100 million commercial points-of-interest (POI) worldwide and is rich with real-world information" some good open source data, but a lot of the actually interesting attributes (hours, tips, descriptions) are only available for $$$

datasets geography

Saved 2024-12-06T21:40:22.161883Z

100STYLE - Ian Mason

"over 4 million frames of motion capture data for 100 different styles of locomotion"

datasets 3d animation walking

Saved 2024-11-26T20:05:35.866916Z

Getty Vocabularies (Getty Research Institute)

"structured resources for the visual arts domain, including art, architecture, decorative arts, other cultural works, archival materials, visual surrogates, and art conservation" tons of fun stuff in here, love me a controlled vocabulary (via data is plural via lynn cherny)

datasets language art

Saved 2024-09-18T16:20:39.469541Z

AIAAIC - AIAAIC Repository

"The independent, open, public interest resource detailing incidents and controversies driven by and relating to AI, algorithms, and automation"

ai datasets journalism

Saved 2024-07-16T21:46:31.916771Z

CLLD Concepticon 3.2.0 -

"...links concept labels from different conceptlists to concept sets. Each concept set is given a unique identifier, a unique label, and a human-readable definition. Concept sets are further structured by defining different relations between the concepts..."

linguistics semantics datasets

Saved 2024-07-16T20:55:43.483933Z

dell-research-harvard/AmericanStories · Datasets at Hugging Face

"a collection of full article texts extracted from historical U.S. newspaper images [that] includes nearly 20 million scans from the public domain"

datasets corpora language text history

Saved 2023-09-13T18:51:55.137457Z

Priya22/project-dialogism-novel-corpus: The official repository for the The Project Dialogism Novel Corpus, a dataset of annotated quotations in full-length English novels.

(via data is plural): "every quotation from 22 novels, plus who speaks each line, who they’re addressing, the characters they mention, and more. With 35,000+ quotations, the corpus 'is by an order of magnitude the largest dataset of annotated quotations for literary texts in English.'"

data datasets corpora text

Saved 2023-02-01T20:16:10Z

Folklore* | The Quarterly Journal of Economics | Oxford Academic

"a unique catalog of oral traditions spanning approximately 1,000 societies"

folklore datasets data

Saved 2022-12-21T20:37:34Z

TextOCR

"TextOCR provides ~1M high quality word annotations on TextVQA images allowing application of end-to-end reasoning on downstream tasks such as visual question answering or image captioning."

text ocr datasets language poetics machinelearning

Saved 2021-05-17T21:47:42Z

100,000 Podcasts: A Spoken English Document Corpus - ACL Anthology

"a noisy but fascinating collection of documents which can be studied through the lens of natural language processing, information retrieval, and linguistics"

datasets language poetics podcasts audio speech

Saved 2021-03-29T15:48:21Z

Download the C4 dataset! · Discussion #5056 · allenai/allennlp

allennlp's version of the c4 dataset

machinelearning text datasets poetics

Saved 2021-03-22T22:16:57Z

Global Wind Atlas

geography energy wind datasets

Saved 2020-09-30T21:30:39Z

Open States: discover politics in your state

legislation in us states

politics datasets

Saved 2020-09-30T21:29:21Z

Leveraging Machine Learning to Fuel New Discoveries with the arXiv Dataset | arXiv.org blog

text poetics datasets corpora

Saved 2020-08-19T13:45:05Z

COVID-19 Case Surveillance Public Use Data | Data | Centers for Disease Control and Prevention

data coronavirus datasets

Saved 2020-07-15T18:27:17Z

Common Voice

datasets voice audio poetics

Saved 2020-07-01T21:11:23Z

EPIC-KITCHENS Dataset

"The extended largest dataset in first-person (egocentric) vision; multi-faceted, audio-visual, non-scripted recordings in native environments - i.e. the wearers' homes, capturing all daily activities in the kitchen over multiple days. Annotations are collected using a novel 'Pause-and-Talk' narration interface."

datasets cooking cv text corpus

Saved 2020-06-29T16:32:34Z

htrc/htrc-feature-reader: Tools for working with HTRC Feature Extraction files

python interface for the HTRC Extracted Features dataset

python programming text corpora datasets

Saved 2020-06-24T19:27:27Z

Make Me a Hanzi | Free, open-source Chinese character data

"[a] dictionary and graphical data for over 9000 of the most common simplified and traditional Chinese characters. Among other things, this data includes stroke-order vector graphics for all these characters." (via gábor ugray's !!con 2020 talk)

writing text chinese poetics mol datasets

Saved 2020-05-09T17:21:45Z

whipson/PoKi-Poems-by-Kids: PoKi: A Large Dataset of Poems by Children

"freely available for research with the condition that the research be used for the benefit of children"

text datasets corpora poetry poetics

Saved 2020-04-29T15:37:13Z

CCMatrix: A billion-scale bitext data set for training translation models

"CCMatrix is the largest data set of high-quality, web-based bitexts for training translation models. With more than 4.5 billion parallel sentences in 576 language pairs pulled from snapshots of the CommonCrawl public data set, CCMatrix is more than 50 times larger than the WikiMatrix corpus that we shared last year."

nlproc translation text poetics datasets

Saved 2020-02-10T22:19:38Z

mhagiwara/github-typo-corpus: GitHub Typo Corpus: A Large-Scale Multilingual Dataset of Misspellings and Grammatical Errors

"a large-scale dataset of misspellings and grammatical errors along with their corrections harvested from GitHub. It contains more than 350k edits and 65M characters in more than 15 languages, making it the largest dataset of misspellings to date."

datasets language text poetics

Saved 2019-12-11T16:06:06Z

Google AI Blog: Announcing Two New Natural Language Dialog Datasets

"In the movie-oriented CCPE dataset, individuals posing as a user speak into a microphone and the audio is played directly to the person posing as a digital assistant. The “assistant” types out their response, which is in turn played to the user via text-to-speech. [...] The Taskmaster-1 dataset makes use of both the methodology described above as well as a one-person, written technique to increase the corpus size and speaker diversity—about 7.7k written “self-dialog” entries and ~5.5k 2-person, spoken dialogs. For written dialogs, we engaged people to create the full conversation themselves based on scenarios outlined for each task, thereby playing roles of both the user and assistant."

datasets nlproc text conversation poetics

Saved 2019-11-06T14:00:05Z

Universal Dependencies

"Universal Dependencies (UD) is a framework for consistent annotation of grammar (parts of speech, morphological features, and syntactic dependencies) across different human languages. UD is an open community effort with over 200 contributors producing more than 100 treebanks in over 70 languages."

datasets nlproc language text poetics

Saved 2019-10-15T14:11:54Z

US Street Network Models and Measures - Geoff Boeing

"over 110,000 processed, cleaned street network graphs (which in turn comprise over 55 million nodes and over 137 million edges)"

datasets network transportation

Saved 2019-09-20T16:11:27Z

ColoredConventions.org

text datasets dh

Saved 2019-09-09T21:52:54Z

The UN Security Council Debates - Harvard Dataverse

"[A] dataset of UN Security Council debates between January 1995 and December 2017... split into distinct speeches" with metadata on "the speaker, the speaker's nation or affiliation, and the speaker's role in the meeting" and "the topic of the meeting." 65393 speeches extracted from 4958 meeting protocols (!). via data is plural

text datasets politics

Saved 2019-07-17T21:13:50Z

Google's Natural Questions

nlproc datasets text

Saved 2019-01-28T23:36:40Z