"a collection of full article texts extracted from historical U.S. newspaper images [that] includes nearly 20 million scans from the public domain"
Priya22/project-dialogism-novel-corpus: The official repository for the The Project Dialogism Novel Corpus, a dataset of annotated quotations in full-length English novels.
(via data is plural): "every quotation from 22 novels, plus who speaks each line, who they’re addressing, the characters they mention, and more. With 35,000+ quotations, the corpus 'is by an order of magnitude the largest dataset of annotated quotations for literary texts in English.'"
"a unique catalog of oral traditions spanning approximately 1,000 societies"
"TextOCR provides ~1M high quality word annotations on TextVQA images allowing application of end-to-end reasoning on downstream tasks such as visual question answering or image captioning."
"a noisy but fascinating collection of documents which can be studied through the lens of natural language processing, information retrieval, and linguistics"
allennlp's version of the c4 dataset
legislation in us states
"The extended largest dataset in first-person (egocentric) vision; multi-faceted, audio-visual, non-scripted recordings in native environments - i.e. the wearers' homes, capturing all daily activities in the kitchen over multiple days. Annotations are collected using a novel 'Pause-and-Talk' narration interface."
python interface for the HTRC Extracted Features dataset
"[a] dictionary and graphical data for over 9000 of the most common simplified and traditional Chinese characters. Among other things, this data includes stroke-order vector graphics for all these characters." (via gábor ugray's !!con 2020 talk)
"freely available for research with the condition that the research be used for the benefit of children"
"CCMatrix is the largest data set of high-quality, web-based bitexts for training translation models. With more than 4.5 billion parallel sentences in 576 language pairs pulled from snapshots of the CommonCrawl public data set, CCMatrix is more than 50 times larger than the WikiMatrix corpus that we shared last year."
mhagiwara/github-typo-corpus: GitHub Typo Corpus: A Large-Scale Multilingual Dataset of Misspellings and Grammatical Errors
"a large-scale dataset of misspellings and grammatical errors along with their corrections harvested from GitHub. It contains more than 350k edits and 65M characters in more than 15 languages, making it the largest dataset of misspellings to date."
"In the movie-oriented CCPE dataset, individuals posing as a user speak into a microphone and the audio is played directly to the person posing as a digital assistant. The “assistant” types out their response, which is in turn played to the user via text-to-speech. [...] The Taskmaster-1 dataset makes use of both the methodology described above as well as a one-person, written technique to increase the corpus size and speaker diversity—about 7.7k written “self-dialog” entries and ~5.5k 2-person, spoken dialogs. For written dialogs, we engaged people to create the full conversation themselves based on scenarios outlined for each task, thereby playing roles of both the user and assistant."
"Universal Dependencies (UD) is a framework for consistent annotation of grammar (parts of speech, morphological features, and syntactic dependencies) across different human languages. UD is an open community effort with over 200 contributors producing more than 100 treebanks in over 70 languages."
"over 110,000 processed, cleaned street network graphs (which in turn comprise over 55 million nodes and over 137 million edges)"
"[A] dataset of UN Security Council debates between January 1995 and December 2017... split into distinct speeches" with metadata on "the speaker, the speaker's nation or affiliation, and the speaker's role in the meeting" and "the topic of the meeting." 65393 speeches extracted from 4958 meeting protocols (!). via data is plural