"structured resources for the visual arts domain, including art, architecture, decorative arts, other cultural works, archival materials, visual surrogates, and art conservation" tons of fun stuff in here, love me a controlled vocabulary (via data is plural via lynn cherny)
"...links concept labels from different conceptlists to concept sets. Each concept set is given a unique identifier, a unique label, and a human-readable definition. Concept sets are further structured by defining different relations between the concepts..."
(via data is plural): "every quotation from 22 novels, plus who speaks each line, who they’re addressing, the characters they mention, and more. With 35,000+ quotations, the corpus 'is by an order of magnitude the largest dataset of annotated quotations for literary texts in English.'"
"TextOCR provides ~1M high quality word annotations on TextVQA images allowing application of end-to-end reasoning on downstream tasks such as visual question answering or image captioning."
"a noisy but fascinating collection of documents which can be studied through the lens of natural language processing, information retrieval, and linguistics"
"The extended largest dataset in first-person (egocentric) vision; multi-faceted, audio-visual, non-scripted recordings in native environments - i.e. the wearers' homes, capturing all daily activities in the kitchen over multiple days. Annotations are collected using a novel 'Pause-and-Talk' narration interface."
"[a] dictionary and graphical data for over 9000 of the most common simplified and traditional Chinese characters. Among other things, this data includes stroke-order vector graphics for all these characters." (via gábor ugray's !!con 2020 talk)
"CCMatrix is the largest data set of high-quality, web-based bitexts for training translation models. With more than 4.5 billion parallel sentences in 576 language pairs pulled from snapshots of the CommonCrawl public data set, CCMatrix is more than 50 times larger than the WikiMatrix corpus that we shared last year."
"a large-scale dataset of misspellings and grammatical errors along with their corrections harvested from GitHub. It contains more than 350k edits and 65M characters in more than 15 languages, making it the largest dataset of misspellings to date."
"In the movie-oriented CCPE dataset, individuals posing as a user speak into a microphone and the audio is played directly to the person posing as a digital assistant. The “assistant” types out their response, which is in turn played to the user via text-to-speech. [...] The Taskmaster-1 dataset makes use of both the methodology described above as well as a one-person, written technique to increase the corpus size and speaker diversity—about 7.7k written “self-dialog” entries and ~5.5k 2-person, spoken dialogs. For written dialogs, we engaged people to create the full conversation themselves based on scenarios outlined for each task, thereby playing roles of both the user and assistant."
"Universal Dependencies (UD) is a framework for consistent annotation of grammar (parts of speech, morphological features, and syntactic dependencies) across different human languages. UD is an open community effort with over 200 contributors producing more than 100 treebanks in over 70 languages."
"[A] dataset of UN Security Council debates between January 1995 and December 2017... split into distinct speeches" with metadata on "the speaker, the speaker's nation or affiliation, and the speaker's role in the meeting" and "the topic of the meeting." 65393 speeches extracted from 4958 meeting protocols (!). via data is plural