Allison's bookmarks (tagged nlproc)

Wikifunctions

"Wikifunctions is a Wikimedia project for everyone to collaboratively create and maintain a library of code functions to support the Wikimedia projects and beyond.... We are currently primarily focused on functions related to Wikidata Lexemes. The Lexicographical data from Wikidata and functions to process it are essential for the goal of an Abstract Wikipedia." lots of interesting implementations of nlproc-related stuff!

programming language nlproc

Saved 2025-07-07T21:05:55.109416Z

umbrella/packages/text-analysis at develop · thi-ng/umbrella

nice little text analysis library for javascript

programming javascript nlproc

Saved 2025-07-07T18:49:27.284916Z

wordfreq/SUNSET.md at master · rspeer/wordfreq

"The field I know as 'natural language processing' is hard to find these days.... It's rare to see NLP research that doesn't have a dependency on closed data controlled by OpenAI and Google, two companies that I already despise. [...] [C]ollecting a whole lot of text in a lot of languages... used to be a pretty reasonable thing to do, and not the kind of thing someone would be likely to object to. Now, the text-slurping tools are mostly used for training generative AI, and people are quite rightly on the defensive. If someone is collecting all the text from your books, articles, Web site, or public posts, it's very likely because they are creating a plagiarism machine that will claim your words as its own." i feel this in my very bones

ai nlproc text

Saved 2024-09-18T14:46:21.798924Z

Nomic Blog

"Open source, open data, open training code, fully reproducible and auditable text embedding model"

text machinelearning ai nlproc

Saved 2024-03-02T21:53:48.831044Z

alphacep/vosk-api: Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node

"Vosk is an offline open source speech recognition toolkit. [...] Vosk models are small (50 Mb) but provide continuous large vocabulary transcription, zero-latency response with streaming API, reconfigurable vocabulary and speaker identification." Bindings for various languages, "scales from small devices like Raspberry Pi or Android smartphone to big clusters."

speech nlproc text language poetics

Saved 2021-07-20T05:15:40Z

eyung/Singling: Java application for sonification of linguistic data.

"Application for the sonification of text which can be transformed according to various triggers and parameters to facilitate the learning and analysis of literacoustics, reading by listening."

text poetics sonification nlproc

Saved 2021-06-15T21:08:56Z

Rainbow Zero by Spinfoam Games

"Rainbow Zero is a... toy? widget? thingy? that allows you to explore a part of the space defined by the GloVe word vectors."

games nlproc text poetics wordvectors

Saved 2021-05-28T16:56:45Z

Applied Language Technology - YouTube

spacy 3 tutorials

nlproc python spacy text

Saved 2021-03-01T23:30:10Z

Robustness Gym

"Despite impressive performance on standard benchmarks, deep neural networks often fail when deployed to real-world systems, due to distribution shifts, training artifacts, and noisy data. To address these vulnerabilities, we introduce Robustness Gym: a simple and extensible toolkit for robustness testing that supports the entire spectrum of evaluation methodologies, from adversarial attacks to rule-based data augmentations."

nlproc machinelearning evaluation poetics

Saved 2021-01-18T22:57:24Z

nipunsadvilkar/pySBD: 🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.

"a rule-based sentence boundary detection that works out-of-the-box"

nlproc programming text

Saved 2021-01-16T18:42:36Z

ToxMod — Modulate

"detects toxic, disruptive, or otherwise problematic speech in real-time, and gives you the option to have our software respond immediately, or to escalate to your moderation team"

moderation poetics nlproc chat

Saved 2021-01-08T16:19:56Z

Hateful Memes Challenge winners

"Hate speech can come in many forms, including memes that combine text and images. This kind of multimodal content can be particularly challenging for AI to detect because it requires a holistic understanding of the meme." that is not the reason that hate speech is difficult to detect, and it's actually harmful that you think it's the reason, sorry

language culture machinelearning nlproc hatespeech

Saved 2021-01-03T22:42:05Z

Natalie Schluter on Twitter: "Hey #NLProc and #linguistics people! I'm looking for key examples of government/national cultural institutions (not universities) being successful in the preservation of national/indigenous languages, and maybe even truly ena

language preservation and nlproc

linguistics nlproc culture

Saved 2020-12-18T16:01:55Z

UniMorph: Schema and datasets for universal morphological annotation

"a collaborative effort to improve how NLP handles complex morphology in the world’s languages. The goal of UniMorph is to annotate morphological data in a universal schema that allows an inflected word from any language to be defined by its lexical meaning, typically carried by the lemma, and by a rendering of its inflectional form in terms of a bundle of morphological features from our schema."

nlproc text programming poetics linguistics morphology

Saved 2020-12-01T00:16:02Z

on vocal cloning — Are.na

everest's bibliography on text-to-speech and vocal cloning

language nlproc voice speech

Saved 2020-11-11T16:46:03Z

In-browser topic modeling

from david mimno, lda in the browser (may be helpful for workshops?)

dh nlproc teaching

Saved 2020-10-28T15:47:21Z

NLP Course | For You

looks good!

nlproc learning syllabus

Saved 2020-09-28T14:10:36Z

StereoSet

"StereoSet is a dataset that measures stereotype bias in language models. StereoSet consists of 17,000 sentences that measures model preferences across gender, race, religion, and profession."

machinelearning nlproc text poetics culture

Saved 2020-06-17T20:08:01Z

CCMatrix: A billion-scale bitext data set for training translation models

"CCMatrix is the largest data set of high-quality, web-based bitexts for training translation models. With more than 4.5 billion parallel sentences in 576 language pairs pulled from snapshots of the CommonCrawl public data set, CCMatrix is more than 50 times larger than the WikiMatrix corpus that we shared last year."

nlproc translation text poetics datasets

Saved 2020-02-10T22:19:38Z

🦄🤝🦄 Encoder-decoders in Transformers: a hybrid pre-trained architecture for seq2seq

this looks promising

machinelearning nlproc text language

Saved 2019-12-10T19:04:02Z

alexwarstadt/blimp: The Benchmark of Linguistic Minimal Pairs

"a challenge set for evaluating what language models (LMs) know about major grammatical phenomena in English" it warms my heart to see an ngram baseline in there, haha

language linguistics text nlproc

Saved 2019-12-09T22:04:59Z

Autocat

very cool online text classifier generator (just upload your data and then you can pip install your model!)

nlproc programming language

Saved 2019-12-01T21:54:11Z

Semantic Specialization of Distributional Representation Models

another tutorial from emnlp-19

nlproc language semantics poetics text

Saved 2019-12-01T21:34:55Z

Data Collection and End-to-End Learning for Conversational AI

overview + materials for emnlp-19 workshop

nlproc language poetics chatbots

Saved 2019-12-01T21:32:18Z

Measuring gender imbalances in reporting on the creative industries

language text nlproc dataviz

Saved 2019-11-22T16:43:29Z

Google AI Blog: Announcing Two New Natural Language Dialog Datasets

"In the movie-oriented CCPE dataset, individuals posing as a user speak into a microphone and the audio is played directly to the person posing as a digital assistant. The “assistant” types out their response, which is in turn played to the user via text-to-speech. [...] The Taskmaster-1 dataset makes use of both the methodology described above as well as a one-person, written technique to increase the corpus size and speaker diversity—about 7.7k written “self-dialog” entries and ~5.5k 2-person, spoken dialogs. For written dialogs, we engaged people to create the full conversation themselves based on scenarios outlined for each task, thereby playing roles of both the user and assistant."

datasets nlproc text conversation poetics

Saved 2019-11-06T14:00:05Z

Learn how to make BERT smaller and faster

"ways to make huge models like BERT smaller and faster": quantization, pruning, distillation

machinelearning nlproc text poetics

Saved 2019-10-29T03:19:43Z

Universal Dependencies

"Universal Dependencies (UD) is a framework for consistent annotation of grammar (parts of speech, morphological features, and syntactic dependencies) across different human languages. UD is an open community effort with over 200 contributors producing more than 100 treebanks in over 70 languages."

datasets nlproc language text poetics

Saved 2019-10-15T14:11:54Z

Other Orders

"Recommendation engines like the ones powering the endless feeds on Twitter, Facebook and YouTube, are designed to maximize ad revenue, and therefore to keep you online for as long as possible. In doing so they promote the most reactionary content on their platforms. Yet, these recommendation systems are nothing more than sorting mechanisms. Other Orders provides an alternate set of sorts, optimized for other outcomes."

text poetics programming nlproc

Saved 2019-09-06T19:26:43Z

The Annotated Transformer

nlproc machinelearning text poetics

Saved 2019-07-09T19:21:54Z

BPEmb

"a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia. Its intended use is as input for neural models in natural language processing"

nlproc text poetics machinelearning

Saved 2019-06-30T20:07:12Z

trees are harlequins, words are harlequins — the transformer ... “explained”?

"The Transformer is nothing more than an architecture where the core functional unit is attention. You stack attention layers on top of attention layers, just like you would do with CNN or RNN layers."

algorithms ai nlproc text poetics

Saved 2019-06-28T17:42:33Z

Tsvetshop: Home

"Yulia Tsvetkov's research group at Language Technologies Institute of Carnegie Mellon University. Our work focuses on natural language processing, particularly cross-lingual approaches, low-resource settings, and social good."

language poetics text machinelearning nlproc

Saved 2019-03-18T18:58:08Z

gpt-2-poetry

kyle mcdonald's take

poetry text poetics nlproc machinelearning gpt2

Saved 2019-03-08T18:14:58Z

Google's Natural Questions

nlproc datasets text

Saved 2019-01-28T23:36:40Z