Repositories tagged with "corpus-tools"
trafilatura
adbar
โPython & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XMLโ
Wordless
BLKSerene
โAn Integrated Corpus Tool With Multilingual Support for the Study of Language, Literature, and Translationโ
fundus
flairNLP
โA very simple news crawler with a funny nameโ
bitextor
โBitextor generates translation memories from multilingual websitesโ
ua-gec
grammarly
โUA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Languageโ
MinerU-HTML
opendatalab
โMinerU-HTML: An SLM-powered HTML main content extractor that outputs clean HTML bodies. Perfect for Deep Research Agents, RAG applications, and training data generation.โ
simplemma
โSimple multilingual lemmatizer for Python, especially useful for speed and efficiencyโ
audiomate
ynop
โPython library for handling audio datasets.โ
OpusFilter
Helsinki-NLP
โOpusFilter - Parallel corpus processing toolkitโ
kontext
czcorpus
โAn advanced, extensible web front-end for the Manatee-open corpus search engineโ
Switchboard-Corpus
NathanDuran
โUtilities for Processing the Switchboard Dialogue Act Corpusโ
beta
koskenni
โAn open source reimplementation of Benny Brodda's BETA in Pythonโ
spect
lennes
โSpeCT - Speech Corpus Toolkit for Praat. Documentation: https://lennes.github.io/spect/โ
ms3
johentsch
โA parser for annotated MuseScore 3 files.โ
align-linguistic-alignment
nickduran
โPython library for extracting quantitative, reproducible metrics of multi-level alignment between speakers in naturalistic language corpora.โ