Home/adbar/simplemma/Changelog

adbar/simplemma

Simple multilingual lemmatizer for Python, especially useful for speed and efficiency

19 Releases

Latest: 1w ago

simplemma-1.2.0v1.2.0Latest

adbar·1w ago·June 1, 2026

GitHub

📋 What's Changed

Add support for Esperanto @axel584 (#167)
Update setup and docs, modernize code (#168, #169)
Security: fix arbitrary file write during tarfile extraction in training (#165)
Maintenance: fix file opening in ``training/download-eval-data.py`` (#164)
Breaking: rename the ``trie_directory_factory`` module to
``trie_dictionary_factory``

simplemma-1.1.2v1.1.2

adbar·1y ago·November 19, 2024

GitHub

📋 Changes

Fix cyclic import by @juanjoDiaz (#148)
Fix language detector proportion_in_each_language results by @juanjoDiaz (#150)
Init: use explicit re-exports (#151)
Fix data written by dictionary pickler by @Dunedan (#156)
Add demo rules for Latvian and Estonian (#154, #157)
Remove deprecated langdetect submodule (#160)
Test: remove dummy pickled data (#161)
Language data: upgrade pickle to v5 (#162)

simplemma-1.1.1v1.1.1

adbar·1y ago·August 8, 2024

GitHub

📋 Changes

Fix `ModuleNotFoundError` and test optional dependencies (#142)
Simplify code and add missing type annotations (#144)

simplemma-1.1.0v1.1.0

adbar·1y ago·August 6, 2024

GitHub

📋 Changes

Add a memory-efficient dictionary factory backed by MARISA-tries by @Dunedan in #133
Drop support for Python 3.6 & 3.7 by @Dunedan in #134
Update setup files (#138)

simplemma-1.0.0v1.0.0

adbar·2y ago·May 31, 2024

GitHub

📋 Changes

Series of modular classes
Different lemmatization strategies available
Customization of dictionary loading and handling (`DictionaryFactory`)
`LanguageDetector` class with extended options
See readme and [detailed documentation](https://adbar.github.io/simplemma/)
The `extensive` argument is now `greedy`
The `langdetect` submodule is now `language_detector`
`is_known()` function now restored to its state in v0.9.0 (full dictionary)
+ 5 more

simplemma-0.9.1v0.9.1

adbar·3y ago·January 20, 2023

GitHub

📋 What's Changed

smaller language data footprint with smallest possible impact on performance, using a combination of rules, upper limit on word length, and better data cleaning (#31)
unsupervised approach to affixes activated by default for some languages
reviewed rules for English and German (less greedy)
added rules for Dutch, Finnish, Polish and Russian
improved Russian and Ukrainian language data (#3)
improved tokenizer
Full Changelog: https://github.com/adbar/simplemma/compare/v0.9.0...v0.9.1

simplemma-0.9.0v0.9.0

adbar·3y ago·October 18, 2022

GitHub

📋 Changes

smaller data files (especially for fi, la, pl, pt, sk & tr, #19)
added support for Asturian (``ast``, #20)
bug fixes (#18, #26)

simplemma-0.8.2v0.8.2

adbar·3y ago·September 5, 2022

GitHub

📋 Changes

languages added: Albanian, Hindi, Icelandic, Malay, Middle English, Northern Sámi, Nynorsk, Serbo-Croatian, Swahili, Tagalog
fix for slow language detection introduced in 0.7.0

simplemma-0.8.1v0.8.1

adbar·3y ago·September 1, 2022

GitHub

📋 Changes

better rules for English and German
inconsistencies fixed for cy, de, en, ga, sv (#16)
docs: added language detection and citation info

simplemma-0.8.0v0.8.0

adbar·3y ago·August 2, 2022

GitHub

📋 Changes

code fully type checked, optional pre-compilation with ``mypyc``
fixes: logging error (#11), input type (#12)
code style: [black](https://github.com/psf/black)

simplemma-0.7.0v0.7.0

adbar·3y ago·June 16, 2022

GitHub

📋 Changes

breaking change: language data pre-loading now occurs internally, language codes are now directly provided in ``lemmatize()`` call, e.g. ``simplemma.lemmatize("test", lang="en")``
faster lemmatization and result cache
sentence-aware ``text_lemmatizer()``
optional iterators for tokenization and lemmatization

simplemma-0.6.0v0.6.0

adbar·4y ago·April 6, 2022

GitHub

📋 Changes

improved language models
improved tokenizer
maintenance and code efficiency
added basic language detection (undocumented)

simplemma-0.5.0v0.5.0

adbar·4y ago·November 19, 2021

GitHub

📋 Changes

faster, more efficient code
dropped support for Python 3.5

simplemma-0.4.0v0.4.0

adbar·4y ago·October 19, 2021

GitHub

📋 Changes

new languages: Armenian, Greek, Macedonian, Norwegian (Bokmål), and Polish
language data reviewed for: Dutch, Finnish, German, Hungarian, Latin, Russian, and Swedish
Urdu removed of language list due to issues with the data
add support for Python 3.10 and drop support for Python 3.4
improved decomposition and tokenization algorithms

simplemma-0.3.0v0.3.0

adbar·5y ago·April 8, 2021

GitHub

📋 Changes

improved models and disambiguation
improved tokenization
extended rules for German

simplemma-0.2.2v0.2.2

adbar·5y ago·February 24, 2021

GitHub

📋 Changes

Work on decomposition rules
Reviewed language data
Cleaner code

simplemma-0.2.1v0.2.1

adbar·5y ago·February 2, 2021

GitHub

📋 Changes

Better decomposition into subwords by greedy algorithm
First benchmarks and data-based corrections: German, French, English, Spanish

simplemma-0.2.0v0.2.0

adbar·5y ago·January 25, 2021

GitHub

📋 Changes

Languages added: Danish, Dutch, Finnish, Georgian, Indonesian, Latin, Latvian, Lithuanian, Luxembourgish, Turkish, Urdu
Improved word pair coverage
Tokenization functions added
Limit greediness and range of potential candidates

simplemma-0.1.0v0.1.0

adbar·5y ago·January 18, 2021

GitHub

← Back to simplemma wiki