GitPedia
adbar

adbar/simplemma

Simple multilingual lemmatizer for Python, especially useful for speed and efficiency

19 Releases
Latest: 1w ago
simplemma-1.2.0v1.2.0Latest
adbaradbar·1w ago·June 1, 2026
GitHub

📋 What's Changed

  • Add support for Esperanto @axel584 (#167)
  • Update setup and docs, modernize code (#168, #169)
  • Security: fix arbitrary file write during tarfile extraction in training (#165)
  • Maintenance: fix file opening in ``training/download-eval-data.py`` (#164)
  • Breaking: rename the ``trie_directory_factory`` module to
  • ``trie_dictionary_factory``
simplemma-1.1.2v1.1.2
adbaradbar·1y ago·November 19, 2024
GitHub

📋 Changes

  • Fix cyclic import by @juanjoDiaz (#148)
  • Fix language detector proportion_in_each_language results by @juanjoDiaz (#150)
  • Init: use explicit re-exports (#151)
  • Fix data written by dictionary pickler by @Dunedan (#156)
  • Add demo rules for Latvian and Estonian (#154, #157)
  • Remove deprecated langdetect submodule (#160)
  • Test: remove dummy pickled data (#161)
  • Language data: upgrade pickle to v5 (#162)
simplemma-1.1.1v1.1.1
adbaradbar·1y ago·August 8, 2024
GitHub

📋 Changes

  • Fix `ModuleNotFoundError` and test optional dependencies (#142)
  • Simplify code and add missing type annotations (#144)
simplemma-1.1.0v1.1.0
adbaradbar·1y ago·August 6, 2024
GitHub

📋 Changes

  • Add a memory-efficient dictionary factory backed by MARISA-tries by @Dunedan in #133
  • Drop support for Python 3.6 & 3.7 by @Dunedan in #134
  • Update setup files (#138)
simplemma-1.0.0v1.0.0
adbaradbar·2y ago·May 31, 2024
GitHub

📋 Changes

  • Series of modular classes
  • Different lemmatization strategies available
  • Customization of dictionary loading and handling (`DictionaryFactory`)
  • `LanguageDetector` class with extended options
  • See readme and [detailed documentation](https://adbar.github.io/simplemma/)
  • The `extensive` argument is now `greedy`
  • The `langdetect` submodule is now `language_detector`
  • `is_known()` function now restored to its state in v0.9.0 (full dictionary)
  • + 5 more
simplemma-0.9.1v0.9.1
adbaradbar·3y ago·January 20, 2023
GitHub

📋 What's Changed

  • smaller language data footprint with smallest possible impact on performance, using a combination of rules, upper limit on word length, and better data cleaning (#31)
  • unsupervised approach to affixes activated by default for some languages
  • reviewed rules for English and German (less greedy)
  • added rules for Dutch, Finnish, Polish and Russian
  • improved Russian and Ukrainian language data (#3)
  • improved tokenizer
  • Full Changelog: https://github.com/adbar/simplemma/compare/v0.9.0...v0.9.1
simplemma-0.9.0v0.9.0
adbaradbar·3y ago·October 18, 2022
GitHub

📋 Changes

  • smaller data files (especially for fi, la, pl, pt, sk & tr, #19)
  • added support for Asturian (``ast``, #20)
  • bug fixes (#18, #26)
simplemma-0.8.2v0.8.2
adbaradbar·3y ago·September 5, 2022
GitHub

📋 Changes

  • languages added: Albanian, Hindi, Icelandic, Malay, Middle English, Northern Sámi, Nynorsk, Serbo-Croatian, Swahili, Tagalog
  • fix for slow language detection introduced in 0.7.0
simplemma-0.8.1v0.8.1
adbaradbar·3y ago·September 1, 2022
GitHub

📋 Changes

  • better rules for English and German
  • inconsistencies fixed for cy, de, en, ga, sv (#16)
  • docs: added language detection and citation info
simplemma-0.8.0v0.8.0
adbaradbar·3y ago·August 2, 2022
GitHub

📋 Changes

  • code fully type checked, optional pre-compilation with ``mypyc``
  • fixes: logging error (#11), input type (#12)
  • code style: [black](https://github.com/psf/black)
simplemma-0.7.0v0.7.0
adbaradbar·3y ago·June 16, 2022
GitHub

📋 Changes

  • breaking change: language data pre-loading now occurs internally, language codes are now directly provided in ``lemmatize()`` call, e.g. ``simplemma.lemmatize("test", lang="en")``
  • faster lemmatization and result cache
  • sentence-aware ``text_lemmatizer()``
  • optional iterators for tokenization and lemmatization
simplemma-0.6.0v0.6.0
adbaradbar·4y ago·April 6, 2022
GitHub

📋 Changes

  • improved language models
  • improved tokenizer
  • maintenance and code efficiency
  • added basic language detection (undocumented)
simplemma-0.5.0v0.5.0
adbaradbar·4y ago·November 19, 2021
GitHub

📋 Changes

  • faster, more efficient code
  • dropped support for Python 3.5
simplemma-0.4.0v0.4.0
adbaradbar·4y ago·October 19, 2021
GitHub

📋 Changes

  • new languages: Armenian, Greek, Macedonian, Norwegian (Bokmål), and Polish
  • language data reviewed for: Dutch, Finnish, German, Hungarian, Latin, Russian, and Swedish
  • Urdu removed of language list due to issues with the data
  • add support for Python 3.10 and drop support for Python 3.4
  • improved decomposition and tokenization algorithms
simplemma-0.3.0v0.3.0
adbaradbar·5y ago·April 8, 2021
GitHub

📋 Changes

  • improved models and disambiguation
  • improved tokenization
  • extended rules for German
simplemma-0.2.2v0.2.2
adbaradbar·5y ago·February 24, 2021
GitHub

📋 Changes

  • Work on decomposition rules
  • Reviewed language data
  • Cleaner code
simplemma-0.2.1v0.2.1
adbaradbar·5y ago·February 2, 2021
GitHub

📋 Changes

  • Better decomposition into subwords by greedy algorithm
  • First benchmarks and data-based corrections: German, French, English, Spanish
simplemma-0.2.0v0.2.0
adbaradbar·5y ago·January 25, 2021
GitHub

📋 Changes

  • Languages added: Danish, Dutch, Finnish, Georgian, Indonesian, Latin, Latvian, Lithuanian, Luxembourgish, Turkish, Urdu
  • Improved word pair coverage
  • Tokenization functions added
  • Limit greediness and range of potential candidates
simplemma-0.1.0v0.1.0
adbaradbar·5y ago·January 18, 2021
GitHub