adbar/simplemma
Simple multilingual lemmatizer for Python, especially useful for speed and efficiency
19 Releases
Latest: 1w ago
simplemma-1.2.0v1.2.0Latest
📋 What's Changed
- Add support for Esperanto @axel584 (#167)
- Update setup and docs, modernize code (#168, #169)
- Security: fix arbitrary file write during tarfile extraction in training (#165)
- Maintenance: fix file opening in ``training/download-eval-data.py`` (#164)
- Breaking: rename the ``trie_directory_factory`` module to
- ``trie_dictionary_factory``
simplemma-1.1.2v1.1.2
📋 Changes
- Fix cyclic import by @juanjoDiaz (#148)
- Fix language detector proportion_in_each_language results by @juanjoDiaz (#150)
- Init: use explicit re-exports (#151)
- Fix data written by dictionary pickler by @Dunedan (#156)
- Add demo rules for Latvian and Estonian (#154, #157)
- Remove deprecated langdetect submodule (#160)
- Test: remove dummy pickled data (#161)
- Language data: upgrade pickle to v5 (#162)
simplemma-1.1.1v1.1.1
📋 Changes
- Fix `ModuleNotFoundError` and test optional dependencies (#142)
- Simplify code and add missing type annotations (#144)
simplemma-1.1.0v1.1.0
📋 Changes
- Add a memory-efficient dictionary factory backed by MARISA-tries by @Dunedan in #133
- Drop support for Python 3.6 & 3.7 by @Dunedan in #134
- Update setup files (#138)
simplemma-1.0.0v1.0.0
📋 Changes
- Series of modular classes
- Different lemmatization strategies available
- Customization of dictionary loading and handling (`DictionaryFactory`)
- `LanguageDetector` class with extended options
- See readme and [detailed documentation](https://adbar.github.io/simplemma/)
- The `extensive` argument is now `greedy`
- The `langdetect` submodule is now `language_detector`
- `is_known()` function now restored to its state in v0.9.0 (full dictionary)
- + 5 more
simplemma-0.9.1v0.9.1
📋 What's Changed
- smaller language data footprint with smallest possible impact on performance, using a combination of rules, upper limit on word length, and better data cleaning (#31)
- unsupervised approach to affixes activated by default for some languages
- reviewed rules for English and German (less greedy)
- added rules for Dutch, Finnish, Polish and Russian
- improved Russian and Ukrainian language data (#3)
- improved tokenizer
- Full Changelog: https://github.com/adbar/simplemma/compare/v0.9.0...v0.9.1
simplemma-0.9.0v0.9.0
📋 Changes
- smaller data files (especially for fi, la, pl, pt, sk & tr, #19)
- added support for Asturian (``ast``, #20)
- bug fixes (#18, #26)
simplemma-0.8.2v0.8.2
📋 Changes
- languages added: Albanian, Hindi, Icelandic, Malay, Middle English, Northern Sámi, Nynorsk, Serbo-Croatian, Swahili, Tagalog
- fix for slow language detection introduced in 0.7.0
simplemma-0.8.1v0.8.1
📋 Changes
- better rules for English and German
- inconsistencies fixed for cy, de, en, ga, sv (#16)
- docs: added language detection and citation info
simplemma-0.8.0v0.8.0
📋 Changes
- code fully type checked, optional pre-compilation with ``mypyc``
- fixes: logging error (#11), input type (#12)
- code style: [black](https://github.com/psf/black)
simplemma-0.7.0v0.7.0
📋 Changes
- breaking change: language data pre-loading now occurs internally, language codes are now directly provided in ``lemmatize()`` call, e.g. ``simplemma.lemmatize("test", lang="en")``
- faster lemmatization and result cache
- sentence-aware ``text_lemmatizer()``
- optional iterators for tokenization and lemmatization
simplemma-0.6.0v0.6.0
📋 Changes
- improved language models
- improved tokenizer
- maintenance and code efficiency
- added basic language detection (undocumented)
simplemma-0.5.0v0.5.0
📋 Changes
- faster, more efficient code
- dropped support for Python 3.5
simplemma-0.4.0v0.4.0
📋 Changes
- new languages: Armenian, Greek, Macedonian, Norwegian (Bokmål), and Polish
- language data reviewed for: Dutch, Finnish, German, Hungarian, Latin, Russian, and Swedish
- Urdu removed of language list due to issues with the data
- add support for Python 3.10 and drop support for Python 3.4
- improved decomposition and tokenization algorithms
simplemma-0.3.0v0.3.0
📋 Changes
- improved models and disambiguation
- improved tokenization
- extended rules for German
simplemma-0.2.2v0.2.2
📋 Changes
- Work on decomposition rules
- Reviewed language data
- Cleaner code
simplemma-0.2.1v0.2.1
📋 Changes
- Better decomposition into subwords by greedy algorithm
- First benchmarks and data-based corrections: German, French, English, Spanish
simplemma-0.2.0v0.2.0
📋 Changes
- Languages added: Danish, Dutch, Finnish, Georgian, Indonesian, Latin, Latvian, Lithuanian, Luxembourgish, Turkish, Urdu
- Improved word pair coverage
- Tokenization functions added
- Limit greediness and range of potential candidates
simplemma-0.1.0v0.1.0
