GitPedia

Simplemma

Simple multilingual lemmatizer for Python, especially useful for speed and efficiency

From adbar·Updated June 7, 2026·View on GitHub·

[Lemmatization](https://en.wikipedia.org/wiki/Lemmatisation) is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form. Unlike stemming, lemmatization outputs word units that are still valid linguistic forms. The project is written primarily in Python, distributed under the MIT License license, first published in 2021. Key topics include: corpus-tools, language-detection, language-identification, lemmatiser, lemmatization.

Latest release: v1.2.0simplemma-1.2.0
June 1, 2026View Changelog →

Simplemma: a simple multilingual lemmatizer for Python

Python package
Python versions
Code Coverage
Reference DOI: 10.5281/zenodo.4673264

<!-- include:intro:start -->

Purpose

Lemmatization is the
process of grouping together the inflected forms of a word so they can
be analysed as a single item, identified by the word's lemma, or
dictionary form. Unlike stemming, lemmatization outputs word units that
are still valid linguistic forms.

In modern natural language processing (NLP), this task is often
indirectly tackled by more complex systems encompassing a whole
processing pipeline. However, it appears that there is no
straightforward way to address lemmatization in Python although this
task can be crucial in fields such as information retrieval and NLP.

Simplemma provides a simple and multilingual approach to look for base
forms or lemmata. It may not be as powerful as full-fledged solutions
but it is generic, easy to install and straightforward to use. In
particular, it does not need morphosyntactic information and can process
a raw series of tokens or even a text with its built-in tokenizer. By
design it should be reasonably fast and work in a large majority of
cases, without being perfect.

With its comparatively small footprint it is especially useful when
speed and simplicity matter, in low-resource contexts, for educational
purposes, or as a baseline system for lemmatization and morphological
analysis.

Currently, 50 languages are partly or fully supported (see the list of supported languages).

Installation

The current library is written in pure Python with no dependencies:
pip install simplemma

  • pip install -U simplemma for updates
  • pip install git+https://github.com/adbar/simplemma for the cutting-edge version

The last version supporting Python 3.6 and 3.7 is simplemma==1.0.0.

Usage

Word-by-word

Simplemma is used by selecting a language of interest and then applying
the data on a list of words.

python
>>> import simplemma # get a word myword = 'masks' # decide which language to use and apply it on a word form >>> simplemma.lemmatize(myword, lang='en') 'mask' # apply it on a list of tokens >>> mytokens = ['Hier', 'sind', 'Vaccines'] >>> [simplemma.lemmatize(t, lang='de') for t in mytokens] ['hier', 'sein', 'Vaccines']

Chaining languages

Chaining several languages can improve coverage, they are used in
sequence:

python
>>> from simplemma import lemmatize >>> lemmatize('Vaccines', lang=('de', 'en')) 'vaccine' >>> lemmatize('spaghettis', lang='it') 'spaghettis' >>> lemmatize('spaghettis', lang=('it', 'fr')) 'spaghetti' >>> lemmatize('spaghetti', lang=('it', 'fr')) 'spaghetto'

Greedier decomposition

For certain languages a greedier decomposition is activated by default
as it can be beneficial, mostly due to a certain capacity to address
affixes in an unsupervised way. This can be triggered manually by
setting the greedy parameter to True.

This option also triggers a stronger reduction through an additional
iteration of the search algorithm, e.g. "angekündigten" →
"angekündigt" (standard) → "ankündigen" (greedy). In some cases it
may be closer to stemming than to lemmatization.

python
# same example as before, comes to this result in one step >>> simplemma.lemmatize('spaghettis', lang=('it', 'fr'), greedy=True) 'spaghetto' # German case described above >>> simplemma.lemmatize('angekündigten', lang='de', greedy=True) 'ankündigen' # 2 steps: reduction to infinitive verb >>> simplemma.lemmatize('angekündigten', lang='de', greedy=False) 'angekündigt' # 1 step: reduction to past participle

is_known()

The additional function is_known() checks if a given word is present
in the language data:

python
>>> from simplemma import is_known >>> is_known('spaghetti', lang='it') True

Tokenization

A simple tokenization function is provided for convenience:

python
>>> from simplemma import simple_tokenizer >>> simple_tokenizer('Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.') ['Lorem', 'ipsum', 'dolor', 'sit', 'amet', ',', 'consectetur', 'adipiscing', 'elit', ',', 'sed', 'do', 'eiusmod', 'tempor', 'incididunt', 'ut', 'labore', 'et', 'dolore', 'magna', 'aliqua', '.'] # for an iterator instead of a list, use the RegexTokenizer directly >>> from simplemma import RegexTokenizer >>> RegexTokenizer().split_text('Lorem ipsum dolor sit amet') <generator object ...>

The functions text_lemmatizer() and lemma_iterator() chain
tokenization and lemmatization. They accept the same greedy argument
as lemmatize():

python
>>> from simplemma import text_lemmatizer >>> sentence = 'Sou o intervalo entre o que desejo ser e os outros me fizeram.' >>> text_lemmatizer(sentence, lang='pt') # caveat: desejo is also a noun, should be desejar here ['ser', 'o', 'intervalo', 'entre', 'o', 'que', 'desejo', 'ser', 'e', 'o', 'outro', 'me', 'fazer', '.'] # same principle, returns a generator and not a list >>> from simplemma import lemma_iterator >>> lemma_iterator(sentence, lang='pt')

Caveats

python
# don't expect too much though # this diminutive form isn't in the model data >>> simplemma.lemmatize('spaghettini', lang='it') 'spaghettini' # should read 'spaghettino' # the algorithm cannot choose between valid alternatives yet >>> simplemma.lemmatize('son', lang='es') 'son' # valid common name, but what about the verb form?

As the focus lies on overall coverage, some short frequent words
(typically: pronouns and conjunctions) may need post-processing, this
generally concerns a few dozens of tokens per language.

The current absence of morphosyntactic information is an advantage in
terms of simplicity. However, it is also an impassable frontier regarding
lemmatization accuracy, for example when it comes to disambiguating
between past participles and adjectives derived from verbs in Germanic
and Romance languages. In most cases, simplemma often does not change
such input words.

The greedy algorithm seldom produces invalid forms. It is designed to
work best in the low-frequency range, notably for compound words and
neologisms. Aggressive decomposition is only useful as a general
approach in the case of morphologically-rich languages, where it can
also act as a linguistically motivated stemmer.

Bug reports over the issues
page
are welcome.

Language detection

Language detection works by providing a text and a tuple lang consisting
of a series of languages of interest. Each score is a proportion between 0
and 1. The proportions are computed independently per language, so a token
recognized in several languages counts towards each of them and the scores
need not sum to 1.

The langdetect() function returns a list of language codes along
with their corresponding scores, appending "unk" for the proportion of
unknown or out-of-vocabulary tokens. The proportion of tokens that belong
to the target language(s) can also be obtained directly with the
in_target_language() function, which returns a single ratio.

python
# import necessary functions >>> from simplemma import in_target_language, langdetect # language detection >>> langdetect('"Exoplaneta, též extrasolární planeta, je planeta obíhající kolem jiné hvězdy než kolem Slunce."', lang=("cs", "sk")) [("cs", 0.75), ("sk", 0.125), ("unk", 0.25)] # proportion of known words >>> in_target_language("opera post physica posita (τὰ μετὰ τὰ φυσικά)", lang="la") 0.5

The greedy argument (extensive in past software versions) triggers
use of the greedier decomposition algorithm described above, thus
extending word coverage and recall of detection at the potential cost of
a lesser accuracy.

Advanced usage via classes

The functions described above are suitable for simple usage, but you
can have more control by instantiating Simplemma classes and calling
their methods instead. Lemmatization is handled by the Lemmatizer
class, while language detection is handled by the LanguageDetector
class. These in turn rely on different lemmatization strategies, which
are implementations of the LemmatizationStrategy protocol. The
DefaultStrategy implementation uses a combination of different
strategies, one of which is DictionaryLookupStrategy. It looks up
tokens in a dictionary created by a DictionaryFactory.

For example, it is possible to conserve RAM by limiting the number of
cached language dictionaries (default: 8) by creating a custom
DefaultDictionaryFactory with a specific cache_max_size setting,
creating a DefaultStrategy using that factory, and then creating a
Lemmatizer and/or a LanguageDetector using that strategy:

python
# import necessary classes >>> from simplemma import LanguageDetector, Lemmatizer >>> from simplemma.strategies import DefaultStrategy >>> from simplemma.strategies.dictionaries import DefaultDictionaryFactory LANG_CACHE_SIZE = 5 # How many language dictionaries to keep in memory at once (max) >>> dictionary_factory = DefaultDictionaryFactory(cache_max_size=LANG_CACHE_SIZE) >>> lemmatization_strategy = DefaultStrategy(dictionary_factory=dictionary_factory) # lemmatize using the above customized strategy >>> lemmatizer = Lemmatizer(lemmatization_strategy=lemmatization_strategy) >>> lemmatizer.lemmatize('doughnuts', lang='en') 'doughnut' # detect languages using the above customized strategy >>> language_detector = LanguageDetector('la', lemmatization_strategy=lemmatization_strategy) >>> language_detector.proportion_in_target_languages("opera post physica posita (τὰ μετὰ τὰ φυσικά)") 0.5

For more information see the
extended documentation.

Reducing memory usage

Simplemma provides an alternative solution for situations where low
memory usage and fast initialization time are more important than
lemmatization and language detection performance. This solution uses a
DictionaryFactory that employs a trie as its underlying data structure,
rather than a Python dict.

The TrieDictionaryFactory reduces memory usage by an average of
20x and initialization time by 100x, but this comes at the cost of
potentially reducing performance by 50% or more, depending on the
specific usage.

To use the TrieDictionaryFactory you have to install Simplemma with
the marisa-trie extra dependency (available from version 1.1.0):

bash
pip install simplemma[marisa-trie]

Then you have to create a custom strategy using the
TrieDictionaryFactory and use that for Lemmatizer and
LanguageDetector instances:

python
>>> from simplemma import LanguageDetector, Lemmatizer >>> from simplemma.strategies import DefaultStrategy >>> from simplemma.strategies.dictionaries import TrieDictionaryFactory >>> lemmatization_strategy = DefaultStrategy(dictionary_factory=TrieDictionaryFactory()) >>> lemmatizer = Lemmatizer(lemmatization_strategy=lemmatization_strategy) >>> lemmatizer.lemmatize('doughnuts', lang='en') 'doughnut' >>> language_detector = LanguageDetector('la', lemmatization_strategy=lemmatization_strategy) >>> language_detector.proportion_in_target_languages("opera post physica posita (τὰ μετὰ τὰ φυσικά)") 0.5

While memory usage and initialization time when using the
TrieDictionaryFactory are significantly lower compared to the
DefaultDictionaryFactory, that's only true if the trie dictionaries
are available on disk. That's not the case when using the
TrieDictionaryFactory for the first time, as Simplemma only ships
the dictionaries as Python dicts. The trie dictionaries have to be
generated once from the Python dicts. That happens on-the-fly when
using the TrieDictionaryFactory for the first time for a language and
will take a few seconds and use as much memory as loading the Python
dicts for the language requires. For further invocations the trie
dictionaries get cached on disk.

If the machine that will run Simplemma doesn't have enough memory to
generate the trie dictionaries, they can also be generated on another
computer with the same CPU architecture and copied over to the cache
directory.

<!-- include:intro:end -->

Supported languages

<!-- include:languages:start -->

The following languages are available, identified by their BCP 47
language tag
, which
typically corresponds to the ISO 639-1 code.
If no such code exists, a ISO 639-3
code
is
used instead.

Available languages (2026-05-29):

The Forms column counts the inflected word forms stored in the
dictionary, while Lemmata counts the distinct base forms they map to
(both in thousands). A large gap between the two reflects rich
morphology rather than a data error.

CodeLanguageForms (10³)Lemm. (10³)Acc.Comments
astAsturian15436
bgBulgarian21518
caCatalan64063
csCzech200260.89on UD CS-PDT
cyWelsh36314
daDanish555810.92on UD DA-DDT, alternative: lemmy
deGerman7302460.95on UD DE-GSD, see also German-NLP list
elGreek185210.88on UD EL-GDT
enEnglish139500.94on UD EN-GUM, alternative: LemmInflect
enmMiddle English436
eoEsperanto19118
esSpanish666720.95on UD ES-GSD
etEstonian14134low coverage
faPersian134experimental
fiFinnish3,549124see this benchmark
frFrench248370.94on UD FR-GSD
gaIrish39946
gdGaelic5912
glGalician42643
gvManx7613
hbsSerbo-Croatian67452Croatian and Serbian lists to be added later
hiHindi5811experimental
huHungarian49236
hyArmenian2477
idIndonesian2140.91on UD ID-CSUI
isIcelandic17715
itItalian357280.93on UD IT-ISDT
kaGeorgian664
laLatin89252
lbLuxembourgish30679
ltLithuanian26825
lvLatvian16614
mkMacedonian6716
msMalay184
nbNorwegian (Bokmål)618134
nlDutch3661240.92on UD-NL-Alpino
nnNorwegian (Nynorsk)6818
plPolish3,6702640.91on UD-PL-PDB
ptPortuguese924940.92on UD-PT-GSD
roRomanian34236
ruRussian63354alternative: pymorphy2
seNorthern Sámi1157
skSlovak889710.92on UD SK-SNK
slSlovene16530
sqAlbanian385
svSwedish74593alternative: lemmy
swSwahili4,8704experimental
tlTagalog398experimental
trTurkish1,236400.89on UD-TR-Boun
ukUkrainian38822alternative: pymorphy2

Languages marked as having low coverage may be better suited to
language-specific libraries, but Simplemma can still provide limited
functionality. Where possible, open-source Python alternatives are
referenced.

Experimental mentions indicate that the language remains untested or
that there could be issues with the underlying data or lemmatization
process.

The scores are calculated on Universal
Dependencies
treebanks on single
word tokens (including some contractions but not merged prepositions),
they describe to what extent simplemma can accurately map tokens to
their lemma form. See the training/ folder of the code repository for
more information.

This library is particularly relevant as regards the lemmatization of
less frequent words. Its performance in this case is only incidentally
captured by the benchmark above. In some languages, a fixed number of
words such as pronouns can be further mapped by hand to enhance
performance.

<!-- include:languages:end -->

Speed

The following orders of magnitude are provided for reference only and
were measured on an old laptop to establish a lower bound:

  • Tokenization: > 1 million tokens/sec
  • Lemmatization: > 250,000 words/sec

Using the most recent Python version (i.e. with pyenv) can make the
package run faster.

Roadmap

  • Add further lemmatization lists
  • Grammatical categories as option
  • Function as a meta-package?
  • Integrate optional, more complex models?

Credits and licenses

<!-- include:credits:start -->

The software is licensed under the MIT license. For information on the
licenses of the linguistic information databases, see the licenses folder.

The surface lookups (non-greedy mode) rely on lemmatization lists derived
from the following sources, listed in order of relative importance:

<!-- include:credits:end -->

Contributions

<!-- include:contributions:start -->

This package has been first created and published by Adrien Barbaresi.
It has then benefited from extensive refactoring by Juanjo Diaz (especially the new classes).
See the full list of contributors
to the repository.

Feel free to contribute, notably by filing
issues
for feedback, bug
reports, or links to further lemmatization lists, rules and tests.

Contributions by pull requests ought to follow the following
conventions: code style with black, type
hinting with mypy, included tests with
pytest.

<!-- include:contributions:end -->

Other solutions

See lists: German-NLP and other
awesome-NLP lists
.

For another approach in Python see Spacy's
edit tree lemmatizer.

References

To cite this software:

Reference DOI: 10.5281/zenodo.4673264

Barbaresi A. (year). Simplemma: a simple multilingual lemmatizer for
Python [Computer software] (Version version number). Available from
https://github.com/adbar/simplemma DOI: 10.5281/zenodo.4673264

This work draws from lexical analysis algorithms used in:

Contributors

Showing top 7 contributors by commit count.

View all contributors on GitHub →

This article is auto-generated from adbar/simplemma via the GitHub API.Last fetched: 6/14/2026