JamSpell
Modern spell checking library - accurate, fast, multi-language
[![Build Status][travis-image]][travis] [![Release][release-image]][releases] The project is written primarily in C++, distributed under the MIT License license, first published in 2017. Key topics include: cpp, csharp, java, ngrams, nlp.
JamSpell
JamSpell is a spell checking library with following features:
- accurate - it considers words surroundings (context) for better correction
- fast - near 5K words per second
- multi-language - it's written in C++ and available for many languages with swig bindings
JamSpellPro
jamspell.com - check out a new jamspell version with following features
- Improved accuracy (catboost gradient boosted decision trees candidates ranking model)
- Splits merged words
- Pre-trained models for many languages (small, medium, large) for:
en, ru, de, fr, it, es, tr, uk, pl, nl, pt, hi, no - Ability to add words / sentences at runtime
- Fine-tuning / additional training
- Memory optimization for training large models
- Static dictionary support
- Built-in
Java, C#, Rubysupport - Windows support
Content
Benchmarks
<table> <tr> <td></td> <td>Errors</td> <td>Top 7 Errors</td> <td>Fix Rate</td> <td>Top 7 Fix Rate</td> <td>Broken</td> <td>Speed<br> (words/second)</td> </tr> <tr> <td>JamSpell</td> <td>3.25%</td> <td>1.27%</td> <td>79.53%</td> <td>84.10%</td> <td>0.64%</td> <td>4854</td> </tr> <tr> <td>Norvig</td> <td>7.62%</td> <td>5.00%</td> <td>46.58%</td> <td>66.51%</td> <td>0.69%</td> <td>395</td> </tr> <tr> <td>Hunspell</td> <td>13.10%</td> <td>10.33%</td> <td>47.52%</td> <td>68.56%</td> <td>7.14%</td> <td>163</td> </tr> <tr> <td>Dummy</td> <td>13.14%</td> <td>13.14%</td> <td>0.00%</td> <td>0.00%</td> <td>0.00%</td> <td>-</td> </tr> </table>Model was trained on 300K wikipedia sentences + 300K news sentences (english). 95% was used for train, 5% was used for evaluation. Errors model was used to generate errored text from the original one. JamSpell corrector was compared with Norvig's one, Hunspell and a dummy one (no corrections).
We used following metrics:
- Errors - percent of words with errors after spell checker processed
- Top 7 Errors - percent of words missing in top7 candidated
- Fix Rate - percent of errored words fixed by spell checker
- Top 7 Fix Rate - percent of errored words fixed by one of top7 candidates
- Broken - percent of non-errored words broken by spell checker
- Speed - number of words per second
To ensure that our model is not too overfitted for wikipedia+news we checked it on "The Adventures of Sherlock Holmes" text:
<table> <tr> <td></td> <td>Errors</td> <td>Top 7 Errors</td> <td>Fix Rate</td> <td>Top 7 Fix Rate</td> <td>Broken</td> <td>Speed (words per second)</td> </tr> <tr> <td>JamSpell</td> <td>3.56%</td> <td>1.27%</td> <td>72.03%</td> <td>79.73%</td> <td>0.50%</td> <td>5524</td> </tr> <tr> <td>Norvig</td> <td>7.60%</td> <td>5.30%</td> <td>35.43%</td> <td>56.06%</td> <td>0.45%</td> <td>647</td> </tr> <tr> <td>Hunspell</td> <td>9.36%</td> <td>6.44%</td> <td>39.61%</td> <td>65.77%</td> <td>2.95%</td> <td>284</td> </tr> <tr> <td>Dummy</td> <td>11.16%</td> <td>11.16%</td> <td>0.00%</td> <td>0.00%</td> <td>0.00%</td> <td>-</td> </tr> </table>More details about reproducing available in "Train" section.
Usage
Python
-
Install
swig3(usually it is in your distro package manager) -
Install
jamspell:
bashpip install jamspell
pythonimport jamspell corrector = jamspell.TSpellCorrector() corrector.LoadLangModel('en.bin') corrector.FixFragment('I am the begt spell cherken!') # u'I am the best spell checker!' corrector.GetCandidates(['i', 'am', 'the', 'begt', 'spell', 'cherken'], 3) # (u'best', u'beat', u'belt', u'bet', u'bent', ... ) corrector.GetCandidates(['i', 'am', 'the', 'begt', 'spell', 'cherken'], 5) # (u'checker', u'chicken', u'checked', u'wherein', u'coherent', ...)
C++
-
Add
jamspellandcontribdirs to your project -
Use it:
cpp#include <jamspell/spell_corrector.hpp> int main(int argc, const char** argv) { NJamSpell::TSpellCorrector corrector; corrector.LoadLangModel("model.bin"); corrector.FixFragment(L"I am the begt spell cherken!"); // "I am the best spell checker!" corrector.GetCandidates({L"i", L"am", L"the", L"begt", L"spell", L"cherken"}, 3); // "best", "beat", "belt", "bet", "bent", ... ) corrector.GetCandidates({L"i", L"am", L"the", L"begt", L"spell", L"cherken"}, 3); // "checker", "chicken", "checked", "wherein", "coherent", ... ) return 0; }
Other languages
You can generate extensions for other languages using swig tutorial. The swig interface file is jamspell.i. Pull requests with build scripts are welcome.
HTTP API
-
Install
cmake -
Clone and build jamspell (it includes http server):
bashgit clone https://github.com/bakwc/JamSpell.git cd JamSpell mkdir build cd build cmake .. make
bash./web_server/web_server en.bin localhost 8080
- GET Request example:
bash$ curl "http://localhost:8080/fix?text=I am the begt spell cherken" I am the best spell checker
- POST Request example
bash$ curl -d "I am the begt spell cherken" http://localhost:8080/fix I am the best spell checker
- Candidate example
bashcurl "http://localhost:8080/candidates?text=I am the begt spell cherken" # or curl -d "I am the begt spell cherken" http://localhost:8080/candidates
javascript{ "results": [ { "candidates": [ "best", "beat", "belt", "bet", "bent", "beet", "beit" ], "len": 4, "pos_from": 9 }, { "candidates": [ "checker", "chicken", "checked", "wherein", "coherent", "cheered", "cherokee" ], "len": 7, "pos_from": 20 } ] }
Here pos_from - misspelled word first letter position, len - misspelled word len
Train
To train custom model you need:
-
Install
cmake -
Clone and build jamspell:
bashgit clone https://github.com/bakwc/JamSpell.git cd JamSpell mkdir build cd build cmake .. make
-
Prepare a utf-8 text file with sentences to train at (eg.
sherlockholmes.txt) and another file with language alphabet (eg.alphabet_en.txt) -
Train model:
bash./main/jamspell train ../test_data/alphabet_en.txt ../test_data/sherlockholmes.txt model_sherlock.bin
- To evaluate spellchecker you can use
evaluate/evaluate.pyscript:
bashpython evaluate/evaluate.py -a alphabet_file.txt -jsp your_model.bin -mx 50000 your_test_data.txt
- You can use
evaluate/generate_dataset.pyto generate you train/test data. It supports txt files, Leipzig Corpora Collection format and fb2 books.
Download models
Here is a few simple models. They trained on 300K news + 300k wikipedia sentences. We strongly recommend to train your own model, at least on a few million sentences to achieve better quality. See Train section above.
Contributors
Showing top 12 contributors by commit count.
