CAMeL-Lab/camel_tools
A suite of Arabic natural language processing tools developed by the CAMeL Lab at New York University Abu Dhabi.
📋 Changes
- Supported Python version are 3.11-3.14.
- Updated dependencies.
- Updated documentation.
- Bug fixes for `BERTUnfactoredDisambiguator`.
- Deprecated `pretrained_cache` in `BERTUnfactoredDisambiguator`.
- Speed improvements for `simple_word_tokenize`.
Fixed a couple dependency issues.
Fixed errors in `camel_morphology` and `camel_diac`.
Added Python 3.12 support.
Fixed doc generation issues.
Added Python 3.11 support.
Fixes doc build issues.
📋 Changes
- Fixed an issue where pip tries to install camel-kenlm on Windows.
📋 Changes
- Improved BERT disambiguation accuracy.
- Added support for Python 3.10.
📋 Changes
- Fixed an issue importing Unfactored BERT Disambiguator.
📋 Changes
- Added Unfactored BERT disambiguator component.
- Bug fixes
📋 Changes
- Fixed issue with downloading package catalogue on Google Colab.
📋 Changes
- Removed support for Python 3.6 (only 3.7-3.9 are now supported).
- Implemented a new package manager for fine-grained installation of datasets.
- Fixed GPU support for NER and Sentiment Analysis components.
- Added emoji charsets.
- [simple_word_tokenize](https://camel-tools.readthedocs.io/en/latest/api/tokenizers/word.html) now splits emojis correctly and can optionally split digits.
📋 Changes
- Updated documentation and added more examples
- Morphology improvements and bug fixes
This release adds the `camel_data` command line tool for simplifying downloading data sets.
This update fixes installation issues caused by kenlm dependency.
First official release of CAMeL Tools. See [this post](https://camel-lab.github.io/camel_tools_updates/2020/09/08/camel-tools-release-v1.0.0.html) for more information on this release.
This patch fixes errors in the `almor-msa` builtin database. It also provides a new parameter to `camel_tools.tokenizers.morphological.MorphologicalTokenizer` called `diac` that determines wether output tokens are diacritized or not.
📋 Changes
- Fixed issue where diacritics and other marks broke tokenization.
- Fixed escaping of NOAN replacements in backoff analyses.
- Updated import of Mapping abstract base class for Python 3.
📋 Changes
- Fixed handling of sun letters in CALIMA Star.
- Added ANY keyword as values to certain features in CALIMA Star's reinflector.
Fixed a bug in CALIMA Star reinflector that prevented certain POS variants from being generated.
📋 Changes
- Fixed an issue in CalimaStarReinflector that prevented it from generating analyses.
- Fixed missing defines in the almor-msa database.
📋 Changes
- SimpleDisambiguator has been changed to a more fully featured MLEDisambiguator.
- Added word-boundary and morphological tokenizers.
- Added text normalization utilities.
- Analyzer APIs and Disambiguator APIs have been changed to be more general and output more descriptive named tuple objects.
- Almor-msa database now includes extensions.
- CALIMA Star Analyzer now has in-built caching mode.
- CharMapper objects are now callable (no need to use `map_string()` method).
📋 Changes
- Implemented a simple disambiguation function using pos-lex frequencies.
- `camel_calima_star` has a new option to use simple disambiguation in analysis mode.
- `CalimaStarAnalyzer` now has a new method `analyze_words()` to analyze a list of words.
