Home/WorksApplications/Sudachi/Changelog

WorksApplications/Sudachi

A Japanese Tokenizer for Business

27 Releases

Latest: 3w ago

Sudachi version 0.8.0v0.8.0Latest

github-actions[bot]·3w ago·May 26, 2026

GitHub

📦 **CAUTION**

The v0.8.* is intended as an intermediate release series before the v1.
Please pin the exact version when using this series, as breaking behavioral changes may be introduced even in patch releases.

📋 Changed

Change PathAnchor behavior for elasticsearch-sudachi (https://github.com/WorksApplications/Sudachi/pull/361)
`PathAnchor.Classpath` now loads data via class loader.
`PathAnchor.None` does not resolve now. You may need to use `PathAnchor.filesystem()` instead to resolve based on CWD.
Fix `PathAnchor.Chain.resource`. We recommend to use it instead of `toResource`.
0-th column of DictionaryPrinter output is now normalized (https://github.com/WorksApplications/Sudachi/pull/242)

✨ Added

Add TextNormalizer (https://github.com/WorksApplications/Sudachi/pull/242)
TextNormalizer normalizes text with a same process to the analysis.
Full Changelog: https://github.com/WorksApplications/Sudachi/compare/v0.7.5...v0.8.0

Sudachi version 0.7.5v0.7.5

github-actions[bot]·1y ago·November 5, 2024

GitHub

📋 Changes

Behavior of the dictionary printer and builder are changed (#234)
`DictionaryPrinter` now prints word references in the (Surface, POS, Reading) triple format, instead of the line number format.
`DictionaryBuilder` now allows the dictionary form to be written in the triple format, not only the line number format.
Benchmark scripts are added (#235)
Tutorial and readme are updated (#237, #240)
`Config.Resource.asByteBuffer` now always returns ByteBuffer with little endian byte order (#239)
`StringUtil.readAllBytes` also now returns ByteBuffer with little endian byte order.

Sudachi version 0.7.4v0.7.4

github-actions[bot]·1y ago·July 2, 2024

GitHub

📋 Changes

Add `Tokenizer.lazyTokenizeSentences(SplitMode mode, Readable input)`, that performs analysis lazily and saves memory usage (#231)
`Tokenizer.tokenizeSentences(SplitMode mode, Reader input)` is marked as deprecated.
Do not segfault on tokenizing with closed dictionary (#217)
The default config sudachi.json sets non-existent property joinKanjiNumeric in JoinNumericPlugin (#221)
fix incorrect size calculation when expand (#227)
Update tutorial.md (#226)

Sudachi version 0.7.3v0.7.3

github-actions[bot]·2y ago·June 26, 2023

GitHub

📋 Changes

Added `Config.fromResource` method for reading Configs vial PathAnchor. (#212)
Plugin classloading is done by PathAnchor and support multiple classloaders (#210, #209)

Sudachi version 0.7.1v0.7.1

github-actions[bot]·3y ago·March 9, 2023

GitHub

📋 Changes

Fixed analysis truncation when using analysis with sentence splitting and the input does not contain data which can be treated as splittable sentences
Fixed O(N^2) performance in sentence splitting when underlying reader does not fill buffer fully at once
Stop calling into reader with full buffer

0.6.4v0.6.4

github-actions[bot]·3y ago·March 9, 2023

GitHub

📋 Changes

Fixed analysis truncation when using analysis with sentence splitting and the input does not contain data which can be treated as splittable sentences
Fixed O(N^2) performance in sentence splitting when underlying reader does not fill buffer fully at once
Stop calling into reader with full buffer

Sudachi version 0.6.3v0.6.3Pre-release

github-actions[bot]·3y ago·August 29, 2022

GitHub

Port relaxed boundary mode from 0.7.0 while keeping ABI compatibility with pre-0.7.0 versions.

Sudachi version 0.7.0v0.7.0

github-actions[bot]·3y ago·August 16, 2022

GitHub

📋 Changes

`Tokenizer.tokenize` API returns `MorphemeList` instead of `List<Morpheme>`. This change is ABI-incompatible with previous versions and applications which use Sudachi require recompilation. The change should be source-compatible with no changes required to the source code which uses Sudachi.
New API: `MorphemeList.split`: resplit C-mode token sequence to lower level without re-analyzing the whole string.
Added relaxed boundary matching mode for Regex OOV handler

Sudachi version 0.6.2v0.6.2

github-actions[bot]·4y ago·June 21, 2022

GitHub

📋 Changes

Fixed invalid POS tags which appeared when using user-defined POS tags both in user dictionaries and OOV handlers. You are not affected by this bug if you did not use user-defined POS in OOV handlers.

Sudachi version 0.6.1v0.6.1

github-actions[bot]·4y ago·June 10, 2022

GitHub

📋 Changes

DO NOT USE 0.6.0, IT IS INCOMPATIBLE WITH 0.6.1
Regex OOV plugin has configurable maximum token length
SettingsAnchor renamed to PathAnchor to make more clear its purpose
Add useful Config methods, e.g. for a common case of loading default configuration with provided PathAnchor to resolve default paths in another directory.
Filesystem-based PathAnchor now plays correctly with SecurityManager present (e.g. in ElasticSearch).

📦 Regex OOV length

Use `maxLength` field of the plugin configuration object to set maximum allowed length, in utf-8 bytes (by default 32). The unit will change to unicode codepoints in the future.

Sudachi version 0.6.0v0.6.0

github-actions[bot]·4y ago·June 9, 2022

GitHub

📋 Changes

Improved analysis speed ~20% compared to 0.5.3
New typed configuration API (`Config`)
Regex matcher plugin
OOV Handlers can use fully-customized POS tags
API for compiling dictionaries

📦 API for building dictionaries

In addition to command line interface for building dictionaries, Sudachi now supports API.

📦 Configuration API

New configuration framework allow specifying some resources (dictionaries, character tables) preloaded and prebuilt.
For details on usage, see Javadoc for `Config` class.

📦 Fully-custom POS tags in OOV providers

```json
"oovProviderPlugin" : [
{ "class" : "com.worksap.nlp.sudachi.SimpleOovProviderPlugin",
"oovPOS" : [ "この", "たぐ", "は", "ぞんざい", "しない", "よ" ],
"userPOS": "allow",
"leftId" : 8,
"rightId" : 8,
"cost" : 6000 }
+ 2 more

📦 Regex OOV Provider Plugin

Introduced a new OOV provider which matches a regular expression.
Recommendations:
Use non-capturing groups in regular expressions: `(?:like this)`, but not capturing groups `(like this)`
Caveats:
Matches may start only on boundaries where character type changes.
Matches may not produce words which already present in the dictionary.
Match length is limited to 63 utf-16 code units
Example for matching URLs:
+ 11 more

📦 Speedup

Improved lattice construction logic, it is faster and generates less GC pressure now
Improved trie index lookup logic, it is slightly faster and generates much less GC pressure now
All deprecations in this section will be removed with 1.0 release.
`DictionaryFactory` methods which use `Settings`
`getPath` method of `Settings`, use `getResource` instead.
Build now uses Gradle instead of Maven
Tests can be written in Kotlin in addition to Java
OOV provider plugin internal API has changed. It now must create candidate nodes into the provided list and return number of created OOVs. See Javadoc for details.

Sudachi version 0.6.0-beta2v0.6.0-beta2Pre-release

github-actions[bot]·4y ago·June 7, 2022

GitHub

Sudachi version 0.6.0-beta1v0.6.0-beta1Pre-release

github-actions[bot]·4y ago·June 3, 2022

GitHub

Pre-relesease of 0.6.0

Sudachi version 0.5.3v0.5.3

kazuma-t·4y ago·November 4, 2021

GitHub

📋 Changes

Changed the priority of user dictionaries
If the cost is the same, the words in the dictionary added later will take precedence
Fixed a bug where sentences were incorrectly separated by spaces.
Added a method to dump the internal structure as JSON

Sudachi version 0.5.2v0.5.2

github-actions[bot]·5y ago·March 13, 2021

GitHub

📋 Changes

Added `IgnoreYomiganaPlugin` which removes yomigana in parentheses.
This feature is enabled by default
The default length of hiragana characters recognized as reading kana is up to 4 characters
See sudachi.json for details

Sudachi version 0.5.1v0.5.1

github-actions[bot]·5y ago·November 25, 2020

GitHub

📋 Changes

Added synonym group IDs field to user dictionary
Added `allowEmptyMorpheme` to settings
Setting this property to false suppresses tokens of length 0
The default value is true

Sudachi version 0.5.0v0.5.0

kazuma-t·5y ago·November 4, 2020

GitHub

📋 Changes

Added synonym group IDs field to use Sudachi Synonym Dictionary
New dictionary format, but is backwards compatible
Command line output can now be customized via plugins

Sudachi version 0.4.3v0.4.3

kazuma-t·6y ago·June 19, 2020

GitHub

📋 Changes

Fix overrun with surrogate pairs

Sudachi version 0.4.2v0.4.2

kazuma-t·6y ago·May 29, 2020

GitHub

📋 Changes

Fix buffer overrun with character normalization in `Tokenizer#tokenize(Reader)`

Sudachi version 0.4.1v0.4.1

kazuma-t·6y ago·May 26, 2020

GitHub

📋 Changes

Add `Tokenizer#tokenizeSentences(Reader)`

Sudachi version 0.4.0v0.4.0

kazuma-t·6y ago·April 5, 2020

GitHub

📋 Changes

Add a new sentence boundary detector
Add `Tokenizer#tokenizeSentences`
Add `SentenceDetector`
The CLI makes sentence boundary disambiguation
Fix a bug causing normalized characters to be misaligned

Sudachi version 0.3.2v0.3.2

kazuma-t·6y ago·December 17, 2019

GitHub

📋 Changes

Fix a bug causing crash with old user dictionaries

Sudachi version 0.3.1v0.3.1

kazuma-t·6y ago·December 4, 2019

GitHub

📋 Changes

Fix a bug causing the error `EOS isn't connected to BOS` with a particular input

Sudachi version 0.3.0v0.3.0

kazuma-t·6y ago·July 22, 2019

GitHub

📋 Changes

Support B and C units, and user-defined POS in user dictionaries
Allow a tuple of headword, POS, and reading form instead of a word ID to describe A and B units
Add `-p` option to the CLI to specify the path of resources
Support merging additional settings into the default
Add `-s` option to the CLI to specify settings to be merge

Sudachi version 0.2.0v0.2.0

kazuma-t·7y ago·April 3, 2019

GitHub

📋 Changes

The dictionaries are move to https://github.com/WorksApplications/SudachiDict
Double-array module is move to https://github.com/WorksApplications/jdartsclone
Fix for Java9
Fix the bug causing an invalid word length of a very long word
Read dictionary sources from multiple files without concatenation

Sudachi version 0.1.1v0.1.1

kazuma-t·7y ago·December 16, 2018

GitHub

📋 Changes

Add a Prolonged Sound Mark normalization plugin
OOV cannot start with IDEOGRAPHIC ITERATION MARK
Read the configure files from jar as default
`JoinNumericPlugin` can normalize numbers
Add APIs to get Part-Of-Speech ID
Add a option to ignore runtime errors to the command line tool
Fix a bug that causes using more than 8 user dictionaries
Change behaviors for `゛` (U+309B) and `゜` (U+309C) (refs #48)

v0.1.0

kazuma-t·8y ago·November 17, 2017

GitHub

Sudachi 0.1.0!!

← Back to Sudachi wiki