GitPedia
WorksApplications

WorksApplications/Sudachi

A Japanese Tokenizer for Business

27 Releases
Latest: 3w ago
Sudachi version 0.8.0v0.8.0Latest
github-actions[bot]github-actions[bot]·3w ago·May 26, 2026
GitHub

📩 **CAUTION**

  • The v0.8.* is intended as an intermediate release series before the v1.
  • Please pin the exact version when using this series, as breaking behavioral changes may be introduced even in patch releases.

📋 Changed

  • Change PathAnchor behavior for elasticsearch-sudachi (https://github.com/WorksApplications/Sudachi/pull/361)
  • `PathAnchor.Classpath` now loads data via class loader.
  • `PathAnchor.None` does not resolve now. You may need to use `PathAnchor.filesystem()` instead to resolve based on CWD.
  • Fix `PathAnchor.Chain.resource`. We recommend to use it instead of `toResource`.
  • 0-th column of DictionaryPrinter output is now normalized (https://github.com/WorksApplications/Sudachi/pull/242)

✹ Added

  • Add TextNormalizer (https://github.com/WorksApplications/Sudachi/pull/242)
  • TextNormalizer normalizes text with a same process to the analysis.
  • Full Changelog: https://github.com/WorksApplications/Sudachi/compare/v0.7.5...v0.8.0
Sudachi version 0.7.5v0.7.5
github-actions[bot]github-actions[bot]·1y ago·November 5, 2024
GitHub

📋 Changes

  • Behavior of the dictionary printer and builder are changed (#234)
  • `DictionaryPrinter` now prints word references in the (Surface, POS, Reading) triple format, instead of the line number format.
  • `DictionaryBuilder` now allows the dictionary form to be written in the triple format, not only the line number format.
  • Benchmark scripts are added (#235)
  • Tutorial and readme are updated (#237, #240)
  • `Config.Resource.asByteBuffer` now always returns ByteBuffer with little endian byte order (#239)
  • `StringUtil.readAllBytes` also now returns ByteBuffer with little endian byte order.
Sudachi version 0.7.4v0.7.4
github-actions[bot]github-actions[bot]·1y ago·July 2, 2024
GitHub

📋 Changes

  • Add `Tokenizer.lazyTokenizeSentences(SplitMode mode, Readable input)`, that performs analysis lazily and saves memory usage (#231)
  • `Tokenizer.tokenizeSentences(SplitMode mode, Reader input)` is marked as deprecated.
  • Do not segfault on tokenizing with closed dictionary (#217)
  • The default config sudachi.json sets non-existent property joinKanjiNumeric in JoinNumericPlugin (#221)
  • fix incorrect size calculation when expand (#227)
  • Update tutorial.md (#226)
Sudachi version 0.7.3v0.7.3
github-actions[bot]github-actions[bot]·2y ago·June 26, 2023
GitHub

📋 Changes

  • Added `Config.fromResource` method for reading Configs vial PathAnchor. (#212)
  • Plugin classloading is done by PathAnchor and support multiple classloaders (#210, #209)
Sudachi version 0.7.1v0.7.1
github-actions[bot]github-actions[bot]·3y ago·March 9, 2023
GitHub

📋 Changes

  • Fixed analysis truncation when using analysis with sentence splitting and the input does not contain data which can be treated as splittable sentences
  • Fixed O(N^2) performance in sentence splitting when underlying reader does not fill buffer fully at once
  • Stop calling into reader with full buffer
0.6.4v0.6.4
github-actions[bot]github-actions[bot]·3y ago·March 9, 2023
GitHub

📋 Changes

  • Fixed analysis truncation when using analysis with sentence splitting and the input does not contain data which can be treated as splittable sentences
  • Fixed O(N^2) performance in sentence splitting when underlying reader does not fill buffer fully at once
  • Stop calling into reader with full buffer
Sudachi version 0.6.3v0.6.3Pre-release
github-actions[bot]github-actions[bot]·3y ago·August 29, 2022
GitHub

Port relaxed boundary mode from 0.7.0 while keeping ABI compatibility with pre-0.7.0 versions.

Sudachi version 0.7.0v0.7.0
github-actions[bot]github-actions[bot]·3y ago·August 16, 2022
GitHub

📋 Changes

  • `Tokenizer.tokenize` API returns `MorphemeList` instead of `List<Morpheme>`. This change is ABI-incompatible with previous versions and applications which use Sudachi require recompilation. The change should be source-compatible with no changes required to the source code which uses Sudachi.
  • New API: `MorphemeList.split`: resplit C-mode token sequence to lower level without re-analyzing the whole string.
  • Added relaxed boundary matching mode for Regex OOV handler
Sudachi version 0.6.2v0.6.2
github-actions[bot]github-actions[bot]·4y ago·June 21, 2022
GitHub

📋 Changes

  • Fixed invalid POS tags which appeared when using user-defined POS tags both in user dictionaries and OOV handlers. You are not affected by this bug if you did not use user-defined POS in OOV handlers.
Sudachi version 0.6.1v0.6.1
github-actions[bot]github-actions[bot]·4y ago·June 10, 2022
GitHub

📋 Changes

  • DO NOT USE 0.6.0, IT IS INCOMPATIBLE WITH 0.6.1
  • Regex OOV plugin has configurable maximum token length
  • SettingsAnchor renamed to PathAnchor to make more clear its purpose
  • Add useful Config methods, e.g. for a common case of loading default configuration with provided PathAnchor to resolve default paths in another directory.
  • Filesystem-based PathAnchor now plays correctly with SecurityManager present (e.g. in ElasticSearch).

📩 Regex OOV length

  • Use `maxLength` field of the plugin configuration object to set maximum allowed length, in utf-8 bytes (by default 32). The unit will change to unicode codepoints in the future.
Sudachi version 0.6.0v0.6.0
github-actions[bot]github-actions[bot]·4y ago·June 9, 2022
GitHub

📋 Changes

  • Improved analysis speed ~20% compared to 0.5.3
  • New typed configuration API (`Config`)
  • Regex matcher plugin
  • OOV Handlers can use fully-customized POS tags
  • API for compiling dictionaries

📩 API for building dictionaries

  • In addition to command line interface for building dictionaries, Sudachi now supports API.

📩 Configuration API

  • New configuration framework allow specifying some resources (dictionaries, character tables) preloaded and prebuilt.
  • For details on usage, see Javadoc for `Config` class.

📩 Fully-custom POS tags in OOV providers

  • ```json
  • "oovProviderPlugin" : [
  • { "class" : "com.worksap.nlp.sudachi.SimpleOovProviderPlugin",
  • "oovPOS" : [ "こぼ", "たぐ", "は", "ぞんざい", "しăȘい", "よ" ],
  • "userPOS": "allow",
  • "leftId" : 8,
  • "rightId" : 8,
  • "cost" : 6000 }
  • + 2 more

📩 Regex OOV Provider Plugin

  • Introduced a new OOV provider which matches a regular expression.
  • Recommendations:
  • Use non-capturing groups in regular expressions: `(?:like this)`, but not capturing groups `(like this)`
  • Caveats:
  • Matches may start only on boundaries where character type changes.
  • Matches may not produce words which already present in the dictionary.
  • Match length is limited to 63 utf-16 code units
  • Example for matching URLs:
  • + 11 more

📩 Speedup

  • Improved lattice construction logic, it is faster and generates less GC pressure now
  • Improved trie index lookup logic, it is slightly faster and generates much less GC pressure now
  • All deprecations in this section will be removed with 1.0 release.
  • `DictionaryFactory` methods which use `Settings`
  • `getPath` method of `Settings`, use `getResource` instead.
  • Build now uses Gradle instead of Maven
  • Tests can be written in Kotlin in addition to Java
  • OOV provider plugin internal API has changed. It now must create candidate nodes into the provided list and return number of created OOVs. See Javadoc for details.
Sudachi version 0.6.0-beta2v0.6.0-beta2Pre-release
github-actions[bot]github-actions[bot]·4y ago·June 7, 2022
GitHub
Sudachi version 0.6.0-beta1v0.6.0-beta1Pre-release
github-actions[bot]github-actions[bot]·4y ago·June 3, 2022
GitHub

Pre-relesease of 0.6.0

Sudachi version 0.5.3v0.5.3
kazuma-tkazuma-t·4y ago·November 4, 2021
GitHub

📋 Changes

  • Changed the priority of user dictionaries
  • If the cost is the same, the words in the dictionary added later will take precedence
  • Fixed a bug where sentences were incorrectly separated by spaces.
  • Added a method to dump the internal structure as JSON
Sudachi version 0.5.2v0.5.2
github-actions[bot]github-actions[bot]·5y ago·March 13, 2021
GitHub

📋 Changes

  • Added `IgnoreYomiganaPlugin` which removes yomigana in parentheses.
  • This feature is enabled by default
  • The default length of hiragana characters recognized as reading kana is up to 4 characters
  • See sudachi.json for details
Sudachi version 0.5.1v0.5.1
github-actions[bot]github-actions[bot]·5y ago·November 25, 2020
GitHub

📋 Changes

  • Added synonym group IDs field to user dictionary
  • Added `allowEmptyMorpheme` to settings
  • Setting this property to false suppresses tokens of length 0
  • The default value is true
Sudachi version 0.5.0v0.5.0
kazuma-tkazuma-t·5y ago·November 4, 2020
GitHub

📋 Changes

  • Added synonym group IDs field to use Sudachi Synonym Dictionary
  • New dictionary format, but is backwards compatible
  • Command line output can now be customized via plugins
Sudachi version 0.4.3v0.4.3
kazuma-tkazuma-t·6y ago·June 19, 2020
GitHub

📋 Changes

  • Fix overrun with surrogate pairs
Sudachi version 0.4.2v0.4.2
kazuma-tkazuma-t·6y ago·May 29, 2020
GitHub

📋 Changes

  • Fix buffer overrun with character normalization in `Tokenizer#tokenize(Reader)`
Sudachi version 0.4.1v0.4.1
kazuma-tkazuma-t·6y ago·May 26, 2020
GitHub

📋 Changes

  • Add `Tokenizer#tokenizeSentences(Reader)`
Sudachi version 0.4.0v0.4.0
kazuma-tkazuma-t·6y ago·April 5, 2020
GitHub

📋 Changes

  • Add a new sentence boundary detector
  • Add `Tokenizer#tokenizeSentences`
  • Add `SentenceDetector`
  • The CLI makes sentence boundary disambiguation
  • Fix a bug causing normalized characters to be misaligned
Sudachi version 0.3.2v0.3.2
kazuma-tkazuma-t·6y ago·December 17, 2019
GitHub

📋 Changes

  • Fix a bug causing crash with old user dictionaries
Sudachi version 0.3.1v0.3.1
kazuma-tkazuma-t·6y ago·December 4, 2019
GitHub

📋 Changes

  • Fix a bug causing the error `EOS isn't connected to BOS` with a particular input
Sudachi version 0.3.0v0.3.0
kazuma-tkazuma-t·6y ago·July 22, 2019
GitHub

📋 Changes

  • Support B and C units, and user-defined POS in user dictionaries
  • Allow a tuple of headword, POS, and reading form instead of a word ID to describe A and B units
  • Add `-p` option to the CLI to specify the path of resources
  • Support merging additional settings into the default
  • Add `-s` option to the CLI to specify settings to be merge
Sudachi version 0.2.0v0.2.0
kazuma-tkazuma-t·7y ago·April 3, 2019
GitHub

📋 Changes

  • The dictionaries are move to https://github.com/WorksApplications/SudachiDict
  • Double-array module is move to https://github.com/WorksApplications/jdartsclone
  • Fix for Java9
  • Fix the bug causing an invalid word length of a very long word
  • Read dictionary sources from multiple files without concatenation
Sudachi version 0.1.1v0.1.1
kazuma-tkazuma-t·7y ago·December 16, 2018
GitHub

📋 Changes

  • Add a Prolonged Sound Mark normalization plugin
  • OOV cannot start with IDEOGRAPHIC ITERATION MARK
  • Read the configure files from jar as default
  • `JoinNumericPlugin` can normalize numbers
  • Add APIs to get Part-Of-Speech ID
  • Add a option to ignore runtime errors to the command line tool
  • Fix a bug that causes using more than 8 user dictionaries
  • Change behaviors for `゛` (U+309B) and `゜` (U+309C) (refs #48)
v0.1.0
kazuma-tkazuma-t·8y ago·November 17, 2017
GitHub

Sudachi 0.1.0!!