WorksApplications/Sudachi
A Japanese Tokenizer for Business
27 Releases
Latest: 3w ago
Sudachi version 0.8.0v0.8.0Latest
đŠ **CAUTION**
- The v0.8.* is intended as an intermediate release series before the v1.
- Please pin the exact version when using this series, as breaking behavioral changes may be introduced even in patch releases.
đ Changed
- Change PathAnchor behavior for elasticsearch-sudachi (https://github.com/WorksApplications/Sudachi/pull/361)
- `PathAnchor.Classpath` now loads data via class loader.
- `PathAnchor.None` does not resolve now. You may need to use `PathAnchor.filesystem()` instead to resolve based on CWD.
- Fix `PathAnchor.Chain.resource`. We recommend to use it instead of `toResource`.
- 0-th column of DictionaryPrinter output is now normalized (https://github.com/WorksApplications/Sudachi/pull/242)
âš Added
- Add TextNormalizer (https://github.com/WorksApplications/Sudachi/pull/242)
- TextNormalizer normalizes text with a same process to the analysis.
- Full Changelog: https://github.com/WorksApplications/Sudachi/compare/v0.7.5...v0.8.0
Sudachi version 0.7.5v0.7.5
đ Changes
- Behavior of the dictionary printer and builder are changed (#234)
- `DictionaryPrinter` now prints word references in the (Surface, POS, Reading) triple format, instead of the line number format.
- `DictionaryBuilder` now allows the dictionary form to be written in the triple format, not only the line number format.
- Benchmark scripts are added (#235)
- Tutorial and readme are updated (#237, #240)
- `Config.Resource.asByteBuffer` now always returns ByteBuffer with little endian byte order (#239)
- `StringUtil.readAllBytes` also now returns ByteBuffer with little endian byte order.
Sudachi version 0.7.4v0.7.4
đ Changes
- Add `Tokenizer.lazyTokenizeSentences(SplitMode mode, Readable input)`, that performs analysis lazily and saves memory usage (#231)
- `Tokenizer.tokenizeSentences(SplitMode mode, Reader input)` is marked as deprecated.
- Do not segfault on tokenizing with closed dictionary (#217)
- The default config sudachi.json sets non-existent property joinKanjiNumeric in JoinNumericPlugin (#221)
- fix incorrect size calculation when expand (#227)
- Update tutorial.md (#226)
Sudachi version 0.7.3v0.7.3
đ Changes
- Added `Config.fromResource` method for reading Configs vial PathAnchor. (#212)
- Plugin classloading is done by PathAnchor and support multiple classloaders (#210, #209)
Sudachi version 0.7.1v0.7.1
đ Changes
- Fixed analysis truncation when using analysis with sentence splitting and the input does not contain data which can be treated as splittable sentences
- Fixed O(N^2) performance in sentence splitting when underlying reader does not fill buffer fully at once
- Stop calling into reader with full buffer
0.6.4v0.6.4
đ Changes
- Fixed analysis truncation when using analysis with sentence splitting and the input does not contain data which can be treated as splittable sentences
- Fixed O(N^2) performance in sentence splitting when underlying reader does not fill buffer fully at once
- Stop calling into reader with full buffer
Sudachi version 0.6.3v0.6.3Pre-release
Port relaxed boundary mode from 0.7.0 while keeping ABI compatibility with pre-0.7.0 versions.
Sudachi version 0.7.0v0.7.0
đ Changes
- `Tokenizer.tokenize` API returns `MorphemeList` instead of `List<Morpheme>`. This change is ABI-incompatible with previous versions and applications which use Sudachi require recompilation. The change should be source-compatible with no changes required to the source code which uses Sudachi.
- New API: `MorphemeList.split`: resplit C-mode token sequence to lower level without re-analyzing the whole string.
- Added relaxed boundary matching mode for Regex OOV handler
Sudachi version 0.6.2v0.6.2
đ Changes
- Fixed invalid POS tags which appeared when using user-defined POS tags both in user dictionaries and OOV handlers. You are not affected by this bug if you did not use user-defined POS in OOV handlers.
Sudachi version 0.6.1v0.6.1
đ Changes
- DO NOT USE 0.6.0, IT IS INCOMPATIBLE WITH 0.6.1
- Regex OOV plugin has configurable maximum token length
- SettingsAnchor renamed to PathAnchor to make more clear its purpose
- Add useful Config methods, e.g. for a common case of loading default configuration with provided PathAnchor to resolve default paths in another directory.
- Filesystem-based PathAnchor now plays correctly with SecurityManager present (e.g. in ElasticSearch).
đŠ Regex OOV length
- Use `maxLength` field of the plugin configuration object to set maximum allowed length, in utf-8 bytes (by default 32). The unit will change to unicode codepoints in the future.
Sudachi version 0.6.0v0.6.0
đ Changes
- Improved analysis speed ~20% compared to 0.5.3
- New typed configuration API (`Config`)
- Regex matcher plugin
- OOV Handlers can use fully-customized POS tags
- API for compiling dictionaries
đŠ API for building dictionaries
- In addition to command line interface for building dictionaries, Sudachi now supports API.
đŠ Configuration API
- New configuration framework allow specifying some resources (dictionaries, character tables) preloaded and prebuilt.
- For details on usage, see Javadoc for `Config` class.
đŠ Fully-custom POS tags in OOV providers
- ```json
- "oovProviderPlugin" : [
- { "class" : "com.worksap.nlp.sudachi.SimpleOovProviderPlugin",
- "oovPOS" : [ "ăăź", "ăă", "ăŻ", "ăăăă", "ăăȘă", "ă" ],
- "userPOS": "allow",
- "leftId" : 8,
- "rightId" : 8,
- "cost" : 6000 }
- + 2 more
đŠ Regex OOV Provider Plugin
- Introduced a new OOV provider which matches a regular expression.
- Recommendations:
- Use non-capturing groups in regular expressions: `(?:like this)`, but not capturing groups `(like this)`
- Caveats:
- Matches may start only on boundaries where character type changes.
- Matches may not produce words which already present in the dictionary.
- Match length is limited to 63 utf-16 code units
- Example for matching URLs:
- + 11 more
đŠ Speedup
- Improved lattice construction logic, it is faster and generates less GC pressure now
- Improved trie index lookup logic, it is slightly faster and generates much less GC pressure now
- All deprecations in this section will be removed with 1.0 release.
- `DictionaryFactory` methods which use `Settings`
- `getPath` method of `Settings`, use `getResource` instead.
- Build now uses Gradle instead of Maven
- Tests can be written in Kotlin in addition to Java
- OOV provider plugin internal API has changed. It now must create candidate nodes into the provided list and return number of created OOVs. See Javadoc for details.
Sudachi version 0.6.0-beta2v0.6.0-beta2Pre-release
Sudachi version 0.6.0-beta1v0.6.0-beta1Pre-release
Pre-relesease of 0.6.0
Sudachi version 0.5.3v0.5.3
đ Changes
- Changed the priority of user dictionaries
- If the cost is the same, the words in the dictionary added later will take precedence
- Fixed a bug where sentences were incorrectly separated by spaces.
- Added a method to dump the internal structure as JSON
Sudachi version 0.5.2v0.5.2
đ Changes
- Added `IgnoreYomiganaPlugin` which removes yomigana in parentheses.
- This feature is enabled by default
- The default length of hiragana characters recognized as reading kana is up to 4 characters
- See sudachi.json for details
Sudachi version 0.5.1v0.5.1
đ Changes
- Added synonym group IDs field to user dictionary
- Added `allowEmptyMorpheme` to settings
- Setting this property to false suppresses tokens of length 0
- The default value is true
Sudachi version 0.5.0v0.5.0
đ Changes
- Added synonym group IDs field to use Sudachi Synonym Dictionary
- New dictionary format, but is backwards compatible
- Command line output can now be customized via plugins
Sudachi version 0.4.3v0.4.3
đ Changes
- Fix overrun with surrogate pairs
Sudachi version 0.4.2v0.4.2
đ Changes
- Fix buffer overrun with character normalization in `Tokenizer#tokenize(Reader)`
Sudachi version 0.4.1v0.4.1
đ Changes
- Add `Tokenizer#tokenizeSentences(Reader)`
Sudachi version 0.4.0v0.4.0
đ Changes
- Add a new sentence boundary detector
- Add `Tokenizer#tokenizeSentences`
- Add `SentenceDetector`
- The CLI makes sentence boundary disambiguation
- Fix a bug causing normalized characters to be misaligned
Sudachi version 0.3.2v0.3.2
đ Changes
- Fix a bug causing crash with old user dictionaries
Sudachi version 0.3.1v0.3.1
đ Changes
- Fix a bug causing the error `EOS isn't connected to BOS` with a particular input
Sudachi version 0.3.0v0.3.0
đ Changes
- Support B and C units, and user-defined POS in user dictionaries
- Allow a tuple of headword, POS, and reading form instead of a word ID to describe A and B units
- Add `-p` option to the CLI to specify the path of resources
- Support merging additional settings into the default
- Add `-s` option to the CLI to specify settings to be merge
Sudachi version 0.2.0v0.2.0
đ Changes
- The dictionaries are move to https://github.com/WorksApplications/SudachiDict
- Double-array module is move to https://github.com/WorksApplications/jdartsclone
- Fix for Java9
- Fix the bug causing an invalid word length of a very long word
- Read dictionary sources from multiple files without concatenation
Sudachi version 0.1.1v0.1.1
đ Changes
- Add a Prolonged Sound Mark normalization plugin
- OOV cannot start with IDEOGRAPHIC ITERATION MARK
- Read the configure files from jar as default
- `JoinNumericPlugin` can normalize numbers
- Add APIs to get Part-Of-Speech ID
- Add a option to ignore runtime errors to the command line tool
- Fix a bug that causes using more than 8 user dictionaries
- Change behaviors for `ă` (U+309B) and `ă` (U+309C) (refs #48)
v0.1.0
Sudachi 0.1.0!!
