Package org.jbox.textCutter

This package defines APIs for cutting text into words.


Interface Summary
Cutter The root interface of text cutter.

Class Summary
AbstractCutter A abstract class define default behavior of Cutter.
CutterBox Container of Cutter.

Package org.jbox.textCutter Description

This package defines APIs for cutting text into words.

A Cutter is used to cut text of language specified by unicode scope. Because there are many types of language in the world, a single cutter couldn't analyze all grammar. Cutters are put in a CutterBox to deal with complex text. When a text is passed into CutterBox, it is transfered to a Cutters link. All Cutters in the link fetch the text belong to its unicode scope by LanguageFilter, and then cut it into words, passing the residual text to next Cutter.

It should be noticed that the text passed in is not sured to be dealt with.

CutterBox use NoiseFilter to filter noise word. All noise words must be defined in a file in the directory "DICT/NOISE/".For example, word "fool" needed to be filtered, it should be added to a file in "DICT/NOISE/", or added "fool" to a new file such as "myNoise.txt" in "DICT/NOISE/". Then the word "fool" will be ignored when cutting text. It's not needed to invoke any function for filtering text, CutterBox will do it when calling cutPage(Page). The construction of CutterBox is like below:

Noted that one word should be written in one row. If two words written in one row, for example, "fool fun", it will be regarded as one word.
Noted that the directory "DICT/NOISE/" is needed, even if no noise word is defined.