Quick Start
One of the first things you'll probably want to do is to crawl the
Internet. This is easy to do in jbox. The following code
demonstrates how to this:
public class Foo {
public crawl(String[] args){
Configuration cfg =
Configuration.config();
WebSpider s =
cfg.buildWebSpider();
while(s.hashNext()){
Page p =
s.next();
System.out.println(p);
}
}
}
Cutting text
Cutting text into words is a very important part of search engineer.
Different languages have different arithometic to cut text. Two
implementation for English and CJK are offered at current version in
jbox. At default condition, Just English cutter is used, you can
change it in configuration file "jbox.cfg.xml". You may
start to cut text like this:
public void cutText(){
String text = "Cutting
text into words is very important for search engineer."
CutterBox cb =
cfg.buildCutterBox();
System.out.println(cb.cutText(text));
}
Cutting text of Page like this:
public void cutPage(){
String text = "Cutting
text into words is very important for search engineer."
Page page = new Page();
page.setText(text);
CutterBox cb =
cfg.buildCutterBox();
cb.cutPage(page));
System.out.println(page.getWords());
}
You can specify which cutters to work in "jbox.cfg.xml"
file to deal with complex langauge text.
Creating index
Index is the core of search engineer. It is also easy to do in jbox.
The following code show use default implementation of IndexWriter to
create index of a Page:
public void createIndex(){
WebSpider s =
cfg.buildWebSpider();
CutterBox cb =
cfg.buildCutterBox();
org.jbox.indexer.IndexWriter iw
= cfg.buildIndexWriter();
while(s.hashNext()){
Page p =
s.next();
if(p==null||p.getText()==null)continue;
cb.cutPage(p);
// You should cut a page befor create its index.
iw.saveIndex(p);
}
}
You can find the index in the field "index" in
"Word" table in your data base. It might be like this:
"22-0.166667-0,1"
The first field "22" means the word did appear in Page with
id 22, the second field "0.166667" means the TF of the word
in a page, and "0,1" represents the locations of the word.
For example:
"I have a cat. You have a dog. He is so funny."
The word "have" appear in first sentence and second, so
locations of the word in the text is "0,1". The text have
12 words, so the TF is 2/12 = 0.166667. Suppose the id of this text
is "22", and then index of this word in the text is
"22-0.166667-0,1".
Searching index
After the index has been created in data base, now you can make
personal search client for your application. For example:
public void search(){
String query =
"successfully,status";
//search the key words "successfully,status";
Configuration cfg =
Configuration.config(); //load default configuration file
"jbox.cfg.xml";
Searcher s =
cfg.buildSearcher();
Page[] result =
s.search(query);
for (Page p : result) {
System.out.println(p.getTitle());
System.out.println(p.getText());
System.out.println(p.getUrl());
}
}
You may find that the text of page is just a short introduction if
you use SimpleSearcher to search. It's because SimpleSearcher return
a proxy of Page, but not Page. If you want to get the whole text, you
may do it like below:
for (Page p : result) {
PageProxy
proxy = (PageProxy)p;
P =
p.getPage();
System.out.println(p.getText());
}
Highlighting key words
If you want to highlight the key words, you may use Highlight function
to do it. For example, you may add the code below to above to
highlight key words:
for(…){
…
searcher.highLight(p.getText(),
new StringBuffer(query), Color.RED)
…
}
What does Highlight function do is just replace all the key words in the text with specified color:
keyword---><font
color="xxxxxx">keyword</font>
Now you have
finished your personal simple search engineer. You can find
additional examples in the example package:
createIndex/SimpleSample.java - a simple example
for crawling Internet and creating index to data base.
createIndex/ComplexExample.zip- a
example using multi-thread to improve the performance of crawling
Internet and creating index to data base.
MyJbox.zip - a example for building a search
engineer. It's a complete example for tomcat+mysql. Need tomcat5.5,
mysql5.0, jdk1.5.
You may download jbox
here.