Quick Start

Crawling Internet

One of the first things you'll probably want to do is to crawl the Internet. This is easy to do in jbox. The following code demonstrates how to this:
   
public class Foo {
    public crawl(String[] args){
        Configuration cfg = Configuration.config();
        WebSpider s = cfg.buildWebSpider();
        while(s.hashNext()){
            Page p = s.next();
            System.out.println(p);
        }
    }
}

Cutting text

Cutting text into words is a very important part of search engineer. Different languages have different arithometic to cut text. Two implementation for English and CJK are offered at current version in jbox. At default condition, Just English cutter is used, you can change it in configuration file "jbox.cfg.xml". You may start to cut text like this:

public void cutText(){
        String text = "Cutting text into words is very important for search engineer."
        CutterBox cb = cfg.buildCutterBox();
        System.out.println(cb.cutText(text));
}

    Cutting text of Page like this:

public void cutPage(){
        String text = "Cutting text into words is very important for search engineer."
        Page page = new Page();
        page.setText(text);
        CutterBox cb = cfg.buildCutterBox();
        cb.cutPage(page));   
        System.out.println(page.getWords());
}

You can specify which cutters to work in "jbox.cfg.xml" file to deal with complex langauge text. 

Creating index

Index is the core of search engineer. It is also easy to do in jbox. The following code show use default implementation of IndexWriter to create index of a Page:

public void createIndex(){
        WebSpider s = cfg.buildWebSpider();
        CutterBox cb = cfg.buildCutterBox();
        org.jbox.indexer.IndexWriter iw = cfg.buildIndexWriter();
        while(s.hashNext()){
            Page p = s.next();
            if(p==null||p.getText()==null)continue;
            cb.cutPage(p);            // You should cut a page befor create its index.
            iw.saveIndex(p);
        }
}

You can find the index in the field "index" in "Word" table in your data base. It might be like this:

    "22-0.166667-0,1"

The first field "22" means the word did appear in Page with id 22, the second field "0.166667" means the TF of the word in a page, and "0,1" represents the locations of the word.
For example:

 "I have a cat. You have a dog. He is so funny."

The word "have" appear in first sentence and second, so locations of the word in the text is "0,1". The text have 12 words, so the TF is 2/12 = 0.166667. Suppose the id of this text is "22", and then index of this word in the text is "22-0.166667-0,1".

Searching index

After the index has been created in data base, now you can make personal search client for your application. For example:

    public void search(){
        String query = "successfully,status";        //search the key words "successfully,status";
        Configuration cfg = Configuration.config(); //load default configuration file "jbox.cfg.xml";
        Searcher s = cfg.buildSearcher();
        Page[] result = s.search(query);
        for (Page p : result) {
            System.out.println(p.getTitle());
            System.out.println(p.getText());
            System.out.println(p.getUrl());
        }
    }

You may find that the text of page is just a short introduction if you use SimpleSearcher to search. It's because SimpleSearcher return a proxy of Page, but not Page. If you want to get the whole text, you may do it like below:

        for (Page p : result) {
            PageProxy proxy = (PageProxy)p;
            P = p.getPage();
            System.out.println(p.getText());
        }

Highlighting key words

If you want to highlight the key words, you may use Highlight function to do it. For example, you may add the code below to above to highlight key words:

        for(…){
        …
        searcher.highLight(p.getText(), new StringBuffer(query), Color.RED)

}
   
What does Highlight function do is just replace all the key words in the text with specified color:
       
    keyword---><font color="xxxxxx">keyword</font>

Now you have finished your personal simple search engineer. You can find additional examples in the example package:

        createIndex/SimpleSample.java - a simple example for crawling Internet and creating index to data base.
        createIndex/ComplexExample.zip- a example using multi-thread to improve the performance of crawling Internet and creating index to data base.
        MyJbox.zip - a example for building a search engineer. It's a complete example for tomcat+mysql. Need tomcat5.5, mysql5.0, jdk1.5.

You may download jbox here.
Copyright © 2007-2013 YiBin.h.
Licensed under the Apache License, Version 2.0.