org.jbox.webSpider
Interface WebSpider

All Known Implementing Classes:
SimpleSpider

public interface WebSpider

The root interface of WebSpider. It is used to crawl the Internet and fetch pages.

Version:
1.0
Author:
YiBin.H

Method Summary
 int getMaxPageNum()
          Return max page number defined in configuration file.
 boolean hashNext()
          Check if there is a next page to visit.
 Page next()
          Visit and return the next @{link Page Page} Object.
 void setMaxPageNum(int maxPageNum)
          Set max number of pages that the spider will crawl.
 void setRules(java.lang.String[] rules)
          Set crawl rules of WebSpider.
 void setStartUrls(java.lang.String[] startUrls)
          Set start URLs of WebSpider.
 

Method Detail

setStartUrls

void setStartUrls(java.lang.String[] startUrls)
Set start URLs of WebSpider.

Parameters:
startUrls - String array containing start URLs of WebSpider.

setRules

void setRules(java.lang.String[] rules)
Set crawl rules of WebSpider. A rule is written in REGEXP(regular expression), for example, rule:
{"http://.*(\.html)$"} limits the spider just crawls URLs end with ".html". rules:
{"http://.*(\.html)$","http://localhost/.*"} limits the spider crawls URLS end with ".html" and start with "http://localhost/".

Parameters:
rules - String array containing rules written in REGEXP.

setMaxPageNum

void setMaxPageNum(int maxPageNum)
Set max number of pages that the spider will crawl.

Parameters:
maxPageNum - max number of pages the WebSpider will crawl.

hashNext

boolean hashNext()
Check if there is a next page to visit.

Returns:
true if has next, or false otherwise.

next

Page next()
Visit and return the next @{link Page Page} Object.

Returns:
@{link Page Page} object.

getMaxPageNum

int getMaxPageNum()
Return max page number defined in configuration file.

Returns:
max page number.