org.jbox.webSpider.simpleSpider
Class SimpleSpider

java.lang.Object
  extended by org.jbox.webSpider.simpleSpider.SimpleSpider
All Implemented Interfaces:
WebSpider

public class SimpleSpider
extends java.lang.Object
implements WebSpider

An implementation of WebSpider. It should be noticed that the SimpleSpider doesn't take care of "rebot.txt".

Version:
1.0
Author:
YiBin.H

Constructor Summary
SimpleSpider()
          Constructs a new SimpleSpider.
 
Method Summary
 int getMaxPageNum()
          Return the max page number that the spider will crawl.
 boolean hashNext()
          Check if there is a next page to visit or if has reached the max page number.
 Page next()
          Visit and return the next @{link Page Page} object.
 void setMaxPageNum(int maxPageNum)
          Set the max page number that the spider will crawl.
 void setRules(java.lang.String[] rules)
          Set crawl rules of WebSpider.
 void setStartUrls(java.lang.String[] startUrls)
          Set start URLs of WebSpider.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

SimpleSpider

public SimpleSpider()
Constructs a new SimpleSpider.

Method Detail

setStartUrls

public void setStartUrls(java.lang.String[] startUrls)
Set start URLs of WebSpider.

Specified by:
setStartUrls in interface WebSpider
Parameters:
startUrls - String array containing start URLs of WebSpider.

setRules

public void setRules(java.lang.String[] rules)
Set crawl rules of WebSpider. A rule is written in REGEXP(regular expression). For example, rule:
{"http://.*(\.html)$"} limits the spider to crawls URLs end with ".html". rules:
{"http://.*(\.html)$","http://localhost/.*"} limits the spider to crawls URLS end with ".html" and start with "http://localhost/".

Specified by:
setRules in interface WebSpider
Parameters:
rules - String array containing rules written in REGEXP.

setMaxPageNum

public void setMaxPageNum(int maxPageNum)
Set the max page number that the spider will crawl.

Specified by:
setMaxPageNum in interface WebSpider
Parameters:
maxPageNum - max page number.

getMaxPageNum

public int getMaxPageNum()
Return the max page number that the spider will crawl.

Specified by:
getMaxPageNum in interface WebSpider
Returns:
max page number.

hashNext

public boolean hashNext()
Check if there is a next page to visit or if has reached the max page number.

Specified by:
hashNext in interface WebSpider
Returns:
true if has next and still not reach the max page number, or false otherwise.

next

public Page next()
Visit and return the next @{link Page Page} object. Return Page with URL,title,and text. It should be noticed that all the page fetched by SimpleSpider will be encode to "UTF-8", but not the value of "charset=" in content type in response header. If it waits too long to fetch content from a URL, the URL will be skipped.

Specified by:
next in interface WebSpider
Returns:
a Page.
Throws:
UnknownEncodingException - thrown if encoding of a page couldn't not be resolve.