org.jbox.webSpider.simpleSpider
Class HtmlFetcher

java.lang.Object
  extended by org.jbox.webSpider.simpleSpider.HtmlFetcher

public class HtmlFetcher
extends java.lang.Object

A HTML fetcher.

Version:
1.0
Author:
YiBin.H

Field Summary
protected  java.net.URLConnection urlConn
           
 
Constructor Summary
HtmlFetcher()
           
 
Method Summary
 void connect(java.lang.String urlStr)
          Connect the specified URL.
 java.lang.String fectch()
          Fetch text of a page.
protected  java.lang.String fetchEncoding()
          Fetch encoding of a page.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

urlConn

protected java.net.URLConnection urlConn
Constructor Detail

HtmlFetcher

public HtmlFetcher()
Method Detail

connect

public void connect(java.lang.String urlStr)
             throws java.io.IOException
Connect the specified URL.

Parameters:
urlStr - the URL to Connect.
Throws:
java.io.IOException - thrown if fail to connect the URL.

fetchEncoding

protected java.lang.String fetchEncoding()
                                  throws java.io.IOException,
                                         UnknownEncodingException
Fetch encoding of a page. If "charset=" exists in content type of response header, invoking this method will return the value of it, or else spider try to down load content of page until meeting string "charset=". If "charset=" exists in the content, the method will then return the value of "charset=", or else, throws an UnknownEncodingException.

Returns:
encoding of a page.
Throws:
java.io.IOException - thrown if fail to down load HTML of a page.
UnknownEncodingException - thrown if fail to fetch encoding of a page.

fectch

public java.lang.String fectch()
                        throws java.io.IOException,
                               UnknownEncodingException
Fetch text of a page.

Returns:
text of a page.
Throws:
java.io.IOException - thrown if fail to fetch the HTML of a page.
UnknownEncodingException - thrown if fail to resolve encoding of a page.