org.jbox.webSpider.simpleSpider
Class HtmlVisitor

java.lang.Object
  extended by org.htmlparser.visitors.NodeVisitor
      extended by org.htmlparser.beans.StringBean
          extended by org.jbox.webSpider.simpleSpider.HtmlVisitor
All Implemented Interfaces:
java.io.Serializable

public class HtmlVisitor
extends org.htmlparser.beans.StringBean

A HTML text visitor.

Version:
1.0
Author:
YiBin.H
See Also:
Serialized Form

Field Summary
 
Fields inherited from class org.htmlparser.beans.StringBean
mBuffer, mCollapse, mCollapseState, mIsPre, mIsScript, mIsStyle, mLinks, mParser, mPropertySupport, mReplaceSpace, mStrings, PROP_COLLAPSE_PROPERTY, PROP_CONNECTION_PROPERTY, PROP_LINKS_PROPERTY, PROP_REPLACE_SPACE_PROPERTY, PROP_STRINGS_PROPERTY, PROP_URL_PROPERTY
 
Constructor Summary
HtmlVisitor(java.lang.String[] rules)
          Constructs a new HTMLVisitor object with an String array of rules.
 
Method Summary
 java.util.LinkedList<java.lang.String> getLinksUnderRules()
          Return links in a HTML page which meet the rules.
 java.lang.String getText()
          Return text without HTML tag.
 java.lang.String getTitle()
          Return title of a page.
 void parse(java.lang.String html, java.lang.String encoding)
          Parse HTML text with specified encoding.
 void visitTag(org.htmlparser.Tag tag)
          Visit HTML Tag.
 
Methods inherited from class org.htmlparser.beans.StringBean
addPropertyChangeListener, carriageReturn, collapse, extractStrings, getCollapse, getConnection, getLinks, getReplaceNonBreakingSpaces, getStrings, getURL, main, removePropertyChangeListener, setCollapse, setConnection, setLinks, setReplaceNonBreakingSpaces, setStrings, setURL, updateStrings, visitEndTag, visitStringNode
 
Methods inherited from class org.htmlparser.visitors.NodeVisitor
beginParsing, finishedParsing, shouldRecurseChildren, shouldRecurseSelf, visitRemarkNode
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

HtmlVisitor

public HtmlVisitor(java.lang.String[] rules)
Constructs a new HTMLVisitor object with an String array of rules.

Parameters:
rules - String array containing rules written in REGEXP.
Method Detail

visitTag

public void visitTag(org.htmlparser.Tag tag)
Visit HTML Tag.

Overrides:
visitTag in class org.htmlparser.beans.StringBean

getText

public java.lang.String getText()
Return text without HTML tag.

Returns:
String text without HTML tag.

getLinksUnderRules

public java.util.LinkedList<java.lang.String> getLinksUnderRules()
Return links in a HTML page which meet the rules.

Returns:
LinkedList object containing links in a HTML page which meet the rules.

getTitle

public java.lang.String getTitle()
Return title of a page.

Returns:
the title of a page.

parse

public void parse(java.lang.String html,
                  java.lang.String encoding)
           throws org.htmlparser.util.ParserException
Parse HTML text with specified encoding.

Parameters:
html - String of HTML text to parse.
encoding - String representing encoding for parsing.
Throws:
org.htmlparser.util.ParserException - thrown if fail to parse the html.