Parsing Engine

danbikel.parser.english
Class Treebank

java.lang.Object
  extended by danbikel.parser.lang.AbstractTreebank
      extended by danbikel.parser.english.Treebank
All Implemented Interfaces:
Treebank, Serializable

public class Treebank
extends AbstractTreebank

Provides data and methods speciifc to the structures found in the English Treebank (the Penn Treebank) or any other treebank that conforms to the Treebank II annotation guidelines for part-of-speech tagging and bracketing.

See Also:
Serialized Form

Field Summary
 
Fields inherited from class danbikel.parser.lang.AbstractTreebank
augmentationDelimSet, canonicalAugDelimSym, nonterminalExceptionSet
 
Constructor Summary
Treebank()
          Constructs an English Treebank object.
 
Method Summary
 String augmentationDelimiters()
          Returns a string of the three characters that serve as augmentation delimiters in the Penn Treebank: "-=|".
 Symbol baseNPLabel()
          Returns the symbol with which AbstractTraining.addBaseNPs(Sexp) will relabel base NPs.
 Symbol getCanonical(Symbol label)
          Returns a canonical mapping for the specified nonterminal label; if label already is in canonical form, it is returned.
 Symbol getCanonical(Symbol label, boolean stripAugmentations)
          Returns a canonical version of the specified nonterminal label; if label already is in canonical form, it is returned.
 boolean isBaseNP(Symbol label)
          Returns whether the specified label is for a base NP.
 boolean isComma(Symbol word)
          Returns true if the specified word is a comma.
 boolean isConjunction(Symbol label)
          Returns true if label is equal to the symbol whose print name is "CC".
 boolean isLeftParen(Symbol word)
          Returns true if the specified word is a left parenthesis.
 boolean isNP(Symbol label)
          Returns true if the canonical version of the specified label is an NP for for English Treebank.
 boolean isNullElementPreterminal(Sexp tree)
          Returns true if the specified S-expression represents a preterminal whose terminal element is the null element ("-NONE-") for the Penn Treebank.
 boolean isPossessivePreterminal(Sexp tree)
          Returns true if the specified S-expression represents a preterminal that is the possessive part of speech.
 boolean isPreterminal(Sexp tree)
          Returns true if tree represents a preterminal subtree (part-of-speech tag and word).
 boolean isPuncToRaise(Sexp preterm)
          Returns true if the specified S-expression is a preterminal whose part of speech is "," or ":".
 boolean isPunctuation(Symbol tag)
          Returns true if the specified part of speech tag is one for which AbstractTreebank.isPuncToRaise(Sexp) would return true.
 boolean isRightParen(Symbol word)
          Returns true if the specified word is a right parenthesis.
 boolean isSentence(Symbol label)
          Returns true is the specified nonterminal label represents a sentence in the Penn Treebank, that is, if the canonical version of label is equal to "S".
 boolean isVerb(Sexp preterminal)
          Returns true if preterminal represents a terminal with one of the following parts of speech: VB, VBD, VBG, VBN, VBP or VBZ.
 boolean isVerbTag(Symbol tag)
          Returns true if the specified symbol is the part of speech tag of a verb.
 boolean isWHNP(Symbol label)
          Returns true if the canonical version of the specified label is a WHNP in the English Treebank.
 Symbol NPLabel()
          Returns the symbol that AbstractTraining.addBaseNPs(Sexp) should add as a parent if a base NP is not dominated by an NP.
 Nonterminal parseNonterminal(Symbol label, Nonterminal nonterminal)
          Calls AbstractTreebank.defaultParseNonterminal(Symbol, Nonterminal) with the specified arguments.
 Symbol sentenceLabel()
          Returns the canonical label for a sentence, for de-transforming sentences that were transformed via Training.relabelSubjectlessSentences(Sexp).
 Symbol subjectAugmentation()
          Returns the symbol that is used to augment nonterminals to indicate matrix subjects in this language’s Treebank.
 Symbol subjectlessSentenceLabel()
          Returns the symbol that Training.relabelSubjectlessSentences(Sexp) will use for sentences that have no subjects.
 
Methods inherited from class danbikel.parser.lang.AbstractTreebank
addAugmentation, canonicalAugDelimiter, constructPreterminal, containsAugmentation, defaultParseNonterminal, getTag, getTraceIndex, isAugDelim, makeWord, nonTreebankDelimiter, nonTreebankLeftBracket, nonTreebankRightBracket, parseNonterminal, removeAugmentation, removeAugmentation, stripAllButIndex, stripAllButIndex, stripAugmentation, stripIndex, stripIndex
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

Treebank

public Treebank()
Constructs an English Treebank object.

Method Detail

isPreterminal

public final boolean isPreterminal(Sexp tree)
Returns true if tree represents a preterminal subtree (part-of-speech tag and word). Specifically, this method returns true if tree is an instance of SexpList, has a length of 2 and has a first list element of type Symbol.

Specified by:
isPreterminal in interface Treebank
Specified by:
isPreterminal in class AbstractTreebank

isSentence

public boolean isSentence(Symbol label)
Returns true is the specified nonterminal label represents a sentence in the Penn Treebank, that is, if the canonical version of label is equal to "S".

Specified by:
isSentence in interface Treebank
Specified by:
isSentence in class AbstractTreebank
See Also:
Training.relabelSubjectlessSentences(Sexp)

sentenceLabel

public Symbol sentenceLabel()
Description copied from class: AbstractTreebank
Returns the canonical label for a sentence, for de-transforming sentences that were transformed via Training.relabelSubjectlessSentences(Sexp).

Specified by:
sentenceLabel in interface Treebank
Specified by:
sentenceLabel in class AbstractTreebank

subjectlessSentenceLabel

public Symbol subjectlessSentenceLabel()
Returns the symbol that Training.relabelSubjectlessSentences(Sexp) will use for sentences that have no subjects.

Specified by:
subjectlessSentenceLabel in interface Treebank
Specified by:
subjectlessSentenceLabel in class AbstractTreebank

subjectAugmentation

public Symbol subjectAugmentation()
Description copied from class: AbstractTreebank
Returns the symbol that is used to augment nonterminals to indicate matrix subjects in this language’s Treebank.

Specified by:
subjectAugmentation in interface Treebank
Specified by:
subjectAugmentation in class AbstractTreebank
See Also:
Training.relabelSubjectlessSentences(Sexp)

isNullElementPreterminal

public boolean isNullElementPreterminal(Sexp tree)
Returns true if the specified S-expression represents a preterminal whose terminal element is the null element ("-NONE-") for the Penn Treebank.

Specified by:
isNullElementPreterminal in interface Treebank
Specified by:
isNullElementPreterminal in class AbstractTreebank
See Also:
Training.relabelSubjectlessSentences(Sexp)

isPuncToRaise

public boolean isPuncToRaise(Sexp preterm)
Returns true if the specified S-expression is a preterminal whose part of speech is "," or ":".

Specified by:
isPuncToRaise in interface Treebank
Specified by:
isPuncToRaise in class AbstractTreebank
Parameters:
preterm - the preterminal to test
See Also:
Training.raisePunctuation(Sexp)

isPunctuation

public boolean isPunctuation(Symbol tag)
Description copied from class: AbstractTreebank
Returns true if the specified part of speech tag is one for which AbstractTreebank.isPuncToRaise(Sexp) would return true.

Specified by:
isPunctuation in interface Treebank
Specified by:
isPunctuation in class AbstractTreebank
Parameters:
tag - the part of speech to test
See Also:
AbstractTreebank.isPuncToRaise(Sexp)

isPossessivePreterminal

public boolean isPossessivePreterminal(Sexp tree)
Returns true if the specified S-expression represents a preterminal that is the possessive part of speech. This method is intended to be used by implementations of AbstractTraining.addBaseNPs(Sexp).

Specified by:
isPossessivePreterminal in interface Treebank
Specified by:
isPossessivePreterminal in class AbstractTreebank
See Also:
Training.addBaseNPs(Sexp)

isNP

public boolean isNP(Symbol label)
Returns true if the canonical version of the specified label is an NP for for English Treebank.

Specified by:
isNP in interface Treebank
Specified by:
isNP in class AbstractTreebank
Parameters:
label - the label to test
See Also:
AbstractTraining.addBaseNPs(Sexp)

isBaseNP

public boolean isBaseNP(Symbol label)
Description copied from class: AbstractTreebank
Returns whether the specified label is for a base NP. The default implementation here simply tests for object equality between the specified label and the label returned by AbstractTreebank.baseNPLabel(). If a particular language package can have various types of base NP labels (such as those bearing node augmentations), then this method should be overridden.

Specified by:
isBaseNP in interface Treebank
Overrides:
isBaseNP in class AbstractTreebank
Parameters:
label - the label to test
Returns:
whether the specified label is for a base NP.

baseNPLabel

public Symbol baseNPLabel()
Returns the symbol with which AbstractTraining.addBaseNPs(Sexp) will relabel base NPs.

Specified by:
baseNPLabel in interface Treebank
Specified by:
baseNPLabel in class AbstractTreebank
See Also:
AbstractTraining.addBaseNPs(danbikel.lisp.Sexp)

isWHNP

public boolean isWHNP(Symbol label)
Returns true if the canonical version of the specified label is a WHNP in the English Treebank.

Specified by:
isWHNP in interface Treebank
Specified by:
isWHNP in class AbstractTreebank
See Also:
AbstractTraining.addGapInformation(Sexp)

NPLabel

public Symbol NPLabel()
Returns the symbol that AbstractTraining.addBaseNPs(Sexp) should add as a parent if a base NP is not dominated by an NP.

Specified by:
NPLabel in interface Treebank
Specified by:
NPLabel in class AbstractTreebank
See Also:
Training.addBaseNPs(Sexp)

isConjunction

public boolean isConjunction(Symbol label)
Returns true if label is equal to the symbol whose print name is "CC".

Specified by:
isConjunction in interface Treebank
Specified by:
isConjunction in class AbstractTreebank

isVerb

public boolean isVerb(Sexp preterminal)
Returns true if preterminal represents a terminal with one of the following parts of speech: VB, VBD, VBG, VBN, VBP or VBZ. It is an error to call this method with a Sexp object for which isPreterminal(Sexp) returns false.

Specified by:
isVerb in interface Treebank
Specified by:
isVerb in class AbstractTreebank
Parameters:
preterminal - the preterminal to test
Returns:
true if preterminal is a verb
See Also:
HeadTreeNode, Trainer

isVerbTag

public boolean isVerbTag(Symbol tag)
Description copied from class: AbstractTreebank
Returns true if the specified symbol is the part of speech tag of a verb. This method should return true for exactly the same parts of speech for which AbstractTreebank.isVerb(Sexp) returns true, and is used to calculate the distance metric while decoding.

Specified by:
isVerbTag in interface Treebank
Specified by:
isVerbTag in class AbstractTreebank
See Also:
CKYItem.containsVerb(), Decoder

isComma

public boolean isComma(Symbol word)
Description copied from class: AbstractTreebank
Returns true if the specified word is a comma. This method is used by the Decoder class when performing the comma constraint on chart items.

Specified by:
isComma in interface Treebank
Specified by:
isComma in class AbstractTreebank
Parameters:
word - the word to test
See Also:
Settings.decoderUseCommaConstraint

isLeftParen

public boolean isLeftParen(Symbol word)
Description copied from class: AbstractTreebank
Returns true if the specified word is a left parenthesis. This method is used by the Decoder class when performing the comma constraint on chart items.

Specified by:
isLeftParen in interface Treebank
Specified by:
isLeftParen in class AbstractTreebank
Parameters:
word - the word to test
See Also:
Settings.decoderUseCommaConstraint

isRightParen

public boolean isRightParen(Symbol word)
Description copied from class: AbstractTreebank
Returns true if the specified word is a right parenthesis. This method is used by the Decoder class when performing the comma constraint on chart items.

Specified by:
isRightParen in interface Treebank
Specified by:
isRightParen in class AbstractTreebank
Parameters:
word - the word to test
See Also:
Settings.decoderUseCommaConstraint

getCanonical

public final Symbol getCanonical(Symbol label)
Returns a canonical mapping for the specified nonterminal label; if label already is in canonical form, it is returned. The canonical mapping refers to transformations performed on nonterminals during the training process. Before obtaining a label's canonical form, it is also stripped of all Treebank augmentations, meaning that only the characters before the first occurrence of '-', '=' or '|' are kept.

Specified by:
getCanonical in interface Treebank
Specified by:
getCanonical in class AbstractTreebank
Parameters:
label - the label to be canonicalized
Returns:
a Symbol with the same print name as label, except that all training transformations and Treebank augmentations have been undone and stripped
See Also:
HeadFinder.findHead(Sexp)

getCanonical

public final Symbol getCanonical(Symbol label,
                                 boolean stripAugmentations)
Description copied from interface: Treebank
Returns a canonical version of the specified nonterminal label; if label already is in canonical form, it is returned.

Specified by:
getCanonical in interface Treebank
Specified by:
getCanonical in class AbstractTreebank
Parameters:
label - the label to be canonicalized
stripAugmentations - indicates whether to strip any augmentations from the specified label before attempting to get its canonical form
Returns:
the canonical version of the specified label

parseNonterminal

public Nonterminal parseNonterminal(Symbol label,
                                    Nonterminal nonterminal)
Calls AbstractTreebank.defaultParseNonterminal(Symbol, Nonterminal) with the specified arguments.

Specified by:
parseNonterminal in interface Treebank
Specified by:
parseNonterminal in class AbstractTreebank
Parameters:
label - to the nonterminal label to parse
nonterminal - the Nonterminal object to fill with the components of label

augmentationDelimiters

public String augmentationDelimiters()
Returns a string of the three characters that serve as augmentation delimiters in the Penn Treebank: "-=|".

Specified by:
augmentationDelimiters in interface Treebank
Specified by:
augmentationDelimiters in class AbstractTreebank
See Also:
AbstractTreebank.stripAugmentation(Symbol), AbstractTreebank.defaultParseNonterminal(Symbol,Nonterminal)

Parsing Engine

Author: Dan Bikel.