Parsing Engine

danbikel.parser
Interface Treebank

All Known Implementing Classes:
AbstractTreebank, BrokenTreebank, Treebank, Treebank, Treebank

public interface Treebank

A Treebank implementation provides data and methods specific to the structures found in a particular Treebank.

A language package must provide an implementation of this interface.


Method Summary
 void addAugmentation(Nonterminal nonterminal, Symbol augmentation)
          Adds the specified augmentation to the end of the (possibly empty) augmentation list of the specified Nonterminal object.
 String augmentationDelimiters()
          Returns a string whose characters are the set of delimiters for complex nonterminal labels.
 Symbol baseNPLabel()
          Returns the symbol with which Training.addBaseNPs(Sexp) will relabel core NPs.
 char canonicalAugDelimiter()
          Returns the first character of the string returned by augmentationDelimiters(), which will be considered the "canonical" augmentation delimiter when adding new augmentations, such as the argument augmentations added by implementations of Training.identifyArguments(Sexp).
 Sexp constructPreterminal(Word word)
          Converts a Word object into a preterminal subtree.
 boolean containsAugmentation(Symbol nonterminal, Symbol augmentation)
          Provides an efficient, thread-safe method for testing whether the specified nonterminal contains the specified augmentation (without parsing the nonterminal).
 void defaultParseNonterminal(Symbol label, Nonterminal nonterminal)
          Fills in the specified Nonterminal object to represent all the components of a complex nonterminal annotation: the base label, any augmentations and any index.
 Symbol getCanonical(Symbol label)
          Returns a canonical version of the specified nonterminal label; if label already is in canonical form, it is returned.
 Symbol getCanonical(Symbol label, boolean stripAugmentations)
          Returns a canonical version of the specified nonterminal label; if label already is in canonical form, it is returned.
 Symbol getTag(Sexp preterminal)
          Gets the component of the preterminal tree that corresponds to the part of speech tag.
 int getTraceIndex(Sexp preterm, Nonterminal nonterminal)
          Returns the index of a trace for the specified null element preterminal.
 boolean isAugDelim(Sexp sexp)
          Returns whether the specified S-expression is a symbol that is an augmentation delimiter for a complex nonterminal label.
 boolean isBaseNP(Symbol label)
          Returns whether the specified label is for a base NP.
 boolean isComma(Symbol word)
          Returns true if the specified word is a comma.
 boolean isConjunction(Symbol label)
          Returns true if the canonical version of the specified label is a conjunction tag or nonterminal in a particular Treebank.
 boolean isLeftParen(Symbol word)
          Returns true if the specified word is a left parenthesis.
 boolean isNP(Symbol label)
          Returns true if the canonical version of the specified label is an NP for the current language's Treebank.
 boolean isNullElementPreterminal(Sexp tree)
          Returns true if the specified S-expression represents a preterminal whose terminal element is the null element for the current language's Treebank.
 boolean isPossessivePreterminal(Sexp tree)
          Returns true if the specified S-expression represents a preterminal that is the possessive part of speech.
 boolean isPreterminal(Sexp tree)
          Returns whether tree represents a preterminal subtree in the parse trees for this language's Treebank.
 boolean isPuncToRaise(Sexp preterm)
          Returns true if the specified S-expression represents a preterminal and a part-of-speech tag that indicates punctuation to be raised when running Training.raisePunctuation(Sexp).
 boolean isPunctuation(Symbol tag)
          Returns true if the specified part of speech tag is one for which isPuncToRaise(Sexp) would return true.
 boolean isRightParen(Symbol word)
          Returns true if the specified word is a right parenthesis.
 boolean isSentence(Symbol label)
          Returns true is the specified nonterminal label represents a sentence in the current language's Treebank.
 boolean isVerb(Sexp preterminal)
          Returns true if the specified preterminal is that of a verb.
 boolean isVerbTag(Symbol tag)
          Returns true if the specified symbol is the part of speech tag of a verb.
 boolean isWHNP(Symbol label)
          Returns true if the canonical version of the specified label is an NP that undergoes WH-movement in a particular Treebank.
 Word makeWord(Sexp preterminal)
          Constructs a Word object from the specified preterminal subtree.
 char nonTreebankDelimiter()
          Returns a delimiter not already in use by the current treebank, for use when constructing lexicalized nonterminals when the Settings.decoderOutputHeadLexicalizedLabels is true.
 char nonTreebankLeftBracket()
          Returns a left-bracket character that is not an existing metacharacter in the current treebank, for use when the Settings.decoderOutputHeadLexicalizedLabels is true.
 char nonTreebankRightBracket()
          Returns a right-bracket character that is not an existing metacharacter in the current treebank, for use when constructing lexicalized nonterminals when the Settings.decoderOutputHeadLexicalizedLabels is true.
 Symbol NPLabel()
          Returns the symbol that Training.addBaseNPs(Sexp) should add as a parent if a base NP is not dominated by an NP.
 Nonterminal parseNonterminal(Symbol label)
          Returns a Nonterminal object to represent all the components of a complex nonterminal annotation: the base label, any augmentations and any index.
 Nonterminal parseNonterminal(Symbol label, Nonterminal nonterminal)
          Identical to parseNonterminal(Symbol), except that instead of returning a newly-created Nonterminal object, this method merely modifies the specified Nonterminal object.
 boolean removeAugmentation(Nonterminal nonterminal, Symbol augmentation)
          Removes the specified augmentation from the augmentation list of the specified Nonterminal object, and the previous augmentation delimiter.
 Sexp removeAugmentation(Sexp sexp, Nonterminal nonterminal, Symbol augmentation)
          Removes the specified nonterminal augmentation from the specified S-expression, using the specified Nonterminal object for temporary storage.
 Symbol sentenceLabel()
          Returns the canonical label for a sentence, for de-transforming sentences that were transformed via Training.relabelSubjectlessSentences(Sexp).
 Symbol stripAllButIndex(Symbol label)
          Returns a symbol identical to the specified label, except all augmentations other than the index will be removed.
 Symbol stripAllButIndex(Symbol label, Nonterminal nonterminal)
          Identical to stripAllButIndex(Symbol), except that instead of creating a new Nonterminal object for use by parseNonterminal(Symbol,Nonterminal), this method uses the specified nonterminal object.
 Symbol stripAugmentation(Symbol label)
          Returns the Symbol created by stripping off all augmentations, that is all characters after and including the first character that appears in the string returned by augmentationDelimiters().
 Symbol stripIndex(Symbol label)
          Returns label, but stripped of any index augmentation.
 Symbol stripIndex(Symbol label, Nonterminal nonterminal)
          Identical to stripIndex(Symbol), except that instead of creating a new Nonterminal object for use by parseNonterminal(Symbol,Nonterminal), this method simply passes the specified nonterminal object.
 Symbol subjectAugmentation()
          Returns the symbol that is used to augment nonterminals to indicate matrix subjects in the current language's Treebank.
 Symbol subjectlessSentenceLabel()
          Returns the symbol with which Training.relabelSubjectlessSentences(Sexp) will relabel sentences when they have no subjects.
 

Method Detail

isPreterminal

boolean isPreterminal(Sexp tree)
Returns whether tree represents a preterminal subtree in the parse trees for this language's Treebank. Typically, preterminals are part-of-speech tags.


getTag

Symbol getTag(Sexp preterminal)
Gets the component of the preterminal tree that corresponds to the part of speech tag.

Parameters:
preterminal - a tree that is assumed to be a preterminal
Returns:
the symbol in preterminal that is a part of speech

makeWord

Word makeWord(Sexp preterminal)
Constructs a Word object from the specified preterminal subtree.

Parameters:
preterminal - a tree that is assumed to be a preterminal
Returns:
the symbol in preterminal that is a part of speech

constructPreterminal

Sexp constructPreterminal(Word word)
Converts a Word object into a preterminal subtree.

Parameters:
word - the word object from which to create a preterminal subtree
Returns:
a preterminal subtree constructed from word

getCanonical

Symbol getCanonical(Symbol label)
Returns a canonical version of the specified nonterminal label; if label already is in canonical form, it is returned.

Parameters:
label - the label to be canonicalized

getCanonical

Symbol getCanonical(Symbol label,
                    boolean stripAugmentations)
Returns a canonical version of the specified nonterminal label; if label already is in canonical form, it is returned.

Parameters:
label - the label to be canonicalized
stripAugmentations - indicates whether to strip any augmentations from the specified label before attempting to get its canonical form
Returns:
the canonical version of the specified label

isSentence

boolean isSentence(Symbol label)
Returns true is the specified nonterminal label represents a sentence in the current language's Treebank. This method is intended to be used by implementations of Training.relabelSubjectlessSentences(Sexp).


sentenceLabel

Symbol sentenceLabel()
Returns the canonical label for a sentence, for de-transforming sentences that were transformed via Training.relabelSubjectlessSentences(Sexp).


subjectlessSentenceLabel

Symbol subjectlessSentenceLabel()
Returns the symbol with which Training.relabelSubjectlessSentences(Sexp) will relabel sentences when they have no subjects.


subjectAugmentation

Symbol subjectAugmentation()
Returns the symbol that is used to augment nonterminals to indicate matrix subjects in the current language's Treebank.

See Also:
Training.relabelSubjectlessSentences(Sexp)

isNullElementPreterminal

boolean isNullElementPreterminal(Sexp tree)
Returns true if the specified S-expression represents a preterminal whose terminal element is the null element for the current language's Treebank. This method is intended to be used by implementations of Training.relabelSubjectlessSentences(Sexp).

See Also:
Training.relabelSubjectlessSentences(Sexp)

getTraceIndex

int getTraceIndex(Sexp preterm,
                  Nonterminal nonterminal)
Returns the index of a trace for the specified null element preterminal. If preterm is not a null element preterminal (that is, a preterminal for which isNullElementPreterminal(Sexp) returns false), the semantics of this method are undefined.

Parameters:
preterm - the null element preterminal whose trace index is to be returned
nonterminal - the object used as the second argument to parseNonterminal(Symbol,Nonterminal)
Returns:
the index of the trace of the terminal contained in preterm, or -1 if the null element does not have an index

isPuncToRaise

boolean isPuncToRaise(Sexp preterm)
Returns true if the specified S-expression represents a preterminal and a part-of-speech tag that indicates punctuation to be raised when running Training.raisePunctuation(Sexp). If punctuation raising is not desirable for a particular language package, this method may be implemented simply to return false.

Parameters:
preterm - the preterminal to test
See Also:
Training.raisePunctuation(Sexp)

isPunctuation

boolean isPunctuation(Symbol tag)
Returns true if the specified part of speech tag is one for which isPuncToRaise(Sexp) would return true.

Parameters:
tag - the part of speech to test
See Also:
isPuncToRaise(Sexp)

isPossessivePreterminal

boolean isPossessivePreterminal(Sexp tree)
Returns true if the specified S-expression represents a preterminal that is the possessive part of speech. This method is intended to be used by implementations of Training.addBaseNPs(Sexp).

See Also:
Training.addBaseNPs(Sexp)

isNP

boolean isNP(Symbol label)
Returns true if the canonical version of the specified label is an NP for the current language's Treebank.

Parameters:
label - the label to test
See Also:
Training.addBaseNPs(Sexp)

baseNPLabel

Symbol baseNPLabel()
Returns the symbol with which Training.addBaseNPs(Sexp) will relabel core NPs.
N.B.: This method should not be used as a predicate for testing whether a particular nonterminal label is that of a base NP. For that purpose, use isBaseNP(Symbol).

See Also:
Training.addBaseNPs(Sexp)

isBaseNP

boolean isBaseNP(Symbol label)
Returns whether the specified label is for a base NP.

Parameters:
label - the label to test
Returns:
whether the specified label is for a base NP.

isWHNP

boolean isWHNP(Symbol label)
Returns true if the canonical version of the specified label is an NP that undergoes WH-movement in a particular Treebank. This method is used by Training.addGapInformation(Sexp). If a particular language package does not require gap information, then this method may be implemented simply to return false.

See Also:
Training.addGapInformation(Sexp)

NPLabel

Symbol NPLabel()
Returns the symbol that Training.addBaseNPs(Sexp) should add as a parent if a base NP is not dominated by an NP.

See Also:
Training.addBaseNPs(Sexp)

isConjunction

boolean isConjunction(Symbol label)
Returns true if the canonical version of the specified label is a conjunction tag or nonterminal in a particular Treebank.


isVerb

boolean isVerb(Sexp preterminal)
Returns true if the specified preterminal is that of a verb. This method is used by HeadTreeNode to determine if a particular subtree contains a verb, which is in turn used by Trainer to calculate the distance metric, which depends on whether a verb occurs in the subtrees of the previous modifiers. It is the responsibility of the caller to insure that preterminal is a Sexp object for which isPreterminal(Sexp) returns true.

See Also:
HeadTreeNode, Trainer

isVerbTag

boolean isVerbTag(Symbol tag)
Returns true if the specified symbol is the part of speech tag of a verb. This method should return true for exactly the same parts of speech for which isVerb(Sexp) returns true, and is used to calculate the distance metric while decoding.

See Also:
CKYItem.containsVerb(), Decoder

isComma

boolean isComma(Symbol word)
Returns true if the specified word is a comma. This method is used by the Decoder class when performing the comma constraint on chart items.

Parameters:
word - the word to test
See Also:
Settings.decoderUseCommaConstraint

isLeftParen

boolean isLeftParen(Symbol word)
Returns true if the specified word is a left parenthesis. This method is used by the Decoder class when performing the comma constraint on chart items.

Parameters:
word - the word to test
See Also:
Settings.decoderUseCommaConstraint

isRightParen

boolean isRightParen(Symbol word)
Returns true if the specified word is a right parenthesis. This method is used by the Decoder class when performing the comma constraint on chart items.

Parameters:
word - the word to test
See Also:
Settings.decoderUseCommaConstraint

augmentationDelimiters

String augmentationDelimiters()
Returns a string whose characters are the set of delimiters for complex nonterminal labels.

Implementation note: The return value of this method should be used only to implement the other methods of interface. Construction of and predicates over complex nonterminals should be handled by the other methods specified in this interface that either take a Nonterminal as an argument or return a Nonterminal.

See Also:
isAugDelim(Sexp), stripAugmentation(Symbol), defaultParseNonterminal(Symbol,Nonterminal)

canonicalAugDelimiter

char canonicalAugDelimiter()
Returns the first character of the string returned by augmentationDelimiters(), which will be considered the "canonical" augmentation delimiter when adding new augmentations, such as the argument augmentations added by implementations of Training.identifyArguments(Sexp).


nonTreebankLeftBracket

char nonTreebankLeftBracket()
Returns a left-bracket character that is not an existing metacharacter in the current treebank, for use when the Settings.decoderOutputHeadLexicalizedLabels is true. For most treebanks, '[' is a good default.

Returns:
a left-bracket character that is not an existing metacharacter in the current treebank

nonTreebankRightBracket

char nonTreebankRightBracket()
Returns a right-bracket character that is not an existing metacharacter in the current treebank, for use when constructing lexicalized nonterminals when the Settings.decoderOutputHeadLexicalizedLabels is true. For most treebanks, ']' is a good default.

Returns:
a right-bracket character that is not an existing metacharacter in the current treebank

nonTreebankDelimiter

char nonTreebankDelimiter()
Returns a delimiter not already in use by the current treebank, for use when constructing lexicalized nonterminals when the Settings.decoderOutputHeadLexicalizedLabels is true.

Returns:
a delimiter not already in use by the current treebank

stripAugmentation

Symbol stripAugmentation(Symbol label)
Returns the Symbol created by stripping off all augmentations, that is all characters after and including the first character that appears in the string returned by augmentationDelimiters().

Parameters:
label - the potentially-complex nonterminal label to be stripped
Returns:
a version of label with all augmentations removed

stripIndex

Symbol stripIndex(Symbol label)
Returns label, but stripped of any index augmentation. This method assumes that the index will always be the final augmentation in a complex nonterminal label.
N.B.: This method will create a new Nonterminal object, to be filled in by stripIndex(Symbol,Nonterminal).

Parameters:
label - the nonterminal to be stripped of any possible index
Returns:
a Symbol that is identical to label, except that all characters after and including the final delimiter are removed if the final augmentation is composed entirely of digits

stripIndex

Symbol stripIndex(Symbol label,
                  Nonterminal nonterminal)
Identical to stripIndex(Symbol), except that instead of creating a new Nonterminal object for use by parseNonterminal(Symbol,Nonterminal), this method simply passes the specified nonterminal object. In a sequential run, this method provides maximum efficiency, as only one Nonterminal object need be created at the beginning of the run.


stripAllButIndex

Symbol stripAllButIndex(Symbol label)
Returns a symbol identical to the specified label, except all augmentations other than the index will be removed. If label had no index to begin with, then this method is functionally identical to stripAugmentation(Symbol).

Parameters:
label - the nonterminal label to strip of non-index augmentations

stripAllButIndex

Symbol stripAllButIndex(Symbol label,
                        Nonterminal nonterminal)
Identical to stripAllButIndex(Symbol), except that instead of creating a new Nonterminal object for use by parseNonterminal(Symbol,Nonterminal), this method uses the specified nonterminal object. In a sequential run, this method provides maximum efficiency, as only one Nonterminal object need be created at the beginning of the run.


parseNonterminal

Nonterminal parseNonterminal(Symbol label)
Returns a Nonterminal object to represent all the components of a complex nonterminal annotation: the base label, any augmentations and any index. If there are no augmentations, the augmentations field of the returned object will contain a list with zero elements; if there is no index, the value of index will be -1. A final requirement of the contract of this method is to represent all the delimiters in the list of augmentations; this requirement is met, for example, by the helper method defaultParseNonterminal(Symbol,Nonterminal).
Efficiency note: This method creates and returns a new Nonterminal object with every invocation.

Parameters:
label - a (possibly complex) nonterminal label from a Treebank
Returns:
a Nonterminal object representing any and all components of the specified complex nonterminal
See Also:
Nonterminal

parseNonterminal

Nonterminal parseNonterminal(Symbol label,
                             Nonterminal nonterminal)
Identical to parseNonterminal(Symbol), except that instead of returning a newly-created Nonterminal object, this method merely modifies the specified Nonterminal object. This method may be used for efficiency: in a particular, sequential training run, only one Nonterminal need be created, repeatedly passed in to this method for modification.

Parameters:
label - a (possibly complex) nonterminal label from a Treebank
nonterminal - the representation of any and all components present in label

defaultParseNonterminal

void defaultParseNonterminal(Symbol label,
                             Nonterminal nonterminal)
Fills in the specified Nonterminal object to represent all the components of a complex nonterminal annotation: the base label, any augmentations and any index. If there are no augmentations, the augmentations field of the returned object will contain a list with no elements; if there is no index, the value of index will be -1. Augmentation delimiters are the characters in the string returned by augmentationDelimiters().
N.B.: This method assumes that the index, if one exists for the specified nonterminal, will always be the final augmentation in the label.
This method is intended to be used by implementations of parseNonterminal(Symbol,Nonterminal).

Parameters:
label - a (possibly complex) nonterminal label from a Treebank
See Also:
Nonterminal

containsAugmentation

boolean containsAugmentation(Symbol nonterminal,
                             Symbol augmentation)
Provides an efficient, thread-safe method for testing whether the specified nonterminal contains the specified augmentation (without parsing the nonterminal).

N.B.: This method assumes that the augmentation is preceded by the canonical augmentation delimiter. To search for an augmentation preceded by any of the possible augmentaion delimiters (as defined by augmentationDelimiters()), use

 parseNonterminal(nonterminal).augmentations.contains(augmentation)
 


addAugmentation

void addAugmentation(Nonterminal nonterminal,
                     Symbol augmentation)
Adds the specified augmentation to the end of the (possibly empty) augmentation list of the specified Nonterminal object. This method takes care to add the canonical augmentation delimiter before adding the augmentation itself, and also takes care to add these two elements before a final delimiter between the main augmentations and the index, if one exists.

Parameters:
nonterminal - the nonterminal to which to add an augmentation
augmentation - the augmentation to add to nonterminal's augmentation list

removeAugmentation

boolean removeAugmentation(Nonterminal nonterminal,
                           Symbol augmentation)
Removes the specified augmentation from the augmentation list of the specified Nonterminal object, and the previous augmentation delimiter. If the specified augmentation is not preceded by an augmentation delimiter, meaning it is the base label itself, then it is not removed.

Parameters:
nonterminal - the nonterminal from which to remove an augmentation
augmentation - the augmentation to remove from nonterminal
Returns:
true if augmentation and a preceding augmentation delimiter was removed from nonterminal's augmentation list, or false otherwise

removeAugmentation

Sexp removeAugmentation(Sexp sexp,
                        Nonterminal nonterminal,
                        Symbol augmentation)
Removes the specified nonterminal augmentation from the specified S-expression, using the specified Nonterminal object for temporary storage. If the specified S-expression is a list, then each element will be destructively replaced with the return value of this method; otherwise, if the specified S-epxression is a symbol, its augmentation is removed and the new symbol is returned.

N.B.: While the description of the behavior of this method on lists is recursive, a concrete implementation need not use a recursive algorithm.

Parameters:
sexp - the S-expression containing symbols whose augmentations are to be removed
nonterminal - an object used for temporary storage during the invocation of this method
augmentation - the augmentation to be removed from all symbols in the specified S-expression
Returns:
the specified S-expression, but with all symbols changed so that none has the specified augmentation

isAugDelim

boolean isAugDelim(Sexp sexp)
Returns whether the specified S-expression is a symbol that is an augmentation delimiter for a complex nonterminal label.

Parameters:
sexp - the S-expression to be tested
Returns:
whether the specified S-expression is a symbol that is an augmentation delimiter.
See Also:
augmentationDelimiters()

Parsing Engine

Author: Dan Bikel.