|
Parsing Engine | |||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectdanbikel.parser.lang.AbstractTreebank
public abstract class AbstractTreebank
A collection of mostly-abstract methods to be implemented by a
langauge-specific subclass. A Treebank
implementation
provides data and methods specific to the structures found in a
particular Treebank.
Field Summary | |
---|---|
protected BitSet |
augmentationDelimSet
A BitSet indexed by character (that is, whose size is
Character.MAX_VALUE ), where for each character c
of the string returned by augmentationDelimiters() ,
augmentationDelimSet.get(c)
returns true . |
protected Symbol |
canonicalAugDelimSym
A Symbol created from the first character of Treebank.augmentationDelimiters() . |
protected Symbol[] |
nonterminalExceptionSet
A set of nonterminal labels ( Symbol objects) that
defaultParseNonterminal(Symbol,Nonterminal) should use when
determining the base nonterminal label. |
Constructor Summary | |
---|---|
AbstractTreebank()
No-arg constructor, to be called by all subclasses of this abstract class. |
Method Summary | |
---|---|
void |
addAugmentation(Nonterminal nonterminal,
Symbol augmentation)
Adds the specified augmentation to the end of the (possibly empty) augmentation list of the specified Nonterminal object. |
abstract String |
augmentationDelimiters()
Returns a string whose characters are the set of delimiters for complex nonterminal labels. |
abstract Symbol |
baseNPLabel()
Returns the symbol with which Training.addBaseNPs(Sexp) will
relabel core NPs. |
char |
canonicalAugDelimiter()
Returns the first character of the string returned by augmentationDelimiters() , which will be considered the
"canonical" augmentation delimiter when adding
new augmentations, such as the argument augmentations added by
implementations of Training.identifyArguments(Sexp) . |
Sexp |
constructPreterminal(Word word)
Converts a Word object into a preterminal subtree. |
boolean |
containsAugmentation(Symbol nonterminal,
Symbol augmentation)
Provides an efficient, thread-safe method for testing whether the specified nonterminal contains the specified augmentation (without parsing the nonterminal). |
void |
defaultParseNonterminal(Symbol label,
Nonterminal nonterminal)
Fills in the specified Nonterminal object to represent
all the components of a complex nonterminal annotation: the base label,
any augmentations and any index. |
abstract Symbol |
getCanonical(Symbol label)
Returns a canonical mapping for the specified nonterminal label; if label already is in canonical form, it is returned. |
abstract Symbol |
getCanonical(Symbol label,
boolean stripAugmentations)
Returns a canonical version of the specified nonterminal label; if label already is in canonical form, it is returned. |
Symbol |
getTag(Sexp preterminal)
Gets the component of the preterminal tree that corresponds to the part of speech tag. |
int |
getTraceIndex(Sexp preterm,
Nonterminal nonterminal)
Returns the index of a trace for the specified null element preterminal. |
boolean |
isAugDelim(Sexp sexp)
Returns whether the specified S-expression is a symbol that is an augmentation delimiter for a complex nonterminal label. |
boolean |
isBaseNP(Symbol label)
Returns whether the specified label is for a base NP. |
abstract boolean |
isComma(Symbol word)
Returns true if the specified word is a comma. |
abstract boolean |
isConjunction(Symbol label)
Returns true if the canonical version of the specified label
is a conjunction tag or nonterminal in a particular Treebank. |
abstract boolean |
isLeftParen(Symbol word)
Returns true if the specified word is a left
parenthesis. |
abstract boolean |
isNP(Symbol label)
Returns true if the canonical version of the specified label
is an NP for this language’s Treebank. |
abstract boolean |
isNullElementPreterminal(Sexp tree)
Returns true if the specified S-expression represents
a preterminal whose terminal element is the null element for this
language’s Treebank. |
abstract boolean |
isPossessivePreterminal(Sexp tree)
Returns true if the specified S-expression represents
a preterminal that is the possessive part of speech. |
abstract boolean |
isPreterminal(Sexp tree)
Returns whether tree represents a preterminal subtree in the
parse trees for this language's Treebank. |
abstract boolean |
isPuncToRaise(Sexp preterm)
Returns true if the specified S-expression represents
a preterminal and a part-of-speech tag that indicates punctuation
to be raised when running Training.raisePunctuation(Sexp) . |
abstract boolean |
isPunctuation(Symbol tag)
Returns true if the specified part of speech tag is one
for which isPuncToRaise(Sexp) would return true . |
abstract boolean |
isRightParen(Symbol word)
Returns true if the specified word is a right
parenthesis. |
abstract boolean |
isSentence(Symbol label)
Returns true is the specified nonterminal label represents a
sentence in this language’s Treebank. |
abstract boolean |
isVerb(Sexp preterminal)
Returns true if the specified preterminal is that of a verb. |
abstract boolean |
isVerbTag(Symbol tag)
Returns true if the specified symbol is the part of speech
tag of a verb. |
abstract boolean |
isWHNP(Symbol label)
Returns true if the canonical version of the specified label
is an NP that undergoes WH-movement in a particular Treebank. |
Word |
makeWord(Sexp preterminal)
Constructs a Word object from the specified preterminal
subtree. |
char |
nonTreebankDelimiter()
Returns a delimiter not already in use by the current treebank, for use when constructing lexicalized nonterminals when the Settings.decoderOutputHeadLexicalizedLabels is true. |
char |
nonTreebankLeftBracket()
Returns a left-bracket character that is not an existing metacharacter in the current treebank, for use when the Settings.decoderOutputHeadLexicalizedLabels is true. |
char |
nonTreebankRightBracket()
Returns a right-bracket character that is not an existing metacharacter in the current treebank, for use when constructing lexicalized nonterminals when the Settings.decoderOutputHeadLexicalizedLabels is
true. |
abstract Symbol |
NPLabel()
Returns the symbol that Training.addBaseNPs(Sexp) should
add as a parent if a base NP is not dominated by an NP. |
Nonterminal |
parseNonterminal(Symbol label)
Returns a Nonterminal object to represent all the
components of a complex nonterminal annotation: the base label, any
augmentations and any index. |
abstract Nonterminal |
parseNonterminal(Symbol label,
Nonterminal nonterminal)
Identical to parseNonterminal(Symbol) , except that instead of
returning a newly-created Nonterminal object, this
method merely modifies the specified Nonterminal object. |
boolean |
removeAugmentation(Nonterminal nonterminal,
Symbol augmentation)
Removes the specified augmentation from the augmentation list of the specified Nonterminal object, and the previous augmentation
delimiter. |
Sexp |
removeAugmentation(Sexp sexp,
Nonterminal nonterminal,
Symbol augmentation)
Removes the specified nonterminal augmentation from the specified S-expression, using the specified Nonterminal object for temporary
storage. |
abstract Symbol |
sentenceLabel()
Returns the canonical label for a sentence, for de-transforming sentences that were transformed via Training.relabelSubjectlessSentences(Sexp) . |
Symbol |
stripAllButIndex(Symbol label)
Returns a symbol identical to the specified label , except
all augmentations other than the index will be removed. |
Symbol |
stripAllButIndex(Symbol label,
Nonterminal nonterminal)
Identical to stripAllButIndex(Symbol) , except that instead of
creating a new Nonterminal object for use by
parseNonterminal(Symbol,Nonterminal) , this method
uses the specified nonterminal object. |
Symbol |
stripAugmentation(Symbol label)
Returns the Symbol created by stripping off all
augmentations, that is all characters after and including the first
character that appears in the string returned by
augmentationDelimiters() . |
Symbol |
stripIndex(Symbol label)
Returns label , but stripped of any index augmentation. |
Symbol |
stripIndex(Symbol label,
Nonterminal nonterminal)
Identical to stripIndex(Symbol) , except that instead of creating
a new Nonterminal object for use by parseNonterminal(Symbol,Nonterminal) , this method simply passes the
specified nonterminal object. |
abstract Symbol |
subjectAugmentation()
Returns the symbol that is used to augment nonterminals to indicate matrix subjects in this language’s Treebank. |
abstract Symbol |
subjectlessSentenceLabel()
Returns the symbol with which Training.relabelSubjectlessSentences(Sexp)
will relabel sentences when they have no subjects. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
protected BitSet augmentationDelimSet
BitSet
indexed by character (that is, whose size is
Character.MAX_VALUE
), where for each character c
of the string returned by augmentationDelimiters()
,
augmentationDelimSet.get(c)returns
true
. The default constructor of this abstract class
will appropriately initialize this data member.
protected final Symbol canonicalAugDelimSym
Treebank.augmentationDelimiters()
.
protected Symbol[] nonterminalExceptionSet
Symbol
objects) that
defaultParseNonterminal(Symbol,Nonterminal)
should use when
determining the base nonterminal label. If this behavior is desired,
this array should be assigned in in the constructor of a subclass.
This hook into the behavior of defaultParseNonterminal
is
primarily intended for the unfortunate case when Treebank designers have
nonterminal labels that contain the delimiters used for augmenting
nonterminal labels (as is the case with the English Treebank in the
form of -LRB- and -RRB-).
Constructor Detail |
---|
public AbstractTreebank()
augmentationDelimSet
based on the string returned by
augmentationDelimiters()
.
Method Detail |
---|
public abstract boolean isPreterminal(Sexp tree)
tree
represents a preterminal subtree in the
parse trees for this language's Treebank. Typically, preterminals are
part-of-speech tags.
isPreterminal
in interface Treebank
public Symbol getTag(Sexp preterminal)
preterminal.list().get(0).symbol();If this is not appropriate for a particular Treebank, then this method should be overridden.
getTag
in interface Treebank
preterminal
- a tree that is assumed to be a preterminal
preterminal
that is a part of speechpublic Word makeWord(Sexp preterminal)
Word
object from the specified preterminal
subtree. This default implementation creates a Word
object from
the terminal and preterminal symbols (word and part-of-speech tag
symbols):
Words.get(preterminal.list().get(1).symbol(), preterminal.list().get(0).symbol());If a particular Treebank requires a different type of word object to be constructed, or has a different preterminal tree structure, this method should be overridden.
makeWord
in interface Treebank
preterminal
- a tree that is assumed to be a preterminal
preterminal
that is a part of speechpublic Sexp constructPreterminal(Word word)
Word
object into a preterminal subtree. This default
implementation creates a tree whose sole nonterminal is the part of speech
of the specified word object and whose terminal is the word component of
the specified word object.
constructPreterminal
in interface Treebank
word
- the word object from which to create a preterminal subtree
word
public abstract Symbol getCanonical(Symbol label)
label
already is in canonical form, it is returned.
This method is intended to be used by implementations of
HeadFinder.findHead(Sexp)
.
getCanonical
in interface Treebank
label
- the label to be canonicalizedHeadFinder.findHead(Sexp)
public abstract Symbol getCanonical(Symbol label, boolean stripAugmentations)
Treebank
label
already is in canonical form, it is returned.
getCanonical
in interface Treebank
label
- the label to be canonicalizedstripAugmentations
- indicates whether to strip any augmentations
from the specified label before attempting to get its canonical form
public abstract boolean isSentence(Symbol label)
true
is the specified nonterminal label represents a
sentence in this language’s Treebank. This method is intended to
be used by implementations of Training.relabelSubjectlessSentences(Sexp)
.
isSentence
in interface Treebank
public abstract Symbol sentenceLabel()
Training.relabelSubjectlessSentences(Sexp)
.
sentenceLabel
in interface Treebank
public abstract Symbol subjectlessSentenceLabel()
Training.relabelSubjectlessSentences(Sexp)
will relabel sentences when they have no subjects.
subjectlessSentenceLabel
in interface Treebank
public abstract Symbol subjectAugmentation()
subjectAugmentation
in interface Treebank
Training.relabelSubjectlessSentences(Sexp)
public abstract boolean isNullElementPreterminal(Sexp tree)
true
if the specified S-expression represents
a preterminal whose terminal element is the null element for this
language’s Treebank. This method is intended to be used by implementations
of Training.relabelSubjectlessSentences(Sexp)
.
isNullElementPreterminal
in interface Treebank
Training.relabelSubjectlessSentences(Sexp)
public int getTraceIndex(Sexp preterm, Nonterminal nonterminal)
parseNonterminal(Symbol,Nonterminal)
. If this is not true for a
particular Treebank, this method should be overridden. If
preterm
is not a null element preterminal (that is, a
preterminal for which isNullElementPreterminal(Sexp)
returns
false
), the semantics of this method are undefined. This
method is used by the default implementation of AbstractTraining.hasGap(Sexp,Sexp,ArrayList)
, which is a helper method
for the default implementation of Training.addGapInformation(Sexp)
.
getTraceIndex
in interface Treebank
preterm
- the null element preterminal whose trace index is to be
returnednonterminal
- the object used as the second argument to
parseNonterminal(Symbol,Nonterminal)
preterm
, or -1 if the null element does not have an indexpublic abstract boolean isPuncToRaise(Sexp preterm)
true
if the specified S-expression represents
a preterminal and a part-of-speech tag that indicates punctuation
to be raised when running Training.raisePunctuation(Sexp)
. If
punctuation raising is not desirable for a particular language
package, this method may be implemented simply to return
false
.
isPuncToRaise
in interface Treebank
preterm
- the preterminal to testTraining.raisePunctuation(Sexp)
public abstract boolean isPunctuation(Symbol tag)
true
if the specified part of speech tag is one
for which isPuncToRaise(Sexp)
would return true
.
isPunctuation
in interface Treebank
tag
- the part of speech to testisPuncToRaise(Sexp)
public abstract boolean isPossessivePreterminal(Sexp tree)
true
if the specified S-expression represents
a preterminal that is the possessive part of speech. This method is
intended to be used by implementations of Training.addBaseNPs(Sexp)
.
isPossessivePreterminal
in interface Treebank
Training.addBaseNPs(Sexp)
public abstract boolean isNP(Symbol label)
true
if the canonical version of the specified label
is an NP for this language’s Treebank.
isNP
in interface Treebank
label
- the label to testTraining.addBaseNPs(Sexp)
public abstract Symbol baseNPLabel()
Training.addBaseNPs(Sexp)
will
relabel core NPs.
baseNPLabel
in interface Treebank
Training.addBaseNPs(Sexp)
public boolean isBaseNP(Symbol label)
baseNPLabel()
. If
a particular language package can have various types of base NP labels
(such as those bearing node augmentations), then this method
should be overridden.
isBaseNP
in interface Treebank
label
- the label to test
public abstract boolean isWHNP(Symbol label)
true
if the canonical version of the specified label
is an NP that undergoes WH-movement in a particular Treebank. This method
is used by Training.addGapInformation(Sexp)
. If a particular
language package does not require gap information, then this method may be
implemented simply to return false
.
isWHNP
in interface Treebank
Training.addGapInformation(Sexp)
public abstract Symbol NPLabel()
Training.addBaseNPs(Sexp)
should
add as a parent if a base NP is not dominated by an NP.
NPLabel
in interface Treebank
Training.addBaseNPs(Sexp)
public abstract boolean isConjunction(Symbol label)
true
if the canonical version of the specified label
is a conjunction tag or nonterminal in a particular Treebank.
isConjunction
in interface Treebank
public abstract boolean isVerb(Sexp preterminal)
true
if the specified preterminal is that of a verb.
This method is used by HeadTreeNode
to determine if a particular
subtree contains a verb, which is in turn used by Trainer
to
calculate the distance metric, which depends on whether a verb occurs
in the subtrees of the previous modifiers. It is the responsibility
of the caller to insure that preterminal
is a
Sexp
object for which isPreterminal(Sexp)
returns
true
.
isVerb
in interface Treebank
HeadTreeNode
,
Trainer
public abstract boolean isVerbTag(Symbol tag)
true
if the specified symbol is the part of speech
tag of a verb. This method should return true for exactly the same
parts of speech for which isVerb(Sexp)
returns true
,
and is used to calculate the distance metric while decoding.
isVerbTag
in interface Treebank
CKYItem.containsVerb()
,
Decoder
public abstract boolean isComma(Symbol word)
true
if the specified word is a comma. This method
is used by the Decoder
class when performing the comma
constraint on chart items.
isComma
in interface Treebank
word
- the word to testSettings.decoderUseCommaConstraint
public abstract boolean isLeftParen(Symbol word)
true
if the specified word is a left
parenthesis. This method is used by the Decoder
class when performing the comma constraint on chart items.
isLeftParen
in interface Treebank
word
- the word to testSettings.decoderUseCommaConstraint
public abstract boolean isRightParen(Symbol word)
true
if the specified word is a right
parenthesis. This method is used by the Decoder
class when performing the comma constraint on chart items.
isRightParen
in interface Treebank
word
- the word to testSettings.decoderUseCommaConstraint
public abstract String augmentationDelimiters()
augmentationDelimiters
in interface Treebank
stripAugmentation(Symbol)
,
defaultParseNonterminal(Symbol,Nonterminal)
public char canonicalAugDelimiter()
augmentationDelimiters()
, which will be considered the
"canonical" augmentation delimiter when adding
new augmentations, such as the argument augmentations added by
implementations of Training.identifyArguments(Sexp)
.
canonicalAugDelimiter
in interface Treebank
public char nonTreebankLeftBracket()
Settings.decoderOutputHeadLexicalizedLabels
is true.
The default implementation here returns '['.
nonTreebankLeftBracket
in interface Treebank
public char nonTreebankRightBracket()
Settings.decoderOutputHeadLexicalizedLabels
is
true. The default implementation here returns ']'.
nonTreebankRightBracket
in interface Treebank
public char nonTreebankDelimiter()
Treebank
Settings.decoderOutputHeadLexicalizedLabels
is true.
nonTreebankDelimiter
in interface Treebank
public Symbol stripAugmentation(Symbol label)
Symbol
created by stripping off all
augmentations, that is all characters after and including the first
character that appears in the string returned by
augmentationDelimiters()
.
stripAugmentation
in interface Treebank
label
- the potentially-complex nonterminal label to be stripped
label
with all augmentations removedpublic Symbol stripIndex(Symbol label)
label
, but stripped of any index augmentation. This
method assumes that the index will always be the final augmentation in a
complex nonterminal label.Nonterminal
object, to be filled in by stripIndex(Symbol,Nonterminal)
.
stripIndex
in interface Treebank
label
- the nonterminal to be stripped of any possible index
Symbol
that is identical to label
,
except that all characters after and including the final delimiter
are removed if the final augmentation is composed entirely of digitspublic Symbol stripIndex(Symbol label, Nonterminal nonterminal)
stripIndex(Symbol)
, except that instead of creating
a new Nonterminal
object for use by parseNonterminal(Symbol,Nonterminal)
, this method simply passes the
specified nonterminal
object. In a sequential run, this
method provides maximum efficiency, as only one Nonterminal
object need be created at the beginning of the run.
stripIndex
in interface Treebank
public Symbol stripAllButIndex(Symbol label)
label
, except
all augmentations other than the index will be removed. If
label
had no index to begin with, then this method
is functionally identical to stripAugmentation(Symbol)
.
stripAllButIndex
in interface Treebank
label
- the nonterminal label to strip of non-index augmentationspublic Symbol stripAllButIndex(Symbol label, Nonterminal nonterminal)
stripAllButIndex(Symbol)
, except that instead of
creating a new Nonterminal
object for use by
parseNonterminal(Symbol,Nonterminal)
, this method
uses the specified nonterminal
object. In a sequential
run, this method provides maximum efficiency, as only one
Nonterminal
object need be created at the beginning
of the run.
stripAllButIndex
in interface Treebank
public Nonterminal parseNonterminal(Symbol label)
Nonterminal
object to represent all the
components of a complex nonterminal annotation: the base label, any
augmentations and any index. If there are no augmentations, the
augmentations
field of the returned object will contain
a list with zero elements; if there is no index, the
value of index will be -1. A final requirement of the contract of this
method is to represent all the delimiters in the list of augmentations;
this requirement is met, for example, by the helper method defaultParseNonterminal(Symbol,Nonterminal)
.Nonterminal
object with every invocation.
parseNonterminal
in interface Treebank
label
- a (possibly complex) nonterminal label from a Treebank
Nonterminal
object representing any and
all components of the specified complex nonterminalNonterminal
public abstract Nonterminal parseNonterminal(Symbol label, Nonterminal nonterminal)
parseNonterminal(Symbol)
, except that instead of
returning a newly-created Nonterminal
object, this
method merely modifies the specified Nonterminal
object.
This method may be used for efficiency: in a particular, sequential
training run, only one Nonterminal
need be created,
repeatedly passed in to this method for modification.
parseNonterminal
in interface Treebank
label
- a (possibly complex) nonterminal label from a Treebanknonterminal
- the representation of any and all components present
in label
public void defaultParseNonterminal(Symbol label, Nonterminal nonterminal)
Nonterminal
object to represent
all the components of a complex nonterminal annotation: the base label,
any augmentations and any index. If there are no augmentations, the
augmentations
field of the returned object will contain a
list with no elements; if there is no index, the value of index will be
-1. Augmentation delimiters are the characters in the string returned by
augmentationDelimiters()
.parseNonterminal(Symbol,Nonterminal)
.
defaultParseNonterminal
in interface Treebank
label
- a (possibly complex) nonterminal label from a TreebankNonterminal
public boolean containsAugmentation(Symbol nonterminal, Symbol augmentation)
N.B.: This method assumes that the augmentation is preceded
by the canonical augmentation delimiter. To search for an augmentation
preceded by any of the possible augmentaion delimiters (as defined
by augmentationDelimiters()
), use
parseNonterminal(nonterminal).augmentations.contains(augmentation)
containsAugmentation
in interface Treebank
public void addAugmentation(Nonterminal nonterminal, Symbol augmentation)
Nonterminal
object.
This method takes care to add the canonical augmentation delimiter
before adding the augmentation itself, and also takes care to add
these two elements before a final delimiter between the main augmentations
and the index, if one exists.
addAugmentation
in interface Treebank
nonterminal
- the nonterminal to which to add an augmentationaugmentation
- the augmentation to add to nonterminal
's
augmentation listpublic boolean removeAugmentation(Nonterminal nonterminal, Symbol augmentation)
Nonterminal
object, and the previous augmentation
delimiter. If the specified augmentation is not preceded by
an augmentation delimiter, meaning it is the base label itself, then it
is not removed.
removeAugmentation
in interface Treebank
nonterminal
- the nonterminal from which to remove an augmentationaugmentation
- the augmentation to remove from
nonterminal
true
if augmentation
and
a preceding augmentation delimiter was removed from
nonterminal
's augmentation list, or false
otherwisepublic Sexp removeAugmentation(Sexp sexp, Nonterminal nonterminal, Symbol augmentation)
Treebank
Nonterminal
object for temporary
storage. If the specified S-expression is a list, then each element will
be destructively replaced with the return value of this method; otherwise,
if the specified S-epxression is a symbol, its augmentation is removed and
the new symbol is returned.
N.B.: While the description of the behavior of this method on lists
is recursive, a concrete implementation need not use a recursive
algorithm.
removeAugmentation
in interface Treebank
sexp
- the S-expression containing symbols whose augmentations
are to be removednonterminal
- an object used for temporary storage during the
invocation of this methodaugmentation
- the augmentation to be removed from all symbols in the
specified S-expression
public final boolean isAugDelim(Sexp sexp)
Treebank
isAugDelim
in interface Treebank
sexp
- the S-expression to be tested
Treebank.augmentationDelimiters()
|
Parsing Engine | |||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |