|
Parsing Engine | |||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectdanbikel.parser.lang.AbstractTraining
public abstract class AbstractTraining
Provides methods for language-specific preprocessing of training
parse trees. The primary method to be invoked from this class is
preProcess(Sexp)
. Additionally, as this class contains or
has access to appropriate preprocessing data and methods, it also
contains a crucial method to be used for post-processing, to
"undo" what it has done during preprocessing after
decoding. This post-processing method is postProcess(Sexp)
, and is invoked by default by the
Decoder
.
Concurrency note: As training is typically a sequential process, with very few noted exceptions, none of the default implementations of the methods of this abstract base class is thread-safe. If thread-safe guarantees are desired, the methods of this class should be overridden.
preProcess(Sexp)
,
postProcess(Sexp)
,
Decoder
,
Serialized FormField Summary | |
---|---|
protected static boolean |
addGapInfo
Caches the boolean value of the property Settings.addGapInfo . |
protected SexpList |
argAugmentations
A list representing the set of all argument augmentations. |
protected Map |
argContexts
Data member used to store the map required by the default implementation of the method identifyArguments(Sexp) . |
protected static Symbol |
argContextsSym
The symbol to indicate the list of argument-finding rules from a metadata resource. |
protected static Set |
argNonterminals
Static set for storing argument nonterminals. |
protected Symbol |
baseNP
The value of Treebank.baseNPLabel() , cached for efficiency and
convenience. |
protected Symbol |
canonicalAugDelimSym
A Symbol created from the first character of Treebank.augmentationDelimiters() . |
protected Symbol |
defaultArgAugmentation
The symbol that will be used to identify argument nonterminals. |
protected String |
delimAndGapStr
The string consisting of the canonical augmentation delimiter concatenated with the gap augmentation, to be used in identifying nonterminals that contain gap augmentations. |
protected int |
delimAndGapStrLen
The length of delimAndGapStr , cached here for efficiency
and convenience. |
protected Symbol |
gapAugmentation
The symbol that will be used to identify nonterminals whose subtrees contain a gap (a trace). |
protected HeadFinder |
headFinder
Holds the value of Language.headFinder() . |
protected static Symbol |
headPostSym
The symbol that is a possible mapping argContexts to indicate
to choose a child relative to the right side of the head as an argument. |
protected static Symbol |
headPreSym
The symbol that is a possible mapping argContexts to indicate
to choose a child relative to the left side of the head as an argument. |
protected static Symbol |
headSym
The symbol that is a possible mapping in argContexts to indicate
to choose a child relative to the head as an argument. |
protected static String |
metadataPropertyPrefix
The prefix of the property of the metadata resource required by the default constructor of concrete subclasses. |
protected Set |
nodesToPrune
Data member to store the set of nodes to prune for the default implementation of prune(Sexp) . |
protected static Symbol |
nodesToPruneSym
The symbol to indicate the list of nodes to prune. |
protected Symbol |
NP
The value of Treebank.NPLabel() , cached for efficiency and
convenience. |
protected Set |
prunedPreterms
The set of preterminals ( Sexp objects) that have been pruned
away. |
protected Set |
prunedPunctuation
The set of preterminals ( Sexp objects) that were "raised
away" by raisePunctuation(Sexp) because they appeared either at
the beginning or the end of a sentence. |
protected static boolean |
relabelHeadChildrenAsArgs
Indicates to relabel head children as arguments. |
protected static boolean |
repairBaseNPs
Caches the boolean value of the property Settings.collinsRepairBaseNPs . |
protected static Symbol |
semTagArgStopListSym
The symbol to indicate the list of node augmentations that prevent a node from being relabeled |
protected Set |
semTagArgStopSet
Data member used to store the set required by the method identifyArguments(Sexp) . |
protected Symbol |
traceTag
The symbol that gets assigned as the part of speech for null preterminals that represent traces that have undergone WH-movement, as relabeled by the default implementation of addGapInformation(Sexp) . |
protected Treebank |
treebank
Holds the value of Language.treebank() . |
protected Set |
wordsToPrune
Data member to store the set of words to prune for the default implementation of prune(Sexp) . |
Constructor Summary | |
---|---|
protected |
AbstractTraining()
Default constructor for this abstract base class; sets argContexts to a new Map object, sets semTagArgStopSet to a new Set object and initializes canonicalAugDelimSym . |
Method Summary | |
---|---|
protected boolean |
addArgAugmentation(Symbol label,
Nonterminal nonterminal)
Adds the default argument augmentation to the specified nonterminal if the specified label is not already an argument. |
Sexp |
addBaseNPs(Sexp tree)
Adds and/or relabels base NPs, which are defined in this default implementation to be NPs that do not dominate other non-possessive NPs, where a possessive NP is defined to be an NP that itself dominates a possessive preterminal, as determined by the implementation of the method Treebank.isPossessivePreterminal(Sexp) . |
Sexp |
addGapInformation(Sexp tree)
Augments nonterminals to include gap information for WHNP's that have moved and leave traces (gaps), as in the GPSG framework. |
Set |
argNonterminals()
Returns a static set of possible argument nonterminals. |
protected void |
canonicalizeNonterminals(Sexp tree)
Modifies each nonterminal in the specified tree to be its canonical version. |
protected void |
collectPreterms(Set preterms,
Sexp tree)
Adds all preterminal subtrees to the specified set. |
protected void |
createArgAugmentationsList()
A helper method that runs through every nonterminal "pattern" for each context in argContexts , parses the pattern using Treebank.parseNonterminal(danbikel.lisp.Symbol) , runs through the resulting list of
augmentations and adds each augmentation symbol to the argAugmentations list. |
protected void |
createArgNonterminalsSet()
Sets the argNonterminals data member to be the static set
of argument nonterminals. |
Symbol |
defaultArgAugmentation()
The symbol that is used to mark argument (required) nonterminals by identifyArguments(Sexp) . |
Symbol |
gapAugmentation()
The symbol that will be used to identify nonterminals whose subtrees contain a gap (a trace). |
Symbol |
getCanonicalArg(Symbol label)
Returns the canonical version of the specified argument nonterminal. |
Symbol |
getCanonicalArg(Symbol label,
Nonterminal nonterminal)
Returns the canonical version of the specified argument nonterminal. |
Set |
getPrunedPreterms()
Returns the set of pruned preterminals ( Sexp objects). |
Set |
getPrunedPunctuation()
Returns the set of preterminals ( Sexp objects) that were
punctuation elements that were "raised away" because they were either at
the beginning or end of a sentence. |
protected int |
hasGap(Sexp tree,
Sexp root,
ArrayList indexStack)
Returns -1 if tree has no gap (trace), or the index of the
trace otherwise. |
boolean |
hasGap(Symbol label)
Returns true if and only if label has a
gap augmentation as added by addGapInformation(Sexp) . |
protected boolean |
hasPossessiveChild(Sexp tree)
Returns true if tree contains a child for which
Treebank.isPossessivePreterminal(Sexp) returns
true , false otherwise. |
Symbol |
headPostSym()
The symbol that is a possible mapping argContexts to indicate to
choose a child relative to the right side of the head as an argument. |
Symbol |
headPreSym()
The symbol that is a possible mapping argContexts to indicate
to choose a child relative to the left side of the head as an argument. |
Symbol |
headSym()
Returns the symbol used in the argContexts map to identify
an offset from the head child. |
Sexp |
identifyArguments(Sexp tree)
Augments labels of nonterminals that are arguments. |
boolean |
isAllNodesToPrune(Sexp tree)
Returns whether all words or preterminals of this tree are to be pruned. |
boolean |
isArgument(Symbol label)
Returns true if and only if label has an
argument augmentation as added by identifyArguments(Sexp) . |
protected boolean |
isArgument(Symbol label,
Nonterminal nonterminal)
Returns true if the specified nonterminal label has an
argument augmentation. |
protected boolean |
isArgument(Symbol label,
Nonterminal nonterminal,
boolean parseLabel)
Returns true if the specified nonterminal label has an
argument augmentation. |
boolean |
isArgumentFast(Symbol label)
Returns true if the specified nonterminal label has an
argument augmentation. |
protected boolean |
isCoordinatedPhrase(Sexp tree,
int headIdx)
Returns true if a non-head child of the specified
tree is a conjunction, and that conjunction is either post-head
but non-final, or immediately pre-head but non-initial (where
"immediately pre-head" means "at the first index
less than headIdx that is not punctuation, as determined
by Treebank.isPunctuation(Symbol) ). |
protected boolean |
isTypeOfSentence(Symbol label)
A helper method used by repairBaseNPs(Sexp,int,Sexp) . |
boolean |
isValidTree(Sexp tree)
Returns true if tree is a preterminal (the base
case) or is a list with the first element of type Symbol (the
node label) and subsequent elements are valid trees (the recursive case). |
static void |
main(String[] args)
Test driver for this class. |
protected boolean |
needToAddNormalNPLevel(Sexp grandparent,
int parentIdx,
Sexp tree)
Returns true if a unary NP needs to be added above the
specified base NP. |
void |
postProcess(Sexp tree)
Post-processes a parse tree after decoding, eseentially undoing the steps performed in preprocessing. |
Sexp |
preProcess(Sexp tree)
The method to call before counting events in a training parse tree. |
SexpList |
preProcessTest(SexpList sentence,
SexpList originalWords,
SexpList tags)
Preprocesses the specified test sentence and its coordinated list of tags. |
void |
printMetadata()
Debugging method to print the metadata used by this class. |
Sexp |
prune(Sexp tree)
Prunes away subtrees that have a root that is an element of nodesToPrune . |
Sexp |
raisePunctuation(Sexp tree)
Raises punctuation to the highest possible point in a parse tree, resulting in a tree where no punctuation is the first or last child of a non-leaf node. |
protected void |
readMetadata(SexpTokenizer metadataTok)
Reads metadata to fill in argContexts and
semTagArgStopSet . |
protected void |
readMetadataHook(Symbol dataType,
int metadataLen,
SexpList metadata)
A hook for subclasses to have their own custom metadata types. |
protected void |
relabelArgChildren(SexpList treeList,
int headIdx,
SexpList candidatePatterns)
Relabels as arguments all immediately-dominated children in the specified subtree accoding to the specified argument-finding patterns. |
Sexp |
relabelSubjectlessSentences(Sexp tree)
Relabels sentences that have no subjects with the nonterminal label returned by Treebank.subjectlessSentenceLabel() . |
Symbol |
removeArgAugmentation(Symbol label)
Removes any argument augmentations from the specified nonterminal label. |
protected Symbol |
removeArgAugmentation(Symbol label,
Nonterminal nonterminal)
Parses label into the specified Nonterminal object and then
removes all argument augmentations. |
Sexp |
removeGapAugmentation(Sexp sexp)
If the specified S-expression is a list, this method modifies the list to contain only symbols without gap augmentations; otherwise, this method removes the gap augmentation (if one exists) in the specified symbol and returns that new symbol. |
Sexp |
removeNullElements(Sexp tree)
Removes all null elements, that is, those nodes of tree for
which Treebank.isNullElementPreterminal(Sexp) returns
true . |
protected void |
removeOnlyChildBaseNPs(Sexp tree)
Handle case where an NP dominates a base NP and has no other children (the base NP is an "only child" of the dominating NP). |
boolean |
removeWord(Symbol word,
Symbol tag,
int idx,
SexpList sentence,
SexpList tags,
SexpList originalTags,
Set prunedPretermsPosSet,
Map prunedPretermsPosMap)
Invoked by the decoder as the first step in preprocessing (prior to the invocation of Training.preProcessTest(danbikel.lisp.SexpList, danbikel.lisp.SexpList, danbikel.lisp.SexpList) ). |
Sexp |
repairBaseNPs(Sexp tree)
Changes the specified tree so that when the last child of an NPB is an S, the S gets raised to be a sibling immediately following the NPB. |
protected Sexp |
repairBaseNPs(Sexp grandparent,
int parentIdx,
Sexp tree)
Changes the specified tree so that when the last child of an NPB is an S, the S gets raised to be a sibling immediately following the NPB. |
void |
setUpFastArgMap(CountsTable nonterminals)
Indicates to set up a static map for quickly mapping argument nonterminals to their non-argument variants (that is, for quickly stripping away their argument augmentations). |
String |
skip(Sexp tree)
Returns whether the specified tree is to be skipped when training. |
Symbol |
startSym()
Returns the symbol to indicate hidden nonterminals that precede the first in a sequence of modifier nonterminals. |
Word |
startWord()
Returns the Word object that represents the hidden "head
word" of the start symbol. |
protected static void |
staticSetUpFastArgMap(CountsTable nonterminals)
Indicates to set up a static map for quickly mapping argument nonterminals to their non-argument variants (that is, for quickly stripping away their argument augmentations). |
Symbol |
stopSym()
Returns the symbol to indicate a hidden nonterminal that follows the last in a sequence of modifier nonterminals. |
Word |
stopWord()
Returns the Word object that represents the hidden "head
word" of the stop symbol. |
Sexp |
stripAugmentations(Sexp tree)
Strips any augmentations off all of the nonterminal labels of tree . |
protected Symbol |
stripAugmentations(Symbol label)
Parses the specified nonterminal label and removes all augmentations. |
protected void |
stripAugmentations(Symbol label,
Nonterminal nonterminal,
boolean parseLabel)
Fills in the specified Nonterminal object with the specified
nonterminal label but without any augmentations. |
protected Sexp |
threadNPArgAugmentations(Sexp tree)
Adds any argument augmentations on an NP to its head child, continuing recursively until reaching a preterminal. |
Symbol |
topSym()
Returns the symbol to indicate the hidden root of all parse trees. |
Word |
topWord()
Returns the Word object that represents the hidden "head
word" of the hidden root of all parse trees. |
Symbol |
traceTag()
The symbol that gets reassigned as the part of speech for null preterminals that represent traces that have undergone WH-movement, as relabeled by the default implementation of addGapInformation(Sexp) . |
Sexp |
transformSubjectNTs(Sexp tree)
Transforms nonterminals marked with a subject augmentation so that their unaugmented base label is the concatenation of the original base label plus the subject augmentation. |
protected boolean |
unaryProductionsToNull(Sexp tree)
Returns whether the specified subtree consists solely of unary productions going to a null element terminal. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
protected static boolean relabelHeadChildrenAsArgs
protected static boolean addGapInfo
Settings.addGapInfo
.
protected static boolean repairBaseNPs
Settings.collinsRepairBaseNPs
.
protected static Set argNonterminals
protected static final Symbol argContextsSym
protected static final Symbol semTagArgStopListSym
protected static final Symbol nodesToPruneSym
nodesToPrune
,
prune(Sexp)
protected static final String metadataPropertyPrefix
"parser.training.metadata."
.
protected Treebank treebank
Language.treebank()
.
protected HeadFinder headFinder
Language.headFinder()
.
protected Symbol gapAugmentation
stripAugmentations(Sexp)
, so that gap augmentations that are added by
addGapInformation(Sexp)
do not get removed. The default value is
the symbol returned by Symbol.add("g")
. If this
default value conflicts with an augmentation already used in a particular
Treebank, this value should be reassigned in the constructor of a
subclass.
protected String delimAndGapStr
Treebank.canonicalAugDelimiter()
,
gapAugmentation
protected int delimAndGapStrLen
delimAndGapStr
, cached here for efficiency
and convenience.
protected Symbol defaultArgAugmentation
stripAugmentations(Sexp)
, so that argument
augmentations that are added by identifyArguments(Sexp)
do not
get removed. The default value is the symbol returned by
Symbol.add("A")
. If this default value conflicts
with an augmentation already used in a particular Treebank, this value
should be reassigned in the constructor of a subclass.
protected SexpList argAugmentations
createArgAugmentationsList()
method after filling in the argContexts
map.
argContexts
,
createArgAugmentationsList()
protected Symbol traceTag
addGapInformation(Sexp)
. The default value is the return value of
Symbol.add("*TRACE*")
. If this maps to an actual
part of speech tag or nonterminal label in a particular Treebank, this
data member should be reassigned in the constructor of a subclass.
protected final Symbol canonicalAugDelimSym
Treebank.augmentationDelimiters()
.
protected Set nodesToPrune
prune(Sexp)
. The set should only contain
objects of type Symbol
, and the elements of this set
should be added in the constructor of a subclass.
prune(Sexp)
protected Set wordsToPrune
prune(Sexp)
. The set should only contain
objects of type Symbol
, and the elements of this set should be
added in the constructor of a subclass. The default implementation will
only prune a preterminal if both the part-of-speech tag is in nodesToPrune
and if the word is in this
wordsToPrune
set.
prune(Sexp)
protected Set prunedPreterms
Sexp
objects) that have been pruned
away.
protected Map argContexts
identifyArguments(Sexp)
. This data member maps
parent nonterminals to lists of children nonterminals, to indicate that
the children are candidates for being labeled as arguments in the presence
of that parent. A children list may also be a list of the form
(head <offset>)indicating to match a node
<offset>
away from the head
child of the parent that was mapped to this children list. The keys and
values of this map should be added in the constructor of a subclass.
The keys of this map must be of type Symbol
, and the values of
this map must be of type SexpList
.
Optionally, after this data member has been filled in by the constructor
of a subclass, the method createArgAugmentationsList()
should
be invoked to automatically fill in the argAugmentations
list.
identifyArguments(Sexp)
,
argAugmentations
,
createArgAugmentationsList()
protected Set semTagArgStopSet
identifyArguments(Sexp)
. The set contains semantic tags (which is
English Treebank parlance) that prohibit a candidate argument child from
being relabeled as an argument. The objects in this set must all be of
type Symbol
. The members of this set should be added in the
constructor of a subclass.
identifyArguments(Sexp)
protected static final Symbol headSym
argContexts
to indicate
to choose a child relative to the head as an argument. For example, an
argument context might be PP
mapping to (head
1))
, meaning that the child that is 1 position to the right of the
head child of a PP should be relabeled as an argument. The value of this
data member is the symbol returned by
Symbol.add("head")
. In the unlikely event that
this value conflicts with a nonterminal in a particular Treebank, this
data member should be reassigned in the constructor of a subclass.
identifyArguments(Sexp)
protected static final Symbol headPreSym
argContexts
to indicate
to choose a child relative to the left side of the head as an argument.
For example, an argument context might be VP
mapping to
(head-left left MD VBD)
, meaning that the children to the left
of the head child should be searched from left to right, and the first
child found that is a member of the set {MD, VBD} should be
considered a possible argument of the head.
protected static final Symbol headPostSym
argContexts
to indicate
to choose a child relative to the right side of the head as an argument.
For example, an argument context might be PP
mapping to
(head-right left PP NP WHNP ADJP)
, meaning that the children
to the right of the head child should be searched from left to right, and
the first child found that is a member of the set
{PP, NP, WHNP, ADJP} should be considered a possible argument
of the head.
protected final Symbol baseNP
Treebank.baseNPLabel()
, cached for efficiency and
convenience.
protected final Symbol NP
Treebank.NPLabel()
, cached for efficiency and
convenience.
protected Set prunedPunctuation
Sexp
objects) that were "raised
away" by raisePunctuation(Sexp)
because they appeared either at
the beginning or the end of a sentence.
Constructor Detail |
---|
protected AbstractTraining()
argContexts
to a new Map
object, sets semTagArgStopSet
to a new Set
object and initializes canonicalAugDelimSym
. Subclass constructors are responsible for filling
in the data for argContexts
and semTagArgStopSet
.
Method Detail |
---|
public void setUpFastArgMap(CountsTable nonterminals)
Training
N.B.: This method is necessarily thread-safe, as it is expected
to be invoked by every Decoder
as it starts up, and since there
can be multiple Decoder
instances within a given VM.
However, note that it is inappropriate to invoke this
method if the set of nonterminals in the specified counts table
is incomplete (see the documentation for the SubcatBag
class
for an instance where this will be the case).
setUpFastArgMap
in interface Training
nonterminals
- a counts table whose keys form a complete set of
all possible nonterminal labels, as is obtained from
DecoderServerRemote.nonterminals()
(the counts to which the
nonterminals are mapped are not used by this method)protected static void staticSetUpFastArgMap(CountsTable nonterminals)
N.B.: This method is necessarily thread-safe, as it is expected
to be invoked by every Decoder
as it starts up, and since there
can be multiple Decoder
instances within a given VM.
nonterminals
- a counts table whose keys form a complete set of
all possible nonterminal labels, as is obtained from
DecoderServerRemote.nonterminals()
(the counts to which the
nonterminals are mapped are not used by this method)public Sexp preProcess(Sexp tree)
prune(Sexp)
addBaseNPs(Sexp)
repairBaseNPs(Sexp)
addGapInformation(Sexp)
relabelSubjectlessSentences(Sexp)
removeNullElements(Sexp)
raisePunctuation(Sexp)
identifyArguments(Sexp)
stripAugmentations(Sexp)
addGapInformation(Sexp)
should be run after methods that
introduce new nodes, which in this case is addBaseNPs(Sexp)
, as
these new nodes may need to be used to thread the gap feature
relabelSubjectlessSentences(Sexp)
should be run after
addGapInformation(Sexp)
because only those sentences whose
empty subjects are not the result of WH-movement should be
relabeled
removeNullElements(Sexp)
should be run after any
methods that depend on the presence of null elements, such as
relabelSubjectlessSentences(Sexp)
because a sentence cannot
be determined to be subjectless unless a null element is present as
a child of a subject-marked node
addGapInformation(Sexp)
because the determination of
the location of a trace requires the presence of indexed null elements
raisePunctuation(Sexp)
should be run after
removeNullElements(Sexp)
because a null element that is a
leftmost or rightmost child can block detection of a punctuation element
that needs to be raised after removal of the null element (if a punctuation
element is the next-to-leftmost or next-to-rightmost child of an interior
node)
stripAugmentations(Sexp)
should be run after all methods
that may depend upon the presence of nonterminal augmentations: identifyArguments(Sexp)
, relabelSubjectlessSentences(Sexp)
and
addGapInformation(Sexp)
preProcess
in interface Training
tree
- the parse tree to pre-process
tree
having been pre-processedpublic boolean removeWord(Symbol word, Symbol tag, int idx, SexpList sentence, SexpList tags, SexpList originalTags, Set prunedPretermsPosSet, Map prunedPretermsPosMap)
Training
Training.preProcessTest(danbikel.lisp.SexpList, danbikel.lisp.SexpList, danbikel.lisp.SexpList)
).
Returns whether the specified word should be removed from the sentence
before parsing.
removeWord
in interface Training
word
- a word in the sentence about to parsedtag
- the supplied part-of-speech tag of the specified word,
or null if tags were not suppliedidx
- the index of the specified word in the specified sentencesentence
- a list of Symbol
objects that represent the words
of the sentence to be parsedtags
- coordinated list of supplied part-of-speech tag lists for each
of the words in the specified sentence, or null if no tags
were suppliedoriginalTags
- the cached copy of the specified tags list,
used when Settings.restorePrunedWords
is trueprunedPretermsPosSet
- the set of part-of-speech tags that were
pruned during trainingprunedPretermsPosMap
- a map of words pruned during training to
their part-of-speech tags when they were pruned
public SexpList preProcessTest(SexpList sentence, SexpList originalWords, SexpList tags)
preProcessTest
in interface Training
sentence
- the list of words, where a known word is a symbol and
an unknown word is represented by a 3-element list (see
DecoderServerRemote.convertUnknownWords(danbikel.lisp.SexpList)
)originalWords
- the list of unprocessed words (all symbols)tags
- the list of tag lists, where the list at index i
is the list of possible parts of speech for the word at that index
sentence
and the second of which
is a processed version of tags
; if tags
is null
, then the returned list will contain only
one element (since SexpList
objects are not designed
to handle null elements)public boolean isValidTree(Sexp tree)
true
if tree
is a preterminal (the base
case) or is a list with the first element of type Symbol
(the
node label) and subsequent elements are valid trees (the recursive case).
If a language package requires a different definition of training parse
tree validity, this method should be overridden. However, changing the
definition of tree validity should be done with care, as the default
implementations of the tree-processing methods in this class require trees
that correspond to the definition of validity implemented by this method.
This method also ensures that not all words or preterminals in the tree
are to be pruned.
isValidTree
in interface Training
tree
- the parse tree to check for validityisAllNodesToPrune(Sexp)
,
Treebank.isPreterminal(Sexp)
public boolean isAllNodesToPrune(Sexp tree)
tree
- the tree to inspect
prune(Sexp)
public String skip(Sexp tree)
isValidTree(Sexp)
.
skip
in interface Training
tree
- an annotated training tree
null
otherwiseTrainer.train(SexpTokenizer,boolean,boolean)
public Sexp transformSubjectNTs(Sexp tree)
tree
- the tree in which to transform subject nonterminals
public Set getPrunedPreterms()
Sexp
objects).
getPrunedPreterms
in interface Training
prune(Sexp)
public Sexp prune(Sexp tree)
nodesToPrune
.
Side effect: An internal set of pruned preterminals will
be updated. This set may be accessed via getPrunedPreterms()
.
Bugs: Cannot prune away entire tree if the root label of the
specified tree is in nodesToPrune
.
prune
in interface Training
tree
- the parse tree to prune
tree
having been prunednodesToPrune
protected final void collectPreterms(Set preterms, Sexp tree)
preterms
- the set to which preterminal subtrees of the specified
tree are to be addedtree
- the tree from which to collect preterminal subtreespublic Sexp identifyArguments(Sexp tree)
tree
untouched if argument identification is not desired for a particular
language package.
Note that children in a coordinated phrase are never relabeled as
arguments, as determined by subtrees for which
isCoordinatedPhrase(Sexp,int)
returns true
.
identifyArguments
in interface Training
tree
- the parse tree to modify
tree
objectTreebank.canonicalAugDelimiter()
protected void relabelArgChildren(SexpList treeList, int headIdx, SexpList candidatePatterns)
treeList
- the subtree in which to relabel argumentsheadIdx
- the index of the child of the specified subtree
that is the headcandidatePatterns
- the set of argument-finding rulesprotected boolean addArgAugmentation(Symbol label, Nonterminal nonterminal)
label
- the label that has been parsed into the specified
Nonterminal
objectnonterminal
- the parsed version of the specified label
Nonterminal
object
was modified, false otherwisepublic Symbol defaultArgAugmentation()
identifyArguments(Sexp)
.
defaultArgAugmentation
in interface Training
public Symbol getCanonicalArg(Symbol label)
Treebank.getCanonical(Symbol)
.
For example, in the English Penn Treebank, S nonterminals that
dominate trees with no subjects get converted to SG; if one of
these is identified as an argument, it will be converted to SG-A;
this method will return S-A, since S is the canonical
version of SG. This method is needed by the class SubcatBag
.
getCanonicalArg(Symbol,Nonterminal)
method.
getCanonicalArg
in interface Training
label
- the argument nonterminal label to be canonicalized
getCanonicalArg(Symbol, Nonterminal)
public Symbol getCanonicalArg(Symbol label, Nonterminal nonterminal)
Treebank.getCanonical(Symbol)
.
For example, in the English Penn Treebank, S nonterminals that
dominate trees with no subjects get converted to SG; if one of
these is identified as an argument, it will be converted to SG-A;
this method will return S-A, since S is the canonical
version of SG. This method is needed by the class SubcatBag
.
label
- the argument nonterminal label whose canonical version is
to be returnednonterminal
- the Nonterminal
instance to be used
public boolean isArgument(Symbol label)
true
if and only if label
has an
argument augmentation as added by identifyArguments(Sexp)
.
isArgument
in interface Training
protected boolean isArgument(Symbol label, Nonterminal nonterminal)
true
if the specified nonterminal label has an
argument augmentation. This method is a synonym for
isArgument(label, nonterminal, true).
isArgumentFast(Symbol)
.
label
- the label to be testednonterminal
- the Nonterminal
instance to be used for
storing the parsed version of the specified nonterminal label
isArgument(Symbol,Nonterminal,boolean)
protected boolean isArgument(Symbol label, Nonterminal nonterminal, boolean parseLabel)
true
if the specified nonterminal label has an
argument augmentation.
isArgumentFast(Symbol)
.
label
- the label to be testednonterminal
- the Nonterminal
instance to be used for
storing the parsed version of the specified nonterminal labelparseLabel
- indicates whether to parse the specified label
before checking whether it is an argument
public boolean isArgumentFast(Symbol label)
true
if the specified nonterminal label has an
argument augmentation.isArgument(Symbol)
, this
method is thread-safe. Also, after setUpFastArgMap(CountsTable)
has been invoked, this method is much more efficient than
isArgument(Symbol)
, as it uses an internal cache for O(1)
expected time operation.
isArgumentFast
in interface Training
setUpFastArgMap(CountsTable)
public Sexp addGapInformation(Sexp tree)
tree
untouched if gap
information is desired for a particular language package. The default
implementation of this method checks the setting of the property Settings.addGapInfo
: if this property is false
, then
tree
is returned untouched; otherwise, this method simply
calls hasGap(Sexp,Sexp,ArrayList)
.
addGapInformation
in interface Training
tree
- the parse tree to which to add gapping
tree
that was passed in, with certain
nodes modified to include gap informationhasGap(Sexp, Sexp, ArrayList)
protected int hasGap(Sexp tree, Sexp root, ArrayList indexStack)
tree
has no gap (trace), or the index of the
trace otherwise. If tree
is a null preterminal with an
indexed terminal (a trace) that matches the index at the top of
indexStack
, then that index is popped off the stack, the
preterminal label is changed to be traceTag
, and the index of the
trace is returned. If a child of tree
has a gap but another
child is a WHNP that is coindexed, then the gap is "filled", and
this method returns -1; otherwise, this method augments the label of
tree
with gapAugmentation
and returns the gap index
of the child.
Put informally, this method does a depth-first search of tree
,
pushing the indices of any indexed WHNP nodes onto indexStack
and popping off those indices when the corresponding null element is found
someplace deeper in the tree. The stack is necessary to allow for
the nesting of gaps in a tree.
Algorithm:
// base case if tree is a null-element preterminal with an index that matches top of indexStack then modify preterminal to be traceTag; return pop(indexStack); endif int numWHNPChildren = 0; Sexp whnpChild = null; foreach child of tree do if child is a WHNP with an index augmentation then if numWHNPChildren == 0 then whnpChild = child; endif numWHNPChildren++; endif end if numWHNPChildren > 0 then push(index of whnpChild, indexStack); endif int numTracesToBeLinked = 0, traceIndex = -1; foreach child of tree do int gapIndex = hasGap(child, root, indexStack); // recursive call if gapIndex != -1 then if numTracesToBeLinked == 0 then traceIndex = gapIndex; endif numTracesToBeLinked++; endif end if numTracesToBeLinked > 0 then add gap augmentation to the current parent (the root of tree); if numWHNPChildren > 0 and index of whnpChild == traceIndex then // a trace from a child subtree has been hooked up with the current WHNP child return -1; else return traceIndex; endif else if numWHNPChildren > 0 then print warning that a moved WHNP node doesn't have a coindexed trace in any of its parent's other child subtrees; endif return -1; endifA warning will also be issued if there are crossing WHNP-trace dependencies.
This method is called by the default implementation of addGapInformation(danbikel.lisp.Sexp)
.
tree
- the tree to gapifyroot
- always the root of the tree we're gapifying, for error and
warning reportingindexStack
- a stack of Integer
objects (where the top
of the stack is the highest-indexed object), representing the pending
requests to find traces to match with coindexed WHNP's discovered higher
up in the tree (earlier in the DFS)
tree
has no gap, or the index of the trace
otherwisegapAugmentation
,
traceTag
,
addGapInformation(Sexp)
,
Treebank.isWHNP(Symbol)
,
Treebank.isNullElementPreterminal(Sexp)
,
Treebank.getTraceIndex(Sexp, Nonterminal)
public boolean hasGap(Symbol label)
true
if and only if label
has a
gap augmentation as added by addGapInformation(Sexp)
.
hasGap
in interface Training
public Symbol gapAugmentation()
stripAugmentations(Sexp)
, so that gap augmentations that are added by
addGapInformation(Sexp)
do not get removed. The default value is
the symbol returned by Symbol.add("g")
. If this
default value conflicts with an augmentation already used in a particular
Treebank, the value of the data member gapAugmentation
should be
reassigned in the constructor of a subclass.
gapAugmentation
in interface Training
public Symbol traceTag()
addGapInformation(Sexp)
. The default value is the return value of
Symbol.add("*TRACE*")
. If this maps to an actual
part of speech tag or nonterminal label in a particular Treebank, the
data member traceTag
should be reassigned in the constructor
of a subclass.
traceTag
in interface Training
public Sexp relabelSubjectlessSentences(Sexp tree)
Treebank.subjectlessSentenceLabel()
. This method is
optional, and may be overridden to simply return tree
untouched if subjectless sentence relabeling is not desired for a
particular language package.
The default implementation here assumes that a subjectless sentence is a
node for which Treebank.isSentence(Symbol)
returns
true
and has a child with an augmentation for which Treebank.subjectAugmentation()
returns true
, and that this
child represents a subtree that is a series of unary productions, ending in
a subtree for which Treebank.isNullElementPreterminal(Sexp)
returns true
. Informally, this method looks for sentence
nodes that have a child marked as a subject, where that child has a null
element as its first (and presumably only) child. For example, in the
English Treebank, this would mean one of the following contexts:
(S (PREMOD ...) (NP-SBJ (-NONE- *T*)) ... )or
(S (PREMOD ...) (NP-SBJ (NPB (-NONE- *T*))) ... )where (PREMOD ...) represents zero or more premodifying phrases and where NPB represents a node inserted by a method such as
addBaseNPs(Sexp)
. Note that the subtree rooted by NPB
satisfies the condition of being a subtree that is the result of a
series of unary productions (one of them, in this case) ending
in a null element preterminal. (This seemingly over-complicated condition
is necessary for this method to run properly after tree
has been processed by addBaseNPs(Sexp)
.)
If a subclass of this class in a language package requires more extensive or different checking for the "subjectlessness" of a sentence, this method should be overridden.
relabelSubjectlessSentences
in interface Training
tree
- the parse tree in which to relabel subjectless sentences
tree
that was passed in, with
subjectless sentence nodes relabeledTreebank.isSentence(Symbol)
,
Treebank.subjectAugmentation()
,
Treebank.isNullElementPreterminal(Sexp)
,
Treebank.subjectlessSentenceLabel()
protected final boolean unaryProductionsToNull(Sexp tree)
tree
- the subtree to test
public Sexp stripAugmentations(Sexp tree)
tree
. The set of nonterminal labels does not include
preterminals, which are typically parts of speech. If a particular
language's Treebank augments preterminals, this method should be
overridden in a language package's subclass. The only augmentations that
will not be removed are those that are added by identifyArguments(Sexp)
, so as to preserve the transformations of that
method. This method should only be called subsequent to the invocations
of methods that require augmentations, such as relabelSubjectlessSentences(Sexp)
.
stripAugmentations
in interface Training
tree
- the tree all of the nonterminals of which are to be stripped
of all augmentations except those added by identifyArguments
tree
protected Symbol stripAugmentations(Symbol label)
label
- the label from which to strip all augmentations
protected void stripAugmentations(Symbol label, Nonterminal nonterminal, boolean parseLabel)
Nonterminal
object with the specified
nonterminal label but without any augmentations.
label
- the label from which to strip augmentationsnonterminal
- the Nonterminal
object to use for
storage when optionally parsing the specified label and removing
all augmentationsparseLabel
- indicates whether to call
Treebank.parseNonterminal(Symbol)
; if false, this
method assumes that the specified Nonterminal
object
already contains the results of parsing the specified nonterminal
label (if this is not the case, then the behavior of this method
is undefined)public Sexp raisePunctuation(Sexp tree)
Treebank.isPuncToRaise(Sexp)
.
Side effect: All preterminals removed from the beginning and end
of the sentence are stored in an internal set, which can be accessed
via getPrunedPunctuation()
.
Example of punctuation raising:
(S (NP (NPB Pierre Vinken) (, ,) (ADJP 61 years old) (, ,)) (VP joined (NP (NPB the board))) (. .))becomes
(S (NP (NPB Pierre Vinken) (, ,) (ADJP 61 years old)) (, ,) (VP joined (NP (NPB the board))))This method appropriately deals with the case of having multiple punctuation elements to be raised on the left or right side of the list of children for a nonterminal. For example, in English, if this method were passed the tree
(S (NP (DT The) (NN dog) (, ,) (NNP Barky) (. .) (. .) (. .)) (VP (VB was) (ADJP (JJ stupid))) (. .) (. .) (. .))the result would be
(S (NP (DT The) (NN dog) (, ,) (NNP Barky)) (. .) (. .) (. .) (VP (VB was) (ADJP (JJ stupid))))
Bugs: In the pathological case where all the children of a node
are punctuation to raise, this method simply emits a warning to
System.err
and does not attempt to raise them (which would
cause an interior node to become a leaf).
raisePunctuation
in interface Training
tree
- the parse tree to destructively modify by raising punctuation
tree
objectpublic Set getPrunedPunctuation()
Sexp
objects) that were
punctuation elements that were "raised away" because they were either at
the beginning or end of a sentence.
getPrunedPunctuation
in interface Training
raisePunctuation(Sexp)
public Sexp addBaseNPs(Sexp tree)
Treebank.isPossessivePreterminal(Sexp)
. If an NP
is relabeled as a base NP but is not dominated by another NP, then
a new NP is interposed, for the sake of consistency. For example,
if the specified tree is the English Treebank tree
(S (NP-SBJ (DT The) (NN dog)) (VP (VBD sat)))then this method will transform it to be
(S (NP-SBJ (NPB (DT The) (NN dog))) (VP (VBD sat)))Note that the SBJ augmentation is transferred to the enclosing NP.
addBaseNPs
in interface Training
tree
- the parse tree in which to add and/or relabel base NPs
tree
hasPossessiveChild(Sexp)
,
Treebank.isNP(Symbol)
,
Treebank.baseNPLabel()
,
Treebank.NPLabel()
public Sexp repairBaseNPs(Sexp tree)
(NP (NPB (DT an) (NN effort) (S ...)))get transformed to
(NP (NPB (DT an) (NN effort)) (S ...))
repairBaseNPs
in interface Training
tree
- the tree whose base NPs are to be repaired
protected Sexp repairBaseNPs(Sexp grandparent, int parentIdx, Sexp tree)
(NP (NPB (DT an) (NN effort) (S ...)))get transformed to
(NP (NPB (DT an) (NN effort)) (S ...))
grandparent
- the grandparent of the specified tree, or
null
if the specified tree is the rootparentIdx
- the index of the specified tree in the
the specified grandparent's list of childrentree
- the tree in which to repair base NPsprotected Sexp threadNPArgAugmentations(Sexp tree)
protected boolean isTypeOfSentence(Symbol label)
repairBaseNPs(Sexp,int,Sexp)
.
While the default implementation here simply returns the result of
calling Treebank.isSentence(Symbol)
with the specified label,
subclasses may override this method if different semantics are required
for identifying sentences that occur as siblings of base NPs.
label
- the nonterminal label to test
true
if the specified nonterminal represents a
sentence, false
otherwiseprotected boolean needToAddNormalNPLevel(Sexp grandparent, int parentIdx, Sexp tree)
true
if a unary NP needs to be added above the
specified base NP.
grandparent
- the parent of the "parent" that is a
base NPparentIdx
- the index of the child of grandparent
that is the base NP (that is,
grandparent.list().get(parentIdx) == tree
tree
- the base NP, whose parent is grandparent
protected boolean isCoordinatedPhrase(Sexp tree, int headIdx)
true
if a non-head child of the specified
tree is a conjunction, and that conjunction is either post-head
but non-final, or immediately pre-head but non-initial (where
"immediately pre-head" means "at the first index
less than headIdx
that is not punctuation, as determined
by Treebank.isPunctuation(Symbol)
). A child is a
conjunction if its label is one for which
Treebank.isConjunction(Symbol)
returns true
.
tree
- the (sub)tree to testheadIdx
- the index of the head child of the specified treeprotected boolean hasPossessiveChild(Sexp tree)
true
if tree
contains a child for which
Treebank.isPossessivePreterminal(Sexp)
returns
true
, false
otherwise. This is a helper method
used by the default implementation of addBaseNPs(Sexp)
.
Possessive children are often more even-tempered than possessive parents.
tree
- the parse subtree to check for possessive preterminal
childrenpublic Sexp removeNullElements(Sexp tree)
tree
for
which Treebank.isNullElementPreterminal(Sexp)
returns
true
. Additionally, if the removal of a null element leaves
an interior node that is childless, then this interior node is removed as
well. For example, if we have the following sentence in English
(S (NP-SBJ (-NONE- *T*)) (VP ...))it will be transformed to be
(S (VP ...))N.B.: This method should only be invoked after preprocessing with
relabelSubjectlessSentences(Sexp)
and addGapInformation(Sexp)
, as these methods (and possibly others, if
overridden) rely on the presence of null elements.
removeNullElements
in interface Training
Treebank.isNullElementPreterminal(Sexp)
public Symbol startSym()
Symbol.add("+START+")
; if this value
conflicts with an actual nonterminal in a particular Treebank, then this
method should be overridden.
startSym
in interface Training
Trainer
public Word startWord()
Word
object that represents the hidden "head
word" of the start symbol.
startWord
in interface Training
startSym
,
Trainer
public Symbol stopSym()
Symbol.add("+STOP+")
; if this value
conflicts with an actual nonterminal in a particular Treebank, then this
method should be overridden.
This symbol may also be used as a special value that is guaranteed not to conflict with any nonterminal in a given language's treebank.
stopSym
in interface Training
Trainer
public Word stopWord()
Word
object that represents the hidden "head
word" of the stop symbol.
stopWord
in interface Training
stopSym
,
Trainer
public Symbol topSym()
Symbol.add("+TOP+")
; if this value conflicts with
an actual nonterminal in a particular Treebank, then this method should be
overridden.
topSym
in interface Training
Trainer
public Word topWord()
Word
object that represents the hidden "head
word" of the hidden root of all parse trees.
topWord
in interface Training
public Symbol headSym()
argContexts
map to identify
an offset from the head child.
public Symbol headPreSym()
argContexts
to indicate
to choose a child relative to the left side of the head as an argument.
For example, an argument context might be VP
mapping to
(head-left left MD VBD)
, meaning that the children to the left
of the head child should be searched from left to right, and the first
child found that is a member of the set {MD, VBD} should be
considered a possible argument of the head.
public Symbol headPostSym()
argContexts
to indicate to
choose a child relative to the right side of the head as an argument. For
example, an argument context might be PP
mapping to
(head-right left PP NP WHNP ADJP)
, meaning that the children
to the right of the head child should be searched from left to right, and
the first child found that is a member of the set {PP, NP, WHNP,
ADJP} should be considered a possible argument of the head.
protected void createArgAugmentationsList()
argContexts
, parses the pattern using Treebank.parseNonterminal(danbikel.lisp.Symbol)
, runs through the resulting list of
augmentations and adds each augmentation symbol to the argAugmentations
list.
protected void createArgNonterminalsSet()
argNonterminals
data member to be the static set
of argument nonterminals. The default implementation here scans the
argContexts
list, and adds every nonterminal "pattern" for a
given context to the set. If the nonterminal to be added is not
already an argument as determined by isArgument(danbikel.lisp.Symbol)
, then the
Treebank.canonicalAugDelimiter()
and defaultArgAugmentation
are appended before it is added to the set. This default implementation,
therefore, does not necessarily return a complete set of all possible arg
nonterminals, but merely those that are explicitly named in the
argument-finding contexts. As this method is primarily intended to be
used by SubcatBag
when setting up its static resources for
categorizing argument nonterminals, this implementation is sufficient,
as all nonterminals that are not explicitly named will be thrown into
the miscellaneous category.
public Set argNonterminals()
createArgNonterminalsSet()
if
the argNonterminals
data member has not been initialized
(that is, if it is null
).
argNonterminals
in interface Training
public Symbol removeArgAugmentation(Symbol label)
Training
removeArgAugmentation
in interface Training
label
- the label whose argument augmentations are to be removed
protected Symbol removeArgAugmentation(Symbol label, Nonterminal nonterminal)
Nonterminal
object and then
removes all argument augmentations.
label
- the label from which to remove argument augmentationsnonterminal
- the object to use as temporary storage during
execution of this method
public Sexp removeGapAugmentation(Sexp sexp)
delimAndGapStr
, which means that symbols consisting solely
of the gap augmentation itself (gapAugmentation
) will
be unaffected.
removeGapAugmentation
in interface Training
sexp
- a symbol or list of symbols from which to remvoe any
gap augmentations
public void postProcess(Sexp tree)
Training
postProcess
in interface Training
tree
- the tree to be post-processedprotected void removeOnlyChildBaseNPs(Sexp tree)
protected void canonicalizeNonterminals(Sexp tree)
tree
- the tree whose nonterminals are to be converted to their
canonical versionsTreebank.getCanonical(Symbol)
public static void main(String[] args)
preProcess(Sexp)
method on
them, and then outputs the resulting trees to standard out.
Usage: <filename>, where <filename> contains S-expressions
representing trees.
protected void readMetadataHook(Symbol dataType, int metadataLen, SexpList metadata)
dataType
- the symbol representing the data type for this
metadata entrymetadataLen
- the length of the list of the specified
metadata entrymetadata
- the list of a metadata entry to be processed
by a subclass, if the data type is recognizedprotected void readMetadata(SexpTokenizer metadataTok) throws IOException
argContexts
and
semTagArgStopSet
. Does no format
checking on the S-expressions of the metadata resource.
metadataTok
- tokenizer for stream of S-expressions containing
metadata for this class
IOException
public void printMetadata()
|
Parsing Engine | |||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |