|
Parsing Engine | |||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
public interface Training
Specifies methods for language-specific preprocessing of training
parse trees. The primary method to be invoked from an implementation
of this interface is preProcess(Sexp)
. Additionally, as
implementations are likely to contain or have access to appropriate
preprocessing data and methods, this interface also specifies a crucial
method to be used for post-processing, to "undo" what it was
done during preprocessing after decoding. This post-processing method is
postProcess(Sexp)
, and is invoked by default by the
Decoder
.
A language package must include an implementation of this interface.
preProcess(Sexp)
,
postProcess(Sexp)
,
Decoder
Method Summary | |
---|---|
Sexp |
addBaseNPs(Sexp tree)
Adds and/or relabels base NPs in the specified tree. |
Sexp |
addGapInformation(Sexp tree)
Augments nonterminals to include gap information for WHNP's that have moved and leave traces (gaps), as in the GPSG framework. |
Set |
argNonterminals()
Returns a static set of possible argument nonterminals. |
Symbol |
defaultArgAugmentation()
The symbol that is used to mark argument (required) nonterminals by identifyArguments(Sexp) . |
Symbol |
gapAugmentation()
The symbol that will be used to identify nonterminals whose subtrees contain a gap (a trace). |
Symbol |
getCanonicalArg(Symbol argLabel)
Returns the canonical version of the specified argument nonterminal, crucially including its argument augmentation. |
Set |
getPrunedPreterms()
Returns the set of pruned preterminals ( Sexp objects). |
Set |
getPrunedPunctuation()
Returns the set of preterminals ( Sexp objects) that were
punctuation elements that were "raised away" because they were either at
the beginning or end of a sentence. |
boolean |
hasGap(Symbol label)
Returns true if and only if label has a
gap augmentation as added by addGapInformation(Sexp) . |
Sexp |
identifyArguments(Sexp tree)
Augments labels of nonterminals that are arguments. |
boolean |
isArgument(Symbol label)
Returns true if and only if label has an
argument augmentation as added by identifyArguments(Sexp) . |
boolean |
isArgumentFast(Symbol label)
Returns true if and only if the specified nonterminal
label has an argument augmentation preceded by the canonical
augmentaion delimiter. |
boolean |
isValidTree(Sexp tree)
Returns whether the specified tree is valid. |
void |
postProcess(Sexp tree)
Post-processes a parse tree after decoding, eseentially undoing the steps performed in preprocessing. |
Sexp |
preProcess(Sexp tree)
The method to call before counting events in a training parse tree. |
SexpList |
preProcessTest(SexpList sentence,
SexpList originalWords,
SexpList tags)
Preprocesses the specified test sentence and its coordinated list of tags. |
Sexp |
prune(Sexp tree)
Prunes away subtrees that have a root that is an element of nodesToPrune . |
Sexp |
raisePunctuation(Sexp tree)
Raises punctuation to the highest possible point in a parse tree, resulting in a tree where no punctuation is the first or last child of a non-leaf node. |
Sexp |
relabelSubjectlessSentences(Sexp tree)
Relabels sentences that have no subjects with the nonterminal label returned by Treebank.subjectlessSentenceLabel() . |
Symbol |
removeArgAugmentation(Symbol label)
Removes any argument augmentations from the specified nonterminal label. |
Sexp |
removeGapAugmentation(Sexp sexp)
If the specified S-expression is a list, this method modifies the list to contain only symbols without gap augmentations; otherwise, this method removes the gap augmentation (if one exists) in the specified symbol and returns that new symbol. |
Sexp |
removeNullElements(Sexp tree)
Removes all null elements, that is, those nodes of tree for
which Treebank.isNullElementPreterminal(Sexp) returns
true . |
boolean |
removeWord(Symbol word,
Symbol tag,
int idx,
SexpList sentence,
SexpList tags,
SexpList originalTags,
Set prunedPretermsPosSet,
Map prunedPretermsPosMap)
Invoked by the decoder as the first step in preprocessing (prior to the invocation of preProcessTest(danbikel.lisp.SexpList, danbikel.lisp.SexpList, danbikel.lisp.SexpList) ). |
Sexp |
repairBaseNPs(Sexp tree)
Changes the specified tree so that when the last child of an NPB is an S, the S gets raised to be a sibling immediately following the NPB. |
void |
setUpFastArgMap(CountsTable nonterminals)
Indicates to set up a static map for quickly mapping argument nonterminals to their non-argument variants (that is, for quickly stripping away their argument augmentations). |
String |
skip(Sexp tree)
Returns whether the specified tree is to be skipped when training. |
Symbol |
startSym()
Returns the symbol to indicate hidden nonterminals that precede the first in a sequence of modifier nonterminals. |
Word |
startWord()
Returns the Word object that represents the hidden "head
word" of the start symbol. |
Symbol |
stopSym()
Returns the symbol to indicate a hidden nonterminal that follows the last in a sequence of modifier nonterminals. |
Word |
stopWord()
Returns the Word object that represents the hidden "head
word" of the stop symbol. |
Sexp |
stripAugmentations(Sexp tree)
Strips any augmentations off all of the nonterminal labels of tree . |
Symbol |
topSym()
Returns the symbol to indicate the hidden root of all parse trees. |
Word |
topWord()
Returns the Word object that represents the hidden "head
word" of the hidden root of all parse trees. |
Symbol |
traceTag()
The symbol that gets reassigned as the part of speech for null preterminals that represent traces that have undergone WH-movement, as relabeled by the default implementation of addGapInformation(Sexp) . |
Method Detail |
---|
void setUpFastArgMap(CountsTable nonterminals)
N.B.: This method is necessarily thread-safe, as it is expected
to be invoked by every Decoder
as it starts up, and since there
can be multiple Decoder
instances within a given VM.
However, note that it is inappropriate to invoke this
method if the set of nonterminals in the specified counts table
is incomplete (see the documentation for the SubcatBag
class
for an instance where this will be the case).
nonterminals
- a counts table whose keys form a complete set of
all possible nonterminal labels, as is obtained from
DecoderServerRemote.nonterminals()
(the counts to which the
nonterminals are mapped are not used by this method)Sexp preProcess(Sexp tree)
tree
- the parse tree to pre-process
tree
having been pre-processedboolean removeWord(Symbol word, Symbol tag, int idx, SexpList sentence, SexpList tags, SexpList originalTags, Set prunedPretermsPosSet, Map prunedPretermsPosMap)
preProcessTest(danbikel.lisp.SexpList, danbikel.lisp.SexpList, danbikel.lisp.SexpList)
).
Returns whether the specified word should be removed from the sentence
before parsing.
word
- a word in the sentence about to parsedtag
- the supplied part-of-speech tag of the specified word,
or null if tags were not suppliedidx
- the index of the specified word in the specified sentencesentence
- a list of Symbol
objects that represent the words
of the sentence to be parsedtags
- coordinated list of supplied part-of-speech tag lists for each
of the words in the specified sentence, or null if no tags
were suppliedoriginalTags
- the cached copy of the specified tags list,
used when Settings.restorePrunedWords
is trueprunedPretermsPosSet
- the set of part-of-speech tags that were
pruned during trainingprunedPretermsPosMap
- a map of words pruned during training to
their part-of-speech tags when they were pruned
SexpList preProcessTest(SexpList sentence, SexpList originalWords, SexpList tags)
sentence
- the list of words, where a known word is a symbol and
an unknown word is represented by a 3-element list (see
DecoderServerRemote.convertUnknownWords(danbikel.lisp.SexpList)
)originalWords
- the list of unprocessed words (all symbols)tags
- the list of tag lists, where the list at index i
is the list of possible parts of speech for the word at that index
sentence
and the second of which
is a processed version of tags
; if tags
is null
, then the returned list will contain only
one element (since SexpList
objects are not designed
to handle null elements)boolean isValidTree(Sexp tree)
tree
- the parse tree to check for validityString skip(Sexp tree)
tree
- an annotated training tree
null
otherwiseTrainer.train(SexpTokenizer,boolean,boolean)
Set getPrunedPreterms()
Sexp
objects).
prune(Sexp)
Sexp prune(Sexp tree)
nodesToPrune
.
Side effect: An internal set of pruned preterminals will
be updated. This set may be accessed via getPrunedPreterms()
.
Bugs: Cannot prune away entire tree if the root label of the
specified tree is in nodesToPrune
.
tree
- the parse tree to prune
tree
having been prunedSexp identifyArguments(Sexp tree)
tree
untouched if argument identification is not desired for a particular
language package.
tree
- the parse tree to modify
tree
objectTreebank.canonicalAugDelimiter()
Symbol defaultArgAugmentation()
identifyArguments(Sexp)
.
boolean isArgument(Symbol label)
true
if and only if label
has an
argument augmentation as added by identifyArguments(Sexp)
.
boolean isArgumentFast(Symbol label)
true
if and only if the specified nonterminal
label has an argument augmentation preceded by the canonical
augmentaion delimiter. Unlike isArgument(Symbol)
, this
method is thread-safe. Also, it is more efficient than
isArgument(Symbol)
, as it does not actually parse the
specified nonterminal label.
Symbol getCanonicalArg(Symbol argLabel)
argLabel
- the argument nonterminal to be canonicalized
Sexp addGapInformation(Sexp tree)
tree
untouched if gap
information is desired for a particular language package.
tree
- the parse tree to which to add gapping
tree
that was passed in, with certain
nodes modified to include gap informationboolean hasGap(Symbol label)
true
if and only if label
has a
gap augmentation as added by addGapInformation(Sexp)
.
Symbol gapAugmentation()
stripAugmentations(Sexp)
, so that gap augmentations that are added by
addGapInformation(Sexp)
do not get removed.
Symbol traceTag()
addGapInformation(Sexp)
.
Sexp relabelSubjectlessSentences(Sexp tree)
Treebank.subjectlessSentenceLabel()
. This method is
optional, and may be overridden to simply return tree
untouched if subjectless sentence relabeling is not desired for a
particular language package.
tree
- the parse tree in which to relabel subjectless sentences
tree
that was passed in, with
subjectless sentence nodes relabeledTreebank.isSentence(Symbol)
,
Treebank.subjectAugmentation()
,
Treebank.isNullElementPreterminal(Sexp)
,
Treebank.subjectlessSentenceLabel()
Sexp stripAugmentations(Sexp tree)
tree
. The set of nonterminal labels does not include
preterminals, which are typically parts of speech. If a particular
language's Treebank augments preterminals, this method should be
overridden in a language package's subclass. The only augmentations that
will not be removed are those that are added by identifyArguments(Sexp)
, so as to preserve the transformations of that
method. This method should only be called subsequent to the invocations
of methods that require augmentations, such as relabelSubjectlessSentences(Sexp)
.
tree
- the tree all of the nonterminals of which are to be stripped
of all augmentations except those added by identifyArguments
tree
Sexp raisePunctuation(Sexp tree)
Treebank.isPuncToRaise(Sexp)
.
Side effect: All preterminals removed from the beginning and end
of the sentence are stored in an internal set, which can be accessed
via getPrunedPunctuation()
.
Example of punctuation raising:
(S (NP (NPB Pierre Vinken) (, ,) (ADJP 61 years old) (, ,)) (VP joined (NP (NPB the board))) (. .))becomes
(S (NP (NPB Pierre Vinken) (, ,) (ADJP 61 years old)) (, ,) (VP joined (NP (NPB the board))))This method appropriately deals with the case of having multiple punctuation elements to be raised on the left or right side of the list of children for a nonterminal. For example, in English, if this method were passed the tree
(S (NP (DT The) (NN dog) (, ,) (NNP Barky) (. .) (. .) (. .)) (VP (VB was) (ADJP (JJ stupid))) (. .) (. .) (. .))the result would be
(S (NP (DT The) (NN dog) (, ,) (NNP Barky)) (. .) (. .) (. .) (VP (VB was) (ADJP (JJ stupid))))
Bugs: In the pathological case where all the children of a node
are punctuation to raise, this method simply emits a warning to
System.err
and does not attempt to raise them (which would
cause an interior node to become a leaf).
tree
- the parse tree to destructively modify by raising punctuation
tree
objectSet getPrunedPunctuation()
Sexp
objects) that were
punctuation elements that were "raised away" because they were either at
the beginning or end of a sentence.
raisePunctuation(Sexp)
Sexp addBaseNPs(Sexp tree)
tree
- the parse tree in which to add and/or relabel base NPs
tree
Treebank.isNP(Symbol)
,
Treebank.baseNPLabel()
,
Treebank.NPLabel()
Sexp repairBaseNPs(Sexp tree)
(NP (NPB (DT an) (NN effort) (S ...)))get transformed to
(NP (NPB (DT an) (NN effort)) (S ...))
Sexp removeNullElements(Sexp tree)
tree
for
which Treebank.isNullElementPreterminal(Sexp)
returns
true
. Additionally, if the removal of a null element leaves
an interior node that is childless, then this interior node is removed as
well. For example, if we have the following sentence in English
(S (NP-SBJ (-NONE- *T*)) (VP ...))it will be transformed to be
(S (VP ...))N.B.: This method should only be invoked after preprocessing with
relabelSubjectlessSentences(Sexp)
and addGapInformation(Sexp)
, as these methods (and possibly others, if
overridden) rely on the presence of null elements.
Treebank.isNullElementPreterminal(Sexp)
Symbol startSym()
Trainer
Word startWord()
Word
object that represents the hidden "head
word" of the start symbol.
startSym()
,
Trainer
Symbol stopSym()
This symbol may also be used as a special value that is guaranteed not to conflict with any nonterminal in a given language's treebank.
Trainer
Word stopWord()
Word
object that represents the hidden "head
word" of the stop symbol.
stopSym()
,
Trainer
Symbol topSym()
Trainer
Word topWord()
Word
object that represents the hidden "head
word" of the hidden root of all parse trees.
Set argNonterminals()
Symbol removeArgAugmentation(Symbol label)
label
- the label whose argument augmentations are to be removed
Sexp removeGapAugmentation(Sexp sexp)
sexp
- a symbol or list of symbols from which to remvoe any
gap augmentations
void postProcess(Sexp tree)
tree
- the tree to be post-processed
|
Parsing Engine | |||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |