|
Parsing Engine | |||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectdanbikel.parser.Trainer
public class Trainer
Derives all counts necessary to compute the probabilities for this parser, including the top-level counts and all derived counts. The two additional facilities of this class are (1) the loading and storing of a text file containing top-level event counts and (2) the loading and storing of a Java object file containing all derived event counts.
All top-level events or mappings are recorded as S-expressions with the format(name event count)for events and
(name key value)for mappings. All derived counts are stored by the internal data structures of several
Model
objects, which are in turn all contained within a single
ModelCollection
object. This class provides methods to load and
store a Java object file containing this ModelCollection
, as
well as some initial objects containing information about the
ModelCollection
object (see the -scan flag in the usage
of the main method of this class
).
The various model objects capture the generation submodels of the different
output elements of the parser. The smoothing levels of these submodels are
represented by ProbabilityStructure
objects, passed as
parameters to the Model
objects, at construction
time. This architecture
provides a type of "plug-n-play" smoothing scheme for the various submodels
of this parser.
main(String[])
,
Model
,
ModelCollection
,
ProbabilityStructure
,
Serialized FormNested Class Summary | |
---|---|
static class |
Trainer.EventEntry
Class to represent a MapToPrimitive.Entry object for use by the getEventIterator(danbikel.lisp.SexpTokenizer, danbikel.lisp.Symbol) method. |
Field Summary | |
---|---|
protected Filter |
allPass
An instance of AllPass . |
protected Map |
canonicalSubcatMap
A reflexive map for storing canonical versions of Subcat objects. |
protected double |
countThreshold
The value of the Settings.countThreshold setting. |
protected double |
derivedCountThreshold
The value of the Settings.derivedCountThreshold setting. |
protected boolean |
downcaseWords
The value of the Settings.downcaseWords setting. |
protected Subcat |
emptySubcat
The value returned by Subcats.get() . |
protected Symbol |
gapAugmentation
The value of Training.gapAugmentation() . |
protected CountsTable |
gapEvents
A table for storing counts of gap-generation events. |
static Symbol |
gapEventSym
The label for gap events. |
protected Model |
gapModel
The gap-generation model. |
protected CountsTable |
headEvents
A table for storing counts of head-generation events. |
static Symbol |
headEventSym
The label for head nonterminal generation events. |
protected Model |
headModel
The head-generation model. |
protected Map |
headToParentMap
A map of head child nonterminals to their observed parent nonterminals. |
protected boolean |
keepAllWords
The value of the Settings.keepAllWords setting. |
protected boolean |
keepLowFreqTags
The value of the Settings.keepLowFreqTags setting. |
protected Map |
leftSubcatMap
A map of events from the last back-off level of the left subcat–generation submodel to the set of possible left subcats. |
protected Model |
leftSubcatModel
The model for generating subcats that fall on the left side of head children. |
protected Model |
lexPriorModel
The model for marginal probabilities of lexical elements (for the estimation of the joint event that is a fully lexicalized nonterminal). |
protected ModelCollection |
modelCollection
The set of Model objects and other resources that describe
an entire parsing model. |
static Symbol |
modEventSym
The label for modifier nonterminal generation events. |
protected CountsTable |
modifierEvents
A table for storing counts of modifier-generation events. |
protected Map |
modNonterminalMap
A map of events from the last back-off level of the modifier nonterminal–generation submodel to the set of possible futures (typically, a future is a modifier label and its head word's part-of-speech tag). |
protected Model |
modNonterminalModel
The modifying nonterminal–generation model. |
protected Model |
modWordModel
The model that generates head words of modifying nonterminals. |
protected Filter |
nonPreterm
A filter that only allows TrainerEvent instances that do not
represent preterminals (where the parent is identical to the part-of-speech
tag of the head word). |
protected Filter |
nonStop
A filter that disallows ModifierEvent instances where the
modifier is Training.stopSym() , but allows all other objects. |
protected Filter |
nonStopAndNonTop
A filter that disallows ModifierEvent instances where the modifier
is neither Training.stopSym() nor Training.topSym() , but
allows all other objects. |
static Symbol |
nonterminalEventSym
The label for nonterminal generation events. |
protected Model |
nonterminalPriorModel
The model for conditional probabilities of nonterminals given the lexical components (for the estimation of the joint event that is a fully lexicalized nonterminal). |
protected CountsTable |
nonterminals
A table for storing counts of (unlexicalized) nonterminals. |
protected Filter |
nonTop
A filter that only allows TrainerEvent instances where the parent
nonterminal is not Training.topSym() . |
protected Filter |
nonTopNonPreterm
A filter that is functionally equivalent to piping objects through both nonTop and nonPreterm . |
protected int |
numPrevMods
The value of the Settings.numPrevMods setting. |
protected int |
numPrevWords
The value of the Settings.numPrevWords setting. |
protected Map |
posMap
A map of words to lists of their observed part-of-speech tags. |
static Symbol |
posMapSym
The label for word to part-of-speech mappings. |
protected CountsTable |
priorEvents
A table for storing counts of lexicalized nonterminal prior events. |
protected Set |
prunedPreterms
A set of Sexp objects representing preterminals that were
pruned during training. |
static Symbol |
prunedPretermSym
The label for the set of pruned preterminals. |
static Symbol |
prunedPuncSym
The label for the set of pruned punctuation preterminals. |
protected Set |
prunedPunctuation
Returns the set of preterminals ( Sexp objects) that were
punctuation elements that were “raised away” because they were
either at the beginning or end of a sentence. |
protected int |
reportingInterval
The value of the Settings.trainerReportingInterval setting. |
protected Map |
rightSubcatMap
A map of events from the last back-off level of the right subcat–generation submodel to the set of possible right subcats. |
protected Model |
rightSubcatModel
The model for generating subcats that fall on the right side of head children. |
protected Map |
simpleModNonterminalMap
A map from unlexicalized parent-head-side triples to all possible partially-lexicalized modifying nonterminals. |
protected Symbol |
startSym
The value of Training.startSym() . |
protected Word |
startWord
The value of Training.startWord() . |
protected Symbol |
stopSym
The value of Training.stopSym() . |
protected Word |
stopWord
The value of Training.stopWord() . |
protected Model |
topLexModel
The head-word generation model for heads of entire sentences. |
protected Model |
topNonterminalModel
The head-generation model for heads whose parents are Training.topSym() . |
protected Filter |
topOnly
A filter that only allows TrainerEvent instances where the parent
is Training.topSym() . |
protected Symbol |
topSym
The value of Training.topSym() . |
protected Symbol |
traceTag
The value of Training.traceTag() . |
protected static Class |
trainerClass
The class from which an instance will be constructed in main(String[]) . |
protected int |
unknownWordThreshold
The value of the Settings.unknownWordThreshold setting. |
protected static String[] |
usageMsg
The usage for the main method of this class. |
protected CountsTable |
vocabCounter
A table for storing counts of vocabulary items. |
static Symbol |
vocabSym
The label for vocabulary counts. |
protected CountsTable |
wordFeatureCounter
A table for storing counts of word feature–vectors. |
protected WordFeatures |
wordFeatures
A handle onto static WordFeatures object contained static inside
Language . |
static Symbol |
wordFeatureSym
The label for word feature (unknown vocabulary) counts. |
Constructor Summary | |
---|---|
Trainer()
Constructs a new training object, which uses values from Settings
for its settings. |
Method Summary | |
---|---|
protected void |
addGapEvent(GapEvent event)
This method is a synonym for addGapEvent(event, 1.0) . |
protected void |
addGapEvent(GapEvent event,
double count)
Adds the specified GapEvent to gapEvents with
the specified count. |
protected void |
addHeadEvent(HeadEvent event)
This method is a synonym for addHeadEvent(event, 1.0) . |
protected void |
addHeadEvent(HeadEvent event,
double count)
Adds the specified HeadEvent to headEvents with
the specified count. |
protected void |
addModifierEvent(ModifierEvent event)
This method is a synonym for addModifierEvent(event, 1.0) . |
protected void |
addModifierEvent(ModifierEvent event,
double count)
Adds the specified ModifierEvent to modifierEvents
with the specified count. |
protected void |
addToPosMap(Symbol word,
Symbol tag)
Called by addToPosMap(Word) . |
protected void |
addToPosMap(Word word)
Called by collectStats(danbikel.lisp.Sexp, danbikel.parser.HeadTreeNode, boolean) and
alterLowFrequencyWords(HeadTreeNode) . |
static void |
addToValueCounts(Map map,
Object key,
Object value)
Adds value to the set of values to which
key is mapped (if value is not already in
that set) and increments the count of that value by 1. |
static void |
addToValueCounts(Map map,
Object key,
Object value,
int count)
Adds value to the set of values to which
key is mapped (if value is not already
in that set) and increments the count of that value by
count . |
protected void |
alterLowFrequencyWords(HeadTreeNode tree)
For every Word in the specified tree, if it occurred
less than unknownWordThreshold times, then it is modified. |
protected void |
clearEventCounters()
Clears the priorEvents , headEvents , modifierEvents and gapEvents counts tables. |
protected void |
collectModifierStats(HeadTreeNode tree,
Subcat subcat,
int gapIdx,
boolean side)
Note the O(n) operation performed on the prevModList. |
protected void |
collectStats(Sexp orig,
HeadTreeNode tree,
boolean isRoot)
Collects the statistics from the specified tree. |
protected void |
countVocab(HeadTreeNode tree)
Counts number of occurrences of each word in the specified tree and adds the word with this count to vocabCounter . |
protected void |
createModelObjects()
Creates all of the internal model objects used by this trainer when constructing its internal ModelCollection object. |
void |
createPosMap()
Creates posMap from the headEvents , modifierEvents and gapEvents counts tables. |
void |
createPosMap(CountsTable events)
Adds to posMap using information contained in the specified
counts table. |
void |
deriveCounts()
Derives event counts for all back-off levels of all sub-models for the current parsing model. |
void |
deriveCounts(boolean setModelCollection)
Derives event counts for all back-off levels of all sub-models for the current parsing model. |
void |
deriveCounts(boolean setModelCollection,
FlexibleMap canonical)
Derives event counts for all back-off levels of all sub-models for the current parsing model. |
protected void |
deriveCounts(double derivedCountThreshold,
FlexibleMap canonical)
Derives all counts for creating a ModelCollection object. |
protected void |
deriveModelCounts(double derivedCountThreshold,
FlexibleMap canonical)
A helper method used by deriveCounts(double,FlexibleMap) to derive
counts for all Model instances contained within a ModelCollection . |
void |
doneCollectingObservations()
A hook that gets called by main(java.lang.String[]) after all observations are
collected via any calls to readStats(File) ,
readStats(SexpTokenizer) and
train(SexpTokenizer,boolean,boolean) . |
static SexpList |
getCanonicalList(Map map,
SexpList list)
Returns a canonical version of the specified list from the specified reflexive map. |
static Iterator |
getEventIterator(SexpTokenizer tokenizer,
Symbol type)
Returns an iterator over TrainerEvent objects that were written
out in S-expression form. |
static SexpTokenizer |
getStandardSexpStream(File file)
Returns a new SexpTokenizer wrapped around the specified file
using the encoding specified by Language.encoding() and
a buffer size equal to Constants.defaultFileBufsize . |
protected static void |
incrementallyTrain(Trainer trainer,
String inputFilename)
Incrementally updates derived model counts by reading chunks of TrainerEvent objects from the specified input file. |
static ModelCollection |
loadModelCollection(ObjectInputStream ois)
Loads the ModelCollection from the specified file. |
static ModelCollection |
loadModelCollection(String objectInputFilename)
Loads the ModelCollection from the specified file. |
static void |
main(String[] args)
Takes arguments according to the usage as specified in usageMsg . |
protected void |
modelCollectionSet(FlexibleMap canonical)
Sets all the data members of the modelCollection member of this
trainer with the internal resources constructed by this trainer (such as
all the Model instances). |
protected void |
modelCollectionSetHook()
A method called by deriveCounts() just after it calls
ModelCollection.set(danbikel.parser.Model, danbikel.parser.Model, danbikel.parser.Model, danbikel.parser.Model, danbikel.parser.Model, danbikel.parser.Model, danbikel.parser.Model, danbikel.parser.Model, danbikel.parser.Model, danbikel.parser.Model, danbikel.parser.CountsTable, danbikel.parser.CountsTable, danbikel.parser.CountsTable, java.util.Map, java.util.Map, java.util.Map, java.util.Map, java.util.Map, java.util.Map, java.util.Set, java.util.Set, danbikel.util.FlexibleMap) . |
protected ModelCollection |
newModelCollection()
Returns a new instance of ModelCollection . |
static SexpList |
newStartList()
Creates and returns a new start list. |
static WordList |
newStartWordList()
|
void |
outputHeadToParentMap()
Outputs the head map internal to this Trainer object
to System.err . |
static void |
outputMap(Map map,
String mapName)
Outputs the specified map to System.err |
static void |
outputMap(Map map,
String mapName,
Writer writer)
Outputs the specified named map to the specified writer. |
static void |
outputMaps(Map leftMap,
String leftMapName,
Map rightMap,
String rightMapName)
Outputs both the specified maps to System.err . |
static void |
outputMaps(Map leftMap,
String leftMapName,
Map rightMap,
String rightMapName,
Writer writer)
Outputs both the specified maps to the specified writer. |
void |
outputModNonterminalMap()
Outputs the modifier map internal to this Trainer object
to System.err . |
void |
outputSubcatMaps()
Outputs the subcat maps internal to this Trainer object
to System.err . |
protected void |
precomputeProbs()
Precomputes all probabilities and smoothing parameters for all Model instances that are part of the ModelCollection of this
trainer. |
void |
readStats(File file)
Reads the statistics and observations from an output file in the format created by writeStats(Writer) . |
void |
readStats(SexpTokenizer tok)
Reads the observations and their counts contained in the specified S-expression tokenization stream. |
void |
readStats(SexpTokenizer tok,
int maxEventsToRead)
Reads at most the specified number of observations and their counts contained in the specified S-expression tokenization stream. |
void |
readStatsHook(SexpList event)
A hook for subclasses to read an event of a newly-defined type (called by readStats(SexpTokenizer) ). |
static void |
scanModelCollectionObjectFile(ObjectInputStream ois,
OutputStream os)
Scans the object file and prints out the information contained in its header objects. |
static void |
scanModelCollectionObjectFile(String scanObjectFilename,
OutputStream os)
Scans the object file and prints out the information contained in its header objects. |
void |
setModelCollection(ObjectInputStream ois)
Sets the internal modelCollection member of this class to the
instance loaded from the specified input stream. |
void |
setModelCollection(String objectInputFilename)
Sets the internal modelCollection data member of this class to the
object of that type loaded from the specified file. |
void |
train(SexpTokenizer tok,
boolean auto,
boolean stripOuterParens)
Records observations from the training trees contained in the specified S-expression tokenizer. |
void |
writeModelCollection(ObjectOutputStream oos,
String trainingInputFilename,
String trainingOutputFilename)
Writes the internal ModelCollection object to the specified output
stream, writing a header containing the names of the training input file
and training output file. |
void |
writeModelCollection(String objectOutputFilename,
String trainingInputFilename,
String trainingOutputFilename)
Writes the internal ModelCollection object to the specified output
file, writing a header containing the names of the training input file and
training output file. |
void |
writeStats(File file)
Writes the statistics and mappings collected by train(SexpTokenizer,boolean,boolean) to a human-readable text
file, by constructing a Writer around a stream around the
specified file and calling writeStats(Writer) . |
void |
writeStats(Writer writer)
Writes the statistics and mappings collected by train(SexpTokenizer,boolean,boolean) to a human-readable text
file. |
void |
writeStatsHook(Writer writer)
A hook for subclasses to write out any additional top-level events, or top-level events of a different, newly-defined type. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
protected static Class trainerClass
main(String[])
. This data member may be re-assigned in
a subclass' main
method before invocation of this
class' main
method, so that all trainer method
invocations done by this class' main
method will be on
an instance of the subclass.
public static final Symbol nonterminalEventSym
public static final Symbol headEventSym
public static final Symbol modEventSym
public static final Symbol gapEventSym
public static final Symbol posMapSym
public static final Symbol vocabSym
public static final Symbol wordFeatureSym
public static final Symbol prunedPretermSym
Training.prune(Sexp)
public static final Symbol prunedPuncSym
Training.raisePunctuation(Sexp)
,
Training.getPrunedPunctuation()
protected int unknownWordThreshold
Settings.unknownWordThreshold
setting.
protected double countThreshold
Settings.countThreshold
setting.
protected double derivedCountThreshold
Settings.derivedCountThreshold
setting.
protected int reportingInterval
Settings.trainerReportingInterval
setting.
protected int numPrevMods
Settings.numPrevMods
setting.
protected int numPrevWords
Settings.numPrevWords
setting.
protected boolean keepAllWords
Settings.keepAllWords
setting.
protected boolean keepLowFreqTags
Settings.keepLowFreqTags
setting.
protected boolean downcaseWords
Settings.downcaseWords
setting.
protected CountsTable nonterminals
Symbol
.
protected CountsTable priorEvents
PriorEvent
.
protected CountsTable headEvents
HeadEvent
.
protected CountsTable modifierEvents
ModifierEvent
.
protected CountsTable gapEvents
GapEvent
.
protected CountsTable vocabCounter
Symbol
.
protected CountsTable wordFeatureCounter
Symbol
.
protected Map posMap
Symbol
, and the values
are SexpList
instances that represent sets by containing
lists of distinct Symbol
objects.
protected Map headToParentMap
Symbol
, and the values are Set
instances containing Symbol
objects.
protected Map leftSubcatMap
Event
, and the values are Set
instances containing Subcat
objects.
protected Map rightSubcatMap
Event
, and the values are Set
instances containing Subcat
objects.
protected Map modNonterminalMap
Event
, and the values are Set
instances containing Event
objects.
protected Map simpleModNonterminalMap
modNonterminalMap
.
The keys are SexpList
objects containing exactly three
Symbol
elements representing the following in a production:
Constants.LEFT
or
Constants.RIGHT
.
Set
objects containing SexpList
objects that contain exactly two Symbol
elements representing a
partially-lexicalized modifying nonterminal:
NP(NNP)
, which is a noun phrase headed by a singular
proper noun.
Settings.useSimpleModNonterminalMap
protected Set prunedPreterms
Sexp
objects representing preterminals that were
pruned during training.
Training.prune(Sexp)
,
Treebank.isPreterminal(Sexp)
protected Set prunedPunctuation
Sexp
objects) that were
punctuation elements that were “raised away” because they were
either at the beginning or end of a sentence.
Training.raisePunctuation(Sexp)
,
Treebank.isPuncToRaise(Sexp)
protected transient Map canonicalSubcatMap
Subcat
objects.
protected transient Subcat emptySubcat
Subcats.get()
.
protected ModelCollection modelCollection
Model
objects and other resources that describe
an entire parsing model.
protected Model lexPriorModel
protected Model nonterminalPriorModel
protected Model topNonterminalModel
Training.topSym()
.
protected Model topLexModel
protected Model headModel
protected Model gapModel
protected Model leftSubcatModel
protected Model rightSubcatModel
protected Model modNonterminalModel
protected Model modWordModel
protected transient WordFeatures wordFeatures
WordFeatures
object contained static inside
Language
.
protected Symbol startSym
Training.startSym()
.
protected Symbol stopSym
Training.stopSym()
.
protected Symbol topSym
Training.topSym()
.
protected Word startWord
Training.startWord()
.
protected Word stopWord
Training.stopWord()
.
protected Symbol gapAugmentation
Training.gapAugmentation()
.
protected Symbol traceTag
Training.traceTag()
.
protected Filter allPass
AllPass
.
protected Filter nonTop
TrainerEvent
instances where the parent
nonterminal is not Training.topSym()
.
protected Filter nonPreterm
TrainerEvent
instances that do not
represent preterminals (where the parent is identical to the part-of-speech
tag of the head word).
protected Filter nonTopNonPreterm
nonTop
and nonPreterm
.
protected Filter topOnly
TrainerEvent
instances where the parent
is Training.topSym()
.
protected Filter nonStop
ModifierEvent
instances where the
modifier is Training.stopSym()
, but allows all other objects.
protected Filter nonStopAndNonTop
ModifierEvent
instances where the modifier
is neither Training.stopSym()
nor Training.topSym()
, but
allows all other objects.
protected static final String[] usageMsg
java danbikel.parser.Trainer -help
to display the complete
usage of this class.
Constructor Detail |
---|
public Trainer()
Settings
for its settings. This class is not thread-safe, and there will typically
be one instance of a Trainer
object per process, constructed
via the main(java.lang.String[])
method of this class.
Settings.unknownWordThreshold
,
Settings.countThreshold
,
Settings.derivedCountThreshold
,
Settings.trainerReportingInterval
,
Settings.numPrevMods
Method Detail |
---|
protected ModelCollection newModelCollection()
ModelCollection
. Subclasses may override
this method to return different sub-types of ModelCollection
.
ModelCollection
public void train(SexpTokenizer tok, boolean auto, boolean stripOuterParens) throws IOException
Map
objects or items to be counted, stored in
CountsTable
objects. All the trees obtained from
tok
are first preprocessed using
Training.preProcess(Sexp)
.
tok
- the S-expression tokenizer from which to obtain training
parse treesauto
- indicates whether to automatically determine whether to
strip off outer parens of training parse trees before preprocessing;
if the value of this argument is false
, then the value
of stripOuterParens
is usedstripOuterParens
- indicates whether an outer layer of parentheses
should be stripped off of trees obtained from tok
before
preprocessing and training (only used if the auto
argument
is false
)
IOException
CountsTable
,
Training.preProcess(Sexp)
protected void countVocab(HeadTreeNode tree)
vocabCounter
. Specifically,
if the tree with which this recursive method is called represents
a preterminal that is not a trace and that is not already a key in
wordFeatureCounter
, then the word
field
of the tree's headWord is added
(with a count of 1) to vocabCounter
.
tree
- protected void alterLowFrequencyWords(HeadTreeNode tree)
Word
in the specified tree, if it occurred
less than unknownWordThreshold
times, then it is modified.
If keepAllWords
is true
, then the word's
features
field is set, using Word.setFeatures(Symbol)
;
otherwise, the word's word
field is set, using
Word.setWord(Symbol)
.addToPosMap(danbikel.parser.Word)
with posMap
and the head word as arguments if keepLowFreqTags
is
true
.
tree
- the tree in which to alter word frequenciesprotected void addHeadEvent(HeadEvent event)
addHeadEvent(event, 1.0)
.
event
- the event to be added with a count of 1.0addHeadEvent(HeadEvent,double)
protected void addModifierEvent(ModifierEvent event)
addModifierEvent(event, 1.0)
.
event
- the event to be added with a count of 1.0addModifierEvent(ModifierEvent,double)
protected void addGapEvent(GapEvent event)
addGapEvent(event, 1.0)
.
event
- the event to be added with a count of 1.0addGapEvent(GapEvent,double)
protected void addHeadEvent(HeadEvent event, double count)
HeadEvent
to headEvents
with
the specified count. This is a helper method used by the
collectStats
and readStats
methods. The purpose of
using this protected method is to provide a hook for subclasses.
event
- the HeadEvent
to be addedcount
- the count of the event to be addedMapToPrimitive.add(Object,double)
protected void addModifierEvent(ModifierEvent event, double count)
ModifierEvent
to modifierEvents
with the specified count. This is a helper method used by the
collectStats
,
collectModifierStats
and
readStats
methods. The purpose of
using this protected method is to provide a hook for subclasses.
event
- the ModifierEvent
to be addedcount
- the count of the event to be addedMapToPrimitive.add(Object,double)
protected void addGapEvent(GapEvent event, double count)
GapEvent
to gapEvents
with
the specified count. This is a helper method used by the
collectStats
,
collectModifierStats
and
readStats
methods. The purpose of
using this protected method is to provide a hook for subclasses.
event
- the GapEvent
to be addedcount
- the count of the event to be addedMapToPrimitive.add(Object,double)
protected void collectStats(Sexp orig, HeadTreeNode tree, boolean isRoot)
orig
- the original (preprocessed) tree, used for debugging purposestree
- the tree from which to collect statistics and mappingsisRoot
- indicates whether tree
is the observed root
of a tree (the observed root is the child of the hidden root,
represented by the symbol Training.topSym()
)public static SexpList newStartList()
Language.training.startSym()
.
This is the appropriate initial list of previously "generated" modidifers
when beginning the Markov process of generating modifiers.
Training.startSym()
public static WordList newStartWordList()
protected void collectModifierStats(HeadTreeNode tree, Subcat subcat, int gapIdx, boolean side)
public void createPosMap()
posMap
from the headEvents
, modifierEvents
and gapEvents
counts tables.
public void createPosMap(CountsTable events)
posMap
using information contained in the specified
counts table.
events
- the counts table of TrainerEvent
instances from
which to derive a mapping of words to their observed parts of speechprotected final void addToPosMap(Word word)
collectStats(danbikel.lisp.Sexp, danbikel.parser.HeadTreeNode, boolean)
and
alterLowFrequencyWords(HeadTreeNode)
.
word
- the Word object containing word (and possibly a word-feature
vector) and a tag with which that word (and possibly feature vector) has
been observed withprotected final void addToPosMap(Symbol word, Symbol tag)
addToPosMap(Word)
.
word
- the word with which to associate a part of speechtag
- the part-of-speech tag associated with the specified wordprotected void createModelObjects()
ModelCollection
object.
Each model is created by first creating its ProbabilityStructure
object, and then calling that object's
ProbabilityStructure.newModel()
method to wrap itself in a
Model
instance. There are ten Model
members of this class:
lexPriorModel
nonterminalPriorModel
topNonterminalModel
topLexModel
headModel
gapModel
leftSubcatModel
rightSubcatModel
modNonterminalModel
modWordModel
ProbabilityStructure
for each of the above models, the following
algorithm is used:
Settings
, then it is appended to the default model
structure classname and an instance is used.
Settings.globalModelStructureNumber
setting for more details on
all the model structure–specific settings that control which
concrete subclasses of ProbabilityStructure
are instantiated.
public void deriveCounts()
modelCollectionSet(FlexibleMap)
method will be invoked.
Model.deriveCounts(CountsTable,Filter, double,FlexibleMap)
public void deriveCounts(boolean setModelCollection)
setModelCollection
- indicates whether to invoke modelCollectionSet(FlexibleMap)
after deriving
countsModel.deriveCounts(CountsTable,Filter, double,FlexibleMap)
public void deriveCounts(boolean setModelCollection, FlexibleMap canonical)
setModelCollection
- indicates whether to invoke modelCollectionSet(FlexibleMap)
after deriving
countscanonical
- the FlexibleMap
instance to use for
creating a reflexive map of canonical versions of
event objects creating when deriving countsModel.deriveCounts(CountsTable,Filter, double,FlexibleMap)
protected void clearEventCounters()
priorEvents
, headEvents
, modifierEvents
and gapEvents
counts tables.
protected void deriveCounts(double derivedCountThreshold, FlexibleMap canonical)
ModelCollection
object.
derivedCountThreshold
- the count threshold below which to throw away
derived eventscanonical
- a reflexive map of canonical versions of
derived Event
and Transition
objects, shared among all Model
instances of this trainerprotected void deriveModelCounts(double derivedCountThreshold, FlexibleMap canonical)
deriveCounts(double,FlexibleMap)
to derive
counts for all Model
instances contained within a ModelCollection
.
derivedCountThreshold
- the count threshold below which to throw away
derived eventscanonical
- a reflexive map of canonical versions of
derived Event
and Transition
objects, shared among all Model
instancesprotected void precomputeProbs()
Model
instances that are part of the ModelCollection
of this
trainer.
Model.precomputeProbs()
protected void modelCollectionSet(FlexibleMap canonical)
modelCollection
member of this
trainer with the internal resources constructed by this trainer (such as
all the Model
instances).
canonical
- a reflexive map of canonical versions of derived Event
and Transition
objects, shared among all
Model
instancesprotected void modelCollectionSetHook()
deriveCounts()
just after it calls
ModelCollection.set(danbikel.parser.Model, danbikel.parser.Model, danbikel.parser.Model, danbikel.parser.Model, danbikel.parser.Model, danbikel.parser.Model, danbikel.parser.Model, danbikel.parser.Model, danbikel.parser.Model, danbikel.parser.Model, danbikel.parser.CountsTable, danbikel.parser.CountsTable, danbikel.parser.CountsTable, java.util.Map, java.util.Map, java.util.Map, java.util.Map, java.util.Map, java.util.Map, java.util.Set, java.util.Set, danbikel.util.FlexibleMap)
.
public static final SexpList getCanonicalList(Map map, SexpList list)
map
- a reflexive map of SexpList
objectslist
- the list to canonicalize
public void outputHeadToParentMap()
Trainer
object
to System.err
.
public void outputSubcatMaps()
Trainer
object
to System.err
.
public void outputModNonterminalMap()
Trainer
object
to System.err
.
public static void outputMap(Map map, String mapName)
System.err
public static void outputMaps(Map leftMap, String leftMapName, Map rightMap, String rightMapName)
System.err
.
public static void outputMap(Map map, String mapName, Writer writer) throws IOException
IOException
public static void outputMaps(Map leftMap, String leftMapName, Map rightMap, String rightMapName, Writer writer) throws IOException
IOException
public static final void addToValueCounts(Map map, Object key, Object value)
value
to the set of values to which
key
is mapped (if value
is not already in
that set) and increments the count of that value by 1.
map
- the map of keys to sets of values, where each value has its
own count (map
is actually a map of keys to maps of values
to counts)key
- the key in map
to associate with a set of values
with countsvalue
- the value to add to the set of key
's values,
whose count is to be incremented by 1public static final void addToValueCounts(Map map, Object key, Object value, int count)
value
to the set of values to which
key
is mapped (if value
is not already
in that set) and increments the count of that value by
count
.
map
- the map of keys to sets of values, where each value has its
own count (map
is actually a map of keys to maps of values
to counts)key
- the key in map
to associate with a set of values
with countsvalue
- the value to add to the set of key
's values,
whose count is to be incremented by count
count
- the amount by which to increment value
's countpublic void writeStats(File file) throws IOException
train(SexpTokenizer,boolean,boolean)
to a human-readable text
file, by constructing a Writer
around a stream around the
specified file and calling writeStats(Writer)
.
IOException
train(SexpTokenizer,boolean,boolean)
,
writeStats(Writer)
public void writeStatsHook(Writer writer) throws IOException
writer
-
IOException
public void writeStats(Writer writer) throws IOException
train(SexpTokenizer,boolean,boolean)
to a human-readable text
file.writeStatsHook(Writer)
just before terminating.
IOException
train(SexpTokenizer,boolean,boolean)
,
SymbolicCollectionWriter.writeMap(Map,Symbol,Writer)
,
CountsTable.output(String,Writer)
public void readStats(File file) throws FileNotFoundException, UnsupportedEncodingException, IOException
writeStats(Writer)
. Observations are one of several
types, all recorded as S-expressions where the first element is one of the
following symbols:
file
- the file containing the S-expressions representing
top-level observations and their counts
FileNotFoundException
UnsupportedEncodingException
IOException
public static SexpTokenizer getStandardSexpStream(File file) throws FileNotFoundException, UnsupportedEncodingException, IOException
SexpTokenizer
wrapped around the specified file
using the encoding specified by Language.encoding()
and
a buffer size equal to Constants.defaultFileBufsize
.
file
- the file around which to construct a SexpTokenizer
SexpTokenizer
wrapped around the specified file
using the encoding specified by Language.encoding()
and
a buffer size equal to Constants.defaultFileBufsize
FileNotFoundException
- if the specified file cnanot be found
UnsupportedEncodingException
- if the encoding specified by
Language.encoding()
is unsupported
IOException
- if there is a problem opening a stream for the
specified filepublic void readStatsHook(SexpList event)
readStats(SexpTokenizer)
). This method is responsible for
printing out any error messages if the specified event is improperly
formatted or is not recognized. New event types must still have
the same general S-expression format requirements as the existing event
types of this class: they must be lists of length 2 or 3.System.err
indicating that the specified event is
an unrecognized event type.
event
- the event to be readpublic static Iterator getEventIterator(SexpTokenizer tokenizer, Symbol type)
TrainerEvent
objects that were written
out in S-expression form.
tokenizer
- the S-expression reader from which to read
TrainerEvent
objects that were serialized as S-expression
stringstype
- the type of TrainerEvent
objects to retrive; the value
of this argument may be one of
TrainerEvent
objects that were written
out in S-expression formpublic void readStats(SexpTokenizer tok) throws IOException
writeStats(Writer)
. Observations are one of several
types, all recorded as S-expressions where the first element is one of the
following symbols:
tok
- the S-expression tokenization stream from which to read
top-level counts
IOException
- if the underlying stream throws an IOExceptionpublic void readStats(SexpTokenizer tok, int maxEventsToRead) throws IOException
writeStats(Writer)
. Observations are
one of several types, all recorded as S-expressions where the first
element is one of the following symbols:
tok
- the S-expression tokenization stream from which to read
top-level countsmaxEventsToRead
- the maximum number of events to read from the
specified stream; if the value of this parameter is less than 1,
then all observations are read from the underlying stream, and the
behavior of this method is identical to readStats(SexpTokenizer)
IOException
- if the underlying stream throws an IOExceptionpublic void doneCollectingObservations()
main(java.lang.String[])
after all observations are
collected via any calls to readStats(File)
,
readStats(SexpTokenizer)
and
train(SexpTokenizer,boolean,boolean)
. The default implementation
does nothing.
public void writeModelCollection(String objectOutputFilename, String trainingInputFilename, String trainingOutputFilename) throws FileNotFoundException, IOException
ModelCollection
object to the specified output
file, writing a header containing the names of the training input file and
training output file.
objectOutputFilename
- the output file to which to write the
internal ModelCollection
object
constructed by this trainertrainingInputFilename
- the name of the input file of training parse
trees from which events and counts were
collectedtrainingOutputFilename
- the name of the training output file of
top-level (maximal context) events
FileNotFoundException
- if the specified output filename cannot be
created
IOException
- if there is a problem writing to the stream
of the specified output filepublic void writeModelCollection(ObjectOutputStream oos, String trainingInputFilename, String trainingOutputFilename) throws IOException
ModelCollection
object to the specified output
stream, writing a header containing the names of the training input file
and training output file.
oos
- the output stream to which to write the
internal ModelCollection
object
constructed by this trainertrainingInputFilename
- the name of the input file of training parse
trees from which events and counts were
collectedtrainingOutputFilename
- the name of the training output file of
top-level (maximal context) events
IOException
- if there is a problem writing to the stream of the
specified output filepublic void setModelCollection(String objectInputFilename) throws ClassNotFoundException, IOException, OptionalDataException
modelCollection
data member of this class to the
object of that type loaded from the specified file.
objectInputFilename
- the object from which to load a ModelCollection
ClassNotFoundException
- if the concrete type of ModelCollection
read from the specified
file cannot be found
IOException
- if there is a problem reading from the
stream of the specified file
OptionalDataException
- if there is a problem reading primitive data
associated with the ModelCollection
object read from the specified filepublic static ModelCollection loadModelCollection(String objectInputFilename) throws ClassNotFoundException, IOException, OptionalDataException
ModelCollection
from the specified file.
objectInputFilename
- the name of the Java serialized object file from
which to load a ModelCollection
instance; the file must contain a series of
header objects as produced by writeModelCollection(String,String,String)
ModelCollection
object contained in the specified file
ClassNotFoundException
- if the concrete type of the ModelCollection
or any of the header
objects in the specified file cannot be
found
IOException
- if there is a problem reading from the
specified file
OptionalDataException
- if there is a problem reading primitive data
associated with an object from the object
input stream created from the specified
filepublic void setModelCollection(ObjectInputStream ois) throws ClassNotFoundException, IOException, OptionalDataException
modelCollection
member of this class to the
instance loaded from the specified input stream.
ois
- an object input stream containing a series of header objects and
ultimately a ModelCollection
instance
ClassNotFoundException
- if the concrete type of the ModelCollection
or any of the header
objects in the specified input stream cannot
be found
IOException
- if there is a problem reading from the
specified input stream
OptionalDataException
- if there is a problem reading primitive data
associated with an object from the specified
object input streampublic static ModelCollection loadModelCollection(ObjectInputStream ois) throws ClassNotFoundException, IOException, OptionalDataException
ModelCollection
from the specified file.
ois
- the object input stream from which to load a ModelCollection
instance; the stream must contain a series of
header objects, as produced by
writeModelCollection(String,String,String)
ModelCollection
object contained in the specified file
ClassNotFoundException
- if the concrete type of the ModelCollection
or any of the header
objects in the specified file cannot be
found
IOException
- if there is a problem reading from the
specified file
OptionalDataException
- if there is a problem reading primitive data
associated with an object from the object
input stream created from the specified
filepublic static void scanModelCollectionObjectFile(String scanObjectFilename, OutputStream os) throws ClassNotFoundException, IOException, OptionalDataException
writeModelCollection(String,String,String)
.
scanObjectFilename
- the object whose header is to be scannedos
- the output stream to which to print information
ClassNotFoundException
- if any of the concrete types of the header
objects in the specified file cannot be
found
IOException
- if there is a problem reading from the
stream created from the specified file
OptionalDataException
- if there is a problem of extra primtive data
when deserializing an object from the object
input stream created from the specified
filepublic static void scanModelCollectionObjectFile(ObjectInputStream ois, OutputStream os) throws ClassNotFoundException, IOException, OptionalDataException
writeModelCollection(String,String,String)
.
ois
- the object input stram whose header objects are to be scannedos
- the output stream to which to print information
ClassNotFoundException
- if any of the concrete types of the header
objects in the specified stream cannot be
found
IOException
- if there is a problem reading from the
specified stream
OptionalDataException
- if there is a problem of extra primtive data
when deserializing an object from the
specified object input streamprotected static void incrementallyTrain(Trainer trainer, String inputFilename) throws FileNotFoundException, UnsupportedEncodingException, IOException
TrainerEvent
objects from the specified input file. The number of TrainerEvent
objects read at a time (the chunk size) is determined by the
value of the Settings.maxEventChunkSize
.
trainer
- the Trainer
instance for which incremental
training is to be performedinputFilename
- the file containing observations to be read by the
readStats(SexpTokenizer,int)
method
FileNotFoundException
- if the specified file cannot be found
UnsupportedEncodingException
- if the encoding used to read
characters from the specified file is
not supported
IOException
- if there is a problem reading from the
specified filepublic static void main(String[] args)
usageMsg
.
Please run java danbikel.parser.Trainer -help
to display the
complete usage of this class.
|
Parsing Engine | |||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |