Parsing Engine

danbikel.parser
Class ModelCollection

java.lang.Object
  extended by danbikel.parser.ModelCollection
All Implemented Interfaces:
Serializable

public class ModelCollection
extends Object
implements Serializable

Provides access to all Model objects and maps necessary for parsing. By bundling all of this information together, all of the objects necessary for parsing can be stored and retrieved simply by serializing and de-serializing this object to a Java object file. The types of output elements that are modeled are determined by ProbabilityStructure objects around which Model objects are wrapped, by the method ProbabilityStructure.newModel(). This collection holds ten different Model objects, each modeling a different output element of this parser (nonterminal, word, subcategorization frame, etc.) becuase each wraps a different type of ProbabilityStructure object. The concrete types of ProbabilityStructure objects are determined by various run-time settings, as described in the documentation for Settings.globalModelStructureNumber. The other counts tables, maps and resources contained in this object are derived by the Trainer.

See Also:
Settings.globalModelStructureNumber, Settings.precomputeProbs, Settings.writeCanonicalEvents, Serialized Form

Field Summary
protected static boolean callGCAfterReadingObject
          Indicates whether to invoke System.gc() after this object has been de-serialized from a stream.
protected  FlexibleMap canonicalEvents
          The reflexive map used to canonicalize objects created when deriving counts for all models in this model collection.
protected  Model gapModel
          The model for generating gaps.
protected  Model headModel
          The model for generating a head nonterminal given its (lexicalized) parent.
protected  Map headToParentMap
          A mapping from head labels to possible parent labels.
protected  Map leftSubcatMap
          A mapping from left subcat-prediction conditioning contexts (typically parent and head nonterminal labels) to all possible subcat frames.
protected  Model leftSubcatModel
          The model for generating subcats on the left side of the head child.
protected  Model lexPriorModel
          The model for lexical priors.
protected  Model[] modelArr
          An array containing all Model objects contained by this model collection, set up by createModelArray().
protected  Map modNonterminalMap
          A mapping from the last level of back-off of modifying nonterminal conditioning contexts to all possible modifying nonterminals.
protected  Model modNonterminalModel
          The model for generating partially-lexicalized nonterminals that modify the head child.
protected  Model modWordModel
          The model for generating head words of lexicalized nonterminals that modify the head child.
protected  Symbol[] nonterminalArr
          An array of all nonterminal labels, providing a mapping of unique integers (indices into this array) to nonterminal labels.
protected  Map nonterminalMap
          A map from nonterminal labels (Symbol objects) to unique integers that are indices in the nonterminal array.
protected  Model nonterminalPriorModel
          The model for nonoterminal priors.
protected  CountsTable nonterminals
          A table that maps unlexicalized nonterminals to their counts in the training corpus.
protected  Map posMap
          A mapping from lexical items to all of their possible parts of speech.
protected  Set prunedPreterms
          The set of preterminals pruned during training.
protected  Set prunedPunctuation
          The set of punctuation preterminals pruned during training.
protected  Map rightSubcatMap
          A mapping from right subcat-prediction conditioning contexts (typically parent and head nonterminal labels) to all possible subcat frames.
protected  Model rightSubcatModel
          The model for generating subcats on the right side of the head child.
protected  Map simpleModNonterminalMap
          A map from unlexicalized parent-head-side triples to all possible partially-lexicalized modifying nonterminals.
protected  Model topLexModel
          The model for generating the head word and part of speech of observed root nonterminals given the hidden +TOP+ nonterminal.
protected  Model topNonterminalModel
          The model for generating observed root nonterminals given the hidden +TOP+ nonterminal.
protected static boolean verbose
          Indicates whether to output verbose messages to System.err.
protected  CountsTable vocabCounter
          A table that maps observed words to their counts in the training corpus.
protected  CountsTable wordFeatureCounter
          A table that maps observed word-feature vectors to their counts in the training corpus.
 
Constructor Summary
ModelCollection()
          Constructs a new ModelCollection that initially contains no data.
 
Method Summary
 FlexibleMap canonicalEvents()
          Returns the reflexive map used to canonicalize objects created when deriving counts for all models in this model collection.
protected  void createModelArray()
          Populates the modelArr with the Model objects that are contained in this model collection.
 Model gapModel()
          Returns the gap-generation model.
 String getModelCacheStats()
          Invokes Model.getCacheStats() on each Model contained in this model collection, and returns the results as a single String.
 Symbol[] getNonterminalArr()
          Returns the nonterminalArr member.
 Map getNonterminalMap()
          Returns the nonterminalMap member.
 Model headModel()
          Returns the head-generation model.
 Map headToParentMap()
          Returns a mapping from head labels to possible parent labels.
protected  void internalReadObject(ObjectInputStream s)
          Reads an instance of this class from the specified stream.
protected  void internalWriteObject(ObjectOutputStream s)
          Writes this object to the specified stream.
 Map leftSubcatMap()
          Returns a mapping from left subcat-prediction conditioning contexts (typically parent and head nonterminal labels) to all possible subcat frames.
 Model leftSubcatModel()
          Returns the left subcat-generation model.
 Model lexPriorModel()
          Returns the model for marginal probabilities of lexical elements (for the estimation of the joint event that is a fully lexicalized nonterminal)
 Iterator modelIterator()
          Syntactic sugar for modelList().iterator().
 List modelList()
          Returns an unmodifiable list view of the Model objects contained in this model collection.
 Map modNonterminalMap()
          Returns a mapping from the last level of back-off of modifying nonterminal conditioning contexts to all possible modifying nonterminals.
 Model modNonterminalModel()
          Returns the modifying nonterminal–generation model.
 Model modWordModel()
          Returns the model that generates head words of modifying nonterminals.
 Model nonterminalPriorModel()
          Returns the model for conditional probabilities of nonterminals given the lexical components (for the estimation of the joint event that is a fully lexicalized nonterminal)
 CountsTable nonterminals()
          Returns a mapping of (unlexicalized) nonterminals to their counts in the training data.
 int numNonterminals()
          Returns the number of unique (unlexicalized) nonterminals observed in the training data.
 Map posMap()
          Returns a mapping from Symbol objects representing words to SexpList objects that contain the set of their possible parts of speech (a list of Symbol).
 Set prunedPreterms()
          Returns set of preterminals pruned during training.
 Set prunedPunctuation()
          Returns set of punctuation preterminals pruned during training.
 Map rightSubcatMap()
          Returns a mapping from right subcat-prediction conditioning contexts (typically parent and head nonterminal labels) to all possible subcat frames.
 Model rightSubcatModel()
          Returns the right subcat-generation model.
 void set(Model lexPriorModel, Model nonterminalPriorModel, Model topNonterminalModel, Model topLexModel, Model headModel, Model gapModel, Model leftSubcatModel, Model rightSubcatModel, Model modNonterminalModel, Model modWordModel, CountsTable vocabCounter, CountsTable wordFeatureCounter, CountsTable nonterminals, Map posMap, Map headToParentMap, Map leftSubcatMap, Map rightSubcatMap, Map modNonterminalMap, Map simpleModNonterminalMap, Set prunedPreterms, Set prunedPunctuation, FlexibleMap canonicalEvents)
          Sets all the data members of this object.
 void shareCounts(boolean verbose)
          In a dangerous but effective way, this method shares counts for a back-off level from one model with another model; in this case, the last level of back-off from the modWordModel is being shared (i.e., will be used) as the last level of back-off for topLexModel, as the last levels of both these models typically estimate p(w | t).
 Map simpleModNonterminalMap()
          Returns a map from unlexicalized parent-head-side triples to all possible partially-lexicalized modifying nonterminals.
 Model topLexModel()
          Returns the head-word generation model for heads of entire sentences.
 Model topNonterminalModel()
          Returns the head-generation model for heads whose parents are Training.topSym().
 CountsTable vocabCounter()
          Returns a mapping from Symbol objects representing words to their count in the training data.
 CountsTable wordFeatureCounter()
          Returns a mapping from Symbol objects that are word features to their count in the training data.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

verbose

protected static final boolean verbose
Indicates whether to output verbose messages to System.err. The value of this constant is normally true (this is research software, after all).

See Also:
Constant Field Values

callGCAfterReadingObject

protected static final boolean callGCAfterReadingObject
Indicates whether to invoke System.gc() after this object has been de-serialized from a stream.

See Also:
Constant Field Values

modelArr

protected transient Model[] modelArr
An array containing all Model objects contained by this model collection, set up by createModelArray().


lexPriorModel

protected transient Model lexPriorModel
The model for lexical priors.


nonterminalPriorModel

protected transient Model nonterminalPriorModel
The model for nonoterminal priors.


topNonterminalModel

protected transient Model topNonterminalModel
The model for generating observed root nonterminals given the hidden +TOP+ nonterminal.


topLexModel

protected transient Model topLexModel
The model for generating the head word and part of speech of observed root nonterminals given the hidden +TOP+ nonterminal.


headModel

protected transient Model headModel
The model for generating a head nonterminal given its (lexicalized) parent.


gapModel

protected transient Model gapModel
The model for generating gaps.


leftSubcatModel

protected transient Model leftSubcatModel
The model for generating subcats on the left side of the head child.


rightSubcatModel

protected transient Model rightSubcatModel
The model for generating subcats on the right side of the head child.


modNonterminalModel

protected transient Model modNonterminalModel
The model for generating partially-lexicalized nonterminals that modify the head child.


modWordModel

protected transient Model modWordModel
The model for generating head words of lexicalized nonterminals that modify the head child.


vocabCounter

protected transient CountsTable vocabCounter
A table that maps observed words to their counts in the training corpus.


wordFeatureCounter

protected transient CountsTable wordFeatureCounter
A table that maps observed word-feature vectors to their counts in the training corpus.


nonterminals

protected transient CountsTable nonterminals
A table that maps unlexicalized nonterminals to their counts in the training corpus.


posMap

protected transient Map posMap
A mapping from lexical items to all of their possible parts of speech.


headToParentMap

protected transient Map headToParentMap
A mapping from head labels to possible parent labels.


leftSubcatMap

protected transient Map leftSubcatMap
A mapping from left subcat-prediction conditioning contexts (typically parent and head nonterminal labels) to all possible subcat frames.


rightSubcatMap

protected transient Map rightSubcatMap
A mapping from right subcat-prediction conditioning contexts (typically parent and head nonterminal labels) to all possible subcat frames.


modNonterminalMap

protected transient Map modNonterminalMap
A mapping from the last level of back-off of modifying nonterminal conditioning contexts to all possible modifying nonterminals.


simpleModNonterminalMap

protected transient Map simpleModNonterminalMap
A map from unlexicalized parent-head-side triples to all possible partially-lexicalized modifying nonterminals.


prunedPreterms

protected transient Set prunedPreterms
The set of preterminals pruned during training.


prunedPunctuation

protected transient Set prunedPunctuation
The set of punctuation preterminals pruned during training.


canonicalEvents

protected transient FlexibleMap canonicalEvents
The reflexive map used to canonicalize objects created when deriving counts for all models in this model collection.


nonterminalMap

protected Map nonterminalMap
A map from nonterminal labels (Symbol objects) to unique integers that are indices in the nonterminal array.


nonterminalArr

protected Symbol[] nonterminalArr
An array of all nonterminal labels, providing a mapping of unique integers (indices into this array) to nonterminal labels. The inverse map is contained in nonterminalMap.

Constructor Detail

ModelCollection

public ModelCollection()
Constructs a new ModelCollection that initially contains no data.

Method Detail

set

public void set(Model lexPriorModel,
                Model nonterminalPriorModel,
                Model topNonterminalModel,
                Model topLexModel,
                Model headModel,
                Model gapModel,
                Model leftSubcatModel,
                Model rightSubcatModel,
                Model modNonterminalModel,
                Model modWordModel,
                CountsTable vocabCounter,
                CountsTable wordFeatureCounter,
                CountsTable nonterminals,
                Map posMap,
                Map headToParentMap,
                Map leftSubcatMap,
                Map rightSubcatMap,
                Map modNonterminalMap,
                Map simpleModNonterminalMap,
                Set prunedPreterms,
                Set prunedPunctuation,
                FlexibleMap canonicalEvents)
Sets all the data members of this object.

Parameters:
lexPriorModel - the model for marginal probabilities of lexical elements (for the estimation of the joint event that is a fully lexicalized nonterminal)
nonterminalPriorModel - the model for conditional probabilities of nonterminals given the lexical components (for the estimation of the joint event that is a fully lexicalized nonterminal)
topNonterminalModel - the head-generation model for heads whose parents are Training.topSym()
topLexModel - the head-word generation model for heads of entire sentences
headModel - the head-generation model
gapModel - the gap-generation model
leftSubcatModel - the left subcat-generation model
rightSubcatModel - the right subcat-generation mode,l
modNonterminalModel - the modifying nonterminal-generation model
modWordModel - the modifying word-generation model
vocabCounter - a table of counts of all "known" words of the training data
wordFeatureCounter - a table of counts of all word features ("unknown" words) of the training data
nonterminals - a table of counts of all nonterminals occurring in the training data
posMap - a mapping from lexical items to all of their possible parts of speech
leftSubcatMap - a mapping from left subcat-prediction conditioning contexts (typically parent and head nonterminal labels) to all possible subcat frames
rightSubcatMap - a mapping from right subcat-prediction conditioning contexts (typically parent and head nonterminal labels) to all possible subcat frames
modNonterminalMap - a mapping from the last level of back-off of modifying nonterminal conditioning contexts to all possible modifying nonterminals
simpleModNonterminalMap - a mapping from parent-head-side triples to all possible partially-lexicalized modifying nonterminals
prunedPreterms - the set of preterminals pruned during training
prunedPunctuation - the set of punctuation preterminals pruned during training
canonicalEvents - the reflexive map used to canonicalize objects created when deriving counts for all models in this model collection

createModelArray

protected void createModelArray()
Populates the modelArr with the Model objects that are contained in this model collection.


modelList

public List modelList()
Returns an unmodifiable list view of the Model objects contained in this model collection.

Returns:
an unmodifiable list view of the Model objects contained in this model collection

modelIterator

public Iterator modelIterator()
Syntactic sugar for modelList().iterator().

Returns:
the iterator of the list returned by modelList()

shareCounts

public void shareCounts(boolean verbose)
In a dangerous but effective way, this method shares counts for a back-off level from one model with another model; in this case, the last level of back-off from the modWordModel is being shared (i.e., will be used) as the last level of back-off for topLexModel, as the last levels of both these models typically estimate p(w | t).

Parameters:
verbose - indicates whether to print a message to System.err
See Also:
Settings.trainerShareCounts

numNonterminals

public int numNonterminals()
Returns the number of unique (unlexicalized) nonterminals observed in the training data.


getNonterminalMap

public Map getNonterminalMap()
Returns the nonterminalMap member.


getNonterminalArr

public Symbol[] getNonterminalArr()
Returns the nonterminalArr member.


lexPriorModel

public Model lexPriorModel()
Returns the model for marginal probabilities of lexical elements (for the estimation of the joint event that is a fully lexicalized nonterminal)


nonterminalPriorModel

public Model nonterminalPriorModel()
Returns the model for conditional probabilities of nonterminals given the lexical components (for the estimation of the joint event that is a fully lexicalized nonterminal)


topNonterminalModel

public Model topNonterminalModel()
Returns the head-generation model for heads whose parents are Training.topSym().


topLexModel

public Model topLexModel()
Returns the head-word generation model for heads of entire sentences.


headModel

public Model headModel()
Returns the head-generation model.


gapModel

public Model gapModel()
Returns the gap-generation model.


leftSubcatModel

public Model leftSubcatModel()
Returns the left subcat-generation model.


rightSubcatModel

public Model rightSubcatModel()
Returns the right subcat-generation model.


modNonterminalModel

public Model modNonterminalModel()
Returns the modifying nonterminal–generation model.


modWordModel

public Model modWordModel()
Returns the model that generates head words of modifying nonterminals.


vocabCounter

public CountsTable vocabCounter()
Returns a mapping from Symbol objects representing words to their count in the training data.


wordFeatureCounter

public CountsTable wordFeatureCounter()
Returns a mapping from Symbol objects that are word features to their count in the training data.


nonterminals

public CountsTable nonterminals()
Returns a mapping of (unlexicalized) nonterminals to their counts in the training data.


posMap

public Map posMap()
Returns a mapping from Symbol objects representing words to SexpList objects that contain the set of their possible parts of speech (a list of Symbol).


headToParentMap

public Map headToParentMap()
Returns a mapping from head labels to possible parent labels.


leftSubcatMap

public Map leftSubcatMap()
Returns a mapping from left subcat-prediction conditioning contexts (typically parent and head nonterminal labels) to all possible subcat frames.


rightSubcatMap

public Map rightSubcatMap()
Returns a mapping from right subcat-prediction conditioning contexts (typically parent and head nonterminal labels) to all possible subcat frames.


modNonterminalMap

public Map modNonterminalMap()
Returns a mapping from the last level of back-off of modifying nonterminal conditioning contexts to all possible modifying nonterminals.


simpleModNonterminalMap

public Map simpleModNonterminalMap()
Returns a map from unlexicalized parent-head-side triples to all possible partially-lexicalized modifying nonterminals.


prunedPreterms

public Set prunedPreterms()
Returns set of preterminals pruned during training.


prunedPunctuation

public Set prunedPunctuation()
Returns set of punctuation preterminals pruned during training.


canonicalEvents

public FlexibleMap canonicalEvents()
Returns the reflexive map used to canonicalize objects created when deriving counts for all models in this model collection.


getModelCacheStats

public String getModelCacheStats()
Invokes Model.getCacheStats() on each Model contained in this model collection, and returns the results as a single String.

Returns:
a single string that is the concatenation of all of the Model.getCacheStats() strings for the models in this collection

internalWriteObject

protected void internalWriteObject(ObjectOutputStream s)
                            throws IOException
Writes this object to the specified stream.

Parameters:
s - the stream to which to write this object
Throws:
IOException - if there is a problem writing to the specified stream

internalReadObject

protected void internalReadObject(ObjectInputStream s)
                           throws IOException,
                                  ClassNotFoundException
Reads an instance of this class from the specified stream.

Parameters:
s - the stream from which to read an instance of this class
Throws:
IOException - if there is a problem reading from the specified stream
ClassNotFoundException - if any of the concrete types that are in the specified stream cannot be found

Parsing Engine

Author: Dan Bikel.