Parsing Engine

danbikel.parser.arabic
Class Training

java.lang.Object
  extended by danbikel.parser.lang.AbstractTraining
      extended by danbikel.parser.arabic.Training
All Implemented Interfaces:
Training, Serializable

public class Training
extends AbstractTraining

Provides methods for language-specific processing of training parse trees. Even though this subclass of Training is in the default English language package, its primary purpose is simply to fill in the AbstractTraining.argContexts, AbstractTraining.semTagArgStopSet and AbstractTraining.nodesToPrune data members using a metadata resource. If this capability is desired in another language package, this class may be subclassed.

This class also re-defined the method AbstractTraining.hasPossessiveChild(Sexp).

See Also:
Serialized Form

Field Summary
protected static String[] caseMarkers
          An array of case markers in Arabic Treebank part-of-speech tags.
protected static String[] definiteMarkers
          An array of definite/indefinite markers in Arabic Treebank part-of-speech tags.
protected static String[] detPrefixMarkers
          An array of determiner markers in Arabic Treebank part-of-speech tags.
protected static String[] genderMarkers
          An array of gender markers in Arabic Treebank part-of-speech tags.
protected static String[][] markers
          An array of the various markers arrays.
protected static String[] moodMarkers
          An array of verb mood markers in Arabic Treebank part-of-speech tags.
protected static String[] nounSuffixMarkers
          An array of noun markers in Arabic Treebank part-of-speech tags.
protected static String[] numberMarkers
          An array of number markers in Arabic Treebank part-of-speech tags (Arabic has forms for singular, plural and dual).
protected static String[] personMarkers
          An array of person/number markers (indicating information such as “first person singular”) in Arabic Treebank part-of-speech tags.
protected static String[] pronounMarkers
          An array of pronoun markers in Arabic Treebank part-of-speech tags.
protected static boolean regularizeVerbs
          If regularizeVerbs is true, it indicates that part of speech tags that contain any of the patterns in the verbPatterns array should be transformed simply into the pattern itself.
protected static boolean[] remove
          Indicates which of the various types of markers should be removed from Arabic Treebank part-of-speech tags during preprocessing (currently unused).
protected static Symbol tagMapSym
          The symbol associated with tag map metadata.
protected static String[] verbPatterns
          The match patterns used when regularizeVerbs is true.
 
Fields inherited from class danbikel.parser.lang.AbstractTraining
addGapInfo, argAugmentations, argContexts, argNonterminals, baseNP, canonicalAugDelimSym, defaultArgAugmentation, delimAndGapStr, delimAndGapStrLen, gapAugmentation, headFinder, headPostSym, headPreSym, headSym, metadataPropertyPrefix, nodesToPrune, NP, prunedPreterms, prunedPunctuation, relabelHeadChildrenAsArgs, repairBaseNPs, semTagArgStopSet, traceTag, treebank, wordsToPrune
 
Constructor Summary
Training()
          The default constructor, to be invoked by Language.
 
Method Summary
protected  void canonicalizeNonterminals(Sexp tree)
          For arabic, we do not want to transform preterminals (parts of speech) to their canonical forms, so this method is overridden.
protected  int contains(StringBuffer searchBuf, String[] searchPatterns, IntCounter patternIdx)
          Helper method used by TagMap.transformTag(Word).
protected  void createArgNonterminalsSet()
          An overridden version of AbstractTraining.createArgNonterminalsSet() that adds argument nonterminal patterns, such as *-SBJ, to the set of argument nonterminals.
protected  boolean hasPossessiveChild(Sexp tree)
          We override this method so that it always returns false, so that the default implementation of addBaseNPs(Sexp) never considers an NP to be a possessive NP.
 boolean isValidTree(Sexp tree)
          If the specified tree has a root label with a print name equal to "X", then this method returns false; otherwise, this method returns the value of the default implementation in the superclass with the specified tree (super.isValidTree(tree)).
static void main(String[] args)
          Test driver for this class.
 Sexp preProcess(Sexp tree)
          The method to call before counting events in a training parse tree.
 SexpList preProcessTest(SexpList sentence, SexpList originalWords, SexpList tags)
          Preprocesses the specified test sentence and its coordinated list of part-of-speech tags, leaving the original sentence untouched but providing a modified version of the coordinated list of tags, where each tag has been mapped using the value of the original word and the original tag using TagMap.transformTag(Word).
protected  void readMetadataHook(Symbol dataType, int metadataLen, SexpList metadata)
          Reads the tag map metadata if the specified data type is equal to tagMapSym.
 Symbol startSym()
          Returns the symbol to indicate hidden nonterminals that precede the first in a sequence of modifier nonterminals.
 Word startWord()
          Returns the Word object that represents the hidden "head word" of the start symbol.
 Symbol stopSym()
          Returns the symbol to indicate a hidden nonterminal that follows the last in a sequence of modifier nonterminals.
 Word stopWord()
          Returns the Word object that represents the hidden "head word" of the stop symbol.
 Symbol topSym()
          Returns the symbol to indicate the hidden root of all parse trees.
 Word topWord()
          Returns the Word object that represents the hidden "head word" of the hidden root of all parse trees.
protected  Symbol transformTagOld(Word word)
          Deprecated. This method is the old mechanism by which to transform the part-of-speech tag associated with an Arabic word; it has been superseded by the method TagMap.transformTag(Word).
protected  Sexp transformTags(Sexp tree)
          Does an in-place transformation of the part-of-speech tags in the specified tree.
 
Methods inherited from class danbikel.parser.lang.AbstractTraining
addArgAugmentation, addBaseNPs, addGapInformation, argNonterminals, collectPreterms, createArgAugmentationsList, defaultArgAugmentation, gapAugmentation, getCanonicalArg, getCanonicalArg, getPrunedPreterms, getPrunedPunctuation, hasGap, hasGap, headPostSym, headPreSym, headSym, identifyArguments, isAllNodesToPrune, isArgument, isArgument, isArgument, isArgumentFast, isCoordinatedPhrase, isTypeOfSentence, needToAddNormalNPLevel, postProcess, printMetadata, prune, raisePunctuation, readMetadata, relabelArgChildren, relabelSubjectlessSentences, removeArgAugmentation, removeArgAugmentation, removeGapAugmentation, removeNullElements, removeOnlyChildBaseNPs, removeWord, repairBaseNPs, repairBaseNPs, setUpFastArgMap, skip, staticSetUpFastArgMap, stripAugmentations, stripAugmentations, stripAugmentations, threadNPArgAugmentations, traceTag, transformSubjectNTs, unaryProductionsToNull
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

tagMapSym

protected static final Symbol tagMapSym
The symbol associated with tag map metadata.


nounSuffixMarkers

protected static final String[] nounSuffixMarkers
An array of noun markers in Arabic Treebank part-of-speech tags.


detPrefixMarkers

protected static final String[] detPrefixMarkers
An array of determiner markers in Arabic Treebank part-of-speech tags.


personMarkers

protected static final String[] personMarkers
An array of person/number markers (indicating information such as “first person singular”) in Arabic Treebank part-of-speech tags.


numberMarkers

protected static final String[] numberMarkers
An array of number markers in Arabic Treebank part-of-speech tags (Arabic has forms for singular, plural and dual).


genderMarkers

protected static final String[] genderMarkers
An array of gender markers in Arabic Treebank part-of-speech tags.


caseMarkers

protected static final String[] caseMarkers
An array of case markers in Arabic Treebank part-of-speech tags.


definiteMarkers

protected static final String[] definiteMarkers
An array of definite/indefinite markers in Arabic Treebank part-of-speech tags.


pronounMarkers

protected static final String[] pronounMarkers
An array of pronoun markers in Arabic Treebank part-of-speech tags.


moodMarkers

protected static final String[] moodMarkers
An array of verb mood markers in Arabic Treebank part-of-speech tags.


markers

protected static final String[][] markers
An array of the various markers arrays.

See Also:
nounSuffixMarkers, detPrefixMarkers, personMarkers, numberMarkers, genderMarkers, caseMarkers, definiteMarkers, pronounMarkers, moodMarkers

remove

protected static final boolean[] remove
Indicates which of the various types of markers should be removed from Arabic Treebank part-of-speech tags during preprocessing (currently unused). This array must be coordinated with markers.


regularizeVerbs

protected static final boolean regularizeVerbs
If regularizeVerbs is true, it indicates that part of speech tags that contain any of the patterns in the verbPatterns array should be transformed simply into the pattern itself. For example, the tag IV2D+VERB_IMPERFECT+IVSUFF_SUBJ:D_MOOD:SJ would be transformed into, simply, VERB_IMPERFECT.

See Also:
Constant Field Values

verbPatterns

protected static final String[] verbPatterns
The match patterns used when regularizeVerbs is true.

Constructor Detail

Training

public Training()
         throws FileNotFoundException,
                IOException
The default constructor, to be invoked by Language. This constructor looks for a resource named by the property metadataPropertyPrefix + language where metadataPropertyPrefix is the value of the constant AbstractTraining.metadataPropertyPrefix and language is the value of Settings.get(Settings.language). For example, the property for English is "parser.training.metadata.english".

Throws:
FileNotFoundException
IOException
Method Detail

readMetadataHook

protected void readMetadataHook(Symbol dataType,
                                int metadataLen,
                                SexpList metadata)
Reads the tag map metadata if the specified data type is equal to tagMapSym.

Overrides:
readMetadataHook in class AbstractTraining
Parameters:
dataType - the data type of the specified metadata resource; if the specified symbol is equal to tagMapSym then this method will read and store the associated tag map metadata
metadataLen - the length of the metadata list
metadata - the metadata resource

startSym

public Symbol startSym()
Returns the symbol to indicate hidden nonterminals that precede the first in a sequence of modifier nonterminals. This method overrides the default implementation so as to return a symbol that does not contain a plus sign (+), which is a nonterminal augmentation delimiter in the Arabic Treebank.

Specified by:
startSym in interface Training
Overrides:
startSym in class AbstractTraining
Returns:
the symbol to indicate hidden nonterminals that precede the first in a sequence of modifier nonterminals.
See Also:
Trainer

startWord

public Word startWord()
Returns the Word object that represents the hidden "head word" of the start symbol. This method overrides the default implementation so as to return a Word containing symbols that do not contain a plus sign (+), which is a nonterminal augmentation delimiter in the Arabic Treebank.

Specified by:
startWord in interface Training
Overrides:
startWord in class AbstractTraining
See Also:
startSym, Trainer

stopSym

public Symbol stopSym()
Returns the symbol to indicate a hidden nonterminal that follows the last in a sequence of modifier nonterminals. This method overrides the default implementation so as to return a symbol that does not contain a plus sign (+), which is a nonterminal augmentation delimiter in the Arabic Treebank.

This symbol may also be used as a special value that is guaranteed not to conflict with any nonterminal in a given language's treebank.

Specified by:
stopSym in interface Training
Overrides:
stopSym in class AbstractTraining
See Also:
Trainer

stopWord

public Word stopWord()
Returns the Word object that represents the hidden "head word" of the stop symbol. This method overrides the default implementation so as to return a Word containing symbols that do not contain a plus sign (+), which is a nonterminal augmentation delimiter in the Arabic Treebank.

Specified by:
stopWord in interface Training
Overrides:
stopWord in class AbstractTraining
See Also:
stopSym, Trainer

topSym

public Symbol topSym()
Returns the symbol to indicate the hidden root of all parse trees. This method overrides the default implementation so as to return a symbol that does not contain a plus sign (+), which is a nonterminal augmentation delimiter in the Arabic Treebank.

Specified by:
topSym in interface Training
Overrides:
topSym in class AbstractTraining
See Also:
Trainer

topWord

public Word topWord()
Returns the Word object that represents the hidden "head word" of the hidden root of all parse trees. This method overrides the default implementation so as to return a Word containing symbols that do not contain a plus sign (+), which is a nonterminal augmentation delimiter in the Arabic Treebank.

Specified by:
topWord in interface Training
Overrides:
topWord in class AbstractTraining

preProcess

public Sexp preProcess(Sexp tree)
The method to call before counting events in a training parse tree. This overridden implementation executes the following methods of this class in order:
  1. transformTags(Sexp)
  2. AbstractTraining.prune(Sexp)
  3. AbstractTraining.addBaseNPs(Sexp)
  4. AbstractTraining.removeNullElements(Sexp)
  5. AbstractTraining.raisePunctuation(Sexp)
  6. AbstractTraining.identifyArguments(Sexp)
  7. AbstractTraining.stripAugmentations(Sexp)
While every attempt has been made to make the implementations of these preprocessing methods independent of one another, the order above is not entirely arbitrary. In particular:

Specified by:
preProcess in interface Training
Overrides:
preProcess in class AbstractTraining
Parameters:
tree - the parse tree to pre-process
Returns:
tree having been pre-processed

createArgNonterminalsSet

protected void createArgNonterminalsSet()
An overridden version of AbstractTraining.createArgNonterminalsSet() that adds argument nonterminal patterns, such as *-SBJ, to the set of argument nonterminals.

Overrides:
createArgNonterminalsSet in class AbstractTraining

preProcessTest

public SexpList preProcessTest(SexpList sentence,
                               SexpList originalWords,
                               SexpList tags)
Preprocesses the specified test sentence and its coordinated list of part-of-speech tags, leaving the original sentence untouched but providing a modified version of the coordinated list of tags, where each tag has been mapped using the value of the original word and the original tag using TagMap.transformTag(Word).

Specified by:
preProcessTest in interface Training
Overrides:
preProcessTest in class AbstractTraining
Parameters:
sentence - the list of words, where a known word is a symbol and an unknown word is represented by a 3-element list (see DecoderServerRemote.convertUnknownWords(danbikel.lisp.SexpList))
originalWords - the list of unprocessed words (all symbols)
tags - the list of tag lists, where the list at index i is the list of possible parts of speech for the word at that index
Returns:
a two-element list, containing two lists, the first of which is (in this case) an unprocessed version of sentence and the second of which is a processed version of tags; if tags is null, then the returned list will contain only one element (since SexpList objects are not designed to handle null elements)
See Also:
TagMap.transformTag(Word)

isValidTree

public boolean isValidTree(Sexp tree)
If the specified tree has a root label with a print name equal to "X", then this method returns false; otherwise, this method returns the value of the default implementation in the superclass with the specified tree (super.isValidTree(tree)).

Specified by:
isValidTree in interface Training
Overrides:
isValidTree in class AbstractTraining
Parameters:
tree - the tree to test for validitiy
Returns:
false if the specified tree's root label is equal to Symbol.add("X"), or super.isValidTree(tree) otherwise
See Also:
AbstractTraining.isAllNodesToPrune(Sexp), Treebank.isPreterminal(Sexp)

contains

protected int contains(StringBuffer searchBuf,
                       String[] searchPatterns,
                       IntCounter patternIdx)
Helper method used by TagMap.transformTag(Word).


transformTagOld

protected Symbol transformTagOld(Word word)
Deprecated. This method is the old mechanism by which to transform the part-of-speech tag associated with an Arabic word; it has been superseded by the method TagMap.transformTag(Word).

Parameters:
word - the word whose part-of-speech tag is to be transformed
Returns:
a transformed version of the part-of-speech tag contained in the specified Word object

TagMap.transformTag(Word)


transformTags

protected Sexp transformTags(Sexp tree)
Does an in-place transformation of the part-of-speech tags in the specified tree.

Parameters:
tree - the tree whose part-of-speech tags are to be mapped
Returns:
the specified tree having been modified to contain transformed part-of-speech tags

hasPossessiveChild

protected boolean hasPossessiveChild(Sexp tree)
We override this method so that it always returns false, so that the default implementation of addBaseNPs(Sexp) never considers an NP to be a possessive NP. Thus, the behavior of addBaseNPs is much simpler: all and only NPs that do not dominate other NPs will be relabeled NPB.

Overrides:
hasPossessiveChild in class AbstractTraining
Parameters:
tree - the tree to be tested
Returns:
false, regardless of the value of the specified tree

canonicalizeNonterminals

protected void canonicalizeNonterminals(Sexp tree)
For arabic, we do not want to transform preterminals (parts of speech) to their canonical forms, so this method is overridden.

Overrides:
canonicalizeNonterminals in class AbstractTraining
Parameters:
tree - the tree for which nonterminals, but not parts of speech, are to be transformed into their canonical forms
See Also:
Treebank.getCanonical(Symbol)

main

public static void main(String[] args)
Test driver for this class.


Parsing Engine

Author: Dan Bikel.