sotaq - segmentation and stressing
What is sotaq?
sotaq is a program that reads a collection of phrases and prints for
each a decomposition into rhythmic segments, with secondary stresses, following
a model based on Optimality Theory.
At least, that is what it is supposed to do. Actually, sotaq is very
experimental, and may be based on a lot of misunderstanding, in particular
misunderstanding of OT. So, to make its description more precise, we will define
a few terms as they are used in this document, which may be at variance
with normal use (this bug should be corrected in the future):
- syllable
- a sequence of letters.
- phrase
- a sequence of syllables.
- segment
- a sequence of successive syllables in a phrase, with exactly one of them
singled out as stressed. A lexical stress is a stress fixed
on input, which must be respected throughout.
- segment decomposition
- a sequence of disjoint segments covering all the syllables of a phrase.
To each segment decomposition an integer cost is assigned, and sotaq
outputs the decompositions of minimum cost. The cost is the sum of the
individual costs assigned to its segments, plus the sum of costs assigned to
pairs of successive segments.
Each individual cost is a sum of criteria, each comprised of a value
and a weight. The value is computed on each segment or pair of segments,
and may take into account properties like length, position of the stress, its
relation to lexical components of the phrase, and so on. The weight is just a
number assigned to a criterion, and can be used to establish a hierarchy of
preference among criteria.
Relation to optimality theory
One OT based model would have a hierarchy of conditions, and count violations of
these, so that any violations of low ranked conditions are preferred over a
violation of a higher ranked condition. Let us see an example:
Suppose we have three conditions named as:
SegMax,SegMin >> AlignI/L >> AlignW/L
where the symbol >> points from high rank to low. To make sotaq
rank segment decompositions accordingly, one needs:
- A criterion for each condition, supported internally in the program. The
value of a segment, according to each criterion is 1 if the segment violates
the condition, 0 otherwise.
- Weights must be chosen to reflect the hierarchy. Typically one would get
the desired results with weights 100, 10, 1. To be on the totally secure
mathematical side, each weight should be at least n+1 times the next
one, where n is the number of syllables in the phrase. The choice of
weights may be done at the time of calling the program.
Installing sotaq
First of all, in order to run sotaq you need perl, version 5.x (any x).
In order to check if perl is installed in your computer, go to a command line
and type the command
perl -v
The ensuing messages will be quite clear. If your computer hasn't got perl
installed, there are two options: change computers, or install perl. The first
option may be a real one in a lab; the second is feasible whatever computer you
are using, but you'll probably need some help.
Now get sotaq, by saving this link. Your browser
has an option for this, probably shift+some button.
This is it. You are ready to run sotaq. However, if your operating system is
Unix or Linux, there is an optional step that will pay off in later convenience.
First, execute
which perl
The computer will print something like /usr/local/bin/perl or /usr/bin/perl.
Now, get sotaq into a text editor, and look at the first line:
#!/usr/local/bin/perl
edit it so that what comes after the magical characters #! is exactly
what you got from that which. For instance, in a Linux system you would have to
remove the /local part. Now, save the file, and execute the following
command:
chmod 755 sotaq
and that's it. You are ready to use sotaq.
Using sotaq
sotaq is a filter, that is, it reads from standard input, writes to
standard output. That means that you call it like this, from a shell or DOS
window:
perl sotaq [options] < infile > outfile
or, in case, you followed the last installation steps:
sotaq [options] < infile > outfile
here,
- outfile is the name of the file where the results will be
saved. If the > outfile part is left out altogether, the
results will be printed on your screen (I like to work in an emacs shell
buffer for this).
- infile is the name of the input file, containing a collection
of annotated phrases to be processed. The format of the input file is
explained below.
- options
There are many different options, and most of them have the form
--name=N
meaning "assign weight N to the criterion dubbed name".
For a full list of options, execute:
sotaq --help
A section below details the current options related to criteria. There is a
further option --debug=N, that only concerns those that want to
fiddle with the source code. The source explains it.
The output will consist of a listing of options, following by, for each input
phrase:
- A representation of its encoding, preceded by the string "I: ".
It shows in a semi-graphical way the split into lexical items and lexical
stresses.
- All min cost segment decompositions. each preceded by the string "O: ",
in a semi-graphical way that resembles the presentation of the original
data.
- The min cost value and solution count.
The input file
There are two types of input files:
- Anotated phrases
- Each phrase is presented as a collection of separated syllables, each
syllable preceded by a number coding some properties; each property code
is a number, and joint properties are coded by adding the corresponding
numbers. The current recognized properties are:
- Has lexical stress -- code 1.
- Starts a lexical item -- code 2.
Codes and syllables are separated by any nonzero number of spaces. A
phrase can be broken through several successive input lines, provided
each, but the last one, ends with a backslash \ (no
spaces after the \ ).
A sample input file can be copied from here.
- Bit encoding
- For an annotated phrase as above, keep just the codes; write each as a
two bit binary number, and string those numbers into a single
bit-string. This is maintained for compatibility with older programs,
and may be deprecated in the future. A sample input file can be copied
from here.
The options
The following options give weights to criteria which test the violation of a
condition on a segment. Thus, for each criterion, the value of the segment
is 1 if the condition is violated. Some of the implemented criteria
come from linguistic modeling, others are just a programmers fancy.
Option |
Condition |
--ini=N |
stress at the first syllable of the segment |
--max2=N |
segment has length at most 2 |
--max4=N |
segment has length at most 4 |
--min2=N |
segment has length at least 2 |
--integ=N |
segment is contained in a lexical item |
--integ=N |
segment is contained in the union of a lexical item with a preceding
functional word (unstressed monosyllable) |
--acmono=N |
stress not on a monosyllable |
--clash=N |
adjacent segments without adjacent stressed syllables |
--clashint=N |
adjacent segments within a lexical item without adjacent stressed
syllables |
--lapse=N |
adjacent segments with at most one syllable between stresses |
--lapseint=N |
adjacent segments within a lexical item with at most one syllable
between stresses |
Some of those criteria suggest a measure of failure, not simply
occurrence of failure. So, this suggests additional criteria, whose values
are not just 0-1:
Option |
Value |
--inidist=N |
distance of stress from beginning of segment, discounting the first
syllable if it is a word |
--bindist=N |
distance of segment length from 2 |
--integc=N |
one less than number of words touched by segment |
--integm=N |
adjacent segments with adjacent stressed syllables |
So, for instance, if you want to rank the options as ini >>
max2 >> integ, and the input is on file example, you can get the
best segment decompositions by calling:
- sotaq --ini=1000 --max2=10 --integ=1 < example
Further developments
The only syllable properties currently being considered are start of word and
lex stress. Maybe other properties of linguistic significance can be considered,
so that other criteria can be applied in evaluating segments. For instance, it
may be relevant if a syllable starts (or ends) in a vowel, if it is followed by
a punctuation mark, if it can be muted, and so on.
Some initial tests involving phrases with annotated secondary stresses
suggest that, at least with the already implemented criteria, just a plain
ranking of boolean conditions cannot achieve the desired results. That has led
to the introduction of numerical criteria. Besides, one can play with similar
weights for different criteria. If that yields good results, it will be a though
act for the theory to follow.
In introducing new criteria, it is important to understand the following
requirement of the model: the value of a segment should be computable using only
information about the segment and syllable properties - it cannot depend on
other segments. Similarly, the value of a pair of successive segments should be
computable without reference to any other segments.
Arnaldo Mandel <am@ime.usp.br>
Last modified: Fri Aug 13 19:22:01 EST 1999