This interdisciplinary project has two main
objectives. From the point of view of Probability Theory we want to develop
the necessary tools to identify patterns in trajectories of stochastic
processes. From the point of view of Linguistics, we want to use these tools
to identify characteristic rhythmic patterns that are distinct for Brazilian
Portuguese and Modern European Portuguese, henceforth BP and EP,
respectively. From the point of view of Probability theory, this research is
included in the study of stochastic processes with long range memory, the
so-called Chains with Complete Connections. Onicescu and Mihoc introduced
these chains in the 30's, by a series of articles. Recently this theme has had
enormous
advances with contributions from members of the present research
group
(cf. Bressaud, Galves and Fernández 1999a and 1999b, Ferrari
and Galves
2000, Ferrari, Maass, Martinez and Ney (2000)). These works
present
the state of the art and contain a detailed reference list about
the
subject.
The mathematical tools we are developing
at this moment, that is, Markovian
approximations and renewal schemes, have an explicit applied goal,
perfect simulation and estimation of the entropy of
the process. The objective now, is to consolidate known results
developing the statistical techniques necessary to model the rhythmic
patterns of PB and PE.
The modeling of rhythmic patterns in natural
languages is a leading
edge in Linguistics research. Even the existence hypothesis of
rhythmic classes dividing the natural languages into large groups,
although supported by evidences of psycholinguistic character,
was
not supported until recently by the data. One of the first acoustic
evidence was given in 1998 by Ramus, Nespor, Mehler (1999), the
second
author being one of the external members of the group. This
article showed evidenced that the empirical measures of the relative
time
spent by the vowels and the variance of the length of the consonantal
groups split a pilot set of languages into three large groups.
The
exact location occupied by PB and PE into this division is an
open
question that has been tackled by members of the group, and this
research includes the construction of a corpus of tagged samples
of
BP and EP, to serve as a basis to our statistical study. A preliminary
analysis can be found in Dorea, Galves, Kira, E. and A. Pereira
Alencar (1997).
A most important technical question appears at this point: the
automatic identification of the vowels and consonants, and then
the automatic identification of the stressed syllables in PB and
PE. This question, of evident scientific and technological
significance, presents immediately a difficulty that makes unfeasible
its study by frequency domain methods, the usual Time Series
technique. In fact, the acoustic signal produced by a speaker
of a
natural language is non-stationary in time (here the concept of
stationarity is the one usually used in Time Series) that is,
its
spectrogram changes in time, more specifically from one vowel
to the next
one. The automatic extraction of the necessary information from
the
acoustic signal will require more sophisticated techniques (wavelets,
Bayesian models of pattern recognition, cf. Ferrari, Frigessi,
Gonzaga de
Sá 1995, hidden Markov chains).
An "a priori" distribution should codify the basic characteristics
of
a rhythmic class guiding the setting of parameters of the
Universal Grammar, during the acquisition of a maternal language
by a
child. In several articles (Collet, Galves and Lopes 1995, Cassandro,
Collet, Galves and Galves 1999, Fernández and Galves 1999),
members of
this research group suggested that Gibbs states could be used
as
probability measures controlling the choice of sentences satisfying
simultaneously the syntax and prosodic pattern restrictions, in
particular rhythmic patterns of the speaker's language. The Optimality
Theory, introduced into Linguistics by Prince and Smolensky, proposes
a
model clearly inspired by Statistical Mechanics and that is perfectly
convenient to this proposal. Sândalo, Abaurre and
Galves (1999)
propose a set of restrictions and weights to describe the "energy"
functions associated to BP and EP. The computations of the solutions
of "minimal energy" involve a combinatorial task that
is
computationally intensive. To face this question, Arnaldo Mandel
developped the software Sotaq, refining a prototype built by Pierre
Collet and Antonio Galves. More details about Sotaq, the optimality
model we are considering and Optimality Theory in general can
be found at
http://www.ime.usp.br/~tycho/prosody/.
The optimality model raises several mathematical, computational,
statistical and linguistical questions. First, which conditions
should
be satisfied by the defining restrictions of the "energy"
function, to
assure uniqueness or at least low cardinality of the solution
set, the
so-called "fundamental states" of Statistical Mechanics?
Second,how to
assure that such a system is periodic in some sense(cf. Van Enter,
A.C.D. and J. Miçekisz (1992)) as
suggested recently by Antonio Galves and Roberto Fernández,
in such a way
that this periodicity recovers the intuitive notion of rhythm?
From the statistical point of view, there is the major problem
of
fitting the model to the data. Recently, Marzio Cassandro, Antonio
Galves, Charlotte Galves and Renato Assunção suggested
the use of a
"minimum dispersion" criterion to discriminate among
the several
parameter values. This criterion is compatible to the "minimum
entropy" criterion suggested by Collet, Galves and Lopes
(1995). The
objective now is, on one hand, to develop the Statistical Theory
necessary to the implementation of this proposal and, on the other
hand, to build a corpus of phonetic-acoustic data, in order to
test
the suitability of these models.
From the linguistic point of view, the goal is to understand the
fine-tuning between a minimal "energy" description and
the description made
by Ramus, Nespor and Mehler (1999) already cited. A question of
interface between Linguistics and Statistical Mechanics is to
find a
minimal set of restrictions defining the energy function.
This is a multidisciplinary project, involving researchers from
the
areas of Statistics and Probability Theory (Renato Assunção,
Francisco Cribari, Pablo Ferrari
(project vice-coordinator), Luis Renato Fontes, Antonio Galves
(project coordinator), Cláudio Landim, Nancy Lopes Garcia
and André
Toom), Linguistics (Charlotte Galves) and Computer Science (Arnaldo
Mandel). This group is completed with several external members,
with
researchers working in Linguistics (Anthony Kroch, Marina Nespor
and
Jean-Roger Vergnaud), Probability, Statistical Physics and Dynamical
Systems (Xavier Bressaud, Marzio Cassandro, Pierre Collet and
Roberto
Fernández). Francisco Cribari has just joined the group
making it
complete. All other members have been working together on several
issues directly related to the present proposal
and within Fapesp's Thematic Project "Rhythmic patterns,
parameter setting and language change" (http://www.ime.usp.br/~tycho)
and FINEP-Pronex Project "Critical phenomena in probability
and stochastic process" (http://www.ime.usp.br/~gprob).
Bibliography