ProbabilisticTools for Pattern Identification Applied to Linguistics


This interdisciplinary project has two main objectives. From the point of view of Probability Theory we want to develop the necessary tools to identify patterns in trajectories of stochastic processes. From the point of view of Linguistics, we want to use these tools to identify characteristic rhythmic patterns that are distinct for Brazilian Portuguese and Modern European Portuguese, henceforth BP and EP, respectively. From the point of view of Probability theory, this research is included in the study of stochastic processes with long range memory, the so-called Chains with Complete Connections. Onicescu and Mihoc introduced these chains in the 30's, by a series of articles. Recently this theme has had enormous
advances with contributions from members of the present research group
(cf. Bressaud, Galves and Fernández 1999a and 1999b, Ferrari and Galves
2000, Ferrari, Maass, Martinez and Ney (2000)). These works present
the state of the art and contain a detailed reference list about the
subject.

The mathematical tools we are developing at this moment, that is, Markovian
approximations and renewal schemes, have an explicit applied goal,
perfect simulation and estimation of the entropy of
the process. The objective now, is to consolidate known results
developing the statistical techniques necessary to model the rhythmic
patterns of PB and PE.

The modeling of rhythmic patterns in natural languages is a leading
edge in Linguistics research. Even the existence hypothesis of
rhythmic classes dividing the natural languages into large groups,
although supported by evidences of psycholinguistic character, was
not supported until recently by the data. One of the first acoustic
evidence was given in 1998 by Ramus, Nespor, Mehler (1999), the second
author being one of the external members of the group. This
article showed evidenced that the empirical measures of the relative time
spent by the vowels and the variance of the length of the consonantal
groups split a pilot set of languages into three large groups. The
exact location occupied by PB and PE into this division is an open
question that has been tackled by members of the group, and this research includes the construction of a corpus of tagged samples of
BP and EP, to serve as a basis to our statistical study. A preliminary
analysis can be found in Dorea, Galves, Kira, E. and A. Pereira
Alencar (1997).

A most important technical question appears at this point: the
automatic identification of the vowels and consonants, and then
the automatic identification of the stressed syllables in PB and
PE. This question, of evident scientific and technological
significance, presents immediately a difficulty that makes unfeasible
its study by frequency domain methods, the usual Time Series
technique. In fact, the acoustic signal produced by a speaker of a
natural language is non-stationary in time (here the concept of
stationarity is the one usually used in Time Series) that is, its
spectrogram changes in time, more specifically from one vowel to the next
one. The automatic extraction of the necessary information from the
acoustic signal will require more sophisticated techniques (wavelets,
Bayesian models of pattern recognition, cf. Ferrari, Frigessi, Gonzaga de
Sá 1995, hidden Markov chains).

An "a priori" distribution should codify the basic characteristics of
a rhythmic class guiding the setting of parameters of the
Universal Grammar, during the acquisition of a maternal language by a
child. In several articles (Collet, Galves and Lopes 1995, Cassandro,
Collet, Galves and Galves 1999, Fernández and Galves 1999), members of
this research group suggested that Gibbs states could be used as
probability measures controlling the choice of sentences satisfying
simultaneously the syntax and prosodic pattern restrictions, in
particular rhythmic patterns of the speaker's language. The Optimality
Theory, introduced into Linguistics by Prince and Smolensky, proposes a
model clearly inspired by Statistical Mechanics and that is perfectly
convenient to this proposal. Sândalo, Abaurre and Galves (1999)
propose a set of restrictions and weights to describe the "energy"
functions associated to BP and EP. The computations of the solutions
of "minimal energy" involve a combinatorial task that is
computationally intensive. To face this question, Arnaldo Mandel
developped the software Sotaq, refining a prototype built by Pierre Collet and Antonio Galves. More details about Sotaq, the optimality
model we are considering and Optimality Theory in general can be found at
http://www.ime.usp.br/~tycho/prosody/.

The optimality model raises several mathematical, computational,
statistical and linguistical questions. First, which conditions should
be satisfied by the defining restrictions of the "energy" function, to
assure uniqueness or at least low cardinality of the solution set, the
so-called "fundamental states" of Statistical Mechanics? Second,how to
assure that such a system is periodic in some sense(cf. Van Enter, A.C.D. and J. Miçekisz (1992)) as
suggested recently by Antonio Galves and Roberto Fernández, in such a way
that this periodicity recovers the intuitive notion of rhythm?


From the statistical point of view, there is the major problem of
fitting the model to the data. Recently, Marzio Cassandro, Antonio
Galves, Charlotte Galves and Renato Assunção suggested the use of a
"minimum dispersion" criterion to discriminate among the several
parameter values. This criterion is compatible to the "minimum
entropy" criterion suggested by Collet, Galves and Lopes (1995). The
objective now is, on one hand, to develop the Statistical Theory
necessary to the implementation of this proposal and, on the other
hand, to build a corpus of phonetic-acoustic data, in order to test
the suitability of these models.

From the linguistic point of view, the goal is to understand the
fine-tuning between a minimal "energy" description and the description made
by Ramus, Nespor and Mehler (1999) already cited. A question of
interface between Linguistics and Statistical Mechanics is to find a
minimal set of restrictions defining the energy function.

This is a multidisciplinary project, involving researchers from the
areas of Statistics and Probability Theory (Renato Assunção, Francisco Cribari, Pablo Ferrari (project vice-coordinator), Luis Renato Fontes, Antonio Galves
(project coordinator), Cláudio Landim, Nancy Lopes Garcia and André
Toom), Linguistics (Charlotte Galves) and Computer Science (Arnaldo
Mandel). This group is completed with several external members, with
researchers working in Linguistics (Anthony Kroch, Marina Nespor and
Jean-Roger Vergnaud), Probability, Statistical Physics and Dynamical
Systems (Xavier Bressaud, Marzio Cassandro, Pierre Collet and Roberto
Fernández). Francisco Cribari has just joined the group making it
complete. All other members have been working together on several issues directly related to the present proposal
and within Fapesp's Thematic Project "Rhythmic patterns, parameter setting and language change" (http://www.ime.usp.br/~tycho)
and FINEP-Pronex Project "Critical phenomena in probability and stochastic process" (http://www.ime.usp.br/~gprob).


Bibliography

  1. Bressaud, X., Galves, A. and R. Fernández (1999). Speed of d-convergence for Markov approximations of chains with complete connections. A coupling approach. Stochastic Process. Appl., vol. 83 , no.1, 127-138.
  2. Bressaud, X., Galves, A. and R. Fernández (1999). Decay of correlations for non Holderian dynamics. A coupling approach. Electron. J. Probab., vol. 4, Paper no.3, 1-19, 1999.
  3. Cassandro, M., Collet, P., Galves, A. and Ch. Galves (1999). A Statistical-Physics Approach to Language Acquisition and Language Change. Physica A, vol. 263, 427-437.
  4. Collet, P., Galves, A. and A. Lopes (1995). Maximum likelihood and minimum entropy identification of grammars. Random and Computational Dynamics, vol. 3, 241-256.
  5. Dorea, C., Galves, A., Kira, E. and A. Pereira Alencar (1997). Markovian modeling of the stress contours of Brazilian and European Portuguese. REBRAPE, vol. 11, 161-173.
  6. Fernández, R. and A. Galves (2000). Identifying features in the presence of competing evidence. The case of first-language acquisition. WSSIAA, no prelo.
  7. Ferrari, P. and A. Galves (2000). Constructions of stochastic processes, coupling and regeneration, no prelo, acessível no endereço http://www.ime.usp.br/~pablo/book
  8. Ferrari, P. Maass, A., Martinez, S. and P. Ney (2000). Cesaro mean distribution of group automata starting from measures with summable decay.Ergodic Theory and Dynamical Systems , no prelo.
  9. Ramus, F., Nespor, M. and J. Mehler (1999). Correlates of linguistic rhythm in the speech signal. Cognition, 73(3), 265-292.
  10. Sândalo, F., Abaurre, M. B, and Ch. Galves (1999). Otimizando o ritmo do Português, Relatório Técnico, IEL-UNICAMP.


Página inicial