Recently, several papers, starting with Ramus, Nespor and Mehler (1999), gave evidence that simple statistics of the speech signal could discriminate between different rhythmic classes.In the present paper, we propose a new approach to the problem of finding acoustic correlates of the rhythmic classes. Its main ingredient is a rough measure of sonority defined directly from the spectrogram of the signal. This approach has the major advantage that it can be implemented in an entirely automatic way, with no need of previous hand-labelling of the acoustic signal. Applied to the same linguistic samples considered in RNM, it produces the same clusters corresponding to the three conjectured rhythmic classes. The resulting statistics strongly suggest that rhythmic class discrimination can be entirely based on a measure of obstruency present in the signal.
Large annotated speech corpora are a critical component of research in prosody. The classification of languages according to their speech rhythm, for example, requires a great number of annotated sentences by different speakers in different languages. We have developed {\it Vocale}, a tool for the semi-automatic annotation of vocalic and consonantal parts of speech because in recent models these units have been identified as reliable acoustic correlates of speech rhythm. {\it Vocale} is based on relative entropy and uses various additional classifiers such as energy and length for the annotation of vowels and consonants. It runs using Praat speech analysis facilities and gives a Praat label file as an output. {\it Vocale} is open source software and is available to the scientific community under http://www.ime.usp.br/$\sim$tycho/tipal/prosody/vocale/.
Typical postlexical interface phenomena, like secondary (rhythmic) stress, can be succesfully modeled by OT analyses, which predict optimally stressed outputs from a set of possible inputs and a hierarchically ranked set of constraints. This paper presents an OT analysis for European and Brazilian Portuguese secondary stressing. Based on this analysis, a computer program, sotaq, has been developed, allowing for automatic testing, against large corpora, of proposed constraint hierarchies for both varieties of Portuguese. Test results are presented, showing suitable hierarchies generating secondary stresses for both varieties of Portuguese.
We discuss mathematical issues suggested by the processes of first-language acquisiton and language change. We present a model of language acquisition with two components: A probability measure describing sentence selection by a native speaker, and an identification principle modeling how the child choses an element of the finite set of natural grammars. More generally, we present an approach to the problem of classifying existing evidence to choose among a finite set of policies in the presence of possibly conflicting hints.
Based on the Penn-Helsinki Parsed Corpus of Middle English, the Tycho Brahe Parsed Corpus of Historical Portuguese is an electronic corpus consisting of texts originally written by native speakers of European Portuguese, born between 1550 and 1850. For the morphological codification of this corpus, an annotation system was developed, which basically consists of (i) a set of classificatory labels, for the identification of the lexical category of words; (ii) morphological variation labels, for the recognition of the [+marked] inflectional characteristics of lexical items; (iii) labels for text punctuation codification, and finally (iv) a set of diacritics, for codification of the following information: "/" signal, for the separation of the lexical item from its respective label; "+" signal, to denote association of labels in contracted lexical items, and "-" for linking morphological variation labels to classificatory labels.
In face of its morphological richness, a detailed research about morphosyntactic properties of Portuguese was made, in order to obtain robust sets of labels. The obtained results show that the proposed annotation system for Portuguese guarantees efficient automatic searching across time. It was also shown that the annotation system could be largely applied to Romance languages in general.
The aim of this paper is to explain why Statistical Physics can help understanding two related linguistic questions. The first question is how to model first language acquisition by a child. The second question is how language change proceeds in time. Our approach is based on a Gibbsian model for the interface between syntax and prosody. We also present a simulated annealing model of language acquisition, which extends the Triggering Learning Algorithm recently introduced in the linguistic litterature.
Prosody plays a crucial role in grammar selection, restricting the possible values of the parameters to be set. In the present work we present a formal account of this claim, using the Thermodynamical Formalism. In this framework, the sample of positive evidence presented to the child is chosen according to a Gibbs state defined by the potential associated to the prosody. Given a smple of sentences of the parental grammar according to a maximum likelihood principle. We argue that such a model accounts for both language acquisition and language change. As example we study the change in clitic placement from Classical to modern European Portuguese.
Using the Thermodynamic Formalism, we introduce a Gibbsian model for the identification of regular grammars based only on positive evidence. This model mimics the natural language acquisition procedure driven by prosody which is here represented by the thermodynamic potential. The statistical question we face is how to estimate the incidence of matrix of a subshift of finite type from a sample produced by a Gibbs state whose potential is known. The model acquaints for both the robustness of the language acquisition procedure and language changes. The probabilistics approach we use avoids invocking ad-hoc restrictions as Berwick's Subset Principle.
This work addresses the question of modeling the stress contours of Brazilian and Modern European Portuguese as high order Markov chains. We discuss three criteria to select the order of the chain: the Akaike's Information Criterion, the Bayesian Information Criterion and the Minimum Entropy Criterion. A statistical analysis of a sample of spontaneous speech from both dialects indicates that the corresponding Markov chains are of different order.