::: Information-theoretic Approaches to Linguistics

Description
Program
Presentations
Registration
Contact

Description

A wide range of research has shown that tools from information theory (e.g. information content/surprisal, entropy) are useful tools in addressing questions of linguistic interest. These range from predicting the targets and outcomes of phonological and syntactic processes, to explaining the cognitive bases for these processes, to evaluating models of linguistic data.

A two-day workshop will bring together a number of researchers working on information-theoretic approaches to linguistics in an effort to share knowledge, tools, insights, and specific research findings. There will also be a tutorial on information theory for those not familiar with the approach. The tutorial will be followed by invited talks and a poster session.

Co-sponsored by The National Science Foundation and The Ohio State University.

Program

The workshop's main session will be held in room E0046 of Muenzinger. The poster session and lunches will be in the nearby University Memorial Center, room 382.

View the program (pdf).

Presentations

Invited Speakers

Petar Milin (University of Novi Sad, Serbia) (with Harald Baayen, Peter Hendrix, and Marco Marelli)

"From nominal case in Serbian to prepositional phrases in English: Modeling exemplar and prototype effects without exemplars and without prototypes, using discriminative learning"
handouts
[show/hide abstract]

We discuss the processing consequences of paradigmatic structure in inflectional morphology (nominal case in Serbian) and syntax (prepositional phrases in English) from an information- theoretic perspective. We first show that a greater distance from an exemplar to the prototype (evaluated with relative entropy measures) comes with increased processing costs, both for English and for Serbian. We then introduce an implemented symbolic computational model based on principles of discriminative learning that correctly simulates both the effects of exemplars and prototypes (as gauged with relative entropy) without assuming the presence of independent representations for prototypes or exemplars.
The basic engine of the model is parameter-free, and completely driven by corpus data. The model provides excellent fits to observed processing latencies, and faithfully simulates the observed effects of relative entropy. The model can be viewed as a very first step towards implementing a construction grammar in which the weights between form and meaning representations bear the weight of constructional schemas and exemplars. An important advantage of this approach is that the model is extremely sparse in the number of representations it requires, and therefore avoids mental lexicons or constructicons populated with many millions of exemplars.
John Goldsmith (University of Chicago)

"Learning morphophonology from unsupervised learning of morphology"
[show/hide abstract]

A number of researchers, including the present writer, have developed approaches to learn, or induce, finite-state morphologies directly from a corpus with no prior knowledge of the language, employing minimum description length (MDL) methods or other sets of machine learning tools. These methods tend to be best able to deal with concatenative morphology, and the presence of phonological rules in the language, especially those that are described as morphophonological rules, have the effect of making a morphology seem more complicated than it really is. A rule that deletes a stem-final segment, or doubles a stem-final segment, may be mis-analyzed by a naive automatic morphological learner, who misinterprets the effects of the phonological rule as evidence for different stems taking different sets of affixes. But this is an opportunity for us, as well as a challenge for the morphology-learner: the morphology learner can search for simple phonological rules whose effect is to greater simplify the morphology. By expanding the problem of learning morphology to the problem of learning both morphology and phonology, we end up simplifying the over-all resulting system. We illustrate how this works in cases from French, English, and other languages.
John Hale (Cornell University)

"Information-theoretic approaches to syntactic processing"
Citations
[show/hide abstract]

When understanding linguistic forms, people evidently do computational work. The Entropy Reduction Hypothesis (ERH) proposes a definition of this computational work in terms of the progress a comprehender makes in finding a derivation for the perceived linguistic form. It applies the idea of entropy from information theory to the set of derivations compatible with an initial substring of a sentence. Given a probabilistic grammar, this permits the set of such compatible derivations to be viewed as a random variable, and the change in derivational uncertainty from word to word to be calculated.
This talk reviews scientific consequences of this hypothesis, presenting results on relative clause comprehension in English and Korean. This application leverages the ERH's generality by combining it with mildly context-sensitive grammars such as Minimalist Grammars. If time permits, the talk will also address potential algorithmic models of the ERH and the relationship between the rational analysis of cognition generally and the ERH in particular.
Kathleen Currie Hall (CUNY: College of Staten Island & The Graduate Center), Andrew Wedel, and Adam Ussishkin (University of Arizona)

"Entropy and Phonological Contrast"
[show/hide abstract]

One way entropy can be used to understand phonological contrast is in terms of the degree to which any two sounds in a language are contrastive with each other. The typical criterion of "predictability of distribution," used to determine whether two sounds are contrastive or allophonic, can be recast in terms of entropy. This creates a continuous scale of unpredictability between two sounds. Our aim here is to provide evidence that this approach provides new avenues of explanation beyond the traditional dichotomy of contrastive vs. non-contrastive. We will show how the entropy-based view predicts the existence of so-called "marginal contrasts," particularly cases in which two sounds in a language are mostly predictable but are subject to a few unpredictable exceptions (e.g., Canadian Raising, New York /æ/-tensing, Japanese palatal consonants). We will also discuss work in progress suggesting that degree of contrast is inversely correlated with probability of neutralization in sound change.
Elizabeth Hume and Rory Turnbull (The Ohio State University)

"Redundancy in Phonology"
[show/hide abstract]

TBA
Florian Jaeger (University of Rochester) and Roger Levy (UC San Diego)

"Surprisal and information density in sentence comprehension and production"
[show/hide abstract]

The proposal that the negative log of the conditional probability of a linguistic element in its context—well-known in information theory as the element's "surprisal" or "Shannon information content"—has attracted considerable attention over the past several years as a quantity of fundamental interest and explanatory power in human language comprehension and production. In this talk, we describe key empirical results demonstrating the importance of this quantity as a determinant of human linguistic behavior, ranging from garden-path ambiguity resolution and syntactic comprehension difficulty in locally unambiguous contexts to speaker choice in structuring the utterance at levels of phonetic, morphosyntactic, and clause-level organization, and ultimately to the shape of the lexicon itself. We also discuss theoretical interpretations of these effects of Shannon information content as evidence for a drive towards communicative optimality on the part of native speakers. We close with speculation on how similar analysis might usefully be applied to the study of typologically attested universals of grammar — and the exceptions to them.
Jason Riggle (University of Chicago)

"The role of acoustic, articulatory, and distributional similarity in non-local dependencies"
[show/hide abstract]

Defining non-local dependencies as phenomena in which a particular subset of a language's phonemes interact as if they were adjacent even when separated captures key aspects of autosegmental phonology. However, the problem of discovering such dependencies rather than encoding them as phonological primitives presents the challenge that, for languages with n phonemes, there are 2n possible non-local dependencies. In this work, I examine the utility of three kinds of similarity-metrics: acoustic, articulatory, and distributional for their ability to identify classes of segments exhibiting non-local dependencies.
Andrea Sims (The Ohio State University)

"Information Theory and Paradigmatic Morphology"
[show/hide abstract]

Substantial evidence now indicates that paradigmatic structure is necessary for an adequate description of inflectional morphology (e.g. Matthews 1972, Anderson 1992, Aronoff 1994, Stump 2001, Baerman et al. 2005). However, that work has primarily been interested in deterministic relations holding among paradigm cells (e.g. rules of referral), and has not generally taken advantage of insights from information theory (an exception being Ackerman et al. (2009), which is concerned with how speakers learn inflection class membership). In this paper I explore the usefulness of treating the inflectional paradigm as an entropy system. Traditional Word and Paradigm models included the idea of a principal part — the word-form(s) from which all others could be predicted — but principal parts have played little role in modern paradigm-based morphology (but see Finkel and Stump (2007) for typological explorations).
In this paper I argue that these kinds of implicational relations are tied to the structural properties of inflection classes. For instance, in the Modern Greek nominal system, singular, plural and stress formatives cross-cut each other to an unusual degree, and I show that lack of interdependence between cells is tied to genitive plural defectiveness. I formalize the account using Shannon conditional entropy. Ultimately, this paper argues that the principal part, as a kind of paradigmatic relation, needs to be reintroduced to inflectional theory and that it can be formulated probabilistically in information-theoretic terms.

Poster Presentations

Sonia Barnes (The Ohio State University)

"The role of frequency in the deletion of intervocalic /d/ in Spanish first conjugation past participles"
sited references
[show/hide abstract]

Lexical items with high absolute frequency tend to be subject to phonetically reductive processes at a higher rate than lower frequency words. This study investigates the role that frequency plays in the alternation between /d/ and /Ø/ in first conjugation participles in Spanish, considering different manifestations of frequency. Relying on the concept of entropy, the study provides a unified approach of the frequency effects that model the variation observed.
The following independent variables were tested in a logistic regression analysis using the glm function in R: type of construction, token frequency, relative frequency, neighborhood density and number of segments. The results show that intervocalic /d/ in past participles is more likely to be omitted in contexts where it contributes little to system entropy: when it is less informative in the contrast between lexical neighbors (low-neighborhood density) and when it has a higher probability of occurrence (higher relative frequency).
Uriel Cohen-Priva (Stanford University)

"Information Utility promotes preservation"
[show/hide abstract]

A number of languages (English, Arabic, Huallaga Quechua) have multiple unrelated weakening processes that target specific segments. US English has several /t,d/ weakening processes, and Arabic /q/ weakens to /g/, /k/ and /ʔ/ in different dialects. Current linguistic theory does not predict language-specific conspiracies that weaken different segments in different languages.
I present a new model, MULE, that predicts which segments are likely to weaken in each language. In MULE, the information utility of linguistic elements (their expected predictability) promotes preservation while effort promotes reduction. The balance between the two forces leads to the language-specific weakening patterns. In OT terms, information utility represents faithfulness, and effort markedness. MULE's predictions are corroborated in standard OT models and experimentally in real-valued OT models. MULE's innovation lies in treating information utility as a preserving rather than weakening force, and in showing that expected predictability rather than local (contextual) predictability accounts for linguistic behavior.
Michael Collins (The Ohio State University)

"Information Theory and Linearization Patterns of Phrasal Prepositional Verbs"
[show/hide abstract]
English has a number of verb + preposition + preposition compounds, known as phrasal prepositional verbs. The position of adverbs within these constructions follows an extremely limited distribution.
1. a) I sometimes put up with nonsense.
2. b) ?I put sometimes up with nonsense.
3. c) ?I put up sometimes with nonsense.
4. d) ?I put up with sometimes nonsense.
5. e) I put up with nonsense sometimes.
In this poster, I show how the tools and concepts of Information Theory (Shannon 1948) can explain why language users are biased against certain configurations (b, c, d) and prefer others (a, e).
Robin Dodsworth (North Carolina State University)

"The Role of Complexity in Dialect Contact Outcomes"
[show/hide abstract]

This study investigates whether structural complexity – in this case, complex internal conditioning of phonetic variables – correlates with the speed of change in a dialect contact setting, hypothesizing that phonetic variables with high structural complexity are "high surprisal" elements (Hume & Mailhot forthcoming) and thus more vulnerable to change. Raleigh, NC, has housed large-scale contact between southern and non-southern dialects for 50 years. Acoustic analysis of the front vowel systems of 45 native residents from 3 generations reveals the gradual loss of southern variants. The two front vowels showing the strongest effects of adjacent segments (i.e., structural complexity) in the oldest generation are /æ/ tensing and /e/ retraction. Linear mixed-model regression shows that while all of the front vowels have shifted significantly away from their southern variants, /æ/ and /e/ have indeed shifted more quickly. Further, simplification of their internal factors was largely completed within two generations. The results are consistent with an information-theoretic model of large-scale dialect contact outcomes.
Richard Futrell & Michael Ramscar (Stanford University)

"German Grammatical Gender Manages Nominal Entropy"
[show/hide abstract]

We propose an information-theoretic functional motivation for grammatical gender. By lowering the entropy of nouns in context, grammatical gender allows German speakers to encode more information into the channel without increasing demands on the hearer. Gender marking mitigates the spikes in entropy caused by low frequency nouns in the context Determiner-Noun. Thus gender marking makes it easier to use low frequency nouns there without incurring processing difficulty. Accordingly, we find that in this context the average frequency of German nouns (NEGRA II corpus) is significantly lower than that of English nouns (NYT Gigaword corpus).
Next, we show how semantic (ir)regularities in German gender assignment facilitate the prediction of nouns. We show the existence of semantic regularities using an empirical distributional metric for semantic similarity in the Google N-Gram corpus. Further, we show that frequent nouns which are likely to co-occur are assigned to different genders.
Sunghoon Hong (Hankuk University of Foreign Studies / Indiana University)

"An Information-Theoretic Account of the Epenthetic Vowel in Korean"
[show/hide abstract]

Hume ∓ Bromberg (2005) and Hume (2006) proposed a novel way to deal with epenthetic vowels, arguing that epenthetic vowels should not be described as the unmarked vowels, determined by a universal principle of markedness, but rather as vowels having least overall "information content." The main purpose of this paper is to verify whether Hume & Bromberg's proposal holds for the epenthetic vowel [i] in Korean. The overall information content of the epenthetic vowel was calculated based on the frequencies of unigrams and bigrams obtained from the 21st-Century Sejong Project Morpheme-Tagged Corpus, constructed by the National Institute of the Korean Language from 1999 to 2004. The results show that the epenthetic vowel [i] in Korean is also least in the overall information content, as predicted by Hume & Bromberg and Hume.
Blake Stephen Howald (Georgetown University)

"Information-theoretic perspectives on the supervised machine learning of the spatiotemporal event structure of narrative discourses"
[show/hide abstract]

Research in the prediction of semantic and pragmatic relationships between elements of discourse structure with supervised machine learning techniques indicates that those algorithms relying on information-theoretic computations (Shannon 1948) outperform those relying on other types of computations (e.g. Bayesian). This presentation demonstrates that the accuracy of predicting explicit and implicit annotated structural information (i.e., spatial and temporal reference, event types and rhetorical relations) in 75 narrative discourses is up to 28% higher for the C4.5 decision tree and K* classifiers. Consequently, the spatiotemporal event structure of narrative discourses can be represented by a 40% reduction in entropy based on the distribution of coding elements as opposed to a uniform distribution model. Implications of this perspective on discourse structure generally, its potential use in differentiation of discourse genres, and approaches to NLP-based discourse processing, are additionally explored.
Michael Ramscar & Richard Futrell (Stanford University)

"The Predictive Function of Prenominal Adjectives"
[show/hide abstract]

Prenominal adjectives are supposed to modify or add to the meanings of head nouns, but it is often not clear how this works. For example, consider the NP "a cute little puppy"―since almost all puppies, especially the most prototypical ones, are cute and little, it is not clear what the adjectives are adding; and these are the most likely adjectives before “puppy”. The function of prenominal adjectives is better understood in terms of information theory. Prenominal adjectives lower the entropy of infrequent nouns and give incremental clues to help a hearer discriminate the intended message from other possible messages.
Supporting this function for adjectives, we show detailed evidence that more informative (i.e. infrequent) nouns are more likely to be preceded by adjectives in the Corpus of Contemporary American English (COCA). We also investigate the predictive role of certain adjectives that preferentially appear before high frequency nouns.
Purnima Thakur (CUNY Graduate Center)

"Sibilants in Gujarati Phonology"
[show/hide abstract]

There is a lack of consensus on the status of alveo-palatal [ʃ] in the phonology of Gujarati: some scholars treat /s/ and /ʃ/ as truly contrastive units (Adenwala, 1965; Dave, 1977; Masica, 1991); others treat them as quasi-phonemes, contrasting only in the environment of front vowels and glides (Pandit, 1954) and still others (Turner, 1921; Grierson, 1931) view [ʃ] as an allophone of /s/ conditioned by front vowels and glide /j/. In this paper, I show how an information-theoretic approach to phonological relationships (Hall, 2009) can be used to precisely calculate the status of [s] and [ʃ] in the individual grammars of 20 speakers of Gujarati. In the pool of speakers I sampled, none showed a truly allophonic status, but the rest were split between those with a (near) perfect contrast and those with a quasi-phonemic relationship between the two sibilants.
Tsz-Him Tsui (The Ohio State University)

"An Information-theoretic account of tonal merger in Hong Kong Cantonese"
[show/hide abstract]

Among the six lexical tones in Hong Kong Cantonese, recent studies have found that the low-rising Tone 5 may merge with the high-rising Tone 2 (Bauer, Cheung & Cheung 2003, Mok & Wong 2010a, b) or the mid-level Tone 3 (Wong 2008). I argue that the acoustic similarities in tone contours alone is inadequate to warrant the tone mergers, as the non-merging level tones have similar acoustics as well. Combining the acoustics into an information-theoretic approach (Shannon 1948, Hall 2009, Hume et al 2011), however, can predict the merge of rising Tone 5, as it carries the least information for distinguishing syllables in speech.
Marjolein van Egmond, Lizet van Ewijk, & Sergey Avrutin (Universiteit Utrecht)

"A New Theoretical Model for Word Finding Difficulties in Aphasic Patients"
[show/hide abstract]

It has been argued that aphasic patients suffer from a reduced processing capacity which renders part of the lexicon unavailable: the amount of energy (measured in entropy) necessary to retrieve these words exceeds the available resources. Therefore, to be able to communicate, patients unconsciously have to minimize the necessary amount of energy. We adapted the mathematical model for lexical retrieval developed by Ferrer i Cancho and Solé (2003) to model the aphasic lexicon. Based on this model, speaker entropy is calculated. Our adaptations were aimed at reducing entropy while maintaining as many words as possible. Our modifications to the model resulted in a lower entropy. Reducing entropy while maintaining as many words as possible was most effectively done by rendering words with an intermediate number of associations inaccessible, causing a gap in the distribution of associations. These results provide a detailed hypothesis for word finding difficulties in aphasic patients. We propose an experiment to test this hypothesis.
Lizet van Ewijk (Universiteit Utrecht)

"Verb retrieval in non-fluent aphasia: an information-theoretic approach."
[show/hide abstract]

Word finding difficulties (especially inflected verbs) are common in (non-fluent) aphasia. Previous research addressed this problem either in purely linguistic terms (e.g. verb movement) or in terms of lexical characteristics (e.g. frequency, age of acquisition.) We propose a new measure of verbal complexity (and related to this verb retrieval difficulties in Dutch aphasia), which is formulated in terms of Shannon’s information theory. We aim to explore the complexity of individual verbs and its effect on retrieval, in healthy and aphasic subjects. We use two information-theoretic measures: inflectional entropy (reflecting probabilistic variability of forms within a given verbal family) and information load (reflecting complexity of individual verb forms). Our results demonstrate that a decrease in lexical processing capacity characteristic for patients with aphasia has a measurable effect that can be calculated using information theoretical means.

Registration Information:

There is no registration fee for the workshop but we would appreciate having people register in advance in order to help with planning. Please do so by emailing LSAinfotheory@ling.osu.edu with your name and affiliation.

Contact and Links:

For more information, contact the organizers (Kathleen Currie Hall, Beth Hume, Rory Turnbull) at LSAinfotheory@ling.osu.edu.

Potentially of interest to our participants is the workshop Testing Models of Phonetics and Phonology, being held on Wednesday July 13 in Boulder, CO.

Information-theoretic Approaches to Linguistics

July 16–17, 2011, in conjunction with the 2011 LSA Linguistic Institute.