Friday, December 20, 2013

Morphological relations

At some point in this forum I think I posted about my old work on Whole Word Morphology.  Next month I am attending the International Symposium on Artificial Intelligence and Mathematics, and speaking about this approach in a special session on Mathematics of Natural Language Processing.  I think that it may be useful for NLP in a variety of languages.  Here is the abstract:

Whole Word Morphology does away with morphemes, instead representing all morphology as relations among sets of words, which we call lexical correspondences. This paper presents a more formal treatment of Whole Word Morphology than has been previously published, demonstrating how the morphological relations are mediated by unification with sequence variables. Examples from English are presented, as well as Eskimo, the latter providing an example of a highly complex polysynthetic lexicon. The lexical correspondences of Eskimo are operative through their
interconnection in a network using a symmetric and an asymmetric relation. Finally, a learning algorithm for deriving lexical correspondences from an annotated lexicon is presented.

Link to the paper at the ISAIM website

This research program fits with my general theme of learning language through unification procedures, which I think is both computationally useful and cognitively relevant.  It seems to me that the cognitive version of unification is "analogical learning."

Friday, October 4, 2013

Contextual grammars and term-labeled trees

I've recently discovered the work of Solomon Marcus, a brilliant mathematician who's been publishing in mathematical linguistics since the 1950s.  He worked to develop "contextual grammars," which derive from the American Structuralist ideas that the "set of contexts" determined by a word in a given language is an important characteristic of the word, perhaps the most important. In a survey paper that appeared in the Handbook of Formal Languages (Rozenberg and Salomaa 1997), Marcus explains how the sets of contexts determined by the words in a language and the sets of words appearing in the contexts are related by a Galois connection.  He cites Sestier (1960) for first showing this.

My problem with this whole framework is that the contexts are defined as string contexts only--the strings that occur before and after a word in a sentence form the context.  I have worked to get beyond this string-based model of language syntax.  In my own work I proposed term-labeled trees, which are sentences provided with an immediate constituent analysis (a bare tree) together with a semantic term label, usually a lambda term.  Then I proposed, ignorant of contextual grammars at the time, the notion of a term-labeled tree context.  This is the term-labeled sentence tree with two holes in it, corresponding to the locations of the meaning term and the linked syntactic item (possibly a word, maybe a subtree) that would fill the context. 

Drawing on the result of Sestier, it appears that in a term-labeled tree language the term-labeled tree contexts are still related to the words and other possible subtrees by a Galois connection.  I would have to take some more steps to prove it, but it seems right at a glance.  This is a nice mathematical connection, and it would be great if it continues to hold in what I regard as the improved variation on contextual grammars.


Friday, August 30, 2013

Learning biases in constraint-based grammar learning

In my previous post I highlighted a new topic in TopiCS.  Here I will offer a few remarks on one of the papers, "Cognitive Biases, Linguistics Universals, and Constraint-Based Grammar Learning" by Culbertson, Smolensky, and Wilson.

The broad goals of this paper are (i) to exemplify the argument that human language learning is facilitated by learning biases, and (ii) to model some specific biases probabilistically in a Bayesian fashion.  Let me say firstly that I am very sympathetic to both of these general ideas.  But, I think that this project is very narrowly applicable only to the framework of optimality-theoretic syntax, and that is in no way fleshed out enough to generate an entire language. 

So, without going into too many details, I think the paper's result applies only to a tiny corner of a grammar, in particular the part that derives "nominal word order" within noun phrases involving a numeral + noun on one hand, and an adjective + noun on the other.   I'm not sure how to react to a result that only derives a couple of expressions from a whole language.  I agree there might be Bayesian probability at work in grammar learning, but a project like this really needs to be worked out on a grammatical system capable, at least in theory, of deriving an entire language.  I don't know if that capability has been shown for the kind of optimality-theoretic syntax under discussion here.  I do know there are about 50 frameworks waiting in the wings that are known to be able to generate real languages, at least in theory if not in practice.  Maybe we should try applying some of the ideas from this paper to fully fledged grammatical frameworks, instead of a half-baked one (sorry if that is a mixed metaphor!).

Sunday, August 18, 2013

Cognitive Modeling and Computational Linguistics

It has been some years since I quit my membership in the Association for Computational Linguistics.  I quit because, in simplest terms, I wasn't getting much out of the membership and I was not encouraged by the direction of the field of Comp Ling.  My impressions from about 2000 through 2008 were that Comp Ling was getting more and more "engineering" oriented, and more and more hostile to any other purpose for computational modeling.  I have a few anecdotes I could tell about my own papers on that score; one appeared in JoLLI after referees for Computational Linguistics suggested it be sent there, since it had no "practical application."  (Being naive at the time, I did not realize that every paper in Computational Linguistics had to have a practical application.) 

A new topic which appeared in the July issue of Topics in Cognitive Science gives some hope for a different future.  Here one finds 11 papers under Computational Models of Natural Language, edited by John Hale and David Reitter.  The overarching theme is basically computational psycholinguistics relaunched.  The papers include many which I would like to comment on here in later posts.   They were presented at the first workshop on Cognitive Modeling and Computational Linguistics, held at the ACL meeting in 2010.  This workshop has since been reprised in the succeeding years, so it seems that this is not a one-time aberration.  The notion of using computational linguistics to investigate linguistic theory was purged from the ACL (especially the North American chapter) before I finally quit.  I'm glad to see this research avenue explored under the auspices of this Association once again.

Sunday, May 12, 2013

The Eurasiatic sharpshooter fallacy?

A paper by Mark Pagel, Quentin Atkinson, Andreea Calude and Andrew Meade has the linguistic blogosphere buzzing, so I thought I'd contribute my own entry. Their paper "Ultraconserved words point to deep language ancestry across Eurasia" was published in Proceedings of the National Academy of Science ahead of print, and has already been the subject of fierce criticism from the ultra-doctrinaire community of comparative historical linguistics. The paper applies an intriguing statistical procedure to the LWED database of reconstructed proto-words in seven established language families, and purports to uncover 23 lexical items which unite the families into a Eurasiatic superfamily, much as was proposed many years ago by Joe Greenberg (among others.)

I'm not going to contribute a detailed critique of the paper here; I will note that the critique that was posted on Language Log by Sally Thomason includes the caveat that she is not qualified to judge the statistical procedures.  I think that if one is going to critique a scholarly paper, it should really be critiqued in its entirety and not just in bits and bites, but some may differ on that score.

I think two major criticisms have emerged from the various comments, which are "garbage in, garbage out" and the "Texas sharpshooter fallacy."  The second one (raised by Andrew McKenzie of the University of Kansas) is more interesting to me, since it actually involves the statistical interpretation. This statistical fallacy involves "discovering hidden structure" or clusters in data where there is really no evidence for anything.  It takes its name from the tale of a Texas gun for hire who was not a very good shot.  Being clever, he took out his two revolvers and fired 12 shots as best he could at the side of a barn, and then painted a target centered on the tightest cluster of bullet holes. He then showed the target to potential clients, claiming to be a sharpshooter.

In the Eurasiatic data, I guess the problem could be that the 23 "ultraconserved" lexical items which were found to unite the families could just be randomly like each other, but it is hard for me to draw this analogy with the Texas sharpshooter because the statistical results in the paper are so significant they seem to mitigate problems of this kind.  For one thing, there are 7 language families involved and not just two.  For another, the 23 lexical items emerge from the typical 200-word Swadesh list comparison.  Without any rigorous argument, it seems to me that there is a very low chance of 23 items out of 200 (that's 12.5%) randomly being similar across 7 language families.  A commonly cited real instance of a scientific study waylaid by the Texas sharpshooter was a Swedish epidemiological study of 800 medical conditions. They found a significant difference in the incidence of one ailment out of 800, among people who lived near electric transmission lines (this is cited on the Wikipedia page about the Texas sharpshooter). This result is now regarded as not reproducible, an instance of the Texas sharpshooter.  But let's take note that 1 ailment out of 800 is quite different from 23 words out of 188.

Quentin Atkinson assured me that he can stand behind this paper, and he may yet have to defend it in the pages of PNAS or some similar platform.  These authors are not going to make a clean getaway with such a provocative proposal, not once the anti-mass comparison folks in comparative linguistics got wind of it.  My own view in general is that we should embrace new sources of evidence in linguistics, rather than closing ranks and saying that methods developed over a century ago are really the only way.  Let's not forget that the standard comparative method is so strict that it can be carried out "by a trained eye," and without any statistical processing. Surely there must be some kind of computational analysis that can go beyond this.

Thursday, March 14, 2013

Toward the learnability theory of language as a complex adaptive system

Summary

There have recently been a number of efforts to model language as a complex adaptive system. A few successful projects have explicitly modeled evolving language using evolutionary game theory. When carefully applied, this technique has shown itself able to account for aspects of language change, and there deserves to be far more testing of this approach.  In the simplest kind of model, the speakers play “language games” in which the objective is to imitate each other. The game strategies are related by a payoff matrix, and the “imitation dynamics” governs the evolution in a fashion analogous to the replicator dynamics in biological models. 
Learnability theory (of language) is the established mathematical study of the learning capabilities of inductive language learning algorithms.  Such theoretical analysis covers a wide range of learning models, and can be very helpful in evaluating the effectiveness of postulated language learning algorithms.  But in reality, of course, language learners are themselves the speakers in the evolving linguistic community.  Language learners are busy learning an evolving language—in fact they are part of the cause.  Mutation in an evolutionary language game can be modeled as imperfect learning. The program of research I am putting forward here is to forge a greater connection between the evolutionary modeling of language and the formal learnability theory of language. There is precious little existing research on this topic, but as a practicing linguist I believe we must seek to understand language as something which is at once inductively learned and evolving under a host of systemic pressures and functional adaptations.  Only then can we hope to achieve significant understanding of the most important human cognitive ability.

Language as a complex adaptive system

At present, the efforts to model human language as a “complex adaptive system” have been developing for at least two decades. Beginning in the 1990s with some pioneering early work on models of language evolution by Steels, Croft and other researchers, the idea of modeling language as an emergent property of a system of agents who are trying to communicate gradually became more popular, though it is still far from being a mainstream topic in Linguistics. The currency and modern development of the approach is evinced by such books as The Origin of Vowel Systems1, Self-Organization in the Evolution of Speech2, and Language as a Complex Adaptive System3. 
De Boer1 describes how language meets the criteria of a complex system: the interacting elements are the speakers, and the local interactions are the speakers talking to each other, and also learning language from the speech community. De Boer goes on to explain how language is also adaptive: it changes under the influence of cognitive and social forces which seek to optimize a variety of attributes, viz. communicative efficiency, communicative effectiveness, and ease of learning.
Steels4 introduced the simulation of the complex system of language, consisting of a large number of agents interacting and playing a “language game” designed to foster increased communicative effectiveness.  The approach is clearly based upon Maynard Smith’s5 evolutionary game theory.  A number of the elements of natural language have been modeled as emerging from a population playing games: De Boer (op. cit.) showed how agents playing an “imitation game” with vowel sounds could spontaneously develop a vowel system bearing similarities to natural vowel systems; Steels6 simulated emergence of conceptual categories and linguistic syntax, while Steels and Kaplan7 simulated emergence of a lexicon—which is to say, a set of form-meaning relations shared among the population.
Yet, something is missing from much of this development: a theory which can both describe and constrain how such complex systems of language learners might function while nevertheless changing their established language. It seems that the notion of complex adaptive system has been adopted in much functionally motivated linguistic research almost as a leitmotif rather than as a serious mathematical theory which can be used to study the interconnected processes of language learning and evolution.  There are apparent conundrums in this interconnection; chief among them is that while human children learn language successfully, language continues to evolve through the generations.

Evolutionary game theory

Evolutionary game theory was developed by J. Maynard Smith,5 and is well known as a model of evolving complex systems. In the study of language evolution, it has been used to formalize the notion of functional adaptation of a language to meet certain communicative needs.8 Jäger (op. cit.) used a stochastic evolutionary game simulation to demonstrate that certain common features of grammar are stable states, while other unattested conditions are evolutionarily unstable, using just a few uncontroversial facts about linguistic communication.  Despite this interesting successful research, there has been little or no follow-up.  My feeling is that a lot more work deserves to be pursued in this area.  Jäger only dealt with functional adaptation affecting a couple of grammatical features; there are numerous other evolutionary forces which drive language change in other varied ways.
By way of example, let me consider the evolution of vowel systems. Some pioneering modeling was done by de Boer on the initial evolution of vowel systems (at the dawn of language), but there has been little or no complex systems modeling of the continuing evolution of the sounds of language in response to systemic pressures. There is indeed scarce agreement about what the systemic pressures affecting vowel systems actually are. For whatever reason, natural languages most frequently evolve to a point where they have approximately 5 vowel qualities (usually /i, e, a, o, u/ as in Spanish), and this seems to be an evolutionarily stable state of our systems—languages with approximately 5 vowels are often evolutionary targets and keep them unchanged for many centuries (to wit, Spanish).  On the other hand, many languages have for unknown reasons developed vowel systems with more than 10 distinct vowel qualities—this is characteristic of the Germanic languages including English. These larger vowel inventories, however, are usually unstable; English dialects are constantly going through “vowel shifts” which threaten to render the many global varieties of English mutually unintelligible. A further fact of interest, however, is that the state in which a language has a large number of constantly shifting vowels itself appears to be evolutionarily stable.  English has had at least 10 vowels and diphthongs since Anglo-Saxon times (leaving long/short distinctions aside) and now has about 13, so rather than reducing the number of vowels we have cycled through many different qualities of these vowels in the intervening centuries.  This type of vowel shifting is reminiscent of the stable oscillatory states which have been demonstrated in evolutionary models of cooperative behavior which include mutation.9
In language evolution, successful imitation plays the part of the replication found in evolution models (so Maynard Smith’s replicator dynamics are now imitation dynamics),8 while imperfect learning by the next generation is the mutation. The above mentioned vowel shifting is likely to be caused in part by a failure to accurately imitate so many vowels because of production/perception failures (this is a fitness failure in evolutionary terms), but also by quasi-random changes in the lexicon that affect the functional load of the vowel contrasts. Functional load (briefly: the amount of important distinctive work done by a distinction in a given language)10 is likely to be an important force in the evolution of language, although along with many other such forces it has never been modeled in a full-fledged evolutionary game simulation.
Many other evolutionary phenomena have been postulated for natural language but have never been subjected to detailed study through dynamical simulation. One further example is the apparent tendency for the separate words of sentences to gradually “agglutinate,” so that sequences of words often become prefix-root or root-suffix combinations (witness the modern-day creation of English items like gonna and coulda). The reverse process, while easy to imagine, is all but unknown in reality.11 So we see that there are subtle questions of linguistic fitness that need to be carefully considered to achieve explanatory models.
 

Formal learning theory

Formal learning or “learnability” theory has recently been reviewed by Fulop and Chater12, who cover a number of distinct approaches to the mathematical modeling of learning functions or languages. A standard model of learning can be used in nearly all formal learning algorithms: we suppose that the learner receives a data sequence —this is the example set or learning sample or training set—one item at a time. The data sequence consists of examples. The learner then proposes a hypothesis to characterize what it has learned after each example. One natural goal for the language learner is to recover (or perhaps to approximate) the “true” language from which the data D has been generated. In this setting, an example might consist of a sequence of symbols, plus the information that this expression is within the target language. A learning sample would then be a sequence of such sentences.
Learnability theory is traditionally concerned chiefly with how to set up a problem so whether the “true” function, concept or language is learnable can be assessed by mathematical analysis. This kind of learning theory also usually focuses on learning as a process in which the learner’s hypothesis approaches the target language as more data is analyzed. Key differences among theoretical frameworks within learning theory center on the specific way to model the notion of “approaching the target.” For the research program I envision, it is not essential to select one particular learning theoretic approach. One could imagine useful results being connected to a variety of methodologies including Bayesian learning, formalized inductive inference, and probably approximately correct inference, given that each of these disparate approaches has yielded important results pertaining to language learning.
An important finding to emerge from language learnability studies is that various elements of natural languages can be successfully learned by a variety of specific algorithms, but only if one allows either unrealistic computing power13 or tighter restrictions on the class of languages which can be learned14, 15 (i.e. some form of Universal Grammar or universal learning bias). The form of innate learning bias that will serve to permit language learning is not as extensive as that originally sought within the Chomskyan program of Principles and Parameters, however. The latter program called for such a rich innate component that the credo “most of language is innate” has sometimes been attributed to the Chomskyan paradigm.16

Learning in evolving systems

Seemingly the first source to combine learnability theory with the study of language as an evolving dynamical system is Niyogi 200617.  While impressive and mathematically sophisticated, this work should be viewed as only a starting point for the research program I envision.  There are many assumptions made in Niyogi’s approach that deserve reconsideration, including the language learning paradigm and the dynamical system model.  For the former, Niyogi stuck to the linguistic paradigm of Principles and Parameters, a model that has since fallen out of favor, in part due to Niyogi’s own proofs18 pointing out that the learning algorithms did not have the expected nice properties.   For the latter, Niyogi’s dynamical models are not complex systems, rather they are deterministic simple systems with tractable analytic solutions.  While this enabled the calculation of handy mathematical results instead of messy simulations, now we must move into the complex systems regime—indeed they should be not only complex but adaptive systems as well.
In what is apparently the only literature to add substantially to Niyogi’s approach, Chatterjee et al.9 provide some useful methods for combining the study of evolutionary dynamical systems with the study of learning theory.  While mentioning linguistic applications, their work is focused on populations learning Prisoner’s Dilemma strategies.  My plan is to in essence combine the methods of Jäger and those of Chatterjee et al., in the search for novel substantive connections between the complex adaptive system model of language change and learning-theoretic results about language.  The main theoretical area which requires development is, as pointed out to me by Nick Chater (p.c.), the scenario in which inductive learners (i.e. children) are using the “results” of previous learners (i.e. adult speakers) who presumably have identical language learning biases, and all are together in the same evolutionary dynamical system.  This should make it possible to learn more easily from finite (and thus partial) language data.  But as mentioned, there seems to be no published literature which addresses these points or which does what I’m envisioning for my research program.

Research plans

My plans for carrying out the research involve a number of interrelated activities and phases. I plan to construct evolutionary language simulations which model a variety of linguistic forces beyond functional adaptation, such as cognitive and speech mechanistic constraints.   Such simulations will progress to involve multiple generations of speakers, in which the younger speakers learn language from the older speakers by imitation. The next phase would add mutation to the simulation, in the form of imperfect learning.  My hope is eventually to be able to derive language learnability results in the specific setting of a multigenerational adaptive speech community with homogeneous learning biases.
To take a specific example, I plan to invoke an existing method for learning about the morphology of words19 to develop an evolutionary game simulation in which each succeeding generation of learners applies the method to the output of the previous generation.  Constructing the simulation will involve carefully considered parameters and entries in the payoff matrix which determines the outcomes of “games” played by the participants.  The overall goal of the game is not only successful imitation but correct word structure in relation to other similar words, which can be gauged by a number of possible measures. This will use a stochastic version of evolutionary game theory, as in Jäger (op. cit.).  Once the basic evolving system is set up, mutation can be introduced and the dynamics examined under different assignments to fitness parameters.
The basic learning results about this approach to morphology are quite straightforward19 and should be applicable within the dynamic systems approach; I imagine that the complexity of the learning is an interesting object of study in addition to the learnability per se. The effects of mutation in various forms will surely affect the learning results; it may become impossible to learn adequately if imperfection is too great, because a degree of homogeneity needs to be found among the adult speakers.  In general I am hoping to find some unforeseen results.     

References


Wednesday, February 6, 2013

Why formal learning theory matters for cognitive science

A new special topic bearing this title, edited by myself and Nick Chater, has just been published in Topics in Cognitive Science.  The topic includes numerous papers on formal learning theory of languages, and a couple of others addressing Bayesian and semisupervised learning.

I won't bother to link to this journal, since you either have subscription access or you don't, and a direct link is not likely to work either way.

Athabaskan languages

Sometimes I get the distinct feeling that mathematical linguistics, and linguistic theory in general, has absolutely nothing to say about some languages.  I've been renewing my interest in Navajo lately, which is a typical representative of the Athabaskan (Na-Dene) family in general.

My big Analytical Lexicon of Navajo (Young, Morgan and Midgette 1992) organizes the verbs according to the roots, each of which is expressed by several stems in conjugated verbs.  Verbs are conjugated for two different kinds of aspect in a kind of two-dimensional aspect matrix.  Some verbs have 8 or more different aspect combinations that they may be conjugated in.  The lexicon lists 550 roots, expressed using 2100 stems.  And it's all irregular.   All of this verbal morphology, for the entire language, is irregular.  There are no rules which would yield the pronounced forms, as far as I can see. 

Now, there is in fact some regular inflection on the verbs such as subject and object agreement, and some of the aspectual prefixes are sort of regular and are sorted into multiple verbal classes, but the stems expressing the various aspect combinations are irregular.  A theory of aspectual meanings and the possibilities there would be greatly desired.  My cross-linguistic surveys of aspectual systems tell me that the study of aspect is a total mess.  There are different terminologies used every time you turn around and look at a new language family.

When I look at Navajo I'm reminded that a major gap in mathematical linguistics is a theory of morphosyntax.  These Navajo verbs are sufficiently expressive that they can serve as a complete sentence, so long as you're happy to speak using pronouns.  The pronouns themselves are the agreement morphemes on the verb. There are other (hundreds) of verbal prefixes that serve to add specific characters to the action, like "going on and on", "descending from a height", "shape of a circle", and so forth.

If I could develop a theory of anything that would work for Navajo, I'd know I'd accomplished something important.