Saturday, February 27, 2010

Kornai's Chapter 2

Chapter 2 of Kornai's "Mathematical Linguistics" presents the basic elements familiar to many of us, such as the Chomsky hierarchy of languages. But the presentation is quite different, and involves plenty of algebraic perspective on the matter, together with a few factoids I had never really considered.

The most interesting things I learned involve analogies between the language hierarchy and recursion-theoretic views of the numbers. In particular, "the Kleene theorem guarantees that regular languages have the same distinguished status among languages that the rationals in Q have among numbers in R" (i.e. the reals). Moreover, "context-free languages play the same role among languages that algebraic numbers play among the reals." As a reminder, algebraic numbers are those reals that can be roots of polynomials, while the other reals are called "transcendental," with \pi as the canonical example. (one day I am going to have to enable this blog for MathML) It turns out that the transcendental numbers are in a well-defined sense infinitely more numerous than the algebraic numbers, and I guess this fact carries over to the language hierarchy, so that languages beyond context-free are somehow infinitely more numerous? I'm not sure about this whole analogy, this is really something I just learned from the book.

Thursday, February 25, 2010

Mathematical Linguistics by Kornai

I only just realized that Andras Kornai published the book "Mathematical Linguistics" in 2008. I am currently reading it and it is sparking my mind in a number of areas. I highly recommend it, it is a broad view of linguistics from a mathematical perspective.

In this blog I would like to highlight each chapter in turn. We don't get great books to study very often in this field. The first chapter is introductory, but it makes clear the author's rather unique view of things. The most interesting idea put forth there is that mathematical linguistics, viewed as a mathematical theory, is astoundingly complex, and encompasses far more axioms than is typical of well-studied areas of mathematics. This brings Kornai to the analogy with physics, and the idea that mathematical linguistics lies in a "mesoscopic" regime between the microscopic (which can be fully specified and axiomatized) and the macroscopic (which would presumably by typified in mathematics by nonlinear systems that can be chaotic and are impossible to specify axiomatically).

Tuesday, February 23, 2010

Book review: Mathematics of Language

A while ago I reviewed "The Mathematics of Language" by Marcus Kracht.
This is posted in the open access eLanguage, under Book Notices.
Here is the link:
http://elanguage.net/blogs/booknotices/?p=19

I'm putting this on the blog because I believe that very few people actually interested in this book are regular readers of Language or eLanguage.

The review is very general, intended for general linguists, but still offers something that might be of interest to people here.

Thursday, February 18, 2010

Parts of speech -- what are they again?

Syntactic "parts of speech" have bothered me for many years. If you are mathematically minded and not a traditional linguist, they probably bother you, too. There was a nice piece by Shuly Wintner in the recent issue of Computational Linguistics, in which he noted the "theoretical bankruptcy" of this field. How does this relate to parts of speech? Because computational linguists generally don't have a clue when it comes to parts of speech. I mean seriously, have you ever examined the tagging manual for the CLAWS tagset? (Google this and you can). This tagset (whichever alternative you choose) is the most incredibly ad hoc extension of the basic parts of speech which have descended to us from the ancient Greek grammarians. For modern computational linguists to actually use this suggests that perhaps modern physicists should rely more on Aristotle than Einstein.

If the notion of a "part of speech" is to have any valid theoretical foundation, then it must surely be formalized as a kind of "word usage class." It then reasonably follows that a good system of word usage classes should be able to be induced from language data. My hypothesis is that this is what the human learner does. The methodological million dollar question then becomes, can we induce good parts of speech from straight text (string only data), or do we need syntactic structural information to get it right? My work in type-logical grammar, such as a recent paper published in JoLLI, uses structural information. This leads to the bootstrapping problem of where you get that, and what it means to have structures prior to having any parts of speech. Plus the algorithms for it are grotesque, intractable batch unifiers, and generally useless for practical work.

I am wondering how much can be achieved with inducing parts of speech from string-only data. I have a current Master's student who is trying, by modifying some code posted by Alex Clark for a similar task. There are numerous difficulties, including the fact that the only contexts available to determine a word usage in string-only data are the nearby words, and this invokes n-gram models. Hardly a theoretically wonderful approach to language. Another big problem is word ambiguity; the same word-form often has several parts of speech, and really is being used as a different word in each case. Clark's paper in the 2000 CoNLL proceedings tries to address this. Dan Klein's dissertation attacks the problem as well, but he appears to be evaluating against the old grammarian's system as a gold standard. This is a bit backward, to me. Does anyone know of work in this area that is getting anywhere? Part of speech induction seems like an area in which only a few researchers are dabbling, so there is not yet a clear methodology that has got everyone's attention.

In the end, the parts of speech problem is a huge embarrassment for linguistics. It really shows that we are still at the beginning stages of the field's development, if we are tripped up by something so fundamental.

Tuesday, February 16, 2010

Categorial grammar, in some form

Since 1997 or so I have been busy with type-logical grammar, as known from the books by Morrill (1994) or myself (2004), and various other sources such as Moortgat's fine paper in the Handbook of Logic and Language (1997). These systems became quite baroque and complicated with the whole "multimodal" agenda, amply studied by Moot in his PhD thesis (2002).

Competing with the type-logical approach, one finds the combinatory categorial grammar championed by Steedman in a number of sources such as his books of 1996 and 2001. Now there is also the system known as "pregroup grammars" introduced by Lambek in the now-defunct journal Grammars in 2001. He had complained that the multimodal type-logical grammars were too complicated.

Now the question is, which of these frameworks is the most lively? Certainly pregroup grammars have that new-car smell in their first decade, and are being studied. But I am greatly encouraged by type-logical projects such as Jaeger's book on anaphora in the Springer series Trends in Logic. Linguist friends who know of my type-logical affections have often asked me about linguistic problems like the handling of anaphora and ellipsis. I have always demurred, sweeping such matters under the rug in my desire to just work on what I wanted, but also confident that these things could be taken care of. I have reason to believe that type-logical grammars are still a very good framework worthy of continuing study. I think their mathematical connections are quite rich, and my mind is full of things that will wait for future posts.

Monday, February 15, 2010

Opening lines. . .

Welcome to my blog about things linguistic and mathematical, and most frequently both at once.

This is to be a venue mostly aimed at professional researchers in linguistics, computational linguistics, and cognitive science, but mathematicians may also find much of interest. I am always interested in how mathematics can be applied to the study of language, and that often relates to cognition and computation, so here we are.

I've thought about doing a blog for a few years, always saying to myself I might start one when I have more time. Well, I don't have any more time but I am perhaps learning to type faster, so I just decided to give it a shot.

Watch for random musings of mine on math and language and how they relate. I wanted a venue to air out my mind that was less formal than publications, but still could access the broader community. I eagerly hope for serious commentary.

Current linguistics blogs seem to be of three kinds; there are maybe a few serious ones about linguistics, there are some of the usual things that are aimed at laypeople interested in language, and there are a few kind of nonserious blogs by grad students that are not very research oriented. This is the first blog devoted to research topics in the mathematics of language, I think. If it is not, please let me know.

Next post, something important.