Thursday, February 18, 2010

Parts of speech -- what are they again?

Syntactic "parts of speech" have bothered me for many years. If you are mathematically minded and not a traditional linguist, they probably bother you, too. There was a nice piece by Shuly Wintner in the recent issue of Computational Linguistics, in which he noted the "theoretical bankruptcy" of this field. How does this relate to parts of speech? Because computational linguists generally don't have a clue when it comes to parts of speech. I mean seriously, have you ever examined the tagging manual for the CLAWS tagset? (Google this and you can). This tagset (whichever alternative you choose) is the most incredibly ad hoc extension of the basic parts of speech which have descended to us from the ancient Greek grammarians. For modern computational linguists to actually use this suggests that perhaps modern physicists should rely more on Aristotle than Einstein.

If the notion of a "part of speech" is to have any valid theoretical foundation, then it must surely be formalized as a kind of "word usage class." It then reasonably follows that a good system of word usage classes should be able to be induced from language data. My hypothesis is that this is what the human learner does. The methodological million dollar question then becomes, can we induce good parts of speech from straight text (string only data), or do we need syntactic structural information to get it right? My work in type-logical grammar, such as a recent paper published in JoLLI, uses structural information. This leads to the bootstrapping problem of where you get that, and what it means to have structures prior to having any parts of speech. Plus the algorithms for it are grotesque, intractable batch unifiers, and generally useless for practical work.

I am wondering how much can be achieved with inducing parts of speech from string-only data. I have a current Master's student who is trying, by modifying some code posted by Alex Clark for a similar task. There are numerous difficulties, including the fact that the only contexts available to determine a word usage in string-only data are the nearby words, and this invokes n-gram models. Hardly a theoretically wonderful approach to language. Another big problem is word ambiguity; the same word-form often has several parts of speech, and really is being used as a different word in each case. Clark's paper in the 2000 CoNLL proceedings tries to address this. Dan Klein's dissertation attacks the problem as well, but he appears to be evaluating against the old grammarian's system as a gold standard. This is a bit backward, to me. Does anyone know of work in this area that is getting anywhere? Part of speech induction seems like an area in which only a few researchers are dabbling, so there is not yet a clear methodology that has got everyone's attention.

In the end, the parts of speech problem is a huge embarrassment for linguistics. It really shows that we are still at the beginning stages of the field's development, if we are tripped up by something so fundamental.


  1. Hi! Congratulations for the new blog!

    Regarding the claim that "the **only** contexts available to determine a word usage in string-only data are the nearby words", aren't you overlooking the form of the words themselves?

    Here is an example: when a human sees several words ending with, for example, the suffix "-ly", he will tend to group them in the same POS category (which he could later define as the category or POS of "adverbs"...), independently of which other words are nearby.

    Also, there are languages where the word order is free. For these languages, we cannot rely on nearby words to induce a category for a word.

    What do you think?

  2. Well OK,

    first tell all your friends about this blog.

    I agree that morphological information can certainly be harnessed, but with many pitfalls.
    English, for instance, has suffixes -s which are homophones between plural noun and 3rd person verb.
    We also have frequently homophonous word pairs "likes, kills, hunts" etc. with both a noun and verb meaning.
    I believe that Alex Clark's 2003 paper (ACL conference) makes an attempt at combining distributional and morphological information.

    Your remark about free word order is well-taken. This puts the nail in the coffin, as it were, for pure string-only approaches to part of speech induction. I guess we have to use configurational information, probably mostly from the semantic structure.

  3. Yes. English is a very messy language. All other languages that I speak have more structured and reliable morphology. And then morphology is more reliable than word order for identifying part of speech.

    It has also just occurred to me that word order and the form of words influence each other. For example, I have always thought that the suffixes "-er" and "-est" evolved from a particular word order:

    "fast + more" ---> "faster"
    "fast + most" ---> "fastest"

    Similarly, the rich verb conjugation suffix system of my native language, Portuguese, seems to have resulted as a contraction of a verb and an auxiliary verb occurring in a particular word order.

    Conversely, the form of words indicate how they are pronounced, and then people probably tend to choose word orders that sound better. Therefore, morphology influences word order too.

    But I am just guessing. I am a logician, not a linguist. I wonder if these observations about the evolution of languages really make sense, or if they are just coincidences...

  4. Indeed, it is a long-held slogan of historical linguists that "yesterday's syntax is today's morphology." This reflects the general presumption that perhaps most morphology results from "fusion" of adjacent words. One sees this in action in modern English, with the development of such modernisms as "coulda" from "could have" and "gonna" from "going to."

    The only problem is, morphological fusion takes so damn long it is not, to my knowledge, a totally proven thesis because no language has been documented for long enough.

  5. Laszlo Kalman just gave a talk at the Hungarian Computational Linguistics meeting arguing that CL, rather than theoretically bankrupt, is just conservative, showing a strong preference for the old structuralist theories that were designed very much with data and discovery procedures in mind. I believe that the old definitios of lexical categories, based on distributional equivalence, are very much applicable and retain their full power. First, in languages with significant morphology, two words belong in the same POS iff they share the same paradigm, see classic pieces by Bloch and Trager, Nida, Harris and others (I think the standard pedagogical intro is
    Hockett's 1958: A Course in Modern Linguistics). One has to be careful with items that have defective paradigms, and a lot of indeclinabilia remain outside the category system, these we deal with
    individually. When the morphology is less extensive, word order is far more rigid, and distributional criteria based on neighboring words become more applicable.