Syntactic "parts of speech" have bothered me for many years. If you are mathematically minded and not a traditional linguist, they probably bother you, too. There was a nice piece by Shuly Wintner in the recent issue of Computational Linguistics, in which he noted the "theoretical bankruptcy" of this field. How does this relate to parts of speech? Because computational linguists generally don't have a clue when it comes to parts of speech. I mean seriously, have you ever examined the tagging manual for the CLAWS tagset? (Google this and you can). This tagset (whichever alternative you choose) is the most incredibly ad hoc extension of the basic parts of speech which have descended to us from the ancient Greek grammarians. For modern computational linguists to actually use this suggests that perhaps modern physicists should rely more on Aristotle than Einstein.
If the notion of a "part of speech" is to have any valid theoretical foundation, then it must surely be formalized as a kind of "word usage class." It then reasonably follows that a good system of word usage classes should be able to be induced from language data. My hypothesis is that this is what the human learner does. The methodological million dollar question then becomes, can we induce good parts of speech from straight text (string only data), or do we need syntactic structural information to get it right? My work in type-logical grammar, such as a recent paper published in JoLLI, uses structural information. This leads to the bootstrapping problem of where you get that, and what it means to have structures prior to having any parts of speech. Plus the algorithms for it are grotesque, intractable batch unifiers, and generally useless for practical work.
I am wondering how much can be achieved with inducing parts of speech from string-only data. I have a current Master's student who is trying, by modifying some code posted by Alex Clark for a similar task. There are numerous difficulties, including the fact that the only contexts available to determine a word usage in string-only data are the nearby words, and this invokes n-gram models. Hardly a theoretically wonderful approach to language. Another big problem is word ambiguity; the same word-form often has several parts of speech, and really is being used as a different word in each case. Clark's paper in the 2000 CoNLL proceedings tries to address this. Dan Klein's dissertation attacks the problem as well, but he appears to be evaluating against the old grammarian's system as a gold standard. This is a bit backward, to me. Does anyone know of work in this area that is getting anywhere? Part of speech induction seems like an area in which only a few researchers are dabbling, so there is not yet a clear methodology that has got everyone's attention.
In the end, the parts of speech problem is a huge embarrassment for linguistics. It really shows that we are still at the beginning stages of the field's development, if we are tripped up by something so fundamental.