Friday, September 28, 2012

Bloomfield vs. Harris on the road to learnability

For some years I've been aware of work by Alex Clark on learnability of substitutable languages.  It always sort of reminded me of my own work on learning what I've called "syntactically homogeneous" languages.  This week I sat down and analyzed our approaches to figure out the differences, and it turns out that there are some pretty big ones, even if the driving spirits behind the two approaches are very similar.

I'm directly comparing a paper by Clark and Eyraud [ALT 2005 proceedings edited by Jain, Simon and Tomita] to my own paper published in JoLLI in 2010.  Both papers strive to find out what is learnable, and how, by restricting to languages that are somehow "reasonable" from rather older perspectives.  Both papers, in fact, resort to criteria put forth by giants of the American Structuralist school.

Clark and Eyraud invoke Zellig Harris's notions of substitutability, and define a strongly substitutable context-free language as one in which two words substitutable in some string context are substitutable in all string contexts where they occur. They go on to prove that such context-free languages are PAC-learnable by a specific algorithm.  This condition puts the focus, if you will, on the words themselves in order to define how the language is laid out.

I, on the other hand, go all the way back to the founder of American Structuralism, Leonard Bloomfield, for my inspiration in trying to formulate a reasonably restricted type of language to learn.  I'm also dealing with term-labeled tree languages consisting of tree-structured sentences annotated by lambda terms, but that's not critical to the points here.  I try to channel Bloomfield's ideas about what "good parts of speech" ought to be, and put the focus on contexts for words, rather than words themselves.  A term-labeled tree context is basically a term-labeled tree with a hole in it where a word might go.  The condition I formulated to encapsulate Bloomfield's ideas was that, roughly, two shape-equivalent contexts should have the same expressions (and words) able to instantiate them in the language.  Such a language has "well-defined parts of speech" and is thus syntactically homogeneous in my terms.

My learnability analysis is quite different from Clark's approach, since I did not study any properties of probabilistic learners.  Instead I showed how a certain class of such term-labeled tree languages could be exactly learned from a finite set of "good examples" using a specific type unification procedure.

So how do these differences play out in the learnable classes of languages for each approach?  Consider the following set of sentences:
i) Tom hit John.
ii) John hit Tom.
iii) He hit John.
iv) John hit him.
 It seems to me that Clark and Eyraud's approach would require "Him hit Tom" and other dubious members to be added to the language or it isn't strongly substitutable.   The above set of sentences is indeed syntactically homogeneous on my account as it is, however.  My feeling is that my notion of syntactic homogeneity is more likely to be a property of natural languages.  But Clark and Eyraud do have their PAC-derived tractability to beat me up with.


  1. You are quite right that natural languages are not substitutable -- but that paper is a non-probabilistic learner so the learning model is too strict. One can enlarge the class of languages greatly using probabilistic models or membership queries.

    Learning tree languages is a different problem ..

  2. OK, fair enough. At this point I am still trying to figure out if natural languages are plausibly "syntactically homogeneous" in the sense I defined, or some similar sense. Maybe I'll write a paper just on that, for a general linguistics journal.