A recent paper by Ed Stabler (in Language Universals, Christiansen et al. eds. 2009) puts the focus on an important question that is rarely formulated in the literature. What are the structural properties of natural languages which guarantee learnability? We know from a variety of negative results going back to the famous Gold theorem that such properties have to go far beyond the defining principles of the Chomsky hierarchy, since none of the traditional language classes except the finite languages are strictly learnable, in the sense of identifiability in the limit. Because we presume to model natural languages as some sort of infinite languages, something else must be going on. There must be some restrictions on the possible forms of natural language that permit learnability in some sense. Stabler says that to address this question, we need a proposal about how human learners generalize from finite data. There is as yet no complete answer to this problem, and indeed very little research seems to be currently motivated by such questions.
While I do not generally use this forum to highlight my own published work, in this case I believe that my 2010 paper in the Journal of Logic, Language and Information (together with an erratum published this year) does address Stabler's question directly. The title of my paper is Grammar Induction by Unification of Type-Logical Lexicons, and therein the basic proposal is given. Human learners are proposed to generalize from finite data by unification of the sets of syntactic categories that are discovered by an initial semantic bootstrapping procedure. The bootstrapping procedure extracts basic category information from semantically annotated sentence structures (this is highly enriched data, to be sure, but I argue for the plausibility of that in general terms). The basic system of categories is then unified by a two-step process that takes into account the distribution of the words (usage patterns, expressed structurally), and is able to generalize to "recursively redundant extensions" of the learning data. This is then a specific proposal of the sort invited by Stabler. The resulting learnable class of languages is highly restricted, including only those that are in some sense closed under recursively redundant extensions. This is in accord with general thinking about human languages, in that it is normally the case that a recognized recursive procedure (such as the appending of prepositional phrases to expand noun phrases, in rough terms) can always be applied indefinitely to yield grammatical sentences of increasing length.
I hope in the future to highlight my findings in a more general journal like Cognitive Science. It is helpful that Stabler conveniently provided the leading question to my purported answer.