Thursday, October 14, 2010

Topology of language spaces

Very little has been written, it seems, about the topology of language spaces. Fix a (finite) vocabulary V. Then the Kleene-star set V* represents the (countably infinite) set of all expressions over V. The language space P(V*) is the set of all sets of expressions, so it is the power set of V*, meaning it has the cardinality of the reals. It is well known that most of these languages are uncomputable (non recursively enumerable), but it would be useful to provide a good topology to "metrize" the language space somehow. For an application of this, it would be very interesting to provide a suitable metric of the "distance" between two languages in a language space.

Very little work has been published which attacks this problem. The definitive work appears to be the PhD dissertation by David Kephart (University of South Florida, 2005) "Topology, Morphisms, and Randomness in the Space of Formal Languages." This dissertation first studies the only prior work on the subject, which defined the Bodnarchuk metric as a distance between languages. This is, roughly, the shortest expression length at which differences between the two languages appear. The resulting metric space of languages is in essence the Cantor space. It is an unsatisfying metric for linguistic, or any other, purposes.

Let's think for a moment about the problem to be modeled topologically. Consider two dialects of English, call them AE and BE. Assuming they have the same vocabulary (we can agree, I hope, that vocabulary differences alone are mundane and easy to describe), any differences between them have to be "grammatical," meaning speakers of AE may assent to certain sentences which are rejected by BE speakers as ungrammatical. We might like to know, then, how "far apart" are these dialects grammatically? As soon as any major grammatical difference appears, given the recursive nature of natural languages, it becomes quite likely that any difference between potential construction patterns will give rise to an infinite number of sentences which distinguish AE from BE. But yet, if AE and BE still agree on the majority of construction patterns, they may appear at a random glance to still be very nearly the same. So then, there is no hope for a simplistic metric like the cardinality of the distinguishing set of sentences, because in any interesting case the two dialects will be distinguished by an infinite number of sentences.

With that, let me get back to Kephart's thesis. After analyzing what he calls the "Cantor distance" mentioned above as fairly worthless, he goes on to propose two new metrics, which are in fact pseudo-metrics, because they give the distance 0 between some pairs of distinct languages. The first of these is called the Besicovitch norm when applied to one language, or Besicovitch distance when applied to the symmetric difference between two languages. (The name stems from the originator of the metric, where it was first applied in a treatment of cellular automata). Its definition is complicated because it counts cardinality by sectioning. The Besicovitch norm (or size of a language, by this measure) is defined as the least upper bound on limits of ratios between the cardinality of sections (up to length k) of the language to the cardinality of up to k-length sections of V*. To get a distance between AE and BE, simply use this norm to measure the size of the symmetric difference set between the languages. This quantity is a surjective mapping from P(V*) onto [0, 1], so all distances are a positive fraction at most 1.

The resulting pseudo-metric language space is not very nice topologically because it isn't even T0, but this can be improved by taking the quotient space which identifies all the "equivalent" distinct languages. This quotient space is called the Besicovitch topology. In it, all the finite languages have size 0, and so cannot contribute anything to the distance between two languages. This alone is a positive feature, in my view; for one thing, it implies that two languages which differ by only finitely many sentences are essentially the same. But the Besicovitch distance doesn't quite capture an intuitive feeling of the distance between languages for a number of reasons.

Kephart then proposes a refinement, a differently defined "entropic" pseudo-metric. It is this which I feel is getting close to a linguistically relevant language distance. As he summarizes, "where the Cantor metric hinges on the first apparent distinction between languages and disregards the rest, and the Besicovitch distance relies upon the total proportion of the monoid (V*) exhausted by language distinctions regardless of the expansion rate of this proportion, the entropic distance will take into account both the appearance of distinctions and their rate of increase." Instead of computing a proportion from the size of the up to k-length section of a language (or symmetric difference between languages), the entropic norm of a language computes a normalized topological entropy of the precisely k-length section. This is relevant because the topological entropy of each k-length section measures the exponential growth rate at that point. So, the entropic norm is defined as the "lim sup" (least upper bound on the limits) of this normalized entropy as k increases. The entropic distance between languages AE and BE is once again simply the entropic norm of their symmetric difference set.

I will need to do some further research to see how much I like this measure of the distance between languages, and the resulting topological language space. It certainly seems promising to me at the moment, and a further analysis could provide for a great deal of work for me or whoever was interested.


  1. Good, I was looking for people with similar view points to my own. I think using concepts of topology and geometry we can really study how languages form and change.

    Currently we have very little besides guess work to determine how languages work. If we can show that languages can be thought of as say manifolds, then we can apply all the available concepts of topology and geometry (including differential topology and differential geometry) to them.

  2. Sir, can you please elaborate your viewpoint?
    I know there is some categorical correlation between language emergence and topological structures but its baffling to think how it could be linked together.