SummaryThis is a reasonably detailed account
of the
statistical procedures
we used to
test our
hypothesis. It is intended to be as non-technical as possible,
but you may still find it pretty heavy going.
For the really
technical details please consult our paper. |
Our hypothesis concerns the relationship between a typological linguistic feature (namely, tone) and the "derived" alleles (variants) of two human genes (ASPM and Microcephalin, which will be referred to in the following as ASPM-D and MCPH-D). We tested the hypothesis using linguistic and genetic data from 49 populations from the Old World. While we did not choose these specific populations, we gathered genetic data in the form of allele frequencies and linguistic data in the form of values of typological features from a variety of sources.
Step 1a: calculating the relationship between tone and each "derived" allele considered separately. We first computed the correlation (Pearson's r) between tone and ASPM-D (r = -0.53, p = 9.63*10-5) and between tone and MCPH-D (r = -0.54, p = 7.22*10-5). Both are quite large and highly significant (the '-' minus sign indicates that the "derived" alleles correlate with non-tone - this is due to the choice of our coding scheme and irrelevant for the argument). Just to clarify: these correlations are between the tone status of the language of each population and the corresponding allele frequency in that population, based on the 49 populations in our sample.
Step 1b: assessing the significance of the correlations in the context of genes and language. The standard significance tests in step 1a point to the conclusion that tone correlates with each of the two "derived" alleles and this relationship is probably not due to chance. However, this conclusion may not be a priori justified in the specific area of genes and language. It may be that genes generally correlate with the typological features of language because of patterns of past migration and contact between peoples speaking different languages.
Because of this possibility, we wanted to make sure that our correlation is "signficant" not only in this standard statistical sense, but also that it is unusual when compared to other correlations between genes and typological features. So for the same 49 populations we gathered frequency data for 983 alleles and values for 26 typological features, in order to see in which manner genes and typological features tend to behave. And it turned out that they behave nicely, which for a statistician means that the correlations between all pairs of genes and typological features follows a normal distribution around 0 (see image on the main page). This means that, as most people might have expected, there is no general correlation between genes and linguistic features.
Nevertheless, the correlations we were interested in (tone/ASPM-D and tone/MCPH-D) turned out to be in the tail of the distribution. Specifically, they are stronger than 98.6% of the thousands of gene-language correlations we tested.
Important! Step 1b was not a matter of searching the 983 alleles and 26 linguistic features looking for strong correlations, but assessing the empirical "significance" of the previously established correlations against this alleles & features database.
Step 2a: computing the relationship between tone and both "derived" alleles considered together. Our hypothesis concerns the relationship between tone and both "derived" alleles at the same time. To test this, we used logistic regression, of tone on the pair ASPM-D and MCPH-D: what this gives is an estimation of the "strength" with which the frequency of the two "derived" alleles in a population "predicts" the tonality of its language.
This proved to be significant (Nagelkerke's R2 = 0.53, p = 0.015), again in a standard statistical sense. We then carried out a step 2b that was analogous to step 1b, in order to find out whether this relationship was unusual in the specific context of genes and language. We found that the logistic regression of tone on the pair ASPM-D and MCPH-D is in the top 2.7% of all the logistic regressions of all linguistic features on all pairs of alleles in our database.
This suggests that linguistic tone is predicted by the population frequency of the two "derived" alleles together.
Step 3: Controlling for geography and history. Even though we were satisfied that the relation between tone and the two genes is unusual, we were aware that it could still just be a matter of historical and geographical links between languages and their speakers. It is generally accepted that correlations between languages and genes are explained by shared demographic processes, which are shaped, in turn, by geography, history and some other factors, as well (L. L. Cavalli-Sforza is probably the best known scientist to investigate these links). Therefore, we wanted to make sure that this was not the explanation for the results of steps 1 and 2. In effect, we really tried very hard to do what any skeptical scientist should do - to prove our own hypothesis false.
To do this, we transformed our data into distances. Distances have some very nice properties and there are quite a few statistical techniques for working with them. What we specifically applied is a technique known as Mantel correlation, which basically computes the correlation between two distances taking into account their special properties. It returns a correlation estimate (r) and a signficance (p) which behave in the same way as for the more familiar Pearson correlation.
The distances we worked with are:
- the geographic distance
between the populations, computed on land (thus avoiding long
sea voyages);
- the genetic distance between
populations, computed as Nei's D,
a pretty standard measure in population genetics;
- the typological
linguistic distance between languages, basically
reflecting how "similar" or "alike" two languages are from a structural
point of view;
- the historical
linguistic distance between languages, reflecting how
closely related they are (this is different from the typological one,
as two languages that are related can be structurally different and
vice-versa).
To assess the strength of the relationship between tone and the "derived" alleles, we basically computed the Mantel (second order partial) correlation between the typological linguistic distance (tone only) and the genetic distance (ASPM-D and MCPH-D only) controlling for the geographic distance and historical linguistic distance, simultaneously. The results (r = 0.283, p < 0.001) show that this relationship is not entirely explained by shared (known) history or spatial proximity.
Holm's multiple comparisons correction:
Because we computed a large number of statistical measures,
we had to adjust
the significances (p-values)
of our results. This is basically due to the fact that, when computing, for
example, 1000 correlations, just by chance 10 will turn out to be significant for
an alpha level of 0.01. Adequate
adjustments are required to prevent this from happening. We have
systematically employed Holm's
procedure and our reported p-values
are adjusted.
Go back to the Further Information page...
| Last
updated: 01 June
2007 Dan Dediu & D.R. Ladd |