There were several challenges in using the Twitter API to meet the study's research goal. Here I look at population estimation: how to measure the population size by sampling.

## Capture/Recapture and the Lincoln-Petersen Formula

Twitter themselves do not provide data on the geographical distribution of their userbase, and they're a little opaque about their true user population with dead accounts bolstering the numbers. What little official data there is, focuses on those countries with the highest per capita Twitter penetration. In other words, announcements have generally ignored the Arabian peninsula.At the highest level, estimating population size is a simple two step process due to a clever idea from ecological surveying called mark-recapture technique (also known as capture/recapture).

Capture/recapture looks at the overlap between different samples in order to estimate the total size of a population. The degree of overlap tells you how complete your samples were, and hence the size of the population.

We used a sampling process to collect a couple of samples -- sets of users for each country. The samples were obtained by the simple process of looking for people tweeting. In principle, this is all that is required in order to apply the standard Lincoln-Petersen formula for mark-recapture population estimation.[1]

## Not So Fast Mr Bond

However, the Lincoln-Petersen formula relies on several assumptions:Assumption 1 is a reasonable approximation – although Twitter is growing, the change in the Twitter population over a short period will be relatively small. Assumption 2 is also valid as the sampling process is completely passive and cannot affect individual behaviours.

However assumption 3 clearly does not hold for a message-based sampling technique. Different twitter users exhibit

*very*different patterns of activity: some post many times a day, others once a month. Such a population is said to be heterogenous. We have attempted to correct for this as follows.

## Correcting for heterogeneity

The chances of being picked up in a sweep are linked to activity. More frequent twitter users are more likely to turn up in both sweeps. Activity levels are far from uniform (as confirmed here and in various studies, e.g. [2]). This has the effect that the standard Lincoln-Petersen formula will very significantly underestimate the population size.We correct for this by assigning a prior distribution for the likelihood of being seen. This allows a corrected formula to be calculated. This correction uses extra information about users supplied by Twitter, as described below.

Note: given longer histories with more sweeps, there are other models which can be fitted for capture behaviour. See [3] for an overview of techniques. With two sweeps, it is necessary to use a prior distribution to model the heterogenity effect. Without an informative prior, the maximum likelihood estimator for population size with heterogenity is only the number of individuals seen.

We assume that an individual's capture probability is linearly correlated with their average post frequency, which we can calculate from their Twitter profile.

Let

*s*,

_{1}*s*be the number of individuals captured in sweep 1 and 2 respectively, let

_{2}*m*be the number of marked individuals found in sweep 2 (the overlap) and let

_{2}*N*be the total population (which is what we wish to estimate).

We divide the population into those caught in sweep 1, and those who were not captured in sweep 1. Let

*M*be the individuals captured and marked in sweep1, and

*U*the individuals not captured in sweep 1. These sub-populations have different probabilities of being captured in sweep 2, which we denote

*p*and

_{M}*p*respectively.

_{U}*p*is the average posterior probability of capture given a previous capture, and

_{M}*p*is the average posterior probability of capture given no previous capture.

_{U}Given

*p*, we could estimate |

_{U}*U*| = (

*s*-

_{2}*m*)/

_{2}*p*and hence

_{U}*N*= |

*U*| +

*s*.

_{1}But we don't have

*p*-- what we have is an estimator for

_{U}*p*=

_{M}*m*/

_{2}*s*

_{1}So we have more work to do.

Let

*F*,

_{M}*F*be the tweet frequency distributions for

_{U}*M*and

*U*, i.e.

*F*has the probability function P(frequency|

_{M}*M*). We measure

*F*directly, creating a histogram. We can then estimate the prior distribution for

_{M}*F*since P(frequency=f |

*M*) is proportional to

*f*.Prior(frequency=

*f*) by the link with tweet frequency. This estimate is unstable around low frequencies, so we set a minimum activity threshold of one tweet per fortnight. We use the prior

*F*as an estimator for

*F*. This is reasonable, but does have some bias towards higher frequencies, which will result in the final correction being smaller than it should be, and hence our result is below the ideal estimate.

_{U}By the link with tweet frequency described above, we have

*p*=

_{M}*k*.mean(

*F*) and

_{M}*p*=

_{U}*k*.mean(

*F*) for some value

_{U}*k*.

And without further ado, this gives

*N*=

*s*+ (

_{1}*s*–

_{2}*m*).(

_{2}*s*/

_{1}*m*).(mean(

_{2}*F*)/mean(

_{M}*F*))

_{U}Simples.*

Note that the Lincoln-Petersen formula is a special case of this equation, where

*p*=

_{M}*p*. Which is to say, if you put the homogenity assumption back in, you'll get Lincoln-Petersen out.

_{U}*The handful of lines shown here took some work, with help from the good Dr Halliwell, and some computational modelling work to double check the reasonableness of it all.

## Correcting for the unlocatable population

A significant number of users choose to withhold their location, or provide non-geographical locations e.g. "wherever there is dancing". Because a country-based sampling process has to discard unresolved locations, without correction we would underestimate the population size.In order to correct for this, we estimated the size of this effect for each country surveyed. We searched for tweets with the location-identifying phrases, "I'm in X" and "here in X" for both the country name and large cities within that country. We performed these searches in both English and Arabic.

These searches pick up a mixture of:

Let's assume that: (a) the people who use such phrases do not have any bias for or against putting their proper location into their user description, and (b) the people who withhold their location are visitors or locals in proportion to the ratio of identified visitors and identified locals.

Now we can estimate the proportion of locals who withhold their location.

Putting these things together allowed us to go beyond sample measurements to estimate the underlying populations.