Platypus Innovation Blog

14 September 2011

Mark and Recapture Estimation on Twitter

This is a maths post -- some technical notes on work we did for the Arab Social Media Report produced by The Dubai School of Govenment. Non-mathematicians should see that study instead. If you proceed, you have been warned: there will be equations.

There were several challenges in using the Twitter API to meet the study's research goal. Here I look at population estimation: how to measure the population size by sampling.

Capture/Recapture and the Lincoln-Petersen Formula

Twitter themselves do not provide data on the geographical distribution of their userbase, and they're a little opaque about their true user population with dead accounts bolstering the numbers. What little official data there is, focuses on those countries with the highest per capita Twitter penetration. In other words, announcements have generally ignored the Arabian peninsula.

At the highest level, estimating population size is a simple two step process due to a clever idea from ecological surveying called mark-recapture technique (also known as capture/recapture).

Capture/recapture looks at the overlap between different samples in order to estimate the total size of a population. The degree of overlap tells you how complete your samples were, and hence the size of the population.

We used a sampling process to collect a couple of samples -- sets of users for each country. The samples were obtained by the simple process of looking for people tweeting. In principle, this is all that is required in order to apply the standard Lincoln-Petersen formula for mark-recapture population estimation.[1]

Not So Fast Mr Bond

However, the Lincoln-Petersen formula relies on several assumptions:

• That the population does not significantly change between samples i.e. there aren't many people joining or leaving Twitter across the sampling period.

• That the sampling procedure does not have an effect on individual behaviours (in ecological sampling the act of capturing an animal might sometimes kill it).

• That all individuals in the population are equally likely to be captured.

• Assumption 1 is a reasonable approximation – although Twitter is growing, the change in the Twitter population over a short period will be relatively small. Assumption 2 is also valid as the sampling process is completely passive and cannot affect individual behaviours.

However assumption 3 clearly does not hold for a message-based sampling technique. Different twitter users exhibit very different patterns of activity: some post many times a day, others once a month. Such a population is said to be heterogenous. We have attempted to correct for this as follows.

Correcting for heterogeneity

The chances of being picked up in a sweep are linked to activity. More frequent twitter users are more likely to turn up in both sweeps. Activity levels are far from uniform (as confirmed here and in various studies, e.g. [2]). This has the effect that the standard Lincoln-Petersen formula will very significantly underestimate the population size.

We correct for this by assigning a prior distribution for the likelihood of being seen. This allows a corrected formula to be calculated. This correction uses extra information about users supplied by Twitter, as described below.

Note: given longer histories with more sweeps, there are other models which can be fitted for capture behaviour. See [3] for an overview of techniques. With two sweeps, it is necessary to use a prior distribution to model the heterogenity effect. Without an informative prior, the maximum likelihood estimator for population size with heterogenity is only the number of individuals seen.

We assume that an individual's capture probability is linearly correlated with their average post frequency, which we can calculate from their Twitter profile.

Let s1, s2 be the number of individuals captured in sweep 1 and 2 respectively, let m2 be the number of marked individuals found in sweep 2 (the overlap) and let N be the total population (which is what we wish to estimate).

We divide the population into those caught in sweep 1, and those who were not captured in sweep 1. Let M be the individuals captured and marked in sweep1, and U the individuals not captured in sweep 1. These sub-populations have different probabilities of being captured in sweep 2, which we denote pM and pU respectively. pM is the average posterior probability of capture given a previous capture, and pU is the average posterior probability of capture given no previous capture.

Given pU, we could estimate |U| = (s2 - m2)/pU and hence N = |U| + s1.

But we don't have pU -- what we have is an estimator for pM = m2/s1

So we have more work to do.

Let FM, FU be the tweet frequency distributions for M and U, i.e. FM has the probability function P(frequency|M). We measure FM directly, creating a histogram. We can then estimate the prior distribution for F since P(frequency=f |M) is proportional to f.Prior(frequency=f) by the link with tweet frequency. This estimate is unstable around low frequencies, so we set a minimum activity threshold of one tweet per fortnight. We use the prior F as an estimator for FU. This is reasonable, but does have some bias towards higher frequencies, which will result in the final correction being smaller than it should be, and hence our result is below the ideal estimate.

By the link with tweet frequency described above, we have pM = k.mean(FM) and pU = k.mean(FU) for some value k.

And without further ado, this gives N = s1 + (s2m2).(s1/m2).(mean(FM)/mean(FU))

Simples.*

Note that the Lincoln-Petersen formula is a special case of this equation, where pM = pU. Which is to say, if you put the homogenity assumption back in, you'll get Lincoln-Petersen out.

*The handful of lines shown here took some work, with help from the good Dr Halliwell, and some computational modelling work to double check the reasonableness of it all.

Correcting for the unlocatable population

A significant number of users choose to withhold their location, or provide non-geographical locations e.g. "wherever there is dancing". Because a country-based sampling process has to discard unresolved locations, without correction we would underestimate the population size.

In order to correct for this, we estimated the size of this effect for each country surveyed. We searched for tweets with the location-identifying phrases, "I'm in X" and "here in X" for both the country name and large cities within that country. We performed these searches in both English and Arabic.

These searches pick up a mixture of:

• Identifiable tourists and visitors (identified by the fact that they give their location as a different country).

• Identifiable local people (identified by their location).

• People who do not give out their proper location.

• Let's assume that: (a) the people who use such phrases do not have any bias for or against putting their proper location into their user description, and (b) the people who withhold their location are visitors or locals in proportion to the ratio of identified visitors and identified locals.

Now we can estimate the proportion of locals who withhold their location.

Putting these things together allowed us to go beyond sample measurements to estimate the underlying populations.

References

• Seber, G.A.F., The Estimation of Animal Abundance and Related Parameters. Caldwel,New Jersey: Blackburn Press.

• http://blogs.hbr.org/cs/2009/06/new_twitter_research_men_follo.html (10% of users create 90% of volume)

• Sophie Baillargeon Louis-Paul Rivest, Rcapture: Loglinear Models for Capture-Recapture in R