Spam Riddle
Another probability riddle from FiveThirtyEight:
Last week’s Riddler column garnered six comments on Facebook. However, every single one of those comments was spam. Sometimes, spammers even reply to other spammers’ comments with yet more spam. This got me thinking.
Over the course of three days, suppose the probability of any spammer making a new comment on this week’s Riddler column over a very short time interval is proportional to the length of that time interval. (For those in the know, I’m saying that spammers follow a Poisson process.) On average, the column gets one brand-new comment of spam per day that is not a reply to any previous comments. Each spam comment or reply also gets its own spam reply at an average rate of one per day.
For example, after three days, I might have four comments that were not replies to any previous comments, and each of them might have a few replies (and their replies might have replies, which might have further replies, etc.).
After the three days are up, how many total spam posts (comments plus replies) can I expect to have?
Solution
This question boils down to finding the mean of a linear birth process, which models population growth — très à la mode! Every derivation I found for the distribution of this process was fairly messy, while the formula for the distribution P(k spams after t days)=P(St=k)=(1−e−t)ke−t begs an intuitive explanation connecting it to the geometric distribution. So I decided to find it myself!
Description of the Linear Birth Process
(1−e−t)ke−t is the probability that a coin that comes up heads with probability e−t will get k tails before its first heads. This is the geometric distribution with parameter e−t. e−t is the probability that an entity will take longer than t days to produce its first spam.
This shows that, but doesn’t explain why, you can model St by for instance counting how many times in a row the post births a new top-level spam comment (not a reply) in less than t days.
Connection to Coin Flipping
The main trick is to connect the linear birth process to the linear death process, Dk, which models a population starting with Dk0=k spams that each take an average of one day to die.
When Dk=j the death rate is j, since there are j spams dying. Meanwhile, when S=j−1 the birth rate is j (there are j−1 spams plus the original post). So the time it takes Dk to go from j to j−1 has the same distribution as the time it takes S to go from j−1 to j. Based on their starting points then, the extinction time of Dk has the same distribution as the time of the kth birth of S. In fact, running S+1 (the population of spams plus the post) until the kth birth is essentially the reverse of Dk, with the exception that S+1 will be one bigger than Dk at the jumps since a birth just occurred instead of a death.
Dk will go extinct by time t, i.e. Dkt=0, if all k spams die before t days. If we think of fj as a coin that’s heads when the jth spam lives longer than t days, this is like having fj=T for 1≤j≤k. Then P(St=k)
=P(St≥k)−P(St≥k+1) =P(Dkt=0)−P(Dk+1t=0) =P(f1,…,fk=T)−P(f1,…,fk+1=T) =P(f1,…,fk=T,fk+1=H) =(1−e−t)ke−t
Riddle Solution from Coin Flipping
The mean of Gp, a geometric random variable with parameter p, is 1p−1, so E[St]=et−1. To see this, imagine simulating Gp N times for large N by flipping the coin until there have been N heads. Then ∑Ni=1GipN=NTNH=NT+NHNH−1→1p−1.