Spam Riddle

Another probability riddle from FiveThirtyEight:

Last week’s Riddler column garnered six comments on Facebook. However, every single one of those comments was spam. Sometimes, spammers even reply to other spammers’ comments with yet more spam. This got me thinking.

Over the course of three days, suppose the probability of any spammer making a new comment on this week’s Riddler column over a very short time interval is proportional to the length of that time interval. (For those in the know, I’m saying that spammers follow a Poisson process.) On average, the column gets one brand-new comment of spam per day that is not a reply to any previous comments. Each spam comment or reply also gets its own spam reply at an average rate of one per day.

For example, after three days, I might have four comments that were not replies to any previous comments, and each of them might have a few replies (and their replies might have replies, which might have further replies, etc.).

After the three days are up, how many total spam posts (comments plus replies) can I expect to have?

Solution

This question boils down to finding the mean of a linear birth process, which models population growth — très à la mode! Every derivation I found for the distribution of this process was fairly messy, while the formula for the distribution P(k spams after t days)=P(St=k)=(1et)ket begs an intuitive explanation connecting it to the geometric distribution. So I decided to find it myself!

Description of the Linear Birth Process

(1et)ket is the probability that a coin that comes up heads with probability et will get k tails before its first heads. This is the geometric distribution with parameter et. et is the probability that an entity will take longer than t days to produce its first spam.

This shows that, but doesn’t explain why, you can model St by for instance counting how many times in a row the post births a new top-level spam comment (not a reply) in less than t days.

Connection to Coin Flipping

The main trick is to connect the linear birth process to the linear death process, Dk, which models a population starting with Dk0=k spams that each take an average of one day to die.

When Dk=j the death rate is j, since there are j spams dying. Meanwhile, when S=j1 the birth rate is j (there are j1 spams plus the original post). So the time it takes Dk to go from j to j1 has the same distribution as the time it takes S to go from j1 to j. Based on their starting points then, the extinction time of Dk has the same distribution as the time of the kth birth of S. In fact, running S+1 (the population of spams plus the post) until the kth birth is essentially the reverse of Dk, with the exception that S+1 will be one bigger than Dk at the jumps since a birth just occurred instead of a death.

Dk will go extinct by time t, i.e. Dkt=0, if all k spams die before t days. If we think of fj as a coin that’s heads when the jth spam lives longer than t days, this is like having fj=T for 1jk. Then P(St=k)

=P(Stk)P(Stk+1) =P(Dkt=0)P(Dk+1t=0) =P(f1,,fk=T)P(f1,,fk+1=T) =P(f1,,fk=T,fk+1=H) =(1et)ket

Riddle Solution from Coin Flipping

The mean of Gp, a geometric random variable with parameter p, is 1p1, so E[St]=et1. To see this, imagine simulating Gp N times for large N by flipping the coin until there have been N heads. Then Ni=1GipN=NTNH=NT+NHNH11p1.

Written on April 10, 2020