Twitter Fights Spam With BotMaker
For Twitter, fighting spam messages is not as easy as it is for other companies. This is because of two constraints:
- Exposed developer APIs
- Real-time content
This means that the spammers know exactly what an anti-spammer knows through the APIs, and the conventional machine learning models used to identify spam cannot perform perfectly because of the real time content. To avoid this, the anti-spammers have to reduce latency as much as possible.
To deal with such an issue where content is so dynamic, Twitter has developed a system called BotMaker, which defends against the uninvited content. The main goals of the BotMaker are as follows:
- Prevent creation of spam content: Creation of spam can be prevented if the whole process of being able to create spam is made as difficult and complex as possible.
- Reduce the visibility time of spam on Twitter: In case any spam content does reach the users, the system makes an effort to eradicate it as soon as possible.
- Reduce the reaction time to new spam attacks: To defend against new spam, collection and evaluation of data, deployment of rules and models has to be done extremely quickly.
Twitter Engineer Raghav Jeyaraman said in a blog post:
The system handles billions of events every day in production, and we have seen a 40% reduction in key spam metrics since launching BotMaker.
Ideally, the best defense against spam would be detecting and removing it at the time of creation. However, that is not possible because latency is an integral requirement of Twitter due to functions like retweets, follows and messages. In order to deal with spam at all stages, the BotMaker breaks the process down into three different jobs:
1. Real time (Scarecrow)
Scarecrow detects the spam in real time and prevents it from getting into the system. This means it must run with low latency. By existing within the synchronous paths of all writing actions, Scarecrow can either deny, accept or challenge suspicious writes with countermeasures like captcha codes.
2. Near real time (Sniper)
For the spam that does get past Scarecrow, Sniper is continuously classifying the users and content apart from the write path. There are many features that do not allow real time evaluation by machine learning models; these models are evaluated in Sniper. Moreover, because of its asynchronous nature, highly latent features can be dealt with Sniper.
3. Periodic/batch jobs
In models that look over user behavior after a long period and extract features from big data, latency is not a thorn in their working. Thus, these can be run from time to time in offline jobs. However, relying completely on this job for detecting all spam content is not effective.
BotMaker achieves its goals by receiving data from all the distributed systems of Twitter. Hence, with its successful results, the BotMaker is now being declared as a “fundamental interposition layer in our [Twitter’s] distributed system” by the engineers at Twitter.