Millions of Faked Pro-Repeal Net Neutrality Comments – The Importance of AI for Detection
AI-driven natural language processing techniques unearthed millions of faked form submissions in support of repealing net neutrality. Given the real-life consequences of this, it shows how AI is crucial in establishing the authenticity of online comments.
A compelling project by data scientist Jeff Kao used AI-driven natural language processing techniques to unearth millions of faked form submissions in support of repealing net neutrality. Given the real-life consequences of this, it points to how AI techniques are crucial in establishing the authenticity of online comments.
Jeff’s fascinating post details how at least 1.3 million comments that he analysed were generated by mail merge, swapping out synonyms in plausible seeming sentences — but ultimately ending up with a kind of word salad, which, amongst 22 million comments, is hard to catch through traditional techniques. This is where natural language processing techniques such as word clustering was powerful.
"When laying just five of these side-by-side with highlighting, as above, it’s clear that there’s something fishy going on. But when the comments are scattered among 22+ million, often with vastly different wordings between comment pairs, I can see how it’s hard to catch. Semantic clustering techniques, and not typical string-matching techniques, did a great job at nabbing these." -- Jeff Kao
After clustering comment categories and removing duplicates, Jeff ultimately unearthed that fewer than 800,000 of the over 22 million comments submitted to the FCC in support of repealing net neutrality actually appeared to be organic.
This is a really powerful example of the ways in which techniques such as semantic word clustering can be used to identify repeated patterns of semantically similar phrases, and therefore repeated posts; it’s just one of the many techniques we use with our Ally solution to identify fake or spam chat in online communities.
You can read Jeff Kao’s full post here; and he helpfully included the datasets used for others to process too to verify the results.