Spam: Kill it at the root

Posted November 17, 2003 by Dougal Campbell. Filed under Meta.

Matt and I were chatting briefly earlier today about blog comment spam, and about an article that Mark Pilgrim posted on his site. Mark is pretty pessimistic about the outlook for anti-comment-spam efforts. And he points out the great lengths to which spammers will go in order to get their spams out to where eyeballs will see them. Both usenet and email are awash with the refuse of spammers, because for some reason, there are people out there who respond to their messages.

However, I, like Sam Ruby, believe that the outlook isn’t quite as grim for blogs as it has been for usenet and email. Blogs are fundamentally different in how they work and in how people use them.

It was easy for spammers to abuse usenet, because it is a store-and-forward system designed specifically for distributing messages to a large number of leaf nodes. You post a message into a newsgroup, and your message is replicated around the world, on thousands of other usenet servers, viewed by millions of users. Clients and servers speak a simple protocol, NNTP, and though most servers these days require authentication before you can post a message, there are still open servers to be found, if you know how to look.

Likewise, email is a store-and-forward system, with a method of relaying messages from one server to another. Unlike usenet, email is designed to deliver messages directly to a particluar person, which makes it slightly less efficient for reaching large numbers of users. But only slightly, because it’s quite easy to specify multiple recipients for a message. So the spammer can deliver a single message to a server, and instruct the server to deliver it to hundreds of different people. And again, many modern servers attempt to prevent unauthorized users from relaying messages, but there are still plenty of vulnerable servers on the internet that the spammers can find.

The relative ease and extreme low cost of reaching large numbers of consumers have made usenet and email into juicy targets for spammers. Low-hanging fruit, if you will. Spammers have also targeted other services, such as instant messaging services, pagers, cell phones, and more recently, blogs.

Obviously, I have a great interest in the problem of spam on blogs. I run multiple personal blogs, and I’m a developer for the WordPress blogging software. Spammers have hit my blogs on multiple occassions recently, and I delete the spam comments as soon as I become aware of them. But automating the process to some degree would certainly allow me to relax more. Fortunately, blogs are fundamentally different from usenet and email in several ways, and I think that these differences may allow us to stop this spam problem before it really gets bad. There’s still a chance for us to kill this weed at the root, before it has a chance to grow.

Why do spammers post comments on blogs? Do they really expect blog owners to leave their comments up for people to read? Do they expect people to follow their links to the spammers’ sites selling viagra and porn? Not really. What they are really after is search engine ranking. Because the more links to their sites that a search engine sees, the better the chance that their site will come up in a list of search results. Even if you’ve deleted the original comment, it won’t matter, if a search engine has already indexed the links in the comment before you deleted it. So, one of our anti-spam tactics should be to bust up the search engine ranking for spammers links.

And once a spammer has given us the link to his page and we see it for what it is, we can store that information in a blacklist, and use it to block future messages that try to link to that site. And if we can make a common, distributed, authenticated way to share such blacklist info, we can be pretty effective at staunching the flow of the spam. And if we can team up now, while this problem is still young, we can basically create an environment where comment spam just isn’t cost-effective for the spammers.

No Pings

RSS feed for comments on this post.

I would much rather have the ability to require users to authenticate in order to post comments. It doesn’t even require that much – just a password field added to the “Leave a Comment” section.

Comment from Sage on November 17, 2003
The WordPress development team is considering and implementing a variety of methods to give blog owners more control over their comment systems. The CVS code currently has otaku42’s comment moderation system in place, which allows you to make your comments manually moderated, requiring admin approval before they become public. There is also a facility in place which will allow us to add an “automatic moderation” system in the future, whereby comments could programatically be flagged as approved, moderated, possible-spam, definite-spam, etc.

And I’m sure that we’ll be adding some sort of comment registration system in the future, as well.

Comment from Dougal Campbell on November 17, 2003
I think you (and Sam Ruby) are mistaken in thinking Weblogs-with-comments have an anti-spam advantage over e-mail and Usenet. To start with, a Weblog-with-comments is functionally equivalent to a Usenet group that is stored only on one server: anyone who wants to read an item requests it from the server and it’s delivered to them.

Do they really expect blog owners to leave their comments up for people to read?

Of course they do! Otherwise they wouldn’t try to disguise them as genuine comments (“Great site! I found just what I was looking for”), sacrificing the benefit they’d get from more relevant link text.

What they are really after is search engine ranking.

Correct, and that makes the spam problem worse for Weblogs than for other media, not better.

Even if you’ve deleted the original comment, it won’t matter, if a search engine has already indexed the links in the comment before you deleted it.

No search engine I know of works like that. When a link to a site is removed, the site’s ranking goes down accordingly.

So, one of our anti-spam tactics should be to bust up the search engine ranking for spammers links.

If you do that by asking search engines not to index comment pages, that would be worse than not allowing comments at all. Where comments are allowed they often contain useful information, which search engines would miss out on if they couldn’t index it. What you could do, however, is to prohibit commenters from making hyperlinks at all (by removing the URI field and disallowing the a href element).

And if we can make a common, distributed, authenticated way to share such blacklist info, we can be pretty effective at staunching the flow of the spam.

I’m not sure how you can have read Mark’s article and still believe this. A blacklist scheme needs a trust model and a business model, and if those were so easy to create, you could create them for the comment system itself and not have the spam problem in the first place.

In Sam’s case, the spam he was dealing with was created manually, and his deletion was highly automated. But that is the opposite of the usual case, which is why — in the long run — you can’t beat spammers: the process of anti-spamming (deleting spam, implementing registration schemes, developing blacklists and filters etc) will almost always be less automatable than the process of spamming is.

The one advantage Weblogs do have over other media is that you can change the commenting system on your Weblog as often as you like to spoil the spammer’s latest software, whereas you can’t do that with the NNTP or SMTP protocols. (You can’t do that with the Trackback protocol either, so you should remove support for those.) But I say “in theory” because each time you update it you’d need to convince your user base to upgrade, and if that was easy, viruses taking advantage of months-old Windows vulnerabilities wouldn’t be nearly so common.

As I’ve told anyone who will listen since I turned off comments on my own Weblog in 2001: in the long run, the probability of any Weblog continuing with a comment system approaches zero.

Comment from mpt on November 18, 2003
So, one of our anti-spam tactics should be to bust up the search
engine ranking for spammers links.

If you do that by asking search engines not to index comment pages, that would be worse than not allowing comments at all. Where comments are allowed they often contain useful information, which search engines would miss out on if they couldn’t index it. What you could do, however, is to prohibit commenters from making hyperlinks at all (by removing the URI field and disallowing the a href element).

No, I don’t plan to totally remove indexing for comment pages. And I don’t plan to prevent commenters from entering urls. I plan to detect spammer urls and either block them outright, or if automatic detection can’t be sure that it is or isn’t a spam, then it will obfuscate the link in such a way that the search engines will not index it.

And if we can make a common, distributed, authenticated way to share such blacklist info, we can be pretty effective at staunching the flow of the spam.

I’m not sure how you can have read Mark’s article and still believe this. A blacklist scheme needs a trust model and a business model, and if those were so easy to create, you could create them for the comment system itself and not have the spam problem in the first place.

As I said, I’m an optimist. As you said yourself, we have much more control over our blogs than adminstrators have over mail and usenet, because we can change our protocols at whim, for the most part. And each blog owner can decide for his or her own self what kind of trust model they want to implement. You have chosen the “no trust” model. Some people will stay with “complete trust”. And many of us will choose something in between. The key is to find a balance that lets us keep the lines of communication open for real web conversation, while making the spammer’s Return on Investment too low to bother with.

Comment from Dougal Campbell on November 18, 2003
I’ve started getting a couple comment spam per day, and IP-specific blocking isn’t feasible — the spam is coming from a variety of addresses. Given this, are the comment spammers using robots to place the spam? If so, how hard would it be to implement some sort of system where the comments form contains a .jpg of a word, and the person posting the comment would have to type out the text of the word in a separate field before being able to submit the post?

To save processing power, the revised b2comments.php (or whatever it’s called under WordPress; I haven’t switch yet) can be distributed with a set of pre-rendered images/word sets. The word set and the image can be associated to each other in a static array; no need to start providing dynamically generated words and images as of yet.

Comment from Cheng-Jih Chen on November 19, 2003
You have chosen the “no trust” model. Some people will stay with “complete trust”. And many of us will choose something in between.

It seems likely that spamming will always be much more automatable than anti-spamming. So I don’t see any amount of trust as feasible, unless there are immediate financial and/or social penalties on unwanted communication. And on the Internet, neither of those are practical; financial penalties because of transaction costs, and social penalties because it’s trivial to adopt a new identity.

Comment from mpt on November 20, 2003
No matter what the technology is, there will always be some unscrupulous person out there to take advantage of it. When I was in the military, we were told that if someone wants to get in, they will find a way. The most you can do is slow him down. I think the same theory applies here. While the idea of the jpg with letters/numbers in it is a good idea, I’d like to see something a bit different (having to add something like that may turn off the avg commentor). Here’s my idea. It’s combiation of the moderated comments and register to comment ideas previously discussed.

1) Allow guest comments, but then they become moderated by the blog owner. Or, if the commentor wants to…
2) Allow people to sign up as “regular” commenters. They would then get a randomized PIN of sorts that they could then enter when posting. Comments from these people (the PIN would be x-ref’d with the email address) would then be automatically posted.

The idea of the jpg random image would only stop the automated system. Some of these spammers have nothing but time, and will do things by hand if pushed.
-TG

Comment from Chris Anderson on November 20, 2003
http://www.w3.org/TR/turingtest/

We’re not going to do image-based verification.

Comment from Matt on November 20, 2003
As I was reading the comment here I thought of an way to handle a spam comment.

Allow all comments to go through that do not have a URI in it. all those that do have a URI get placed into the queue for the moderator to approve. as long as the grep statement that is looking for the URI is up to date, this should be vary affective.

As for Trackbacks we only have to make sure that it is processed in the same way.

Comment from Ian Sheridan on November 20, 2003
Yep, that would work … until the day when you got overwhelmed by a couple of hundred URI-containing comments, four or five of which were legitimate, and sorting through them required more time than you had to spare. That’s what I meant by “spamming will always be much more automatable than anti-spamming”.

Comment from mpt on November 21, 2003
Perhaps some sort of learning bayesian spam filter could be implemented? it would slowly learn some of the common tactics used by comment spammers – but once the initial learning period – and a few reclassifications are in order – it could be working well.

Comment from totally on November 21, 2003
Liked our site, bla bla bla, go play some Blackjack here! 😛 jk.
I thought myself of adding to the commenting system of my still-b2 blog some sort of ‘humanity-check’, e.g., code-named images which display a word or something which must be correctly typed for the comment to be effectively published — such can only be achieved by humans, I suppose…

Comment from Blackbook on November 22, 2003
On http://joelonsoftware.com/- in the Oct 31st entry – Joel talks about how Fog Creek solved the link spam problem:

“We’ve reengineered it so that URLs become links to a redirect server hosted by Fog Creek which, we hope, means that posting a URL in our discussion group will not boost its PageRank.”

So, any posted URLs go through the company’s website, doing nothing for the spammer’s Google page rank. It doesn’t really stop the spammer from posting to the site, but it reduces the benefit of doing so considerably.

Comment from Kitt Hodsden on November 23, 2003
I just have a blacklist of words that if they appear in comment text or author name, the post is not allowed. The mangling of names inherent in e-mail spam will do comment spammers no good as people are not likely to use search engines to search for the mispellings of p0rn or se* or stuff like that, so the actual search words need to come through into the links – and so by blocking the very words that spammers want to use to achieve high page rank for a specific search query, the spamming attempt is thwarted. If they do choose to mispell in some misguided notion that it will help – delete it and ban the ip – the whole subnet as necessary – preferrably they’re smart enough to realize that it won’t. Maybe even post this text there at the comment form explaining to spammers why misspellings won’t help.

Comment from Andrew on November 23, 2003
It seems to me there’s a lot of naiveté and defeatism going around here when we discuss comment-spam.

First of all, none of the encoding or redirection tricks are going to work, because Google follows them regardless. You should just move on to putting your comments pages in a directory with a robots.txt noindex directive.

mpt’s concern that “asking search engines not to index comment pages would be worse than not allowing comments at all” is a touching, but misguided. If the page is useful and important, Google and the rest will hear about it in places OTHER than our comment sections (ie: people will post about them in clean posts on their own ‘blogs, not just in our comments).

Comment from Joel "Jaykul" Bennett on November 23, 2003
As I see it there are only two solutions:

1.) No Comments

OR

2.) moderate them and as I can tell .073 Has this functionality. I got it working on my test server and am now placing it on my live site (along with a redesign, it’s messy but all there.)

And both these solutions are the same. They both eliminate the ability of the spammer to post any comments. trackbacks should be dealt with in the same fashion. it a pain for those that get a ton of Comment Spam but nothing is going to work better then using your own brain to filter out the unwanted.

Comment from Ian Sheridan on November 23, 2003
Joel, I’m not bothered about Google missing out on pages because they’re linked to only from comments. As you say, that’s unlikely.

What I am bothered about is that Google will miss out on useful information because it’s posted as an (unindexed) comment on someone else’s Weblog, instead of being posted as an article on your own Weblog.

(And yes, naïveté and pessimism are opposite extremes, but technology encourages extremes.)

Comment from mpt on November 27, 2003
>And once a spammer has given us the link to his page and we see >it for what it is, we can store that information in a blacklist, and >use it to block future messages that try to link to that site.

My comment on this is that it’s a good idea if a person or group will take on responsibility for keeping the blacklist updated and accurate. Spammers have a high turnover on web domains, as a look at any expired domains list will show. That could be a bad thing for some unsuspecting person who purchases the domain later, only to find it blocked everywhere because of a spammer.

I’ve had a somewhat similar experience with a shared server for one of my domains. I was having database problems and my webhost suggested switching to a different server that had CPanel. I made the switch, only to find out the server had previously hosted a spammer and as a result I couldn’t send out email to my registered members. Everything from the monthly newsletter to requests for lost passwords has to be routed through another account. A little research showed the spammer had been gone for many months, but the server remains on blacklist because whoever maintains the list apparently hasn’t updated it in over 6 months.

My previous view of blacklists was very gung-ho, but now I am more moderate and think they are a good idea only if they are responsibly maintained. Better would be something like MethLabs list of P2P “enemies”, where people can post links, vote or comment on them, and block them if they choose.

I really like the idea of busting up the search engine ranking. I get tired sometimes of being spam police for my sites, especially one that is top-ranked in it’s area on all the major engines. I’ve already had to delete “news” and comments submitted to that site that was spam, and pull links that have nothing to do with the subject matter of the site. I would love a way to put them in an oubliette where all their effort was wasted. The only thing that I can think of at the moment would be to deactivate links in comments, or require comments containing links to be reviewed before posting.

Comment from Laura on November 27, 2003
FWIW, I have had some preliminary success in integrating a Bayesian analysis function with WordPress. I’m still in the very early testing phase, but I forsee this becoming one of several tools that we will be using to detect and handle spam.

Comment from Dougal Campbell on December 1, 2003
Another solution I heard somewhere was that of renaming the comments file(s), so that spam robots, not finding the default-named files, be unsuccessful. That requires changing the name all over the WordPress code; maybe in some next revisions of it, it would be useful to put the comments filename in some config variable, so that it would only be required to change it once.

Comment from Blackbook on December 8, 2003
> Another solution I heard somewhere was that of renaming the comments file(s)

This sounds promising. If you then choose the name of the POST vars in the input form by random (so every blog has a unique set of POST variable names) it even gets harder 🙂

Comment from Dirk on February 4, 2004
(Editor: Sorry for the anonymous email address… I don’t give it out on the web… we’ve gone for days without spam, and I aim to keep it that way. If you need to contact me, you can via our website. Please remove this portion of my message before posting it to your ‘blog. Thanks.)

We use Radio UserLand for our ‘blog. This software has a feature that we use extensively that lets us forward email directly to the ‘blog. Thus, if someone wants to make a comment, they have to email us first, we look over the message to be sure it’s something we want in our ‘blog, then we forward it to the ‘blog. It’s the only way I’ve found of keeping 100% of the spammers out of our ‘blog.

We also use if to forward the market research and news that we receive from our Clearing Firm. It has saved me many, many carpal tunnel operations…

Comment from Darren on March 9, 2004
Hey there-

New to the WordPress world, but the recent 3.0 issues with MT have given me reason to check out the alternatives (interesting note: while ultimately I’m fine with the way SixApart has responded to the furor, I’m still thinking about jumping ship to WordPress due to performance reasons). Anyhow, this particular thread is one of interest to me not-too-surprisingly. Over in the MT world, James Seng has actually wrote two plug-ins to handle the comment-spam issue – the first was an image-based verification (which Matt has justifiably negged), the latter is a Baynesian filter written in Perl:

http://james.seng.cc/archives/000152.html

While it’s not perfect (check out the update to his post), it might be something worth checking out. (I personally don’t code – while I am interested in starting, I just don’t have the bandwidth right now).

Comment from Brian on May 19, 2004
First: I’d like a definition of “blog comment spam.” I sometimes worry about leaving a comment, with honest reactions to a post, and including text about “I have a recent article on this topic in my blog site, http://www.whatever...”

I mean, I want to respond via comments, and I’d like to leave my blog URL for those who may be interested in my own article, but I’m not trying to scam my way into higher Google rankings.

Second, I just read an article at the Six Apart dot com web site, on comment spam, and the last 15 comments (approx.) were porm comment spam.

Third, even here, look up above my comment, what’s up with the classic comment spam “Interesting story, I read almost all this page”????? (and others also questionable)????

Comment from Steven Streight on August 31, 2004

WordPress.org