Gmail's spam filtering

Thu Mar 22 13:19:15 EDT 2007

Kristian Hermansen wrote:
> How does Gmail do it?  Do they utilize the fact that millions of their
> users (agents) help in the learning process of what is 'spam' by
> clicking that 'Report Spam' button?
...

And in a later posting:
> Sender Reputation in a Large Webmail Service, Bradley Taylor, Third
> Conference on Email and Anti-Spam (CEAS 2006), 2006
> 
> Short read too:
> http://www.ceas.cc/2006/19.pdf

That was an interesting read.

Their technique is pretty simple, and essentially it does work as you 
originally speculated before you found the paper. The technique can be 
summed up as:

First, they determine who the sending party is. Unlike most spam 
filtering systems, they avoid relying on IP addresses. Instead they 
depend heavily on SPF[1] and DomainKeys[2]. Because these mechanisms 
aren't widely used yet, they expand the scope of SPF by using a "best 
guess" rule[3] to figure out whether the sending machine's IP address is 
a likely match for the domain. According to their stats, only 26% of the 
non-spam messages they receive can't be authenticated using one or more 
of these techniques, while only about 40% of the spam can be authenticated.

1. http://www.openspf.org/
2. http://www.ietf.org/html.charters/dkim-charter.html
3. http://www.openspf.org/FAQ/Best_guess_record

Next, they calculate a "reputation" for the sender, which is a 
percentage showing how non-spammy they are. (0% is all spam, 100% is all 
non-spam.) Feeding into that calculation are the counts of users marking 
messages from that sender as spam, or not-spam, as well as stats showing 
how past messages from that sender were classified.

Their charts show that senders tend to cluster towards the top or bottom 
of the spectrum. Most are either below 5% or above 80%.

If the reputation is below a threshold, say 5%, it's spam. If it's above 
another threshold, say 80%, it's non-spam. All the stuff that falls in 
the middle gets sent to a statistical filter. (The paper didn't mention 
which filter. Similarly, the paper doesn't address what other anti-spam 
techniques, like greylisting, that Gmail may or may not be using.)

So largely they depend on their users to determine whether a sender is 
spammy. (The paper seems to suggest that while the votes from all users 
are used in aggregate to calculate a senders reputation, if an 
individual marks a sender a certain way, mail from that sender will be 
sorted accordingly for that specific user. In other words, individual 
users have their own white lists and black lists that override the 
normal formula.)

Their system seems to be heavily dependent on their ability to 
authenticate the sender. Oddly absent from the paper is a discussion of 
what they do about the senders that can't be authenticated (26% of 
non-spam and 60% of spam). I wonder how they are even counting those 
senders (in their stats), if they can't determine who they are, and they 
aren't falling back on using IP addresses. They could be counting 
thousands of fictitious domains as unique senders if they're only 
looking at the domain.

The paper says one of the challenges to their system is that some users 
don't log in to the web UI, and thus never classify messages. Perhaps 
some day they'll switch from POP to IMAP (so users can remotely browse 
their spam folder), and provide something like a Thunderbird extension 
so users can classify messages.

The paper concludes by comparing their system to several existing 
systems like SpamCop, Return Path’s Sender Score, Habeas' SenderIndex, 
some of which return a binary spam/not-spam indicator, and a few that 
return a score. But again they pointed out that these systems rely on 
the sender's IP address and say, "Using the authenticated domain, rather 
than the IP address though, would be a welcome improvement to these 
systems."

The author of the paper seems to be almost disappointed that Google has 
amassed this database of information on senders, but doesn't want to 
share it with the public, and he encourages the development of an open 
system that applies the same techniques: "It would be nice if a 
third-party service could provide something similar that everyone could 
use." Could be an interesting project...

I have an in-house developed anti-spam proxy that I use on our mail 
server, and I'll probably try incorporating some of these techniques.

> However, Gmail catches them every time :-)

I wouldn't say every time. But it does a darn good job. I primarily use 
Gmail via POP, so it is inconvenient to reclassify messages, but have 
done so on a few occasions - both for false positives and false 
negatives. Perhaps a few times a quarter I'll get some spam. More 
frequently for spam redistributed by mailing lists.

  -Tom

-- 
Tom Metro
Venture Logic, Newton, MA, USA
"Enterprise solutions through open source."
Professional Profile: http://tmetro.venturelogic.com/

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.