Gmail's spam filtering
Tom Metro
blu at vl.com
Thu Mar 22 13:19:15 EDT 2007
Kristian Hermansen wrote:
> How does Gmail do it? Do they utilize the fact that millions of their
> users (agents) help in the learning process of what is 'spam' by
> clicking that 'Report Spam' button?
...
And in a later posting:
> Sender Reputation in a Large Webmail Service, Bradley Taylor, Third
> Conference on Email and Anti-Spam (CEAS 2006), 2006
>
> Short read too:
> http://www.ceas.cc/2006/19.pdf
That was an interesting read.
Their technique is pretty simple, and essentially it does work as you
originally speculated before you found the paper. The technique can be
summed up as:
First, they determine who the sending party is. Unlike most spam
filtering systems, they avoid relying on IP addresses. Instead they
depend heavily on SPF[1] and DomainKeys[2]. Because these mechanisms
aren't widely used yet, they expand the scope of SPF by using a "best
guess" rule[3] to figure out whether the sending machine's IP address is
a likely match for the domain. According to their stats, only 26% of the
non-spam messages they receive can't be authenticated using one or more
of these techniques, while only about 40% of the spam can be authenticated.
1. http://www.openspf.org/
2. http://www.ietf.org/html.charters/dkim-charter.html
3. http://www.openspf.org/FAQ/Best_guess_record
Next, they calculate a "reputation" for the sender, which is a
percentage showing how non-spammy they are. (0% is all spam, 100% is all
non-spam.) Feeding into that calculation are the counts of users marking
messages from that sender as spam, or not-spam, as well as stats showing
how past messages from that sender were classified.
Their charts show that senders tend to cluster towards the top or bottom
of the spectrum. Most are either below 5% or above 80%.
If the reputation is below a threshold, say 5%, it's spam. If it's above
another threshold, say 80%, it's non-spam. All the stuff that falls in
the middle gets sent to a statistical filter. (The paper didn't mention
which filter. Similarly, the paper doesn't address what other anti-spam
techniques, like greylisting, that Gmail may or may not be using.)
So largely they depend on their users to determine whether a sender is
spammy. (The paper seems to suggest that while the votes from all users
are used in aggregate to calculate a senders reputation, if an
individual marks a sender a certain way, mail from that sender will be
sorted accordingly for that specific user. In other words, individual
users have their own white lists and black lists that override the
normal formula.)
Their system seems to be heavily dependent on their ability to
authenticate the sender. Oddly absent from the paper is a discussion of
what they do about the senders that can't be authenticated (26% of
non-spam and 60% of spam). I wonder how they are even counting those
senders (in their stats), if they can't determine who they are, and they
aren't falling back on using IP addresses. They could be counting
thousands of fictitious domains as unique senders if they're only
looking at the domain.
The paper says one of the challenges to their system is that some users
don't log in to the web UI, and thus never classify messages. Perhaps
some day they'll switch from POP to IMAP (so users can remotely browse
their spam folder), and provide something like a Thunderbird extension
so users can classify messages.
The paper concludes by comparing their system to several existing
systems like SpamCop, Return Path’s Sender Score, Habeas' SenderIndex,
some of which return a binary spam/not-spam indicator, and a few that
return a score. But again they pointed out that these systems rely on
the sender's IP address and say, "Using the authenticated domain, rather
than the IP address though, would be a welcome improvement to these
systems."
The author of the paper seems to be almost disappointed that Google has
amassed this database of information on senders, but doesn't want to
share it with the public, and he encourages the development of an open
system that applies the same techniques: "It would be nice if a
third-party service could provide something similar that everyone could
use." Could be an interesting project...
I have an in-house developed anti-spam proxy that I use on our mail
server, and I'll probably try incorporating some of these techniques.
> However, Gmail catches them every time :-)
I wouldn't say every time. But it does a darn good job. I primarily use
Gmail via POP, so it is inconvenient to reclassify messages, but have
done so on a few occasions - both for false positives and false
negatives. Perhaps a few times a quarter I'll get some spam. More
frequently for spam redistributed by mailing lists.
-Tom
--
Tom Metro
Venture Logic, Newton, MA, USA
"Enterprise solutions through open source."
Professional Profile: http://tmetro.venturelogic.com/
--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.
More information about the Discuss
mailing list