logic question
David Kramer
david at thekramers.net
Tue Feb 22 13:13:15 EST 2005
On Tue, 22 Feb 2005, Bill Holt wrote:
> Hello, I have postfix/spam assassin/redhat es4.0 I'm stumped on how to
> seed the bayesian database. The corpus @ wiki is old (don't want to seed
> it with email from 2004), and I am using this machine as a gateway to an
> exchange server. So by the time the email gets to the exchange server,
> It's useless to me. My question is how to get the spam back on the
> gateway for processing. Do I just take spam from users and write rules
> accordingly? I'm a little lost at the best way to approach this. Any
> pointers in the right direction would be greatly appreciated. Thank you,
> Bill
I was just talking to a coworker (and now BLU member) about that this
morning. Steve, consider this your answer, too.
You know that spamassassin doesn't say whether an email is spam or not, it
gives it a numerical rating, and you can do different things with emails
of different ratings. I have mailboxes for _SpamMaybe and _SpamSAYes,
where possible and very likely spam messages respectively get dumped.
I also have folders SpamSASpam and SpamSAHam. As I find messages not
rated highly enough as spam, either in _SpamMaybe or any other folder, I
move it to SpamSASpam. Likewise, any non-spam messages that get caught as
spam, I copy to SpamSAHam. Then I have a script on my mail server that
trains the database from those folders, and moves their content to an
offline file. This is a cut-down version of this script:
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
#!/bin/bash
SRCDIR=~/IMAP
DSTDIR=~/IMAPARCHIVE
if [ -s $SRCDIR/SpamSASpam ] ; then
echo Found spam
sa-learn --spam --mbox $SRCDIR/SpamSASpam
cat $SRCDIR/SpamSASpam >> $DSTDIR/SpamSASpam
cp /dev/null $SRCDIR/SpamSASpam
fi
if [ -s $SRCDIR/_SpamSAYes ] ; then
echo Found spam already caught
cat $SRCDIR/_SpamSAYes >> $DSTDIR/SpamSASpam
cp /dev/null $SRCDIR/_SpamSAYes
fi
if [ -s $SRCDIR/SpamSAHam ] ; then
echo Found ham
sa-learn --ham --mbox $SRCDIR/SpamSAHam
cat $SRCDIR/SpamSAHam >> $DSTDIR/SpamSAHam
cp /dev/null $SRCDIR/SpamSAHam
fi
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
NB: This script should really look for the procmail lock files before
copying/truncating the files, but it's just not that big a deal.
You will note that $DSTDIR/SpamSASpam grows indefinitely. This is a good
thing. I just had a problem on my system where an update of Perl broke
DB_File (Thank you, SuSE), and all hell broke loose on my bayes files.
Upgrading spamassassin did no good (though the new version is MUCH
better). I eventually ended up deleting them, but I had my big, fat,
corpus of spam for the past year or so to retrain with.
WARNING: Bayes won't work well unless you feed it ham, too. Don't forget
to train both ham and spam.
You're welcome to my corpus, if the fact that the emails are to me instead
of you won't affect it. It's about 24MB.
--
DDDD David Kramer david at thekramers.net http://thekramers.net
DK KD
DKK D It is the business of the future to be dangerous
DK KD
DDDD -DJ SPooky
More information about the Discuss
mailing list