Where do you want to go today?

Home
Who am I?
Iceland
OU tips
Film & DVD
Festivities
Credits
books
Chili
Chinese
How to
win friends

geek stuff

spam
PowerBook love
Mac hacks
Windows code
Visual Basic
Linux
My Public Key

diversions...

Spam

I relied on these two docs to setup my anti-spam relay server: Scott Vintinner's page and Scott Henderson's one. Here is a collection of tips for further customizing a postfix/amavisd-new/SpamAssassin setup.

This is a powerpoint presentation I put together for my local linux user group. It covers a few techniques on fighting spam, along with a description of what I installed in my office. I did this on my Mac, so the fonts might look a bit funny on a PC; haven't tried it on Open Office.

Sender Policy Framework (SPF)

Along with SpamAssassin et al, I think email authentication is definitely a good idea, I just wish it was a bit more widely deployed! Here's what I wrote about it.

Setting up white/blacklists in amavis

At some point after you've setup your box, you're almost certainly going to need to do one of three things:

whitelist certain individuals or domains.
blacklist a few spam-domains that still manage to get through.
stop filtering mail for users who have decided to opt-out.

You can do this at the SA level, but it's better to use amavis for this sort of thing, as you'll save processing power by not running SpamAssassin, and SA doesn't always get the full set of addresses in all circumstances (BCC'ing, for instance).

Comment out this section in /etc/amavisd.conf:

map { $whitelist_sender{lc($_)}=1} (qw( cert-advisory-owner@cert.org ... returns.groups.yahoo.com ));

...add these lines:
read_hash(\%whitelist_sender, '/var/amavis/whitelist'); read_hash(\%blacklist_sender, '/var/amavis/blacklist'); read_hash(\%spam_lovers, '/var/amavis/spam_lovers');

...and then touch the files into existence:
# touch /var/amavis/whitelist # touch /var/amavis/blacklist # touch /var/amavis/spam_lovers

you can add names to these lists like so:

# echo thinkgeek.com >> /var/amavis/whitelist # echo cas_likes_spam@caspergasper.com >> /var/amavis/spam_lovers

The lists don't support globbing (no * wildcards); you have to put either the whole email address or the domain (actually that's not completely true, but if you stick to those rules you won't go wrong). These text files are only read at startup so after amending them you need to stop and restart amavisd to take effect; on a RH box just type:

# service amavisd restart

or if not, wherever services are kept:

# /etc/init.d/amavisd restart

Here are some very trivial scripts to further simplfiy whitelisting etc, so you can just type:

# add_to_whitelist caspergasper.com

and the name will be added to the list and amavisd will be restarted. Here's the whitelist, the blacklist and the spam-lovers scripts. Download them to somewhere in your path like /usr/local/sbin/

In addition I often use this script that can stop and restart both postfix and amavisd at once.

pflogsumm double-reporting

If you're running the pflogsumm reporting tool for postfix along with amavis, you'll find it doubles the number of emails successfully passed through your system. The problem is simply when amavis has finished with the mail it requeues it, so it genuinely does go through twice. Here is a cover script that fixes the problem -- just download it to /usr/local/bin/ (or wherever). You might need to change the location to pflogsumm at the top of the file, and just alter the command in your cron job to run pflog_amavis instead of pflogsumm.

One minor change -- the script won't read the log file from standard input, it has to be the last parameter on the command line, ie, this doesn't work:

cat /var/log/maillog | pflog_amavis

it has to be like this:

pflog_amavis -i /var/log/maillog

I think this is more readable anyway. Here's an example file that shows the format; put it in /etc/cron.daily/ or cron.weekly/, depending on how often you want to run the report. Just change the paths to pflog_amavis and your log file as needed. Incidentally, make sure the script gets called before logrotate, or else you'll be reading a blank file at the end of the week and you'll wonder why you get no mail on a Saturday. The easiest way of ensuring this is to prefix the script name with a zero, as they're called in alphabetical order.

The script takes a 'brute strength' approach and parses the mail log looking for the one relay line that links the two queue ids together:

Jan 4 07:26:52 spamassassin postfix/smtp[14934]: 5C0631743C: to=<info@idmltd.co.uk>, relay=127.0.0.1[127.0.0.1], delay=14, status=sent (250 2.6.0 Ok, id=08466-05, from MTA: 250 Ok: queued as 173FE1743D)

It stores the original QID in a hash with the time as a value; then it goes back through the log, copying out all the postfix lines to a temp file except those that were requeued within a given time period. (The QID is only guaranteed to be unique at the instance it's created -- they are reused). Pflogsumm is then called with all the arguments and the temp mail log, which is deleted on exit.

AFAIK, the script works without problems. I've been using it for a few months at my site trouble-free. Performance is quite good too -- the whole report only takes an extra 50% longer. No extra perl dependencies are needed.

Training Bayes

If you enable auto-learning (as you probably should), SA will do a great job of building up your bayes db, but it's always a good idea to manually train on a collection of ham and spam. This is particularly important for the rare mail that gets mis-classified. SpamAssassin, like the rest of us, should learn from his mistakes.
The fiddly bit of this whole process is getting the mail into a format that's required.
The headers contain important information that will get scrubbed if you forward the mail. If you're running Exchange and Outlook, the best thing to do is create a couple of public IMAP folders called "ham" and "spam" respectively. From here, users can right-click and drag their mail into these folders.
Once you've got some mail in these folders, you can run a perl script that will copy over the mail from the folders, then run sa-learn on them. I'm using the bayesimappull script I found on the sa-talk list, although I've modified it to change ownership of the bayes db back to the correct user and group (defaults to amavis:amavis). I've never had a problem, but it is possible for the bayes files to be re-created with root ownership. After downloading the file, change the IP address to that of your Exchange server and alter the names of the spam and ham folders at the top of the script. This script needs an additional IMAP perl module which you can get from CPAN:

# perl -MCPAN -e shell > install Mail::IMAPClient

Before manually training, take a look and see how your database is doing:

# sa-learn --dump magic

you should get something like this:




  0.000 0
2 
0    bayes db version
  0.000  0  1966  0  non-token data: nspam
  0.000  0  5138  0  non-token data: nham
  0.000  0  139830  0  non-token data: ntokens
  0.000  0  1067276800  0  non-token data: oldest atime
  0.000  0  1078057434  0  non-token data: newest atime
  0.000  0  1078055556  0  non-token data: last journal sync atime
 0.000  0  1076350138  0  non-token data: last expiry atime
  0.000  0  0  0  non-token data: last expire atime delta
  0.000  0  0  0  non-token data: last expire reduction count

nspam and nham shows the number of spams and hams, respectively. Now run bayesimappull like so:

# bayesimappull -uid="username" -pwd="fred"

with the username and password of an account on the exchange server with permission to read these files.
Try running sa-learn --dump magic again and see if the numbers have increased.

You may want to run this with the --norebuild option to speed things up (TODO: not currently implemented on the bayesimappull script). It would be easy to add this to a cron job running daily, but I prefer to do it manually so I can check the mails are being correctly classified.

If for some reason training fails, you may have a write-conflict; stop amavisd and try again.

A few general points about bayesian learning

The accepted wisdom is learn on absolutely everything with the exception of things like mailing lists that discuss spam (like sa-talk). The more the merrier.
Don't worry about having an imbalance of spam vs. ham, unless the figures become grossly distorted (like 20:1). Bayes can easily overcome these distortions, especially with a large corpus.
sa-learn remembers the message ids of mails it's learned from, so there's no danger of trying to learn from the same email twice. You can re-learn a mail if you've trained it as spam when it's really ham, or vice versa.
Remember, you need over 200 hams and spams in your database before bayes starts working; with a properly trained database, most of your ham should hit BAYES_00 and most of your spam BAYES_99. Not much should fall in the middle.

If you ever want to output your bayes db to a text file, try this:

# sa-learn --dump data | sort > bayes_dump.txt

This will show you every token in the database in ascending order of spam probability.

Checklist for improving SpamAssassin's accuracy

The first temptation is probably to reduce the tag score; I would advise against doing this if possible, as you definitely will increase the number of false positives and there are plenty of other options to try first. Here are a selection, in a rough order of importance:

manually train bayes, especially on the mistakes.
Make sure you're running the latest version of SpamAssassin. There is something of an arms race between spammers and the SA developers, so each upgrade is genuinely worth having. Sign up for the sa-announce mailing list to be notified.
The online spam databases (Razor, DCC, Pyzor) are valuable add-ons.
Take a look at adding custom rules. Over at the SA custom rules emporium, there are a host of very good rulesets to choose from. Just download the .cf file to /etc/mail/spamassassin/, and restart amavisd. Run spamassassin --lint first to check for any errors. I can strongly recommend BigEvil -- it's a huge list of known spam domains which is updated very frequently. As the rule for inclusion is zero FPs there's no reason at all not to run it. Backhair and popcorn have also worked well for me too.
Don't forget to blacklist! There will be occassional spam domains that aren't included in BigEvil and still manage to get through. Like whitelisting, it's also a good way of saving processing power (See earlier section for how to do this with amavis).
SMTP restrictions. I reject all mail coming into my domain without an A or MX record. I've never had a problem with this at all. Some people even restrict on the lack of reverseDNS, but I think that's just a recipe for losing mail.
RBLs. I would always advise against running realtime blackhole checks at the MTA level, as you will lose legitimate mail. Also (I believe I'm right in saying this) you'll only be checking against the last mail server's IP, not necessarily the originating server if you're using a proxy firewall, or your ISP receives mail for you. RBLs within SA check all the mail servers listed (except those excluded by the trusted_networks setting), and as they are used to score a mail rather than blanket-reject, they are a very useful addition to the armoury.
If spam hits a handful of rules but still falls short of the tag level, consider raising the score of a few rules.
If you're getting a very particular type of spam, try writing your own rule to catch it. There are plenty of good HOWTOs on rule-writing like this one.

The key to fighting spam is a scatter-gun approach. You can't rely just on Bayes or just on regexp matching, but a combination of those along with RBLs, SMTP restrictions and online spam databases work best. You really don't have to put up with an inbox full of spam -- it should be possible to achieve a very high accuracy with next to zero false positives. If you're doing all of the above and still spam is getting through, make sure your setup's working properly; ask for help on the sa-talk mailing list.