Spam
I relied on these two docs to setup my anti-spam relay server: Scott
Vintinner's page and
Scott Henderson's one.
Here is a collection of tips for further customizing a postfix/amavisd-new/SpamAssassin
setup.
This is a powerpoint presentation I
put together for my local linux user group. It covers a few techniques on fighting
spam, along with a description of what I installed in my office. I did this
on my
Mac,
so the fonts might look a bit funny on a PC; haven't tried it on Open Office.
Sender Policy Framework (SPF)
Along with SpamAssassin et al, I think email authentication is definitely a good idea,
I just wish it was a bit more widely deployed!
Here's what I wrote
about it.
Setting up white/blacklists in amavis
At some point after you've setup your box, you're almost certainly going to
need to do one of three things:
- whitelist certain individuals or domains.
- blacklist a few spam-domains that still manage to get through.
- stop filtering mail for users who have decided to opt-out.
You can do this at the SA level, but it's better to use amavis for this sort
of thing, as you'll save processing power by not running SpamAssassin, and
SA doesn't always get the full set of addresses in all circumstances (BCC'ing,
for instance).
Comment out this section in /etc/amavisd.conf:
map { $whitelist_sender{lc($_)}=1} (qw(
cert-advisory-owner@cert.org
...
returns.groups.yahoo.com
));
...add these lines:
read_hash(\%whitelist_sender, '/var/amavis/whitelist');
read_hash(\%blacklist_sender, '/var/amavis/blacklist');
read_hash(\%spam_lovers, '/var/amavis/spam_lovers');
...and then touch the files into existence:
# touch /var/amavis/whitelist
# touch /var/amavis/blacklist
# touch /var/amavis/spam_lovers
you can add names to these lists like so:
# echo thinkgeek.com >> /var/amavis/whitelist
# echo cas_likes_spam@caspergasper.com >> /var/amavis/spam_lovers
The lists don't support globbing (no * wildcards); you have to put either
the whole email address or the domain (actually that's not completely
true, but if you stick to those rules you won't go wrong). These text
files are
only read at startup so after amending them you need
to
stop
and restart
amavisd
to
take effect; on a RH box just type:
# service amavisd restart
or if not, wherever services are kept:
# /etc/init.d/amavisd restart
Here are some very trivial scripts to further simplfiy whitelisting etc, so
you can just type:
# add_to_whitelist caspergasper.com
and the name will be added to the list and amavisd will be restarted. Here's
the whitelist,
the blacklist and
the spam-lovers scripts. Download
them to somewhere in your path like /usr/local/sbin/
In addition I often use this script that can stop and restart both postfix and amavisd at once.
pflogsumm double-reporting
If you're running the pflogsumm reporting tool for postfix along with amavis,
you'll find it doubles the number of emails successfully passed through your
system. The problem is simply when amavis has finished with the mail it requeues
it, so it genuinely does go through twice. Here is a cover
script that
fixes the problem -- just download it to /usr/local/bin/ (or wherever).
You might need to change the location to pflogsumm at the top of the file,
and just
alter
the
command
in your cron
job to run pflog_amavis instead of pflogsumm.
One minor change -- the script won't read the log file from standard
input, it has to be the last parameter on the command line, ie, this
doesn't work:
cat /var/log/maillog | pflog_amavis
it has to be like this:
pflog_amavis -i /var/log/maillog
I think this is more readable anyway. Here's an example
file that shows the format; put it in /etc/cron.daily/ or cron.weekly/,
depending on how often you want to run the report. Just change the paths
to pflog_amavis and your log file as
needed. Incidentally, make sure the script gets called before logrotate,
or else you'll be reading a blank file at the end of the week and you'll wonder why you get no mail
on a Saturday. The easiest
way of ensuring this is to prefix the script name with a zero, as they're
called in alphabetical order.
The script takes a 'brute strength' approach and parses the mail log
looking for the one relay line that links the two queue ids together:
Jan 4 07:26:52 spamassassin postfix/smtp[14934]: 5C0631743C:
to=<info@idmltd.co.uk>, relay=127.0.0.1[127.0.0.1], delay=14,
status=sent (250 2.6.0 Ok, id=08466-05, from MTA: 250 Ok: queued as
173FE1743D)
It stores the original QID in a hash with the time as a value; then it
goes back through the log, copying out all the postfix lines to a temp
file except those that were requeued within a given time period. (The
QID is only guaranteed to be unique at the instance it's created --
they are reused). Pflogsumm is then called with all the arguments and
the temp mail log, which is deleted on exit.
AFAIK, the script works without problems. I've been using it for a
few months at my site trouble-free. Performance is quite good too --
the whole report only takes an extra 50% longer. No extra perl dependencies
are needed.
Training Bayes
If you enable auto-learning (as you probably should), SA will do a great
job of building up your bayes db, but it's always a good idea to manually train
on a collection of ham and spam. This is particularly important for the rare
mail that gets mis-classified. SpamAssassin, like the rest of us, should learn
from his mistakes.
The fiddly bit of this whole process is getting the mail into a format that's
required.
The headers contain important information that will get scrubbed if you forward
the mail. If you're running Exchange and Outlook, the best thing to do is create
a couple of public IMAP folders called "ham" and "spam" respectively.
From here, users can right-click and drag their mail into these folders.
Once you've got some mail in these folders, you can run a perl script that
will copy over the mail from the folders, then run sa-learn on them. I'm using
the bayesimappull script I found on
the sa-talk list, although I've modified it to change ownership of the bayes
db back to the correct user and group (defaults to amavis:amavis). I've never
had a problem, but it is possible for the bayes files to be re-created with
root ownership. After downloading the file, change the IP address to that of
your Exchange server and alter the names of the spam and ham folders at the
top of the script. This script needs
an additional IMAP perl module which you can get from CPAN:
# perl -MCPAN -e shell
> install Mail::IMAPClient
Before manually
training,
take a look and see how your database is doing:
# sa-learn --dump magic
you should get something like this:
0.000 | 0 |
2 |
0 | bayes db version |
0.000 | 0 | 1966 | 0 | non-token data: nspam |
0.000 | 0 | 5138 | 0 | non-token data: nham |
0.000 | 0 | 139830 | 0 | non-token data: ntokens |
0.000 | 0 | 1067276800 | 0 | non-token data: oldest atime |
0.000 | 0 | 1078057434 | 0 | non-token data: newest atime |
0.000 | 0 | 1078055556 | 0 | non-token data: last journal sync atime |
0.000 | 0 | 1076350138 | 0 | non-token data: last expiry atime |
0.000 | 0 | 0 | 0 | non-token data: last expire atime delta |
0.000 | 0 | 0 | 0 | non-token data: last expire reduction count |
nspam and nham shows the number of spams and hams, respectively.
Now run bayesimappull like so:
# bayesimappull -uid="username" -pwd="fred"
with the username and password of an account on the exchange server with
permission to read these files.
Try running sa-learn --dump magic again and
see if the numbers have increased.
You may want to run this with
the --norebuild
option to speed things up (TODO: not currently implemented on the bayesimappull
script). It would be easy to add this to a cron job running daily,
but I
prefer to do it
manually so I can check the mails are being correctly classified. If for some reason training fails, you may have a write-conflict; stop amavisd
and try again.
A few general points about bayesian learning
- The accepted wisdom is learn on absolutely everything with the exception
of things like mailing lists that discuss spam (like sa-talk). The more the
merrier.
- Don't worry about having an imbalance of spam vs. ham, unless the figures
become grossly distorted (like 20:1). Bayes can easily overcome these
distortions, especially with a large corpus.
- sa-learn remembers the message ids of mails it's learned from,
so there's no danger of trying to learn from the same email twice.
You can re-learn a
mail if you've trained it as spam when it's really ham, or vice
versa.
- Remember, you need over 200 hams and spams in your database
before bayes starts working; with a properly trained database,
most of your
ham should
hit BAYES_00
and most of your spam BAYES_99. Not much should fall in the middle.
If you ever want to output your bayes db to a text file, try this:
# sa-learn --dump data | sort > bayes_dump.txt
This will show you every token in the database in ascending order of spam
probability.
Checklist for improving SpamAssassin's accuracy
The first temptation is probably to reduce the tag score; I would advise against
doing this if possible, as you definitely will increase the number of false
positives and there are plenty of other options to try first. Here are a
selection, in a
rough order of importance:
- manually train bayes, especially on the mistakes.
- Make sure you're running the latest version of SpamAssassin. There is
something of an arms race between spammers and the SA developers, so
each upgrade is genuinely
worth having. Sign up for the
sa-announce
mailing list to be notified.
- The online spam databases (Razor, DCC, Pyzor) are valuable add-ons.
- Take a look at adding custom rules. Over at the
SA
custom rules emporium, there are a host of very good rulesets to choose
from. Just download
the .cf
file to /etc/mail/spamassassin/, and restart amavisd. Run spamassassin
--lint first to check for any errors. I can strongly recommend
BigEvil -- it's a huge list
of known spam domains which is updated very frequently. As the
rule for inclusion is zero FPs there's no reason at all not to run it.
Backhair and popcorn
have also worked well for me too.
- Don't forget to blacklist! There will be occassional spam domains
that aren't included in BigEvil and still manage to get through.
Like whitelisting,
it's
also a good way of saving processing power (See earlier section
for how to do this with amavis).
- SMTP restrictions. I reject all mail coming into my domain
without an A or MX record. I've never had a problem with this
at all. Some
people even
restrict
on the lack of reverseDNS, but I think that's just a recipe
for losing mail.
- RBLs. I would always advise against running realtime blackhole
checks at the MTA level, as you will lose legitimate mail.
Also (I believe
I'm right in saying
this) you'll only be checking against the last mail server's
IP, not necessarily the originating server if you're using
a proxy
firewall, or your ISP receives
mail for you. RBLs within SA check all the mail servers listed
(except those excluded by the trusted_networks setting),
and as they are
used
to score
a mail rather than blanket-reject, they are a very useful
addition to the armoury.
- If spam hits a handful of rules but still falls short
of the tag level, consider raising the score of a few rules.
- If you're getting a very particular type of spam, try
writing your own rule to catch it. There are plenty of
good HOWTOs
on rule-writing like
this
one.
The key to fighting spam is a scatter-gun approach. You
can't rely just on Bayes or just on regexp matching,
but a combination
of
those along with RBLs,
SMTP
restrictions and online spam databases work best. You
really don't have to put up with an inbox full of spam
-- it should be
possible to achieve
a very
high accuracy with next to zero false positives. If you're doing all
of the above and still spam is getting through, make sure your setup's
working properly; ask for help on the sa-talk mailing list.
|