message tags

Tom Allison tallison at tacocat.net
Tue Sep 7 14:24:08 CEST 2004


David Relson wrote:
> Greetings,
> 
> It has been suggested that bogofilter should be able to avoid
> duplication registration of a message and avoid unregistering a message
> that was never registered.  Given a unique tag, bogofilter could check
> the tag before registering or unregistering a message.  To take the idea
> a step further, the tag would have a value of 1 for its ham count or its
> spam count.  The tag value would also make it easy to fix an incorrect
> classification.
> 
> All this is well and good, but what part of the message can be used for
> the tag (or to generate a tag)?  Bogofilter already recognizes the
> Message-ID, Queue-ID, and IP address for a message.  However none of
> these is unique (see note 1), which is a desirable characteristic.
> 

If we are talking about simple duplicated messages, then I would like to 
suggest a procmail ruleset of:

# remove duplicate emails
:0 Wh: msgid.lock
| formail -D 2048 msgid.cache

While you have mentioned you found repeated use of Message-ID, have you 
confirmed that they are not truely duplicate messages?  For me, 
legitimate email doesn't have this problem, so it's just another means 
to filter spam.  Can anyone show evidence of legitimate email that is of 
different bodies coming from the same Message-ID?

However, if you are trying to keep track of what email to register as 
spam/ham when you may/may not have registered it already...

I do this currently with the existing tags.  I have been able to develop 
some simple scripts that will take email and process it for corrections.

I use the fact that X-Bogosity exists as proof that the email has at 
least been seen by bogofilter and the configuration existence of -u or 
not to indicate if the correction should use the -Sn or -n options.

I have never been spoofed with an X-Bogosity header, but if I were to 
find any evidence of this, I would think that procmail/formail would 
provide a very nice means of stripping out the duplicity.

It works something like this:

for F in `ls $NEWHAM`; do
     doit H $NEWHAM$F
     mv $NEWHAM$F $HAM;
done

WHERE: $NEWHAM is the directory that I put email into after reviewing it 
from the bogofilter process.  That is to say, email comes in and 
bogofilter files it into either:
bogofilter-ham
bogofilter-unsure
bogofilter-spam

I then have to move everything from these three directories into:
email-ham
email-spam

It is these directories that I read this email from.

This process looks at the final destination as the Right Answer and the 
X-Bogosity as the Best Guess and sorts out any differences between the two.

the function doit is seeded with the argument 'H' to indicate that 
everything in this directory is supposed to be HAM.

The function doit looks like:

function doit {
     Bogosity=`formail -zx X-Bogosity < $2|awk -F, '{print $1}'`

     case "$1" in
         H )

             case "$Bogosity" in
                 Yes )
                     $BOGOFILTER -Sn < $2
                     ;;
                 Unsure )
                     $BOGOFILTER -n < $2
                     ;;
                 No )
                     # No Action Taken
                     # This is where is should be.
                     ;;
                 * )
                     echo "INVALID STATE $1 $2 $Bogosity"
                     ;;
             esac

             ;;

Similarly for a case of 'S' for Spam.


In the event that you do not use -peu but only -pe, this can be easily 
modified as follows:

function doit {
     Bogosity=`formail -zx X-Bogosity < $2|awk -F, '{print $1}'`

     case "$1" in
         H )
             if [ $Bogosity != "No" ]; then
                 $BOGOFILTER -nI $2
             fi
             ;;

         S )
             if [ $Bogosity != "Yes" ]; then
                 $BOGOFILTER -sI $2
             fi
             ;;
         * )
             echo "NO ACTION TAKEN"
             echo "specify 'H' or 'S'"
             exit 1;
             ;;
     esac
}

Again, given that procmail and bash can do everything you are asking 
for, I would personally discourage the notion of putting everything into 
bogofilter that can be readily managed, debugged, and processed using 
basic unix scripts.

In the event that bogofilter were daemonized such that it did not 
require procmail or similar to operate then there might be a better case 
for some of this, but I personally would still use the existing method 
of hand filtering the spam into two folders and running a crontab script 
from there.

I do have to confess that I have another line in my scripts which move 
ALL spam/ham guesses into the email-ham/email-spam folders that are over 
4 days old.  This gives me a window to review spam while still forcing 
some automatic processing to keep things from getting too big.



More information about the Bogofilter mailing list