bogotune cleanup

Tom Allison tallison at tacocat.net
Fri Feb 18 12:00:56 CET 2005


I was tooling around with some bogoutil features and was wondering is 
there is some way to manage the logic between the -c and -a switches.

For example:

tallison at cling:~$ bogoutil -c 1 -a 20041130 -d .bogofilter/wordlist.db | 
more
$0 3 1 20041206
$0.00 4 12 20050123
$0.025 2 0 20041130
$0.07 4 0 20041130
$0.075 4 0 20041130
$0.12 4 0 20041130
$0.15 3 2 20041230
$0.24 2 0 20041130
$0.25 10 0 20041130
$0.29 2 0 20041130
$0.34 4 0 20041130
$0.35 5 0 20041130
$0.39 2 0 20041130
$0.45 2 0 20041130
$0.65 2 0 20041130
$0.75 4 0 20041130
$0.85 2 0 20041130
$0.86 2 2 20041210
$0.95 4 0 20041130
$0100007 2 0 20041130
$039 0 1 20050214
$04604 0 1 20050214

I rebuild my database on 20041130 so I have a lot of these timestamps.
I don't see any reason to remove them if they are effective in filtering 
spam, but since I only train on error the chances that I will update 
anything is reduced.

As a result of this approach, any database trimming I might want to do 
would be more along the lines of
   "if date is <= 20041130 AND count <= 1 then"
rather then the observerd
   "if date is <= 20041130 OR count <=1 then"



 From a practicality standpoint, or a user perspective, what makes more 
sense: AND logic here or use of OR logic?  One case for AND logic might 
be that the smaller data set returned would have less of a potential 
skewing of the wordlist from reality.  The argument being based on the 
idea that as you remove tokens you may not accurately represent the 
cummuative MSG_COUNT and TOKEN_COUNT accurately.



More information about the Bogofilter mailing list