bogotune cleanup
Tom Allison
tallison at tacocat.net
Fri Feb 18 12:00:56 CET 2005
I was tooling around with some bogoutil features and was wondering is
there is some way to manage the logic between the -c and -a switches.
For example:
tallison at cling:~$ bogoutil -c 1 -a 20041130 -d .bogofilter/wordlist.db |
more
$0 3 1 20041206
$0.00 4 12 20050123
$0.025 2 0 20041130
$0.07 4 0 20041130
$0.075 4 0 20041130
$0.12 4 0 20041130
$0.15 3 2 20041230
$0.24 2 0 20041130
$0.25 10 0 20041130
$0.29 2 0 20041130
$0.34 4 0 20041130
$0.35 5 0 20041130
$0.39 2 0 20041130
$0.45 2 0 20041130
$0.65 2 0 20041130
$0.75 4 0 20041130
$0.85 2 0 20041130
$0.86 2 2 20041210
$0.95 4 0 20041130
$0100007 2 0 20041130
$039 0 1 20050214
$04604 0 1 20050214
I rebuild my database on 20041130 so I have a lot of these timestamps.
I don't see any reason to remove them if they are effective in filtering
spam, but since I only train on error the chances that I will update
anything is reduced.
As a result of this approach, any database trimming I might want to do
would be more along the lines of
"if date is <= 20041130 AND count <= 1 then"
rather then the observerd
"if date is <= 20041130 OR count <=1 then"
From a practicality standpoint, or a user perspective, what makes more
sense: AND logic here or use of OR logic? One case for AND logic might
be that the smaller data set returned would have less of a potential
skewing of the wordlist from reality. The argument being based on the
idea that as you remove tokens you may not accurately represent the
cummuative MSG_COUNT and TOKEN_COUNT accurately.
More information about the Bogofilter
mailing list