Results - Filter on subject & body

Jozef Hitzinger hitzinger at phobos.fphil.uniba.sk
Tue Mar 9 15:06:52 CET 2004


Hi,

thanks to all who took their time and replied to my mails. Here are the
promised numbers. They are all in form (ham/unsure/spam).

[1] 11848 pcs spam mbox

with headers: 3/283/11562
subject+body: 4/133/11711 (caught 149 more spam messages)

[2] 7517 unsorted mail mbox

with headers: 3420/410/3687
subject+body: 3165/602/3750 (caught 63 more spam messages)


There was also difference in what kind of messages got through. In the
case with all headers, those to get through tended to be from debian-devel
and suchlike - places which already have filters in place, so spam from
there is not frequent enough to outweight the headers stuff.

With just subject & body, these spams were removed nicely. The kind that
got through in this case, were small messages - too few words and often no
subject. Well, it's obvious that this will be the weak point of approach I
proposed, without headers there's not much left to rule upon.

I admit that the increase in performance wasn't too big, but these small
messages should be handled better with adding more such messages to db,
while the source-problem is fundamental. Ok, I've argued that already.
Stop.


All other settings were the same for both tests. I used my production
settings and mail corpuses, no special selection to get the results.
If you have more questions about testing, ask.


Thanks to all who tried to help me with scripts, I'll do the header
mangling that way.

-- 
jozef  :-)




More information about the Bogofilter mailing list