Idea for improving the learning stage

Sat Sep 8 09:35:03 CEST 2007

On Thu, 06 Sep 2007, Andrew wrote:

> It basically comes down to having the filter take note of this: did the 
> user need to open the email before flagging it as spam?
> 
> If the answer is "no", then concentrate your stats on the subject line 
> and ignore the body (which might be full of random words used by the 
> spammer to pollute the filter's database).
> 
> If the answer is "yes", the reverse applies: ignore the subject, which 
> must have looked "legitimate" to the user, and concentrate on the body, 
> which is what clued the user in about the email being spam.
> 
> By analyzing only the subject OR the body, you analyze only what 
> actually looks like spam, thus ignoring the parts of the email that are 
> there to deceive.

How does bogofilter, for a newly arriving mail, decide whether to look
at header or body? If we modified just the learning side, we'd still be
evaluating body and header, which might still mislead bogofilter. So,
does your suggestion imply we'll have to keep header and body databases
separate? That's certainly doable technically, but what do you do with
nested MIME messages? (Postfix, for instance, allows to specify
regexp-based filters for the message as a whole, or for headers of
embedded MIME parts, usually "attachments").

-- 
Matthias Andree