ignore text/plain part of multipart/alternative messages?

Simon Huggins huggie at earth.li
Wed Aug 13 10:35:09 CEST 2003


On Wed, Aug 13, 2003 at 07:28:00AM +0100, Peter Bishop wrote:
> Perhaps we should score the text/plain and text/html parts separately
> (i.e.  score them like separate messages) then use the *highest* score
> to decide if the overall message is spam.
[..]
> There is also a problem if a padded message is used to update the database 
> as the padding tokens could dilute the database. So here again it would be 
> necessary to score the parts to decide which tokens go in the database.
> (header tokens + tokens in the highest scoring part)

Hmm, I think you might well have a point here.

Do you have a corpus you could test this on to give a vague feel for
results?  Something like taking a few messages, scoring each part, using
the highest score and so on and seeing whether it helps detect future
spam?

I guess until we have numbers then there's not much incentive to code
it.

I'm busy at work atm or I'd try to do something like this myself.



Simon.

-- 
[ "A computer's got to do what a computer's got to do." -Holly.        ]




More information about the Bogofilter mailing list