ignore text/plain part of multipart/alternative messages?
Simon Huggins
huggie at earth.li
Wed Aug 13 10:35:09 CEST 2003
On Wed, Aug 13, 2003 at 07:28:00AM +0100, Peter Bishop wrote:
> Perhaps we should score the text/plain and text/html parts separately
> (i.e. score them like separate messages) then use the *highest* score
> to decide if the overall message is spam.
[..]
> There is also a problem if a padded message is used to update the database
> as the padding tokens could dilute the database. So here again it would be
> necessary to score the parts to decide which tokens go in the database.
> (header tokens + tokens in the highest scoring part)
Hmm, I think you might well have a point here.
Do you have a corpus you could test this on to give a vague feel for
results? Something like taking a few messages, scoring each part, using
the highest score and so on and seeing whether it helps detect future
spam?
I guess until we have numbers then there's not much incentive to code
it.
I'm busy at work atm or I'd try to do something like this myself.
Simon.
--
[ "A computer's got to do what a computer's got to do." -Holly. ]
More information about the Bogofilter
mailing list