libgmime
Greg Louis
glouis at dynamicro.on.ca
Fri Dec 13 20:25:32 CET 2002
On 20021213 (Fri) at 1514:13 +0100, Matthias Andree wrote:
> > Currently that's not true; the uuencoded data are read and a very large
> > number of meaningless tokens is generated.
>
> I thought we had an "eat uuencode lines" rule in lexer.l now.
>
Only in cvs. The patch applies cleanly to older stuff though; I'm
using it now.
> How do we identify tokens that cannot and will no more contribute to
> the spamicity? We will want to have a cron job weed these entries out
> of the .db once in a week.
I'd love to have a utility that could rauss unwanted tokens from the
dbs, given a list of the tokens to remove. Would it be hard to make
one? There was some discussion a while back on automated pruning; I
quote:
> If I do random deletions on the spamlist, I'm going to hit a token
> with a count of 1 more than 3 times out of 4, and one with a count
> under 11 more than 19 times out of 20. The goodlist is a bit
> riskier, but still not bad: 60% and 88% respectively.
> (Mind you, it might be as productive just to purge all the tokens
> with counts of 1 every so often ;-)
To which Thomas Allison replied:
> It would be simplest to delete the low count ones.
> As you mentioned, if there is a new tendency developing then the
> important ones will be quickly replenished and rise above the
> threshold of 1 in short notice.
End quote.
> Bogofilter will never be able to distinguish uuencode from random
> text.
You mean, "It will never be worth the effort of teaching bogofilter to
distinguish uuencode." I'd agree with that, but I think it _should_
ignore application-octet-stream attachments.
--
| G r e g L o u i s | gpg public key: |
| http://www.bgl.nu/~glouis | finger greg at bgl.nu |
More information about the bogofilter-dev
mailing list