libgmime

Fri Dec 13 20:25:32 CET 2002

On 20021213 (Fri) at 1514:13 +0100, Matthias Andree wrote:
> > Currently that's not true; the uuencoded data are read and a very large 
> > number of meaningless tokens is generated.
> 
> I thought we had an "eat uuencode lines" rule in lexer.l now.
> 
Only in cvs.  The patch applies cleanly to older stuff though; I'm
using it now.

> How do we identify tokens that cannot and will no more contribute to
> the spamicity? We will want to have a cron job weed these entries out
> of the .db once in a week.

I'd love to have a utility that could rauss unwanted tokens from the
dbs, given a list of the tokens to remove.  Would it be hard to make
one?  There was some discussion a while back on automated pruning; I
quote:

 > If I do random deletions on the spamlist, I'm going to hit a token
 > with a count of 1 more than 3 times out of 4, and one with a count
 > under 11 more than 19 times out of 20.  The goodlist is a bit
 > riskier, but still not bad: 60% and 88% respectively.

 > (Mind you, it might be as productive just to purge all the tokens
 > with counts of 1 every so often ;-)

To which Thomas Allison replied:

 > It would be simplest to delete the low count ones.

 > As you mentioned, if there is a new tendency developing then the
 > important ones will be quickly replenished and rise above the
 > threshold of 1 in short notice.

End quote.

> Bogofilter will never be able to distinguish uuencode from random
> text.

You mean, "It will never be worth the effort of teaching bogofilter to  
distinguish uuencode."  I'd agree with that, but I think it _should_
ignore application-octet-stream attachments.

-- 
| G r e g  L o u i s          | gpg public key:      |
|   http://www.bgl.nu/~glouis |   finger greg at bgl.nu |