proposed improvements

David Relson relson at osagesoftware.com
Wed Apr 30 16:22:10 CEST 2003


Cedric,

Word pairs could be implemented quite easily in the collect_words() 
function in file collect.c.  The code would just need to keep track of the 
previous (old) token and create "$old $new".  You could probably get the 
code working in an hour or so.

Database size is one concern with the change.  With 13,538 messages my 
spamlist.db is 4.5 MB.  My goodlist.db has 32,681 messages and is 
14MB.  Overall size isn't too bad.  A second possible problem is 
speed.  With a larger database, caching becomes more important.  Using word 
pairs will also increase the number of database lookups.  So, there will be 
a speed penalty, but I won't even try to predict how big that will be.

If you (or somebody else) decides to try that, I'm available for questions 
and code review.

Also, it would be a good idea to set up some tests to measure how much it 
helps.  The bogofilter/training scripts in 0.12.x could be used to see 
whether it helps and to measure how much it helps.

By the way, 0.12.2 is now available on SourceForge as the current release.

David

At 09:20 AM 4/30/03, Cedric Foll wrote:

>Hi,
>
>i'm using bogofilter on my mail serveur and I'm really happy with it.
>I'm trying spamassassin and a propose few things that i found good in
>that soft.
>
>1)Possibility of analysing more than just a word. For exemple "bigger"
>and "sex" are quite usual in a no-spam e-mail. But "bigger sex" is
>almost allreay spam.
>So it could be an option in bogofilter during learning process to also
>analyse block of two words.
>For exemple with the text "A B C D" to not only save
>A
>B
>C
>D
>but also
>A B
>B C
>C D
>I know that the db would be three time bigger but space disk isn't a pb
>for a lot of people. And it could be an option.
>2) For each mail learned, to have a file where we save a checksum of the
>mail and his classification. So it should not be possible to learn more
>than one time an e-mail and when a mistake is done, no more need of -S
>and -N. (spamassin use something like that). It shoudl be optional too.
>
>
>Regards.
>
>
>---------------------------------------------------------------------
>FAQ: http://bogofilter.sourceforge.net/bogofilter-faq.html
>To unsubscribe, e-mail: bogofilter-unsubscribe at aotto.com
>For summary digest subscription: bogofilter-digest-subscribe at aotto.com
>For more commands, e-mail: bogofilter-help at aotto.com





More information about the Bogofilter mailing list