proposed improvements
David Relson
relson at osagesoftware.com
Wed Apr 30 16:22:10 CEST 2003
Cedric,
Word pairs could be implemented quite easily in the collect_words()
function in file collect.c. The code would just need to keep track of the
previous (old) token and create "$old $new". You could probably get the
code working in an hour or so.
Database size is one concern with the change. With 13,538 messages my
spamlist.db is 4.5 MB. My goodlist.db has 32,681 messages and is
14MB. Overall size isn't too bad. A second possible problem is
speed. With a larger database, caching becomes more important. Using word
pairs will also increase the number of database lookups. So, there will be
a speed penalty, but I won't even try to predict how big that will be.
If you (or somebody else) decides to try that, I'm available for questions
and code review.
Also, it would be a good idea to set up some tests to measure how much it
helps. The bogofilter/training scripts in 0.12.x could be used to see
whether it helps and to measure how much it helps.
By the way, 0.12.2 is now available on SourceForge as the current release.
David
At 09:20 AM 4/30/03, Cedric Foll wrote:
>Hi,
>
>i'm using bogofilter on my mail serveur and I'm really happy with it.
>I'm trying spamassassin and a propose few things that i found good in
>that soft.
>
>1)Possibility of analysing more than just a word. For exemple "bigger"
>and "sex" are quite usual in a no-spam e-mail. But "bigger sex" is
>almost allreay spam.
>So it could be an option in bogofilter during learning process to also
>analyse block of two words.
>For exemple with the text "A B C D" to not only save
>A
>B
>C
>D
>but also
>A B
>B C
>C D
>I know that the db would be three time bigger but space disk isn't a pb
>for a lot of people. And it could be an option.
>2) For each mail learned, to have a file where we save a checksum of the
>mail and his classification. So it should not be possible to learn more
>than one time an e-mail and when a mistake is done, no more need of -S
>and -N. (spamassin use something like that). It shoudl be optional too.
>
>
>Regards.
>
>
>---------------------------------------------------------------------
>FAQ: http://bogofilter.sourceforge.net/bogofilter-faq.html
>To unsubscribe, e-mail: bogofilter-unsubscribe at aotto.com
>For summary digest subscription: bogofilter-digest-subscribe at aotto.com
>For more commands, e-mail: bogofilter-help at aotto.com
More information about the Bogofilter
mailing list