bogofilter and charset recoding

Wed Dec 17 14:22:51 CET 2003

On Wed, 17 Dec 2003 16:17:55 +0300
ls+muttedf93d30 at gambit.com.ru wrote:

> >>> "ignore_case" is an option that exists
> >>> for compatibility with old versions
> >> Every word may have 2^word.length records in a wordlist then.
> > If people used all possible capitalizations for a word, there would
> > be that many different spellings.  In actual usage, I've not
> > observed that problem.
> 
> In actual usage I've observed the charset problem: many
> words are present in my wordlist in different character
> sets (KOI8-R, WINDOWS-1251, CP-866, UTF-8 and ISO-8859-5).
> 
> When using one character set for wordlist, accuracy may increase,
> since the meaning of word is not changing when using different
> representations.
> 
> > If you're concerned or space is at a premium, you can
> > create a filter to convert upper case characters to lower
> > case and filter the message before bogofilter gets it.
> 
> This preprocessor must support MIME, decode base64 and
> quoted-printable :-(

Hi,

It could be argued that having different representations gives
additional clues to help in the scoring :-)

I personally don't need unicode and am not inclined to do the work.  If
you have the skills, feel free to start hacking the code.

David

P.S.  Please use the mailing list so others can learn from our
discussions and participate in them.

-- 
David Relson                   Osage Software Systems, Inc.
relson at osagesoftware.com       Ann Arbor, MI 48103
www.osagesoftware.com          tel:  734.821.8800