... convert_unicode.c ...

David Relson relson at osagesoftware.com
Mon Jun 20 13:53:54 CEST 2005


On Mon, 20 Jun 2005 13:35:19 +0200
Matthias Andree wrote:

> David Relson <relson at osagesoftware.com> writes:
> 
> > The question of the moment is what to do when iconv_open() fails.  As
> > you suggest we could just ignore the message.  That seems like a bad
> > idea as one could just add a dummy mime body section with a bogus
> > charset and bogofilter would be disabled.  Not good!
> 
> Right you are - the question is what will mailers present to the user
> with strange character sets? We should probably log these for a while to
> obtain relevant information.

Attached is a list of 76 charsets in this month's spam.  I don't have
time to see what iconv_open( "from_charset", "UTF-8" ) thinks of them
-- have to head to work.

> 
> > It would be better to turn off translation and simply parse whatever
> > text is present. Translation will resume at the next 
> > "Content-Type: ... charset=" directive.  True, some untranslated text
> > would be passed through, but the impact would probably be minor.
> 
> I'm a bit concerned about storing non-UTF-8 tokens in a database that
> claims UTF-8 format. This is a can of worms we can avoid - like reading
> the database back to show it to the user (we don't do that yet) fails
> with EILSEQ or similar.

The solutions I can think of are all hacks

  ignore tokens for invalid charsets (when scoring and registering)
  don't register tokens from messages with invalid charsets

The present policy of accepting tokens from invalid charsets is
relatively benign.  Of course, when the bogus charset name is seen
(after registration as spam), it has a very high spam score --
effectively a red flag!

David

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: charset.2005-06-Spam.txt
URL: <https://www.bogofilter.org/pipermail/bogofilter-dev/attachments/20050620/4c26e80f/attachment.txt>


More information about the bogofilter-dev mailing list