charsets and lexer [was: "make check" fails on hp-ux]

Mon Nov 25 05:29:53 CET 2002

At 10:40 PM 11/24/02, Matthias Andree wrote:

>David Relson <relson at osagesoftware.com> writes:
>
> > The message is in file tests/t.systest.d/inputs/msg.3.txt and is:
> >
> >          Content-Type: text/html; charset="us-ascii"
>
>Hu. High bit set in US-ASCII? Reject it at the SMTP port and be
>done. You're lucky and can do that. A German version of Netscape 4 has
>umlauts in some headers without encoding "Visitenkarte für..."
>("business card for...") which cause false positives on these checks.

I'm just reporting on what I see in the messages I received.  The 8 test 
messages in tests/t.systest.d/inputs _are_ all spam, but that's not exactly 
why they were chosen for the test suite.  I wanted some messages that have 
noticeably different Robinson spamicities, in order to make their 
computations more "interesting".

2 of the 8 messages, i.e. msg.3.txt and msg.7.txt have hi-bit characters, 
but only for several special characters 0x92 (single quote), 0x93 and 0x94 
(left/right double quotes), 0xA9 (copyright sign), and 0xA0 (no-break 
space).  0xA0 shows up in a lot of webpages as an alternative to  

1 message within the non-spam training set is "iso-8859-1" and "8bit" and 
contains a German tag line with character 0xFC in  für.

2 are us-ascii, 5 are iso-8859-1, and 1 doesn't specify any charset.

3 specify "Content-Transfer-Encoding: 7bit".  Both the hi-bit messages are 
in this 7bit grouping.

Anyhow 'tis a variety of messages and types and it makes for an interesting 
test :-)

> > Perhaps the temporary solution is to modify the test messages so that
> > they don't have the problem characters (0x92, etc).  When we have a
> > reasonable charset handler, then we can treat those characters in a
> > better way.
>
>OK. bogofilter -p is unaffected so far (else, t.integrity2 would fail,
>and this test is there in anticipation of possible passthrough mode
>changes.)

The training sets have been sanitized with proper 7bit characters replacing 
the 8bit characters.  Having gone through the exercise of mapping 8bit to 
7bit in lexer.l, I had spamicities from the training set for all the 
messages.  After switching to the sanitized files, and removing the 8bit to 
7bit code, the tests ran and gave the same results as before.

This week, I'll see about adding my charset code so you guys can take pot 
shots at it, tell me what all is wrong, and maybe we can end up with a 
slightly better lexer !!

David