charsets and lexer [was: "make check" fails on hp-ux]
David Relson
relson at osagesoftware.com
Mon Nov 25 05:29:53 CET 2002
At 10:40 PM 11/24/02, Matthias Andree wrote:
>David Relson <relson at osagesoftware.com> writes:
>
> > The message is in file tests/t.systest.d/inputs/msg.3.txt and is:
> >
> > Content-Type: text/html; charset="us-ascii"
>
>Hu. High bit set in US-ASCII? Reject it at the SMTP port and be
>done. You're lucky and can do that. A German version of Netscape 4 has
>umlauts in some headers without encoding "Visitenkarte für..."
>("business card for...") which cause false positives on these checks.
I'm just reporting on what I see in the messages I received. The 8 test
messages in tests/t.systest.d/inputs _are_ all spam, but that's not exactly
why they were chosen for the test suite. I wanted some messages that have
noticeably different Robinson spamicities, in order to make their
computations more "interesting".
2 of the 8 messages, i.e. msg.3.txt and msg.7.txt have hi-bit characters,
but only for several special characters 0x92 (single quote), 0x93 and 0x94
(left/right double quotes), 0xA9 (copyright sign), and 0xA0 (no-break
space). 0xA0 shows up in a lot of webpages as an alternative to
1 message within the non-spam training set is "iso-8859-1" and "8bit" and
contains a German tag line with character 0xFC in für.
2 are us-ascii, 5 are iso-8859-1, and 1 doesn't specify any charset.
3 specify "Content-Transfer-Encoding: 7bit". Both the hi-bit messages are
in this 7bit grouping.
Anyhow 'tis a variety of messages and types and it makes for an interesting
test :-)
> > Perhaps the temporary solution is to modify the test messages so that
> > they don't have the problem characters (0x92, etc). When we have a
> > reasonable charset handler, then we can treat those characters in a
> > better way.
>
>OK. bogofilter -p is unaffected so far (else, t.integrity2 would fail,
>and this test is there in anticipation of possible passthrough mode
>changes.)
The training sets have been sanitized with proper 7bit characters replacing
the 8bit characters. Having gone through the exercise of mapping 8bit to
7bit in lexer.l, I had spamicities from the training set for all the
messages. After switching to the sanitized files, and removing the 8bit to
7bit code, the tests ran and gave the same results as before.
This week, I'll see about adding my charset code so you guys can take pot
shots at it, tell me what all is wrong, and maybe we can end up with a
slightly better lexer !!
David
More information about the bogofilter-dev
mailing list