"make check" fails on hp-ux
David Relson
relson at osagesoftware.com
Sun Nov 24 17:04:16 CET 2002
At 11:19 PM 11/23/02, Allyn Fratkin wrote:
>>Let me know how it goes. If you're so inclined, run both tests and send
>>me a tgz file of ./tmp.
>
>
>here it is. i have the "sh -x" output of t.systest and t.grftest,
>and the output directories. i hope this helps. i couldn't get
>much out of the systest one, but the grftest one definitely shows
>different spamicity values.
Allyn,
What you sent me was perfect. I know why the tests failed on hp-ux. I'll
tell you how I learned that ...
I put your outputs from t.systest into directory systest-hp-ux and those
from t.grftest in grftest-hp-ux. I put the reference outputs into
sytest-linuxand grftest-linux. With this organization I could diff the
directories to see what's going on. diff spit out 300+ lines when it
compared the msg.3.r.vvv files, which are msg.3.txt processed using the
Robinson algorithm and generating verbose level 3 output. Stated
differently, the output listed all the tokens and their
spamicities. Looking at the two files, I found that the hp-ux file has 214
lines and the reference (linux) file has 216 lines. The linux file has
lines for clancy, clancy's, king, and king's, while the hp-ux file only has
clancy and king (see below):
$$ egrep -w "(clancy|king)" */msg.3.r.vvv
systest-hp-ux/msg.3.r.vvv:
43 clancy 0.00 0 0.415000 -0.53614 -0.87948
systest-hp-ux/msg.3.r.vvv:
64 king 0.00 0 0.415000 -0.53614 -0.87948
systest-linux/msg.3.r.vvv:
43 clancy 0.00 0 0.415000 -0.53614 -0.87948
systest-linux/msg.3.r.vvv:
44 clancy's 0.00 0 0.415000 -0.53614 -0.87948
systest-linux/msg.3.r.vvv:
65 king 0.00 0 0.415000 -0.53614 -0.87948
systest-linux/msg.3.r.vvv:
66 king's 0.00 0 0.415000 -0.53614 -0.87948
Going back to the original message, I see the following lines:
Subject: Stephen King's latest thriller! Free Shipping & Handling
Red Rabbit By Tom Clancy
Tom Clancy's back with the story behind his most popular
character - Jack Ryan.
I ran these lines though my hexdump utility and got
000000 53 75 62 6A 65 63 74 3A 20 53 74 65 70 68 65 6E Subject:
Stephen
000010 20 4B 69 6E 67 92 73 20 6C 61 74 65 73 74 20 74 King s
latest t
000020 68 72 69 6C 6C 65 72 21 20 46 72 65 65 20 53 68 hriller!
Free Sh
000030 69 70 70 69 6E 67 20 26 20 48 61 6E 64 6C 69 6E ipping &
Handlin
000040 67 0A 0A 52 65 64 20 52 61 62 62 69 74 20 42 79 g Red
Rabbit By
000050 20 54 6F 6D 20 43 6C 61 6E 63 79 0A 54 6F 6D 20 Tom
Clancy Tom
000060 43 6C 61 6E 63 79 92 73 20 62 61 63 6B 20 77 69 Clancy s
back wi
000070 74 68 20 74 68 65 20 73 74 6F 72 79 20 62 65 68 th the
story beh
000080 69 6E 64 20 68 69 73 20 6D 6F 73 74 20 70 6F 70 ind his
most pop
000090 75 6C 61 72 0A 63 68 61 72 61 63 74 65 72 20 2D ular
character -
0000A0 20 4A 61 63 6B 20 52 79 61 6E 2E 20 0A 0A Jack
Ryan.
Notice that at byte positions 15 and 66 the character value is 0x92. linux
shows this as an apostrophe and lexer.l accepts it as a valid part of a
token. On hp-ux this character is evidently rejected, so the tokens that
bogofilter sees are slightly different.
My instinct is to modify yyinput() so that it translates 0x92 to
apostrophe. I've attached a patch. In addition to translating 0x92 to
apostrophe, it also translates 0xA0 (known as the "no-break space") to 0x20
(a space).
Unfortunately, this change changes the reference results, as token
"it\x92s" changes from an unknown token to matching "it's", with
corresponding spamicity change from 0.415000 to 0.237638. I'll regenerate
the reference results.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: lexer.l.patch
Type: application/octet-stream
Size: 1441 bytes
Desc: not available
URL: <https://www.bogofilter.org/pipermail/bogofilter-dev/attachments/20021124/68bc062d/attachment.obj>
More information about the bogofilter-dev
mailing list