"make check" fails on hp-ux

David Relson relson at osagesoftware.com
Sun Nov 24 17:04:16 CET 2002


At 11:19 PM 11/23/02, Allyn Fratkin wrote:
>>Let me know how it goes.  If you're so inclined, run both tests and send
>>me a tgz file of ./tmp.
>
>
>here it is.  i have the "sh -x" output of t.systest and t.grftest,
>and the output directories.  i hope this helps.  i couldn't get
>much out of the systest one, but the grftest one definitely shows
>different spamicity values.

Allyn,

What you sent me was perfect.  I know why the tests failed on hp-ux.  I'll 
tell you how I learned that ...

I put your outputs from t.systest into directory systest-hp-ux and those 
from t.grftest in grftest-hp-ux.  I put the reference outputs into 
sytest-linuxand grftest-linux.  With this organization I could diff the 
directories to see what's going on.  diff spit out 300+ lines when it 
compared the msg.3.r.vvv files, which are msg.3.txt processed using the 
Robinson algorithm and generating verbose level 3 output.  Stated 
differently, the output listed all the tokens and their 
spamicities.  Looking at the two files, I found that the hp-ux file has 214 
lines and the reference (linux) file has 216 lines.  The linux file has 
lines for clancy, clancy's, king, and king's, while the hp-ux file only has 
clancy and king (see below):

$$ egrep -w "(clancy|king)" */msg.3.r.vvv
systest-hp-ux/msg.3.r.vvv: 
43  clancy                    0.00         0  0.415000  -0.53614  -0.87948
systest-hp-ux/msg.3.r.vvv: 
64  king                      0.00         0  0.415000  -0.53614  -0.87948

systest-linux/msg.3.r.vvv: 
43  clancy                    0.00         0  0.415000  -0.53614  -0.87948
systest-linux/msg.3.r.vvv: 
44  clancy's                  0.00         0  0.415000  -0.53614  -0.87948
systest-linux/msg.3.r.vvv: 
65  king                      0.00         0  0.415000  -0.53614  -0.87948
systest-linux/msg.3.r.vvv: 
66  king's                    0.00         0  0.415000  -0.53614  -0.87948

Going back to the original message, I see the following lines:

         Subject: Stephen King's latest thriller! Free Shipping & Handling

         Red Rabbit By Tom Clancy
         Tom Clancy's back with the story behind his most popular
         character - Jack Ryan.

I ran these lines though my hexdump utility and got

         000000  53 75 62 6A 65 63 74 3A  20 53 74 65 70 68 65 6E  Subject: 
Stephen
         000010  20 4B 69 6E 67 92 73 20  6C 61 74 65 73 74 20 74   King s 
latest t
         000020  68 72 69 6C 6C 65 72 21  20 46 72 65 65 20 53 68  hriller! 
Free Sh
         000030  69 70 70 69 6E 67 20 26  20 48 61 6E 64 6C 69 6E  ipping & 
Handlin
         000040  67 0A 0A 52 65 64 20 52  61 62 62 69 74 20 42 79  g  Red 
Rabbit By
         000050  20 54 6F 6D 20 43 6C 61  6E 63 79 0A 54 6F 6D 20   Tom 
Clancy Tom
         000060  43 6C 61 6E 63 79 92 73  20 62 61 63 6B 20 77 69  Clancy s 
back wi
         000070  74 68 20 74 68 65 20 73  74 6F 72 79 20 62 65 68  th the 
story beh
         000080  69 6E 64 20 68 69 73 20  6D 6F 73 74 20 70 6F 70  ind his 
most pop
         000090  75 6C 61 72 0A 63 68 61  72 61 63 74 65 72 20 2D  ular 
character -
         0000A0  20 4A 61 63 6B 20 52 79  61 6E 2E 20 0A 0A         Jack 
Ryan.

Notice that at byte positions 15 and 66 the character value is 0x92.  linux 
shows this as an apostrophe and lexer.l accepts it as a valid part of a 
token.  On hp-ux this character is evidently rejected, so the tokens that 
bogofilter sees are slightly different.

My instinct is to modify yyinput() so that it translates 0x92 to 
apostrophe.  I've attached a patch.  In addition to translating 0x92 to 
apostrophe, it also translates 0xA0 (known as the "no-break space") to 0x20 
(a space).

Unfortunately, this change changes the reference results, as token 
"it\x92s" changes from an unknown token to matching "it's", with 
corresponding spamicity change from 0.415000 to 0.237638.  I'll regenerate 
the reference results.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: lexer.l.patch
Type: application/octet-stream
Size: 1441 bytes
Desc: not available
URL: <https://www.bogofilter.org/pipermail/bogofilter-dev/attachments/20021124/68bc062d/attachment.obj>


More information about the bogofilter-dev mailing list