word scores [was: Unregistering Mail / MD5]

Fri Feb 14 01:05:35 CET 2003

At 05:28 PM 2/13/03, Barry Gould wrote:

>At 02:16 PM 2/13/2003, David Relson wrote:
>
>>Try running command "bogoutil -p -w $YOUR_DIRECTORY barrygould 
>>pennysaverusa.net".  Both those tokens are already being scored in every 
>>message you receive.  I'd bet that the spamicity is 0.50000 (or darn 
>>close to it).
>>
>>I assert that that there's little or no difference in the classification 
>>of the original message and the forwarded message.  Your address is 
>>already in the header, so a second copy is not going to matter.
>
>Sorry, but it doesn't look that way:
># bogoutil -p -w .bogofilter barrygould pennysaverusa.net
>                        spam    good  Gra prob  Rob prob
>barrygould              251   13047  0.056194  0.057041
>pennysaverusa.net      1780   85259  0.060692  0.060819
>
>'barrygould' I can understand: most of our spam goes to other addresses.
>I'm not so sure about pennysaverusa.net, as all mail should say
>"Received: from (server) by mail.pennysaverusa.net" ...

What are the .MSG_COUNT values?

Here are some of the numbers from my domain.  Not surprisingly, the counts 
for osagesoftware.com are almost identical to the .MSG_COUNT values and the 
probabilities are 0.50.  It is interesting to see that the userids (which 
have been anonymize) other than mine are mostly very spammy - which does 
correspond to userids 1,2,and 3 getting most of the spam.

[relson at osage src]$ bogoutil -p -w ../../spam-fixups/wordlists.h.16 david 
relson \
userid1 userid2 userid3 userid4 osagesoftware osagesoftware.com .MSG_COUNT
                        spam    good  Gra prob  Rob prob
david                    84    1414  0.113138  0.114387
relson                  455    5709  0.146137  0.146403
userid1                 551     294  0.800980  0.799256
userid2                 800     387  0.816148  0.814889
userid3                1372     295  0.908987  0.907979
userid4                  22     112  0.296675  0.301403
osagesoftware           268      15  0.974598  0.968388
osagesoftware.com      3070    6527  0.502502  0.502457
.MSG_COUNT             3086    6627  0.500000  0.499958

>Regardless, I don't think it's a good idea.
>It would pollute the db with other headers, like MUA=Eudora would start 
>appearing spammy, even though it shouldn't be, esp as all my friends and 
>family and many co-workers use Eudora.
>It would also mess up the block-on-subnets data.

I don't think the effect would be very significant, but who can say for 
sure?  Not I.

On a related subject, each day my mail server scans /var/log/messages, 
/var/log/syslog, etc and sends me a report of anomalous events.  The 
reports are ham - by definition.  Several times though, bogofilter is 
unsure.  When I look to see why, it's often bounced mail messages to spam 
sites, for example:

Feb 11 20:15:45 nic postfix/smtp[6456]: connect to 
secondmore.com[203.22.104.12]: Connection refused (port 25)
Feb 11 20:15:45 nic postfix/smtp[6456]: 869372873C: 
to=<girlie-porno-site-html-useriduserid=osagesoftware.com at secondmore.com>, 
relay=none, delay=1973, status=deferred (connect to 
secondmore.com[203.22.104.12]: Connection refused)

Bogofilter recognizes url:203, url:203.22, url:203.22.104, 
url:203.22.104.12 as being spammy.  Together with the other stuff in the 
message, the net result is "Unsure".

>I can conceive that a possible alternative would be to forward messages 
>with some sort of delimiter line telling bogofilter (or procmail) to 
>ignore everything up to that line, but it would have to be defined by the 
>user, otherwise spammers could put that line at the end of all spam :)

Remember that the "X-Bogosity" string _is_ already user settable via config 
file.