word scores [was: Unregistering Mail / MD5]
David Relson
relson at osagesoftware.com
Fri Feb 14 01:05:35 CET 2003
At 05:28 PM 2/13/03, Barry Gould wrote:
>At 02:16 PM 2/13/2003, David Relson wrote:
>
>>Try running command "bogoutil -p -w $YOUR_DIRECTORY barrygould
>>pennysaverusa.net". Both those tokens are already being scored in every
>>message you receive. I'd bet that the spamicity is 0.50000 (or darn
>>close to it).
>>
>>I assert that that there's little or no difference in the classification
>>of the original message and the forwarded message. Your address is
>>already in the header, so a second copy is not going to matter.
>
>Sorry, but it doesn't look that way:
># bogoutil -p -w .bogofilter barrygould pennysaverusa.net
> spam good Gra prob Rob prob
>barrygould 251 13047 0.056194 0.057041
>pennysaverusa.net 1780 85259 0.060692 0.060819
>
>'barrygould' I can understand: most of our spam goes to other addresses.
>I'm not so sure about pennysaverusa.net, as all mail should say
>"Received: from (server) by mail.pennysaverusa.net" ...
What are the .MSG_COUNT values?
Here are some of the numbers from my domain. Not surprisingly, the counts
for osagesoftware.com are almost identical to the .MSG_COUNT values and the
probabilities are 0.50. It is interesting to see that the userids (which
have been anonymize) other than mine are mostly very spammy - which does
correspond to userids 1,2,and 3 getting most of the spam.
[relson at osage src]$ bogoutil -p -w ../../spam-fixups/wordlists.h.16 david
relson \
userid1 userid2 userid3 userid4 osagesoftware osagesoftware.com .MSG_COUNT
spam good Gra prob Rob prob
david 84 1414 0.113138 0.114387
relson 455 5709 0.146137 0.146403
userid1 551 294 0.800980 0.799256
userid2 800 387 0.816148 0.814889
userid3 1372 295 0.908987 0.907979
userid4 22 112 0.296675 0.301403
osagesoftware 268 15 0.974598 0.968388
osagesoftware.com 3070 6527 0.502502 0.502457
.MSG_COUNT 3086 6627 0.500000 0.499958
>Regardless, I don't think it's a good idea.
>It would pollute the db with other headers, like MUA=Eudora would start
>appearing spammy, even though it shouldn't be, esp as all my friends and
>family and many co-workers use Eudora.
>It would also mess up the block-on-subnets data.
I don't think the effect would be very significant, but who can say for
sure? Not I.
On a related subject, each day my mail server scans /var/log/messages,
/var/log/syslog, etc and sends me a report of anomalous events. The
reports are ham - by definition. Several times though, bogofilter is
unsure. When I look to see why, it's often bounced mail messages to spam
sites, for example:
Feb 11 20:15:45 nic postfix/smtp[6456]: connect to
secondmore.com[203.22.104.12]: Connection refused (port 25)
Feb 11 20:15:45 nic postfix/smtp[6456]: 869372873C:
to=<girlie-porno-site-html-useriduserid=osagesoftware.com at secondmore.com>,
relay=none, delay=1973, status=deferred (connect to
secondmore.com[203.22.104.12]: Connection refused)
Bogofilter recognizes url:203, url:203.22, url:203.22.104,
url:203.22.104.12 as being spammy. Together with the other stuff in the
message, the net result is "Unsure".
>I can conceive that a possible alternative would be to forward messages
>with some sort of delimiter line telling bogofilter (or procmail) to
>ignore everything up to that line, but it would have to be defined by the
>user, otherwise spammers could put that line at the end of all spam :)
Remember that the "X-Bogosity" string _is_ already user settable via config
file.
More information about the Bogofilter
mailing list