Random lettered word examples
Tom Allison
tallison at tacocat.net
Tue Mar 16 12:47:35 CET 2004
Eric Wood wrote:
> These slipped through bogofilter:
>
> http://www.interplas.com/spam.txt
> http://www.interplas.com/spam2.txt
>
> And I get email like this very consistantly which is why I'm looking for a
> procmail rules that can maybe score words consisting of all consonants as
> spamish or a string of impossible consecutive consonants.
>
> My buddy's MS Outlook spam filter (Spam Inspector) catches email like this
> virtually all the time when bogofilter lets it through. My guess is that it
> can spell check only the [a-zA-Z] words then it gets trapped if there are
> over 50% or so mispellings in certain areas.
>
I played with some email like this (of my own) and found out that:
All the gibberish at the end comes in at a robx value (.415) and since
it's withing the min_dev parameter (0.10) is summarily ignored in the email.
I accidently set my min_dev < .085 which put into consideration all the
initial-robx material and scored 'ham' like clockwork. It might be
worthwhile to add a note in the bogofilter.cf.example file to this
effect, that you want min_dev > abs(0.5 - robx)
--------
tallison at janus:~> bogofilter -vv < spam.txt
X-Bogosity: Unsure, tests=bogofilter, spamicity=0.649109, version=0.17.2
int cnt prob spamicity histogram
0.00 5 0.024668 0.007974 #####
0.10 0 0.000000 0.007974
0.20 0 0.000000 0.007974
0.30 4 0.366755 0.098442 ####
0.40 0 0.000000 0.098442
0.50 0 0.000000 0.098442
0.60 3 0.650643 0.208138 ###
0.70 1 0.751575 0.244896 #
0.80 1 0.809796 0.282436 #
0.90 13 0.994421 0.588911 #############
tallison at janus:~> bogofilter -vv < spam2.txt
X-Bogosity: Unsure, tests=bogofilter, spamicity=0.503607, version=0.17.2
int cnt prob spamicity histogram
0.00 5 0.002805 0.000747 #####
0.10 0 0.000000 0.000747
0.20 1 0.200982 0.012152 #
0.30 8 0.334807 0.132136 ########
0.40 0 0.000000 0.132136
0.50 0 0.000000 0.132136
0.60 1 0.623312 0.159974 #
0.70 0 0.000000 0.159974
0.80 2 0.835902 0.244146 ##
0.90 11 0.977496 0.533678 ###########
tallison at janus:~>
--------
These would have slipped through mine too. But after one training
they're correct. I do use a training to exhaustion process on my mailbox.
More information about the Bogofilter
mailing list