Robinson-Fisher use / viewing tokens
David Relson
relson at osagesoftware.com
Tue Jan 21 03:34:39 CET 2003
At 09:11 PM 1/20/03, Barry Gould wrote:
>Hi,
>I'm playing with RF in 0.10.0
>
>What is the command to see the tokens with RF?
You can see tokens with "-vvv" or "-R". Technically, bogofilter is writing
a table of data formatted so that the R numeric package can read it, verify
the calculations, and do some other manipulations. Greg Louis is our
algorithm and R guy and could tell you more.
>When I run the new bogofilter with -vv or -r -vv or -rf -vv, I don't
>get the tokens output, only numbers. (see below)
>-vvv seems to work, but is very hard to read as it wraps lines, plus I'd
>like to see the 15 tokens it actually chose.
"15 tokens" only applies to Graham. Robinson and Robinson-Fisher use _all_
the tokens (with the possibility of excluding neutral values with the
min_dev setting).
"-r" is not needed when using "-f". I'll check into your observation about
"algorithm=robinson" needed with "algorithm=fisher". A single one should
be sufficient. If two are needed, there's a problem to be fixed.
"-v" gives a minimal level of detal (1 line). "-vv" generates the
histogram that you see (11 lines). "-vvv" generates the complete Rtable,
which is 75 characters wide.
>BTW, RF seems to do much better with my false positive (spamicity=0.596015
>instead of 0.990) with the same dbs.
Yes. You can use the same databases.
>X-Bogosity: Yes, tests=bogofilter, spamicity=0.596015, version=0.10.0
> int cnt prob spamicity histogram
> 0.00 74 0.056230 0.018743 #############
> 0.10 109 0.150253 0.052099 ##################
> 0.20 129 0.255298 0.099964 ######################
> 0.30 164 0.351663 0.162955 ###########################
> 0.40 258 0.449303 0.253415
> ###########################################
> 0.50 306 0.552217 0.344295
> ##################################################
> 0.60 301 0.648827 0.418358
> ##################################################
> 0.70 260 0.753508 0.476234
> ###########################################
> 0.80 106 0.847086 0.500720 ##################
> 0.90 211 0.985128 0.596015 ###################################
That's a big message - 1500+ distinct tokens, with values all over the map!
'Tis useful to have "min_dev=0.1". This "takes out" the tokens which are
not already known to the wordlists since the spamicity calculation gives
them a 0.415 score. The 0.1 setting which pretty much clears out the 0.40
and 0.50 lines. For your message, it'd cut the count by 550 or so
words. Try the min_dev setting and send the results to the list. My
guesstimate is that the spamicity value won't change a whole lot.
>At 08:40 PM 11/26/2002, David Relson wrote:
>
>>Graham: bogofilter -g -vv <message -- prints the 15 tokens and their info
>>Robinson: bogofilter -r -vv <message -- prints a histogram of the
>>tokens evaluated
>> bogofilter -r -vvv <message - prints _all_ the tokens
>> evaluated and their info
... still true ...
More information about the Bogofilter
mailing list