Robinson-Fisher use / viewing tokens

David Relson relson at osagesoftware.com
Tue Jan 21 03:34:39 CET 2003


At 09:11 PM 1/20/03, Barry Gould wrote:

>Hi,
>I'm playing with RF in 0.10.0
>
>What is the command to see the tokens with RF?

You can see tokens with "-vvv" or "-R".  Technically, bogofilter is writing 
a table of data formatted so that the R numeric package can read it, verify 
the calculations, and do some other manipulations.  Greg Louis is our 
algorithm and R guy and could tell you more.

>When I run the new bogofilter with -vv  or  -r -vv  or  -rf -vv, I don't 
>get the tokens output, only numbers. (see below)
>-vvv seems to work, but is very hard to read as it wraps lines, plus I'd 
>like to see the 15 tokens it actually chose.

"15 tokens" only applies to Graham.  Robinson and Robinson-Fisher use _all_ 
the tokens (with the possibility of excluding neutral values with the 
min_dev setting).

"-r" is not needed when using "-f".  I'll check into your observation about 
"algorithm=robinson" needed with "algorithm=fisher".  A single one should 
be sufficient.  If two are needed, there's a problem to be fixed.

"-v" gives a minimal level of detal (1 line).  "-vv" generates the 
histogram that you see (11 lines). "-vvv" generates the complete Rtable, 
which is 75 characters wide.

>BTW, RF seems to do much better with my false positive (spamicity=0.596015 
>instead of 0.990) with the same dbs.

Yes.  You can use the same databases.

>X-Bogosity: Yes, tests=bogofilter, spamicity=0.596015, version=0.10.0
>           int  cnt    prob   spamicity  histogram
>          0.00   74  0.056230  0.018743  #############
>          0.10  109  0.150253  0.052099  ##################
>          0.20  129  0.255298  0.099964  ######################
>          0.30  164  0.351663  0.162955  ###########################
>          0.40  258  0.449303  0.253415 
> ###########################################
>          0.50  306  0.552217  0.344295 
> ##################################################
>          0.60  301  0.648827  0.418358 
> ##################################################
>          0.70  260  0.753508  0.476234 
> ###########################################
>          0.80  106  0.847086  0.500720  ##################
>          0.90  211  0.985128  0.596015  ###################################

That's a big message - 1500+ distinct tokens, with values all over the map!

'Tis useful to have "min_dev=0.1".  This "takes out" the tokens which are 
not already known to the wordlists since the spamicity calculation gives 
them a 0.415 score.  The 0.1 setting which pretty much clears out the 0.40 
and 0.50 lines.  For your message, it'd cut the count by 550 or so 
words.  Try the min_dev setting and send the results to the list.  My 
guesstimate is that the spamicity value won't change a whole lot.


>At 08:40 PM 11/26/2002, David Relson wrote:
>
>>Graham: bogofilter -g -vv <message -- prints the 15 tokens and their info
>>Robinson:       bogofilter -r -vv <message -- prints a histogram of the 
>>tokens evaluated
>>                 bogofilter -r -vvv <message - prints _all_ the tokens 
>> evaluated and their info

... still true ...





More information about the Bogofilter mailing list