token pairs [was: Algorithm limitations]

David Relson relson at osagesoftware.com
Tue Apr 13 14:06:34 CEST 2004


On 13 Apr 2004 07:37:01 -0400
Tom Anderson wrote:

> On Mon, 2004-04-12 at 00:07, michael at optusnet.com.au wrote:
> > I'm already doing word pairs. You might have seen the patch I posted
> > previously for a lossy token database. That was design to support
> > exactly what you're talking about. Basically it would allow
> > bogofilter to generate a vast array of tokens and only keep the
> > ones that occur 'frequently'. 'frequently' here means "at a
> > frequency high enough that a second instance comes along before the
> > first has been discarded from the database". :)
> 
> Would it be possible to consider this patch for inclusion in the
> stable bogofilter branch, turned on via a switch?
> 
> Tom

Tom,

I'm not willing to include word pairs until after the 1.0 release, but
am willing to let users experiment with the technique.  Attached is a
patch from a couple of months ago and updated to work with 0.17.5. 
Below is a sample of the output using it:

[relson at osage src]$ echo this is a test of word pairs | bogofilter -C -H
-vvv
X-Bogosity: No, tests=bogofilter, spamicity=0.100088, version=0.17.5.cvs
                                     n    pgood     pbad      fw     U
"test"                            5637  0.069284  0.007706  0.100088 +
"pairs"                            247  0.002963  0.000426  0.125785 -
"word"                            4374  0.040844  0.021773  0.347716 -
"this"                           71436  0.469233  0.597469  0.560108 -
N_P_Q_S_s_x_md                       1  0.899912  0.100088  0.100088
                                        0.017800  0.520000  0.375000

[relson at osage src]$ echo this is a test of word pairs | bogofilter -C -H
-vvv -P
X-Bogosity: No, tests=bogofilter, spamicity=0.100088, version=0.17.5.cvs
                                     n    pgood     pbad      fw     U
"test"                            5637  0.069284  0.007706  0.100088 +
"pairs"                            247  0.002963  0.000426  0.125785 -
"word"                            4374  0.040844  0.021773  0.347716 -
"test:word"                          0  0.000000  0.000000  0.520000 -
"this:test"                          0  0.000000  0.000000  0.520000 -
"word:pairs"                         0  0.000000  0.000000  0.520000 -
"this"                           71436  0.469233  0.597469  0.560108 -
N_P_Q_S_s_x_md                       1  0.899912  0.100088  0.100088
                                        0.017800  0.520000  0.375000
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: patch.0413.token.pairs.txt
URL: <http://www.bogofilter.org/pipermail/bogofilter/attachments/20040413/b91ca9f4/attachment.txt>


More information about the Bogofilter mailing list