token pairs [was: Algorithm limitations]
David Relson
relson at osagesoftware.com
Tue Apr 13 14:06:34 CEST 2004
On 13 Apr 2004 07:37:01 -0400
Tom Anderson wrote:
> On Mon, 2004-04-12 at 00:07, michael at optusnet.com.au wrote:
> > I'm already doing word pairs. You might have seen the patch I posted
> > previously for a lossy token database. That was design to support
> > exactly what you're talking about. Basically it would allow
> > bogofilter to generate a vast array of tokens and only keep the
> > ones that occur 'frequently'. 'frequently' here means "at a
> > frequency high enough that a second instance comes along before the
> > first has been discarded from the database". :)
>
> Would it be possible to consider this patch for inclusion in the
> stable bogofilter branch, turned on via a switch?
>
> Tom
Tom,
I'm not willing to include word pairs until after the 1.0 release, but
am willing to let users experiment with the technique. Attached is a
patch from a couple of months ago and updated to work with 0.17.5.
Below is a sample of the output using it:
[relson at osage src]$ echo this is a test of word pairs | bogofilter -C -H
-vvv
X-Bogosity: No, tests=bogofilter, spamicity=0.100088, version=0.17.5.cvs
n pgood pbad fw U
"test" 5637 0.069284 0.007706 0.100088 +
"pairs" 247 0.002963 0.000426 0.125785 -
"word" 4374 0.040844 0.021773 0.347716 -
"this" 71436 0.469233 0.597469 0.560108 -
N_P_Q_S_s_x_md 1 0.899912 0.100088 0.100088
0.017800 0.520000 0.375000
[relson at osage src]$ echo this is a test of word pairs | bogofilter -C -H
-vvv -P
X-Bogosity: No, tests=bogofilter, spamicity=0.100088, version=0.17.5.cvs
n pgood pbad fw U
"test" 5637 0.069284 0.007706 0.100088 +
"pairs" 247 0.002963 0.000426 0.125785 -
"word" 4374 0.040844 0.021773 0.347716 -
"test:word" 0 0.000000 0.000000 0.520000 -
"this:test" 0 0.000000 0.000000 0.520000 -
"word:pairs" 0 0.000000 0.000000 0.520000 -
"this" 71436 0.469233 0.597469 0.560108 -
N_P_Q_S_s_x_md 1 0.899912 0.100088 0.100088
0.017800 0.520000 0.375000
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: patch.0413.token.pairs.txt
URL: <http://www.bogofilter.org/pipermail/bogofilter/attachments/20040413/b91ca9f4/attachment.txt>
More information about the Bogofilter
mailing list