[PATCH] Better tagging.

michael at optusnet.com.au michael at optusnet.com.au
Sun Sep 14 07:06:44 CEST 2003


David Relson <relson at osagesoftware.com> writes:
[...]
> > No, haven't used bogotune. My working set is about 200k messages,
> > and bogotune looked like it was going to take weeks to finish. :)
> > 
> > I'm using pure defaults (mindev 0.1).
> > 
> > Michael.
[...] 
> With your 200k messages, bogotune could conceivable take weeks.  It
> might be useful to break the 200k into 10 or 20 equal groups and then
> pick several groups and run bogotune.  That would reduce the compute
> time a whole lot and it would be interesting to see if bogotune found
> the tested to be comparable, i.e. produced similar results.

That would be interesting...

Note that I'm more interested in the relative improvement than the
absolute. I don't belive that bogotune will spit out parameters
that turn an improvement into a detriment.

My task is made harder becuase I'm working with the mail for
a large set of users, not just my mail. So the 'ham' corpus
is much less distinguished than it would commonly be.

PS: Interestingly, using word-pairs gives a HUGE improvement in
accuracy.  w/o using word-pairs are tokens, I get 43,705 false
negatives (in a 400k message corpus), and with word-pairs I get
27,596. Very nice.


Michael.




More information about the bogofilter-dev mailing list