train-on-error and mondo-tune project [was: bogus bogotuning]

David Relson relson at osagesoftware.com
Thu Jan 29 13:25:19 CET 2004


On Thu, 29 Jan 2004 08:48:40 +0100
Boris 'pi' Piwinger wrote:

> Greg Louis <glouis at dynamicro.on.ca> wrote:
> 
> >At one point there was actually an
> >option to do just what you wanted, but we took it out again because
> >it wasn't found helpful.  Now you come along and ask for it, we try
> >(persistently) to explain why it's a bad idea,
> 
> I asked many times (and never got an answer) why it would be
> bad for train-on-error. There the database will be
> significantly below the limit, even if you have tens of
> thousands messages to test with.

My recollection is that bogotune had problems using your train-on-error
message set, i.e. the message set wasn't usable for the big test.

We've seen significant differences in bogotune results for different
message sets.  When tuning, bogotune needs to be able to tell whether
one set of parameters is better than another (or not).  This means the
messages in the ham and spam sets need to produce a variety of scores.
Some data sets are too clean, i.e. have too little scoring variation,
for bogotune's needs.  

One cause of "too clean" is when the tuning messages are included in the
wordlist.  I don't recall if we identified other reasons why bogotune
couldn't use a message set.

> BTW: There was this call for message bases to review
> bogofilter's defaults. What was the result? I have never
> seen it. I also have never got an answer how that process
> went with my train-to-exhaustion database (only a
> preliminary test).

You're right.  A month or so ago, a call for message bases was put out
so that Greg could run bogotune with a message base of 200,000 or so
messages.  We called that the mondo-test because of the large number of
messages involved and the need for a big machine (fast cpu and lots of
ram) for running it.

Greg hasn't run the big test yet.  Last I remember, he was waiting for
one more large message set.  I believe he got sidetracked with other
things in life.

Greg:  what's the status of the mondo-tune plan???

David

P.S.  My apologies if this message rambles.  There were a number of
concepts (related to the two questions) that I thought needed to be
commented on.




More information about the Bogofilter mailing list