invalid html warfare

John McCain jmccain at layer3al.com
Wed May 28 20:45:04 CEST 2003


Okay, I've written up some perl scripts and done a few tests.

The five vowel rule has some problems.  Among the text which hits on such a 
rule are combinations of words involving puncuation marks, domain names , and 
other information which may not necessarily be trash data.

I've also found that the five non-vowel regex hits on something like 40-50% of 
my tokens  (!)  Most of this is valid, unique trash data such as message id's 
and other crap.

If what we are trying to accomplish is to attempt to "take out the trash" and 
then see how the database behaves, perhaps we should take a different 
approach.  If the spammers are attempting to pollute our databases by 
flooding them with unique trash tokens, maybe we can use bogofilter to find 
the trash for us.

What if we removed from the database each token occurring only once in the 
database?  (bogoutil -c 1  *.db??)This would only be practical if done on a 
sufficiently infrequent interval for "good data" to accumulate more than one 
hit, but often enough to prevent database pollution.

On Wednesday 28 May 2003 10:46 am, John McCain wrote:
> I keep everything.  Especially evil stuff, for just this sort of purpose. 
> The problem, really, is age.
>
> The absolute biggest corpus I could put together is about 2000 good
> messages and something like 1000 spams, but that includes messages so old
> that I question their scientific usefulness.  I've seen spam change quite a
> bit over the time I've been collecting it.  My production system bogofilter
> corpus consists of about half that amount (the more current half).
>
> .MSGCOUNT yields 0 for both spam and good   -   ?
>
> spamlist wc yields  108906
> goodlist wc yields  137097
>
> On Wednesday 28 May 2003 10:34 am, David Relson wrote:
> > At 10:42 AM 5/28/03, John McCain wrote:
> > >I'd love to pitch in, but I don't have access to enough test data.
> >
> > John,
> >
> > Do you save all your email, including the evil stuff?  How much do you
> > have?  How big are your wordlists?
> >
> > Here're some "size" commands to run:
> >
> > bogoutil -w $BOGODIR .MSGCOUNT
> > bogoutil -d $BOGODIR/spamlist.db | wc -l
> > bogoutil -d $BOGODIR/goodlist.db | wc -l
> >
> > If you've got a few thousand messages, you could experiment and develop
> > some tests.  There are some people on the list with large corpora who
> > might well be interested in testing any scripts you might develop.
> >
> > David
> >
> >
> > ---------------------------------------------------------------------
> > FAQ: http://bogofilter.sourceforge.net/bogofilter-faq.html
> > To unsubscribe, e-mail: bogofilter-unsubscribe at aotto.com
> > For summary digest subscription: bogofilter-digest-subscribe at aotto.com
> > For more commands, e-mail: bogofilter-help at aotto.com
>
> ---------------------------------------------------------------------
> FAQ: http://bogofilter.sourceforge.net/bogofilter-faq.html
> To unsubscribe, e-mail: bogofilter-unsubscribe at aotto.com
> For summary digest subscription: bogofilter-digest-subscribe at aotto.com
> For more commands, e-mail: bogofilter-help at aotto.com





More information about the Bogofilter mailing list