[PATCH] combined wordlist a.k.a. single list

Jeremy Blosser jblosser-bogofilter at firinn.org
Wed Jun 4 02:58:49 CEST 2003


On Jun 03, Greg Louis [glouis at dynamicro.on.ca] wrote:
> On 20030603 (Tue) at 1334:35 -0500, Jeremy Blosser wrote:
> > On May 31, David Relson [relson at osagesoftware.com] wrote:
> 
> > > This patch converts bogofilter so that it stores all the data in a
> > > single file, wordlist.db, with records containing token, spam count,
> > > nonspam count and (optionally) the timestamp.
> > 
> > From an administrative perspective I *much* prefer separate wordlists.  I'm
> > sure the db utils would probably make it possible to split them up and
> > manipulate them separately when required
> 
> Could you expand on this a little?  When would that be required?
> ...
> It's not obvious to me at all why keeping one copy of the token with
> the two counts is more complex to manage.  I guess that's because I
> don't understand why one would need to manipulate spam and nonspam
> tokens separately.

I don't need to manipulate them separately at the token level, but I do
want to manipulate them separately at the list level.

Our goodlist is a pretty important resource for us.  It took a lot of time
and effort to create the initial lists, and has taken even more time to
refine them with user feedback to something we can trust to filter all of
our mail, especially in a large heterogenous environment like ours.  On my
personal accounts at home I keep all the spam and nonspam I receive so I
can do wordlist rebuilds as I need them, but it'd be foolish to try that
here due to the volume of mail we see, privacy concerns about storing the
nonspam notwithstanding.  We can't just recreate our existing goodlist from
mail we have stored somewhere for that purpose.  We need to keep several
levels of backups of the goodlist, because it'd be hard to replace if we
somehow lost it, and our ability to block only spam (and never good mail)
is pretty tied to it.

Our spamlist is less important to us.  Though we may not have the original
mails used to build it, it's really easy for us to get a new one if we need
it... we see at least 30,000 spams a day.  And we have a few ways of
catching those if we need to do it outside of bogofilter.  We might lose
the granularity that bogofilter provides (which is the reason bf is what we
want filtering in production), but we wouldn't have to start anywhere near
ground zero.  (Note, though, that those "other" means for easily building a
basic spamlist aren't valid for the goodlist... it's easy to catch a lot of
obvious unambiguous spams, it's not very easy to catch a lot of obvious
unambiguous good mails.)

Also, our spam list is more generic and not really at all company-sensitive
information; we could use it if we had to to jump start another corpus if
we needed to, or possibly to contribute to any broad internet spam
collection efforts.  Our goodlist is unique to us and our current
implementation and has various security concerns associated with it, even
in wordlist format.

These may just sound academic, but this is how we find ourselves looking at
our lists, and it's helpful to be able to deal with them as separate
entities for general system administration, if not for bogofilter
administration itself.




More information about the bogofilter-dev mailing list