Training without ham.

Sun Sep 7 20:12:25 CEST 2003

On Sun, 7 Sep 2003 19:58:32 +0200 (CEST)
"Hr. Daniel Mikkelsen" <daniel at copyleft.no> wrote:

> On Sun, 7 Sep 2003, David Relson wrote:
> 
> > On Sun, 7 Sep 2003 19:05:52 +0200 (CEST)
> > "Hr. Daniel Mikkelsen" <daniel at copyleft.no> wrote:
> >
> > > site wide spam filtering. Since spam is generally the same for all
> > > accounts, while ham can differ widely (between nationalities for
> > > instance), is it viable to set up a bogofilter that only uses a
> > > spam corpus provided by some of the site administrators?
> 
> > BF compares the tokens to ham and spam lists and determines which
> > one matches better.  If you only train on spam, the comparison
> > becomes one of "known" words (which are all spam) and "unknown"
> > words.  As the ham/spam comparison is lost, the results can't be
> > good.
> 
> So a comparable statistical filter package with another kind of logic
> for the comparision/determination part (not learning, not scanning)
> would possibly be do the trick?
> 
> Are there such packages out there?
> 
> (Downloading the bogofilter sources now to have a look.)
> 
> -- Daniel Mikkelsen, Copyleft Software AS

Daniel,

It might be very useful to look at bogofilter's FAQ and read some of the
articles it mentions.  That'll give you a better idea of the theoretical
underpinnings for bogfilter and will let you understand what it does
(and how it does it.

"possibly" covers a lot of territory.  The bayesian analysis used calls
for good and bad wordlists so that the incoming message can be scored
between 0 (only matches good words) and 1 (only matches bad words).

If you want to count spam characteristics (shouting, occurrences of
viagra, etc,) and score the message using them (without reference to
non-spam), then bogofilter isn't the answer for you - though
SpamAssassin may be what you want.

Hope this helps.

David