support for multiple wordlists

Chris Fortune cfortune at telus.net
Tue May 18 01:18:38 CEST 2004


From: "Tom Allison" <tallison at tacocat.net>
> I guess I was just thinking of going with lots of procmail glue to make
> this all happen.

I'm using perl to "glue" two wordlist, global and personal wordlists.  Here's the algorithm, in pseudo-code.

$avg_score = $global_score = classify email using global wordlist;
check user's personal wordlist.db word_count;
if (personal word_count is too low or too imbalanced spam/non-spam){
    ignore it and continue;
}
else (if word_count is reasonably high and balanced spam/non-spam){
    $personal_score = classify the email using personal wordlist;
    $avg_score = ($global_score + $personal_score) / 2;
}

It's a hack, (in the tradition of perl hacks - just make the darn thing work today), but surprisingly it is working smoothly so far.
The "crazy" (anomalous) classification of some users is buffered by the "sanity" (normality) of the global wordlist, and is kept
separate.   Many users want and desire a small set of high-bogosity emails (for example: MLM that they are involved in, specialty
sales brochures,  viagra users mailing list, their favorite porno by mail or casino mail, etc.).  These mails would otherwise be
classified as >99% spam, but their small personal wordlists allow desired sales emails to be quarantined as virtually Unsure, at
~50%, and the user then has a chance to add them explicitly to his white-list.  The obverse is also true (eg: hammy emails that the
user has classified as spam, simply because he doesn't like the sender).

I allow each user to build his own wordlist via a web interface & quarantine, but only the admin can build the global wordlist
(using emails submitted by users, then reviewed).

I tried allowing the personal wordlist to override the global wordlist, but because some users press the wrong buttons it was
disasterous for them, with many false positives and false negatives.  I don't recommend it for naive users.  This trend is reversed
after the user's personal wordlist is sufficiently large, but until then it is better to rely on the global wordlist and gradually
include the personal wordlist.

To this end I would like to refine the average score equation so that the transition from global to personal is based on the
word_count of the personal wordlist.
$avg_score = ($global_score + $personal_score) / (2 * $factor);
Does anyone have any ideas how to determine $factor?

Thanks,
Chris
http://spameater.com/






More information about the Bogofilter mailing list