Bogofilter to filter porn site

Cedric Foll cedric.foll at ac-rouen.fr
Wed Mar 26 14:55:30 CET 2003


> You could skip your keyword search and let bogofilter do _its_ word
search 
> and spam/ham classification.  Bogofilter's job is to distinguish
between 
> good and page messages.  With bogofilter we of good and bad as being
spam 
> and non-spam.  There's no reason why bogofilter can't use web pages
(rather 
> than email) to distinguish porn from non-porn (rather than
spam/non-spam).

I use a keword search first in order to don't waste cpu.
So, emong, the 16 000 site each day, bogofilter have to analyse only the
one which match a regexp.


> Using the record of sites visited, retrieve the page (using wget or
lynx or 
> other such program) and feed it to bogofilter.  bogofilter will do its
> normal parsing, word lookup, and spamicity calculation.  For pages
that 
> show up as "spam" (which means "porn" in this case), you can add the
site 
> to the black list.
> 
> I think you should find it pretty easy to train bogofilter to be a 
> porn/non-porn classifier.

In fact, that's already work very well.
But i'd like to know if smb has ever test and know some tricks to
improve the results with analyse of web page instead of e-mails.
It's quite similar (a lot of e-mail spam are wrote in html) but when i
analyse e-mail results are very good (95% of spam filtered and 0% of
false positive) but to classify web page between porn/noporn i don't get
these results (but my results are good too 80% of porn detected, 5% of
false positive).
The explanation is perhaps that i'm living in France and so all mail
received are in french (expect for ml) and so almost all no-french/no-ml
mail is spam.


Regards.





More information about the Bogofilter mailing list