Maintaining a snappy bogofilter

David Relson relson at osagesoftware.com
Fri Apr 11 20:46:00 CEST 2003


At 02:12 PM 4/11/03, printer at moveupdate.com wrote:

>We were going along very well until the last couple of days.
>now our false positives are way up due to the emails being sent
>with HTML formatting.
>
>Is there a way to tell bogofilter to ignore the all HTML tokens
>, but just the tokens and not anything in between them ? ie:
>process "<td>this is it</td>" as "this is it" , "<! word here
>-> as "word here" ?

At present, html tags such as <td> are discarded.  A construct like 
abc<td>def becomes "abc" and "def".

If you have an html comment, it too is discarded.  However the text on 
either side is joined together.  So "beg<! word here>in"  is the same as 
"begin".

A future release will have options to allow the innards to be processed as 
tokens.  Also, the parser will be smarter and will know which tags 
represent textual breaks and which don't.  So "com<br>ment" will become 
"com" and "ment", while "com<b>ment" will become "comment".

David






More information about the Bogofilter mailing list