-vvv output [was: FAQ: Asian spam]

Boris 'pi' Piwinger 3.14 at logic.univie.ac.at
Thu Mar 27 14:42:49 CET 2003


David Relson wrote:

>>So it really adds to the database, maybe it won't hurt more
>>to add all?
>>
>>Anyhow, the FAQ should say something on that issue.
> 
> Would you care to draft a few lines on asian spam?  An idea would be to 
> include the two primary processing ways.  1 - check for 
> "charset=gb2512|kc_5601_..." to discard it ; 2 - add to database.

OK, but someone really needs to verify my assumptions.

Q: How does bogofilter work with messages in languages not
based on european characters, like asian languages?

A: Good news first. Bogofilter does detect them pretty
successfully. Bad news: this can be expensive. You have
basically two choices:

1) You are sure you will not receive any message in those
languages which is legitimate. So you might as well kill
them right away, before bogofilter will see them. This will
keep the database smaller. To do this you might do something
like in the following procmail recipe before you call
bogofilter:
[recipe goes here]

2) You let bogofilter do the job.
2a) Just do it. Bogofilter will learn (as you teach). The
database will contain many tokens which don't make sense
since the charset cannot been displayed (those languages
typically have characters encoded into several bytes, but
you see single bytes), but it works as intended.
2b) You can set replace_nonascii_characters which will make
all non-ascii-characters look as the same single character.
This keeps the database much smaller, but will most likely
not work all that well with languages exceeding ASCII (like
many european languages).

Future: Bogofilter will learn Unicode and hence be able to
understand all those languages and charsets.

> Since bogofilter needed a way to display detailed info, it was reasonable 
> to adopt the R table ("-R") output for "-vvv".  Like you, I don't find the 
> last two columns particularly useful.  They are intermediate results in the 
> spamicity calculation and do not have easily understood interpretations (at 
> least for a person like me).  Perhaps the thing to do is have "-R" generate 
> _all_ columns of output and to have "-vvv" leave out the last two numeric 
> columns.  What do you think?

Either that or just options.

> By the way, any explanation of the table format would be Greg's, as he's 
> our "R" expert.

I liked your text. I guess I have it still at home.

pi






More information about the Bogofilter mailing list