Funny characters [was: Levenshtein distance]
relson at osagesoftware.com
Fri Feb 25 08:52:44 EST 2005
On Fri, 25 Feb 2005 10:31:10 +0200
Edvard Majakari wrote:
> "Lee Dowthwaite" <lee at dowthwaite.net> writes:
> > for spam: indeed, almost all such juxtapositions were the result of spam.
> > Another thing it was very good at spotting - again, with minimal DB usage -
> > was foreign content. On these grounds it may well have caught the "Jmaes"
> > example also.
> What about code? Wouldn't procmail recipes, perl code, sendmail
> configuration files etc. in e-mail seem like spam then?
I don't think you need to worry. Remember that bogofilter (without
being trained) uses the robs and robx parameters to give scores to
unknown tokens. Once bogofilter has been trained with a given token as
ham (or spam), it uses that information.
The first time bogofilter encounters a perl script, it would see a lot
of unknown tokens. Once that script has been used to train as non-spam,
bogofilter would recognize those tokens. The situation is no different
than the first time bogofilter sees your name (or my name). That first
time, it's an unknown. After training it's known. A similar name, for
example "Ravid Nelson" (for me) or "Edward" for you would be treated the
same way -- unknown, then known.
With funny characters, for example "sp at m" rather than "spam", the
unknown quickly becomes known. After all, bogofilter _does_ learn
P.S. Edvard - you seem to be posting from a non-subscribed address.
Doing that requires moderator approval, which delays your posting.
Subscribing to the list is open to all and will remove that delay.
Bogofilter mailing list
Bogofilter at bogofilter.org
More information about the Bogofilter