Training on outgoing e-mails?

Tony L. Svanstrom tony at moon.pp.se
Sat Oct 7 11:28:47 CEST 2006


 This is something that I've thought about for years now, but so far I've never
gotten around to actually try it; I bet at least one of you's got some
interesting data about it to share...

 Have you tried automatically training all your outgoing e-mails as ham; and if
so, what has the results been, esp. when using word-combinations as tokens?


 The main reason for me to not try that is that I'm already collecting some
headers from the outgoing e-mails, which I then use to automatically train some
incoming e-mails as ham (I mainly look at From:, In-Reply-To:, References:); so
I'm storing e-mailaddresses and message-ids.
 This also allows me to use the faked [randomstring]@[mydomains]-message-ids to
send some spam directly to spam-training without passing go.

 The day spammers start using fresh e-mailaddresses (and message-ids) from
recently sent e-mails I'm seriously screwed though, so it'd be nice to know if
it then'd been a good idea to train on outgoing e-mails.


 Oh, and how about this idea... You train on single words on incoming e-mails,
but you also train on word-pairs/combinations on outgoing; only allowing ham-
combinations in your database.
 This shouldn't hit on too many random strings, nor on texts from books not
written in the same style as the way you communicate with people; but it would
of course hit on the word-combinations used in e-mails quoting something you've
written; and it would in a crude way result in bogofilter learning to recognize
the style of your friends and coworkers (and foes on mailinglists, of course).

 Such a solution would have stopped the spam I got in my inbox a few days ago
simply because there was random strings about procmail in it; but it might on
the other hand result in currently single-ham words (like maybe 'procmail') be
reclassified as a lot spammier... Not a problem at all on a mailinglist like
this one, as the headers would be quite hammy and the general style of writing
would slowly creep into your database, but if a lurker on this list were to
send an e-mail about procmail to me off-list, then his maybe unknown style of
writing could push his e-mail towards the spam-direction.
 "Unknown style" could in this case be if he writes to me about procmail in my
native tongue, a language which I never use when writing about procmail.

 Thoughts?


	/Tony
-- 
        /\___/\                                          /\___/\
        \_@ @_/                                          \_@ @_/
   .--oOO-(_)-OOo--------------------------------------oOO-(_)-OOo--.
   |  perl -e'print$_{$_} for sort%_=`lynx -dump svanstrom.org/t`'  |
   `---ôôô---ôôô----------------------------------------ôôô---ôôô---´
       \O/   \O/        ©1998-2006 svanstrom.org        \O/   \O/




More information about the Bogofilter mailing list