message count files

David Relson relson at osagesoftware.com
Wed Dec 3 13:41:23 CET 2003


Greetings,

Attached is a copy of script msg-count.sh which will be an official part
of bogofilter's next release.  It is for converting standard email
formats to the privacy protecting, speed enhancing message count format.
 It takes a while to run as each message is processed by bogolexer (to
determine the tokens), sort (to remove duplicates and re-order the
tokens), and bogoutil (to find the ham and spam counts for each token). 
The sort process effectively obscures the meaning of the message since
the canonical order renders the message meaningless.

Below are samples of how msg-count.sh can be used to process mbox files
and directories of individual messages.

Enjoy!

David


### using the 2 mbox files from bogofilter's regression test ...

[relson at osage src]$ ls -l tests/inputs/????.mbx
-rw-r--r-- 1 relson relson 194817 Sep  9 20:16 tests/inputs/good.mbx
-rw-r--r-- 1 relson relson 164814 Sep  9 20:16 tests/inputs/spam.mbx

### use a for loop to process each .mbx file
[relson at osage src]$ for f in tests/inputs/????.mbx ; do cat $f |
msg-count.sh > `basename $f .mbx`.mc ; done

### and generate a .mc (msg-count) file
[relson at osage src]$ ls -l good.mc spam.mc
-rw-r--r--    1 relson   relson      75808 Dec  3 07:26 good.mc
-rw-r--r--    1 relson   relson      34807 Dec  3 07:26 spam.mc

### alternatively, with good.d and spam.d directories containing
messages

[relson at osage src]$ ls good.d
msg.n.01.txt  msg.n.13.txt  msg.n.25.txt  msg.n.37.txt
msg.n.02.txt  msg.n.14.txt  msg.n.26.txt  msg.n.38.txt
msg.n.03.txt  msg.n.15.txt  msg.n.27.txt  msg.n.39.txt
msg.n.04.txt  msg.n.16.txt  msg.n.28.txt  msg.n.40.txt
msg.n.05.txt  msg.n.17.txt  msg.n.29.txt  msg.n.41.txt
msg.n.06.txt  msg.n.18.txt  msg.n.30.txt  msg.n.42.txt
msg.n.07.txt  msg.n.19.txt  msg.n.31.txt  msg.n.43.txt
msg.n.08.txt  msg.n.20.txt  msg.n.32.txt  msg.n.44.txt
msg.n.09.txt  msg.n.21.txt  msg.n.33.txt  msg.n.45.txt
msg.n.10.txt  msg.n.22.txt  msg.n.34.txt  msg.n.46.txt
msg.n.11.txt  msg.n.23.txt  msg.n.35.txt  msg.n.47.txt
msg.n.12.txt  msg.n.24.txt  msg.n.36.txt  msg.n.48.txt
[relson at osage src]$ ls spam.d
msg.s.01.txt  msg.s.07.txt  msg.s.13.txt  msg.s.19.txt
msg.s.02.txt  msg.s.08.txt  msg.s.14.txt  msg.s.20.txt
msg.s.03.txt  msg.s.09.txt  msg.s.15.txt  msg.s.21.txt
msg.s.04.txt  msg.s.10.txt  msg.s.16.txt
msg.s.05.txt  msg.s.11.txt  msg.s.17.txt
msg.s.06.txt  msg.s.12.txt  msg.s.18.txt

### run with wordlist directory as first parm and message directory as
second

[relson at osage src]$ for d  in good.d spam.d ; do msg-count.sh
/var/lib/bogofilter $d > $d.dir.mc ; done

[relson at osage src]$ ls -l ????.d.dir.mc
-rw-r--r-- 1 relson relson 195244 Dec  3 07:28 good.d.dir.mc
-rw-r--r-- 1 relson relson  85462 Dec  3 07:28 spam.d.dir.mc
-------------- next part --------------
A non-text attachment was scrubbed...
Name: msg-count.sh
Type: application/x-sh
Size: 2114 bytes
Desc: not available
URL: <http://www.bogofilter.org/pipermail/bogofilter/attachments/20031203/26b2d84a/attachment.sh>


More information about the Bogofilter mailing list