Random probabilistic split mailbox in two (random_split_mbox.py)
Arcady Genkin
agenkin-lst-bogofilter at thpoon.com
Fri Nov 5 22:02:52 CET 2004
Greetings:
While training bogofilter, I was in need of a means to split two huge
mailboxes of spam and ham messages into two heaps each: first for the
full initial training, and another one for training on error. I
wanted to split randomly, specifying the proportion of the messages to
end up in each of the two heaps. Since I didn't find a better way of
doing it, I wrote a rather trivial Python script random_split_mbox.py,
which can be run like this:
random_split_mbox.py ~/mail/spam /tmp/spam1 /tmp/spam2 0.7
This randomly picks out (roughly) 70% of all messages from ~/mail/spam
and puts them into /tmp/spam1, putting the rest into /tmp/spam2.
I think that this script can be useful for others. It can be
downloaded from here:
http://www.cdf.toronto.edu/~agenkin/downloads/random_split_mbox.py
Hope this helps,
--
Arcady Genkin : CDF Systems Administrator
http://www.cdf.toronto.edu/~agenkin/contact.html
More information about the Bogofilter
mailing list