Modularity

David Relson relson at osagesoftware.com
Mon Jan 13 22:43:03 CET 2003


Nick,

I'm with you :-)

At 04:10 PM 1/13/03, Nick Simicich wrote:
>At 11:53 PM 2003-01-12 -0200, Adriano Nagelschmidt Rodrigues wrote:
>
>>Why? I was thinking exactly about the "-u" switch when I listed 
>>"bogolearn" as
>>a possibility.
>
>99% of my invocations of bogofilter (ok, more than that) are with the -u 
>option.  About 99% of the remaining 1% are -S or -N invocations.  I pretty 
>much reclassify all of my misclassified mail.

I see all the mail through my mail server and run with "-u".  Using the 
Robinson-Fisher algorithm, I've not seen any false positives or false 
negatives in the last month, though there is a steady stream of "unsure" 
classifications.  Them I rename as "unsure-spam.mmdd.hhmm.txt" or 
"unsure-good.mmdd.hhmm.txt" and let a cronjob feed them to bogofilter.  My 
64MB P133 handles the work load quite handily, with a usual load average 
around 0.03.


>>C'mon, modularity is the UNIX way. We like it :-)
>
>In my opinion, the main reason for the extreme modularity that is 
>traditional in Unix is limited segment size.  As I recall, early setups 
>(like PC/IX, the port of Unix to the PC/XT), had a 64k I segment size and 
>a 64k D segment size.  You simply could not run big complex programs in a 
>segment size like that. Nor was it convenient to compile them with the 
>machines of the time.  Whereas it is likely that I have a copy of PC/IX 
>around here somewhere, and I may even have an installed copy, it was 
>simply not that interesting - no support for LAN networking came with 
>it.  It could function as a UUCP node, and support multiple simultaneous 
>logins.  I do not believe that this was the only architecture that limited 
>segment size so strongly.

I thought it was a simple lack of memory.  Back in the '70's machines has 
kilobytes of ram - perhaps several dozen kb, not hundreds of kb.  Did the 
early Unix machines even have segments?

>Another reason for modularity is to make things simpler, so that they are 
>more likely to be correct.  This program seems reasonably correct at its 
>current size.
>
>If you split the programs on me, you would exchange one module load for a 
>shell script and two module loads, and having all of the data move on a 
>pipe or something, rather than through the current memory transfer.  How 
>much inefficiency do you want to tolerate?

At the moment I view bogofilter as three programs - bogofilter, bogoutil, 
and bogolexer.  bogofilter and bogoutil share lots of code and implement 
very different capabilities that complement one another.  bogolexer is a 
simple program build around the lexer.  It mostly just parses the input and 
prints it out.  With the recent development of the mime processing code, 
it's been very useful.

Of the three programs, bogolexer most fits the unix convention of single 
task, input from stdin, and output to stdout.  If we wanted to follow the 
unix convention, we could take the lexer out of bogofilter and pipe 
bogolexer's output to bogofilter.


>Right now, it is still possible to run this program on light iron, I run 
>it on a P-90, but it pushes it.  Tripling the work (or more) that it takes 
>to do the job for nothing but purity is bogus if you ask me.
>
>If you want to have three man pages to make the arguments "pure", install 
>the program with two aliases, and have it act differently depending on the 
>alias that is called.  Then you can write one program with three man pages 
>and command interpretations. You can have your conceptual simplicity 
>without sacrificing efficiency.
>
>As someone pointed out, another important reason for modularity is if the 
>intermediate output is useful to, um, something in general.  A wc might be 
>useful to a program or to a human.  This is a case where, if the programs 
>were piped together, the output from program A would always be passed to 
>program B and would likely not even be an external interface.
>
>But please do not blindly do this split without doing some performance 
>studies, including some on small machines.  I have been wrong about 
>performance guesses in the past, but I do not think I am this time.

My judgement is that keeping the current trio of programs is the right 
thing to do.  I don't see a need for repartitioning the tasks.

David








More information about the Bogofilter mailing list