bulk mode

bogofilter at bobvincent.org bogofilter at bobvincent.org
Tue May 6 18:43:55 CEST 2003


On Tue, May 06, 2003 at 07:49:51AM -0400, David Relson wrote:
> At 04:28 AM 5/6/03, bogofilter at bobvincent.org wrote:
> 
> >On Mon, May 05, 2003 at 09:43:21PM -0400, David Relson wrote:
> >> No.  You can loop over files in a maildir and process them one at a
> >> time.  Bogofilter-0.12 has bulk mode switches ('-b' and '-B') which
> >> can be used to make maildir operations faster - assuming you can
> >> meaningfully process more than one file in a batch.
> >
> >Ah, but they only work for classifying messages, not for registering
> >them.
> >
> >Attempting to use the bulk mode switches when registering mail as
> >spam/nonspam results in a segfault.  (attempted file operation on null
> >file pointer).
> 
> bogofilter should _never_ segfault.  Is this a new discovery, or have you 
> known about it?  I'll take a look to see what's happening.

Does, dude.  Latest CVS checkout.  Took several hours with the
debugger to figure out why, though.  Been a long time.

>From bogoconfig.c:

        case 'b':
          bulk_mode = B_STDIN;
          fpin = NULL;        /* Ensure that input file isn't stdin */
          break;

>From main.c:

    if (run_type & (RUN_NORMAL | RUN_UPDATE)) {
      exitcode = classify(argc, argv,out);
    }
    else {
      register_messages(run_type);
      exitcode = 0;
    }

When registering spam, RUN_TYPE is 4
When registering nonspam, RUN_TYPE is 8.

Now follow through where main() calls register_message() which calls
collect_words() which calls get_token() ...
...
which eventually calls xfgetsl() which does the following:

if (feof(s))
  return (EOF);

which segfaults because along the way, "s" is a reference "fpin" which
is still NULL, and feof(0) is undefined.

So I patched it to read:

if (!s || feof(s))
  return (EOF);

but back in lexer.c,  the result gets assigned to the variable
"count", and we have:

if (count == -1) {
    if (ferror(fpin)

which segfaults again because ferror(0) is undefined.

Now back in main.c, we have this code:

        case B_STDIN:           /* '-b' - streaming (stdin) mode */
        {
            size_t len;
            filename = buff;
            if (fgets(buff, sizeof(buff), stdin) == 0) {
              done = true;
              continue;
            }
            len = strlen(filename);
            if (len > 0 && filename[len-1] == '\n')
                filename[len-1] = '\0';
                break;
            }
        }

But it's in classify.c, which never gets called.  There is no
equivalent code in the path that starts with register_messages().

So I figured it wasn't DESIGNED to operate in bulkmode when
registering messages, only when classifying them.

Silly me.

But occasionally, on this list and in the docs, I see references which
suggest (though obviously the author hasn't tried it) that you can
register a maildir of spam with something like:

cd Spam/cur ; ls | bogofilter -b -s

Sorry; that segfaults.

Interestingly enough, this one doesn't:

cd Spam/cur ; bogofilter -B -s `ls`

but it doesn't work, either, because bogofilter sits forever waiting
for an email message to appear on stdin.





More information about the Bogofilter mailing list