Word pairs
Cedric Foll
cedric.foll at ac-rouen.fr
Wed May 14 11:00:38 CEST 2003
I've wrote a pre processing script which return word pairs.
It's wroten in Ruby. I'm using it with succes to class HTML page.
It can't deal with MIME message.
u can use it like that:
mybogo.rb < file | bogofilter
Le mer 14/05/2003 à 09:58, michael at optusnet.com.au a écrit :
> Does anyone have a patch lying around that changes bogofilter
> to use word pairs as token instead of single words?
>
> Many thanks,
> michael, experimenting but lazy.
>
> ---------------------------------------------------------------------
> FAQ: http://bogofilter.sourceforge.net/bogofilter-faq.html
> To unsubscribe, e-mail: bogofilter-unsubscribe at aotto.com
> For summary digest subscription: bogofilter-digest-subscribe at aotto.com
> For more commands, e-mail: bogofilter-help at aotto.com
>
-------------- next part --------------
#!/usr/bin/ruby
#LIMIT_SIZE = 100000
LIMIT_SIZE = -1
page = ""
words = []
page = $stdin.readlines(nil)[0]
exit if !page
page = page[0..LIMIT_SIZE] if LIMIT_SIZE
words = page.split(/[ ,\+;!:\n\.\(\)\/"'\?\*\[\]\-=><{}]+/)
words.reject! {|word| word.length==0 or word == " "}
words.map!{|w| w.tr('A-Z','a-z')}
more = []
for i in 0..words.length-2
more[i] = words[i]+"_"+words[i+1]
end
page.scan(/meta[^>]*name="(publisher|topic|theme|keywords|abstract|page-topic|description)"[^>]*content="([^"]*)"/mi) {|x,res|
more += res.split(/[ ,]+/).map{|word| "meta_"+word} if res
}
puts words * ' '
puts more * ' '
More information about the Bogofilter
mailing list