- 著者
- E. Moura, G. Navarro, N. Ziviani, R. Baeza-Yates
- タイトル
- Fast Searching on Compressed Text Allowing Errors
- 書籍
- Proceedings of the 21th Annual International
Conference on Research and Development in
Information Retrieval (SIGIR'98)
- ページ
- 298-306
- 日時
- 1998
- 概要
- We present a fast compression and decompression scheme for
natural language texts that allows efficient and flexible
string matching by searching the compressed text directly.
The compression scheme uses a word-based Huffman encoding
and the coding alphabet is byte-oriented rather than
bit-oriented. We compress typical English texts to about
30% of their original size, against 40% and 35% for
Compress and Gzip, respectively. Compression times are
close to the times of Compress and approximately half the
times of Gzip, and decompression times are lower than
those of Gzip and one third of those of Compress. The
searching algorithm allows a large number of variations of
the exact and approximate compressed string matching
problem, such as phrases, ranges, complements, wild cards
and arbitrary regular expressions. Separators and
stopwords can be discarded at search time without
significantly increasing the cost. The algorithm is based
on a word-oriented shift-or algorithm and a fast
Boyer-Moore-type filter. It concomitantly uses the
vocabulary of the text available as part of the Huffman
coding data. When searching for simple patterns, our
experiments show that running our algorithm on a
compressed text is twice as fast as running Agrep on the
uncompressed version of the same text. When searching
complex or approximate patterns, our algorithm is up to 8
times faster than Agrep. We also mention the impact of our
technique in inverted files pointing to documents or
logical blocks as Glimpse.
- コメント
- ftp://dcc.ufmg.br/pub/research/~nivio/cgrep
Cgrepというプログラムになっているらしい。
- カテゴリ
- String
Category: String
Comment: ftp://dcc.ufmg.br/pub/research/~nivio/cgrep
Cgrepというプログラムになっているらしい。
Abstract: We present a fast compression and decompression scheme for
natural language texts that allows efficient and flexible
string matching by searching the compressed text directly.
The compression scheme uses a word-based Huffman encoding
and the coding alphabet is byte-oriented rather than
bit-oriented. We compress typical English texts to about
30% of their original size, against 40% and 35% for
Compress and Gzip, respectively. Compression times are
close to the times of Compress and approximately half the
times of Gzip, and decompression times are lower than
those of Gzip and one third of those of Compress. The
searching algorithm allows a large number of variations of
the exact and approximate compressed string matching
problem, such as phrases, ranges, complements, wild cards
and arbitrary regular expressions. Separators and
stopwords can be discarded at search time without
significantly increasing the cost. The algorithm is based
on a word-oriented shift-or algorithm and a fast
Boyer-Moore-type filter. It concomitantly uses the
vocabulary of the text available as part of the Huffman
coding data. When searching for simple patterns, our
experiments show that running our algorithm on a
compressed text is twice as fast as running Agrep on the
uncompressed version of the same text. When searching
complex or approximate patterns, our algorithm is up to 8
times faster than Agrep. We also mention the impact of our
technique in inverted files pointing to documents or
logical blocks as Glimpse.
Bibtype: InProceedings
URL: http://www.dcc.uchile.cl/~gnavarro/publ.html
Pages: 298-306
Author: E. Moura
G. Navarro
N. Ziviani
R. Baeza-Yates
Booktitle: Proceedings of the 21th Annual International
Conference on Research and Development in
Information Retrieval (SIGIR'98)
Title: Fast Searching on Compressed Text Allowing Errors
Year: 1998
Date: 2003/08/01 04:59:50