- 著者
- Gordon Linoff, Craig Stanfil
- タイトル
- Compression of Indexes with Full Positional Information
in Very Large Text Databases
- 書籍
- Proceedings of the Sixteenth Annual International
ACM SIGIR Conference on Research and Development in
Information Retrieval
- ページ
- 88-95
- 日時
- June 1993
- 出版
- ACM Press
- 概要
- This paper describes a combination of compression
methods which may be used to reduce the size of inverted
indexes for very large text databases. These methods are
Prefix Omission, Run-Length Encoding, and a novel family
of numeric representations caled n-s coding. Using
these compression methods on two different text sources
(the King James Version of Bible and a sample of Wall
Street Journal Stories), the compressed index occupies
less than 40% of the size of the original text, even
when both stopwords and numbers are included in the
index. The decreased time required for I/O can almost
fully compensate for the time needed to uncompress the
postings. This research is part of an effort to handle
very large text databased on the CM-5, a massively
parallel MIMD supercomputer.
- コメント
- テキストデータベースにあらわれる単語の出現位置を示すインデクス
ファイル(concordance)の圧縮方法。小さな数は少ないビット数で
表現するようにn-s符合化という、エリアス符合もどきを使っている。
・Dynamicなテキストデータに使えるような気がしない。
・もとのテキストは圧縮していないのが不思議。
- カテゴリ
- IR,
Compress
Organization: ACM
Category: IR Compress
Comment: テキストデータベースにあらわれる単語の出現位置を示すインデクス
ファイル(concordance)の圧縮方法。小さな数は少ないビット数で
表現するようにn-s符合化という、エリアス符合もどきを使っている。
・Dynamicなテキストデータに使えるような気がしない。
・もとのテキストは圧縮していないのが不思議。
Abstract: This paper describes a combination of compression
methods which may be used to reduce the size of inverted
indexes for very large text databases. These methods are
Prefix Omission, Run-Length Encoding, and a novel family
of numeric representations caled n-s coding. Using
these compression methods on two different text sources
(the King James Version of Bible and a sample of Wall
Street Journal Stories), the compressed index occupies
less than 40% of the size of the original text, even
when both stopwords and numbers are included in the
index. The decreased time required for I/O can almost
fully compensate for the time needed to uncompress the
postings. This research is part of an effort to handle
very large text databased on the CM-5, a massively
parallel MIMD supercomputer.
Bibtype: InProceedings
Booktitle: Proceedings of the Sixteenth Annual International
ACM SIGIR Conference on Research and Development in
Information Retrieval
Month: jun
Pages: 88-95
Author: Gordon Linoff
Craig Stanfil
Title: Compression of Indexes with Full Positional Information
in Very Large Text Databases
Year: 1993
Super: SIGIR93
Publisher: ACM Press