indexing - Inverse index binary format -


i'm trying figure out kind of binary file can support needs inverse index. let have document can identify unique id , each document can have 360 fixed values in range of 0-65535. this:

document0: [1, 10, 123, ...] // 360 values

document1: [1, 10, 345, ...] // 360 values

now, inverse index easy - can create each possible value list of documents contains, , query can executed fast, e.g.:

1: [document0, document1]

10: [document0, document1]

123: [document0]

345: [document1]

but wanna store large number of documents in kind of file (binary) , have ability query fast add new documents without recreating whole structure.

now i'm struggling how organize file. if wanna fast access need fixed length document arrays file seek , read. fixed size means have lot of empty spaces document list. idea have kind of bucketing system , each value can belong bucket of specific size, e.g. there buckets size 1, 2, 4, 8, 16, 32, ... (or that) , need kind of header point me bucket starts , size of bucket. idea optimize store size, again i'm having problem addition of new documents.

any idea how organize 'inverse index' file?

best.

i go 65536 files each having id's of documents. if want go gentle on filesystem, divide 256 directories having 256 files each.

00\00.idx 00\01.idx .. ff\ff.idx 

Comments

Popular posts from this blog

c++ - Compiling static TagLib 1.6.3 libraries for Windows -

PostgreSQL 9.x - pg_read_binary_file & inserting files into bytea -

asp.net - call stack missing info on mono with apache and mod_mono -