indexing - Inverse index binary format -
i'm trying figure out kind of binary file can support needs inverse index. let have document can identify unique id , each document can have 360 fixed values in range of 0-65535. this:
document0: [1, 10, 123, ...] // 360 values
document1: [1, 10, 345, ...] // 360 values
now, inverse index easy - can create each possible value list of documents contains, , query can executed fast, e.g.:
1: [document0, document1]
10: [document0, document1]
123: [document0]
345: [document1]
but wanna store large number of documents in kind of file (binary) , have ability query fast add new documents without recreating whole structure.
now i'm struggling how organize file. if wanna fast access need fixed length document arrays file seek , read. fixed size means have lot of empty spaces document list. idea have kind of bucketing system , each value can belong bucket of specific size, e.g. there buckets size 1, 2, 4, 8, 16, 32, ... (or that) , need kind of header point me bucket starts , size of bucket. idea optimize store size, again i'm having problem addition of new documents.
any idea how organize 'inverse index' file?
best.
i go 65536 files each having id's of documents. if want go gentle on filesystem, divide 256 directories having 256 files each.
00\00.idx 00\01.idx .. ff\ff.idx
Comments
Post a Comment