How to optimize Lucene.Net indexing -
i need index around 10gb of data. each of "documents" pretty small, think basic info product, 20 fields of data, few words. 1 column indexed, rest stored. i'm grabbing data text files, part pretty fast.
current indexing speed 40mb per hour. i've heard other people have achieved 100x faster this. smaller files (around 20mb) indexing goes quite fast (5 minutes). however, when have loop through of data files (about 50 files totalling 10gb), time goes on growth of index seems slow down lot. ideas on how can speed indexing, or optimal indexing speed is?
on side note, i've noticed api in .net port not seem contain of same methods original in java...
update--here snippets of indexing c# code: first set thing up:
directory = fsdirectory.getdirectory(@txtindexfolder.text, true); iwriter = new indexwriter(directory, analyzer, true); iwriter.setmaxfieldlength(25000); iwriter.setmergefactor(1000); iwriter.setmaxbuffereddocs(convert.toint16(txtbuffer.text));
then read tab-delim data file:
using (system.io.textreader tr = system.io.file.opentext(file)) { string line; while ((line = tr.readline()) != null) { string[] items = line.split('\t');
then create fields , add document index:
fldname = new field("name", items[4], field.store.yes, field.index.no); doc.add(fldname); fldupc = new field("upc", items[10], field.store.yes, field.index.no); doc.add(fldupc); string contents = items[4] + " " + items[5] + " " + items[9] + " " + items[10] + " " + items[11] + " " + items[23] + " " + items[24]; fldcontents = new field("contents", contents, field.store.no, field.index.tokenized); doc.add(fldcontents); ... iwriter.adddocument(doc);
once done indexing:
iwriter.optimize(); iwriter.close();
apparently, had downloaded 3 yr old version of lucene prominently linked reason home page of project...downloaded recent lucene source code, compiled, used new dll, fixed everything. documentation kinda sucks, price right , real fast.
from helpful blog
first things first, have add lucene libraries project. on lucene.net web site, you’ll see recent release builds of lucene. these 2 years old. do not grab them, have bugs. there has not been official release of lucene time, due resource constraints of maintainers. use subversion (or tortoisesvn) browse around , grab updated lucene.net code apache svn repository. solution , projects visual studio 2005 , .net 2.0, upgraded projects visual studio 2008 without issues. able build solution without errors. go bin directory, grab lucene.net dll , add project.
Comments
Post a Comment