performance - how quick read 25k small txt file content with python -


i download many html store in os,now content ,and extract data need persistence mysql, use traditional load file 1 one ,it's not efficant cost nealy 8 mins.

any advice welcome

g_fields=[  'name',  'price',  'productid',  'site',  'link',  'smallimage',  'bigimage',  'description',  'createdon',  'modifiedon',  'size',  'weight',  'wrap',  'material',  'packagingcount',  'stock',  'location',  'popularity',  'instock',  'categories', ]   @cost_time def batch_xml2csv():     "批量将xml导入到一个csv文件中"     delete(g_xml2csv_file)     f=open(g_xml2csv_file,"a")     import os.path     import mmap     file in glob.glob(g_filter):     print "读入%s"%file     ff=open(file,"r+")     size=os.path.getsize(file)     data=mmap.mmap(ff.fileno(),size)     s=pq(data.read(size))     data.close()     ff.close()     #s=pq(open(file,"r").read())     line=[]     field in g_fields:         r=s("field[@name='%s']"%field).text()         if r none:             line.append("\n")         else:             line.append('"%s"'%r.replace('"','\"'))     f.write(",".join(line)+"\n")     f.close()     print "done!" 

i tried mmap,it seems didn't work well

if you've got 25,000 text files on disk, 'you're doing wrong'. depending on how store them on disk, slowness literally seeking on disk find files.

if you've got 25,0000 of anything it'll faster if put in database intelligent index -- if make index field filename it'll faster.

if have multiple directories descend n levels deep, database still faster.


Comments

Popular posts from this blog

ASP.NET/SQL find the element ID and update database -

jquery - appear modal windows bottom -

c++ - Compiling static TagLib 1.6.3 libraries for Windows -