performance - how quick read 25k small txt file content with python -
i download many html store in os,now content ,and extract data need persistence mysql, use traditional load file 1 one ,it's not efficant cost nealy 8 mins.
any advice welcome
g_fields=[ 'name', 'price', 'productid', 'site', 'link', 'smallimage', 'bigimage', 'description', 'createdon', 'modifiedon', 'size', 'weight', 'wrap', 'material', 'packagingcount', 'stock', 'location', 'popularity', 'instock', 'categories', ] @cost_time def batch_xml2csv(): "批量将xml导入到一个csv文件中" delete(g_xml2csv_file) f=open(g_xml2csv_file,"a") import os.path import mmap file in glob.glob(g_filter): print "读入%s"%file ff=open(file,"r+") size=os.path.getsize(file) data=mmap.mmap(ff.fileno(),size) s=pq(data.read(size)) data.close() ff.close() #s=pq(open(file,"r").read()) line=[] field in g_fields: r=s("field[@name='%s']"%field).text() if r none: line.append("\n") else: line.append('"%s"'%r.replace('"','\"')) f.write(",".join(line)+"\n") f.close() print "done!"
i tried mmap,it seems didn't work well
if you've got 25,000 text files on disk, 'you're doing wrong'. depending on how store them on disk, slowness literally seeking on disk find files.
if you've got 25,0000 of anything it'll faster if put in database intelligent index -- if make index field filename it'll faster.
if have multiple directories descend n levels deep, database still faster.
Comments
Post a Comment