python - Downloading a webpage using urllib2 results in garbled junk? (only sometimes) -
how come hit webpage, html text:
http://itunes.apple.com/us/app/mobile/id381057839
but when hit webpage, garbled junk?
http://itunes.apple.com/us/app/mobile/id375562663
i use same download()
function in python, here:
def download(source_url): try: socket.setdefaulttimeout(10) agent = "mozilla/5.0 (windows; u; windows nt 6.1; en-us; rv:1.9.2.10) gecko/20100914 alexatoolbar/alxf-1.54 firefox/3.6.10 gtb7.1" ree = urllib2.request(source_url) ree.add_header('user-agent',agent) ree.add_header("accept","text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8") ree.add_header("accept-language","en-us,en;q=0.5") ree.add_header("accept-charset","iso-8859-1,utf-8;q=0.7,*;q=0.7") ree.add_header("accept-encoding","gzip,deflate") ree.add_header("host","itunes.apple.com") resp = urllib2.urlopen(ree) htmlsource = resp.read() return htmlsource except exception, e: print e
solved. compression issue.
def download(source_url): try: socket.setdefaulttimeout(10) agents = ['mozilla/4.0 (compatible; msie 5.5; windows nt 5.0)','mozilla/4.0 (compatible; msie 7.0b; windows nt 5.1)','microsoft internet explorer/4.0b1 (windows 95)','opera/8.00 (windows nt 5.1; u; en)'] ree = urllib2.request(source_url) ree.add_header('user-agent',random.choice(agents)) ree.add_header('accept-encoding', 'gzip') opener = urllib2.build_opener() h = opener.open(ree).read() import stringio import gzip compressedstream = stringio.stringio(h) gzipper = gzip.gzipfile(fileobj=compressedstream) data = gzipper.read() return data except exception, e: print e return ""
Comments
Post a Comment