How to: remove part of a Unicode string in Python following a special character -
first short summery:
python ver: 3.1 system: linux (ubuntu)
i trying data retrieval through python , beautifulsoup.
unfortunately of tables trying process contains cells following text string exists:
789.82 ± 10.28
for work need 2 things:
how handle "weird" symbols such as: ± , how remove part of string containing: ± , right of this?
currently error like: syntaxerror: non-ascii charecter '\xc2' in file ......
thank help
[edit]:
# dataretriveal html files detherm # -*- coding: utf8 -*- import sys,os,re beautifulsoup import beautifulsoup sys.path.insert(0, os.getcwd()) raw_data = open('download.php.html','r') soup = beautifulsoup(raw_data) numdiv in soup.findall('div', {"id" : "sec"}): currenttable = numdiv.find('table',{"class" : "data"}) if currenttable: numrow=0 row in currenttable.findall('td', {"class" : "datahead"}): numrow=numrow+1 col in currenttable.findall('td'): col2 = ''.join(col.findall(text=true)) if col2.index('±'): col2=col2[:col2.indeindex('±')] print(col) print(numrow) ref=numdiv.find('a') niceref=''.join(ref.findall(text=true)) print(niceref)
now code followed error of:
unicodedecodeerror: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
where did ascii reference pop ?
you need have python file encoded in utf-8. otherwise, it's quite trivial:
>>> s = '789.82 ± 10.28' >>> s[:s.index('±')] '789.82 ' >>> s.partition('±') ('789.82 ', '±', ' 10.28')
Comments
Post a Comment