How to: remove part of a Unicode string in Python following a special character -

- June 15, 2014

first short summery:

python ver: 3.1 system: linux (ubuntu)

i trying data retrieval through python , beautifulsoup.

unfortunately of tables trying process contains cells following text string exists:

789.82 ± 10.28

for work need 2 things:

how handle "weird" symbols such as: ± , how remove part of string containing: ± , right of this?

currently error like: syntaxerror: non-ascii charecter '\xc2' in file ......

thank help

[edit]:

# dataretriveal html files detherm # -*- coding: utf8 -*-  import sys,os,re beautifulsoup import beautifulsoup   sys.path.insert(0, os.getcwd())  raw_data = open('download.php.html','r') soup = beautifulsoup(raw_data)   numdiv in soup.findall('div', {"id" : "sec"}):     currenttable = numdiv.find('table',{"class" : "data"})     if currenttable:         numrow=0         row in currenttable.findall('td', {"class" : "datahead"}):             numrow=numrow+1          col in currenttable.findall('td'):             col2 = ''.join(col.findall(text=true))             if col2.index('±'):                 col2=col2[:col2.indeindex('±')]             print(col)         print(numrow)         ref=numdiv.find('a')         niceref=''.join(ref.findall(text=true))         print(niceref)

now code followed error of:

unicodedecodeerror: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)

where did ascii reference pop ?

you need have python file encoded in utf-8. otherwise, it's quite trivial:

>>> s = '789.82 ± 10.28' >>> s[:s.index('±')] '789.82 ' >>> s.partition('±') ('789.82 ', '±', ' 10.28')

Search This Blog

ERT

How to: remove part of a Unicode string in Python following a special character -

Comments

Post a Comment

Popular posts from this blog

ASP.NET/SQL find the element ID and update database -

c++ - Compiling static TagLib 1.6.3 libraries for Windows -

PostgreSQL 9.x - pg_read_binary_file & inserting files into bytea -