Encoding error in Python with Chinese characters -
i'm beginner having trouble decoding several dozen csv file numbers + (simplified) chinese characters utf-8 in python 2.7.
i not know encoding of input files have tried possible encodings aware of -- gb18030, utf-7, utf-8, utf-16 & utf-32 (le & be). also, measure, gbk , gb3212, though these should subset of gb18030. utf ones stop when first chinese characters. other encodings stop somewhere in first line except gb18030. thought solution because read through first few files , decoded them fine. part of code, reading line line, is:
line = line.decode("gb18030")
the first 2 files tried decode worked fine. midway through third file, python spits out
unicodedecodeerror: 'gb18030' codec can't decode bytes in position 168-169: illegal multibyte sequence
in file, there 5 such errors in million lines.
i opened input file in text editor , checked characters giving decoding errors, , first few had euro signs in particular column of csv files. confident these typos, delete euro characters. examine types of encoding errors 1 one; rid of euro errors not want ignore others until @ them first.
edit: used chardet
gave gb2312 encoding .99 confidence files. tried using gb2312 decode gave:
unicodedecodeerror: 'gb2312' codec can't decode bytes in position 108-109: illegal multibyte sequence
""" ... gb18030. thought solution because read through first few files , decoded them fine.""" -- please explain mean. me, there 2 criteria successful decoding: firstly raw_bytes.decode('some_encoding') didn't fail, secondly resultant unicode when displayed makes sense in particular language. every file in universe pass first test when decoded latin1
aka iso_8859_1
. many files in east asian languages pass first test gb18030
, because used characters in chinese, japanese, , korean encoded using same blocks of two-byte sequences. how of second test have done?
don't muck looking @ data in ide or text editor. @ in web browser; make better job of detecting encodings.
how know it's euro character? looking @ screen of text editor that's decoding raw bytes using encoding? cp1252?
how know contains chinese characters? sure it's not japanese? korean? did from?
chinese files created in hong kong, taiwan, maybe macao, , other places off mainland use big5
or big5_hkscs
encoding -- try that.
in case, take mark's advice , point chardet
@ it; chardet
makes reasonably job of detecting encoding used if file large enough , correctly encoded chinese/japanese/korean -- if has been hand editing file in text editor using single-byte charset, few illegal characters may cause encoding used other 99.9% of characters not detected.
you may print repr(line)
on 5 lines file , edit output question.
if file not confidential, may make available download.
was file created on windows? how reading in python? (show code)
update after op comments:
notepad etc don't attempt guess encoding; "ansi" default. have tell do. calling euro character raw byte "\x80" decoded editor using default encoding environment -- usual suspect being "cp1252". don't use such editor edit file.
earlier talking "first few errors". have 5 errors total. please explain.
if file indeed correct gb18030, should able decode file line line, , when such error, trap it, print error message, extract byte offsets message, print repr(two_bad_bytes), , keep going. i'm interested in of 2 bytes \x80
appears. if doesn't appear @ all, "euro character" not part of problem. note \x80
can appear validly in gb18030 file, 2nd byte of 2-byte sequence starting \x81
\xfe
.
it's idea know problem before try fix it. trying fix bashing notepad etc in "ansi" mode not idea.
you have been coy how decided results of gb18030 decoding made sense. in particular closely scrutinising lines gbk fails gb18030 "works" -- there must extremely rare chinese characters in there, or maybe non-chinese non-ascii characters ...
here's suggestion better way inspect damage: decode each file raw_bytes.decode(encoding, 'replace')
, write result (encoded in utf8) file. count errors result.count(u'\ufffd')
. view output file whatever used decide gb18030 decoding made sense. u+fffd character should show white question mark inside black diamond.
if decide undecodable pieces can discarded, easiest way raw_bytes.decode(encoding, 'ignore')
update after further information
all \\
confusing. appears "getting bytes" involves repr(repr(bytes))
instead of repr(bytes)
... @ interactive prompt, either bytes
(you'll implict repr()), or print repr(bytes)
(which won't implicit repr())
the blank space: presume mean '\xf8\xf8'.decode('gb18030')
interpret kind of full-width space, , interpretation done visual inspection using unnameable viewer software. correct?
actually, '\xf8\xf8'.decode('gb18030')
-> u'\e28b'
. u+e28b in unicode pua (private use area). "blank space" presumably means viewer software unsuprisingly doesn't have glyph u+e28b in font using.
perhaps source of files deliberately using pua characters not in standard gb18030, or annotation, or transmitting pseudosecret info. if so, need resort decoding tambourine, offshoot of recent russian research reported here.
alternative: cp939-hkscs theory. according hk government, hkscs big5 code fe57 once mapped u+e28b mapped u+28804.
the "euro": said """due data can't share whole line, calling euro char in: \xcb\xbe\x80\x80" [i'm assuming \
omitted start of that, , "
literal]. "euro character", when appears, in same column don't need, hoping use "ignore". unfortunately, since "euro char" right next quotes in file, "ignore" gets rid of both euro character [as] quotes, poses problem csv module determine columns"""
it enormously if show patterns of these \x80
bytes appear in relation quotes , chinese characters -- keep readable showing hex, , hide confidential data e.g. using c1 c2 represent "two bytes sure represent chinese character". example:
c1 c2 c1 c2 cb 80 80 22 # `\x22` quote character
please supply examples of (1) " not lost 'replace' or 'ignore' (2) quote lost. in sole example date, " not lost:
>>> '\xcb\xbe\x80\x80\x22'.decode('gb18030', 'ignore') u'\u53f8"'
and offer send debugging code (see example output below) still open.
>>> import decode_debug de >>> def logger(s): ... sys.stderr.write('*** ' + s + '\n') ... >>> import sys >>> de.decode_debug('\xcb\xbe\x80\x80\x22', 'gb18030', 'replace', logger) *** input[2:5] ('\x80\x80"') doesn't start plausible code sequence *** input[3:5] ('\x80"') doesn't start plausible code sequence u'\u53f8\ufffd\ufffd"' >>> de.decode_debug('\xcb\xbe\x80\x80\x22', 'gb18030', 'ignore', logger) *** input[2:5] ('\x80\x80"') doesn't start plausible code sequence *** input[3:5] ('\x80"') doesn't start plausible code sequence u'\u53f8"' >>>
eureka: -- probable cause of losing quote character --
it appears there bug in gb18030
decoder replace/ignore mechanism: \x80
not valid gb18030 lead byte; when detected decoder should attempt resync next byte. seems ignoring both \x80
, following byte:
>>> '\x80abcd'.decode('gb18030', 'replace') u'\ufffdbcd' # 'a' lost >>> de.decode_debug('\x80abcd', 'gb18030', 'replace', logger) *** input[0:4] ('\x80abc') doesn't start plausible code sequence u'\ufffdabcd' >>> '\x80\x80abcd'.decode('gb18030', 'replace') u'\ufffdabcd' # second '\x80' lost >>> de.decode_debug('\x80\x80abcd', 'gb18030', 'replace', logger) *** input[0:4] ('\x80\x80ab') doesn't start plausible code sequence *** input[1:5] ('\x80abc') doesn't start plausible code sequence u'\ufffd\ufffdabcd' >>>
Comments
Post a Comment