unicode - C# char/byte encoding equality -
i have code dump strings stdout check encoding, looks this:
private void dumpstring(string s) { system.console.write("{0}: ", s); foreach (byte b in s) { system.console.write("{0}({1}) ", (char)b, b.tostring("x2")); } system.console.writeline(); }
consider 2 strings, each of appear "ë", different encodings. dumpstring produce following output:
ë: e(65)(08)
ë: ë(eb)
the code looks this:
dumpstring(string1); dumpstring(string2);
how can convert string2, using system.text.encoding, byte equivalent string1.
they don't have different encodings. strings in c# utf-16 (thus, shouldn't use byte
iterate on strings because you'll lose top 8 bits). have different normalization forms.
your first string "\u0065\u0308": latin small letter e + combining diaeresis. decomposed form (nfd).
the second "\u00eb": latin small letter e diaeresis. precomposed form (nfc).
you can convert between them string.normalize
.
Comments
Post a Comment