Wednesday, July 31, 2013

Help! How can I ensure I get the encoding right?

I've been struggling with character encoding.
Some tests in our suite use strings like this "100µl" or "N°". The problem was, these tests were getting corrupted as they get pushed into Quality Centre. I found a work around through trial and error, but someone must know why this works.
Here's some experiments. I saved the following "µ°" in 3 files. One is ANSI, one UTF8 and one is "ANSI as UTF-8" (so Notepad++ tells me). Then run the following code:

# encoding: utf-8

utf8 =  File.read("files/utf8.txt", external_encoding:"UTF-8")
aautf8 =  File.read("files/ascii_as_utf8.txt")
ascii =  File.read("files/ascii.txt")

puts "UTF8   >> " + utf8
puts utf8.encoding.names.inspect

puts "AAUTF8 >> " + aautf8
puts aautf8.encoding.names.inspect

puts "ASCII  >> " + ascii
puts "ASCII  >> " + ascii.bytes.pack("U*")
puts ascii.encoding.names.inspect

This produces the following (assuming you are in windows with ruby 2.0 and Lucida font and you whispered the magic incantation "chcp 65001"
UTF8   >> µ°
["UTF-8", "CP65001", "locale", "external"]
AAUTF8 >> µ°
["UTF-8", "CP65001", "locale", "external"]
ASCII  >> ��
ASCII  >> µ°
["UTF-8", "CP65001", "locale", "external"]
So I guess my question is:- How are you supposed to load a file and get it to appear correctly? And secondly... that last line... was that a fluke? Also.. How do you tell if the file is loaded correctly or not?

No comments:

GitHub Projects