Thursday, June 8, 2017

Java and UTF-8 encoding

Java supports Unicode since day 1. The professor who introduced me to Java says, "look you don't have to limit yourself English in variable names and strings, you can use Unicode which support just about every language out there!" Ooh now that's forward thinking. The then leading language of the day C++ no way had that. Then he added, "but unfortunately there aren't many Unicode editors out there yet". That was 20+ years ago.

Now there are Unicode editors of course. But can you actually work with non-English things.

public class UnicodeTest { 

 public static void main(String arg[]) {
  String test = "你好,世界";
  
  System.out.println(test);
  

 }

}
First need to compile with UTF-8 encoding:
 javac -encoding "UTF-8" UnicodeTest.java
What do you expect to see? whatever in test variable of course. Yes I expect you can get that if I run it within an IDE. But at the DOS prompt? boom I get junk characters.

The command prompt does not like it. Theoretically setting some code page command will work but does not work for me.

Got a workaround... using bytes array.

public class UnicodeTest { 

 public static void main(String arg[]) {
  String test = "你好,世界";
  
 // System.out.println(test);
  
 try {
  byte[] bytes = test.getBytes("UTF-8");
  System.out.write(bytes);
 } catch (Exception e) {
  e.printStackTrace();
 }
 }

}
Waita minute still junk. But I can pipe to a file. java UnicodeTest > output.txt.

Ah ha I got what I need... in a file.