Reading UTF-8 files – FileReader or FileInputStream?

FileReader and FileInputStream are two stream APIs for reading data from the files. FileReader is preferred when you are dealing with text files and want to read characters instead of bytes. A character can be a combination of few bytes or even a single byte. Whereas FileInputStream will read direct bytes and it will be your responsibility to convert them into valid characters.

Lets see some code

I have a text files which has 2 characters – Ůb .First character is some special character ‘Ů’ and 2nd character is ‘b’

First lets analyze character Ů. Its decimal value is 366, hexadecimal value is 0x16e and binary value is 101101110. If we convert into UTF-8 format it will be represented as 11000101 10101110 which is 197 and 174.

If you want to understand how normal bytes are converted to UTF-8 format read this : How are characters represented in UTF-8 encoding

In this code we will use FileInputStream to read this file

1
2
3
4
5
6
7
8
9
File file = new File("c:/downloads/a.txt");
		FileInputStream fis = new FileInputStream(file);
		int ch = 0;
		String str = "";
		while ((ch = fis.read()) != -1) {
			System.out.println(ch + "-" + (char) ch);
			str += (char) ch;
		}
		System.out.println("Final-" + str);

Output of this program would be :

197-Å
174-®
98-b
Final-Ůb

So as expected FileInputStream read each byte and printed it. As special character Ů consisted of 2 bytes which were represented in UTF-8 format, FOS ended up printing some random characters

Now lets use FileReader. Here we will assume system encoding has been set to UTF-8

1
2
3
4
5
FileReader fr = new FileReader(file);
		ch = 0;
		while ((ch = fr.read()) != -1) {
			System.out.println(ch + "-" + (char) ch);
		}

Output will be :

366-Ů
98-b

So it prints the proper character instead of 2 different bytes for special character Ů. Internally FileReader would have figured out that this UTF-8 character is represented by 2 bytes, so it needs to read 2 bytes and decode them to form a proper character unlike FileInputStream which will just read byte by byte and print it

If system encoding is different from the encoding done on the character we can ourself provide the charset encoding value by using InputStreamReader

1
2
3
4
5
Reader fr = new InputStreamReader(new FileInputStream(file), "UTF-8");
		ch = 0;
		while ((ch = fr.read()) != -1) {
			System.out.println(ch + "-" + (char) ch);
		}

Uday Ogra

Connect with me at http://facebook.com/tendulkarogra and lets have some healthy discussion :)

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *