Quantcast
Channel: cyotek.com Blog Summary Feed
Viewing all articles
Browse latest Browse all 559

Manually writing the byte order mark (BOM) for an encoding into a stream

$
0
0

I recently discovered a problem with our WebCopy and Cyotek Sitemap Creator products to do with "corruption" of plain text documents, where non-ANSI characters appeared incorrectly. It didn't take long to realize that these programs were saving text content as ANSI files. Which I found curious as Crawler library they use detects response encoding and uses this to save the files.

Or does it? Consider the code below:

string fileName;byte[] data;
Encoding encoding;

fileName = Path.GetTempFileName();
data = newbyte[0]; // assume you have a populated byte array!
encoding = Encoding.UTF8;using (FileStream stream = new FileStream(fileName, FileMode.Create))
{using (BinaryWriter writer = new BinaryWriter(stream, encoding))
    writer.Write(data);
}

Looking at this, you might be tempted to assume (as I did) that this code would save the content in the given encoding. When I tried opening one of the files generated by similar code to the above in Notepad++, I found they were encoded as ANSI files. Switching the encoding to UTF-8 immediately displayed the files correctly without the "corruption". So it seems the byte order mark (BOM) isn't actually written by the BinaryWriter - I think it only uses the given encoding for converting strings to a byte array. All this time I assumed files were being saved as UTF-8 (or whatever the response encoding was) and properly supported Unicode, and all this time I was wrong.

So how do you manually write a BOM into a document? The oddly named GetPreamble function available from the Encoding class is what you need - this returns the bytes that comprise the BOM, and you can then write this directly to your stream:

string fileName;byte[] data;
Encoding encoding;

fileName = Path.GetTempFileName();
data = newbyte[0]; // assume you have a populated byte array!
encoding = Encoding.UTF8;using (FileStream stream = new FileStream(fileName, FileMode.Create))
{using (BinaryWriter writer = new BinaryWriter(stream, encoding))
  {
    writer.Write(encoding.GetPreamble());
    writer.Write(data);
  }
}

Note that you only need to write a BOM if your document is actually supposed to be a text file - if it is "normal" binary data (such as an image or a gzip stream) then you definitely do not want to write a BOM, or you truly will have a corrupt file.

Now the files produced by WebCopy and Sitemap Creator are encoded correctly and I can be happily with yet another bug squashed, unhappy at yet another reminder of why I need to write a proper set of automated tests for the libraries I use, but happy again that I had another (albeit brief) tip to post on this blog.

All content Copyright © by Cyotek Ltd or its respective writers. Permission to reproduce news and web log entries and other RSS feed content in unmodified form without notice is granted provided they are not used to endorse or promote any products or opinions (other than what was expressed by the author) and without taking them out of context. Written permission from the copyright owner must be obtained for everything else.
Original URL of this content is http://www.cyotek.com/blog/manually-writing-the-byte-order-mark-bom-for-an-encoding-into-a-stream?source=rss


Viewing all articles
Browse latest Browse all 559

Trending Articles