Reading UTF File with BOM to UTF-8 Encoded std::string in C++11 on Windows

I got a task from boss and during the task I need to write a C++ function to read a text file into UTF-8 encoded string. I can’t believe this easy task in high level language like VB.NET cost me 2 days to figure the solution out. Some solutions use libraries like UTF8-CPP or ICU, while some use Windows API. These ways works but I don’t quite like them because:

  1. Usability, some solutions work good for most characters but failed to handle non-BMP characters
  2. Portability, we are foreseeing our code will be migrated to Linux, use of Windows API means adding work to our migration work in the future
  3. Dependency, I don’t want to depend on third-party library, especially the C++ function later will integrated with the program which only depend on very few third-party library.

Surprisingly there is not much solution that I am satisfied from the Internet, but good that at the end I still able to figure the code out. It still used one Windows API function though, but it is much easier to port to Linux already than ten functions. Anyway here it is:

// Reading ASCII, UTF-8, UTF-16LE, UTF-16BE with auto BOM detection using C++11 on Windows platform
// Code tested on Microsoft Visual Studio 2013 on Windows 7
// Part of the code is referencing http://cfc.kizzx2.com/index.php/reading-a-unicode-utf16-file-in-windows-c/

#include <stdio.h>
#include <tchar.h>
#include <string>
#include <fstream>
#include <sstream>
#include <locale>
#include <codecvt>
#include <iostream>
#include <io.h>
#include <fcntl.h>

#define TEXT_FILE_PATH      "D:\\test.txt"
#define ENCODING_ASCII      0
#define ENCODING_UTF8       1
#define ENCODING_UTF16LE    2
#define ENCODING_UTF16BE    3

std::string readFile(std::string path)
{
	std::string result;
	std::ifstream ifs(path.c_str(), std::ios::binary);
	std::stringstream ss;
	int encoding = ENCODING_ASCII;

	if (!ifs.is_open()) {
		// Unable to read file
		result.clear();
		return result;
	}
	else if (ifs.eof()) {
		result.clear();
	}
	else {
		int ch1 = ifs.get();
		int ch2 = ifs.get();
		if (ch1 == 0xff && ch2 == 0xfe) {
			// The file contains UTF-16LE BOM
			encoding = ENCODING_UTF16LE;
		}
		else if (ch1 == 0xfe && ch2 == 0xff) {
			// The file contains UTF-16BE BOM
			encoding = ENCODING_UTF16BE;
		}
		else {
			int ch3 = ifs.get();
			if (ch1 == 0xef && ch2 == 0xbb && ch3 == 0xbf) {
				// The file contains UTF-8 BOM
				encoding = ENCODING_UTF8;
			}
			else {
				// The file does not have BOM
				encoding = ENCODING_ASCII;
				ifs.seekg(0);
			}
		}
	}
	ss << ifs.rdbuf() << '';
	if (encoding == ENCODING_UTF16LE) {
		std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> utfconv;
		result = utfconv.to_bytes(std::wstring((wchar_t *)ss.str().c_str()));
	}
	else if (encoding == ENCODING_UTF16BE) {
		std::string src = ss.str();
		std::string dst = src;
		// Using Windows API
		_swab(&src[0u], &dst[0u], src.size() + 1);
		std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> utfconv;
		result = utfconv.to_bytes(std::wstring((wchar_t *)dst.c_str()));
	}
	else if (encoding == ENCODING_UTF8) {
		result = ss.str();
	}
	else {
		result = ss.str();
	}
	return result;
}

You can also find the above code at https://gist.github.com/VeryCrazyDog/c20b2cb83896e9975d22