如何正确判断文本文件的字符编码?-C/C++问题

How to correctly determine character encoding of text files?(如何正确判断文本文件的字符编码?)

本文介绍了如何正确判断文本文件的字符编码?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

这是我的情况:我需要正确确定给定文本文件使用哪种字符编码.希望它可以正确返回以下类型之一:

Here is my situation: I need to correctly determine which character encoding is used for given text file. Hopefully, it can correctly return one of the following types:

enum CHARACTER_ENCODING
{
    ANSI,
    Unicode,
    Unicode_big_endian,
    UTF8_with_BOM,
    UTF8_without_BOM
};

到目前为止，我可以通过调用以下功能.如果给定的文本文件最初不是 UTF-8 without BOM，它也可以正确确定 ANSI.问题是当文本文件为UTF-8 without BOM时，下面的函数会误认为是ANSI文件.

Up to now, I can correctly tell a text file is Unicode, Unicode big endian or UTF-8 with BOM by calling the following function. It also can correctly determine for ANSI if the given text file is not originally a UTF-8 without BOM. The problem is that when the text file is UTF-8 without BOM, the following function will mistakenly regard it as a ANSI file.

CHARACTER_ENCODING get_text_file_encoding(const char *filename)
{
    CHARACTER_ENCODING encoding;

    unsigned char uniTxt[] = {0xFF, 0xFE};// Unicode file header
    unsigned char endianTxt[] = {0xFE, 0xFF};// Unicode big endian file header
    unsigned char utf8Txt[] = {0xEF, 0xBB};// UTF_8 file header

    DWORD dwBytesRead = 0;
    HANDLE hFile = CreateFile(filename, GENERIC_READ, FILE_SHARE_READ, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);
    if (hFile == INVALID_HANDLE_VALUE)
    {
        hFile = NULL;
        CloseHandle(hFile);
        throw runtime_error("cannot open file");
    }
    BYTE *lpHeader = new BYTE[2];
    ReadFile(hFile, lpHeader, 2, &dwBytesRead, NULL);
    CloseHandle(hFile);

    if (lpHeader[0] == uniTxt[0] && lpHeader[1] == uniTxt[1])// Unicode file
        encoding = CHARACTER_ENCODING::Unicode;
    else if (lpHeader[0] == endianTxt[0] && lpHeader[1] == endianTxt[1])//  Unicode big endian file
        encoding = CHARACTER_ENCODING::Unicode_big_endian;
    else if (lpHeader[0] == utf8Txt[0] && lpHeader[1] == utf8Txt[1])// UTF-8 file
        encoding = CHARACTER_ENCODING::UTF8_with_BOM;
    else
        encoding = CHARACTER_ENCODING::ANSI;   //Ascii

    delete []lpHeader;
    return encoding;
}

这个问题困扰了我很久，还是找不到好的解决办法.任何提示将不胜感激.

This problem has blocked me for a long time and I still cannot find a good solution. Any hint will be appreciated.