How to read a UCS-2 file?(如何读取 UCS-2 文件?)
问题描述
我正在编写一个程序来获取 UCS-2 Little Endian 中 *.rc 文件编码中的信息.
I'm writing a program to get the infomation in *.rc file encoding in UCS-2 Little Endian.
int _tmain(int argc, _TCHAR* argv[]) {
wstring csvLine(wstring sLine);
wifstream fin("en.rc");
wofstream fout("table.csv");
wofstream fout_rm("temp.txt");
wstring sLine;
fout << "en
";
while(getline(fin,sLine)) {
if (sLine.find(L"IDS") == -1)
fout_rm << sLine << endl;
else
fout << csvLine(sLine);
}
fout << flush;
system("pause");
return 0;
}
en.rc"中的第一行是#include
但sLine
显示如下:
The first line in "en.rc" is #include <windows.h>
but sLine
shows as below:
[0] 255 L'ÿ'
[1] 254 L'þ'
[2] 35 L'#'
[3] 0
[4] 105 L'i'
[5] 0
[6] 110 L'n'
[7] 0
[8] 99 L'c'
. .
. .
. .
这个程序可以正确地用于 UTF-8.我怎样才能做到 UCS-2?
This program can work out correctly for UTF-8. How can I do it to UCS-2?
推荐答案
宽流使用宽流缓冲区来访问文件.宽流缓冲区从文件中读取字节,并使用其 codecvt facet 将这些字节转换为宽字符.默认的 codecvt facet 是 std::codecvt
它在 wchar_t
和 char
的本地字符集之间进行转换> (即,像 mbstowcs(
) 那样).
Wide streams use a wide stream buffer to access the file. The Wide stream buffer reads bytes from the file and uses its codecvt facet to convert these bytes to wide characters. The default codecvt facet is std::codecvt<wchar_t, char ,std::mbstate_t>
which converts between the native character sets for wchar_t
and char
(i.e., like mbstowcs(
) does).
您没有使用本机 char 字符集,因此您需要的是一个 codecvt facet,它将 UCS-2
作为多字节序列读取并将其转换为宽字符.
You're not using the native char character set, so what you want is a codecvt facet that reads UCS-2
as a multibyte sequence and converts it to wide characters.
#include <fstream>
#include <string>
#include <codecvt>
#include <iostream>
int main(int argc, char *argv[])
{
wifstream fin("en.rc", std::ios::binary); // You need to open the file in binary mode
// Imbue the file stream with a codecvt facet that uses UTF-16 as the external multibyte encoding
fin.imbue(std::locale(fin.getloc(),
new std::codecvt_utf16<wchar_t, 0xffff, consume_header>));
// ^ We set 0xFFFF as the maxcode because that's the largest that will fit in a single wchar_t
// We use consume_header to detect and use the UTF-16 'BOM'
// The following is not really the correct way to write Unicode output, but it's easy
std::wstring sLine;
std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> convert;
while (getline(fin, sLine))
{
std::cout << convert.to_bytes(sLine) << '
';
}
}
请注意,这里的 UTF-16
存在问题.wchar_t
的目的是让一个 wchar_t
代表一个代码点.然而,Windows 使用 UTF-16
将一些代码点表示为 two wchar_t
s.这意味着标准 API 不能很好地与 Windows 配合使用.
Note that there's an issue with UTF-16
here. The purpose of wchar_t
is for one wchar_t
to represent one codepoint. However Windows uses UTF-16
which represents some codepoints as two wchar_t
s. This means that the standard API doesn't work very well with Windows.
这里的结果是,当文件包含代理对时,codecvt_utf16
将读取该对,将其转换为大于 16 位的单个代码点值,并且必须将该值截断为 16 位以将其粘贴在 wchar_t
中.这意味着此代码确实仅限于 UCS-2
.我已将 maxcode 模板参数设置为 0xFFFF
以反映这一点.
The consequence here is that when the file contains a surrogate pair, codecvt_utf16
will read that pair, convert it to a single codepoint value greater than 16 bits and have to truncate the value to 16 bits to stick it in a wchar_t
. This means this code really is limited to UCS-2
. I've set the maxcode template parameter to 0xFFFF
to reflect this.
wchar_t
还存在许多其他问题,您可能只想完全避免它:C++ wchar_t 有什么问题"?
There are a number of other problems with wchar_t
, and you might want to just avoid it entirely: What's "wrong" with C++ wchar_t?
这篇关于如何读取 UCS-2 文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!
本文标题为:如何读取 UCS-2 文件?
基础教程推荐
- 如何使图像调整大小以在 Qt 中缩放? 2021-01-01
- 为 C/C++ 中的项目的 makefile 生成依赖项 2022-01-01
- 为什么语句不能出现在命名空间范围内? 2021-01-01
- Windows Media Foundation 录制音频 2021-01-01
- 如何在不破坏 vtbl 的情况下做相当于 memset(this, ...) 的操作? 2022-01-01
- 从 std::cin 读取密码 2021-01-01
- 如何“在 Finder 中显示"或“在资源管理器中显 2021-01-01
- 使用从字符串中提取的参数调用函数 2022-01-01
- 管理共享内存应该分配多少内存?(助推) 2022-12-07
- 在 C++ 中循环遍历所有 Lua 全局变量 2021-01-01