does (w)ifstream support different encodings((w)ifstream 是否支持不同的编码)
问题描述
当我使用 wifstream 将文本文件读取到宽字符串 (std::wstring) 时,流实现是否支持不同的编码 - 即它可以用于读取例如ASCII、UTF-8 和 UTF-16 文件?
When I read a text file to a wide character string (std::wstring) using an wifstream, does the stream implementation support different encodings - i.e. can it be used to read e.g. ASCII, UTF-8, and UTF-16 files?
如果没有,我该怎么办?
If not, what would I have to do?
(我需要阅读整个文件,如果这有区别的话)
(I need to read the entire file, if that makes a difference)
推荐答案
C++ 通过 std::locale
和方面 std::codecvt
支持字符编码.一般的想法是,locale
对象描述了系统的各个方面,这些方面可能因文化而异,(人类)语言因语言而异.这些方面被分解为 facet
,它们是定义如何构造依赖于本地化的对象(包括 I/O 流)的模板参数.当您从 istream
读取或写入 ostream
时,每个字符的实际写入都会通过区域设置的方面进行过滤.这些方面不仅涵盖了 Unicode 类型的编码,还涵盖了诸如大数字的书写方式(例如,使用逗号或句点)、货币、时间、大小写以及大量其他详细信息等各种特征.
C++ supports character encodings by means of std::locale
and the facet std::codecvt
. The general idea is that a locale
object describes the aspects of the system that might vary from culture to culture, (human) language to language. These aspects are broken down into facet
s, which are template arguments that define how localization-dependent objects (include I/O streams) are constructed. When you read from an istream
or write to a ostream
, the actual writing of each character is filtered through the locale's facets. The facets cover not only encoding of Unicode types but such varied features as how large numbers are written (e.g. with commas or periods), currency, time, capitalization, and a slew of other details.
然而,仅仅因为存在进行编码的工具并不意味着标准库实际上可以处理所有编码,也不会使此类代码易于正确执行.即使诸如您应该读入的字符大小(更不用说编码部分)这样的基本内容也很困难,因为 wchar_t
可能太小(损坏数据)或太大(浪费空间),以及最常见的编译器(例如 Visual C++ 和 Gnu C++)确实在它们的实现有多大上有所不同.所以一般需要找外部库来做实际的编码.
However just because the facilities exist to do encodings doesn't mean the standard library actually handles all encodings, nor does it make such code simple to do right. Even such basic things as the size of character you should be reading into (let alone the encoding part) is difficult, as wchar_t
can be too small (mangling your data), or too large (wasting space), and the most common compilers (e.g. Visual C++ and Gnu C++) do differ on how big their implementation is. So you generally need to find external libraries to do the actual encoding.
- iconv 通常被认为是正确的,但如何将其绑定到的示例很难找到 C++ 机制.
- jla3ep 提及 libICU,非常彻底,但是 C++ API 并没有尝试与标准很好地配合(据我所知:您可以扫描 examples 看看你是否可以做得更好.)
- iconv is generally acknowledge to be correct, but examples of how to bind it to the C++ mechanism are hard to find.
- jla3ep mentions libICU, which is very thorough but the C++ API does not try to play nicely with the standard (As far as I can tell: you can scan the examples to see if you can do better.)
我能找到的涵盖所有基础的最直接的例子来自 Boost 的 UTF-8 codecvt facet,有一个专门尝试编码 UTF-8 (UCS4) 以供 IO 流使用的示例.它看起来像这样,但我不建议只是逐字复制它.需要更多地挖掘源 理解它(我并不声称):
The most straightforward example I can find that covers all the bases, is from Boost's UTF-8 codecvt facet, with an example that specifically tries to encode UTF-8 (UCS4) for use by IO streams. It looks like this, though I don't suggest just copying it verbatim. It takes a little more digging in the source to understand it (and I don't claim to):
typedef wchar_t ucs4_t;
std::locale old_locale;
std::locale utf8_locale(old_locale,new utf8_codecvt_facet<ucs4_t>);
...
std::wifstream input_file("data.utf8");
input_file.imbue(utf8_locale);
ucs4_t item = 0;
while (ifs >> item) { ... }
要了解有关语言环境以及它们如何使用方面(包括 codecvt
)的更多信息,请查看以下内容:
To understand more about locales, and how they use facets (including codecvt
), take a look at the following:
- Nathan Myers 对语言环境和方面进行了详尽的解释.迈尔斯是语言环境概念的设计者之一.如果您想深入了解,他有更正式的文档.
- Apache 的标准库实现(以前是 RogueWave 的)有一个完整的方面列表.
- Nicolai Josuttis 的 C++ 标准库第 14 章专门讨论该主题.
- Angelika Langer 和 Klaus Kreft 的标准 C++ IOStreams 和语言环境 写了一整本书.
- Nathan Myers has a thorough explanation of locales and facets. Myers was one of the designers of the locale concept. He has more formal documentation if you want to wade through it.
- Apache's Standard Library implementation (formerly RogueWave's) has a full list of facets.
- Nicolai Josuttis' The C++ Standard Library Chapter 14 is devoted to the subject.
- Angelika Langer and Klaus Kreft's Standard C++ IOStreams and Locales devotes a whole book.
这篇关于(w)ifstream 是否支持不同的编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!
本文标题为:(w)ifstream 是否支持不同的编码
基础教程推荐
- 在 C++ 中循环遍历所有 Lua 全局变量 2021-01-01
- 管理共享内存应该分配多少内存?(助推) 2022-12-07
- 使用从字符串中提取的参数调用函数 2022-01-01
- 为 C/C++ 中的项目的 makefile 生成依赖项 2022-01-01
- 如何在不破坏 vtbl 的情况下做相当于 memset(this, ...) 的操作? 2022-01-01
- 从 std::cin 读取密码 2021-01-01
- 如何使图像调整大小以在 Qt 中缩放? 2021-01-01
- 如何“在 Finder 中显示"或“在资源管理器中显 2021-01-01
- 为什么语句不能出现在命名空间范围内? 2021-01-01
- Windows Media Foundation 录制音频 2021-01-01