What exactly can wchar_t represent?(wchar_t 到底能代表什么?)
问题描述
根据 cppreference.com 在 wchar_t
上的文档:
According to cppreference.com's doc on wchar_t
:
wchar_t
- 宽字符表示的类型(参见宽字符串).需要足够大以表示任何受支持的字符代码点(在支持 Unicode 的系统上为 32 位.一个值得注意的例外是 Windows,其中 wchar_t 为 16 位并保存 UTF-16 代码单元)它具有相同的大小、签名和对齐方式作为整数类型之一,但它是一个不同的类型.
wchar_t
- type for wide character representation (see wide strings). Required to be large enough to represent any supported character code point (32 bits on systems that support Unicode. A notable exception is Windows, where wchar_t is 16 bits and holds UTF-16 code units) It has the same size, signedness, and alignment as one of the integer types, but is a distinct type.
标准在 [basic.fundamental]/5
:
Type wchar_t
是一个独特的类型,其值可以代表支持的语言环境中指定的最大扩展字符集的所有成员的不同代码.类型 wchar_t
应具有与其他整数类型之一(称为其基础类型)相同的大小、符号和对齐要求.类型 char16_t
和 char32_t
分别表示与 uint_least16_t
和 uint_least32_t
具有相同大小、符号和对齐的不同类型,在
中,称为底层类型.
Type
wchar_t
is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales. Typewchar_t
shall have the same size, signedness, and alignment requirements as one of the other integral types, called its underlying type. Typeschar16_t
andchar32_t
denote distinct types with the same size, signedness, and alignment asuint_least16_t
anduint_least32_t
, respectively, in<cstdint>
, called the underlying types.
那么,如果我想处理unicode字符,我应该使用wchar_t
吗?
So, if I want to deal with unicode characters, should I use wchar_t
?
同样地,我如何知道wchar_t
是否支持"一个特定的Unicode字符?
Equivalently, how do I know if a specific unicode character is "supported" by wchar_t
?
推荐答案
所以,如果我想处理unicode字符,我应该使用
wchar_t
?
首先,请注意编码并不会强制您使用任何特定类型来表示某个字符.您可以使用 char
来表示 Unicode 字符,就像 wchar_t
一样 - 您只需要记住最多 4 个 char
一起将形成一个有效的代码点取决于 UTF-8、UTF-16 或 UTF-32 编码,而 wchar_t
可以使用 1 个(Linux 上的 UTF-32)或最多 2 个一起工作(UTF-16 上视窗).
First of all, note that the encoding does not force you to use any particular type to represent a certain character. You may use char
to represent Unicode characters just as wchar_t
can - you only have to remember that up to 4 char
s together will form a valid code point depending on UTF-8, UTF-16, or UTF-32 encoding, while wchar_t
can use 1 (UTF-32 on Linux, etc) or up to 2 working together (UTF-16 on Windows).
接下来,没有明确的 Unicode 编码.一些 Unicode 编码使用固定宽度来表示代码点(如 UTF-32),其他编码(如 UTF-8 和 UTF-16)具有可变长度(例如字母 'a' 肯定只会用完 1 个字节,但分开从英文字母表中,其他字符肯定会使用更多字节来表示).
Next, there is no definite Unicode encoding. Some Unicode encodings use a fixed width for representing codepoints (like UTF-32), others (such as UTF-8 and UTF-16) have variable lengths (the letter 'a' for instance surely will just use up 1 byte, but apart from the English alphabet, other characters surely will use up more bytes for representation).
因此,您必须决定要表示的字符类型,然后相应地选择您的编码.根据您要表示的字符类型,这将影响您的数据将占用的字节数.例如.使用 UTF-32 来表示大部分英文字符会导致很多 0 字节.UTF-8 是许多基于拉丁语的语言的更好选择,而 UTF-16 通常是东亚语言的更好选择.
So you have to decide what kind of characters you want to represent and then choose your encoding accordingly. Depending on the kind of characters you want to represent, this will affect the amount of bytes your data will take. E.g. using UTF-32 to represent mostly English characters will lead to many 0-bytes. UTF-8 is a better choice for many Latin based languages, while UTF-16 is usually a better choice for Eastern Asian languages.
一旦决定了这一点,您就应该尽量减少转化次数,并与您的决定保持一致.
Once you have decided on this, you should minimize the amount of conversions and stay consistent with your decision.
在下一步中,您可以决定适合表示数据的数据类型(或您可能需要的转换类型).
In the next step, you may decide what data type is appropriate to represent the data (or what kind of conversions you may need).
如果您想在代码点的基础上进行文本操作/解释,char
如果您有例如日本汉字.但是,如果您只是想传达您的数据并且不再将其视为字节的定量序列,则可以使用 char
.
If you would like to do text-manipulation/interpretation on a code-point basis, char
certainly is not the way to go if you have e.g. Japanese kanji. But if you just want to communicate your data and regard it no more as a quantitative sequence of bytes, you may just go with char
.
UTF-8 的链接 已经作为评论发布,我建议你也看看那里.另一个不错的读物是每个程序员都应该了解的有关编码的内容.
The link to UTF-8 everywhere was already posted as a comment, and I suggest you having a look there as well. Another good read is What every programmer should know about encodings.
到目前为止,C++ 中只有基本的 Unicode 语言支持(例如 char16_t
和 char32_t
数据类型,以及 u8
/u
/U
字面前缀).因此,选择一个库来管理编码(尤其是转换)当然是一个很好的建议.
As by now, there is only rudimentary language support in C++ for Unicode (like the char16_t
and char32_t
data types, and u8
/u
/U
literal prefixes). So chosing a library for manging encodings (especially conversions) certainly is a good advice.
这篇关于wchar_t 到底能代表什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!
本文标题为:wchar_t 到底能代表什么?
基础教程推荐
- 调用std::Package_TASK::Get_Future()时可能出现争用情况 2022-12-17
- 您如何将 CreateThread 用于属于类成员的函数? 2021-01-01
- 设计字符串本地化的最佳方法 2022-01-01
- 运算符重载的基本规则和习语是什么? 2022-10-31
- 如何在 C++ 中处理或避免堆栈溢出 2022-01-01
- C++,'if' 表达式中的变量声明 2021-01-01
- 什么是T&&(双与号)在 C++11 中是什么意思? 2022-11-04
- 如何定义双括号/双迭代器运算符,类似于向量的向量? 2022-01-01
- C++ 程序在执行 std::string 分配时总是崩溃 2022-01-01
- C++ 标准:取消引用 NULL 指针以获取引用? 2021-01-01