在 C++ 源代码中使用 Unicode

Using Unicode in C++ source code(在 C++ 源代码中使用 Unicode)

本文介绍了在 C++ 源代码中使用 Unicode的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

C++源代码的标准编码是什么?C++ 标准甚至对此有什么说明吗?我可以用 Unicode 编写 C++ 源代码吗?

What is the standard encoding of C++ source code? Does the C++ standard even say something about this? Can I write C++ source in Unicode?

例如,我可以在评论中使用非ASCII字符,例如汉字吗?如果是这样,是否允许使用完整的 Unicode 或只是 Unicode 的一个子集?(例如,那个 16 位的第一页或其他任何名称.)

For example, can I use non-ASCII characters such as Chinese characters in comments? If so, is full Unicode allowed or just a subset of Unicode? (e.g., that 16-bit first page or whatever it's called.)

此外,我可以将 Unicode 用于字符串吗?例如:

Furthermore, can I use Unicode for strings? For example:

Wstring str=L"Strange chars: â Țđ ě €€";

推荐答案

C++ 中的编码相当复杂.这是我的理解.

Encoding in C++ is quite a bit complicated. Here is my understanding of it.

每个实现都必须支持来自基本源字符集的字符.其中包括 §2.2/1(C++11 中的 §2.3/1)中列出的常见字符.这些字符应该都适合一个 char.此外,实现必须支持使用一种称为 universal-character-names 的方式命名其他字符的方法,看起来像 uffffUffffffff并且可以用来指代 Unicode 字符.它们的一个子集可用于标识符(在附件 E 中列出).

Every implementation has to support characters from the basic source character set. These include common characters listed in §2.2/1 (§2.3/1 in C++11). These characters should all fit into one char. In addition implementations have to support a way to name other characters using a way called universal-character-names and look like uffff or Uffffffff and can be used to refer to Unicode characters. A subset of them are usable in identifiers (listed in Annex E).

这一切都很好,但是从文件中的字符到源字符(在编译时使用)的映射是实现定义的.这构成了所使用的编码.这是它的字面意思(C++98 版本):

This is all nice, but the mapping from characters in the file, to source characters (used at compile time) is implementation defined. This constitutes the encoding used. Here is what it says literally (C++98 version):

物理源文件字符是映射,在一个实现定义的方式,到基本的源字符set(引入换行符对于行尾指示符)如果必要的.三字符序列 (2.3)替换为相应的单字符内部陈述.任何源文件字符不在基本来源中字符集 (2.2) 被替换为描述的通用字符名称标记那个字符.(一个实现可以使用任何内部编码,只要实际遇到的扩展字符源文件,和相同的扩展源文件中表示的字符作为通用字符名称(即使用 uXXXX 符号),是同等处理.)

Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary. Trigraph sequences (2.3) are replaced by corresponding single-character internal representations. Any source file character not in the basic source character set (2.2) is replaced by the universal-character-name that des- ignates that character. (An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (i.e. using the uXXXX notation), are handled equivalently.)

对于 gcc,您可以使用选项 -finput-charset=charset 更改它.此外,您可以更改用于在运行时重新设置值的执行字符.正确的选项是 -fexec-charset=charset for char(默认为 utf-8)和 -fwide-exec-charset=charsetcode>(根据 wchar_t 的大小,默认为 utf-16utf-32).

For gcc, you can change it using the option -finput-charset=charset. Additionally, you can change the execution character used to represet values at runtime. The proper option for this is -fexec-charset=charset for char (it defaults to utf-8) and -fwide-exec-charset=charset (which defaults to either utf-16 or utf-32 depending on the size of wchar_t).

这篇关于在 C++ 源代码中使用 Unicode的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!

本文标题为:在 C++ 源代码中使用 Unicode

基础教程推荐