Python returning the wrong length of string when using special characters(Python在使用特殊字符时返回错误长度的字符串)
问题描述
我有一个字符串 ë́aúlt,我想根据字符位置等获取操作的长度.问题是第一个 ë́ 被计算了两次,或者我猜 ë 在位置 0 并且 ´ 在位置 1.
在 Python 中是否有任何可能的方法可以将 ë́ 这样的字符表示为 1?
我将 UTF-8 编码用于输出到的实际代码和网页.
只是一些关于为什么我需要这样做的背景.我正在做一个将英语翻译成 Seneca(一种美洲原住民语言)的项目,而且 ë́ 出现了很多.某些单词的一些重写规则需要了解字母位置(本身和周围的字母)和其他特征,例如重音和其他变音符号.
UTF-8 是一种 Unicode 编码,它使用多个字节来表示特殊字符.如果您不想要编码字符串的长度,请对其进行简单解码并在 unicode 对象上使用
len()(而不是
str
> 对象!).
以下是一些示例:
<预><代码>>>># 创建一个 str 文字(使用 utf-8 编码,如果这是>>># 指定在文件的开头):>>>len('ë́aúlt')9>>># 创建一个 unicode 文字(你通常应该使用这个>>># 版本(如果您正在处理特殊字符):>>>len(u'ë́aúlt')6>>># 相同的 str 文字(以编码符号编写):>>>len('xc3xabxccx81axc3xbalt')9>>># 您可以通过decode() 将任何str 转换为unicode 对象:>>>len('xc3xabxccx81axc3xbalt'.decode('utf-8'))6当然,您也可以像在 str
对象中那样访问 unicode
对象中的单个字符(它们都继承自 basestring
,因此具有相同的方法):
如果您开发本地化应用程序,通常在内部仅使用 unicode
对象是一个好主意,通过解码您获得的所有输入.工作完成后,您可以再次将结果编码为UTF-8".如果你坚持这个原则,你永远不会看到你的服务器因为任何内部的 UnicodeDecodeError
而崩溃,否则你可能会得到 ;)
PS:请注意,str
和 unicode
数据类型在 Python 3 中发生了显着变化.在 Python 3 中,只有 unicode 字符串和纯字节字符串可以'不要再混了.这应该有助于避免 unicode 处理的常见陷阱...
问候,克里斯托夫
I have a string ë́aúlt that I want to get the length of a manipulate based on character positions and so on. The problem is that the first ë́ is being counted twice, or I guess ë is in position 0 and ´ is in position 1.
Is there any possible way in Python to have a character like ë́ be represented as 1?
I'm using UTF-8 encoding for the actual code and web page it is being outputted to.
edit: Just some background on why I need to do this. I am working on a project that translates English to Seneca (a form of Native American language) and ë́ shows up quite a bit. Some rewrite rules for certain words require knowledge of letter position (itself and surrounding letters) and other characteristics, such as accents and other diacritic markings.
UTF-8 is an unicode encoding which uses more than one byte for special characters. If you don't want the length of the encoded string, simple decode it and use len()
on the unicode
object (and not the str
object!).
Here are some examples:
>>> # creates a str literal (with utf-8 encoding, if this was
>>> # specified on the beginning of the file):
>>> len('ë́aúlt')
9
>>> # creates a unicode literal (you should generally use this
>>> # version if you are dealing with special characters):
>>> len(u'ë́aúlt')
6
>>> # the same str literal (written in an encoded notation):
>>> len('xc3xabxccx81axc3xbalt')
9
>>> # you can convert any str to an unicode object by decoding() it:
>>> len('xc3xabxccx81axc3xbalt'.decode('utf-8'))
6
Of course, you can also access single characters in an unicode
object like you would do in a str
object (they are both inheriting from basestring
and therefore have the same methods):
>>> test = u'ë́aúlt'
>>> print test[0]
ë
If you develop localized applications, it's generally a good idea to use only unicode
-objects internally, by decoding all inputs you get. After the work is done, you can encode the result again as 'UTF-8'. If you keep to this principle, you will never see your server crashing because of any internal UnicodeDecodeError
s you might get otherwise ;)
PS: Please note, that the str
and unicode
datatype have changed significantly in Python 3. In Python 3 there are only unicode strings and plain byte strings which can't be mixed anymore. That should help to avoid common pitfalls with unicode handling...
Regards, Christoph
这篇关于Python在使用特殊字符时返回错误长度的字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!
本文标题为:Python在使用特殊字符时返回错误长度的字符串
基础教程推荐
- 筛选NumPy数组 2022-01-01
- 使用PyInstaller后在Windows中打开可执行文件时出错 2022-01-01
- 在 Python 中,如果我在一个“with"中返回.块,文件还会关闭吗? 2022-01-01
- 用于分类数据的跳跃记号标签 2022-01-01
- 线程时出现 msgbox 错误,GUI 块 2022-01-01
- 如何在海运重新绘制中自定义标题和y标签 2022-01-01
- 何时使用 os.name、sys.platform 或 platform.system? 2022-01-01
- Python kivy 入口点 inflateRest2 无法定位 libpng16-16.dll 2022-01-01
- 如何让 python 脚本监听来自另一个脚本的输入 2022-01-01
- Dask.array.套用_沿_轴:由于额外的元素([1]),使用dask.array的每一行作为另一个函数的输入失败 2022-01-01