Character detection in a text file in Python using the Universal Encoding Detector (chardet)(使用通用编码检测器 (chardet) 在 Python 中检测文本文件中的字符)
问题描述
我正在尝试使用 Python 中的通用编码检测器 (chardet) 来检测文本文件 ('infile') 中最可能的字符编码,并将其用于进一步处理.
I am trying to use the Universal Encoding Detector (chardet) in Python to detect the most probable character encoding in a text file ('infile') and use that in further processing.
虽然 chardet 主要用于检测网页的字符编码,但我发现了一个 示例 用于单个文本文件.
While chardet is designed primarily for detecting the character encoding of webpages, I have found an example of it being used on individual text files.
但是,我不知道如何告诉脚本将最可能的字符编码设置为变量charenc"(在整个脚本中多次使用).
However, I cannot work out how to tell the script to set the most likely character encoding to the variable 'charenc' (which is used several times throughout the script).
基于上述示例和 chardet 自己的 文档 的组合,我的代码是如下:
My code, based on a combination of the aforementioned example and chardet's own documentation is as follows:
import chardet
rawdata=open(infile,"r").read()
chardet.detect(rawdata)
字符检测是必要的,因为脚本继续运行以下(以及几个类似的用途):
Character detection is necessary as the script goes on to run the following (as well as several similar uses):
inF=open(infile,"rb")
s=unicode(inF.read(),charenc)
inF.close()
任何帮助将不胜感激.
推荐答案
chardet.detect()
返回一个字典,该字典提供作为与键 'encoding'<关联的值的编码/代码>.所以你可以这样做:
chardet.detect()
returns a dictionary which provides the encoding as the value associated with the key 'encoding'
. So you can do this:
import chardet
rawdata = open(infile, 'rb').read()
result = chardet.detect(rawdata)
charenc = result['encoding']
chardet
文档没有明确说明关于文本字符串和/或字节字符串是否应该与模块一起使用,但有理由认为,如果您有文本字符串,则不需要对其运行字符检测,因此您可能应该传递字节字符串.因此,在对 open()
的调用中使用了二进制模式标志 (b
).但是 chardet.detect()
也可能使用文本字符串,具体取决于您使用的 Python 版本和库的版本,即如果您省略了 b
可能会发现它无论如何都有效,即使您在技术上做错了.
The chardet
documentation is not explicitly clear about whether text strings and/or byte strings are supposed to work with the module, but it stands to reason that if you have a text string you don't need to run character detection on it, so you should probably be passing byte strings. Hence the binary mode flag (b
) in the call to open()
. But chardet.detect()
might also work with a text string depending on which versions of Python and of the library you're using, i.e. if you do omit the b
you might find that it works anyway even though you're technically doing something wrong.
这篇关于使用通用编码检测器 (chardet) 在 Python 中检测文本文件中的字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!
本文标题为:使用通用编码检测器 (chardet) 在 Python 中检测文本
基础教程推荐
- 用于分类数据的跳跃记号标签 2022-01-01
- 如何在海运重新绘制中自定义标题和y标签 2022-01-01
- 如何让 python 脚本监听来自另一个脚本的输入 2022-01-01
- Dask.array.套用_沿_轴:由于额外的元素([1]),使用dask.array的每一行作为另一个函数的输入失败 2022-01-01
- 使用PyInstaller后在Windows中打开可执行文件时出错 2022-01-01
- 筛选NumPy数组 2022-01-01
- Python kivy 入口点 inflateRest2 无法定位 libpng16-16.dll 2022-01-01
- 何时使用 os.name、sys.platform 或 platform.system? 2022-01-01
- 在 Python 中,如果我在一个“with"中返回.块,文件还会关闭吗? 2022-01-01
- 线程时出现 msgbox 错误,GUI 块 2022-01-01