使用 Python 在 YAML 中获取重复键

Getting duplicate keys in YAML using Python(使用 Python 在 YAML 中获取重复键)

本文介绍了使用 Python 在 YAML 中获取重复键的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们需要解析包含重复键的 YAML 文件,所有这些都需要解析.跳过重复是不够的.我知道这违反了 YAML 规范,我不想这样做,但我们使用的第三方工具支持这种用法,我们需要处理它.

We are in need of parsing YAML files which contain duplicate keys and all of these need to be parsed. It is not enough to skip duplicates. I know this is against the YAML spec and I would like to not have to do it, but a third-party tool used by us enables this usage and we need to deal with it.

文件示例:

build:
  step: 'step1'

build:
  step: 'step2'

解析后我们应该有一个类似的数据结构:

After parsing we should have a similar data structure to this:

yaml.load('file.yml')
# [('build', [('step', 'step1')]), ('build', [('step', 'step2')])]

dict 不能再用来表示解析后的内容了.

dict can no longer be used to represent the parsed contents.

我正在寻找 Python 中的解决方案,但没有找到支持此功能的库,我错过了什么吗?

I am looking for a solution in Python and I didn't find a library supporting this, have I missed anything?

另外,我很乐意编写自己的东西,但想让它尽可能简单.ruamel.yaml 看起来像是 Python 中最先进的 YAML 解析器,而且看起来可扩展性适中,是否可以扩展它以支持重复字段?

Alternatively, I am happy to write my own thing but would like to make it as simple as possible. ruamel.yaml looks like the most advanced YAML parser in Python and it looks moderately extensible, can it be extended to support duplicate fields?

推荐答案

PyYAML 只会默默地覆盖第一个条目,ruamel.yaml¹ 如果与旧 API 一起使用,将给出 DuplicateKeyFutureWarning,并在新 API 中引发 DuplicateKeyError.

PyYAML will just silently overwrite the first entry, ruamel.yaml¹ will give a DuplicateKeyFutureWarning if used with the legacy API, and raise a DuplicateKeyError with the new API.

如果您不想为所有类型创建完整的 Constructor,覆盖 SafeConstructor 中的映射构造函数应该可以完成这项工作:

If you don't want to create a full Constructor for all types, overwriting the mapping constructor in SafeConstructor should do the job:

import sys
from ruamel.yaml import YAML
from ruamel.yaml.constructor import SafeConstructor

yaml_str = """
build:
  step: 'step1'

build:
  step: 'step2'
"""


def construct_yaml_map(self, node):
    # test if there are duplicate node keys
    data = []
    yield data
    for key_node, value_node in node.value:
        key = self.construct_object(key_node, deep=True)
        val = self.construct_object(value_node, deep=True)
        data.append((key, val))


SafeConstructor.add_constructor(u'tag:yaml.org,2002:map', construct_yaml_map)
yaml = YAML(typ='safe')
data = yaml.load(yaml_str)
print(data)

给出:

[('build', [('step', 'step1')]), ('build', [('step', 'step2')])]

但是,似乎没有必要将 step: 'step1' 放入列表中.以下将仅在存在重复项时创建列表(必要时可以通过缓存 self.construct_object(key_node, deep=True) 的结果进行优化):

However it doesn't seem necessary to make step: 'step1' into a list. The following will only create the list if there are duplicate items (could be optimised if necessary, by caching the result of the self.construct_object(key_node, deep=True)):

def construct_yaml_map(self, node):
    # test if there are duplicate node keys
    keys = set()
    for key_node, value_node in node.value:
        key = self.construct_object(key_node, deep=True)
        if key in keys:
            break
        keys.add(key)
    else:
        data = {}  # type: Dict[Any, Any]
        yield data
        value = self.construct_mapping(node)
        data.update(value)
        return
    data = []
    yield data
    for key_node, value_node in node.value:
        key = self.construct_object(key_node, deep=True)
        val = self.construct_object(value_node, deep=True)
        data.append((key, val))

给出:

[('build', {'step': 'step1'}), ('build', {'step': 'step2'})]

几点:

  • 可能不用说,这不适用于 YAML 合并键 (<<: *xyz)
  • 如果您需要 ruamel.yaml 的往返功能 (yaml = YAML()),则需要更复杂的 construct_yaml_map.
  • 如果你想转储输出,你应该为此实例化一个新的 YAML() 实例,而不是重新使用用于加载的修补"实例(它可能有效,这只是为了确定):

  • Probably needless to say, this will not work with YAML merge keys (<<: *xyz)
  • If you need ruamel.yaml's round-trip capabilities (yaml = YAML()) , that will require a more complex construct_yaml_map.
  • If you want to dump the output, you should instantiate a new YAML() instance for that, instead of re-using the "patched" one used for loading (it might work, this is just to be sure):

yaml_out = YAML(typ='safe')
yaml_out.dump(data, sys.stdout)

其中给出(带有第一个 construct_yaml_map):

which gives (with the first construct_yaml_map):

- - build
  - - [step, step1]
- - build
  - - [step, step2]

  • 在 PyYAML 和 ruamel.yaml 中不起作用的是 yaml.load('file.yml').如果您不想自己 open() 文件,您可以这样做:

  • What doesn't work in PyYAML nor ruamel.yaml is yaml.load('file.yml'). If you don't want to open() the file yourself you can do:

    from pathlib import Path  # or: from ruamel.std.pathlib import Path
    yaml = YAML(typ='safe')
    yaml.load(Path('file.yml')
    

  • ¹ 免责声明:我是该软件包的作者.

    这篇关于使用 Python 在 YAML 中获取重复键的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!

    本文标题为:使用 Python 在 YAML 中获取重复键

    基础教程推荐