当文档以小写字母&lt；！doctype开头时，空手道HTML解析引发SaxException-Java问题

Karate HTML parsing throwing SaxException when document begins with lower-case lt;!doctype(当文档以小写字母lt；！doctype开头时，空手道HTML解析引发SaxException)

本文介绍了当文档以小写字母&lt；！doctype开头时，空手道HTML解析引发SaxException的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试运行一个对URL调用GET的空手道测试，但我发现当站点以小写形式返回<!doctype声明(在‘普通’HTML中完全可以接受)时，我认为空手道XML解析器会抛出致命的错误和警告。在我看来，空手道使用了一个XML解析器，所以严格地说，这可能是正确的行为，因为小写doctype将中断。然而，我找不到一种方法来绕过这个问题来实现有效的HTML。我试过不同的头球之类的，但似乎无法摆脱这一点。

我已经包含了一个小测试，幸运的是google.com也返回了小写声明：

示例测试

Given url 'http://www.google.com'
When method GET
Then status 200

错误

[Fatal Error] :1:3: The markup in the document preceding the root element must be well-formed.
15:19:45.267 [main] WARN com.intuit.karate.FileUtils - parsing failed: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 3; The markup in the document preceding the root element must be well-formed.

<!doctype html><html .... blah

我下载了空手道源代码，发现报告的警告：

FileUtils.java

public static String toPrettyString(String raw) {
    raw = StringUtils.trimToEmpty(raw);
    try {
        if (Script.isJson(raw)) {
            return JsonUtils.toPrettyJsonString(JsonUtils.toJsonDoc(raw));
        } else if (Script.isXml(raw)) {
            return XmlUtils.toString(XmlUtils.toXmlDoc(raw), true);
        }
    } catch (Exception e) {
        logger.warn("parsing failed: {}", e.getMessage());
    }
    return raw;
}

通过检查返回文档的第一个字符，检查似乎是在JSON或XML之间进行：

Script.Java

public static final boolean isXml(String text) {
    return text.startsWith("<");
}

XmlUtils.Java

然后，我认为builder.parse失败，因为它不是有效的XHTML，因为后面的注释暗示将在递归调用中删除<!doctype。

public static Document toXmlDoc(String xml) {
    ...

    Document doc = builder.parse(is);
    if (dtdEntityResolver.dtdPresent) { // DOCTYPE present
        // the XML was not parsed, but I think it hangs at the root as a text node
        // so conversion to string and back has the effect of discarding the DOCTYPE !
        return toXmlDoc(toString(doc, false));

是否可以将此流量转移为有效的HTML？

推荐答案

如果您查看日志，空手道还会告诉您，它已经将完整的响应(将在response变量中可用)保留为字符串--尽管它未能将其"类型转换"为XML.顺便说一下，您甚至在responseBytes中有一个字节数组。因此，现在由您来做任何您想做的事情，例如，理论上您可以找到一个"宽松"的HTML解析器，然后获得一棵DOM树或其他什么东西。

Given url 'http://www.google.com'
When method GET
Then status 200
* print response

有几个提示，您可以尝试对response进行字符串替换，然后尝试将其类型转换为XML，请参阅：https://github.com/intuit/karate#type-conversion

或者，您所要做的可能只是收集一些数据，而一些正常的正则表达式匹配可能会做到这一点，请参阅以下内容：

https://stackoverflow.com/a/53682733/143475

https://stackoverflow.com/a/50372295/143475

这篇关于当文档以小写字母&lt；！doctype开头时，空手道HTML解析引发SaxException的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持编程学习网！

问题描述

推荐答案

基础教程推荐