用 SAX 和 XNI 检测 XML 文档的编码

论坛元老

Rank: 8 Rank: 8

UID: 1066743

1^#

打印

字体大小: tT

look_w发表于 2018-7-11 09:24 | 只看该作者

用 SAX 和 XNI 检测 XML 文档的编码

XML 根据 Unicode 字符进行定义。在现代计算机的传输和存储过程中，那些 Unicode 字符必须按字节存储，通过解析器进行解码。很多编码方案可实现此目的：UTF-8、 UTF-16、ISO-8859-1、Cp1252 和 SJIS 等。
常用缩写词

API：应用程序编程接口（Application programming interface）
HTTP：超文本传输协议（Hyper Text Transfer Protocol）
W3C：万维网联盟（World Wide Web Consortium）
XML：可扩展标记语言（Extensible Markup Language）

通常情况下，但不一定总是这样，您实际上不关注基本编码。XML 解析器对任何写入到 Unicode 字符串和字符数组中的文档进行转换。程序对解码后的字符串进行操作。本文讨论真正关注基本编码的 “不常出现” 的情况。

最常见的情况是想为输出结果保存输入编码。
另外一种情况是，不用解析文档，而将其作为字符串或字符大对象（Character Large Object, CLOB）存储在数据库中。
类似地，有些系统通过 HTTP 传输 XML 文档时，并没有全部读取文档，但需要设置 HTTP 的 Content-type 报头，指定正确的编码。在这种情况下，您需要知道文档是如何编码的。

大多数情况下，对于您编写的文档，您知道如何编码。但是，如果不是您编写的文档 — 只是从其他地方接收的文档（例如，从一个 Atom 提要中）— 那么最好的方法是使用一个 streaming API，例如 Simple API for XML（SAX）、Streaming API for XML（StAX）、System.Xml.XmlReader 或 Xerces Native Interface（XNI）。另外，也可以使用树型 API，例如文档对象模型（Document Object Model，DOM）。但是，它们需要读取整个文档，即使通常只需读取前 100 个字节（或更少）来判断编码。streaming API 可以只读取需要的内容，一旦得到结果后，就不再解析。这样就会更有效率。
SAX目前，大多数 SAX 解析器，包括与 Sun 公司的 Java™ 软件开发套件（JDK）6 绑定的 SAX 解析器，可以用来检测编码。该技术不难实现，但是也不易理解。可以简单地概括为：

在 setDocumentLocator 方法中，将 Locator 参数传递给 Locator2。
在字段中保存 Locator2 对象。
在 startDocument 方法中，调用 Locator2 字段的 getEncoding() 方法。
（可选）如果已得到想要的全部结果，那么可以抛出 SAXException 提前结束解析过程。

清单 1 通过一个简单的程序说明该技术，输出命令行中给定的所有 URL 的编码。
清单 1. 使用 SAX 确定文档的编码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44

import org.xml.sax.*;
import org.xml.sax.ext.*;
import org.xml.sax.helpers.*;

import java.io.IOException;

public class SAXEncodingDetector extends DefaultHandler {

public static void main(String[] args) throws SAXException, IOException {
      XMLReader parser = XMLReaderFactory.createXMLReader();
      SAXEncodingDetector handler = new SAXEncodingDetector();
      parser.setContentHandler(handler);
      for (int i = 0; i < args.length; i++) {
         try {
            parser.parse(args);
         }
         catch (SAXException ex) {
            System.out.println(handler.encoding);
         }
      }
}

private String encoding;
private Locator2 locator;

_cnnew1@Override
public void setDocumentLocator(Locator locator) {
      if (locator instanceof Locator2) {
         this.locator = (Locator2) locator;
      }
      else {
         this.encoding = "unknown";
      }
}

@Override
public void startDocument() throws SAXException {
      if (locator != null) {
         this.encoding = locator.getEncoding();
      }
      throw new SAXException("Early termination");
}

}

该方法花费 90% 的时间，有可能会更多一点。但是，SAX 解析器不需要支持 Locator 接口，更不用说 Locator2 以及其他的接口。如果知道正在使用的是 Xerces，第二种方法是使用 XNI。
Xerces Native Interface使用 XNI 的方法与 SAX 是非常相似的（实际上，在 Xerces 中，SAX 解析器是本机 XNI 解析器之上很薄的一层）。总之，这种方法更容易一些，因为编码作为参数直接传递给 startDocument()。您只需要读取它，如清单 2 所示。
清单 2. 使用 XNI 确定文档的编码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31

import java.io.IOException;
import org.apache.xerces.parsers.*;
import org.apache.xerces.xni.*;
import org.apache.xerces.xni.parser.*;

public class XNIEncodingDetector extends XMLDocumentParser {

public static void main(String[] args) throws XNIException, IOException {
      XNIEncodingDetector parser = new XNIEncodingDetector();
      for (int i = 0; i < args.length; i++) {
         try {
            XMLInputSource document = new XMLInputSource("", args, "");
            parser.parse(document);
         }
         catch (XNIException ex) {
            System.out.println(parser.encoding);
         }
      }
}

private String encoding = "unknown";

@Override
public void startDocument(XMLLocator locator, String encoding,
      NamespaceContext context, Augmentations augs)
            throws XNIException {
      this.encoding = encoding;
      throw new XNIException("Early termination");
}

}

请注意，因为一些未知的原因，该技术只使用 org.apache.xerces 中实际的 Xerces 类，而不使用与 Sun 的 JDK 6 绑定的 com.sun.org.apache.xerces.internal 中重新打包的 Xerces 类。
XNI 提供了另外一个 SAX 不具有的功能。在少数情况下，在 XML 声明中声明的编码不是实际的编码。SAX 只报告实际编码，但是，XNI 也可以告诉您在 xmlDecl() 方法中声明的编码，如清单 3 所示。
清单 3. 使用 XNI 确定文档的声明的编码和实际的编码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46

import java.io.IOException;
import org.apache.xerces.parsers.*;
import org.apache.xerces.xni.*;
import org.apache.xerces.xni.parser.*;

public class AdvancedXNIEncodingDetector extends XMLDocumentParser {

public static void main(String[] args) throws XNIException, IOException {
      AdvancedXNIEncodingDetector parser = new AdvancedXNIEncodingDetector();
      for (int i = 0; i < args.length; i++) {
         try {
            XMLInputSource document = new XMLInputSource("", args, "");
            parser.parse(document);
         }
         catch (XNIException ex) {
            System.out.println("Actual: " + parser.actualEncoding);
            System.out.println("Declared: " + parser.declaredEncoding);
         }
      }
}

private String actualEncoding = "unknown";
private String declaredEncoding = "none";

@Override
public void startDocument(XMLLocator locator, String encoding,
      NamespaceContext namespaceContext, Augmentations augs)
            throws XNIException {
      this.actualEncoding = encoding;
      this.declaredEncoding = "none"; // reset
}

@Override
// this method is not called if there's no XML declaration
public void xmlDecl(String version, String encoding,
   String standalone, Augmentations augs) throws XNIException {
      this.declaredEncoding = encoding;
}

@Override
public void startElement(QName element, XMLAttributes attributes,
   Augmentations augs) throws XNIException {
      throw new XNIException("Early termination");
}

}

通常情况下，如果声明的编码和实际的编码不同，就表明服务器存在一个 bug。最常见的原因是由于 HTTP Content-type 报头指定的编码与在 XML 声明中声明的编码不同。在本例中，要严格遵守规范，要求优先考虑 HTTP 报头的值。但实际上，很可能 XML 声明中的值是正确的。
结束语通常情况下，您不需要了解输入文档的编码。只需要用解析器处理输入文档，以 UTF-8 编码输出结果即可。但是，有些情况下需要知道输入编码，SAX 和 XNI 可以提供快速而有效的方法来解决这一问题。

收藏分享评分

回复引用

订阅 TOP

返回列表