双懒惰XML解析器(含外文出处)
摘要:
XML是公认的最有效的格式数据编码和交流的领域从万维网为桌面应用程序。然而,大规模的通过为实际系统的实施是被降低了由于效率低下,其文件解析方法。最近开发的懒惰解析技术是一个重大步骤改善这种情况,但仍然有懒惰解析器的一个关键的缺点,他们必须加载整个XML文档,以便提取文件的整体结构解析文件之前可以执行。我们已经制定了一个框架,有效地剖析的基础上的想法,把内部的物理指标的XML文件,使导航过程跳过大部分在解析该文件。我们查看如何产生这种内部指针的方式,优化利用构造解析支持目前W3C的XML标准。双懒惰解析器( 2LP )利用这些内部指针,以有效地解析该文件。使用支持W3C的结构,创造内部指针允许2LP将向后兼容,即指针增强文件可以解析当前XML解析器。我们还实施了一项机制,以有效地剖析大型文件有限主内存,从而克服了重大限制,目前的解决方案。我们通过理论和实验研究我们的指针生成和分析算法,并表明他们的表现大大优于现有方法。
Abstract
XML is acknowledged as the most effective format for data encoding and exchange over domains ranging from the World Wide Web to desktop applications. However, large-scale adoption into actual system implementations is beings lowed down due to the inefficiency of its document-parsing methods. The recent development of lazy parsing techniques is a major step towards improving this situation, but lazy parsers still have a key drawback—they must load the entire XML document in order to extract the overall document structure before document parsing can be performed. We have developed a framework for efficient parsing based on the idea of placing internal physical pointers within the XML document that allow the navigation process to skip large portions of the document during parsing. We show how to generate such internal pointers in a way that optimizes parsing using constructs supported by the current W3C XML standard. A double-lazy parser (2LP) exploits these internal pointers to efficiently parse the document. The usage of supported W3C constructs to create internal pointers allows 2LP to be backward compatible—i.e., the pointer-augmented documents can be parsed by current XML parsers. We also implemented a mechanism to efficiently parse large documents with limited main memory, thereby overcoming a major limitation in current solutions. We study our pointer generation and parsing algorithms both theoretically and experimentally, and show that they perform considerably better than existing approaches.
2008 Elsevier B.V. All rights reserved.
9000字 [资料来源:http://www.doc163.com]