简单易懂的XML parsing

读取一个XML文件，返回一个DOM对象。

什么是DOM对象？全称为Document Object Model, XML文件中的每一个东西都对应为一个node。DOM node的属性和方法遵循国际互联网的标准。

有以下类型的nodes：

Element nodes* Text nodes 每一个Text node都是Element node的child
Attribute nodes 不是任何node的parent 或 child,从属于element node
Comment nodes* Document nodes 只有使用document node的方法才能创造新element, text, attribute, comment

现有以下xml文档：

<listitem>
    <label>Import Wizard</label>
    <callback>uiimport</callback>
    <icon>ApplicationIcon.GENERIC_GUI</icon>
</listitem>
<listitem>
    ...
</listitem>
    ...

其中的一个label标签有字符Plot Tools。假设你想在同样的listitem里面寻找callback标签的字符：

findLabel = 'Plot Tools';
findCbk = '';
xDoc = xmlread(fullfile(matlabroot, ...
               'toolbox','matlab','general','info.xml'));
allListitems = xDoc.getElementsByTagName('listitem');
for k = 0:allListitems.getLength-1
    thisListitem = allListitems.item(k);

    % Get the label element. In this file, each
    % listitem contains only one label.
    thisList = thisListitem.getElementsByTagName('label');
    thisElement = thisList.item(0);
    % Check whether this is the label you want.
    % The text is in the first child node.
    if strcmp(thisElement.getFirstChild.getData, findLabel)
        thisList = thisListitem.getElementsByTagName('callback');
        thisElement = thisList.item(0);
        findCbk = char(thisElement.getFirstChild.getData);
        break;
    end

end

if ~isempty(findCbk)
    msg = sprintf('Item "%s" has a callback of "%s."',...
                      findLabel, findCbk);
else
   msg = sprintf('Did not find the "%s" item.', findLabel);
end
disp(msg);

MATLAB本身就提供一个xmlread函数，其返回的是Document Node。根节点哦。其余与Document Node的函数都是标准已经定义了的，这个标准详情请见这里.在上面一段代码中，我们可以看见几个常用的API：

getElementsByTagName是Document Node的方法，返回一个list。
这个list是node的列表，要得到其中一个元素，需要调用list的item方法。
一个有child的element node，获得其内容，要调用getFirstChild.getData。

如果我们想写一个XML文档：

<?xml version="1.0" encoding="utf-8"?>
<toc version="2.0">
    <tocitem target="upslope_product_page.html">Upslope Area Toolbox<!-- Functions -->
        <tocitem target="demFlow_help.html">demFlow</tocitem>
        <tocitem target="facetFlow_help.html">facetFlow</tocitem>
        <tocitem target="flowMatrix_help.html">flowMatrix</tocitem>
        <tocitem target="pixelFlow_help.html">pixelFlow</tocitem>
    </tocitem>
</toc>

MATLAB代码如下：

docNode = com.mathworks.xml.XMLUtils.createDocument('toc');

toc = docNode.getDocumentElement;
toc.setAttribute('version','2.0');

product = docNode.createElement('tocitem');
product.setAttribute('target','upslope_product_page.html');
product.appendChild(docNode.createTextNode('Upslope Area Toolbox'));
toc.appendChild(product)

product.appendChild(docNode.createComment(' Functions '));

functions = {'demFlow','facetFlow','flowMatrix','pixelFlow'};
for idx = 1:numel(functions)
    curr_node = docNode.createElement('tocitem');

    curr_file = [functions{idx} '_help.html'];
    curr_node.setAttribute('target',curr_file);

    % Child text is the function name.
    curr_node.appendChild(docNode.createTextNode(functions{idx}));
    product.appendChild(curr_node);
end

xmlwrite('info.xml',docNode);
type('info.xml');

这个函数首先先创建出一个Document Node，也就是最重要的根节点；
SetAttribute是Element Node的方法；
element, text, attribute, comment只能由docNode创建，方法是createXXXNode；
Element Node间的父子关系由appendChild确定。

以上就是MATLAB里面处理XML文档的最基本知识。由于XML文档的处理方式是统一的，因此很容易就能拓展到其他语言。从代码中就可以挖掘出许多东西。在实际中，要使用到的XML API恐怕还远远不够。这篇文章只是作为一个入门性质的导引。

简单易懂的XML parsing

留言