Processing a hierarchical XML Document with XPATH in Java. Efficiency?
-
10-10-2019 - |
سؤال
Variants of this question have been asked several times now here, but my question is more a question of the general efficiency of using XPATH in Java.
My task: take wikipedia articles on geographic locations and create a hierarchical data structure from them.
I have already obtained XML versions of the wiki pages and reformatted according to a schema that makes intuitive sense. I have also made a series of very simple classes representing different levels of the administrative hierarchy, such as this:
public class Province implements java.io.Serializable {
private ArrayList<City> cities = new ArrayList<City>();
private String hanzi;
private String pinyin;
public Province(String hanzi, String pinyin) {
this.hanzi = hanzi;
this.pinyin = pinyin;
}
As well as a method to add cities, some getter and setter methods, and a toString().
Here's an example of the type of XML file I'm dealing with :
<mediawiki>
<page>
<title>Tianjin</title>
<revision>
<id>2064019</id>
<text xml:space="preserve">
<province>
<hanzi>天津</hanzi>
<pinyin>Tianjin</pinyin>
<Level2>
<hanzi>和平</hanzi>
<pinyin>Heping</pinyin>
<zip>300000</zip>
</Level2>
<Level2>
<hanzi>河东</hanzi>
<pinyin>Hedong</pinyin>
<zip>300000</zip>
</Level2>
</province>
</text>
</revision>
</page>
...
</mediawiki>
I essentially have a functional setup at this point, but the code is extremely repetitive and doesn't take into account the inherent hierarchical nature of geographic data. Ideally, I could stop at a certain level (let's say "focusing" on a particular province), and only refer to things in relative terms from that point forward, to minimize the number of times I have to crawl through the entire document. As an example (note, I am using an abstraction over the traditional Document setup, but the methods below correspond almost exactly to traditional methods):
XPathReader reader = new XPathReader("sourceXML\\Provinces.xml");
String expression = "/mediawiki/page";
NodeList allProvinces = (NodeList)reader.read(expression, XPathConstants.NODESET);
for(int i=0; i < allProvinces.getLength(); i++) {
expression = "/mediawiki/page[" + i + "]/revision/text/province/hanzi";
String hanzi = reader.read(expression, XPathConstants.STRING).toString();
expression = "/mediawiki/page[" + i + "]/revision/text/province/pinyin";
String pinyin = reader.read(expression, XPathConstants.STRING).toString();
Province currProv = new Province(hanzi, pinyin);
expression = "/mediawiki/page[" + i + "]/revision/text/province/Level2";
NodeList level2 = (NodeList)reader.read(expression, XPathConstants.NODESET);
for(int j=1; j < level2.getLength(); j++) {
expression = "/mediawiki/page[" + i + "]/revision/text/province/Level2[" + j + "]/hanzi";
String hanzi2 = reader.read(expression, XPathConstants.STRING).toString();
expression = "/mediawiki/page[" + i + "]/revision/text/province/Level2[" + j + "]/pinyin";
String pinyin2 = reader.read(expression, XPathConstants.STRING).toString();
City currCity = new City(hanzi2, pinyin2);
currProv.add(currCity);
...
}
}
Frankly speaking, this seems dumb. I am not taking into account the fact that everything about these strings is identical once I get up to the Level I am concerned with. I am not referencing any kind of relative path, and whenever I traverse a part of the document I in fact traverse the entire thing. It would be great if I could block out the rest of the original XML document for a while and only focus on my Province, referring to everything thenceforth in relative terms.
I should especially note that how expensive this is behind the "read" abstraction:
xPath.compile(expression);
String result = xPathExpression.evaluate (xmlDocument, returnType);
I am essentially recompiling an identical pattern with a slightly different ending? What about loading the portion of interest and then referring to its children with something like "currProv/hanzi"?
I have looked into other methods of parsing XML, and the "Digester" seems to do something similar to what I want http://commons.apache.org/digester/core.html, but I already have almost everything there in this XPATH implementation.
I have the nagging suspicion that the solution to this issue is very simple...but I can't quite grasp the solution. Anyhow, I thank you for your time!
المحلول
Relative nested XPaths are the way to go.
I lead the EclipseLink JAXB implementation (MOXy) and we offer this ability through an @XmlPath annotation. If you already have the XPaths it would be a relatively easy mapping.
For more information see: