Processing a hierarchical XML Document with XPATH in Java. Efficiency?

https://stackoverflow.com/questions/4458215

10-10-2019
|

سؤال

Variants of this question have been asked several times now here, but my question is more a question of the general efficiency of using XPATH in Java.

My task: take wikipedia articles on geographic locations and create a hierarchical data structure from them.

I have already obtained XML versions of the wiki pages and reformatted according to a schema that makes intuitive sense. I have also made a series of very simple classes representing different levels of the administrative hierarchy, such as this:

public class Province implements java.io.Serializable {

private ArrayList<City> cities = new ArrayList<City>();
private String hanzi;
private String pinyin;


public Province(String hanzi, String pinyin) {
this.hanzi = hanzi;
this.pinyin = pinyin;
}

As well as a method to add cities, some getter and setter methods, and a toString().

Here's an example of the type of XML file I'm dealing with :

<mediawiki>
     <page>
           <title>Tianjin</title>
           <revision>
                    <id>2064019</id>
                    <text xml:space="preserve">
                              <province>
                                       <hanzi>天津</hanzi>
                                       <pinyin>Tianjin</pinyin>

                                       <Level2>
                                               <hanzi>和平</hanzi>
                                               <pinyin>Heping</pinyin>
                                               <zip>300000</zip>
                                       </Level2>

                                       <Level2>
                                                <hanzi>河东</hanzi>
                                                <pinyin>Hedong</pinyin>
                                                <zip>300000</zip>
                                        </Level2>

                                </province>
                    </text>
            </revision>
      </page>

...

</mediawiki>

I essentially have a functional setup at this point, but the code is extremely repetitive and doesn't take into account the inherent hierarchical nature of geographic data. Ideally, I could stop at a certain level (let's say "focusing" on a particular province), and only refer to things in relative terms from that point forward, to minimize the number of times I have to crawl through the entire document. As an example (note, I am using an abstraction over the traditional Document setup, but the methods below correspond almost exactly to traditional methods):

XPathReader reader = new XPathReader("sourceXML\\Provinces.xml");           
String expression = "/mediawiki/page";
NodeList allProvinces = (NodeList)reader.read(expression, XPathConstants.NODESET);

for(int i=0; i < allProvinces.getLength(); i++) {
     expression = "/mediawiki/page[" + i + "]/revision/text/province/hanzi";
     String hanzi = reader.read(expression, XPathConstants.STRING).toString();

     expression = "/mediawiki/page[" + i + "]/revision/text/province/pinyin";
     String pinyin = reader.read(expression, XPathConstants.STRING).toString();

     Province currProv = new Province(hanzi, pinyin);         



     expression = "/mediawiki/page[" + i + "]/revision/text/province/Level2";
     NodeList level2 = (NodeList)reader.read(expression, XPathConstants.NODESET);

     for(int j=1; j < level2.getLength(); j++) {
           expression = "/mediawiki/page[" + i + "]/revision/text/province/Level2[" + j + "]/hanzi";
           String hanzi2 = reader.read(expression, XPathConstants.STRING).toString();   

           expression = "/mediawiki/page[" + i + "]/revision/text/province/Level2[" + j + "]/pinyin";
           String pinyin2 = reader.read(expression, XPathConstants.STRING).toString();  

         City currCity = new City(hanzi2, pinyin2);
         currProv.add(currCity);
...
     }
}

Frankly speaking, this seems dumb. I am not taking into account the fact that everything about these strings is identical once I get up to the Level I am concerned with. I am not referencing any kind of relative path, and whenever I traverse a part of the document I in fact traverse the entire thing. It would be great if I could block out the rest of the original XML document for a while and only focus on my Province, referring to everything thenceforth in relative terms.

I should especially note that how expensive this is behind the "read" abstraction:

xPath.compile(expression);
String result = xPathExpression.evaluate (xmlDocument, returnType);

I am essentially recompiling an identical pattern with a slightly different ending? What about loading the portion of interest and then referring to its children with something like "currProv/hanzi"?

I have looked into other methods of parsing XML, and the "Digester" seems to do something similar to what I want http://commons.apache.org/digester/core.html, but I already have almost everything there in this XPATH implementation.

I have the nagging suspicion that the solution to this issue is very simple...but I can't quite grasp the solution. Anyhow, I thank you for your time!

المحلول

Relative nested XPaths are the way to go.

I lead the EclipseLink JAXB implementation (MOXy) and we offer this ability through an @XmlPath annotation. If you already have the XPaths it would be a relatively easy mapping.

For more information see:

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى StackOverflow