Personal Project "RSS FEED" XML Parser

Question 1

For my money, the easiest solution would be to use the XPath API.

Essentially, it's a query language for XML. See XPath Tutorial for a primer.

This example uses the RSS feed from SO, which uses <entry...> instead of <item>, but I've used the same technique for other RSS (and XML) files and even very complex HTML documents...

import java.io.IOException;
import java.util.logging.Level;
import java.util.logging.Logger;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpression;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.SAXException;

public class TestRSSFeed {

    public static void main(String[] args) {
        try {
            // Read the feed...
            DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
            Document doc = factory.newDocumentBuilder().parse("http://stackoverflow.com/feeds/tag?tagnames=java&sort=newest");
            Element root = doc.getDocumentElement();

            // Create a xPath instance
            XPath xPath = XPathFactory.newInstance().newXPath();
            // Find all the nodes that are named <entry...> any where in
            // the document that live under the parent node...
            XPathExpression expression = xPath.compile("//entry");
            NodeList nl = (NodeList) expression.evaluate(root, XPathConstants.NODESET);

            System.out.println("Found " + nl.getLength() + " items...");
            for (int index = 0; index < nl.getLength(); index++) {
                Node node = nl.item(index);
                // This is a sub node search.
                // The search is based on the parent node and looks for a single
                // node titled "title" that belongs to the parent node...
                // I did this because I'm only expecting a single node...
                expression = xPath.compile("title");
                Node child = (Node) expression.evaluate(node, XPathConstants.NODE);
                System.out.println(child.getTextContent());
            }

        } catch (IOException | ParserConfigurationException | SAXException exp) {
            exp.printStackTrace();
        } catch (XPathExpressionException ex) {
            ex.printStackTrace();
        }
    }

}

Now, you can do some pretty complex queries, but I thought I'd start with a basic example ;)

Question 2

Just in case anyone is still left wondering about how i managed to solve the CDATA puzzle:

The logic is as follows:

Once you get the program to extract all the xml to display the correct node tree as the rss feed displays, if any xml data is wrapped in CDATA tags, the only way to access that information is by creating new xml based on the text content in the CDATA tag. Once you parse the new document, you should be able to access all the data you need.