Regex for specifig tags and their content, groupped by the tag name

https://stackoverflow.com/questions/200525

03-07-2019
|

Question

Here is the input (html, not xml):

... html content ...
<tag1> content for tag 1 </tag1>
<tag2> content for tag 2 </tag2>
<tag3> content for tag 3 </tag3>
... html content ...

I would like to get 3 matches, each with two groups. First group would contain the name of the tag and the second group would contain the inner text of the tag. There are just those three tags, so it doesn't need to be universal.

In other words:

match.Groups["name"] would be "tag1"
match.Groups["value"] would be "content for tag 2"

Any ideas?

Solution

I don't see why you would want to use match group names for that.

Here is a regular expression that would match tag name and tag content into numbered sub matches.

<(tag1|tag2|tag3)>(.*?)</$1>

Here is a variant with .NET style group names

<(?'name'tag1|tag2|tag3)>(?'value'.*?)</\k'name'>.

EDIT

RegEx adapted as per question author's clarification.

OTHER TIPS

Regex for this might be:

/<([^>]+)>([^<]+)<\/\1>/

But it's general as I don't know much about the escaping machanism of .NET. To translate it:

first group matches the first tag's name between < and >
second group matches the contents (from > to the next <
the end check if the first tag is closed

HTH

Thanks all but none of the regexes work. :( Maybe I wasn't specific enough, sorry for that. Here is the exact html i'm trying to parse:

...some html content <b> etc </b> ...
<user> hello <b>mitch</b> </user>
...some html content <b> etc </b> ...
<message> some html <i>message</i> <a href....>bla</a> </message>
...some html content <b> etc </b> ...

I hope it's clearer now. I'm after USER and MESSAGE tags.

I need to get two matches, each with two groups. First group wpould give me tag name (user or message) and the second group would give me entire inner text of the tag.

Is the data proper xml, or does it just look like it?

If it is html, then the HTML Agility Pack is worth investigation - this provides a DOM (similar to XmlDocument) that you can use to query the data:

string input = @"<html>...some html content <b> etc </b> ...
<user> hello <b>mitch</b> </user>
...some html content <b> etc </b> ...
<message> some html <i>message</i> <a href....>bla</a> </message>
...some html content <b> etc </b> ...</html>";

            HtmlDocument doc = new HtmlDocument();
            doc.LoadHtml(input);
            foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//user | //message"))
            {
                Console.WriteLine("{0}: {1}", node.Name, node.InnerText);
                // or node.InnerHtml to keep the formatting within the content
            }

This outputs:

user:  hello mitch
message:  some html message bla

If you want the formatting tags, then use .InnerHtml instead of .InnerText.

If it is xml, then to code with the full spectrum of xml, it would be better to use an xml parser. For small-to-mid size xml, loading it into a DOM such as XmlDocument would be fine - then query the nodes (for example, "//*"). For huge xml, XmlReader might be an option.

If the data doesn't have to worry about the full xml, then some simple regex shouldn't be too tricky... a simplified example (no attributes, no namespaces, no nested xml) might be:

string input = @"blah <tag1> content for tag 1 </tag1> blop
<tag2> content for tag 2 </tag2> bloop
<tag3> content for tag 3 </tag3> blip";

        const string pattern = @"<(\w+)>\s*([^<>]*)\s*</(\1)>";
        Console.WriteLine(Regex.IsMatch(input, pattern));
        foreach(Match match in Regex.Matches(input, pattern)) {
            Console.WriteLine("{0}: {1}", match.Groups[1], match.Groups[2]);
        }

The problem was that the ([^<]*) people were using to match things inside the tags were matching the opening < of the nested tags, and then the closing tag of the nested tag didn't match the outer tag and so the regex failed.

Here is a slightly more robust version of Tomalak's regex allowing for attributes and whitespace:

Regex tagRegex = new Regex(@"<\s*(?<tag>" + string.Join("|", tags) + @")[^>]*>(?<content>.*?)<\s*/\s*\k<tag>\s*>", RegexOptions.IgnoreCase);

Obviously if you're only ever going to need to use a specific set of tags you can replace the

string.Joing("|", tags)

with the hardcoded pipe seperated list of tags.

Limitations of the regex are that if you have one tag you are trying to match nested inside another it will only match the outer tag. i.e.

<user>abc<message>def</message>ghi</user>

It will match the outer user tag, but not the inner message tag.

It also doesn't handle >'s quoted in attributes like so:

<user attrib="oops>">

It will just match

<user attrib="oops>

as the tag and the

">

will be a part of the tags content.

This will give you named capture groups for what you want. It won't work for nested tags, however.

/<(?<name>[^>]+)>(?<value>[^<]+)</\1>/

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow