How to write correct Regex for url's on the page without anchors?

https://stackoverflow.com/questions/878957

22-08-2019
|

Question

I want to cut all url's like (http://....) and replace them on anchors <a></a> but my requirement: Do not touch anchors and page definition(Doc type) like:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

So I need to find just plain text with url's...

I'm trying to override my render inside page and I made BrowserAdapter:

<browser refID="default">
    <controlAdapters>
        <adapter controlType="System.Web.Mvc.ViewPage"
                 adapterType="Facad.Adapters.AnchorAdapter" />
    </controlAdapters>
</browser>

it looks like this:

public class AnchorAdapter : PageAdapter
{
    protected override void Render(HtmlTextWriter writer)
    {
        /* Get page output into string */
        var sb = new StringBuilder();
        TextWriter tw = new StringWriter(sb);
        var htw = new HtmlTextWriter(tw);

        // Render into my writer
        base.Render(htw);

        string page = sb.ToString();
        //regular expression 
        Regex regx = new Regex("http://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\@\\#\\$\\%\\^\\&amp;\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?", RegexOptions.IgnoreCase); 

        //get the first match 
        Match match = regx.Match(page); 

        //loop through matches 
        while (match.Success)
        {

            //output the match info 
            System.Web.HttpContext.Current.Response.Write("<p>url match: " + match.Groups[0].Value+"</p>");

            //get next match 
            match = match.NextMatch();
        }

        writer.Write(page);
    }
}

Solution

You just need to search a bit ahead and behind the url to see if it's in quotes, it's unlikely someone would paste a quoted url as plaintext but urls are always quoted in tags and doctypes. So your regex becomes:

(^|[^'"])(http://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\@\\#\\$\\%\\^\\&amp;\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?)([^'"]+|$)

(^|[^'"]+) means start of string or a character that is NOT a quote ([^'"]|$) means end of string or not a quote

The extra brackets around the old regex ensure it's a capture group so you can retrieve the actual URL with \2 (group 2) instead of getting the extra crap it might have matched on the edges of the url

BTW, your URL regex looks pretty bad, there are more compact and accurate forms. You really don't need to escape EVERYTHING.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow