How to write correct Regex for url's on the page without anchors?
-
22-08-2019 - |
Question
I want to cut all url's like (http://....) and replace them on anchors <a></a>
but my requirement:
Do not touch anchors and page definition(Doc type) like:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
So I need to find just plain text with url's...
I'm trying to override my render inside page and I made BrowserAdapter:
<browser refID="default">
<controlAdapters>
<adapter controlType="System.Web.Mvc.ViewPage"
adapterType="Facad.Adapters.AnchorAdapter" />
</controlAdapters>
</browser>
it looks like this:
public class AnchorAdapter : PageAdapter
{
protected override void Render(HtmlTextWriter writer)
{
/* Get page output into string */
var sb = new StringBuilder();
TextWriter tw = new StringWriter(sb);
var htw = new HtmlTextWriter(tw);
// Render into my writer
base.Render(htw);
string page = sb.ToString();
//regular expression
Regex regx = new Regex("http://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\@\\#\\$\\%\\^\\&\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?", RegexOptions.IgnoreCase);
//get the first match
Match match = regx.Match(page);
//loop through matches
while (match.Success)
{
//output the match info
System.Web.HttpContext.Current.Response.Write("<p>url match: " + match.Groups[0].Value+"</p>");
//get next match
match = match.NextMatch();
}
writer.Write(page);
}
}
Solution
You just need to search a bit ahead and behind the url to see if it's in quotes, it's unlikely someone would paste a quoted url as plaintext but urls are always quoted in tags and doctypes. So your regex becomes:
(^|[^'"])(http://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\@\\#\\$\\%\\^\\&\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?)([^'"]+|$)
(^|[^'"]+) means start of string or a character that is NOT a quote ([^'"]|$) means end of string or not a quote
The extra brackets around the old regex ensure it's a capture group so you can retrieve the actual URL with \2 (group 2) instead of getting the extra crap it might have matched on the edges of the url
BTW, your URL regex looks pretty bad, there are more compact and accurate forms. You really don't need to escape EVERYTHING.