Banner: ZumGuy Publications and Network

ZumGuy Publications and Network

Automatically replacing a URL with a link

Posted by Sean on Sunday, 7th October 2012 14:09

Automatically replacing a URL with a link.

This may be required when a form is submitted and an internet address (URL) is not automatically a hyperlink. The solution is to make use of a technique called 'regular expressions'.

Using a regular expression we can identify any URLs in the submitted content. We can then replace any matches with a link to the match itself.
So, we will start with our Perl-Compatible Regular Expression:

$regexp = '/\b((http|https|ftp):\/\/|www)[\w\.\-]+\/?[\w\.\-\?\+\/~=&#;,]*[a-z0-9\/]{1}\b/i';


To make things easier to visualise, here is an example URL we might want to detect: https://www.example.com/foo/index.php?hello=world
Let's take this long string apart and look at each block separately.


$regexp is a string, that is why it is enclosed in quotes. Regular expressions have to be surrounded by delimiters, which can be any non-alphanumeric character except the backslash (\).
In this example I'll use simple forward slashes, /.

Next, the \b at the beginning and end of the expression means that the matched string must be a unique word, meaning there must be at least a space before and after it.


Getting into the juicy bits of the expression, we find our first subpattern. This identifies the beginning of the URL, which can be http://, https://, ftp:// or simply www (often people don't actually type in that 'weird http thingy', so we will account for that).

((http|https|ftp):\/\/|www)


This subpattern is therefore divided in two parts: either we have the protocol (http://, https://, ftp://) OR (expressed with the pipe, |) we simply have 'www'.

We could write the first part like this:

(http:\/\/|https:\/\/|ftp:\/\/)


Meaning 'http://' OR 'https://' OR 'ftp://' . Since we already used them as boundaries, we must escape all forward slashes with the escape character, the backslash, that's why we get \/\/.
But since it's actually composed of a variable part ('http', 'https' or 'ftp') and a fixed part ('://'), we can save a few characters and extract the '://' part from the subclass:

(http|https|ftp):\/\/


Meaning 'http' OR 'https' OR 'ftp' followed by '://'.
So, looking back to our example string, we have now detected 'https://'.


Next in our example is the 'www.example.com' part, the domain, which can be just about any combination of letters, numbers, dots, hyphens and underscores.
We could express this with a character class (delimited by '[' and ']'):

[A-Za-z0-9\.\-_]


Again, to save ourselves a few characters, we can use the \w shortcut, which corresponds to [A-Za-z0-9_], and add a couple of characters (hyphens and dots):

[\w\.\-]+


That + I added means one or more, so this whole snippet results in: 'one or more of any word character (a-z, 0-9 and the underscore) plus the dot and hyphen'.


The domain may or may not be followed by a forward slash. To identify this we can use the ? quantifier, which means 'zero or one':

\/?


Remember, forward slashes must be escaped by '\', or they'll be interpreted as the end delimiter!


Following this there may or may not be any combination of, well, just about any characters, representing the directories and possible values passed into the URL.
In our example, this is '/foo/index.php?hello=world'.

[\w\.\-\?\+\/~=&#;,]*


We use the asterix (*) quantifier, which means 'zero or more'.
Notice that all quantifier characters and meta-characters we used in the character class have been escaped - we don't want those interpreted!


Now we must specify that the last character must be alphanumeric or a forward slash - a URL can't end with '&', for instance.

[a-z0-9\/]{1}



At the end of our expression, after the closing delimiters, we can use pattern modifiers to alter its behaviour.
Here we use i to switch to case-insensitive mode.


And here it is, our whole regular expression, ready to match URLs!

/\b((http|https|ftp):\/\/|www)[\w\.\-]+\/?[\w\.\-\?\+\/~=&#;,]*[a-z0-9\/]{1}\b/i




Now what's left to do is actually use it, to find and replace any URLs in our strings.
We can do this with the preg_replace function:

$linked = preg_replace($regexp, '<a href="$0" target="_blank">$0</a>', $body);


Here, $body is our subject string and $regexp our regular expression.
The second parameter is the replacement. A useful tool here is back referencing. Within the regular expression, PHP will number every parenthesized subpattern in order, starting from 1. The value matched by each subpattern is therefore accessible in the replacement string with $n, where n is the number of the subpattern.
$0 will return the string matched by the whole expression.

So this example:

<?php
$regexp = '/\b((http|https|ftp):\/\/|www)[\w\.\-]+\/?[\w\.\-\?\+\/~=&#;,]*[a-z0-9\/]{1}\b/i';
$body = 'https://www.example.com/foo/index.php?hello=world';
$body = preg_replace($regexp, '<a href="$0" target="_blank" title="Go to $0">$0</a>', $body);
echo $body;
?>

Will return

https://www.example.com/foo/index.php?hello=world

I hope this is clear. If you have any questions or comments, I would love to hear from you.
Posted by Andrew on Sunday, 7th October 2012 14:15
This is very interesting and useful. Is there any application of this for added security? And for blocking form spammers? What do you recommend?
Posted by Sean on Sunday, 7th October 2012 16:22
Regular expressions can be used in all sorts of contexts.
You can use them to detect and filter out just about any specific combination of characters.
If you're writing one for a specific application and need a hand, just let me know!
Posted by Andrew on Sunday, 16th December 2012 13:48
  • I like the new Wysiwyg system!

You must be logged in to post messages.

Quote of the day...


ZumGuy Internet Promotions