1

I have an PHP array of strings it looks like this

Array
(
    [1] => Lorem ipsum dolor sit amet http://www.google.com/search?q=stackoverflow consectetur adipiscing elit.
    [2] => Phasellus tempor vehicula fringilla. www.google.com/search?q=stackoverflow&ie=utf-8
    [3] => google.com/search?q=stackoverflow&ie=utf-8 Aenean in cursus libero.
);

URLs will be all sorts of forms, what I need is an array of those links. Something like this:

Array
(
    [1] => http://www.google.com/search?q=stackoverflow
    [2] => http://www.google.com/search?q=stackoverflow&ie=utf-8
    [3] => http://www.google.com/search?q=stackoverflow&ie=utf-8
);
4
  • Do you think that nobody has ever in the history of the internet had to parse URLs from strings, and that the code to do so has never been shared? Good news! It's been done, and the code has been shared, several thousand times! Head to your nearest search box. Commented Jul 21, 2011 at 9:23
  • 1
    Duplicate. stackoverflow.com/questions/1113840/php-remove-url-from-string This will be helpful. Commented Jul 21, 2011 at 9:24
  • Neither a string beginning with "google.com" nor with "www.google.com" is a valid URL. It will be difficult and fuzzy to extract all possible variations. IMO you should first ensure that the URLs are valid. Commented Jul 21, 2011 at 9:24
  • I have tried using regular expression witch supposed to remove URLs with preg_match function but noting good came out of it. Commented Jul 25, 2011 at 7:57

2 Answers 2

2

The code for you:

$pattern = '/((https?|ftp)\:(\/\/)|(file\:\/{2,3}))?(((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))|(((([a-zA-Z0-9]+)(\.)?)+)(\.)(com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum|[a-z]{2}))([\/][\/a-zA-Z0-9\.]*)*([\/]?(([\?][a-zA-Z0-9]+[\=][a-zA-Z0-9\%\(\)]*)([\&][a-zA-Z0-9]+[\=][a-zA-Z0-9\%\(\)]*)*))?/';

$a = array(
    'Lorem ipsum dolor sit amet http://www.google.com/search?q=stackoverflow consectetur adipiscing elit.',
    'Phasellus tempor vehicula fringilla. www.google.com/search?q=stackoverflow&ie=utf-8',
    'google.com/search?q=stackoverflow&ie=utf-8 Aenean in cursus libero.',
);

$urls = array();

foreach($a as $line)
{
    if(!preg_match($pattern, $line, $match))
        continue;

    $urls[] = $match[0];
}

var_dump($urls);

The regular expression was taken from here and corrected a bit.

Sign up to request clarification or add additional context in comments.

1 Comment

I have tested this script and found few weak spots. It gets stuck with special symbols like - or _ or ? and does not handle well if url ends .something (except .html)
0

You should write a proper regular expression to achieve this. Have a look at this

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.