1

In short, I'm utilizing pre_replace to find style sheets and essentially proxy this experience for viewers on my website, I use the external domain and prepend it to the current href. The style sheet starts like so.

<link rel="stylesheet" type="text/css" href="/assets/css/base.css">

I will take the href and prepend the domain to be

<link rel="stylesheet" type="text/css" href="http://www.website.com/assets/css/base.css">

My issue is, when I encounter a site that does not include HTTP/HTTPS

<link rel="stylesheet" type="text/css" href="//cdn.website.com/assets/css/base.css">

Then my current preg replace would not function and return the stylesheet to the following

<link rel="stylesheet" type="text/css" href="http://www.website.com//cdn.website.com/assets/css/base.css">

Is it possible to create some sort of If then with preg_replace to not manipulate the "//" hrefs and only replace the ones with no absolute base domain?

Current preg_replace being used:

$html = file_get_contents($website_url);
$domain = 'website.com';
$html = preg_replace("/(href|src)\=\"([^(http)])(\/)?/", "$1=\"$domain$2", $html);
echo $html;
1
  • 2
    simple: don't use regexes. Use a DOM parser and then it's a simple string replace operation once you've got the href attribute's contents. Commented Jun 13, 2014 at 22:22

3 Answers 3

2

There are if/then/else conditionals in regex, although not really necessary for this to work:

(?!(href|src)=)(\")\/(\\w+.+)(\">)

Code:

$html = file_get_contents($website_url);
$domain = 'http://website.com';
$result = preg_replace("/(?!(href|src)=)(\")\/(\\w+.+)(\">)/u", "$2$domain/$3$4", $html);
echo $result;

Output:

<link rel="stylesheet" type="text/css" href="http://website.com/assets/css/base.css">

Example:

http://regex101.com/r/kU7pF1

Sign up to request clarification or add additional context in comments.

Comments

1

[^(href)] is not a negation. It's still a character class.

You are looking for a (?!...) negative lookahead:

 ~  (href|src) =\" (?!href:)  \/?  ~x

While I dispute the SO meme and overgeneralization of firing up a DOM traversal for each trivia, it should be noted that regex is often only appropriate for normalized and well-known HTML input; not if your task is proxying arbitrary websites.

Comments

0
function alterLinks($html) {

  $ret = '';

  $dom = new DomDocument();
  $dom->loadHTML($html);
  $links = $dom->getElementsByTagName('a');

  foreach ($links as $alink) {
    $href = $alink->getAttribute('href'); 
    $aMungedLink = $this->mungeHref($href);
    $alink->setAttribute("href",$aMungedLink);
  }

  $ret = $dom->saveHTML();
  return $ret;
}

3 Comments

Welcome to StackOverflow. While this code may answer the question, providing additional context regarding why and/or how this code answers the question improves its long-term value. Consider editing your answer to add that context.
Some of the comments in this thread involved regular expressions. I recently had a "change hrefs" problem writing a plugin to a dynamic CMS, so I could optionally output staticHTML instead. I tried but failed to get preg_replace and regular expressions to work. The code above is clean and simple. It worked for me. I didn't write the mungeHref($href) function above because my needs were different than yours. That's the easy part anyway.
fwiw I used almost identical codes to rework the "src" attributes for all images in a dynamic HTML page, so it could then be written out as static HTML. But that's a different topic.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.