1

I am provided with a dump of strings made up of Js object literals that were initially scraped on the web, and I need to get some data out of them in PHP. These are not valid JSON, so I can't use json_decode. They have the format below, where DETAILS is what I need to capture.

...data: [DETAILS]...

In some of the sources, the data element shows up more than once, and I need to capture each match. DETAILS can contain any character including [ { } ], quotes and the comma but I need to capture it all.

I am trying to use regex. Here is what I have tried by following some tutorials, but it's certainly wrong.

preg_match_all('~(?:\G(?!^),|(data: )\{)\s+([^:]+): (\d+|"[^"]*")~', $html, $out, PREG_SET_ORDER) ? $out : []

Please I really need some help with this.

Edit: This is just an example of one data field, showing DETAILS. It's not always in this form.

series:[{name:'Records',data:[[Date.parse('2013-11-01'),1],[Date.parse('2013-12-01'),2],[Date.parse('2014-01-01'),1],[Date.parse('2014-02-01'),4],[Date.parse('2014-03-01'),23],[Date.parse('2014-04-01'),22],[Date.parse('2014-05-01'),19],[Date.parse('2014-06-01'),26],[Date.parse('2014-07-01'),43],[Date.parse('2014-08-01'),29],[Date.parse('2014-09-01'),47],[Date.parse('2014-10-01'),31],[Date.parse('2014-11-01'),32],[Date.parse('2014-12-01'),17],[Date.parse('2015-01-01'),28],[Date.parse('2015-02-01'),2],[Date.parse('2015-03-01'),18],[Date.parse('2015-04-01'),16],[Date.parse('2015-05-01'),10],[Date.parse('2015-06-01'),25],[Date.parse('2015-07-01'),20],[Date.parse('2015-08-01'),21],[Date.parse('2015-09-01'),6],[Date.parse('2015-10-01'),10],[Date.parse('2015-11-01'),-11],[Date.parse('2015-12-01'),12],[Date.parse('2016-01-01'),46],[Date.parse('2016-02-01'),32],[Date.parse('2016-03-01'),16],[Date.parse('2016-04-01'),28],[Date.parse('2016-05-01'),34],[Date.parse('2016-06-01'),24],[Date.parse('2016-07-01'),40],[Date.parse('2016-08-01'),24],[Date.parse('2016-09-01'),57],[Date.parse('2016-10-01'),42],[Date.parse('2016-11-01'),51],[Date.parse('2016-12-01'),53],[Date.parse('2017-01-01'),63],[Date.parse('2017-02-01'),23],[Date.parse('2017-03-01'),80],[Date.parse('2017-04-01'),56],[Date.parse('2017-05-01'),61],[Date.parse('2017-06-01'),74],[Date.parse('2017-07-01'),107],[Date.parse('2017-08-01'),74],[Date.parse('2017-09-01'),120],[Date.parse('2017-10-01'),79],[Date.parse('2017-11-01'),163],[Date.parse('2017-12-01'),130],[Date.parse('2018-01-01'),126],[Date.parse('2018-02-01'),153],[Date.parse('2018-03-01'),236],[Date.parse('2018-04-01'),255],[Date.parse('2018-05-01'),236],[Date.parse('2018-06-01'),231],[Date.parse('2018-07-01'),223],[Date.parse('2018-08-01'),55],[Date.parse('2018-09-01'),171],[Date.parse('2018-10-01'),152],[Date.parse('2018-11-01'),139],[Date.parse('2018-12-01'),115],[Date.parse('2019-01-01'),83],[Date.parse('2019-02-01'),168],[Date.parse('2019-03-01'),79],[Date.parse('2019-04-01'),120],[Date.parse('2019-05-01'),221],[Date.parse('2019-06-01'),167],[Date.parse('2019-07-01'),192],[Date.parse('2019-08-01'),296],[Date.parse('2019-08-17'),40],]}],
3
  • 1
    How do you know when you have reached the end of [DETAILS]? Some more detailed sample data would help. Commented Aug 17, 2019 at 3:54
  • Thanks Nick. I just added one example. I was hoping to know when I reach the end of DETAILS by matching the opening and closing brackets (in the case when there are nested [ s) Commented Aug 17, 2019 at 3:57
  • 1
    If you aren't already, I suggest using a regex tester like regex101 . You're dangerously close to the difficulties of parsing HTML with regex (opening and closing "tags", uncertain content), which is a sure path to madness! Commented Aug 17, 2019 at 4:09

1 Answer 1

1

How about this little cheat:

$newtest = preg_replace('~.+data:\[~s', '', $html);  // remove everything before the data you want to capture

preg_match_all('~([^\]]+\])~s', $newtest, $out, PREG_SET_ORDER) ? $out : [];   // match each DETAILS segment

Remember to escape (with backslash) your bracket characters, because they have special meaning in regex.

Sign up to request clarification or add additional context in comments.

7 Comments

Thanks, this seems pretty close! The problem however is that there can be multiple "data" elements within one html source, and not always only one.
In that case, I'd continue the logic above. The first "match" of preg_match_all will be the entire matched string. Use that in another preg_replace to again remove what you don't need anymore, and then run another preg_match_all. You could encapsulate the whole thing in a while loop, checking whether the preg_match_all found a match in order to determine whether to continue. Make sense?
This makes sense. Thank you! I will give that a try right away. Please I may have one more question if I run into too many problems!
I modified your regex to work better for my use case, and I'm so close now. Only two problems left. The first match is preceded by an open bracket [ and some space - I need to not match that. And the last match begins with an open bracket and a brace [{ which I also need to not match - how can I exclude the last match based on that? Here is my code $newtest = preg_replace('~.+data:~s', '', $html); preg_match_all('(\[+[^\]]+\])', $newtest, $out, PREG_SET_ORDER) ? $out : [];
Post a new sample with multiple "data" elements, please, so I can experiment. You could try adding the Ungreedy modifier: preg_replace('~.+data:~Us', '', $html);
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.