Extracting some data from a JS object literal string in PHP using Regex

Question

I am provided with a dump of strings made up of Js object literals that were initially scraped on the web, and I need to get some data out of them in PHP. These are not valid JSON, so I can't use json_decode. They have the format below, where DETAILS is what I need to capture.

...data: [DETAILS]...

In some of the sources, the data element shows up more than once, and I need to capture each match. DETAILS can contain any character including [ { } ], quotes and the comma but I need to capture it all.

I am trying to use regex. Here is what I have tried by following some tutorials, but it's certainly wrong.

preg_match_all('~(?:\G(?!^),|(data: )\{)\s+([^:]+): (\d+|"[^"]*")~', $html, $out, PREG_SET_ORDER) ? $out : []

Please I really need some help with this.

Edit: This is just an example of one data field, showing DETAILS. It's not always in this form.

series:[{name:'Records',data:[[Date.parse('2013-11-01'),1],[Date.parse('2013-12-01'),2],[Date.parse('2014-01-01'),1],[Date.parse('2014-02-01'),4],[Date.parse('2014-03-01'),23],[Date.parse('2014-04-01'),22],[Date.parse('2014-05-01'),19],[Date.parse('2014-06-01'),26],[Date.parse('2014-07-01'),43],[Date.parse('2014-08-01'),29],[Date.parse('2014-09-01'),47],[Date.parse('2014-10-01'),31],[Date.parse('2014-11-01'),32],[Date.parse('2014-12-01'),17],[Date.parse('2015-01-01'),28],[Date.parse('2015-02-01'),2],[Date.parse('2015-03-01'),18],[Date.parse('2015-04-01'),16],[Date.parse('2015-05-01'),10],[Date.parse('2015-06-01'),25],[Date.parse('2015-07-01'),20],[Date.parse('2015-08-01'),21],[Date.parse('2015-09-01'),6],[Date.parse('2015-10-01'),10],[Date.parse('2015-11-01'),-11],[Date.parse('2015-12-01'),12],[Date.parse('2016-01-01'),46],[Date.parse('2016-02-01'),32],[Date.parse('2016-03-01'),16],[Date.parse('2016-04-01'),28],[Date.parse('2016-05-01'),34],[Date.parse('2016-06-01'),24],[Date.parse('2016-07-01'),40],[Date.parse('2016-08-01'),24],[Date.parse('2016-09-01'),57],[Date.parse('2016-10-01'),42],[Date.parse('2016-11-01'),51],[Date.parse('2016-12-01'),53],[Date.parse('2017-01-01'),63],[Date.parse('2017-02-01'),23],[Date.parse('2017-03-01'),80],[Date.parse('2017-04-01'),56],[Date.parse('2017-05-01'),61],[Date.parse('2017-06-01'),74],[Date.parse('2017-07-01'),107],[Date.parse('2017-08-01'),74],[Date.parse('2017-09-01'),120],[Date.parse('2017-10-01'),79],[Date.parse('2017-11-01'),163],[Date.parse('2017-12-01'),130],[Date.parse('2018-01-01'),126],[Date.parse('2018-02-01'),153],[Date.parse('2018-03-01'),236],[Date.parse('2018-04-01'),255],[Date.parse('2018-05-01'),236],[Date.parse('2018-06-01'),231],[Date.parse('2018-07-01'),223],[Date.parse('2018-08-01'),55],[Date.parse('2018-09-01'),171],[Date.parse('2018-10-01'),152],[Date.parse('2018-11-01'),139],[Date.parse('2018-12-01'),115],[Date.parse('2019-01-01'),83],[Date.parse('2019-02-01'),168],[Date.parse('2019-03-01'),79],[Date.parse('2019-04-01'),120],[Date.parse('2019-05-01'),221],[Date.parse('2019-06-01'),167],[Date.parse('2019-07-01'),192],[Date.parse('2019-08-01'),296],[Date.parse('2019-08-17'),40],]}],

How do you know when you have reached the end of [DETAILS]? Some more detailed sample data would help. — Nick
– Nick, Commented Aug 17, 2019 at 3:54
Thanks Nick. I just added one example. I was hoping to know when I reach the end of DETAILS by matching the opening and closing brackets (in the case when there are nested [ s) — Cogicero
– Cogicero, Commented Aug 17, 2019 at 3:57
If you aren't already, I suggest using a regex tester like regex101 . You're dangerously close to the difficulties of parsing HTML with regex (opening and closing "tags", uncertain content), which is a sure path to madness! — FoulFoot
– FoulFoot, Commented Aug 17, 2019 at 4:09

FoulFoot · Accepted Answer · 2019-08-17 04:34:27Z

1

How about this little cheat:

$newtest = preg_replace('~.+data:\[~s', '', $html);  // remove everything before the data you want to capture

preg_match_all('~([^\]]+\])~s', $newtest, $out, PREG_SET_ORDER) ? $out : [];   // match each DETAILS segment

Remember to escape (with backslash) your bracket characters, because they have special meaning in regex.

answered Aug 17, 2019 at 4:34

FoulFoot

6545 silver badges9 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Cogicero Over a year ago

Thanks, this seems pretty close! The problem however is that there can be multiple "data" elements within one html source, and not always only one.

FoulFoot Over a year ago

In that case, I'd continue the logic above. The first "match" of preg_match_all will be the entire matched string. Use that in another preg_replace to again remove what you don't need anymore, and then run another preg_match_all. You could encapsulate the whole thing in a while loop, checking whether the preg_match_all found a match in order to determine whether to continue. Make sense?

Cogicero Over a year ago

This makes sense. Thank you! I will give that a try right away. Please I may have one more question if I run into too many problems!

Cogicero Over a year ago

I modified your regex to work better for my use case, and I'm so close now. Only two problems left. The first match is preceded by an open bracket [ and some space - I need to not match that. And the last match begins with an open bracket and a brace [{ which I also need to not match - how can I exclude the last match based on that? Here is my code $newtest = preg_replace('~.+data:~s', '', $html); preg_match_all('(\[+[^\]]+\])', $newtest, $out, PREG_SET_ORDER) ? $out : [];

FoulFoot Over a year ago

Post a new sample with multiple "data" elements, please, so I can experiment. You could try adding the Ungreedy modifier: preg_replace('~.+data:~Us', '', $html);

|

Collectives™ on Stack Overflow

Extracting some data from a JS object literal string in PHP using Regex

1 Answer 1

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related