2

I'm trying to use REGEXP in an MariaDB 11.0.2 query to extract all of the image source links within the HTML and text of the field post_content that is in each row in wp_posts. I've looked at related questions but still can't get this to work.

My Regex works in a fiddle: https://regex101.com/r/CyuWhi/1

But using it in a query in Adminer 4.8.1:

SELECT post_content  
FROM wp_posts  
WHERE post_content REGEXP 'src=["|\'](.*?)["|\']';

returns the src links, but also other HTML and text. What I want is just a list of the image source links.

So, either my REGEXP is wrong for Adminer, or the structure of my overall query is not working.

Any ideas?

Edit: Barmar makes the good point that I should use REGEXP_SUBSTR. How would I do that?

The regex101 fiddle has sample row data and the regex. But I have thousands of rows in wp_posts and each has a post_content column that I want to search.

So would I use REGEXP_SUBSTR() with table data? I.e., example #4:

https://www.mysqltutorial.org/mysql-regular-expressions/mysql-regexp_substr/

But as in the example, all I need are the src URLs in a list, not the year.

The screenshot is of Adminer; the regex selects the row, but not just the image source URLs. (Image links are out of sight to the right.)

enter image description here

7
  • Why do you have | in the character sets? This will match src=|foo|, do you really want that? Commented Jul 15, 2024 at 18:38
  • It seems like you may be confusing square brackets with parentheses, e.g. ("|\') will match either " or '. Commented Jul 15, 2024 at 18:39
  • Or upgrade to a modern version of MySQL. Commented Jul 15, 2024 at 18:50
  • show example input and expected output. you say "all of the image source links", implying there may be more than one in a given row; if so, include that in your input/output Commented Jul 15, 2024 at 20:21
  • @ysth Thanks! There are multiple img links in each post_content columns. The regex101 fiddle has sample data and the query, and the screenshot is of Adminer. So would REGEXP_SUBSTR "loop" through each wp_posts row and post_content column? Commented Jul 15, 2024 at 22:49

1 Answer 1

2

Get rid of the ? lazy modifier, which isn't supported in MySQL 5.x regular expressions. It's also not needed when just testing for a regexp match, it's only needed when you need to determine the matching portion (as you might do with REGEXP_SUBSTR() or REGEXP_REPLACE() in MySQL 8.x).

SELECT post_content  
FROM wp_posts  
WHERE post_content REGEXP 'src=["\'].*["\']';

Note that MySQL doesn't use the PCRE library, so selecting that engine at regex101 may produce misleading results.

Sign up to request clarification or add additional context in comments.

4 Comments

Thanks! Sorry, I checked for sure and I'm actually on 11.0.2-MariaDB (through PHP extension MySQLi ) and Adminer 4.8.1. Still no luck with your query; I still get what appears to be the entire post_content field. Does MariaDB make a big difference?
Why shouldn't it return the entire field? You're only using the regexp to select rows, not to extract a substring from the field. The error would be if it returns rows that don't have src="..." in it.
Use REGEXP_SUBSTR() to just extract the links from the HTML. SELECT REGEXP_SUBSTR(...)
The question was updated, it looks like OP uses MariaDB 11.0.2 (PCRE since 10.0.5)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.