How to match the first word after an expression with regex?

Question

For example, in this text:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc eu tellus vel nunc pretium lacinia. Proin sed lorem. Cras sed ipsum. Nunc a libero quis risus sollicitudin imperdiet.

I want to match the word after 'ipsum'.

Ben Blank · Accepted Answer · 2009-02-13 16:45:14Z

55

This sounds like a job for lookbehinds, though you should be aware that not all regex flavors support them. In your example:

(?<=\bipsum\s)(\w+)

This will match any sequence of letter characters which follows "ipsum" as a whole word followed by a space. It does not match "ipsum" itself, you don't need to worry about reinserting it in the case of, e.g. replacements.

As I said, though, some flavors (JavaScript, for example) don't support lookbehind at all. Many others (most, in fact) only support "fixed width" lookbehinds — so you could use this example but not any of the repetition operators. (In other words, (?<=\b\w+\s+)(\w+) wouldn't work.)

edited Feb 13, 2009 at 16:45

answered Feb 13, 2009 at 15:01

Ben Blank

57k28 gold badges133 silver badges164 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

cletus Over a year ago

Lookbehinds tend to be pretty limited when it comes to using wildcards though.

user55400 Over a year ago

Lookbehinds might not even be necessary here. Depending on what 'I want to match' in the question refers to, see David Kemp's solution.

annakata Over a year ago

zero-width tends to be what you want though, it's just that grouping is a trivial get out of jail card.

Peter Boughton Over a year ago

Fixed width is a misleading term - it is more "max width", yes? In most cases it is possible to use a suitable limit, for example: (?<=\b\w{1,100}\s{1,100})

Ben Blank Over a year ago

@Peter — No, it really is fixed width. Try your regex there in Python; it throws an exception.

|

Alan Moore · Accepted Answer · 2009-02-13 20:49:29Z

Some of the other responders have suggested using a regex that doesn't depend on lookbehinds, but I think a complete, working example is needed to get the point across. The idea is that you match the whole sequence ("ipsum" plus the next word) in the normal way, then use a capturing group to isolate the part that interests you. For example:

String s = "Lorem ipsum dolor sit amet, consectetur " +
    "adipiscing elit. Nunc eu tellus vel nunc pretium " +
    "lacinia. Proin sed lorem. Cras sed ipsum. Nunc " +
    "a libero quis risus sollicitudin imperdiet.";

Pattern p = Pattern.compile("ipsum\\W+(\\w+)");
Matcher m = p.matcher(s);
while (m.find())
{
  System.out.println(m.group(1));
}

Note that this prints both "dolor" and "Nunc". To do that with the lookbehind version, you would have to do something hackish like:

Pattern p = Pattern.compile("(?<=ipsum\\W{1,2})(\\w+)");

That's in Java, which requires the lookbehind to have an obvious maximum length. Some flavors don't have even that much flexibility, and of course, some don't support lookbehinds at all.

However, the biggest problem people seem to be having in their examples is not with lookbehinds, but with word boundaries. Both David Kemp and ck seem to expect \b to match the space character following the 'm', but it doesn't; it matches the position (or boundary) between the 'm' and the space.

It's a common mistake, one I've even seen repeated in a few books and tutorials, but the word-boundary construct, \b, never matches any characters. It's a zero-width assertion, like lookarounds and anchors (^, $, \z, etc.), and what it matches is a position that is either preceded by a word character and not followed by one, or followed by a word character and not preceded by one.

kͩeͣmͮpͥ ͩ · Accepted Answer · 2009-02-13 14:54:19Z

2

ipsum\b(\w*)

answered Feb 13, 2009 at 14:54

kͩeͣmͮpͥ ͩ

7,86229 silver badges41 bronze badges

6 Comments

Matthew Taylor Over a year ago

That seems to only match ipsum.

cletus Over a year ago

I'd probably make that \b+(\w+) at least

Matthew Taylor Over a year ago

ipsum\b+(\w+) is not valid regex.

Ateş Göral Over a year ago

@Matthew Taylor: It depends on your platform. You didn't specify which platform/language you're using.

Alan Moore Over a year ago

\b+ matches one or more word boundaries, which makes no sense because a word boundary has zero length. Some flavors will ignore the + but others will reject it as an error. I think "ipsum\s+(\w+)" is what you're groping for.

|

Vijay Anand Pandian · Accepted Answer · 2020-11-11 06:18:02Z

(?<=\bipsum\s|\bipsum\.\s)(\w+)

/(?<=\bipsum\s|\bipsum\.\s)(\w+)/gm Positive Lookbehind (?<=\bipsum\s|\bipsum\.\s) Assert that the Regex below matches

1st Alternative \bipsum\s \b assert position at a word boundary: (^\w|\w$|\W\w|\w\W) ipsum matches the characters ipsum literally (case sensitive) \s matches any whitespace character (equal to [\r\n\t\f\v ])
2nd Alternative \bipsum\.\s \b assert position at a word boundary: (^\w|\w$|\W\w|\w\W) ipsum matches the characters ipsum literally (case sensitive) . matches the character . literally (case sensitive) \s matches any whitespace character (equal to [\r\n\t\f\v ]) 1st Capturing Group (\w+) \w+ matches any word character (equal to [a-zA-Z0-9_])

Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy) Global pattern flags g modifier: global. All matches (don't return after first match) m modifier: multi line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)

JLCDev · Accepted Answer · 2017-07-12 02:41:01Z

1

With javascript you can use (?=ipsum.*?(\w+))

This will get the second occurrence as well (Nunc)

answered Jul 12, 2017 at 2:41

JLCDev

6291 gold badge5 silver badges18 bronze badges

Comments

Deniz Babat · Accepted Answer · 2021-11-16 19:05:44Z

0

Example statement: "availebleLimit: Double?". İf you want to find words after ':' character, the below regex can be used

Regex => :.+$

answered Nov 16, 2021 at 19:05

Deniz Babat

3142 silver badges6 bronze badges

Comments

cjk · Accepted Answer · 2009-02-13 14:53:15Z

-1

ipsum\b(.*)\b

EDIT: although depending on your regex implementation, this could be hungry and find all words after ipsum

answered Feb 13, 2009 at 14:53

cjk

46.6k9 gold badges83 silver badges113 bronze badges

6 Comments

cletus Over a year ago

That'll match the rest of the sentence.

tliff Over a year ago

you have to make that ungreedy

cletus Over a year ago

Actually it's not implementation dependent, or at least I've never come across a regex implementation that is non-greedy by default. Non-greedy is always a switch (at least in Perl, PHP, Java and .Net).

cjk Over a year ago

@cletus: regex implementation can by definition include passing switches to the call to the regex function

Alan Moore Over a year ago

Even if you make it non-greedy--ie, "ipsum\b(.*?)\b"--it still won't work. The "(.*?)" will just match the space between 'ipsum' and the next word.

|

Collectives™ on Stack Overflow

How to match the first word after an expression with regex?

7 Answers 7

6 Comments

Comments

6 Comments

Comments

Comments

Comments

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

6 Comments

Comments

6 Comments

Comments

Comments

Comments

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related