0

I have a very long string, like so:

_acordion">
               <ul>
                   <li>
                     <a href="#" class="contentFrame_acordionToggle" aria-label="לחץ להציג או להסתיר מידע נוסף" aria-controls="contentFrame_acordion01"
           aria-expanded="false">קרנות פנסיה</a>
                     <div class="contentFrame_acordionPanel" aria-hidden="true" id="contentFrame_acordion01" style=""><ul class="V contentFrame_bulletList">
           <li class="rteBulletChk"><a href="/media/27294/512237744_pn_p_0424.xlsx" title="512237744_pn_p_0424.xlsx" data-udi="umb://media/79fe37604a6e408c96937b3a209fcd5f"
           aria-label="על מנת להפוך את האתר לנגיש לקורא מסך לחץ alt + 1. על מנת להפסיק הודעה זאת לחץ alt + 2.">רשימת נכסים ליום 31.12.2024</a></li>
           <li class="rteBulletChk"><a href="/media/26301/512237744_pn_p_0324.xlsx" title="512237744_pn_p_0324.xlsx" data-udi="umb://media/eb65b69244414dc4ad19933013ca168a"
           aria-label="על מנת להפוך את

from which I'm interested to extract a substring in a definite pattern:

Hebrew letters, max 200 chars, href="(substring of interest)", max 200 chars, 31.12.2024

I'm using this regex

קרנות פנסיה(\s|\S){0,200}href=`"(.*?)`"(\s|\S){0,200}31.12.2024

Which gives me this master result(.Groups[0]), which is correct, it starts with the Hebrew letters and finishes with 31.12.2024

enter image description here

I'm interested in the sub-string marked in yellow (some file name), but I'm struggling to finish the matching at the second quotes after href=

Groups[2] insists to return

enter image description here

I also tried with lookahead

קרנות פנסיה(\s|\S){0,200}href=`"(.*?)(?=`")(\s|\S){0,200}31.12.2024

And I still get this "title=" part. How do I get whatever is between the quotes after "href="

8
  • Instead of using a lookahead for the closing quote, use [^"]+ to describe the preceding content instead: $pattern = 'קרנות פנסיה(\s|\S){0,200}href="([^"]+)"(\s|\S){0,200}31.12.2024' Commented May 1 at 18:24
  • @Mathias R. Jessen Your suggestion doesn't work. Commented May 1 at 18:39
  • 1
    That's because the input string has 201 characters in between the hebrew word and href, so (\s|\S){0,200} doesn't quite cut it :) Commented May 1 at 18:45
  • 1
    First, do not use (\s|\S), use [\s\S] or (?s:.) instead (or use a . and (?s) at the pattern start). Then, you can just use href=`"([^`"]*)`" Commented May 1 at 19:00
  • 3
    It is generally a bad idea to attempt to parse HTML with regular expressions. Instead use a dedicated HTML parser as the HtmlDocument class (and the Uri class for uri's), see e.g.: stackoverflow.com/a/71855426/1701026 and stackoverflow.com/a/75269044/1701026 Commented May 2 at 8:56

1 Answer 1

1

Building on the helpful comments on the question.

Syntax note:

  • In lieu of (\s|\S) for matching any single character, it is better to make . do the same via first activating the single-line regex option, which can be done by placing (?s) at the start of a regex, for instance, as shown below.[1]

  • Avoid hard-coding the length range of unspecified substrings, such as (\s|\S){0,200} - it makes your solution less flexible, including with respect to newlines, whose length can be one char. in input strings using Unix-format newlines (LF-only) or two chars. with Windows-format newlines (CR, LF sequences).

    • Assuming that what follows the variable-length part is always followed by a terminating part (e.g. 31.12.2024), instead of {0,200} you can simply use *?, i.e the non-greedy variant of the * quantifier, or of + (+?), as shown below.[2]
  • The best (unambiguous and most efficient) way to match all chars. following " up to but excluding the closing ", is to use [^"]* or - if you can assume that at least one char. is enclosed in "..." - [^"]+

Thus, use the following (assumes that your input string is stored in variable $string; also note the use of verbatim string literals (single-quoted, i.e. '...')), which obviates the need to escape " as `"):[3]

# Note the use of (?s) to make . match newlines and
# .+? to non-greedily match unspecified parts of the string.
if ($string -match '(?s)קרנות פנסיה.+?href="([^"]*).+?31.12.2024') {
  "href attribute value: " + $Matches[1]
}

As for what you tried (leaving aside that {0,200} may be too small a character-count range for your use case):

  • "(.*?)" - unlike "([^"]*)" - isn't guaranteed to capture merely everything up to the very next " instance, namely if the following subexpression only matches a later occurrence of ".

  • The following minimal example demonstrates this:

$string = 'one "two" and "three"!'

if ($string -match '"(.*?)"!') {
  "Group 1 value: " + $Matches[1]
}

This outputs Group 1 value: two" and "three, showing that matching was performed across both "..."-enclosed substrings:

  • Matching continued past the closing " of "two", because only the closing " of the later "three" was followed by !

However, note that while your symptom - your capture group capturing /media/27294/512237744_pn_p_0424.xlsx" title= instead of just /media/27294/512237744_pn_p_0424.xlsx - suggests this is what happened analogously in your case, the regex shown in your question would not produce this symptom, because the subsequent subexpression does match the closing " of the href="/media/27294/512237744_pn_p_0424.xlsx" substring.


[1] By default, . matches any character except a newline. The single-line option, whose inline form is (?s), makes it match newlines too - in the simplest case, place it at the start of your regex, as shown later, but you can also apply it to parts of your regex, selectively, including the ability to turn it off again with (?-s).
An alternative that is still preferable to (\s|\S) is to use a character set, [\s\S].

[2] A ? following quantifiers such as * or + makes the latter non-greedy (a.k.a. lazy), i.e. makes them match as few characters as possible; this contrasts with the greedy default behavior, which matches as many as possible. This is useful for input strings with repeating patterns in which you want to match a single pattern without accidentally matching across multiple ones to the very last one.
A simple example (verify results with $Matches[0]): The greedy regex 'YOLO' -match '.+O' matches the entire input string (up to the last 'O'), whereas 'YOLO' -match '.+?O' matches only 'YO' (up to the first one).

[3] For a detailed explanation of why '...' strings are preferable for specifying regexes, see this answer for more information

Sign up to request clarification or add additional context in comments.

6 Comments

When I use regular expression, I work with named captured group. Instead of ([^"]*) it could be (?<FileOfInterest>[^"]*). The result would be $Matches[ 'FileOfInterest' ]. There are situations where you need to capture some groups. Naming makes it easier to debug and understand the source code.
I didn't understand this part .+?. The plus sign + indicates one or more occurrences of the previous element. The previous element is the dot ., which represents any single character. The question mark ? indicates zero or one occurrence of the previous element. I couldn't understand the need to use ?. It seems to be unnecessary.
@JoãoMac, re the non-greedy ? quantifier modifier, please see the footnote I've added to the answer.
@POL, as noted in the first bullet point, you must activate the single-line regex option ((?s)) in order to make . match newlines too. This is also part of the recommended solution. I've added a footnote that explains this option in detail.
@mklement0, I use regular expressions a lot. That's why I understand your approach. But I didn't know about the existence of greedy and lazy. Thank you again for taking the time to teach us.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.