Building on the helpful comments on the question.
Syntax note:
In lieu of (\s|\S) for matching any single character, it is better to make . do the same via first activating the single-line regex option, which can be done by placing (?s) at the start of a regex, for instance, as shown below.[1]
Avoid hard-coding the length range of unspecified substrings, such as (\s|\S){0,200} - it makes your solution less flexible, including with respect to newlines, whose length can be one char. in input strings using Unix-format newlines (LF-only) or two chars. with Windows-format newlines (CR, LF sequences).
- Assuming that what follows the variable-length part is always followed by a terminating part (e.g.
31.12.2024), instead of {0,200} you can simply use *?, i.e the non-greedy variant of the * quantifier, or of + (+?), as shown below.[2]
The best (unambiguous and most efficient) way to match all chars. following " up to but excluding the closing ", is to use [^"]* or - if you can assume that at least one char. is enclosed in "..." - [^"]+
Thus, use the following (assumes that your input string is stored in variable $string; also note the use of verbatim string literals (single-quoted, i.e. '...')), which obviates the need to escape " as `"):[3]
# Note the use of (?s) to make . match newlines and
# .+? to non-greedily match unspecified parts of the string.
if ($string -match '(?s)קרנות פנסיה.+?href="([^"]*).+?31.12.2024') {
"href attribute value: " + $Matches[1]
}
As for what you tried (leaving aside that {0,200} may be too small a character-count range for your use case):
"(.*?)" - unlike "([^"]*)" - isn't guaranteed to capture merely everything up to the very next " instance, namely if the following subexpression only matches a later occurrence of ".
The following minimal example demonstrates this:
$string = 'one "two" and "three"!'
if ($string -match '"(.*?)"!') {
"Group 1 value: " + $Matches[1]
}
This outputs Group 1 value: two" and "three, showing that matching was performed across both "..."-enclosed substrings:
- Matching continued past the closing
" of "two", because only the closing " of the later "three" was followed by !
However, note that while your symptom - your capture group capturing /media/27294/512237744_pn_p_0424.xlsx" title= instead of just /media/27294/512237744_pn_p_0424.xlsx - suggests this is what happened analogously in your case, the regex shown in your question would not produce this symptom, because the subsequent subexpression does match the closing " of the href="/media/27294/512237744_pn_p_0424.xlsx" substring.
[1] By default, . matches any character except a newline. The single-line option, whose inline form is (?s), makes it match newlines too - in the simplest case, place it at the start of your regex, as shown later, but you can also apply it to parts of your regex, selectively, including the ability to turn it off again with (?-s).
An alternative that is still preferable to (\s|\S) is to use a character set, [\s\S].
[2] A ? following quantifiers such as * or + makes the latter non-greedy (a.k.a. lazy), i.e. makes them match as few characters as possible; this contrasts with the greedy default behavior, which matches as many as possible. This is useful for input strings with repeating patterns in which you want to match a single pattern without accidentally matching across multiple ones to the very last one.
A simple example (verify results with $Matches[0]): The greedy regex 'YOLO' -match '.+O' matches the entire input string (up to the last 'O'), whereas 'YOLO' -match '.+?O' matches only 'YO' (up to the first one).
[3] For a detailed explanation of why '...' strings are preferable for specifying regexes, see this answer for more information
[^"]+to describe the preceding content instead:$pattern = 'קרנות פנסיה(\s|\S){0,200}href="([^"]+)"(\s|\S){0,200}31.12.2024'href, so(\s|\S){0,200}doesn't quite cut it :)(\s|\S), use[\s\S]or(?s:.)instead (or use a.and(?s)at the pattern start). Then, you can just usehref=`"([^`"]*)`"