0

Is there someone to help me with the following:

I'm trying to find specific date and time strings in a text (to be used within VBA Word). Currently working with the following RegEx string:

(?:([0-9]{1,2})[ |-])?(?:(jan(?:uari)?|feb(?:ruari)?|m(?:aa)?rt|apr(?:il)?|mei|jun(?:i)?|jul(?:i)?|aug(?:ustus)?|sep(?:tember|t)?|okt(?:ober)?|nov(?:ember)?|dec(?:ember)?))?(?: |-)?(?(3)(?: around | at | ))?(?:([0-9]{1,2}:[0-9]{1,2})?(?: uur| u|u)?)?

Tested output on following text:

  1. date with around time: 26 sep 2016 around 09:00u
  2. date with at time: 1 sep 2016 at 09:00 uur
  3. date and time u: 1 sep 2018 09:00 u
  4. time without date: 08:30 uur
  5. date with time u: 1 sep 2016 at 09:00u
  6. only time: 09:00
  7. only month: jan
  8. month and year: feb 2019
  9. only day: 02
  10. only day with '-': 2-
  11. day and month: 2 jan
  12. month year: jan 2018
  13. date with '-': 2-feb-2018 09:00
  14. other month: 01 sept 2016
  15. full month: 1 september 2018
  16. shortened year: jul '18

Rules:

  • a date followed by time is valid
  • a date followed by text 'around' or 'at', followed by time is valid
  • a date without day number is valid
  • a date without year is valid
  • a date, month only is not valid
  • a day, without month or year not valid
  • a date may contain dashes '-'
  • a year may be shortenend with ', like jun '18
  • month name can be short or long
  • full match includes ' uur' or 'u' (to highlight the text in ms-Word)
  • submatches text from capture are without prepending or trailing spaces

example at: [https://regex101.com/r/6CFgBP/1/]

Expected output (when using in VBA Word): An regex Matches collection object in which each Match.SubMatches contains the individual items d, m, y, hh:mm from the capture groups in the regex search string. So for example 1: the Submatches (or capture groups) contains values: '26' ','sep','2016','09:00'

The RegEx works fine, but some false-positives need to be excluded:

  • In case there is a day without month/year, should be excluded from Regex (example 9 and 10)
  • In case there is a month without day, should be excluded (example 7)

(I was trying with som lookahead and reference \1 and ?(1), but was not able to get it running properly...)

Any advice highly appreciated!

5
  • 1
    What output do you want to get from your test strings? Commented Sep 24, 2018 at 8:20
  • Quick reply :-) I'm using the capture groups in the Matches.SubMatches VBA object. So i.e. for item 1: the Match returns an object with submatches '26', 'sep', '2016', '09:00' Commented Sep 24, 2018 at 8:26
  • Try this pattern. You may analyze submatches and build the result you need accordingly. Commented Sep 24, 2018 at 9:24
  • Thanks for looking at the pattern. Problem still exists in example 7, 9 and 10. I'd like to not-matching the pattern for those items. Commented Sep 24, 2018 at 11:45
  • It is not a problem with regex. You may easily find out what to keep and what to reject upon a match when you check the submatches(x) length. Commented Sep 24, 2018 at 12:26

2 Answers 2

0

As I understood, you require that each date/time part (day, month, year, hour and minute) must be present.

So you should remove ? after relevant groups (they are not optional).

It is also a good practice to have each group captured as a relevant capturing group.

There is no need to write something like jun(?:i)?. It is enough (and easier to read) when you write just juni? (the ? refers just to preceding i).

Another hint: As the regex language contains \d char class, use just it instead of [0-9] (the regex is shorter and easier to read.

Optional parts (at / around) should be an optional and non-capturing group.

Anything after the minute part is not needed in the regex.

So I propose a regex like below (for readability, I divided it into rows):

(\d{1,2})[ -](jan(?:uari)?|feb(?:ruari)?|m(?:aa)?rt|apr(?:il)?|mei|juni?
|juli?|aug(?:ustus)?|sep(?:tember|t)?|okt(?:ober)?|nov(?:ember)?|dec(?:ember)?)
[ -](\d{4}) (?:around |at )?(\d{1,2}:\d{1,2})

Details:

  • (\d{1,2}) - Day.
  • [ -] - A separator after the day (either a space or a minus).
  • (jan(?:uari)?|...dec(?:ember)?) - Month.
  • [ -] - A separator after the month.
  • (\d{4}) - year.
  • (?:around |at )? - Actually, 3 variants of a separator between year and hour (space / around / at), note the space before (...)?.
  • (\d{1,2}:\d{1,2}) - Hour and minute.

It matches variants 1, 2, 3, 5 and 13. All remaining fail to contain each required part, so they are not matched.

If you allow e.g. that the hour/minute part is optional, change the respective fragment into:

( (?:around |at )?(\d{1,2}:\d{1,2}))?

i.e. surround the space/around/at / hour / minute part with ( and )?, making this part an optional group. Then, variants 14 and 15 will also be matched.

One more extension: If you also allow the hour/minute part alone, add |(\d{1,2}:\d{1,2}) to the regex (all before is the first variant and the added part is the second variant for just hour/minute.

Then, your variants No 4 and 6 will also be matched.

For a working example see https://regex101.com/r/33t1ps/1

Edit

Following your list of rules, I propose the following regex:

  • (\d{1,2}[ -])? - Day + separator, optional.
  • (jan(?:uari)?|...|dec(?:ember)?) - Month.
  • (?:[ -](\d{4}|'\d{2}))? - Separator + year (either 4 or 2 digits with "'").
  • ( (?:around |at )?(\d{1,2}:\d{1,2}))? - Separator + hour/minute - optional end of variant 1.
  • |(\d{1,2}:\d{1,2}) - Variant 2 - only hour and minute.

It does not match only your variants No 9 and 10.

For full regex, including also "uur" see https://regex101.com/r/33t1ps/3

Sign up to request clarification or add additional context in comments.

4 Comments

Thanks for your thoughts. I can use the \d for numbers indeed. It's better readable. Still, I'd like to have 8, 11, 12, 16 to be matched. So, the combination of a day+month, or month+year is valid for a match. Even so, only year. But a single digit in the text, like 9, 10, is not valid, even so a single month name, like 7
Besides this, the captured items or not as I'm looking for, i.e. example 13. The captured text occurs twice: as <space>09:00 and 09:00
Updated the question with 'my rules'
Thanks! Only example 9 is matched. Which shouldn't due to 'month without day/year is not valid'
0

Finally I found something that helps me using the month properly :-)

\b(?:([1-3]|[0-3]\d)[ |-](?'month'(?:[1-9]|\d[12])|(?:jan(?:uari)?|feb(?:ruari)?|m(?:aa)?rt|apr(?:il)?|mei|jun(?:i)?|jul(?:i)?|aug(?:ustus)?|sep(?:tember|t)?|okt(?:ober)?|nov(?:ember)?|dec(?:ember)?))?)?(?:(\g'month')[ |-]((?:19|20|\')(?:\d{2})))?\b(?: omstreeks | om | )?(?:(\d{1,2}[:]\d{2}(?: uur|u)?|[0-2]\d{3}(?: uur|u)))?\b

It uses a named constructor/subroutine. Found here: https://www.regular-expressions.info/subroutine.html

2 Comments

in which the 'omstreeks' and 'at' correspond with 'around' and 'at' (local language)
This is a PCRE pattern, it won't work in MS Word VBA.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.