1

For the life of me, I can't figure out the combination of the regular expression characters to use to parse the part of the string I want. The string is part of a for loop giving a line of 400 thousand lines (out of order). The string I have found by matching with the unique number passed by an array for loop.

For every string I'm trying to get a date number (such as 20151212 below).

Given the following examples of the strings (pulled from a CSV file with 400k++ lines of strings):

String1:

314513,,Jr.,John,Doe,652622,U51523144,,20151212,A,,,,,,,

String2:

365422,[email protected],John,Doe.,Jr,987235,U23481,z725432,20160221,,,,,,,,

String3:

6231,,,,31248,U51523144,,,CB,,,,,,,

There are several complications here...

  1. Some names have a "," in them, so it makes it more than 15 commas.

  2. We don't know the value of the date, just that it is a date format such as (get-date).tostring("yyyyMMdd")

For those who can think of a better way...

We are given two CSV files to match. Algorithmic steps:

  • Look in the CSV file 1 for the ID Number (found on the 2nd column)

    ** No ID Numbers will be blank for CSV file 1

  • Look in the CSV file 2 and match the ID number from CSV file 1. On this same line, get the date. Once have date, append in 5th column on CSV file 1 with the same row as ID number

    ** Note: CSV file 2 will have $null for some of the values in the ID number column

I'm open to suggestions (including using the Import-Csv cmdlet in which I am not to familiar with the flags and syntax of for loops with those values yet).

2
  • Do you have a year range? Commented Dec 11, 2015 at 2:15
  • if names might have commas can you use a text indicator? (like putting quotes around names like "St. John", "John-Smith", or "Solo") Commented Dec 11, 2015 at 2:19

1 Answer 1

2

You could try something like this:

,(19|20)[0-9]{2}(0[1-9]|1[012])(0[1-9]|[12][0-9]|3[01]),

This will match all dates in the given format from 1900 - 2099. It is also specific enough to rule out most other random numbers, although without a larger sample of data, it's impossible to say.

Then in PowerShell:

gc data.csv | where { $_ -match ",((19|20)[0-9]{2}(0[1-9]|1[012])(0[1-9]|[12][0-9]|3[01])),"} | % { $matches[1] }

In the PowerShell match we added capturing parenthesis around what we want, and reference the group via the group number in the $matches index.

If you are only interested in matching one line based on a preceding id you could use a lookbehind. For example,

 $id=314513; # Or maybe U23481
 gc c:\temp\reg.txt | where { $_ -match "(?<=$id.*),((19|20)[0-9]{2}(0[1-9]|1[012])(0[1-9]|[12][0-9]|3[01])),"} | % { $matches[1] }
Sign up to request clarification or add additional context in comments.

7 Comments

well It works for sure, thanks. But it returns only a bool value. I'm trying to say something like $value = $string1 -match ,(19|20)[0-9]{2}(0[1-9]|1[012])(0[1-9]|[12][0-9]|3[01]),. In this case, $value is equal to true, when I need it to hold the [string]$date. Any suggestions??
updated the answer assuming you were using a csv file, although you could pipe any string into it
ah i got it... [regex]$regex = ",(19|20)[0-9]{2}(0[1-9]|1[012])(0[1-9]|[12][0-9]|3[01]),", next... $date = $regex.Matches($string) | ForEach-Object ($_.value), next... remove the "," $string = $string -replace ",","" and finally $string is the value of the date. Yay, thanks swestner, google, and stackOverflow <3
Ah I just read your updated answer, maybe that is better (less verbos)
I'm happy your happy, but not knowing what you mean, and feeling like the answer is incomplete is making my ocd kick in. Care to explain?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.