2

I have a big .csv file (~600k lines, 56Mo), and inside there is database lines (on each line, there's an id, a client name, a client address, a client birthday date, etc). The problem is that, on some lines, some data is written badly (commas not supposed to be there, that mess up the columns).

I guessed that I had to do some RegEx to detect the problematic lines, and to replace the wrong commas with a dash or anything. I followed this article, and, after some tries, I got him to detect the messed-up lines.

Private Sub simpleRegex()
Dim strPattern As String: strPattern = "[^a-zA-Z0-9_,\-]([A-z]+)\,[^a-zA-Z0-9_,\-]([A-z]+)"

Dim strReplace As String: strReplace = "[^a-zA-Z0-9_,\-][A-z]+\-[^a-zA-Z0-9_,\-][A-z]"

Dim regEx As Object
Set regEx = CreateObject("VBScript.RegExp")
Dim strInput As String
Dim Myrange As Range

Set Myrange = ActiveSheet.Range("A1:A2000")

For Each cell In Myrange
    If strPattern <> "" Then
        strInput = cell.Value

        With regEx
            .Global = True
            .MultiLine = True
            .IgnoreCase = False
            .Pattern = strPattern
        End With

        If regEx.Test(strInput) Then
            MsgBox (regEx.Replace(strInput, strReplace))
        Else

        End If
    End If
Next End Sub

The problem is, this solution works if I wanted to change the cibled lines with an unique value, a predefined string (like strReplace="replacement words"). What I want, is to be able to target a succession of characters that match my pattern, and to replace only one character (the comma) in it. An example of it would be :

728,"HAY,HAYE",Marie,François,RAUTUREAU,85,29/05/1856,68;

into :

728,"HAY-HAYE",Marie,François,RAUTUREAU,85,29/05/1856,68;

Do you have a solution?

(Sorry if bad english, it's not my mother tongue).

4
  • Your regex does not match the provided string. Also, strReplace should be a replacement pattern, not a regex pattern. Check backreferences, too. Commented Nov 7, 2016 at 8:10
  • The Regex I wrote seems to work on some lines only, thanks for pointing that out for me, I'm not exactly an expert in Regex. And thank you for the reference, I'm gonna read it out (my mother tongue is French, so I have some difficulties reading big articles like this one). Commented Nov 7, 2016 at 9:17
  • It seems you need to match substrings enclosed with double quotation marks. Is the double quote used as a field qualifier in this CSV? Are there cases where " does not delimit a field? Commented Nov 7, 2016 at 9:22
  • No, in fact, it seems like everytime I have a problematic comma that messes up the columns, it's a comma between two double quotes. Here is how it looks like by the way : puu.sh/s9mC4/d870a2d915.png Commented Nov 7, 2016 at 9:26

3 Answers 3

1

If I got you correct, then there is no need for Regex at all.

With a simple Split you can do it too.

Private Sub simpleReplace()
  Dim str() As String, cell As Variant, Myrange As Range, i As Long
  Set Myrange = ActiveSheet.Range("A1:A2000")
  For Each cell In Myrange
    str = Split(cell.Value, """")
    If UBound(str) Then
      For i = 1 To UBound(str) Step 2
        str(i) = Replace(str(i), ",", "-")
      Next
      cell.Value = Join(str, """")
    End If
  Next
End Sub

this will split your 728,"HAY,HAYE",Marie,François,RAUTUREAU,85,29/05/1856,68; into:

(0) 728,
(1) HAY,HAYE
(2) ,Marie,François,RAUTUREAU,85,29/05/1856,68;

Now every second part of the Split (odd numbers) will be normally enclosed in ". So all that is left, is to Replace the commas there and push it into the cell again.

And if there is no " found, then it will skip this line (because the upper bound is 0)

If you still have any Questions or if that is not what you are looking for, pls tell :)

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks a lot! It seems to work correctly, I'm gonna try to insert all the data in my tables in Postgresql, I keep you in touch !
you are welcome... just tell if there is any trouble :)
Well.. Turns out my file was corrumpted, and I didn't even had to modify my file. The strange characters, the double quotes, ..., I don't even know how they appeared, but they weren't supposed to be there. I lost a week-end of my life. Sigh. But thanks for the help !
1

You can use (?<=(Your Word)) to catch specific characters after a specific word. In your case, this code will help you find the comma:

(?<=(HAY))\,

Update:

Try this and I also updated the demo:

,(?=[^"]+")

Demo: https://regex101.com/r/0rtcFt/6

17 Comments

And if there isn't a "HAY"? what about "HAZ" or "lhzikjrtjfnosrg"? Check hardcoded for every possible combination?
Thanks for the answer! But Dirk is right, the problem is, I took "HAY,HAYE" as an example, but I also have "BREMAND,BREMAUD", or stuff like that, and there are way too many lines to hardcode all of them.
@DirkReichel I updated my code, thank you for pointing that out.
@Alan The code is updated. Let me know if it works for you.
Seems to work for me ! I'll just have to adapt the RegEx a little bit, because I have other issues (the same issues with different columns in fact, like columns getting messed up with decimal numbers), but I should be able to work it out ! The last problem I have is to understand how to use the replacement variable, and how to give him something else than a fixed value (to just change the comma into something else).
|
1

The correct approach here (since you commented that double quotes only appear as field delimiters) is to match double quoted substrings with a simple "[^"]+" regex and replace commas with hyphens only inside the matches.

Here is a sample code:

Sub CallbackTest()
Dim rxStr As RegExp
Dim s As String
Dim m As Object

s = """SOME,MORE,HERE"",728,""HAY,HAYE"",Marie,François,RAUTUREAU,85,29/05/1856,68;"

Set rxStr = New RegExp
rxStr.pattern = """[^""]+"""
rxStr.Global = True

For Each m In rxStr.Execute(s)
   s = Left(s, m.FirstIndex) & Replace(m.Value, ",", "-") & Mid(s, m.FirstIndex + Len(m.Value) + 1)
Next m
Debug.Print s              ' Print demo results
' => "SOME-MORE-HERE",728,"HAY-HAYE",Marie,François,RAUTUREAU,85,29/05/1856,68;

End Sub

2 Comments

Thanks a lot ! So, if I want to adapt this to my file, I have to remove the variable s, and to replace it with all the cells/lines of the file?
The s is the cell value, the input. You may use a temp value that you can assign the original cell value, and only change the cell contents if it was changed. Or add a If rxStr.Test(s) Then For Each ... End If.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.