0

I have a csv file which has more than 2 million records. it does not have any header. First field is date, second filed is time, third and fourth fields are latitudes. Now my task is to validate each of the records based on regex and if data is invalid I need to print an error message. records can contain null record for some of the fields. I am not sure how to check each row and if there is any errors how to print an error message for that record.

2018-01-01  00:15:49    43.24116    -79.85282   Lockout 134 43.39425    -79.98044   H23 9   F109    CCG     00:48:16
2018-01-01  00:16:47    43.76756    -79.41196   Flatbed Tow 435 43.77409    -79.49313   C23 10  FB88    CCG     00:18:19
2018-01-01  00:18:53    43.26671    -79.96222   Tow 172 43.2412 -79.85274   H23 11  F109    CCG     02:42:04
2018-01-01  00:22:59    43.8088942  -79.2698542 No service  35  43.78196    -79.2351    C2  50001   WL5 CLUB_AUTO       00:23:04
2018-01-01  00:25:39    43.57866    -79.63927   Tow 304 43.59991    -79.67094   C950    14  F157    CCG     02:46:21
2018-01-01  00:26:27    43.72097    -79.47553   Lockout 152 43.81375    -79.36767   C950    15  F124    CCG P2  00:50:35
2018-01-01  00:26:56    43.785702   -79.729198  Jump Start/Battery Test 55  43.68537    -79.80871   C28 50003   FB6 CCG     00:52:26
2018-01-01  00:28:08    43.79901    -79.42031   Flatbed Tow 67  43.94571    -79.44134   C950    50004   F124    CLUB_AUTO       00:35:10
2018-01-01  00:33:26    43.67615    -79.7707    Tow 84  0   0   C28 19  FB6 CCG P2  00:54:30

Below is my code

import pandas as pd
import re
#reading CSV
df = pd.read_csv("E:\ERS_DATA_HOOSIER.txt", delimiter='\t', dtype=str, header=None, error_bad_lines=False)
x= len(df.index)
print(x)
#check date
df[0]= df[0].str.split('(\d\d\d\d-\d\d-\d\d)', expand = True)
#check Time
df[1]= df[1].str.extract('(\d\d:d\d:\d\d)', expand = True)
Check Long
df[2]= df[2].str.extract('(\d\d.\d\d\d\d*)',expand= True)
#check Lat
df[3]= df[3].str.extract('(\d\d.\d\d\d\d*)',expand= True)

can anyone suggest efficient way

3
  • I think you have to be more specific about what is a problem here (regex?, or you don't know how to loop over the file and print errors?). And could you give some samples of your records? That would probably be helpful Commented Jan 31, 2020 at 23:34
  • Catch, try your whole code? Commented Feb 3, 2020 at 14:21
  • @marke I am having difficulties on how to print an error message if the data is invalid against the regx. I have updated the question with the brief dataset I am working with. Commented Feb 3, 2020 at 14:21

1 Answer 1

1

You can do it this way, one column at a time:

df = pd.read_csv('data.txt', delimiter='\t', dtype=str, header=None, error_bad_lines=False)
def check_regex(df, col, rgx):
    return df[~df[col].str.contains(rgx)]
check_regex(df, 0, '\d{4}-\d{2}-\d{2}')

EDIT:

You can also do it like this, where the order of rgx in rgxs is the order of columns to check:

rgxs = ['\d{4}-\d{2}-\d{2}', '\d{2}:\d{2}:\d{2}', ...]

def check_rgx(col):
    return col.str.contains(rgxs[col.name])
mask = df.apply(check_rgx)
mask.apply(all, axis=1)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.