I have a csv file which has more than 2 million records. it does not have any header. First field is date, second filed is time, third and fourth fields are latitudes. Now my task is to validate each of the records based on regex and if data is invalid I need to print an error message. records can contain null record for some of the fields. I am not sure how to check each row and if there is any errors how to print an error message for that record.
2018-01-01 00:15:49 43.24116 -79.85282 Lockout 134 43.39425 -79.98044 H23 9 F109 CCG 00:48:16
2018-01-01 00:16:47 43.76756 -79.41196 Flatbed Tow 435 43.77409 -79.49313 C23 10 FB88 CCG 00:18:19
2018-01-01 00:18:53 43.26671 -79.96222 Tow 172 43.2412 -79.85274 H23 11 F109 CCG 02:42:04
2018-01-01 00:22:59 43.8088942 -79.2698542 No service 35 43.78196 -79.2351 C2 50001 WL5 CLUB_AUTO 00:23:04
2018-01-01 00:25:39 43.57866 -79.63927 Tow 304 43.59991 -79.67094 C950 14 F157 CCG 02:46:21
2018-01-01 00:26:27 43.72097 -79.47553 Lockout 152 43.81375 -79.36767 C950 15 F124 CCG P2 00:50:35
2018-01-01 00:26:56 43.785702 -79.729198 Jump Start/Battery Test 55 43.68537 -79.80871 C28 50003 FB6 CCG 00:52:26
2018-01-01 00:28:08 43.79901 -79.42031 Flatbed Tow 67 43.94571 -79.44134 C950 50004 F124 CLUB_AUTO 00:35:10
2018-01-01 00:33:26 43.67615 -79.7707 Tow 84 0 0 C28 19 FB6 CCG P2 00:54:30
Below is my code
import pandas as pd
import re
#reading CSV
df = pd.read_csv("E:\ERS_DATA_HOOSIER.txt", delimiter='\t', dtype=str, header=None, error_bad_lines=False)
x= len(df.index)
print(x)
#check date
df[0]= df[0].str.split('(\d\d\d\d-\d\d-\d\d)', expand = True)
#check Time
df[1]= df[1].str.extract('(\d\d:d\d:\d\d)', expand = True)
Check Long
df[2]= df[2].str.extract('(\d\d.\d\d\d\d*)',expand= True)
#check Lat
df[3]= df[3].str.extract('(\d\d.\d\d\d\d*)',expand= True)
can anyone suggest efficient way