I am trying to convert pandas dtypes ==> pyspark schema.
ie. Convert following text
PERSONID int64
LASTNAME object
FIRSTNAME object
ADDRESS object
CITY object
RESULT bool
Into
StructField('PERSONID',IntegerType(),True),
StructField('LASTNAME',StringType(),True),
StructField('FIRSTNAME',StringType(),True),
StructField('ADDRESS',StringType(),True),
StructField('CITY',StringType(),True),
StructField('RESULT',BooleanType(),True)
So far I have done this:
import re
query = """
PERSONID int64
LASTNAME object
FIRSTNAME object
ADDRESS object
CITY object
RESULT bool
""";
mapping = {'int64': 'IntegerType()',
'float64': 'DoubleType',
'bool': 'BooleanType()',
'object': 'StringType()'
}
regexp = '(\w+)\s+(\w+)'
re.match(query,regexp)
I am new to regex syntaxes.
How to achieve the required result?