Python regex probelm

Question

I am trying to convert pandas dtypes ==> pyspark schema.

ie. Convert following text

PERSONID      int64
LASTNAME     object
FIRSTNAME    object
ADDRESS      object
CITY         object
RESULT         bool

Into

StructField('PERSONID',IntegerType(),True),
StructField('LASTNAME',StringType(),True),
StructField('FIRSTNAME',StringType(),True),
StructField('ADDRESS',StringType(),True),
StructField('CITY',StringType(),True),
StructField('RESULT',BooleanType(),True)

So far I have done this:

import re

query = """
PERSONID      int64
LASTNAME     object
FIRSTNAME    object
ADDRESS      object
CITY         object
RESULT         bool
""";

mapping = {'int64': 'IntegerType()',
           'float64': 'DoubleType',
           'bool': 'BooleanType()',
           'object': 'StringType()'
          }


regexp = '(\w+)\s+(\w+)'

re.match(query,regexp)

I am new to regex syntaxes.

How to achieve the required result?

If the structure is consistent, using the string.strip() method might work better — Sam
– Sam, Commented Mar 10, 2020 at 0:26
May I suggest mentioning it in your original post next time? But your regex actually works according to regex101 (although it is quite slow). So could you clarify what exactly you are struggling with? — Sam
– Sam, Commented Mar 10, 2020 at 0:35

deenende · Accepted Answer · 2020-03-10 02:07:06Z

2

You can solve your problem without using regexp. Regular expressions are often not the most readable solution. Especially after some time of not using them or when the expression is a 50-character string. Pure language syntax will always be clearer and there is less chance that you will forget it.

Syntax solution:

I have divided solution into parts, so you can study it part by part.

query_s = query.rstrip().lstrip()
query_s = query_s.split(sep='\n')
query_s = [ x.split() for x in query_s ]
query_s = [ [x[0], mapping[x[1]]] for x in query_s ]
query_s = [ [ "StructField(\'", x[0], "\',", x[1], ",True)," ] for x in query_s ]
query_s = [ ''.join(x) for x in query_s ]

Regex solution:

query_s = query.split(sep='\n')
query_s = [x for x in query_s if x]
query_s = [ ["StructField(\'", re.match(r"^(\w+)\s+(\w+)", x).group(1), "\',", mapping[re.match(r"^(\w+)\s+(\w+)", x).group(2)], ",True)"] for x in query_s ]
out = [''.join(x) for x in query_s]

You can pass a callable to re.sub, so to make it prettier you can write some nice function to treat matches and pass it to re.sub()

edited Mar 10, 2020 at 2:07

answered Mar 10, 2020 at 1:11

deenende

362 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

deenende Over a year ago

After posting an answer I see that you have commented about the necessity of regex use. As Sam wrote, please mention it in your original post next time.

BhishanPoudel Over a year ago

Nevermind, even though I need regex, I will award with upvote. Congrats on first 10 points.

Collectives™ on Stack Overflow

Python regex probelm

1 Answer 1

Syntax solution:

Regex solution:

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Syntax solution:

Regex solution:

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related