0

I am trying to convert pandas dtypes ==> pyspark schema.

ie. Convert following text

PERSONID      int64
LASTNAME     object
FIRSTNAME    object
ADDRESS      object
CITY         object
RESULT         bool

Into

StructField('PERSONID',IntegerType(),True),
StructField('LASTNAME',StringType(),True),
StructField('FIRSTNAME',StringType(),True),
StructField('ADDRESS',StringType(),True),
StructField('CITY',StringType(),True),
StructField('RESULT',BooleanType(),True)

So far I have done this:

import re

query = """
PERSONID      int64
LASTNAME     object
FIRSTNAME    object
ADDRESS      object
CITY         object
RESULT         bool
""";

mapping = {'int64': 'IntegerType()',
           'float64': 'DoubleType',
           'bool': 'BooleanType()',
           'object': 'StringType()'
          }


regexp = '(\w+)\s+(\w+)'

re.match(query,regexp)

I am new to regex syntaxes.

How to achieve the required result?

4
  • If the structure is consistent, using the string.strip() method might work better Commented Mar 10, 2020 at 0:26
  • I am trying to learn Regex here. Commented Mar 10, 2020 at 0:29
  • May I suggest mentioning it in your original post next time? But your regex actually works according to regex101 (although it is quite slow). So could you clarify what exactly you are struggling with? Commented Mar 10, 2020 at 0:35
  • I need the output as shown in the question. Commented Mar 10, 2020 at 0:41

1 Answer 1

2

You can solve your problem without using regexp. Regular expressions are often not the most readable solution. Especially after some time of not using them or when the expression is a 50-character string. Pure language syntax will always be clearer and there is less chance that you will forget it.

Syntax solution:

I have divided solution into parts, so you can study it part by part.

query_s = query.rstrip().lstrip()
query_s = query_s.split(sep='\n')
query_s = [ x.split() for x in query_s ]
query_s = [ [x[0], mapping[x[1]]] for x in query_s ]
query_s = [ [ "StructField(\'", x[0], "\',", x[1], ",True)," ] for x in query_s ]
query_s = [ ''.join(x) for x in query_s ]

Regex solution:

query_s = query.split(sep='\n')
query_s = [x for x in query_s if x]
query_s = [ ["StructField(\'", re.match(r"^(\w+)\s+(\w+)", x).group(1), "\',", mapping[re.match(r"^(\w+)\s+(\w+)", x).group(2)], ",True)"] for x in query_s ]
out = [''.join(x) for x in query_s]

You can pass a callable to re.sub, so to make it prettier you can write some nice function to treat matches and pass it to re.sub()

Sign up to request clarification or add additional context in comments.

2 Comments

After posting an answer I see that you have commented about the necessity of regex use. As Sam wrote, please mention it in your original post next time.
Nevermind, even though I need regex, I will award with upvote. Congrats on first 10 points.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.