How to load DataFrame from semi-structured textfile?

Question

I have a semi-structured text file which I want to convert it to a Data Frame in Spark. I do have a schema on my mind which is shown below. However, I am finding it challenging to parse my text file and assign the schema.

Following is my sample text file:

    "good service"
    Tom Martin (USA) 17th October 2015    
    4    
    Long review..    
    Type Of Traveller   Couple Leisure    
    Cabin Flown Economy    
    Route   Miami to Chicago    
    Date Flown  September 2015    
    Seat Comfort    12345    
    Cabin Staff Service 12345    
    Ground Service  12345    
    Value For Money 12345    
    Recommended no

    "not bad"
    M Muller (Canada) 22nd September 2015
    6
    Yet another long review..
    Aircraft    TXT-101
    Type Of Customer    Couple Leisure
    Cabin Flown FirstClass
    Route   IND to CHI
    Date Flown  September 2015
    Seat Comfort    12345
    Cabin Staff Service 12345
    Food & Beverages    12345
    Inflight Entertainment  12345
    Ground Service  12345
    Value For Money 12345
    Recommended yes

.
.

The resulting schema with result that I expect to have as follows:

+----------------+------------+--------------+---------------------+---------------+---------------------------+----------+------------------+-------------+--------------+-------------------+----------------+--------------+---------------------+-----------------+------------------------+----------------+---------------------+-----------------+
| Review_Header  | User_Name  | User_Country |  User_Review_Date   | Overall Score |          Review           | Aircraft | Type of Traveler | Cabin Flown | Route_Source | Route_Destination |   Date Flown   | Seat Comfort | Cabin Staff Service | Food & Beverage | Inflight Entertainment | Ground Service | Wifi & Connectivity | Value for Money |
+----------------+------------+--------------+---------------------+---------------+---------------------------+----------+------------------+-------------+--------------+-------------------+----------------+--------------+---------------------+-----------------+------------------------+----------------+---------------------+-----------------+
| "good service" | Tom Martin | USA          | 17th October 2015   |             4 | Long review..             |          | Couple Leisure   | Economy     | Miami        | Chicago           | September 2015 |        12345 |               12345 |                 |                        |          12345 |                     |           12345 |
| "not bad"      | M Muller   | Canada       | 22nd September 2015 |             6 | Yet another long review.. | TXT-101  | Couple Leisure   | FirstClass  | IND          | CHI               | September 2015 |        12345 |               12345 |           12345 |                  12345 |          12345 |                     |           12345 |
+----------------+------------+--------------+---------------------+---------------+---------------------------+----------+------------------+-------------+--------------+-------------------+----------------+--------------+---------------------+-----------------+------------------------+----------------+---------------------+-----------------+

As you may notice, for each block of data in text file, the first four lines are mapped to user defined columns such as Review_Header, User_Name, User_Country, User_Review_Date, whereas rest other individual lines have defined columns.

What could be the best possible way to use schema inference technique in such scenario rather than writing verbose code?

UPDATE: I would like to make this problem a little more tricky. What if the "Long review.." and "Yet another long review" could itself span over multiple newlines. How may I parse the review over multiple line for each block?

Saw your update - I added some more info in my answer.

Benjamin Manns
– Benjamin Manns

2018-04-17 15:08:58 +00:00
Commented Apr 17, 2018 at 15:08 — Benjamin Manns
– Benjamin Manns, Commented Apr 17, 2018 at 15:08

Benjamin Manns · Accepted Answer · 2018-04-17 15:08:13Z

3

If you guarantee that the semi-structured text file has records separated by two newlines, and that those two newlines will never appear in the "Long review..." section, you may be able to use textFiles with a modified delimiter ("\n\n") and then process the lines without writing a custom file format.

sc.hadoopConfiguration.set("textinputformat.record.delimiter", "\n\n")
df = sc.textFile("sample-file.txt")

Then you can do further splitting on "\n" and "\t" to create your fields and columns.

Seeing your update, it's kind of a difficult problem. You have to ask yourself what identifying info is in the attributes that's not in the review. Or what is guaranteed to be in a specific format. E.g.

Can you guarantee there's not two newlines in the long review? This is important if we're splitting on "\n\n" to generate the blocks.
Can you guarantee there's no tabs in the long review?
Is Aircraft, Cabin Flown, Cabin Staff Service, Date Flown, Food & Beverages, Ground Service, ... the full list of attributes? Do you have a full list of possible attributes?

As well as some meta questions:

Where is this data coming from?
Can we request it in a better format?
Can we find this data, or the aspects we're looking for from a better source?

With those known, you'll have a better idea on how to proceed. E.g. if there are no tabs in the review text, (or they're escaped as "\t" or something):

Extract lines[0] - first line "good service"
Extract lines[1] - split to user name, country, review date
Filter lines[2:] containing tabs, get lowest index i - split into attributes
Join lines[2:i] with "\n" - this is the review

edited Apr 17, 2018 at 15:08

answered Apr 15, 2018 at 19:49

Benjamin Manns

9,1785 gold badges39 silver badges48 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

jOasis Over a year ago

Wow! This is what I was actually looking for. :-) Thank you!!

jOasis Over a year ago

I am wondering how may I assign the similar kind of property i.e. textinputformat.record.delimiter in local file system scenario.

Jacek Laskowski · Accepted Answer · 2018-04-15 18:53:23Z

0

What could be the best possible way to use schema inference technique in such scenario rather than writing verbose code?

You don't have much choice and you have to write a verbose code or a custom FileFormat (that would hide the complexity of loading such files to a DataFrame).

Use DataFrameReader.textFile to load the file and transform it accordingly.

textFile(path: String): Dataset[String] Loads text files and returns a Dataset of String. See the documentation on the other overloaded textFile() method for more details.

answered Apr 15, 2018 at 18:53

Jacek Laskowski

75k28 gold badges253 silver badges440 bronze badges

Collectives™ on Stack Overflow

How to load DataFrame from semi-structured textfile?

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related