I have a semi-structured text file which I want to convert it to a Data Frame in Spark. I do have a schema on my mind which is shown below. However, I am finding it challenging to parse my text file and assign the schema.
Following is my sample text file:
"good service"
Tom Martin (USA) 17th October 2015
4
Long review..
Type Of Traveller Couple Leisure
Cabin Flown Economy
Route Miami to Chicago
Date Flown September 2015
Seat Comfort 12345
Cabin Staff Service 12345
Ground Service 12345
Value For Money 12345
Recommended no
"not bad"
M Muller (Canada) 22nd September 2015
6
Yet another long review..
Aircraft TXT-101
Type Of Customer Couple Leisure
Cabin Flown FirstClass
Route IND to CHI
Date Flown September 2015
Seat Comfort 12345
Cabin Staff Service 12345
Food & Beverages 12345
Inflight Entertainment 12345
Ground Service 12345
Value For Money 12345
Recommended yes
.
.
The resulting schema with result that I expect to have as follows:
+----------------+------------+--------------+---------------------+---------------+---------------------------+----------+------------------+-------------+--------------+-------------------+----------------+--------------+---------------------+-----------------+------------------------+----------------+---------------------+-----------------+
| Review_Header | User_Name | User_Country | User_Review_Date | Overall Score | Review | Aircraft | Type of Traveler | Cabin Flown | Route_Source | Route_Destination | Date Flown | Seat Comfort | Cabin Staff Service | Food & Beverage | Inflight Entertainment | Ground Service | Wifi & Connectivity | Value for Money |
+----------------+------------+--------------+---------------------+---------------+---------------------------+----------+------------------+-------------+--------------+-------------------+----------------+--------------+---------------------+-----------------+------------------------+----------------+---------------------+-----------------+
| "good service" | Tom Martin | USA | 17th October 2015 | 4 | Long review.. | | Couple Leisure | Economy | Miami | Chicago | September 2015 | 12345 | 12345 | | | 12345 | | 12345 |
| "not bad" | M Muller | Canada | 22nd September 2015 | 6 | Yet another long review.. | TXT-101 | Couple Leisure | FirstClass | IND | CHI | September 2015 | 12345 | 12345 | 12345 | 12345 | 12345 | | 12345 |
+----------------+------------+--------------+---------------------+---------------+---------------------------+----------+------------------+-------------+--------------+-------------------+----------------+--------------+---------------------+-----------------+------------------------+----------------+---------------------+-----------------+
As you may notice, for each block of data in text file, the first four lines are mapped to user defined columns such as Review_Header, User_Name, User_Country, User_Review_Date, whereas rest other individual lines have defined columns.
What could be the best possible way to use schema inference technique in such scenario rather than writing verbose code?
UPDATE: I would like to make this problem a little more tricky. What if the "Long review.." and "Yet another long review" could itself span over multiple newlines. How may I parse the review over multiple line for each block?