1

I have a semi-structured text file which I want to convert it to a Data Frame in Spark. I do have a schema on my mind which is shown below. However, I am finding it challenging to parse my text file and assign the schema.

Following is my sample text file:

    "good service"
    Tom Martin (USA) 17th October 2015    
    4    
    Long review..    
    Type Of Traveller   Couple Leisure    
    Cabin Flown Economy    
    Route   Miami to Chicago    
    Date Flown  September 2015    
    Seat Comfort    12345    
    Cabin Staff Service 12345    
    Ground Service  12345    
    Value For Money 12345    
    Recommended no

    "not bad"
    M Muller (Canada) 22nd September 2015
    6
    Yet another long review..
    Aircraft    TXT-101
    Type Of Customer    Couple Leisure
    Cabin Flown FirstClass
    Route   IND to CHI
    Date Flown  September 2015
    Seat Comfort    12345
    Cabin Staff Service 12345
    Food & Beverages    12345
    Inflight Entertainment  12345
    Ground Service  12345
    Value For Money 12345
    Recommended yes

.
.

The resulting schema with result that I expect to have as follows:

+----------------+------------+--------------+---------------------+---------------+---------------------------+----------+------------------+-------------+--------------+-------------------+----------------+--------------+---------------------+-----------------+------------------------+----------------+---------------------+-----------------+
| Review_Header  | User_Name  | User_Country |  User_Review_Date   | Overall Score |          Review           | Aircraft | Type of Traveler | Cabin Flown | Route_Source | Route_Destination |   Date Flown   | Seat Comfort | Cabin Staff Service | Food & Beverage | Inflight Entertainment | Ground Service | Wifi & Connectivity | Value for Money |
+----------------+------------+--------------+---------------------+---------------+---------------------------+----------+------------------+-------------+--------------+-------------------+----------------+--------------+---------------------+-----------------+------------------------+----------------+---------------------+-----------------+
| "good service" | Tom Martin | USA          | 17th October 2015   |             4 | Long review..             |          | Couple Leisure   | Economy     | Miami        | Chicago           | September 2015 |        12345 |               12345 |                 |                        |          12345 |                     |           12345 |
| "not bad"      | M Muller   | Canada       | 22nd September 2015 |             6 | Yet another long review.. | TXT-101  | Couple Leisure   | FirstClass  | IND          | CHI               | September 2015 |        12345 |               12345 |           12345 |                  12345 |          12345 |                     |           12345 |
+----------------+------------+--------------+---------------------+---------------+---------------------------+----------+------------------+-------------+--------------+-------------------+----------------+--------------+---------------------+-----------------+------------------------+----------------+---------------------+-----------------+

As you may notice, for each block of data in text file, the first four lines are mapped to user defined columns such as Review_Header, User_Name, User_Country, User_Review_Date, whereas rest other individual lines have defined columns.

What could be the best possible way to use schema inference technique in such scenario rather than writing verbose code?

UPDATE: I would like to make this problem a little more tricky. What if the "Long review.." and "Yet another long review" could itself span over multiple newlines. How may I parse the review over multiple line for each block?

1
  • Saw your update - I added some more info in my answer. Commented Apr 17, 2018 at 15:08

2 Answers 2

3

If you guarantee that the semi-structured text file has records separated by two newlines, and that those two newlines will never appear in the "Long review..." section, you may be able to use textFiles with a modified delimiter ("\n\n") and then process the lines without writing a custom file format.

sc.hadoopConfiguration.set("textinputformat.record.delimiter", "\n\n")
df = sc.textFile("sample-file.txt")

Then you can do further splitting on "\n" and "\t" to create your fields and columns.

Seeing your update, it's kind of a difficult problem. You have to ask yourself what identifying info is in the attributes that's not in the review. Or what is guaranteed to be in a specific format. E.g.

  • Can you guarantee there's not two newlines in the long review? This is important if we're splitting on "\n\n" to generate the blocks.
  • Can you guarantee there's no tabs in the long review?
  • Is Aircraft, Cabin Flown, Cabin Staff Service, Date Flown, Food & Beverages, Ground Service, ... the full list of attributes? Do you have a full list of possible attributes?

As well as some meta questions:

  • Where is this data coming from?
  • Can we request it in a better format?
  • Can we find this data, or the aspects we're looking for from a better source?

With those known, you'll have a better idea on how to proceed. E.g. if there are no tabs in the review text, (or they're escaped as "\t" or something):

  • Extract lines[0] - first line "good service"
  • Extract lines[1] - split to user name, country, review date
  • Filter lines[2:] containing tabs, get lowest index i - split into attributes
  • Join lines[2:i] with "\n" - this is the review
Sign up to request clarification or add additional context in comments.

2 Comments

Wow! This is what I was actually looking for. :-) Thank you!!
I am wondering how may I assign the similar kind of property i.e. textinputformat.record.delimiter in local file system scenario.
0

What could be the best possible way to use schema inference technique in such scenario rather than writing verbose code?

You don't have much choice and you have to write a verbose code or a custom FileFormat (that would hide the complexity of loading such files to a DataFrame).

Use DataFrameReader.textFile to load the file and transform it accordingly.

textFile(path: String): Dataset[String] Loads text files and returns a Dataset of String. See the documentation on the other overloaded textFile() method for more details.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.