I'm working in a Microsoft Fabric notebook, which uses PySpark. The API response has header 'Content-Type' = 'text/csv'. From what I see, the response.text seems to be similar to '"ColA","ColB","ColC"\r"Row1A","Row1B","Row1C", ...'. In other words, from how it's formatted, I can copy the raw Postman output, paste it in Notepad, save as .csv and it looks fine when opened up in Excel.
The response.text is about 45MB and 16K-17K rows. I would like to put the data in a DataFrame (so I can use the DF to overwrite my Fabric Lakehouse table).
I've attempted df = spark.read.csv(responseText, header=False, schema=mySchema). The data and my schema has 64 fields. This took over 15 minutes before I cancelled the run, since I wasn't sure if I did something wrong. I plan to try it again with header=True and inferSchema=True tomorrow, but firstly, is there preprocessing I have to do to response.text before using spark.read.csv() on it? Is it supposed to take this long? Is there a more efficient way to convert it to a DataFrame? (Example: can I save the response as a .csv file within the notebook, then read.csv() from that?)
(For the Fabric-adept, is it possible to save the text as a file.csv and upload it to the Lakehouse 'Files' section, and turn it into a table there? In a way where I can do this programmatically by running the notebook?)
Thank you!