0

I am incredibly new to python, so I might not have the right terminology...

I've extracted text from a pdf using pdfplumber. That's been saved as a object. The code I used for that is:

with pdfplumber.open('Bell_2014.pdf') as pdf:
    page = pdf.pages[0]
    bell = page.extract_text()
    print(bell)

So "bell" is all of the text from the first page of the imported PDF. what bell looks like I need to write all of that text as a string to a csv. I tried using:

 with open('Bell_2014_ex.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerows(bell)

and

bell_ex = 'bell_2014_ex.csv'

with open(bell_ex, 'w', newline='') as csvfile:
   file_writer = csv.writer(csvfile,delimiter=',')
   file_writer.writerow(bell)

All I keep finding when I search this is how to create a csv with specific characters or numbers, but nothing from an output of an already executed code. For instance, I can get the above code:

bell_ex = 'bell_2014_ex.csv'

with open(bell_ex, 'w', newline='') as csvfile:
   file_writer = csv.writer(csvfile,delimiter=',')
   file_writer.writerow(['bell'])

to create a csv that has "bell" in one cell of the csv, but that's as close as I can get. I feel like this should be super easy, but I just can't seem to get it to work. Any thoughts? Please and thank you for helping my inexperienced self.

5
  • We don't know what bell looks like. Can you post what print(bell) output? Or, since its likely longer than we need, a trimmed up version? Commented Jun 19, 2020 at 23:34
  • Hi! I added a screen cap of what it looks like Commented Jun 19, 2020 at 23:59
  • That looks like a single multiline string. Not a "dataframe" (you have to clarify what this is, the popular pandas.DataFrame or something else). CSV is for columnar data and I'm not seeing anything columnar. Commented Jun 20, 2020 at 0:03
  • Thank you for that clarification, I corrected my post to say object instead of dataframe. Commented Jun 20, 2020 at 0:15
  • @DMM. Off-topic, but you should actually accept the working answer to your question. It's simple courtesy and just how this site works Commented Jul 15, 2020 at 9:47

3 Answers 3

1

page.extract_text() is defined as: "Collates all of the page's character objects into a single string." which would make bell just a very long string.

The CSV writerow() expects by default a list of strings, with each item in the list corresponding to a single column.

Your main issue is a type mismatch, you're trying to write a single string where a list of strings is expected. You will need to further operate on your bell object to convert it into a format acceptable to be written to a CSV.

Without having any knowledge of what bell contains or what you intend to write, I can't get any more specific, but documentation on Python's CSV module is very comprehensive in terms of settings delimiters, dialects, column definitions, etc. Once you have converted bell into a proper iterable of lists of strings, you can then write it to a CSV.

Sign up to request clarification or add additional context in comments.

4 Comments

I went back and added a screencap of what "bell" looks like. It's very long since it's all the text of the first page, so I cropped it.
And that screenshot just reinforces that bell is a giant string. What is also missing is what you expect. Is your CSV intended to just have one single row with one single column that contains the value of bell? If that's the case then file_writer.writerow([bell]) with the quotes removed and you're done. If your intended final CSV structure is more complex then you will need to define that structure and manipulate bell into a corresponding Python iterable before writing to CSV. One list per row, one list item per column.
Okay, that makes a lot more sense. I'll have to get clarification on what I need to do with the csv's. (I'm learning this technique for my thesis, but without any actual training, so this is a "learn as ya go" situation for me.) Thank you for helping me, truly.
FWIW you're probably better off understanding this by experimenting with the operation in reverse. Take any spreadsheet and export it as csv. Use Python's CSV reader to read the file into the default output (a list of lists) and inspect it, both at the row level and overall. When you see what Python outputs each row as in relation to the original csv, you'll be able to model what the writer expects when you want to write your data.
0

Some similar code I wrote recently converts a tab-separated file to csv for insertion into sqlite3 database:

Maybe this is helpful:

    retval = ''
    mode = 'r'
    out_file = os.path.join('input', 'listfile.csv')

    """
    Convert tab-delimited listfile.txt to comma separated values (.csv) file
    """

    in_text = open(listfile.txt, 'r')
    in_reader = csv.reader(in_text, delimiter='\t')
    out_csv = open(out_file, 'w', newline='\n')
    out_writer = csv.writer(out_csv, dialect=csv.excel)

    for _line in in_reader:
        out_writer.writerow(_line)
    out_csv.close()

... and that's it, not too tough

1 Comment

But OP isn't reading from a CSV so this likely doesn't apply. As an aside, you could out_writer.writerows(in_reader) and avoid the `for.
0

So my problem was that I was missing the "encoding = 'utf-8'" for special characters and my delimiter need to be a space instead of a comma. What ended up working was:

from pdfminer.high_level import extract_text
object = extract_text('filepath.pdf')
print(object)

new_csv = 'filename.csv'

with open(new_csv, 'w', newline='', encoding = 'utf-8') as csvfile:
    file_writer = csv.writer(csvfile,delimiter=' ')
    file_writer.writerow(object)

However, since a lot of my pdfs weren't true pdfs but scans, the csv ended up having a lot of weird symbols. This worked for about half of the pdfs I have. If you have true pdfs, this will be great. If not, I'm currently trying to figure out how to extract all the text into a pandas dataframe separated by headers within the pdfs since pdfminer extracted all text perfectly. Thank you for everyone that helped!

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.