1

I am using python-docx to extract particular table data in a word file. I have a word file with multiple tables. This is the particular table in multiple tables and the retrieved data need to be arranged like this.

Challenges:

  1. Can I find a particular table in word file using python-docx
  2. Can I achieve my requirement using python-docx
6
  • Can anyone help me with this? Thanks Commented Mar 8, 2018 at 17:21
  • You can iterate over tables in the document. See this thread for code examples: github.com/python-openxml/python-docx/issues/… Commented Mar 9, 2018 at 14:00
  • Did you get anywhere with this @sivanarayana ? I am working on a similar challenge. Commented May 24, 2018 at 20:55
  • 1
    @Watty62, I did not get any. Please share if you have. Thanks. Commented May 26, 2018 at 15:43
  • I did - see below @sivanarayana Commented May 28, 2018 at 13:32

1 Answer 1

2

This is not a complete answer, but it should point you in the right direction, and is based on some similar task I have been working on.

I run the following code in Python 3.6 in a Jupyter notebook, but it should work just in Python.

First we start but importing the docx Document module and point to the document we want to work with.

from docx.api import Document

document = Document(<your path to doc>)

We create a list of tables, and print how many tables there are in that. We create a list to hold all the tabular data.

tables = document.tables

print (len(tables))

big_data = []

Next we loop through the tables:

for table in document.tables:

    data = []

    keys = None
    for i, row in enumerate(table.rows):
        text = (cell.text for cell in row.cells)

        if i == 0:
            keys = tuple(text)
            continue
        row_data = dict(zip(keys, text))
        data.append(row_data)
        #print (data)
        big_data.append(data)
print(big_data)

By looping through all the tables, we read the data, creating a list of lists. Each individual list represents a table, and within that we have dictionaries per row. Each dictionary contains a key / value pair. The key is the column heading from the table and value is the cell contents for that row's data for that column.

So, that is half of your problem. The next part would be to use python-docx to create a new table in your output document - and to fill it with the appropriate content from the list / list / dictionary data.

In the example I have been working on this is the final table in the document. final table

When I run the routine above, this is my output:

[{'Version': '1', 'Changes': 'Local Outcome Improvement Plan ', 'Page Number': '1-34 and 42-61', 'Approved By': 'CPA Board\n', 'Date ': '22 August 2016'}, 
{'Version': '2', 'Changes': 'People are resilient, included and supported when in need section added ', 'Page Number': '35-41', 'Approved By': 'CPA Board', 'Date ': '12 December 2016'}, 
{'Version': '2', 'Changes': 'Updated governance and accountability structure following approval of the Final Report for the Review of CPA Infrastructure', 'Page Number': '59', 'Approved By': 'CPA Board', 'Date ': '12 December 2016'}]]
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.