5

I've been using python to implement a custom parser and use that parsed data to format a word document to be distributed internally. All of the formatting has been straightforward and easy so far but I'm completely stumped on how to insert a checkbox into individual table cells.

I've tried using the python object functions within python-docx (using get_or_add_tcPr(), etc.) which causes MS Word to throw the following error when I try to open the file, "The file xxxx cannot be opened because there are problems with the contents Details: The file is corrupt and cannot be opened".

After struggling with this for a while I moved to a second approach involving manipulating the word/document.xml file for the output doc. I've retrieved what I believe to be the correct xml for a checkbox saved as replacementXML and have inserted filler text into the cells to act as a tag that can be searched and replaced, searchXML. The following seems to run using python in a linux (Fedora 25) environment but the word document displays the same errors when I try to open the document, however this time the document is recoverable and reverts back to the filler text. I've been able to get this to work with a manually made document and using an empty table cell, so I believe that this should be possible. NOTE: I've included the whole xml element for the table cell in the searchXML variable, but I've tried using regular expressions and shortening the string. Not just using an exact match as I know this could differ cell to cell.

searchXML = r'<w:tc><w:tcPr><w:tcW w:type="dxa" w:w="4320"/><w:gridSpan w:val="2"/></w:tcPr><w:p><w:pPr><w:jc w:val="right"/></w:pPr><w:r><w:rPr><w:sz w:val="16"/></w:rPr><w:t>IN_CHECKB</w:t></w:r></w:p></w:tc>'

def addCheckboxes(): 
    os.system("mkdir unzipped")
    os.system("unzip tempdoc.docx -d unzipped/")

    with open('unzipped/word/document.xml', encoding="ISO-8859-1") as file:
        filedata = file.read()

    rep_count = 0
    while re.search(searchXML, filedata):
        filedata = replaceXML(filedata, rep_count)
        rep_count += 1

    with open('unzipped/word/document.xml', 'w') as file:
        file.write(filedata)

    os.system("zip -r ../buildcfg/tempdoc.docx unzipped/*")
    os.system("rm -rf unzipped")

def replaceXML(filedata, rep_count):
    replacementXML = r'<w:tc><w:tcPr><w:tcW w:w="4320" w:type="dxa"/><w:gridSpan w:val="2"/></w:tcPr><w:p w:rsidR="00D2569D" w:rsidRDefault="00FD6FDF"><w:pPr><w:jc w:val="right"/></w:pPr><w:r><w:rPr><w:sz w:val="16"/>
                       </w:rPr><w:fldChar w:fldCharType="begin"><w:ffData><w:name w:val="Check1"/><w:enabled/><w:calcOnExit w:val="0"/><w:checkBox><w:sizeAuto/><w:default w:val="0"/></w:checkBox></w:ffData></w:fldChar>
                       </w:r><w:bookmarkStart w:id="' + rep_count + '" w:name="Check' + rep_count + '"/><w:r><w:rPr><w:sz w:val="16"/></w:rPr><w:instrText xml:space="preserve"> FORMCHECKBOX </w:instrText></w:r><w:r>
                       <w:rPr><w:sz w:val="16"/></w:rPr></w:r><w:r><w:rPr><w:sz w:val="16"/></w:rPr><w:fldChar w:fldCharType="end"/></w:r><w:bookmarkEnd w:id="' + rep_count + '"/></w:p></w:tc>'
    filedata = re.sub(searchXML, replacementXML, filedata, 1)

    rerturn filedata

I have a strong feeling that there is a much simpler (and correct!) way of doing this through the python-docx library but for some reason I can't seem to get it right.

Is there a way to easily insert checkbox fields into a table cell in an MS Word doc? And if yes, how would I do that? If no, is there a better approach than manipulating the .xml file?

UPDATE: I've been able to inject XML into the document succesffuly using python-docx but the checkbox and added XML are not appearing.

I've added the following XML into a table cell:

<w:tc>
  <w:tcPr>
    <w:tcW w:type="dxa" w:w="4320"/>
    <w:gridSpan w:val="2"/>
  </w:tcPr>
  <w:p>
    <w:r>
      <w:bookmarkStart w:id="0" w:name="testName">
        <w:complexType w:name="CT_FFCheckBox">
          <w:sequence>
            <w:choice>
              <w:element w:name="size" w:type="CT_HpsMeasure"/>
              <w:element w:name="sizeAuto" w:type="CT_OnOff"/>
            </w:choice>
            <w:element w:name="default" w:type="CT_OnOff" w:minOccurs="0"/>
            <w:element w:name="checked" w:type="CT_OnOff" w:minOccurs="0"/>
          </w:sequence>
        </w:complexType>
      </w:bookmarkStart>
      <w:bookmarkEnd w:id="0" w:name="testName"/>
    </w:r>
  </w:p>
</w:tc>

by using the following python-docx code:

run = p.add_run()
tag = run._r
start = docx.oxml.shared.OxmlElement('w:bookmarkStart')
start.set(docx.oxml.ns.qn('w:id'), '0')
start.set(docx.oxml.ns.qn('w:name'), n)
tag.append(start)

ctype = docx.oxml.OxmlElement('w:complexType')
ctype.set(docx.oxml.ns.qn('w:name'), 'CT_FFCheckBox')
seq = docx.oxml.OxmlElement('w:sequence')
choice = docx.oxml.OxmlElement('w:choice')
el = docx.oxml.OxmlElement('w:element')
el.set(docx.oxml.ns.qn('w:name'), 'size')
el.set(docx.oxml.ns.qn('w:type'), 'CT_HpsMeasure')
el2 = docx.oxml.OxmlElement('w:element')
el2.set(docx.oxml.ns.qn('w:name'), 'sizeAuto')
el2.set(docx.oxml.ns.qn('w:type'), 'CT_OnOff')

choice.append(el)
choice.append(el2)

el3 = docx.oxml.OxmlElement('w:element')
el3.set(docx.oxml.ns.qn('w:name'), 'default')
el3.set(docx.oxml.ns.qn('w:type'), 'CT_OnOff')
el3.set(docx.oxml.ns.qn('w:minOccurs'), '0')
el4 = docx.oxml.OxmlElement('w:element')
el4.set(docx.oxml.ns.qn('w:name'), 'checked')
el4.set(docx.oxml.ns.qn('w:type'), 'CT_OnOff')
el4.set(docx.oxml.ns.qn('w:minOccurs'), '0')

seq.append(choice)
seq.append(el3)
seq.append(el4)

ctype.append(seq)
start.append(ctype)

end = docx.oxml.shared.OxmlElement('w:bookmarkEnd')
end.set(docx.oxml.ns.qn('w:id'), '0')
end.set(docx.oxml.ns.qn('w:name'), n)
tag.append(end)

Can't seem to find reasoning for the XML not being reflected in the output document but will update with whatever I find.

4 Answers 4

9

I've finally been able to accomplish this after lots of digging and help from @scanny.

Checkboxes can be inserted into any paragraph in python-docx using the following function. I am inserting a checkbox into specific cells in a table.

def addCheckbox(para, box_id, name, checked):

  run = para.add_run()
  tag = run._r
  fldchar = docx.oxml.shared.OxmlElement('w:fldChar')
  fldchar.set(docx.oxml.ns.qn('w:fldCharType'), 'begin')

  ffdata = docx.oxml.shared.OxmlElement('w:ffData')
  name = docx.oxml.shared.OxmlElement('w:name')
  name.set(docx.oxml.ns.qn('w:val'), cb_name)
  enabled = docx.oxml.shared.OxmlElement('w:enabled')
  calconexit = docx.oxml.shared.OxmlElement('w:calcOnExit')
  calconexit.set(docx.oxml.ns.qn('w:val'), '0')

  checkbox = docx.oxml.shared.OxmlElement('w:checkBox')
  sizeauto = docx.oxml.shared.OxmlElement('w:sizeAuto')
  default = docx.oxml.shared.OxmlElement('w:default')

  if checked:
    default.set(docx.oxml.ns.qn('w:val'), '1')
  else:
    default.set(docx.oxml.ns.qn('w:val'), '0')

  checkbox.append(sizeauto)
  checkbox.append(default)
  ffdata.append(name)
  ffdata.append(enabled)
  ffdata.append(calconexit)
  ffdata.append(checkbox)
  fldchar.append(ffdata)
  tag.append(fldchar)

  run2 = para.add_run()
  tag2 = run2._r
  start = docx.oxml.shared.OxmlElement('w:bookmarkStart')
  start.set(docx.oxml.ns.qn('w:id'), str(box_id))
  start.set(docx.oxml.ns.qn('w:name'), name)
  tag2.append(start)

  run3 = para.add_run()
  tag3 = run3._r
  instr = docx.oxml.OxmlElement('w:instrText')
  instr.text = 'FORMCHECKBOX'
  tag3.append(instr)

  run4 = para.add_run()
  tag4 = run4._r
  fld2 = docx.oxml.shared.OxmlElement('w:fldChar')
  fld2.set(docx.oxml.ns.qn('w:fldCharType'), 'end')
  tag4.append(fld2)

  run5 = para.add_run()
  tag5 = run5._r
  end = docx.oxml.shared.OxmlElement('w:bookmarkEnd')
  end.set(docx.oxml.ns.qn('w:id'), str(box_id))
  end.set(docx.oxml.ns.qn('w:name'), name)
  tag5.append(end)

  return

The fldData.text object seems random but was taken from the generated XML form a word document with an existing checkbox. The function fails without setting this text. I have not confirmed but I have heard of one scenario where a developer was arbitrarily changing the string but once saved it would revert back to the original generated value.

Sign up to request clarification or add additional context in comments.

1 Comment

HI, Is it working as when I tried it was printing 'FORMCHECKBOX' text but not checkbox
2

In case anyone is looking for a "Check Box Content Control" style of check box, here's a function modeled on @Crudough answer:

    def add_input_checkbox_content_control(paragraph, checked):
        paragraph_tag = paragraph._element
        sdt = oxml.shared.OxmlElement('w:sdt')
        sdt_props = oxml.shared.OxmlElement('w:sdtPr')

        w_checkbox = oxml.shared.OxmlElement('w14:checkbox')
        w_checked = oxml.shared.OxmlElement('w14:checked')
        w_checked.set(oxml.ns.qn('w14:val'), "1" if checked else "0")
        w_checked_state = oxml.shared.OxmlElement('w14:checkedState')
        w_checked_state.set(oxml.ns.qn('w14:font'), "MS Gothic")
        w_checked_state.set(oxml.ns.qn('w14:val'), "2612")  # unicode value for box with x in it

        w_unchecked_state = oxml.shared.OxmlElement('w14:uncheckedState')
        w_unchecked_state.set(oxml.ns.qn('w14:font'), "MS Gothic")
        w_unchecked_state.set(oxml.ns.qn('w14:val'), "2610")  # unicode value for empty box

        w_checkbox.append(w_checked)
        w_checkbox.append(w_checked_state)
        w_checkbox.append(w_unchecked_state)
        sdt_props.append(w_checkbox)

        sdt_content = oxml.shared.OxmlElement('w:sdtContent')
        sdt_content_run = oxml.OxmlElement('w:r')
        # the box doesn't appear until clicked, so a default unicode box will need to be set as a placeholder
        sdt_content_run.text = "\u2612" if checked else "\u2610"
        sdt_content.append(sdt_content_run)

        sdt.append(sdt_props)
        sdt.append(sdt_content)
        paragraph_tag.append(sdt)

Comments

1

The key thing with these workaround functions is to have an example of XML that works, and to be able to compare the XML you generate. If you generate XML that matches the working example, it will work every time. opc-diag is handy for inspecting the XML in a Word document. Working with really small documents (like single paragraph or two-row table, for analysis purposes) makes it a lot easier to work out how Word is structuring the XML.

An important thing to note is that the XML elements in a Word document are sequence sensitive, meaning the child elements within any other element generally have a set order in which they must appear. If you get this swapped around, you get the "repair" error you mentioned.

I find it much easier to manipulate the XML from within python-docx, as it takes care of all the unzipping and rezipping for you, along with a lot of the other details.

To get the sequencing right, you'll need to be familiar with the XML Schema specifications for the elements you're working with. There is an example here: http://python-docx.readthedocs.io/en/latest/dev/analysis/features/text/paragraph-format.html

The full schema is in the code tree under ref/xsd/. Most of the elements for text are in the wml.xsd file (wml stands for WordProcessing Markup Language).

You can find examples of other so-called "workaround functions" by searching on "python-docx" workaround function. Pay particular attention to the parse_xml() function and the OxmlElement objects which will allow you to create new XML subtrees and individual elements respectively. XML elements can be positioned using regular lxml._Element methods; all XML elements in python-docx are based on lxml. http://lxml.de/api/lxml.etree._Element-class.html

5 Comments

Thank you for the response scanny! I am looking into the schemas now and trying to apply it to a simpler (python generated) word doc. I will update with my progress and any other issues that arise. Also, thank you for being so active in the community. I wouldn't have made any progress with this issue if it weren't for your responses!
Hi @scanny, I've been trying to get the parse_xml() function to work for the checkbox but I receive an lxml.etree.XMLSyntaxError complaining about the namespave not being defined. I understand the error but am not very familiar with XML and I do not know how to correctly add the define. I'm using the XML from the schema you provided. I've worked with some simpler documents to understand the sequence for a checkmark and it seems to require a a paragraph within the cell. Is it correct to use 'cell._tc._add_p' and then insert the xml for the checkbox? Any help is greatly appreciated!
Perfect! Will do!
I've updated the question with my progress. Any idea why the document is not reflecting the XML that I've added?
I absolutely understand! I was unable to yesterday as I was under the minimum 'score'. I absolutely appreciate all of your help!
0

The answer by @artsiom-vahin is a good one. As the XML changes, you will need to revise this, unfortunately. In case you already have a checklist in mind, this may be helpful:

import docx

try:

    # Create an instance of a Word document
    doc = docx.Document()
    # Save the empty Word document
    doc.save(r"C:\My Documents\checkbox_test_document.docx")
    
except PermissionError:
    
    print("Need to close the open document")
    
try:

    # Create an instance of a Word document
    doc = docx.Document()
    # Save the empty Word document (overwrites an existing document of the same name)
    doc.save(r"C:\My Documents\checkbox_test_document.docx")

    # Create a new instance of the empty Word document that has been created
    doc = docx.Document(r"C:\My Documents\checkbox_test_document.docx")
    
# Raise an exception if a previously generated Word document is open
except PermissionError:

    print("Need to close the open document")

check1 = "text for checkbox1"
check2 = "text for checkbox2"

checklist = [check1, check2]

def add_checlist_to_docx(document, checklist_item):  
    from docx.oxml.shared import OxmlElement, qn
    
    paragraph = document.add_paragraph()
    tag = paragraph._p
    
    sdt = OxmlElement('w:sdt')
    sdtPr = OxmlElement('w:sdtPr')
    checkbox = OxmlElement('w14:checkbox')
    checked = OxmlElement('w14:checked')
    checked.set(qn('w14:val'), '0')
    checkedState = OxmlElement('w14:checkedState')
    checkedState.set(qn('w14:val'), '2612')
    checkedState.set(qn('w14:font'), 'MS Gothic')
    uncheckedState = OxmlElement('w14:uncheckedState')
    uncheckedState.set(qn('w14:val'), '2610')
    uncheckedState.set(qn('w14:font'), 'MS Gothic')

    sdtContent = OxmlElement('w:sdtContent')
    r_box = OxmlElement('w:r')
    rPr = OxmlElement('w:rPr')
    rFonts = OxmlElement('w:rFonts')
    rFonts.set(qn('w:ascii'), 'MS Gothic')
    rFonts.set(qn('w:eastAsia'), 'MS Gothic')
    rFonts.set(qn('w:hAnsi'), 'MS Gothic')
    rFonts.set(qn('w:hint'), 'eastAsia')
    t_box = OxmlElement('w:t')

    r_text = OxmlElement('w:r')
    t_text = OxmlElement('w:t')
    t_text.set(qn('xml:space'), 'preserve')
    
    checkbox.append(checked)
    checkbox.append(checkedState)
    checkbox.append(uncheckedState)
    sdtPr.append(checkbox)
    sdt.append(sdtPr)
    rPr.append(rFonts)
    t_box.text = '☐'
    r_box.append(rPr)
    r_box.append(t_box)
    sdtContent.append(r_box)
    sdt.append(sdtContent)
    tag.append(sdt)
    t_text.text = checklist_item
    r_text.append(t_text)
    tag.append(r_text)
    return
    
for check in checklist:
    add_checlist_to_docx(doc, check)

doc.save(r"C:\My Documents\checkbox_test_document.docx")

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.