Convert .doc files to pdf using python COM interface to Microsoft Word

Question

How can I convert a Word document in PDF by calling the Word COM interface from Python?

Steven · Accepted Answer · 2011-05-16 13:19:36Z

101

A simple example using comtypes, converting a single file, input and output filenames given as commandline arguments:

import sys
import os
import comtypes.client

wdFormatPDF = 17

in_file = os.path.abspath(sys.argv[1])
out_file = os.path.abspath(sys.argv[2])

word = comtypes.client.CreateObject('Word.Application')
doc = word.Documents.Open(in_file)
doc.SaveAs(out_file, FileFormat=wdFormatPDF)
doc.Close()
word.Quit()

You could also use pywin32, which would be the same except for:

import win32com.client

and then:

word = win32com.client.Dispatch('Word.Application')

answered May 16, 2011 at 13:19

Steven

28.9k6 gold badges64 silver badges51 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

ecoe Over a year ago

For many files, consider setting: word.Visible = False to save time and processing of the word files (MS word will not display this way, code will run in background essentially)

Snorfalorpagus Over a year ago

I've managed to get this working for powerpoint documents. Use Powerpoint.Application, Presentations.Open and FileFormat=32.

user3732708 Over a year ago

when I run this, came an error

File "test.py", line 7, in <module>                                                                                       in_file = os.path.abspath(sys.argv[1])                                                                              IndexError: list index out of range

Open AI - Opting Out Over a year ago

@user3732708 argv[1] and argv[2] will be the names of the input and output files. You get that error if you don't specify the files on the command line.

asetniop Over a year ago

When running the doc.SaveAs() command I got an error and had to drop the "FileFormat=" prefix, and then it worked fine.

|

Al Johri · Accepted Answer · 2019-12-24 08:04:16Z

58

You can use the docx2pdf python package to bulk convert docx to pdf. It can be used as both a CLI and a python library. It requires Microsoft Office to be installed and uses COM on Windows and AppleScript (JXA) on macOS.

from docx2pdf import convert

convert("input.docx")
convert("input.docx", "output.pdf")
convert("my_docx_folder/")

pip install docx2pdf
docx2pdf input.docx output.pdf

Disclaimer: I wrote the docx2pdf package. https://github.com/AlJohri/docx2pdf

answered Dec 24, 2019 at 8:04

Al Johri

2,0901 gold badge25 silver badges27 bronze badges

5 Comments

abdelhedi hlel Over a year ago

@AlJohri take a look here michalzalecki.com/converting-docx-to-pdf-using-python this solution works on both windows and linux. runnig on linux it's a must bcause the most of deployement servers use linux

diek Over a year ago

The solution asked for doc, and docx2pdf does not work for doc...

Marc Over a year ago

Lib is outdated and thus does not work.

robertspierre Jul 19 at 12:52

This doens't hide revision and comments, see github.com/AlJohri/docx2pdf/issues/93

TylerH Jul 19 at 20:00

@AlJohri The question specifically mentions Word 2010 so that makes this entire question Windows-specific, since Word 2010 is Windows-only.

TylerH · Accepted Answer · 2025-07-19 20:01:09Z

23

I have tested many solutions but no one of them works efficiently on Linux distribution.

I recommend this solution :

import sys
import subprocess
import re


def convert_to(folder, source, timeout=None):
    args = [libreoffice_exec(), '--headless', '--convert-to', 'pdf', '--outdir', folder, source]

    process = subprocess.run(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE, timeout=timeout)
    filename = re.search('-> (.*?) using filter', process.stdout.decode())

    return filename.group(1)


def libreoffice_exec():
    # TODO: Provide support for more platforms
    if sys.platform == 'darwin':
        return '/Applications/LibreOffice.app/Contents/MacOS/soffice'
    return 'libreoffice'

and you call your function:

result = convert_to('TEMP Directory',  'Your File', timeout=15)

Sources from: https://michalzalecki.com/converting-docx-to-pdf-using-python/

edited Jul 19 at 20:01

TylerH

21.3k84 gold badges84 silver badges121 bronze badges

answered Feb 28, 2020 at 20:05

abdelhedi hlel

3,8331 gold badge19 silver badges21 bronze badges

3 Comments

not2qubit Over a year ago

This is not using Python, this is just running the libre office exe from a python script.

johnson Nov 29, 2024 at 14:23

Why do I need stdout=subprocess.PIPE and stderr=subprocess.PIPE?

robertspierre Jul 19 at 13:27

The question specifically asks to use the Microsoft Word COM interface to Python. I have modified the title to reflect this.

Yang · Accepted Answer · 2016-12-15 16:52:12Z

I have worked on this problem for half a day, so I think I should share some of my experience on this matter. Steven's answer is right, but it will fail on my computer. There are two key points to fix it here:

(1). The first time when I created the 'Word.Application' object, I should make it (the word app) visible before open any documents. (Actually, even I myself cannot explain why this works. If I do not do this on my computer, the program will crash when I try to open a document in the invisible model, then the 'Word.Application' object will be deleted by OS. )

(2). After doing (1), the program will work well sometimes but may fail often. The crash error "COMError: (-2147418111, 'Call was rejected by callee.', (None, None, None, 0, None))" means that the COM Server may not be able to response so quickly. So I add a delay before I tried to open a document.

After doing these two steps, the program will work perfectly with no failure anymore. The demo code is as below. If you have encountered the same problems, try to follow these two steps. Hope it helps.

    import os
    import comtypes.client
    import time


    wdFormatPDF = 17


    # absolute path is needed
    # be careful about the slash '\', use '\\' or '/' or raw string r"..."
    in_file=r'absolute path of input docx file 1'
    out_file=r'absolute path of output pdf file 1'

    in_file2=r'absolute path of input docx file 2'
    out_file2=r'absolute path of outputpdf file 2'

    # print out filenames
    print in_file
    print out_file
    print in_file2
    print out_file2


    # create COM object
    word = comtypes.client.CreateObject('Word.Application')
    # key point 1: make word visible before open a new document
    word.Visible = True
    # key point 2: wait for the COM Server to prepare well.
    time.sleep(3)

    # convert docx file 1 to pdf file 1
    doc=word.Documents.Open(in_file) # open docx file 1
    doc.SaveAs(out_file, FileFormat=wdFormatPDF) # conversion
    doc.Close() # close docx file 1
    word.Visible = False
    # convert docx file 2 to pdf file 2
    doc = word.Documents.Open(in_file2) # open docx file 2
    doc.SaveAs(out_file2, FileFormat=wdFormatPDF) # conversion
    doc.Close() # close docx file 2   
    word.Quit() # close Word Application

patrick · Accepted Answer · 2018-05-24 19:06:17Z

8

As an alternative to the SaveAs function, you could also use ExportAsFixedFormat which gives you access to the PDF options dialog you would normally see in Word. With this you can specify bookmarks and other document properties.

doc.ExportAsFixedFormat(OutputFileName=pdf_file,
    ExportFormat=17, #17 = PDF output, 18=XPS output
    OpenAfterExport=False,
    OptimizeFor=0,  #0=Print (higher res), 1=Screen (lower res)
    CreateBookmarks=1, #0=No bookmarks, 1=Heading bookmarks only, 2=bookmarks match word bookmarks
    DocStructureTags=True
    );

The full list of function arguments is: 'OutputFileName', 'ExportFormat', 'OpenAfterExport', 'OptimizeFor', 'Range', 'From', 'To', 'Item', 'IncludeDocProps', 'KeepIRM', 'CreateBookmarks', 'DocStructureTags', 'BitmapMissingFonts', 'UseISO19005_1', 'FixedFormatExtClassPtr'

answered May 24, 2018 at 19:06

patrick

811 silver badge1 bronze badge

1 Comment

robertspierre Jul 19 at 13:10

Even better use ExportAsFixedFormat3

ljmc · Accepted Answer · 2023-01-05 22:57:19Z

8

unoconv (writen in Python) and OpenOffice running as a headless daemon.

https://github.com/unoconv/unoconv

http://dag.wiee.rs/home-made/unoconv/

Works very nicely for doc, docx, ppt, pptx, xls, xlsx.

Very useful if you need to convert docs or save/convert to certain formats on a server.

edited Jan 5, 2023 at 22:57

ljmc

5,3732 gold badges11 silver badges30 bronze badges

answered Oct 15, 2014 at 22:15

lxx

1,35421 silver badges31 bronze badges

4 Comments

Basj Over a year ago

Can you include a sample code to show how to do it from a python script (import unoconv unoconv.dosomething(...))? The documentation only shows how to do it from command line.

Att Righ Over a year ago

"Please note that there is a rewrite of Unoconv called "Unoserver": github.com/unoconv/unoserver We are running Unoserver successfully in production, and it’s now the recommended solution. Unoserver does not have all the features of Unoconv, which features it will get depends on a combination of what people want, and if someone wants to implement it. Until Unoserver has all the major features people need, Unoconv is in bugfix mode, there will be no major changes...." from github.com/unoconv/unoconv I'm think I'm placing my money on unoconv still.

Att Righ Over a year ago

Heads up for other uses, I had some issue making unoconv work. The approach I went for (which works okay on linux and within docker) was called libreoffice directly as described in this answer.

robertspierre Aug 5 at 7:40

The question is specifuically about using the Word com interface

TylerH · Accepted Answer · 2025-07-19 20:02:16Z

3

I was working with this solution but I needed to search all .docx, .dotm, .docm, .odt, .doc or .rtf and then turn them all to .pdf (python 3.7.5).

import os
import win32com.client

wdFormatPDF = 17

for root, dirs, files in os.walk(r'your directory here'):
    for f in files:
    
        if  f.endswith(".doc")  or f.endswith(".odt") or f.endswith(".rtf"):
            try:
                print(f)
                in_file=os.path.join(root,f)
                word = win32com.client.Dispatch('Word.Application')
                word.Visible = False
                doc = word.Documents.Open(in_file)
                doc.SaveAs(os.path.join(root,f[:-4]), FileFormat=wdFormatPDF)
                doc.Close()
                word.Quit()
                word.Visible = True
                print ('done')
                os.remove(os.path.join(root,f))
                pass
            except:
                print('could not open')
                # os.remove(os.path.join(root,f))
        elif f.endswith(".docx") or f.endswith(".dotm") or f.endswith(".docm"):
            try:
                print(f)
                in_file=os.path.join(root,f)
                word = win32com.client.Dispatch('Word.Application')
                word.Visible = False
                doc = word.Documents.Open(in_file)
                doc.SaveAs(os.path.join(root,f[:-5]), FileFormat=wdFormatPDF)
                doc.Close()
                word.Quit()
                word.Visible = True
                print ('done')
                os.remove(os.path.join(root,f))
                pass
            except:
                print('could not open')
                # os.remove(os.path.join(root,f))
        else:
            pass

The try and except was for those documents I couldn't read and won't exit the code until the last document.

edited Jul 19 at 20:02

TylerH

21.3k84 gold badges84 silver badges121 bronze badges

answered Feb 17, 2020 at 18:28

John Paul Lemmon

1091 silver badge3 bronze badges

2 Comments

not2qubit Over a year ago

What are you importing?

robertspierre Aug 4 at 1:47

This answer has unnecessary boilerplate code

user2921789 · Accepted Answer · 2017-07-01 12:37:04Z

I tried the accepted answer but wasn't particularly keen on the bloated PDFs Word was producing which was usually an order of magnitude bigger than expected. After looking how to disable the dialogs when using a virtual PDF printer I came across Bullzip PDF Printer and I've been rather impressed with its features. It's now replaced the other virtual printers I used previously. You'll find a "free community edition" on their download page.

The COM API can be found here and a list of the usable settings can be found here. The settings are written to a "runonce" file which is used for one print job only and then removed automatically. When printing multiple PDFs we need to make sure one print job completes before starting another to ensure the settings are used correctly for each file.

import os, re, time, datetime, win32com.client

def print_to_Bullzip(file):
    util = win32com.client.Dispatch("Bullzip.PDFUtil")
    settings = win32com.client.Dispatch("Bullzip.PDFSettings")
    settings.PrinterName = util.DefaultPrinterName      # make sure we're controlling the right PDF printer

    outputFile = re.sub("\.[^.]+$", ".pdf", file)
    statusFile = re.sub("\.[^.]+$", ".status", file)

    settings.SetValue("Output", outputFile)
    settings.SetValue("ConfirmOverwrite", "no")
    settings.SetValue("ShowSaveAS", "never")
    settings.SetValue("ShowSettings", "never")
    settings.SetValue("ShowPDF", "no")
    settings.SetValue("ShowProgress", "no")
    settings.SetValue("ShowProgressFinished", "no")     # disable balloon tip
    settings.SetValue("StatusFile", statusFile)         # created after print job
    settings.WriteSettings(True)                        # write settings to the runonce.ini
    util.PrintFile(file, util.DefaultPrinterName)       # send to Bullzip virtual printer

    # wait until print job completes before continuing
    # otherwise settings for the next job may not be used
    timestamp = datetime.datetime.now()
    while( (datetime.datetime.now() - timestamp).seconds < 10):
        if os.path.exists(statusFile) and os.path.isfile(statusFile):
            error = util.ReadIniString(statusFile, "Status", "Errors", '')
            if error != "0":
                raise IOError("PDF was created with errors")
            os.remove(statusFile)
            return
        time.sleep(0.1)
    raise IOError("PDF creation timed out")

Mobasshir Bhuiya · Accepted Answer · 2023-12-12 08:53:35Z

I have modified it for ppt support as well. My solution support all the below-specified extensions.

import sys
import os
import json
import subprocess
from pathlib import Path
from tqdm.auto import tqdm

word_extensions = [".doc", ".odt", ".rtf", ".docx", ".dotm", ".docm"]
ppt_extensions = [".ppt", ".pptx"]


def windows(paths, keep_active):
    import win32com.client
    import pythoncom

    pythoncom.CoInitialize()
    word = win32com.client.dynamic.Dispatch("Word.Application")
    ppt = win32com.client.dynamic.Dispatch("Powerpoint.Application")
    wdFormatPDF = 17
    pptFormatPDF = 32

    if paths["batch"]:
        for ext in word_extensions:
            for docx_filepath in tqdm(sorted(Path(paths["input"]).glob(f"*{ext}"))):
                pdf_filepath = Path(paths["output"]) / (
                    str(docx_filepath.stem) + ".pdf"
                )
                doc = word.Documents.Open(str(docx_filepath))
                doc.SaveAs(str(pdf_filepath), FileFormat=wdFormatPDF)
                doc.Close()
        for ext in ppt_extensions:
            for ppt_filepath in tqdm(sorted(Path(paths["input"]).glob(f"*{ext}"))):
                pdf_filepath = Path(paths["output"]) / (str(ppt_filepath.stem) + ".pdf")
                ppt_ = ppt.Presentations.Open(str(ppt_filepath))
                ppt_.SaveAs(str(pdf_filepath), FileFormat=pptFormatPDF)
                ppt_.Close()
    else:
        pbar = tqdm(total=1)
        input_filepath = Path(paths["input"]).resolve()
        pdf_filepath = Path(paths["output"]).resolve()
        if input_filepath.suffix in word_extensions:
            doc = word.Documents.Open(str(input_filepath))
            doc.SaveAs(str(pdf_filepath), FileFormat=wdFormatPDF)
            doc.Close()
        else:
            ppt_ = ppt.Presentations.Open(str(input_filepath))
            ppt_.SaveAs(str(pdf_filepath), FileFormat=pptFormatPDF)
            ppt_.Close()
        pbar.update(1)

    if not keep_active:
        word.Quit()
        ppt.Quit()


def resolve_paths(input_path, output_path):
    input_path = Path(input_path).resolve()
    output_path = Path(output_path).resolve() if output_path else None
    output = {}
    if input_path.is_dir():
        output["batch"] = True
        output["input"] = str(input_path)
        if output_path:
            assert output_path.is_dir()
        else:
            output_path = str(input_path)
        output["output"] = output_path
    else:
        output["batch"] = False
        # assert str(input_path).endswith(".docx")
        output["input"] = str(input_path)
        if output_path and output_path.is_dir():
            output_path = str(output_path / (str(input_path.stem) + ".pdf"))
        elif output_path:
            assert str(output_path).endswith(".pdf")
        else:
            output_path = str(input_path.parent / (str(input_path.stem) + ".pdf"))
        output["output"] = output_path
    return output


def convert(input_path, output_path=None, keep_active=False):
    paths = resolve_paths(input_path, output_path)
    if sys.platform == "win32":
        return windows(paths, keep_active)
    else:
        raise NotImplementedError(
            "This script is not implemented for linux and macOS as it requires Microsoft Word to be installed"
        )


def main():
    print("Processing...")
    input_path = os.path.abspath(sys.argv[1])
    convert(input_path)
    print("Processed...")


if __name__ == "__main__":
    main()

My Solution: Github Link

I have modified code from Docx2PDF

Bas Bossink · Accepted Answer · 2011-05-15 21:05:51Z

1

If you don't mind using PowerShell have a look at this Hey, Scripting Guy! article. The code presented could be adopted to use the wdFormatPDF enumeration value of WdSaveFormat (see here). This blog article presents a different implementation of the same idea.

edited May 15, 2011 at 21:05

answered May 15, 2011 at 20:53

Bas Bossink

9,7344 gold badges45 silver badges53 bronze badges

Comments

robertspierre · Accepted Answer · 2025-08-04 01:51:09Z

Expanding on Steven's answer, if you are using Microsoft Office 365 and still have this task, you can use the newer ExportAsFixedFormat3 which lets you control things like whether to export markup (comments and revisions), whether to create bookmarks, whether to optimize for screen or print.

Look at the documentation for all the different options.

from pathlib import Path
import win32com.client

word = win32com.client.Dispatch('Word.Application')
word.Visible = False

wdExportFormatPDF = 17
wdExportDocumentContent = 0  # Exports the document without markup.
wdExportCreateNoBookmarks = 0  # Do not create bookmarks in the exported document.
wdExportOptimizeForPrint = 0  # Export for print, which is higher quality and results in a larger file size.

file = Path("document.docx")
out_file = file.with_suffix('.pdf')
doc = word.Documents.Open(str(file.resolve()))
doc.ExportAsFixedFormat3(
   str(out_file.resolve()),
   OptimizeFor=wdExportOptimizeForPrint,
   Item=wdExportDocumentContent,
   CreateBookmarks=wdExportCreateNoBookmarks,
   ExportFormat=wdExportFormatPDF
)
doc.Close()

word.Quit()

Collectives™ on Stack Overflow

Convert .doc files to pdf using python COM interface to Microsoft Word

11 Answers 11

6 Comments

5 Comments

3 Comments

1 Comment

1 Comment

4 Comments

2 Comments

Comments

Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

11 Answers 11

6 Comments

5 Comments

3 Comments

1 Comment

1 Comment

4 Comments

2 Comments

Comments

Comments

Comments

Comments

Linked

Related