2
\$\begingroup\$

Autodidact and inspired by David Beazley code (to improve my skills in Python), I would like to get your feedback on this parser code.

The lazy property lets the code be computed only one time if used (https://github.com/dabeaz/python-cookbook/blob/master/src/8/lazily_computed_attributes/example1.py)



class lazyproperty:
    def __init__(self, func):
        self.func = func
    def __get__(self, instance, cls):
        if instance is None:
            return self
        else:
            value = self.func(instance)
            setattr(instance, self.func.__name__, value)
            return value


class PDFParser():
    """

    """
    def __init__(self,filepath,page_num=0):
        self.filepath = filepath
        try:
            self._doc = fitz.open(filepath)
            self.page_num = page_num
            self._page = self._doc[page_num]
        except Exception as e: 
            print("Lecture PDF impossible. {}".format(e))

    @lazyproperty
    def text(self):
        return self._page.getText()

    @lazyproperty
    def _pixs(self):
        imgs = self._doc.getPageImageList(self.page_num)
        pixs =[]
        for img in imgs:
            xref = img[0]
            pix = fitz.Pixmap(self._doc, xref)
            pixs.append(pix)
        return pixs

    @lazyproperty
    def _pixpage(self):
        pix = self._page.getPixmap(colorspace=fitz.csGRAY)
        return pix
    
    @property   
    def img(self):
        return self.imgs[0]


    @lazyproperty
    def imgs(self):
        pixs = self._pixs
        imgsarray = []
        for pix in pixs:
            img = self.pix2np(pix)
            imgsarray.append(img)
        return imgsarray
        

    def write(self,outputdir,fullpage=False):
        filename = os.path.basename(self.filepath).split('.pdf')[0]
        def writePNG(pix,filepath):
            # This is GRAY or RGB
            try:       
                pix.writePNG(filepath)
            # CMYK: convert to RGB first
            except:               
                pix = fitz.Pixmap(fitz.csRGB, pix)
                pix.writePNG(filepath)
                pix = None
        if fullpage:
            filepath = os.path.join(outputdir,'{}_p{}.png'.format(filename,self.page_num))
            pix = self._pixpage
            writePNG(pix,filepath)
            return
        pixs = self._pixs
        for i,pix in enumerate(pixs):
            filepath = os.path.join(outputdir,'{}_p{}_i{}.png'.format(filename,self.page_num,i))
            writePNG(pix,filepath)
        return

    @staticmethod
    def pix2np(pix):
        """
        Convert pixmap to image np.ndarray
        https://stackoverflow.com/questions/53059007/python-opencv
        param pix: pixmap
        """
        import numpy as np
        #https://stackoverflow.com/questions/22236749/numpy-what-is-the-difference-between-frombuffer-and-fromstring
        im = np.frombuffer(pix.samples, dtype=np.uint8).reshape(pix.h, pix.w, pix.n)
        try:
            im = np.ascontiguousarray(im[..., [2, 1, 0]])  # rgb to bgr
        except IndexError:
            #Trick to convert Gray rto BGR, (im.reshape)
            logger.warning("Shape of image array is {}".format(im.shape)) 
            im = cv2.cvtColor(im,cv2.COLOR_GRAY2RGB)
            im = np.ascontiguousarray(im[..., [2, 1, 0]])
        return im

\$\endgroup\$
0

2 Answers 2

3
\$\begingroup\$

@cache

Thank you for the "lets the code be computed only one time" context, and for the URL citation.

It's not clear that @lazyproperty does anything for your use case that the familiar and well-tested @cache decorator hasn't already covered. If it is bringing something new to the party, please write a """docstring""" which explains that.

The Beazly examples are for immutable shapes. You seem to invite the caller to update the public .page_num attribute, which can interact with some of those lazy evaluations. It's worth documenting the details. If there's restrictions on what caller should do, then tell us.

swallowed exception

This is bad, very bad:

        try:
            self._doc = fitz.open(filepath)
            ...
        except Exception as e: 
            print("Lecture PDF impossible. {}".format(e))

If you feel that's a better diagnostic, and your user needs to see it, then great! But the next line must be:

            raise

The OP code is willing to construct an unusable object in the case of FileNotFound.

context manager

I saw the call to fitz.open(filepath), but I didn't notice a close() call. Consider turning this class into a with context manager, so caller will automatically close the PDF file once it goes out of scope.

docstring

class PDFParser():
    """

    """

Well, that was a good beginning. Now finish it.

bare except

            except:               

There are good reasons for avoiding this anti-pattern, such as messing up CTRL/C handling.

It's not clear that pix was ever set and that the initial write attempt could ever succeed.

                pix.writePNG(filepath)
                pix = None

If you were hoping to save some RAM there, it's not clear it actually produced any interesting effect, given that pix will go out of scope once we return.

explicit return

        return

We evaluated write() for side effects. Prefer to just fall off the end, rather than explicitly saying return.

The effect is the same: caller will get a None value.

\$\endgroup\$
2
\$\begingroup\$

In addition to the good advice of the previous answer, consider the following as well.

Naming

Many of the functions and variables do not have descriptive names. For example, if img and imgs mean "image" and "images", then use the actual words because they are easier to read.

def imgs(self):

If that function gets or creates images, perhaps rename it as:

def get_images(self):
    pixels = self._pixs
    images = []
    for pixel in pixels:
        images.append(self.pix2np(pixel))
    return images

The above has the words spelled out.

Also, the code is simpler with the temporary img variable removed.

Documentation

Add a docstring at the top of the file to summarize the purpose of the code. Also add docstrings to describe more of the functions.

Commas

The code has inconsistent comma usage. The black program can be used to automatically add consistent space after commas.

Import

import lines are conventionally declared at the top of the code, not inside a class function:

def pix2np(pix):
    """
    Convert pixmap to image np.ndarray
    https://stackoverflow.com/questions/53059007/python-opencv
    param pix: pixmap
    """
    import numpy as np
\$\endgroup\$

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.