2

I have an method and it converts pdf text into a list. After the process the memory usage increase too much. For example a 1000 page pdf use 300mb memory and i can't free it. I have readed some LOH articles but have not find a solution.

 public List<string> GetTextFromPdf()
    {
        if (_pdfDoc.Pages == null) return null;
        List<string> ocrList = new List<string>();

        foreach (var words in _pdfDoc.Pages.Select(s => s.Value.WordList))
        {
            ocrList.AddRange(words.Select(word => word.Word).Select(input => Regex.Replace(input, @"[\W]", "")));
        }

        GC.Collect();
        return ocrList;
    }
8
  • 3
    Don't re-parse the regex every time - use a shared Regex instance Commented Jun 19, 2011 at 13:35
  • This is probably an issue with your PDF library Commented Jun 19, 2011 at 13:35
  • Is your _pdfDoc object disposable? Commented Jun 19, 2011 at 13:36
  • the _pdfDoc wrapper class is disposable Commented Jun 19, 2011 at 13:39
  • 1
    blogs.msdn.com/b/bclteam/archive/2010/06/25/… Commented Jun 19, 2011 at 13:58

3 Answers 3

5

This is about normal for a 100 megabyte .pdf. You load the entire thing in memory, that takes double the amount of memory since a character in .NET takes 2 bytes. You will also create a bunch of garbage in the large object heap for the list. Add the typical .NET runtime overhead and 300 megabytes is not an unexpected result.

Check this answer for details on how using the List<>.Capacity property can help reduce the LOH demands.

Sign up to request clarification or add additional context in comments.

5 Comments

I'm curious about why you think PDFs aren't already in Unicode?
@Eric - PDF has been around too long to benefit from Unicode standardization. It has the typical 8-bit encoding zoo, "WinAnsi" is one of them.
i cleared the list and set capacity to 0 but the memory usage is still same
Yes, it is rare for the Windows memory manager to release virtual memory. The odds that the released memory exactly matches a memory mapping is very low. Not a problem, it is virtual. Minimize your main window to make yourself feel better. The linked answer tried to explain how to reduce LOH usage by not allocating it in the first place.
@Orhan : It's not (just) about clearing the list but about allocating it wisely. Like ocrList = new List<string>(_pdfDoc.Pages.Count * OverEstimateWordsPerPage);
0

Check if your pdf loader is referenced somewhere - so it can not be disposed.

1 Comment

The problem is after adding the words to my ocrList. When i call _pdfDoc.Dispose() the app crashes. Because the pdf is still inside a viewer.
0

Is your pdf library COM based? You may need to call Marshall.releasecomobject on some of your references when you have finished with them.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.