Optimize nested loop

Question

I'm having a memory issue with my application with a nested for loop and I can't figure out how to improve it. I've tried using linq, but I guess that internally it's the same, because the memory leaks still is there.

EDIT: As I've been requested, I'll provide more information about my problem.

I've got all of my customers (about 400.000) indexed in a Lucene document store. Each customer can be present in more than one agency, exiting some of them than can be in 200-300 agencies.

I need to retrieve all of my customers from the 'global' customer index and build a separate index for each agency, only containing the customers it can see. There are some business rules and security rules that need to be applied to each agency index, so right now, I can't afford to maintain a single customer index for all my agencies.

My process looks like this:

int numDocuments = 400000;

// Get a Lucene Index Searcher from an Index Factory
IndexSearcher searcher = SearcherFactory.Instance.GetSearcher(Enums.CUSTOMER);

// Builds a query that gets everything in the index
Query query = QueryHelper.GetEverythingQuery();
Filter filter = new CachingWrapperFilter(new QueryWrapperFilter(query));

// Sorts by Agency Id
SortField sortField = new SortField("AgencyId, SortField.LONG);
Sort sort = new Sort(sortField);

TopDocs documents = searcher.Search(query, filter, numDocuments, sort);

for (int i = 0; i < numDocuments; i++)
{
     Document document = searcher.Doc(documents.scoreDocs[i].doc);

     // Builds a customer object from the lucene document
     Customer customer = new Customer(document);

     // If this nested loop is removed, the memory doesn't grow
     foreach(Agency agency in customer.Agencies)
     {
          // Gets a writer from a factory for the agency id.
          IndexWriter writer = WriterFactory.Instance.GetWriter(agency.Id);

          // Builds an agency-specific document from the customer
          Document customerDocument = customer.GetAgencyDocument(agency.Id);

          // Adds the document to the agency's lucene index
          writer.AddDocument(customerDocument);
     }
}

EDIT: The solution

The problem was I wasn't reusing the instances of the "Document" object in the inner loop, and that caused an indecent grow of memory usage of my service. Just reusing a single instance of Document for the full process solved my problem.

Thanks everyone.

it really depends on what you are doing, this code by itself isn't leaky. — Daniel A. White
– Daniel A. White, Commented Apr 16, 2012 at 16:17
There's a difference between "using a lot of memory" and "a memory leak." — David
– David, Commented Apr 16, 2012 at 16:18
If the problem is that these lists are just "too big" and the nesting is a major performance hit... Does the inner loop do anything Customer-specific? That is, can you logically loop through all of the Customers and then loop through all of the Agencies and still accomplish the same task? That would make the execution closer to O(2x) instead of O(x^2). — David
– David, Commented Apr 16, 2012 at 16:22
If you have a list of 400k customers, their memory is going to hang around. If you then retrieve the entire list of agencies per customer, you're not going to avoid increasing memory usage as long as the customers remain in memory. — user7116
– user7116, Commented Apr 16, 2012 at 16:32

Marino Šimić · Accepted Answer · 2012-04-16 16:46:59Z

4

What I believe to be happening here is:

You have too much object creation inside the loops. If at all possible do no use the new() keyword inside the loops. Initialize objects that are reusable across the loops and pass them data to work on. DO not construct new objects inside that many loops because garbage collection will become a serious problem and the garbage collector may not be able to keep up with you, and will defer collection.

The first thing you can do to try if this is true, try to force garbage collection every X loops and wait for pending finalizers. If this brings memory down you know that this is the problem. And solving it is easy: just do not create new instances every loop iteration.

answered Apr 16, 2012 at 16:46

Marino Šimić

7,3801 gold badge34 silver badges66 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Jf Beaulac · Accepted Answer · 2012-04-17 13:22:05Z

First you should re-use your Document and Field instances that you pass to IndexWriter.AddDocument() to minimize memory usage and relieve pressure on the garbage collector.

• Re-use Document and Field instances As of Lucene 2.3 there are new setValue(...) methods that allow you to change the value of a Field. This allows you to re-use a single Field instance across many added documents, which can save substantial GC cost. It's best to create a single Document instance, then add multiple Field instances to it, but hold onto these Field instances and re-use them by changing their values for each added document. For example you might have an idField, bodyField, nameField, storedField1, etc. After the document is added, you then directly change the Field values (idField.setValue(...), etc), and then re-add your Document instance.

Note that you cannot re-use a single Field instance within a Document, and, you should not change a Field's value until the Document containing that Field has been added to the index.

http://wiki.apache.org/lucene-java/ImproveIndexingSpeed

D'Arcy Rittich · Accepted Answer · 2012-04-16 16:33:28Z

2

The key may be how you are initializing customers and customer.Agencies. If you can, rather than returning a type of List, make the return types IEnumerable<Customer> and IEnumerable<Agency>. This may allow deferred execution to happen, which should consume less memory, but may make the operation take longer.

Another option would be to run the code in batches, so use your code above, but populate List<Customer> customers in batches of, e.g., 10,000 at a time.

edited Apr 16, 2012 at 16:33

answered Apr 16, 2012 at 16:25

D'Arcy Rittich

172k41 gold badges298 silver badges287 bronze badges

3 Comments

Memet Olsen Over a year ago

I wouldn't go for IEnumerable with 400000 customers

D'Arcy Rittich Over a year ago

@memetolsen I guess that depends on the data source - with SQL it could definitely be an issue.

dandel Over a year ago

The data source is a Lucene index. Any queries to database are done for loading the customer list (with the agency list in each customer included)

yorch · Accepted Answer · 2012-04-16 16:34:15Z

1

As @RedFilter said, try using IEnumerable along with the yield statement.

This may help:

http://csharpindepth.com/Articles/Chapter11/StreamingAndIterators.aspx

http://www.alteridem.net/2007/08/22/the-yield-statement-in-c/

answered Apr 16, 2012 at 16:34

yorch

7,3668 gold badges35 silver badges39 bronze badges

Comments

Shiraz Bhaiji · Accepted Answer · 2012-04-16 16:34:45Z

1

Looping through a list in memory that is allready loaded in memory, you do not change the amount of memory that the list is using.

It must be something that you are doing to the items in the list that is causing the memory usage.

You need to look at what you are trying to achieve and redesign your program to not have all data in memory at the same time.

answered Apr 16, 2012 at 16:34

Shiraz Bhaiji

65.7k37 gold badges149 silver badges260 bronze badges

Comments

Tony Hopkinson · Accepted Answer · 2012-04-16 16:41:57Z

1

If you mean you want to reduce the memory usage, then the basic answer is to break it up.

So get all the customers for one agency into a CustomersForAgency collection,then process just that.

Clearing or letting the CustomersForAgency collection got out of scope, will take all those customers and (optionally that agency) out of scope allowing .net to reuse the memory.

That's assuming of course that teh bulk of the memory allocation is for Customers, and not other persistent instances used for processing, you simplified out.

answered Apr 16, 2012 at 16:41

Tony Hopkinson

20.4k3 gold badges35 silver badges40 bronze badges

2 Comments

dandel Over a year ago

That's something I tried before, but a customer can be present in more than an agency. That means that if I iterate over the agencies, a customer is queried from the data source once for every agency.

Tony Hopkinson Over a year ago

That was one way to break the job up, bound to be more. You are going for a brute force approach at the moment, optimisation requires a deep knowledge of the requirement and the implementation. That's you. Two ways to go start with what you know, attack it, see what happens, or lateral thinking leads to a cunning plan. The latter might be possible, but if you are living this code, you need to "sleep on it", do something else for a bit. Often cunning plans arise from not thinking too hard about it.

Collectives™ on Stack Overflow

Optimize nested loop

6 Answers 6

Comments

Comments

3 Comments

Comments

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

Comments

Comments

3 Comments

Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related