PHP for repetitive large data and array processing

Question

My question really revolves around the repetitive use of a large amount of data.

I have about 50mb of data that I need to cross reference repetitively during a single php page execution. This task is most easily solved by using sql queries with table joins. The problem is the sheer volume of data that I need to process in an very short amount of time and the number of queries required to do it.

What I am currently doing is dumping the relevant part of each table (usually in excess of 30% or 10k rows) into an array and looping. The table joins are always on a single field, so I built a really basic 'index' of sorts to identify which rows are relevant.

The system works. It's been in my production environment for over a year, but now I'm trying to squeeze even more performance out of it. On one particular page I'm profiling, the second highest total time is attributed to the increment line that loops though these arrays. It's hit count is 1.3 million, for a total execution time of 30 seconds. This represents the work that would have been preformed by about 8200 sql queries it to achieve the same result.

What I'm looking for is anyone else that has run a situation like this. I really can't belive that I'm anywhere near the first person to have large amounts of data that needs to be processed in PHP.

Thanks!

Thank you very much to everyone that offered some advice here. It looks like there's isn't really a sliver bullet here like I was hoping. I think what I'm going to end up doing is using a mix of mysql memory tables and some version of a paged memcache.

There's tons of people here who deal with big data in php. I'm not sure what your question is. — goat
– goat, Commented Dec 7, 2012 at 18:31
@rambo coder I guess what I'm trying to get at here is how do you store your data locally? How do you join without sql (or the hack that I'm using)? There has to be a more popular, tested solution that people are using than custom coding it. — Beachhouse
– Beachhouse, Commented Dec 7, 2012 at 18:36
You could store the data in a more optimized data structure thats suitable for your specific access patterns. It's also possible maybe your data doesnt change often, and you could keep it in memory in this optimized data structure by writing a daemon which answers queries over socket connections. There's also parallel processing(split the task up into small chunks, use a different machine for each chunk). My guess is you can just improve your code + data structure to get huge gains though. — goat
– goat, Commented Dec 7, 2012 at 18:43
parallelization would totally work for this. Each of the 1000 object could be created on any server. I'm leveraging memcache now to store the arrays, but all that really saves me is the upfront query and creation time, which is small compared to the looping time. — Beachhouse
– Beachhouse, Commented Dec 7, 2012 at 18:48
I may have gotten ahead of myself. I think the first thing you need to be sure of is that you are highly proficient with database indexes. Also, maybe look into data warehousing methods, like star schema. — goat
– goat, Commented Dec 7, 2012 at 18:54

Diego · Accepted Answer · 2012-12-10 18:52:57Z

1

This solution depends closely on what are you doing with the data, but I found that working unique-value columns inside array keys accelerate things a lot when you are trying to look up for a row given certain value on a column. This is because php uses a hash table to store the keys for fast lookups. It's hundreds of times faster than iterating over the array, or using array_search. But without seeing a code example is hard to say.

Added from comment:

The next step is use some memory database. You can use memory tables in mysql, or SQLite. Also depends on how much of your running environment you control, because those methods would need more memory than a shared hosting provider would usually allow. It would probably also simplify your code because of grouping, sorting, aggregate functions, etc.

edited Dec 10, 2012 at 18:52

answered Dec 7, 2012 at 18:46

Diego

6821 gold badge7 silver badges17 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Beachhouse Over a year ago

This is pretty much what my 'index' does. The iteration is for dealing with more than one match.

Diego Over a year ago

Ok. I Guess the next step is use some memory database. You can use memory tables in mysql, or SQLite (sqlite.org/inmemorydb.html) . Also depends on how much of your running environment you control, because those methods would need more memory than a shared hosting provider would usually allow. It would probably also simplify your code because of grouping, sorting, aggregate functions, etc.

Beachhouse Over a year ago

I'm running on AWS EC2 c1.medium. Database is on AWS RDS, also running c1.medium. I'll look into in memory tables, I'm not sure I've heard of that as any option for mysql before.

Diego Over a year ago

You probably already found this link: dev.mysql.com/doc/refman/5.6/en/memory-storage-engine.html

RonaldBarzell · Accepted Answer · 2012-12-07 18:29:15Z

1

Well, I'm looking at a similar situation in which I have a large amount of data to process, and a choice to try to do as much via MySQL queries, or off-loading it to PHP.

So far, my experience has been this:

PHP is a lot slower than using MySQL queries.
MySQL query speed is only acceptable if I cram the logic into a single call, as the latency between calls is severe.

I'm particularly shocked by how slow PHP is for looping over an even modest amount of data. I keep thinking/hoping I'm doing something wrong...

answered Dec 7, 2012 at 18:29

RonaldBarzell

3,8501 gold badge18 silver badges23 bronze badges

2 Comments

Beachhouse Over a year ago

8200 queries though. It's OOP so each basically it instantiates 1000 objects and processes the data for them which takes 8 queries. I could rewrite it all, but this is a batch job, and many times these object are used one at a time.

Diego Over a year ago

If you are doing a simple "SELECT * FROM <table> WHERE id = <id>" query for each object, you can consider using memcache to cache those rows individually, so you would hit memcache before the database. Again it depends on your use case, if the objects are used only once a day on a batch job and your cache is always "cold" (meaning it has no elements cached), it will not make any difference. Again if I could get a code example, I would be glad to help.

Collectives™ on Stack Overflow

PHP for repetitive large data and array processing

2 Answers 2

4 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related