0

I'm building a script which requires counting the number of occurances of each word in each file, out of about 2000 files, each being around 500KB.

So that is 1GB of data, but MySQL usage goes over 30+ GB (then it runs out and ends).

I've tracked down the cause of this to my liberal use of associative arrays, which looks like this:

for($runc=0; $runc<$numwords; $runc++)
 {
 $word=trim($content[$runc]);

 if ($words[$run][$word]==$wordacceptance && !$wordused[$word])
  {
  $wordlist[$onword]=$word;
  $onword++;
  $wordused[$word]=true;
  }

 $words[$run][$word]++; // +1 to number of occurances of this word in current category
 $nwords[$run]++;
 }

$run is the current category.

You can see that to count the words's I'm just adding them to the associative array $words[$run][$word]. Which increases with each occurance of each word in each category of files.

Then $wordused[$word] is used to make sure that a word doesn't get added twice to the wordlist.

$wordlist is a simple array (0,1,2,3,etc.) with a list of all different words used.

This eats up gigantic amounts of memory. Is there a more efficient way of doing this? I was considering of using a MySQL memory table, but I want to do the whole thing in PHP so it's fast and portable.

3
  • 2
    I don't see how the code you show could cause massive memory use by mySQL? Commented Nov 2, 2011 at 10:36
  • 1
    I don't have that much data on me so I can't test it :D. But how does PHP's array_count_values method stack up with memory and processing ? Commented Nov 2, 2011 at 10:39
  • Combining array_count_values is good, I'll use that to count the words after the arrays are merged and sorted. Commented Nov 2, 2011 at 11:27

1 Answer 1

1

Have you tried the builtin function for counting words?
http://hu2.php.net/manual/en/function.str-word-count.php

EDIT: Or use explode to get an array of words, trim all with array_walk, then sort, and then go though with a for, and count the occurances, and if a new word comes in the list you can flush the number of occurances, so no need for accounting which word was previously.

Sign up to request clarification or add additional context in comments.

5 Comments

Didn't know about that one, but it doesn't look like it can count the occurances of the word. Only the number of different words, or return a list of different words. But I need the number of occurances of each word.
Do the method after "EDIT:", I think you can sort the input, and then count. You don't need accounting if the input is sorted.
Hmmmm... exploding the words, combining arrays (from different files in same category) and then sorting might just work. Then no need for associative array.
What do you mean about using flush? I don't understand how it would be used there.
On flushing I was mean: You don't need to count the number of occurances in php, you can just write it out somewhere. So no $counts["foo"] = 45; $counts["bar"] = 71; etc. Just write these numbers to a file, or the stdout.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.