3

I'm writing a basic categorization tool that will take a title and then compare it to an array of keywords. Example:

$cat['dining'] = array('food','restaurant','brunch','meal','cand(y|ies)');
$cat['services'] = array('service','cleaners','framing','printing');
$string = 'Dinner at seafood restaurant';

Are there creative ways to loop through these categories or to see which category has the most matches? Note that in the 'dining' array, I have regex to match variations on the word candy. I tried the following, but with these category lists getting pretty long, I'm wondering if this is the best way:

$keywordRegex = implode("|",$cat['dining']); 
preg_match_all("/(\b{$keywordRegex}\b)/i",$string,$matches]);

Thanks, Steve

EDIT: Thanks to @jmathai, I was able to add ranking:

    $matches = array(); 
    foreach($keywords as $k => $v) {
        str_replace($v, '#####', $masterString,$count);
        if($count > 0){
            $matches[$k] = $count;
        }
    }
    arsort($matches);
1
  • Don't know php too well, but I suspect a hash would be faster than a regex. If you have one of the values as an actual regex (like cand(y|ies) run it as a regex on the subject, e.g. put the regex values into a separate hash. Commented Feb 5, 2011 at 2:04

5 Answers 5

4

This can be done with a single loop.

I would split candy and candies into separate entries for efficiency. A clever trick would be to replace matches with some token. Let's use 10 #'s.

$cat['dining'] = array('food','restaurant','brunch','meal','candy','candies');
$cat['services'] = array('service','cleaners','framing','printing');
$string = 'Dinner at seafood restaurant';

$max = array(null, 0); // category, occurences
foreach($cat as $k => $v) {
  $replaced = str_replace($v, '##########', $string);
  preg_match_all('/##########/i', $replaced, $matches);
  if(count($matches[0]) > $max[1]) {
    $max[0] = $k;
    $max[1] = count($matches[0]);
  }
}

echo "Category {$max[0]} has the most ({$max[1]}) matches.\n";
Sign up to request clarification or add additional context in comments.

3 Comments

Pure Genius. I simplified it a little and added match counts in my edited post.
@daxiang28: Still, this way you have to match the string individually against every word in $cat rather than using a quick lookup per word in the string, and in the variant above you only get the first category name with the largest number of matching words. (Your alteration is better in that respect). Also, still the issue of having to enter all variants. Clever though.
I might try to eliminate the str_replace step by using \b word boundry markers in the preg_match_all call. But a more troubling aspect of this problem is runtime of the solution. As the number of elements in each category Array increases, the looping increases by the number of words in the searh term. Turning the arrays in associative lists (think Set here) will trade memory for runtime. Even so, I think there is a scaling issue with is approach. Probably good enough for small projects though.
2
$cat['dining'] = array('food','restaurant','brunch','meal');
$cat['services'] = array('service','cleaners','framing','printing');
$string = 'Dinner at seafood restaurant';

$string = explode(' ',$string);
foreach ($cat as $key => $val) {
  $kwdMatches[$key] = count(array_intersect($string,$val));
}
arsort($kwdMatches);

echo "<pre>";
print_r($kwdMatches);

1 Comment

this doesn't account for the regular expressions?
1

Providing the number of words is not too great, then creating a reverse lookup table might be an idea, then run the title against it.

// One-time reverse category creation
$reverseCat = array();    
foreach ($cat as $cCategory => $cWordList) {
   foreach ($cWordList as $cWord) {
       if (!array_key_exists($cWord, $reverseCat)) {
           $reverseCat[$cWord] = array($cCategory);
       } else if (!in_array($cCategory, $reverseCat[$cWord])) {
           $reverseCat[$cWord][] = $cCategory;
       }
   }
}

// Processing a title
$stringWords = preg_split("/\b/", $string);

$matchingCategories = array();
foreach ($stringWords as $cWord) {
   if (array_key_exists($cWord, $reverseCat)) {
       $matchingCategories = array_merge($matchingCategories, $reverseCat[$cWord]);
   }
}

$matchingCategories = array_unique($matchingCategories);

5 Comments

Note if a ranking was required, then instead of the array_unique() call at the end, a quick O(n) loop on $matchingCategories to build a count table, followed by an arsort() would give the descending ranking.
Does this handle the 'cand(y|ies)' conditions?
@daxiang28: Sorry I did not notice that, if you just wrote it out as 'candy','candies' then it would. Do you really need the regexp match there? If you can have arbitrary regexps in all target words, then I think you would have to match every word in $string against every single word in subarrays of $cat, quite slow.
The reverse lookup is interesting. But in this case would I have to handle every variation of the word? If I have a keyword 'piece' but the string is '4 pieces of candy', it wouldn't match right?
@daxiang28: Correct, you would have to type all variants. You could get around the need to type regular plurals though. Say all regular plurals you just put a + on the end of the word, rather than writing the word again. Then within foreach ($cWordList as $cWord), strip and detect any + on the end, where detected, add a .'s' variant at the same time to $reverseCat. Then you only need to type irregular plurals, and alternate forms.
0

You are performing O(n*m) lookup on n being the size of your categories and m being the size of a title. You could try organizing them like this:

const $DINING = 0;
const $SERVICES = 1;

$categories = array(
    "food" => $DINING,
    "restaurant" => $DINING,
    "service" => $SERVICES,
);

Then for each word in a title, check $categories[$word] to find the category - this gets you O(m).

Comments

0

Okay here's my new answer that lets you use regex in $cat[n] values...there's only one caveat about this code that I can't figure out...for some reason, it fails if you have any kind of metacharacter or character class at the beginning of your $cat[n] value.

Example: .*food will not work. But s.afood or sea.* etc... or your example of cand(y|ies) will work. I sort of figured this would be good enough for you since I figured the point of the regex was to handle different tenses of words, and the beginnings of words rarely change in that case.

function rMatch ($a,$b) {
  if (preg_match('~^'.$b.'$~i',$a)) return 0;
  if ($a>$b) return 1;
  return -1;
}

$string = explode(' ',$string);
foreach ($cat as $key => $val) {
  $kwdMatches[$key] = count(array_uintersect($string,$val,'rMatch'));
}
arsort($kwdMatches);

echo "<pre>";
print_r($kwdMatches);

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.