0

I was trying to split a string on non-alphanumeric characters or simple put I want to split words. The approach that immediately came to my mind is to use regular expressions.

Example:
$string = 'php_php-php php';
$splitArr = preg_split('/[^a-z0-9]/i', $string);

But there are two problems that I see with this approach.

  1. It is not a native php function, and is totally dependent on the PCRE Library running on server.
  2. An equally important problem is that what if I have punctuation in a word
    Example:
    $string = 'U.S.A-men's-vote';
    $splitArr = preg_split('/[^a-z0-9]/i', $string);

    Now this will spilt the string as [{U}{S}{A}{men}{s}{vote}]
    But I want it as [{U.S.A}{men's}{vote}]

So my question is that:

  • How can we split them according to words?
  • Is there a possibility to do it with php native function or in some other way where we are not dependent?

Regards

9
  • 2
    What is your definition of a word? It is allowed periods? What about something like this sentence.and this one too.? And what about I am sure this regex is a no-go but I'll use it anyway. Commented Oct 24, 2012 at 10:47
  • It depends on what you define as "word". For U.S.A to be a word, you'd need a non-space-padded stop mark to not be a word separator. So you could split on whitespaces, question marks, commas, colons, and so on, OR spaced stop marks. Commented Oct 24, 2012 at 10:47
  • It is possible. Iterate over the string (char by char) and apply your own rules whether the char belongs to a word or not. Commented Oct 24, 2012 at 10:48
  • 3
    preg_split is not native? Please show me a PHP installation since the late 1920s that does not support preg_split Commented Oct 24, 2012 at 10:52
  • 1
    So you're also not using mysql/mysqli/PDO, because these are extensions? What about mb_*? You just have to be realistic at some point... Commented Oct 24, 2012 at 11:04

4 Answers 4

3

Sounds like a case for str_word_count() using the oft forgotten 1 or 2 value for the second argument, and with a 3rd argument to include hyphens, full stops and apostrophes (or whatever other characters you wish to treat as word-parts) as part of a word; followed by an array_walk() to trim those characters from the beginning or end of the resultant array values, so you only include them when they're actually embedded in the "word"

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks Mark. I think considering my situation this will give me the closest to best results. Not 100% accurate but almost there.
3

Either you have PHP installed (then you also have PCRE), or you don't. So your first point is a non-issue.

Then, if you want to exclude punctuation from your splitting delimiters, you need to add them to your character class:

preg_split('/[^a-z0-9.\']+/i', $string);

If you want to treat punctuation characters differently depending on context (say, make a dot only be a delimiter if followed by whitespace), you can do that, too:

preg_split('/\.\s+|[^a-z0-9.\']+/i', $string);

1 Comment

Well, it is possible to have a PHP installation without preg_* functions enabled. In practice it just doesn't really happen.
1

As per my comment, you might want to try (add as many separators as needed)

$splitArr = preg_split('/[\s,!?;:-]+|[.]\s+/', $string, -1, PREG_SPLIT_NO_EMPTY);

You'd then have to handle the case of a "quoted" word (it's not so easy to do in a regular expression, because 'is" "this' quoted? And how?).

So I think it's best to keep ' and " within words (so that "it's" is a single word, and "they 'll" is two words) and then deal with those cases separately. For example a regexp would have some trouble in correctly handling

they 're 'just friends'. Or that's what they say.

while having "'re" and a sequence of words of which the first is left-quoted and the last is right-quoted, the first not being a known sequence ('s, 're, 'll, 'd ...) may be handled at application level.

Comments

0

This is not a php-problem, but a logical one.

Words could be concatenated by a -. Abbrevations could look like short sentences.

You can match your example directly by creating a solution that fits only on this particular phrase. But you cant get a solution for all possible phrases. That would require a neuronal-computing based content-recognition.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.