0

I have a somewhat complex filename following the pattern s[num][alpha1][alpha2].ext that I'm trying to tokenize. The lexicons from which alpha1 and alpha2 are drawn are contained in two lists.

I found the question at https://stackoverflow.com/questions/4998629/python-split-string-with-multiple-delimiters useful, but it didn't solve my problem.

Between [num] and [alpha1], a number precedes a letter (a fairly easy regex), but between [alpha1] and [alpha2], I'm splitting between two words.

Given the filename s13LoremIpsum.ext, for instance, I'd want ("s", "13", "Lorem", "Ipsum").

What would be the best way to accomplish this?

Note that in this particular case, [alpha2] is a single letter, but I'm interested in solutions for both this case and the general case where [alpha1] and [alpha2] are words of arbitrary length. Note also that the general case could introduce ambiguity if there is more than one possible splitting by combining words from the respective lexicons, e.g.

alpha1 = ["a", "ab"]
alpha2 = ["bc", "c"]
# How will we split?
splitString == ("a", "bc")
# --OR--
splitString == ("ab", "c")

Solving this ambiguity is a secondary concern, however.

5
  • 1
    Do alpha1 and alpha2 always start with a capital letter? And do they ever have capital letters within? Is there ever an alpha3? Commented Jan 14, 2014 at 17:52
  • Are alpha1 and alpha2 to match pre-defined values? Your other question implies that they would. Commented Jan 14, 2014 at 17:54
  • In the specific case, both alpha1 and alpha2 are all-capital. In the general case, any words in either could be any mix of capital and lowercase. Commented Jan 14, 2014 at 17:56
  • And yes, alpha1 and alpha2 are drawn from two lists containing predefined possible values for each. Commented Jan 14, 2014 at 17:56
  • In that case you'll need to do what thefourtheye is suggesting. Commented Jan 14, 2014 at 17:57

1 Answer 1

3
alpha1, alpha2 = ["a", "ab", "Lorem"], ["bc", "c", "Ipsum"]
import re
pattern = re.compile("(s)(\\d+)("+"|".join(alpha1) + ")(" + "|".join(alpha2)+")")
data = "s13LoremIpsum.ext"
result = [pattern.match(data).group(i) for i in range(1, 5)]
print result

Output

['s', '13', 'Lorem', 'Ipsum']

The actual compiled pattern can be checked like this

print pattern.pattern

which prints

(s)(\d+)(a|ab|Lorem)(bc|c|Ipsum)
Sign up to request clarification or add additional context in comments.

4 Comments

Awesome, that looks really good! I ran into an issue here, however: Let alpha1, alpha2 = ["AB", "ABC"], ["C", "D"]. Now let data = "s13ABCC.ext". We output ['s', '13', 'AB', 'C']. Note that we get the right answer if alpha1 has the order of its items switched. How could we fix this behavior?
@Walker If we have to manually split them, how would we do it? Shouldn't CC be part of alpha2?
I guess we're skirting the ambiguity I mentioned in my post, but in this example, the only word at the end that is contained in the alpha2 lexicon would be C, so alpha1 should then evaluate to ABC, rather then just AB.
@Walker Then include ABC in alpha1 before AB. Problem solved :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.