Split a Python String Using Multiple Delimiters

Question

I have a somewhat complex filename following the pattern s[num][alpha1][alpha2].ext that I'm trying to tokenize. The lexicons from which alpha1 and alpha2 are drawn are contained in two lists.

I found the question at https://stackoverflow.com/questions/4998629/python-split-string-with-multiple-delimiters useful, but it didn't solve my problem.

Between [num] and [alpha1], a number precedes a letter (a fairly easy regex), but between [alpha1] and [alpha2], I'm splitting between two words.

Given the filename s13LoremIpsum.ext, for instance, I'd want ("s", "13", "Lorem", "Ipsum").

What would be the best way to accomplish this?

Note that in this particular case, [alpha2] is a single letter, but I'm interested in solutions for both this case and the general case where [alpha1] and [alpha2] are words of arbitrary length. Note also that the general case could introduce ambiguity if there is more than one possible splitting by combining words from the respective lexicons, e.g.

alpha1 = ["a", "ab"]
alpha2 = ["bc", "c"]
# How will we split?
splitString == ("a", "bc")
# --OR--
splitString == ("ab", "c")

Solving this ambiguity is a secondary concern, however.

Do alpha1 and alpha2 always start with a capital letter? And do they ever have capital letters within? Is there ever an alpha3? — brandonscript
– brandonscript, Commented Jan 14, 2014 at 17:52
Are alpha1 and alpha2 to match pre-defined values? Your other question implies that they would. — Martijn Pieters
– Martijn Pieters, Commented Jan 14, 2014 at 17:54
In the specific case, both alpha1 and alpha2 are all-capital. In the general case, any words in either could be any mix of capital and lowercase. — Walker
– Walker, Commented Jan 14, 2014 at 17:56
And yes, alpha1 and alpha2 are drawn from two lists containing predefined possible values for each. — Walker
– Walker, Commented Jan 14, 2014 at 17:56
In that case you'll need to do what thefourtheye is suggesting. — brandonscript
– brandonscript, Commented Jan 14, 2014 at 17:57

thefourtheye · Accepted Answer · 2014-01-14 17:53:15Z

3

alpha1, alpha2 = ["a", "ab", "Lorem"], ["bc", "c", "Ipsum"]
import re
pattern = re.compile("(s)(\\d+)("+"|".join(alpha1) + ")(" + "|".join(alpha2)+")")
data = "s13LoremIpsum.ext"
result = [pattern.match(data).group(i) for i in range(1, 5)]
print result

Output

['s', '13', 'Lorem', 'Ipsum']

The actual compiled pattern can be checked like this

print pattern.pattern

which prints

(s)(\d+)(a|ab|Lorem)(bc|c|Ipsum)

answered Jan 14, 2014 at 17:53

thefourtheye

241k53 gold badges466 silver badges505 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Walker Over a year ago

Awesome, that looks really good! I ran into an issue here, however: Let alpha1, alpha2 = ["AB", "ABC"], ["C", "D"]. Now let data = "s13ABCC.ext". We output ['s', '13', 'AB', 'C']. Note that we get the right answer if alpha1 has the order of its items switched. How could we fix this behavior?

thefourtheye Over a year ago

@Walker If we have to manually split them, how would we do it? Shouldn't CC be part of alpha2?

Walker Over a year ago

I guess we're skirting the ambiguity I mentioned in my post, but in this example, the only word at the end that is contained in the alpha2 lexicon would be C, so alpha1 should then evaluate to ABC, rather then just AB.

thefourtheye Over a year ago

@Walker Then include ABC in alpha1 before AB. Problem solved :)

Collectives™ on Stack Overflow

Split a Python String Using Multiple Delimiters

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related