I have a somewhat complex filename following the pattern s[num][alpha1][alpha2].ext that I'm trying to tokenize. The lexicons from which alpha1 and alpha2 are drawn are contained in two lists.
I found the question at https://stackoverflow.com/questions/4998629/python-split-string-with-multiple-delimiters useful, but it didn't solve my problem.
Between [num] and [alpha1], a number precedes a letter (a fairly easy regex), but between [alpha1] and [alpha2], I'm splitting between two words.
Given the filename s13LoremIpsum.ext, for instance, I'd want ("s", "13", "Lorem", "Ipsum").
What would be the best way to accomplish this?
Note that in this particular case, [alpha2] is a single letter, but I'm interested in solutions for both this case and the general case where [alpha1] and [alpha2] are words of arbitrary length. Note also that the general case could introduce ambiguity if there is more than one possible splitting by combining words from the respective lexicons, e.g.
alpha1 = ["a", "ab"]
alpha2 = ["bc", "c"]
# How will we split?
splitString == ("a", "bc")
# --OR--
splitString == ("ab", "c")
Solving this ambiguity is a secondary concern, however.
alpha1andalpha2to match pre-defined values? Your other question implies that they would.