looping a regex replace over a string in python -- missing something obvious

Question

I'm trying to iteratively remove 2nd, 3rd... nth authors from citations in a document, where those citations are in the form (after some cleanup steps)

Straat, Ark, Sijtsma 2013, 75-99.

Nardulli, Peyton, Bajjalieh 2013, 139-192.

My strategy is, taking citations of the form AUTHOR1... AUTHORn-1 AUTHORn YEAR:

1) match AUTHORn-1 AUTHORn YEAR,

2) using a group replace, replace the matched substring with AUTHORn-1 YEAR, so that the overall citation becomes AUTHOR1... AUTHORn-1 YEAR.

3) Then loop around and do it again until all that remains is AUTHOR1 YEAR. I've got ten iterations in here because I know there are no multi-author citations with more than ten people.

My code is as follows:

def multiAuthor(citestring):
    longcite = r'([\s(][A-Z1][A-Za-z1]*-?[A-Za-z1]*),[\s(][A-Z1][A-Za-z1]*-?[A-Za-z1]*[ ,]?( \(?\d\d\d\d[a-z]?[\s.,)])'
    for x in range(0, 10):
        newstring = re.sub(longcite, '\g<1>\g<2>', citestring)
    return(newstring)

this is called on a string of footnotes separated by newlines, and it works on the first iteration. For my two sample matches given above, it correctly returns:

Straat, Ark 2013, 75-99.

Nardulli, Peyton 2013, 139-192.

But that's it. It does not successfully carry out replacement on any loop beyond the first, and accordingly fails to strip away the second author.

I've been debugging with regex101, but am officially stumped. The first iteration of the expression: https://www.regex101.com/r/jM2fF4/3 --- then after running the replacement, the regex on the second loop also matches, and ought to replace again: https://regex101.com/r/fZ1pX7/4

So I think my regex is right. Am I just missing something dumb and obvious? (I'm pretty new to python-land, but I've double and triple-checked my loop syntax, and I think it's right.)

Using python 3.

If you want to see it in action for yourself, I've also put a minimal runnable example (with spaces instead of newlines, but no diff) here: https://github.com/paultopia/stray-cites/blob/master/minimal-test.py

Save me, StackObi Wan, you're my only hope...?

Edit: I indeed was missing something obvious, see my self-answer below; leaving this up because it's probably a common oopsie.

First of all why you want to use regex for this task?and why such complicated regex??also how it doesn't work for other lines? whats your result look like? — Kasravnd
– Kasravnd, Commented Jun 6, 2015 at 21:27
Unfortunately, the text in question has lots of different kinds of citations it needs to match---cites with hyphens, cites with commas in the middle and without, cites with the year in parens and without---this complicated regex is all I can come up with that catches everything. Do you have an idea for a different technique? — Paul Gowder
– Paul Gowder, Commented Jun 6, 2015 at 21:34
:-) the nutty thing is that there are still a few edge cases that I know of that my code can't catch, even after a bunch of preprocessing (getting rid of von, van, etc. etc.). Fortunately it's good enough for a personal tool (finding unmatched citations/references in my 300 page academic book). But would have to take a totally different approach for general use. — Paul Gowder
– Paul Gowder, Commented Jun 6, 2015 at 22:06

bjfletcher · Accepted Answer · 2015-06-06 21:32:56Z

1

Is this something you wanted?

([^,]*).*?([0-9].*?)\.\s*

See a fork on regex101.

([^,]*) matches up to , (comma)
.*? ignoring for as long until...
([0-9].*?)\. matches a digit up to.` (dot)
\s* matches any whitespace after this

Then, in the substitution:

`\1 \2`

which is the first and second matches from above - name and page numbers/year respectively.

edited Jun 6, 2015 at 21:32

answered Jun 6, 2015 at 21:27

bjfletcher

11.6k5 gold badges58 silver badges70 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Paul Gowder Over a year ago

Figured out the immediate problem in answer below, but will also try your more concise regex---always interested in getting better at composing non-insane expressions. Thanks!

bjfletcher Over a year ago

That does happen! - go for a walk, then the problem is solved! Good times. :) You're very welcome - that was a very nicely written question BTW.

Paul Gowder · Accepted Answer · 2015-06-06 21:32:35Z

0

And I'm an idiot. Every time I post on stackoverflow, I turn off my computer and walk away and five minutes later, the answer comes to me.

The loop doesn't work because on every iteration, it finds the match on the original string, not on the string that the previous loop operated on. Correct code:

def multiAuthor(citestring):
    longcite = r'([\s(][A-Z1][A-Za-z1]*-?[A-Za-z1]*),[\s(][A-Z1][A-Za-z1]*-?[A-Za-z1]*[ ,]?( \(?\d\d\d\d[a-z]?[\s.,)])'
    for x in range(0, 10):
        citestring = re.sub(longcite, '\g<1>\g<2>', citestring)
    return(citestring)

answered Jun 6, 2015 at 21:32

Paul Gowder

2,5591 gold badge26 silver badges38 bronze badges

Collectives™ on Stack Overflow

looping a regex replace over a string in python -- missing something obvious

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related