2

I'm trying to iteratively remove 2nd, 3rd... nth authors from citations in a document, where those citations are in the form (after some cleanup steps)

Straat, Ark, Sijtsma 2013, 75-99.

Nardulli, Peyton, Bajjalieh 2013, 139-192.

My strategy is, taking citations of the form AUTHOR1... AUTHORn-1 AUTHORn YEAR:

1) match AUTHORn-1 AUTHORn YEAR,

2) using a group replace, replace the matched substring with AUTHORn-1 YEAR, so that the overall citation becomes AUTHOR1... AUTHORn-1 YEAR.

3) Then loop around and do it again until all that remains is AUTHOR1 YEAR. I've got ten iterations in here because I know there are no multi-author citations with more than ten people.

My code is as follows:

def multiAuthor(citestring):
    longcite = r'([\s(][A-Z1][A-Za-z1]*-?[A-Za-z1]*),[\s(][A-Z1][A-Za-z1]*-?[A-Za-z1]*[ ,]?( \(?\d\d\d\d[a-z]?[\s.,)])'
    for x in range(0, 10):
        newstring = re.sub(longcite, '\g<1>\g<2>', citestring)
    return(newstring)

this is called on a string of footnotes separated by newlines, and it works on the first iteration. For my two sample matches given above, it correctly returns:

Straat, Ark 2013, 75-99.

Nardulli, Peyton 2013, 139-192.

But that's it. It does not successfully carry out replacement on any loop beyond the first, and accordingly fails to strip away the second author.

I've been debugging with regex101, but am officially stumped. The first iteration of the expression: https://www.regex101.com/r/jM2fF4/3 --- then after running the replacement, the regex on the second loop also matches, and ought to replace again: https://regex101.com/r/fZ1pX7/4

So I think my regex is right. Am I just missing something dumb and obvious? (I'm pretty new to python-land, but I've double and triple-checked my loop syntax, and I think it's right.)

Using python 3.

If you want to see it in action for yourself, I've also put a minimal runnable example (with spaces instead of newlines, but no diff) here: https://github.com/paultopia/stray-cites/blob/master/minimal-test.py

Save me, StackObi Wan, you're my only hope...?

Edit: I indeed was missing something obvious, see my self-answer below; leaving this up because it's probably a common oopsie.

4
  • First of all why you want to use regex for this task?and why such complicated regex??also how it doesn't work for other lines? whats your result look like? Commented Jun 6, 2015 at 21:27
  • Unfortunately, the text in question has lots of different kinds of citations it needs to match---cites with hyphens, cites with commas in the middle and without, cites with the year in parens and without---this complicated regex is all I can come up with that catches everything. Do you have an idea for a different technique? Commented Jun 6, 2015 at 21:34
  • In this case NO ;) I think regex is a good choice! Commented Jun 6, 2015 at 21:37
  • :-) the nutty thing is that there are still a few edge cases that I know of that my code can't catch, even after a bunch of preprocessing (getting rid of von, van, etc. etc.). Fortunately it's good enough for a personal tool (finding unmatched citations/references in my 300 page academic book). But would have to take a totally different approach for general use. Commented Jun 6, 2015 at 22:06

2 Answers 2

1

Is this something you wanted?

([^,]*).*?([0-9].*?)\.\s*

See a fork on regex101.

  • ([^,]*) matches up to , (comma)
  • .*? ignoring for as long until...
  • ([0-9].*?)\. matches a digit up to.` (dot)
  • \s* matches any whitespace after this

Then, in the substitution:

`\1 \2`

which is the first and second matches from above - name and page numbers/year respectively.

Sign up to request clarification or add additional context in comments.

2 Comments

Figured out the immediate problem in answer below, but will also try your more concise regex---always interested in getting better at composing non-insane expressions. Thanks!
That does happen! - go for a walk, then the problem is solved! Good times. :) You're very welcome - that was a very nicely written question BTW.
0

And I'm an idiot. Every time I post on stackoverflow, I turn off my computer and walk away and five minutes later, the answer comes to me.

The loop doesn't work because on every iteration, it finds the match on the original string, not on the string that the previous loop operated on. Correct code:

def multiAuthor(citestring):
    longcite = r'([\s(][A-Z1][A-Za-z1]*-?[A-Za-z1]*),[\s(][A-Z1][A-Za-z1]*-?[A-Za-z1]*[ ,]?( \(?\d\d\d\d[a-z]?[\s.,)])'
    for x in range(0, 10):
        citestring = re.sub(longcite, '\g<1>\g<2>', citestring)
    return(citestring)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.