0

I have a corpus of text documents, some of which will have a sequence of substrings. The first and last substrings are consistent, and mark the beginning and the end of the parts I want to replace. But, I would also like to delete/replace all substrings that exist between these first and last positions.

origSent = 'This is the sentence I am intending to edit'

Using the above as an example, how would I go about using 'the' as the start substring, and 'intending' as the end substring, deleting both in addition to the words that exist between them to make the following:

newSent = 'This is to edit'
1
  • you'll need to be a lot clearer on the rules for defining these substrings, if 'the' and 'intending' are always the defining words, then this is trival via str.split() of course Commented Oct 30, 2019 at 16:00

2 Answers 2

1

You could use regex replacement here:

origSent = 'This is the sentence I am intending to edit'
newSent = re.sub(r'\bthe((?!\bthe\b).)*\bintending\b', '', origSent)
print(newSent)

This prints:

This is  to edit

The "secret sauce" in the regex pattern is the tempered dot:

((?!\bthe\b).)*

This will consume all content which does not cross over another occurrence of the word the. This prevents matching on some earlier the before intending, which we don't want to do.

Sign up to request clarification or add additional context in comments.

Comments

1

I would do this:

s_list = origSent.split()
newSent = ' '.join(s_list[:s_list.index('the')] + s_list[s_list.index('intending')+1:])

Hope this helps.

2 Comments

I think you missed an = sign in that second line. Should it say "newSent = ' '.join . . ."
Corrected the answer

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.