Substring extraction from variable length strings with regex python

Question

I have a dataset of texts, from which I am extracting all "sentences" containing a pattern r'\b' + ' (?:\w+ )?(?:\w+ )?'.join(my_words_markers) + r'\b'.

I want now to reduce all long "sentences" (> 200 words), to more readable ones, taking, e.g., only 30 words before and after my pattern, replacing the trimmed part with "...".

Is there a clean way to do so ?

EDIT : the search is conducted on preprocessed text (lowercasing, removing stop words and punctuation and other selected-by-hand words), then matched sentences are stored in their original form. I want to operate the trimming on the original sentence (with punctuation and stop words)

EXAMPLE:

t1 = "This is a complete sentence, containing colors and other words: pink, blue, yellow, tree and chair, orange, green, hello, world, black, sofa, brown. It will be preprocessed"
t2 = preprocess(t1)  # ---> "complete sentence containing colors words pink blue yellow tree chair orange green hello world black sofa brown preprocessed"
my_words_markers = "yellow orange".split()
pattern = r'\b' + ' (?:\w+ )?(?:\w+ )?'.join(my_words_markers) + r'\b'
match = re.search(pattern, t2, re.I)
if match: list_of_sentences.append(t1)

In this list_of_senteces, I want to trim the longest ones:

# what I want is a trimmed version of t1, with, e.g., 4 words before and after pattern: 
"... other words: pink, blue, yellow, tree and chair, orange, green, hello, world, black ..."

Please post some examples of processed and unprocessed sentences. I don't want to update my answer just to be hit with "thanks but it doesn't work with my data" a 2nd time. — Aran-Fey
– Aran-Fey, Commented Feb 1, 2018 at 14:01

Aran-Fey · Accepted Answer · 2018-02-01 11:46:18Z

1

You can extend your regex so that it also matches up to 30 words before and after the pattern:

pattern = r'(?:\w+\W+){,30}\b' + \
          r' (?:\w+ )?(?:\w+ )?'.join(my_words_markers) + \
          r'\b(?:\W+\w+){,30}'

Then loop over all sentences, and if the regex matches, use match.start() and match.end() to check if you have to insert an ellipsis ...:

for sentence in sentences:
    match = re.search(pattern, sentence)
    if match:
        text = '{}{}{}'.format('...' if match.start() > 0 else '',
                               match.group(),
                               '...' if match.end() < len(sentence) else '')
        print(text)

answered Feb 1, 2018 at 11:46

Aran-Fey

44k13 gold badges113 silver badges161 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Fed Over a year ago

Thanks, but it doesn't properly work. I've edited my question

Aran-Fey Over a year ago

@Fed Please try to include all of your requirements in the question next time...

Collectives™ on Stack Overflow

Substring extraction from variable length strings with regex python

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related