0

I have a dataset of texts, from which I am extracting all "sentences" containing a pattern r'\b' + ' (?:\w+ )?(?:\w+ )?'.join(my_words_markers) + r'\b'.

I want now to reduce all long "sentences" (> 200 words), to more readable ones, taking, e.g., only 30 words before and after my pattern, replacing the trimmed part with "...".

Is there a clean way to do so ?

EDIT : the search is conducted on preprocessed text (lowercasing, removing stop words and punctuation and other selected-by-hand words), then matched sentences are stored in their original form. I want to operate the trimming on the original sentence (with punctuation and stop words)

EXAMPLE:

t1 = "This is a complete sentence, containing colors and other words: pink, blue, yellow, tree and chair, orange, green, hello, world, black, sofa, brown. It will be preprocessed"
t2 = preprocess(t1)  # ---> "complete sentence containing colors words pink blue yellow tree chair orange green hello world black sofa brown preprocessed"
my_words_markers = "yellow orange".split()
pattern = r'\b' + ' (?:\w+ )?(?:\w+ )?'.join(my_words_markers) + r'\b'
match = re.search(pattern, t2, re.I)
if match: list_of_sentences.append(t1)

In this list_of_senteces, I want to trim the longest ones:

# what I want is a trimmed version of t1, with, e.g., 4 words before and after pattern: 
"... other words: pink, blue, yellow, tree and chair, orange, green, hello, world, black ..."
3
  • 1
    Please post some examples of processed and unprocessed sentences. I don't want to update my answer just to be hit with "thanks but it doesn't work with my data" a 2nd time. Commented Feb 1, 2018 at 14:01
  • @Rawing happy now? Commented Feb 1, 2018 at 17:07
  • Yes and no. I don't think this is possible... Commented Feb 1, 2018 at 17:14

1 Answer 1

1

You can extend your regex so that it also matches up to 30 words before and after the pattern:

pattern = r'(?:\w+\W+){,30}\b' + \
          r' (?:\w+ )?(?:\w+ )?'.join(my_words_markers) + \
          r'\b(?:\W+\w+){,30}'

Then loop over all sentences, and if the regex matches, use match.start() and match.end() to check if you have to insert an ellipsis ...:

for sentence in sentences:
    match = re.search(pattern, sentence)
    if match:
        text = '{}{}{}'.format('...' if match.start() > 0 else '',
                               match.group(),
                               '...' if match.end() < len(sentence) else '')
        print(text)
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks, but it doesn't properly work. I've edited my question
@Fed Please try to include all of your requirements in the question next time...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.