0

I'm working with tweets and after text processing , the code returns something like:

  • Lorem ipsum dolor sit amaet vi
  • Lorem ipsum dolor sit amaet
  • Lorem ipsum dolor sit amaet via

So sqlite database identify these records as unique. My question is how can I find if two strings contains 5 similar words then skip it? Should I change my regex code or add if statement?

My code:

        clean1 = re.sub(r"(?:@\S*|#\S*|http(?=.*://)\S*)", "", tweet.text)
        clean2 = re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t:])|(\w+:\/\/\S+)", " ", clean1)
        final = re.sub(r'^RT[\s]+', '', clean2)

Thanks!

1
  • does my answer solute your problem? Commented Aug 3, 2017 at 3:43

2 Answers 2

2

I don't think regex will help in this situation

You could do this to tell if two lines have 5 same words

str1 = "Lorem ipsum dolor sit amaet vi" 
str2 = "Lorem ipsum dolor sit amaet"

count = 0 
str1_split = str1.split(" ")
for word in str2.split(" "):
    if word in str1_split:
        count += 1

print count
Sign up to request clarification or add additional context in comments.

Comments

0

Here is the method to count same words in two string:

a="Lorem ipsum dolor sit amaet vi"
b="Lorem ipsum dolor sit amaet"
count=0
for i,j in zip(a.split(),b.split()):
    if i==j:
        count+=1
print count

Output:

5

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.