I'm working with tweets and after text processing , the code returns something like:
- Lorem ipsum dolor sit amaet vi
- Lorem ipsum dolor sit amaet
- Lorem ipsum dolor sit amaet via
So sqlite database identify these records as unique.
My question is how can I find if two strings contains 5 similar words then skip it? Should I change my regex code or add if statement?
My code:
clean1 = re.sub(r"(?:@\S*|#\S*|http(?=.*://)\S*)", "", tweet.text)
clean2 = re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t:])|(\w+:\/\/\S+)", " ", clean1)
final = re.sub(r'^RT[\s]+', '', clean2)
Thanks!