2

I need a regex pattern which extracts all hastags from a tweets in a table. My data like is

select regexp_substr('My twwet #HashTag1 and this is the #SecondHashtag    sample','#\S+')
from dual

it only brings #HashTag1 not #SecondHashtag

I need a output like #HashTag1 #SecondHashtag

Thanks

1
  • You say you need the output in that format, but in most cases you should look for output in separate rows (like GurV shows in his second approach). Commented Mar 18, 2017 at 13:14

1 Answer 1

1

You can use regexp_replace to remove all that doesn't match your pattern.

with t (col) as (
  select 'My twwet #HashTag1 and this is the #SecondHashtag    sample, #onemorehashtag'
  from dual
)
select 
  regexp_replace(col, '(#\S+\s?)|.', '\1')
from t;

Produces;

#HashTag1 #SecondHashtag #onemorehashtag

regexp_substr will return one match. What you can do is turn your string into a table using connect by:

with t (col) as (
  select 'My twwet #HashTag1 and this is the #SecondHashtag    sample, #onemorehashtag'
  from dual
)
select 
  regexp_substr(col, '#\S+', 1, level)
from t
connect by regexp_substr(col, '#\S+', 1, level) is not null;

Returns:

#HashTag1
#SecondHashtag
#onemorehashtag

EDIT:

\S matches any non space character. It would be better to use \w which matches a-z, A-Z, 0-9 and _.

As commented by @mathguy and from this site: a hashtag starts with an alphabet, then alphanumeric characters or underscores are allowed.

So, pattern #[[:alpha:]]\w* will work better.

with t (col) as (
  select 'My twwet #HashTag1, this is the #SecondHashtag. #onemorehashtag'
  from dual
)
select 
  regexp_substr(col, '#[[:alpha:]]\w*', 1, level)
from t
connect by regexp_substr(col, '#[[:alpha:]]\w*', 1, level) is not null;

Produces:

#HashTag1
#SecondHashtag
#onemorehashtag
Sign up to request clarification or add additional context in comments.

3 Comments

This looks good, and the second solution is probably the one that makes more sense in most situations. You also need to handle punctuation though, and non-alphanumeric characters more generally; you may have something like 'My hashtag is #MyHashtag, yours is #YourHashtag, etc.' - here the hashtags should not pick up the comma as part of the hashtag.
I just checked Twitter hashtags: they must begin with a letter and must contain only letters, digits and the underscore. So something like '#[[:alpha:]][[:alnum:]_]* should work. (Not sure if there is also a minimum length; that can be easily accommodated.)
@mathguy - Updated the answer. Thanks for the comment.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.