How to extract all hashtags from string by using regexp_substr

Question

I need a regex pattern which extracts all hastags from a tweets in a table. My data like is

select regexp_substr('My twwet #HashTag1 and this is the #SecondHashtag    sample','#\S+')
from dual

it only brings #HashTag1 not #SecondHashtag

I need a output like #HashTag1 #SecondHashtag

Thanks

You say you need the output in that format, but in most cases you should look for output in separate rows (like GurV shows in his second approach). — user5683823
– user5683823, Commented Mar 18, 2017 at 13:14

Gurwinder Singh · Accepted Answer · 2017-03-18 14:16:18Z

1

You can use regexp_replace to remove all that doesn't match your pattern.

with t (col) as (
  select 'My twwet #HashTag1 and this is the #SecondHashtag    sample, #onemorehashtag'
  from dual
)
select 
  regexp_replace(col, '(#\S+\s?)|.', '\1')
from t;

Produces;

#HashTag1 #SecondHashtag #onemorehashtag

regexp_substr will return one match. What you can do is turn your string into a table using connect by:

with t (col) as (
  select 'My twwet #HashTag1 and this is the #SecondHashtag    sample, #onemorehashtag'
  from dual
)
select 
  regexp_substr(col, '#\S+', 1, level)
from t
connect by regexp_substr(col, '#\S+', 1, level) is not null;

Returns:

#HashTag1
#SecondHashtag
#onemorehashtag

EDIT:

\S matches any non space character. It would be better to use \w which matches a-z, A-Z, 0-9 and _.

As commented by @mathguy and from this site: a hashtag starts with an alphabet, then alphanumeric characters or underscores are allowed.

So, pattern #[[:alpha:]]\w* will work better.

with t (col) as (
  select 'My twwet #HashTag1, this is the #SecondHashtag. #onemorehashtag'
  from dual
)
select 
  regexp_substr(col, '#[[:alpha:]]\w*', 1, level)
from t
connect by regexp_substr(col, '#[[:alpha:]]\w*', 1, level) is not null;

Produces:

#HashTag1
#SecondHashtag
#onemorehashtag

edited Mar 18, 2017 at 14:16

answered Mar 18, 2017 at 9:20

Gurwinder Singh

39.7k6 gold badges62 silver badges87 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

user5683823 Over a year ago

This looks good, and the second solution is probably the one that makes more sense in most situations. You also need to handle punctuation though, and non-alphanumeric characters more generally; you may have something like 'My hashtag is #MyHashtag, yours is #YourHashtag, etc.' - here the hashtags should not pick up the comma as part of the hashtag.

user5683823 Over a year ago

I just checked Twitter hashtags: they must begin with a letter and must contain only letters, digits and the underscore. So something like '#[[:alpha:]][[:alnum:]_]* should work. (Not sure if there is also a minimum length; that can be easily accommodated.)

Gurwinder Singh Over a year ago

@mathguy - Updated the answer. Thanks for the comment.

Collectives™ on Stack Overflow

How to extract all hashtags from string by using regexp_substr

1 Answer 1

EDIT:

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

EDIT:

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related