1

i have this HTML code

<a class="button block left icon-phone" data-reveal="\u06f0\u06f9\u06f3\u06f6\u06f5\u06f6\u06f8\u06f1\u06f6\u06f2\u06f1"  href="#">

this a sting, i want to extract content in front of data-reveal. i did some regex like

p = re.compile('data-reveal=*')

but they didn't work. How can i do it ? Thanks.

3
  • everything in front of data reveal, including href="#"? Can you post your exact desired outcome? Commented Apr 11, 2016 at 3:28
  • 2
    BS4 is a better option for this.. Commented Apr 11, 2016 at 3:31
  • data-reveal="([^"]*)" if you want just data within data-reveal Commented Apr 11, 2016 at 3:35

3 Answers 3

3

You are using the wrong tool for this. You should use an Html Parser like BeautifulSoup.

>>> from bs4 import BeautifulSoup
>>> doc = """<a class="button block left icon-phone" data-reveal="\u06f0\u06f9\u06f3\u06f6\u06f5\u06f6\u06f8\u06f1\u06f6\u06f2\u06f1"  href="#">"""
>>> soup = BeautifulSoup(doc, 'html.parser')
>>> print(soup.find('a').get('data-reveal'))
۰۹۳۶۵۶۸۱۶۲۱
Sign up to request clarification or add additional context in comments.

Comments

2

You shouldn't use regex for this but I'll assume you want to since that's what you do in the op. I'm not exactly sure what you want, so here's how to do either of what I think you could be asking

match everything in data-reveal:
data-reveal="(.+?)"
matches: \u06f0\u06f9\u06f3\u06f6\u06f5\u06f6\u06f8\u06f1\u06f6\u06f2\u06f1

match EVERYTHING in front of data-reveal
data-reveal="(.+)
matches: \u06f0\u06f9\u06f3\u06f6\u06f5\u06f6\u06f8\u06f1\u06f6\u06f2\u06f1"  href="#">

first regex: https://regex101.com/r/jW9fT4/1

second regex: https://regex101.com/r/uZ7vX2/1

Comments

2

Try this:

import re

html = """<a class="button block left icon-phone" data-reveal="\u06f0\u06f9\u06f3\u06f6\u06f5\u06f6\u06f8\u06f1\u06f6\u06f2\u06f1"  href="#">"""

regexObj = re.compile('data-reveal="(.*)" ')
result = regexObj.search(html);
print(result.group(1))

Output:

۰۹۳۶۵۶۸۱۶۲۱

2 Comments

hi ren, thanks for your answer alot. now when i print the out put it shows utf-8 (\u06f0\u06f9\u06f3\u06f6\u06f5\u06f6\u06f8\u06f1\u06f6\u06f2\u06f1) how can i convert it to plain that shows ۰۹۳۶۵۶۸۱۶۲۱ ? thanks again.
thanks, i figure it out, i should use b'\u06f0\u06f9\u06f3\u06f6\u06f5\u06f6\u06f8\u06f1\u06f6\u06f2\u06f1'

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.