In python, how can I extract string with regex?

Question

I want to write a simple markdown parser function that will take in a single line of markdown and be translated into the appropriate HTML. To keep it simple, I want to support only one feature of markdown in atx syntax: headers.

Headers are designated by (1-6) hashes followed by a space, followed by text. The number of hashes determines the header level of the HTML output. Examples

# Header will become <h1>Header</h1>

## Header will become <h2>Header</h2>

###### Header will become <h6>Header</h6>

Rules are as follows

Header content should only come after the initial hashtag(s) plus a space character.

Invalid headers should just be returned as the markdown that was recieved, no translation necessary.

Spaces before and after both the header content and the hashtag(s) should be ignored in the resulting output.

This is code I made.

import re
def markdown_parser(markdown):
    results =''
    pattern = re.compile("#+\s")
    matches = pattern.search(markdown.strip())
    if (matches != None):
        tag = matches[0]
        hashTagLen = len(tag) - 1
        htmlTag = "h" + str(hashTagLen)
        content = markdown.strip()[(hashTagLen + 1):]
        results = "<" + htmlTag + ">" + content + "</" + htmlTag + ">"
    else:
        results = markdown
    return results

When I run this code, exception has occurred as follows.

Unhandled Exception: '_sre.SRE_Match' object is not subscriptable

I'm not sure why this error has occurred.

When I run the script on my shell, it works well. But When I run it on unittest environment (import unittest), the error has occurred.

Please help me.

Why don't you just use the markdown module and have it do all the dirty work for you? — zwer
– zwer, Commented Jul 5, 2017 at 15:15
matches is not the items matched, but a match object, use matches.group to interact with them, cf the docs : m = re.search('(?<=abc)def', 'abcdef'); m.group(0). thus in your case: matches[0] --> matches.group(0) — patrick
– patrick, Commented Jul 5, 2017 at 15:16

duFF · Accepted Answer · 2017-07-05 15:44:25Z

That code looks quite verbose and a lot of that logic can be performed in regex.

If you look at the original markdown library written in perl, you can see only need one pattern, then, from the first capture group, you can attain what style of header it is.

The original implementation is here

sub _DoHeaders {
my $text = shift;

# Setext-style headers:
#     Header 1
#     ========
#  
#     Header 2
#     --------
#
$text =~ s{ ^(.+)[ \t]*\n=+[ \t]*\n+ }{
    "<h1>"  .  _RunSpanGamut($1)  .  "</h1>\n\n";
}egmx;

$text =~ s{ ^(.+)[ \t]*\n-+[ \t]*\n+ }{
    "<h2>"  .  _RunSpanGamut($1)  .  "</h2>\n\n";
}egmx;


# atx-style headers:
#   # Header 1
#   ## Header 2
#   ## Header 2 with closing hashes ##
#   ...
#   ###### Header 6
#
$text =~ s{
        ^(\#{1,6})  # $1 = string of #'s
        [ \t]*
        (.+?)       # $2 = Header text
        [ \t]*
        \#*         # optional closing #'s (not counted)
        \n+
    }{
        my $h_level = length($1);
        "<h$h_level>"  .  _RunSpanGamut($2)  .  "</h$h_level>\n\n";
    }egmx;

return $text;

}

Unless, for some reason you can't, it would be better to use the markdown library as that is an implementation of the original library, warts and all.

You can see how the Markdown-Python library implements it here

class HashHeaderProcessor(BlockProcessor):
""" Process Hash Headers. """

# Detect a header at start of any line in block
RE = re.compile(r'(^|\n)(?P<level>#{1,6})(?P<header>.*?)#*(\n|$)')

def test(self, parent, block):
    return bool(self.RE.search(block))

def run(self, parent, blocks):
    block = blocks.pop(0)
    m = self.RE.search(block)
    if m:
        before = block[:m.start()]  # All lines before header
        after = block[m.end():]     # All lines after header
        if before:
            # As the header was not the first line of the block and the
            # lines before the header must be parsed first,
            # recursively parse this lines as a block.
            self.parser.parseBlocks(parent, [before])
        # Create header using named groups from RE
        h = util.etree.SubElement(parent, 'h%d' % len(m.group('level')))
        h.text = m.group('header').strip()
        if after:
            # Insert remaining lines as first block for future parsing.
            blocks.insert(0, after)
    else:  # pragma: no cover
        # This should never happen, but just in case...
        logger.warn("We've got a problem header: %r" % block)

Isis Binder · Accepted Answer · 2017-07-05 15:21:39Z

0

You don't use indexing to access a Match Object. https://docs.python.org/2/library/re.html#match-objects

answered Jul 5, 2017 at 15:21

Isis Binder

262 bronze badges

Comments

Faibbus · Accepted Answer · 2017-07-05 15:54:47Z

0

You can use re.sub to substitute one to 6 # followed by a space and a word (the pattern being (#{1,6}) (\w+)) with the html you want.

re.sub can be used with a function to handle the replacement.

import re

def replacer(m):
    return '<h{level}>{header}</h{level}>'.format(level=len(m.group(1)), header=m.group(2))

def markdown_parser(markdown):
    results = [re.sub(r'(#{1,6}) (\w+)', replacer, line) for line in markdown.split('\n')]
    return "\n".join(results).strip()

sourceText = "##header#content## smaller header#contents### something"
print(markdown_parser(sourceText))

Prints ##header#content<h2>smaller</h2> header#contents<h3>something</h3>

edited Jul 5, 2017 at 15:54

answered Jul 5, 2017 at 15:40

Faibbus

1,13311 silver badges19 bronze badges

Collectives™ on Stack Overflow

In python, how can I extract string with regex?

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related