1

I want to write a simple markdown parser function that will take in a single line of markdown and be translated into the appropriate HTML. To keep it simple, I want to support only one feature of markdown in atx syntax: headers.

Headers are designated by (1-6) hashes followed by a space, followed by text. The number of hashes determines the header level of the HTML output. Examples

# Header will become <h1>Header</h1>

## Header will become <h2>Header</h2>

###### Header will become <h6>Header</h6>

Rules are as follows

Header content should only come after the initial hashtag(s) plus a space character.

Invalid headers should just be returned as the markdown that was recieved, no translation necessary.

Spaces before and after both the header content and the hashtag(s) should be ignored in the resulting output.

This is code I made.

import re
def markdown_parser(markdown):
    results =''
    pattern = re.compile("#+\s")
    matches = pattern.search(markdown.strip())
    if (matches != None):
        tag = matches[0]
        hashTagLen = len(tag) - 1
        htmlTag = "h" + str(hashTagLen)
        content = markdown.strip()[(hashTagLen + 1):]
        results = "<" + htmlTag + ">" + content + "</" + htmlTag + ">"
    else:
        results = markdown
    return results

When I run this code, exception has occurred as follows.

Unhandled Exception: '_sre.SRE_Match' object is not subscriptable

I'm not sure why this error has occurred.

When I run the script on my shell, it works well. But When I run it on unittest environment (import unittest), the error has occurred.

Please help me.

2
  • 1
    Why don't you just use the markdown module and have it do all the dirty work for you? Commented Jul 5, 2017 at 15:15
  • matches is not the items matched, but a match object, use matches.group to interact with them, cf the docs : m = re.search('(?<=abc)def', 'abcdef'); m.group(0). thus in your case: matches[0] --> matches.group(0) Commented Jul 5, 2017 at 15:16

3 Answers 3

1

That code looks quite verbose and a lot of that logic can be performed in regex.

If you look at the original markdown library written in perl, you can see only need one pattern, then, from the first capture group, you can attain what style of header it is.

The original implementation is here

sub _DoHeaders {
my $text = shift;

# Setext-style headers:
#     Header 1
#     ========
#  
#     Header 2
#     --------
#
$text =~ s{ ^(.+)[ \t]*\n=+[ \t]*\n+ }{
    "<h1>"  .  _RunSpanGamut($1)  .  "</h1>\n\n";
}egmx;

$text =~ s{ ^(.+)[ \t]*\n-+[ \t]*\n+ }{
    "<h2>"  .  _RunSpanGamut($1)  .  "</h2>\n\n";
}egmx;


# atx-style headers:
#   # Header 1
#   ## Header 2
#   ## Header 2 with closing hashes ##
#   ...
#   ###### Header 6
#
$text =~ s{
        ^(\#{1,6})  # $1 = string of #'s
        [ \t]*
        (.+?)       # $2 = Header text
        [ \t]*
        \#*         # optional closing #'s (not counted)
        \n+
    }{
        my $h_level = length($1);
        "<h$h_level>"  .  _RunSpanGamut($2)  .  "</h$h_level>\n\n";
    }egmx;

return $text;

}

Unless, for some reason you can't, it would be better to use the markdown library as that is an implementation of the original library, warts and all.

You can see how the Markdown-Python library implements it here

class HashHeaderProcessor(BlockProcessor):
""" Process Hash Headers. """

# Detect a header at start of any line in block
RE = re.compile(r'(^|\n)(?P<level>#{1,6})(?P<header>.*?)#*(\n|$)')

def test(self, parent, block):
    return bool(self.RE.search(block))

def run(self, parent, blocks):
    block = blocks.pop(0)
    m = self.RE.search(block)
    if m:
        before = block[:m.start()]  # All lines before header
        after = block[m.end():]     # All lines after header
        if before:
            # As the header was not the first line of the block and the
            # lines before the header must be parsed first,
            # recursively parse this lines as a block.
            self.parser.parseBlocks(parent, [before])
        # Create header using named groups from RE
        h = util.etree.SubElement(parent, 'h%d' % len(m.group('level')))
        h.text = m.group('header').strip()
        if after:
            # Insert remaining lines as first block for future parsing.
            blocks.insert(0, after)
    else:  # pragma: no cover
        # This should never happen, but just in case...
        logger.warn("We've got a problem header: %r" % block)
Sign up to request clarification or add additional context in comments.

Comments

0

You don't use indexing to access a Match Object. https://docs.python.org/2/library/re.html#match-objects

Comments

0

You can use re.sub to substitute one to 6 # followed by a space and a word (the pattern being (#{1,6}) (\w+)) with the html you want.

re.sub can be used with a function to handle the replacement.

import re

def replacer(m):
    return '<h{level}>{header}</h{level}>'.format(level=len(m.group(1)), header=m.group(2))

def markdown_parser(markdown):
    results = [re.sub(r'(#{1,6}) (\w+)', replacer, line) for line in markdown.split('\n')]
    return "\n".join(results).strip()

sourceText = "##header#content## smaller header#contents### something"
print(markdown_parser(sourceText))

Prints ##header#content<h2>smaller</h2> header#contents<h3>something</h3>

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.