How to get only text values from a markdown string in Javascript

Question

I currently have some code that uses marked.js to transform one big markdown string (read from a .md file) into html for display on the browser. 'md' is the markdown string and calling 'marked(md)' translates it to html.

getContent(filePath)
        .then(response => {
            if (!response.ok) {
                return Promise.reject(response);
            }
            return response.text().then(md => setContent(marked(md)));
        })
        .catch(e => Dialog.error('Page failed to load!', e));
}, [filePath]);

How can I (either using marked.js, or another solution) parse the markdown/html to get only the text values? Some sample Markdown below.

### HEADER TEXT

---

# Some Page Title

<a href="cafe" target="_blank">Go to Cafe Page</a>

    <Cafe host>/portos/cafe

## Links
- ##### [Tacos](#cafe_tacos)
- ##### [Burritos](#cafe_burritos)
- ##### [Bebidas](#cafe_bebidas)


## Overview
This is the overview text for the page. I really like tacos and burritos.

[![Taco Tabs](some/path/to/images/hello.png 'Tacos')](some/path/to/images/hello.png)

## Dining <a name="dining"></a>

Dining is foo bar burrito taco mulita. 

[![Cafe Overview](some/path/to/images/hello2.png 'Cafe Overview')](some/path/to/images/hello2.png)

The cafe has been open since 1661. It has lots of food.

It was declared the top 1 cafe of all time.

### How to order food

You can order food by ordering food.

<div class="alert alert-info">
    <strong> Note: </strong> TACOS ARE AMAZING.
</div>

You might investigate mdast, which creates a usable syntax tree from markdown text. You would still need to do the work of pulling the data out of the AST, but that should be an easier task — Scott Sauyet
– Scott Sauyet, Commented Oct 18, 2022 at 20:27
For anyone checking this in 2025 (or later!), there's a nice write up on converting markdown to plaintext here. — JPlusPlus
– JPlusPlus, Commented Jul 28 at 1:46

Emiel Zuurbier · Accepted Answer · 2022-10-18 19:43:35Z

3

One way to do it is by parsing the HTML string with DOMParser API to turn your string into a Document object and then walk through it with a TreeWalker object to get the textContent of each Text node in the HTML. The result should be an array of strings.

function parseTextFromMarkDown(mdString) {
  const htmlString = marked(mdString);
  const parser = new DOMParser();
  const doc = parser.parseFromString(htmlString, 'text/html');
  const walker = document.createTreeWalker(doc, NodeFilter.SHOW_TEXT);

  const textList = [];
  let currentNode = walker.currentNode;

  while(currentNode) {
    textList.push(currentNode.textContent);
    currentNode = walker.nextNode();
  }

  return textList;
}

answered Oct 18, 2022 at 19:43

Emiel Zuurbier

21.2k3 gold badges25 silver badges36 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Barry Over a year ago

thanks! how would i rewrite the while loop for typescript? i'm currently trying to get it to work but am gettin gthe error 'Type 'Node | null' is not assignable to type 'Node'. Type 'null' is not assignable to type 'Node'.'

Scott Sauyet Over a year ago

Is there a reason not to just use the .textContent property of the document itself, and not bother walking at all?

Emiel Zuurbier Over a year ago

I'm afraid I don't know the TS solution for the while loop. @ScottSauyet that result would be a giant string with lots of whitespaces and line breaks, which could be fine if that is not a limitation. But having an array of strings allows for more control, for example, you could change the order of sentences relatively easy.

Scott Sauyet Over a year ago

Sure, and my solution adds flexibility in another way, but it's pretty unclear just what's wanted.

Scott Sauyet · Accepted Answer · 2022-10-19 02:49:00Z

While I think Emiel already gave the best answer, another approach would be to use the abstract syntax tree created by Marked's parser, mdast. Then we can walk the syntax tree extracting all the text, combining it into reasonable output. One approach looks like this:

const astToText = ((types) => ({type, children = [], ...rest}) => 
  (types [type] || types .default) (children .map (astToText), rest)
)(Object .fromEntries (Object .entries ({
  'default': () => ` *** Missing type: ${type} *** `,
  'root': (ns) => ns .join ('\n'),
  'heading, paragraph': (ns) => ns .join ('') + '\n',
  'text, code': (ns, {value}) => value,
  'html': (ns, {value}) => 
      new DOMParser () .parseFromString (value, 'text/html') .textContent, 
  'listItem, link, emphasis': (ns) => ns .join (''),
  'list': (ns, {ordered}) => ordered 
      ? ns .map ((n, i) => `${i + 1} ${n}`) .join ('\n')
      : ns .map ((n) => `• ${n}`) .join ('\n'),
  'image': (ns, {title, url, alt}) => `Image "${title}" ("${alt}" - ${url})`,
  // ... probably many more
}) .flatMap (([k, v]) => k .split (/,\s*/) .map (n => [n, v]))))

// import {fromMarkdown} from 'mdast-util-from-markdown'
// const ast = fromMarkdown (<your string>)

// dummy version
const ast = {type: "root", children: [{type: "heading", depth:1, children: [{type: "text", value: "Some Page Title", children: []}]}, {type: "paragraph", children: [{type: "html", value: '<a href="cafe" target="_blank">', children: []}, {type: "text", value: "Go to Cafe Page", children: []}, {type: "html", value: "</a>", children: []}]}, {type: "code", lang:null, meta:null, value: "<Cafe host>/portos/cafe", children: []}, {type: "heading", depth:2, children: [{type: "text", value: "Links", children: []}]}, {type: "list", ordered:!1, start:null, spread:!1, children: [{type: "listItem", spread:!1, checked:null, children: [{type: "heading", depth:5, children: [{type: "link", title:null, url: "#cafe_tacos", children: [{type: "text", value: "Tacos", children: []}]}]}]}, {type: "listItem", spread:!1, checked:null, children: [{type: "heading", depth:5, children: [{type: "link", title:null, url: "#cafe_burritos", children: [{type: "text", value: "Burritos", children: []}]}]}]}, {type: "listItem", spread:!1, checked:null, children: [{type: "heading", depth:5, children: [{type: "link", title:null, url: "#cafe_bebidas", children: [{type: "text", value: "Bebidas", children: []}]}]}]}]}, {type: "heading", depth:2, children: [{type: "text", value: "Overview", children: []}]}, {type: "paragraph", children: [{type: "text", value: "This is the overview text for the page. I really like tacos and burritos.", children: []}]}, {type: "paragraph", children: [{type: "link", title:null, url: "some/path/to/images/hello.png", children: [{type: "image", title: "Tacos", url: "some/path/to/images/hello.png", alt: "Taco Tabs", children: []}]}]}, {type: "heading", depth:2, children: [{type: "text", value: "Dining ", children: []}, {type: "html", value: '<a name="dining">', children: []}, {type: "html", value: "</a>", children: []}]}, {type: "paragraph", children: [{type: "text", value: "Dining is foo bar burrito taco mulita.", children: []}]}, {type: "paragraph", children: [{type: "link", title:null, url: "some/path/to/images/hello2.png", children: [{type: "image", title: "Cafe Overview", url: "some/path/to/images/hello2.png", alt: "Cafe Overview", children: []}]}]}, {type: "paragraph", children: [{type: "text", value: "The cafe has been open since 1661. It has lots of food.", children: []}]}, {type: "paragraph", children: [{type: "text", value: "It was declared the top 1 cafe of all time.", children: []}]}, {type: "heading", depth:3, children: [{type: "text", value: "How to order food", children: []}]}, {type: "paragraph", children: [{type: "text", value: "You can order food by ordering food.", children: []}]}, {type: "html", value: '<div class="alert alert-info">\n    <strong> Note: </strong> TACOS ARE AMAZING.\n</div>', children: []}]}

console .log (astToText (ast))

.as-console-wrapper {max-height: 100% !important; top: 0}

The advantage of this approach over the plain HTML one is that we can decide how certain nodes are rendered in plain text. For instance, here we choose to render this image markup:

![Taco Tabs](some/path/to/images/hello.png 'Tacos')

as

Image "Tacos" ("Taco Tabs" - some/path/to/images/hello.png)

Of course HTML nodes are still going to be problematic. Here I use DOMParser and .textContent, but you could just add it to text, code to include the raw HTML text.

Each function passed to the configuration receives a list of already formatted children as well as the remainder of the node,

Vikram Ray · Accepted Answer · 2023-01-25 09:39:29Z

2

You can also explore https://www.npmjs.com/package/markdown-to-txt.

import markdownToTxt from 'markdown-to-txt';
markdownToTxt('Some *bold text*'); // "Some quoted"

answered Jan 25, 2023 at 9:39

Vikram Ray

1,1663 gold badges13 silver badges20 bronze badges

Collectives™ on Stack Overflow

How to get only text values from a markdown string in Javascript

3 Answers 3

4 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related