Medium to Markdown via Python: the basics

How to create a python script for your conversion needs

Danielle H
5 min readMar 1, 2024

If you write in one place, e.g. Medium, maybe you write in another place, such as your own Jekyll or Hugo site. But why write everything twice? Take the time to craft a Python script that fits your needs exactly, and then sit back and enjoy the extra time to do interesting things :)

Time to take over the world! (generated by Dall-E)

Getting the HTML content

First, we need to get the Medium article as HTML. Note that this means you need to publish your article to Medium before converting it, this will not work with a draft (though that’s on my to-do list).

Then we use the excellent requests library:

import requests

url = "https://medium.com/@dsavir-h/flood-of-tears-2edc7bcf306b"
### get html content from url
response = requests.get(url)
# you can check the response.status_code first if you like (see comment)
html_content = response.text

This is simple: Get request the HTML content from the URL, and return the text.

You can also check the response_code and make sure it’s 200, but Medium redirects to a custom 404 page if the page is not found (while returning 200 as a response) so it doesn’t really make a difference.

Converting to Markdown

There is a specific library for this, too (of course!). It’s pretty simple to use:

import html2text

converter = html2text.HTML2Text()
converter.ignore_links = False # preserve hyperlinks

# Convert the HTML to Markdown
markdown_content = converter.handle(html_content)

However, the result will still need a lot of work.

  • It will include lots of medium links (like the medium share button and medium end matter) that you don’t need.
  • It will format tags as medium links.
  • It removes text separators (see below for why) and ignores code blocks.
  • There are no images or gifs.
  • You will need to create your own front matter and proper filename.
  • And more.

So let’s start with the basics: text separators, code blocks, filename.

Getting specific elements from HTML

We want to get the title and subtitle (for front matter) and any other custom elements such as images and text separators. For this, we use the totally awesome BeautifulSoup4:

from bs4 import BeautifulSoup
import sys

def get_html_element(element,soup) -> str:
"""
Searches for the first occurrence of a specified HTML element in a BeautifulSoup object and returns its text.

Parameters:
- element (str): The tag name of the HTML element to search for (e.g., 'h1', 'div').
- soup (BeautifulSoup): A BeautifulSoup object containing the parsed HTML document.

Returns:
- str: The text of the first occurrence of the specified element if found; otherwise, an empty string.
"""
result = soup.find(element)
if result:
return result.text
else:
print(f"No element ${element} found.")
return ""

### define soup
soup = BeautifulSoup(html_content, 'lxml')

### get title
title = get_html_element('h1',soup) # for front matter
title_name = title.lower().replace(" ","-") # for filename

if (title == ""):
print("no title")
sys.exit()

### get subtitle
subtitle = get_html_element('h2',soup) # for front matter

if (subtitle == ""):
print("no subtitle")
sys.exit()

In the above code:

  • We create the beautiful soup, and
  • Extract from it the title in two forms, one for the front matter of the Jekyll (or other) post, and the other in order to create our filename.
  • Extract the subtitle for the front matter
  • If there is no title and subtitle, we abort the script.

If you don’t always have subtitles in your posts, remove the sys.exit() line when finding the subtitle.

Text separators are another issue. <hr> elements are automatically converted to the Markdown separator, but Medium does not use the <hr> element, instead using the separator role. So I find all the elements with role=separator and replace them with <hr>.

### text separators
# Find all elements with role="separator"
separator_elements = soup.find_all(attrs={"role": "separator"})

# replace with <hr> element, markdown recognizes this
for element in separator_elements:
html_content = html_content.replace(str(element), "<hr>")

Code blocks in Medium are wrapped in the <pre> element. html2text has no idea what to do with that. However, by using a dirty hack and wrapping the <pre> element with code markers ```, the resulting code blocks are preserved. You’ll still need to state what syntax highlighting you want manually, though.

### code blocks
html_content = html_content.replace("<pre", "```<pre")
html_content = html_content.replace("</pre>", "</pre>```")

Create file

Assuming you want today’s date in your filename (for Jekyll), first you need to format the date:

from datetime import datetime

today = datetime.now()
formatted_date_str = today.strftime("%Y-%m-%d")

And with the title from before:

filename = f"{formatted_date_str}-{title_name}.md"

### save file
with open(f"_posts/{filename}", 'w', encoding='utf-8') as file:
file.write(markdown_text)

Full code

from datetime import datetime
import html2text
import requests
from bs4 import BeautifulSoup
import sys

def get_html_element(element,soup) -> str:
"""
Searches for the first occurrence of a specified HTML element in a BeautifulSoup object and returns its text.

Parameters:
- element (str): The tag name of the HTML element to search for (e.g., 'h1', 'div').
- soup (BeautifulSoup): A BeautifulSoup object containing the parsed HTML document.

Returns:
- str: The text of the first occurrence of the specified element if found; otherwise, an empty string.
"""
result = soup.find(element)
if result:
return result.text
else:
print(f"No element ${element} found.")
return ""

### get html content from url
url = "https://medium.com/@dsavir-h/flood-of-tears-2edc7bcf306b"
response = requests.get(url)
html_content = response.text

### define soup
soup = BeautifulSoup(html_content, 'lxml')

### get title
title = get_html_element('h1',soup) # for front matter
title_name = title.lower().replace(" ","-") # for filename

if (title == ""):
print("no title")
sys.exit()

### get subtitle
subtitle = get_html_element('h2',soup) # for front matter

if (subtitle == ""):
print("no subtitle")
sys.exit()

### code blocks
html_content = html_content.replace("<pre", "```<pre")
html_content = html_content.replace("</pre>", "</pre>```")

### text separators
# Find all elements with role="separator"
separator_elements = soup.find_all(attrs={"role": "separator"})

# replace with <hr> element, markdown recognizes this
for element in separator_elements:
html_content = html_content.replace(str(element), "<hr>")

### convert to markdown
converter = html2text.HTML2Text()
converter.ignore_links = False # preserve hyperlinks
markdown_content = converter.handle(html_content)

### get formatted date
today = datetime.now()
formatted_date_str = today.strftime("%Y-%m-%d")

### save file to _posts folder
filename = f"{formatted_date_str}-{title_name}.md"

with open(f"_posts/{filename}", 'w', encoding='utf-8') as file:
file.write(markdown_content)

Summary

This is a great starting point, and you can modify it as suits your needs.

There are still things I want to add, such as images, gifs, automatic front matter (including the tags), and automatically cutting the start and end Medium-specific stuff, so I will keep you updated :)

Check out the full code in GitHub here. What do you use? Any tips and tricks to suggest? Would love to hear from you.

146 days (more than 4 and half months!) and still counting. #BringThemHome.

Check out my free and open source online game Space Short. If you like my stories and site, you can also buy me a coffee.

--

--

Danielle H

I started programming in LabVIEW and Matlab, and quickly expanded to include Android, Swift, Flutter, Web(PHP, HTML, Javascript), Arduino and Processing.