Medium to Markdown via Python: Ease of use

Danielle H
6 min readMar 15, 2024

If you write in Medium and also in a site that supports Markdown (Jekyll, Hugo), you probably have your own method to copy from one to the other. In one of my articles, I introduced a basic Python script to convert Medium to Markdown. In the last article, we polished it.

Now we’ll make it easier to use by automating the front matter and using input parameters.

Easy packaging (Dall-E)

Recap

The script, after the last articles, looks like this:

from datetime import datetime
import html2text
import requests
from bs4 import BeautifulSoup
import sys
import re


def get_html_element(element,soup) -> str:
"""
Searches for the first occurrence of a specified HTML element in a BeautifulSoup object and returns its text.

Parameters:
- element (str): The tag name of the HTML element to search for (e.g., 'h1', 'div').
- soup (BeautifulSoup): A BeautifulSoup object containing the parsed HTML document.

Returns:
- str: The text of the first occurrence of the specified element if found; otherwise, an empty string.
"""
result = soup.find(element)
if result:
return result.text
else:
print(f"No element ${element} found.")
return ""

def cut_text_at_marker(marker:str,text:str,beginning:bool):
"""
Cuts the text at the specified marker and returns the resulting substring. The function can return the
text after the first occurrence of the marker (if beginning is True) or before the last occurrence
of the marker (if beginning is False).
"""
# Find the index of the substring
cut_off_index = 0
if beginning:
cut_off_index = text.find(marker)
else:
cut_off_index = text.rfind(marker)
# Slice the string if the substring is found
newText = ""
if cut_off_index != -1:
if beginning:
newText = text[cut_off_index + len(marker):]
else:
newText = text[:cut_off_index]
return newText

### get html content from url
url = "https://medium.com/@dsavir-h/flood-of-tears-2edc7bcf306b"
response = requests.get(url)
# you can check the response.status_code if you like (see comment)
html_content = response.text

### define soup
soup = BeautifulSoup(html_content, 'lxml')

### get title
title = get_html_element('h1',soup) # for front matter
title_name = title.lower().replace(" ","-") # for filename
title_name = title_name.replace(":","") # remove : - from experience :)
title_name = title_name.replace(".","") # remove .

if (title == ""):
print("no title")
sys.exit()

### get subtitle
subtitle = get_html_element('h2',soup) # for front matter

if (subtitle == ""):
print("no subtitle")
sys.exit()

### code blocks
html_content = html_content.replace("<pre", "```<pre")
html_content = html_content.replace("</pre>", "</pre>```")

### text separators
# Find all elements with role="separator"
separator_elements = soup.find_all(attrs={"role": "separator"})

# replace with <hr> element, markdown recognizes this
for element in separator_elements:
html_content = html_content.replace(str(element), "<hr>")

### convert to markdown
converter = html2text.HTML2Text()
converter.ignore_links = False # preserve hyperlinks
markdown_text = converter.handle(html_content)

### cut end
markdown_text = cut_text_at_marker('\--',markdown_text,False)

### cut beginning
markdown_text = cut_text_at_marker('Share',markdown_text,True)

### get tags
pattern = r"\[\s*([^\]]+?)\s*\]"
matches = re.findall(pattern, markdown_text)
tags = matches[-5:]


### cut end part II: remove the tags from the content
pattern = r'\[\s*{}'.format(re.escape(tags[0]))
all_patterns = list(re.finditer(pattern, markdown_text))
first_tag = all_patterns[-1]
second_cutoff = first_tag.start()
if second_cutoff != -1:
markdown_text = markdown_text[:second_cutoff]

### code blocks part II: remove empty lines
pattern = r'(^```$)(\s*\n\s*)+'
# Replace matches with just the "```" line
markdown_text = re.sub(pattern, r'\1\n', markdown_text, flags=re.MULTILINE)

### get formatted date
today = datetime.now()
formatted_date_str = today.strftime("%Y-%m-%d")

### save file
filename = f"{formatted_date_str}-{title_name}.md"
print(filename)

with open(filename, 'w', encoding='utf-8') as file:
file.write(markdown_text)

Check out the full code in GitHub here.

Front matter

We already have most of what we need for the front matter:

  • We got the tags of the post
  • We have the title and subtitle. In my Jekyll site, I don’t have subtitles. I either add them to the title or add them to the description.
  • Category — that’s a problem. We can hardcode it for now, right next to the hardcoded url at the beginning.
### variables
url = "https://medium.com/@dsavir-h/flood-of-tears-2edc7bcf306b"
category = "processing"

### ... all the other stuff

### add front matter content
tags_str = ", ".join([f'"{tag}"' for tag in tags])

front_matter = f"""---
layout: post
title: "{title}"
categories: [{category}]
tags: [{tags_str}]
description: {title} {subtitle}
comments: true
---
"""
markdown_text = front_matter + markdown_text

### save file etc.

Ta-da! Got our front matter. For example, the resulting front matter for one of my posts is this:

---
layout: post
title: "How to convert Medium to Markdown"
categories: [blogging]
tags: ["Medium", "Markdown", "Jekyll", "Vscode", "Stories"]
description: How to convert Medium to Markdown Using VSCode keybindings
comments: true
---

Input arguments

On the other hand, hardcoding is annoying. Much better to use an input argument, and then instead of python convert we can activate the script with python convert url category, e.g.

python convert https://medium.com/@dsavir-h/flood-of-tears-2edc7bcf306b processing

and then we don’t need to edit our script every time we want to convert a post.

## get url and category from command line
if len(sys.argv) < 3:
print("Usage: python convert.py <URL> <category>")
sys.exit(1)

url = sys.argv[1]
category = sys.argv[2]

Full code

from datetime import datetime
import html2text
import requests
from bs4 import BeautifulSoup
import sys
import re


def get_html_element(element,soup) -> str:
"""
Searches for the first occurrence of a specified HTML element in a BeautifulSoup object and returns its text.
"""
result = soup.find(element)
if result:
return result.text
else:
print(f"No element ${element} found.")
return ""

def cut_text_at_marker(marker:str,text:str,beginning:bool):
"""
Cuts the text at the specified marker and returns the resulting substring. The function can return the
text after the first occurrence of the marker (if beginning is True) or before the last occurrence
of the marker (if beginning is False).
"""
# Find the index of the substring
cut_off_index = 0
if beginning:
cut_off_index = text.find(marker)
else:
cut_off_index = text.rfind(marker)
# Slice the string if the substring is found
newText = ""
if cut_off_index != -1:
if beginning:
newText = text[cut_off_index + len(marker):]
else:
newText = text[:cut_off_index]
return newText


### get url and category from command line
if len(sys.argv) < 3:
print("Usage: python convert.py <URL> <category>")
sys.exit(1)

url = sys.argv[1]
category = sys.argv[2]

### get html content from url
response = requests.get(url)
html_content = response.text

### define soup
soup = BeautifulSoup(html_content, 'lxml')

### get title
title = get_html_element('h1',soup) # for front matter
title_name = title.lower().replace(" ","-") # for filename
title_name = title_name.replace(":","") # remove :
title_name = title_name.replace(".","") # remove .

if (title == ""):
print("no title")
sys.exit()

### get subtitle
subtitle = get_html_element('h2',soup) # for front matter

if (subtitle == ""):
print("no subtitle")
sys.exit()

### code blocks
html_content = html_content.replace("<pre", "```<pre")
html_content = html_content.replace("</pre>", "</pre>```")

### text separators
# Find all elements with role="separator"
separator_elements = soup.find_all(attrs={"role": "separator"})
# replace with <hr> element, markdown recognizes this
for element in separator_elements:
html_content = html_content.replace(str(element), "<hr>")

### convert to markdown
converter = html2text.HTML2Text()
converter.ignore_links = False # preserve hyperlinks
markdown_text = converter.handle(html_content)

### cut end
markdown_text = cut_text_at_marker('\--',markdown_text,False)

### cut beginning
markdown_text = cut_text_at_marker('Share',markdown_text,True)

### get tags
pattern = r"\[\s*([^\]]+?)\s*\]"
matches = re.findall(pattern, markdown_text)
tags = matches[-5:]

### cut end part II remove the tags from the content
pattern = r'\[\s*{}'.format(re.escape(tags[0]))
all_patterns = list(re.finditer(pattern, markdown_text))
first_tag = all_patterns[-1]
second_cutoff = first_tag.start()
if second_cutoff != -1:
markdown_text = markdown_text[:second_cutoff]

### code blocks part II: remove empty lines
pattern = r'(^```$)(\s*\n\s*)+'
# Replace matches with just the "```" line
markdown_text = re.sub(pattern, r'\1\n', markdown_text, flags=re.MULTILINE)

### get formatted date
today = datetime.now()
formatted_date_str = today.strftime("%Y-%m-%d")

### add front matter content
tags_str = ", ".join([f'"{tag}"' for tag in tags])

front_matter = f"""---
layout: post
title: "{title}"
categories: [{category}]
tags: [{tags_str}]
description: {title} {subtitle}
comments: true
---
"""
markdown_text = front_matter + markdown_text

### save file
filename = f"{formatted_date_str}-{title_name}.md"
print(filename)

with open(filename, 'w', encoding='utf-8') as file:
file.write(markdown_text)

Summary

There is plenty more to add — images, captions, gifs, GitHub/Gitlab previews, etc. But this does make things much easier, especially paired with VSCode keybindings.

If you know how to add more, would love to hear from you here or in the repo.

160 days. When will this end? #BringThemHomeNow.

Check out my free and open source online game Space Short. If you like my stories and site, you can also buy me a coffee.

--

--

Danielle H

I started programming in LabVIEW and Matlab, and quickly expanded to include Android, Swift, Flutter, Web(PHP, HTML, Javascript), Arduino and Processing.