Medium to Markdown via Python: Polish

Danielle H
6 min readMar 8, 2024

--

If you write in Medium and also in a site that supports Markdown, you probably have your own method to copy from one to the other. In my last article, I introduced a basic Python script to convert Medium to Markdown. Now we’ll add some polish.

Polish the code (generated by Dall-E)

Check out the full code in GitHub here.

In the last article, we created the following script:

from datetime import datetime
import html2text
import requests
from bs4 import BeautifulSoup
import sys

def get_html_element(element,soup) -> str:
"""
Searches for the first occurrence of a specified HTML element in a BeautifulSoup object and returns its text.

Parameters:
- element (str): The tag name of the HTML element to search for (e.g., 'h1', 'div').
- soup (BeautifulSoup): A BeautifulSoup object containing the parsed HTML document.

Returns:
- str: The text of the first occurrence of the specified element if found; otherwise, an empty string.
"""
result = soup.find(element)
if result:
return result.text
else:
print(f"No element ${element} found.")
return ""

### get html content from url
url = "https://medium.com/@dsavir-h/flood-of-tears-2edc7bcf306b"
response = requests.get(url)
html_content = response.text

### define soup
soup = BeautifulSoup(html_content, 'lxml')

### get title
title = get_html_element('h1',soup) # for front matter
title_name = title.lower().replace(" ","-") # for filename

if (title == ""):
print("no title")
sys.exit()

### get subtitle
subtitle = get_html_element('h2',soup) # for front matter

if (subtitle == ""):
print("no subtitle")
sys.exit()

### code blocks
html_content = html_content.replace("<pre", "```<pre")
html_content = html_content.replace("</pre>", "</pre>```")

### text separators
# Find all elements with role="separator"
separator_elements = soup.find_all(attrs={"role": "separator"})

# replace with <hr> element, markdown recognizes this
for element in separator_elements:
html_content = html_content.replace(str(element), "<hr>")

### convert to markdown
converter = html2text.HTML2Text()
converter.ignore_links = False # preserve hyperlinks
markdown_content = converter.handle(html_content)

### get formatted date
today = datetime.now()
formatted_date_str = today.strftime("%Y-%m-%d")

### save file to _posts folder
filename = f"{formatted_date_str}-{title_name}.md"

with open(f"_posts/{filename}", 'w', encoding='utf-8') as file:
file.write(markdown_content)

This gives us a Markdown file with all our text and code blocks. However, it has lots of useless stuff, like the following opening paragraphs:

[Sign
in](https://medium.com/m/signin?operation=login&redirect=https%3A%2F%2Fdsavir-h.medium.com%2Fflood-
of-tears-2edc7bcf306b&source=post_page---two_column_layout_nav
-----------------------global_nav-----------)

[](https://medium.com/?source=---two_column_layout_nav----------------------------------)

[Write](https://medium.com/m/signin?operation=register&redirect=https%3A%2F%2Fmedium.com%2Fnew-
story&source=---two_column_layout_nav-----------------------
new_post_topnav-----------)

[](https://medium.com/search?source=---two_column_layout_nav----------------------------------)

We definitely don’t need this.

The Beginning

The actual article text (without the title and subtitle) start after the word “Share”. So let’s find the first occurrence of the word “Share” and cut all the text preceding it:

def cut_text_at_marker(marker:str,text:str):
"""
Cuts the text from beginning to the end of the specified marker
"""
# Find the index of the substring
cut_off_index = 0
cut_off_index = text.find(marker)

# Slice the string if the substring is found
newText = ""
if cut_off_index != -1:
newText = text[cut_off_index + len(marker):]

return newText

# after converting to MarkDown:

### Cut off beginning stuff
markdown_content = cut_text_at_marker("Share", markdown_content)

And now our converted file starts at the beginning :)

The End

Our converted article also ends with a lot of useless and strange links. First are our tags, in strange Medium form:

[ Processing](https://medium.com/tag/processing?source=post_page-----
2edc7bcf306b---------------processing-----------------)

What kind of link is that?

In any case, they don’t work in my markdown posts.

After that there is the footer with the claps and author bio etc. So what we need is to

  • Get our tags (for the front matter)
  • Cut everything from the first tag to the end

To get the tag, we use regexp. They are one or two words in square brackets…

…like every other link in the post.

So, what we’ll do is cut off at something that we know is just after the tags: \--, which is how html2text marks that we have arrived at the footer of the post.

Then, we’ll use regexp to get the last 5 links, and those will be our tags.

And after that, we’ll cut off the end again at the first tag.

I always put 5 tags. If you put a different number of tags in your post, consider passing this as a parameter to your script.

So let's modify our cut_text_at_marker function. For the regexp, ask ChatGPT what the pattern would be :)

def cut_text_at_marker(marker:str,text:str,beginning:bool):
"""
Cuts the text at the specified marker and returns the resulting substring. The function can return the
text after the first occurrence of the marker (if beginning is True) or before the last occurrence
of the marker (if beginning is False).
"""
# Find the index of the substring
cut_off_index = 0
if beginning:
cut_off_index = text.find(marker)
else:
cut_off_index = text.rfind(marker)
# Slice the string if the substring is found
newText = ""
if cut_off_index != -1:
if beginning:
newText = text[cut_off_index + len(marker):]
else:
newText = text[:cut_off_index]
return newText

### Cut end part I
markdown_text = cut_text_at_marker('\--',markdown_text,False)

### get tags
pattern = r"\[\s*([^\]]+?)\s*\]" # up to two words in []
matches = re.findall(pattern, markdown_text)
tags = matches[-5:] # get last 5

### Cut end part II
pattern = r'\[\s*{}'.format(re.escape(tags[0])) # first tag in []
all_patterns = list(re.finditer(pattern, markdown_text)) # find all
first_tag = all_patterns[-1] # get last
second_cutoff = first_tag.start() # and finally..
if second_cutoff != -1:
markdown_text = markdown_text[:second_cutoff]

And we now have a proper end.

The empty

Last time, I showed a dirty hack to preserve code blocks. It works, but it leaves empty lines in code blocks, like this:





import html2text

converter = html2text.HTML2Text()
converter.ignore_links = False # preserve hyperlinks

# Convert the HTML to Markdown
markdown_content = converter.handle(html_content)

Critical? No. But annoying nonetheless.

So let’s ask our GPT regexp expert how to find these empty lines and remove them:

### code blocks part II: remove empty lines
pattern = r'(^```$)(\s*\n\s*)+'
# Replace matches with just the "```" line
markdown_text = re.sub(pattern, r'\1\n', markdown_text, flags=re.MULTILINE)

Much better :)

Summary

Full code is now:

from datetime import datetime
import html2text
import requests
from bs4 import BeautifulSoup
import sys
import re


def get_html_element(element,soup) -> str:
"""
Searches for the first occurrence of a specified HTML element in a BeautifulSoup object and returns its text.

Parameters:
- element (str): The tag name of the HTML element to search for (e.g., 'h1', 'div').
- soup (BeautifulSoup): A BeautifulSoup object containing the parsed HTML document.

Returns:
- str: The text of the first occurrence of the specified element if found; otherwise, an empty string.
"""
result = soup.find(element)
if result:
return result.text
else:
print(f"No element ${element} found.")
return ""

def cut_text_at_marker(marker:str,text:str,beginning:bool):
"""
Cuts the text at the specified marker and returns the resulting substring. The function can return the
text after the first occurrence of the marker (if beginning is True) or before the last occurrence
of the marker (if beginning is False).
"""
# Find the index of the substring
cut_off_index = 0
if beginning:
cut_off_index = text.find(marker)
else:
cut_off_index = text.rfind(marker)
# Slice the string if the substring is found
newText = ""
if cut_off_index != -1:
if beginning:
newText = text[cut_off_index + len(marker):]
else:
newText = text[:cut_off_index]
return newText

### get html content from url
url = "https://medium.com/@dsavir-h/flood-of-tears-2edc7bcf306b"
response = requests.get(url)
# you can check the response.status_code if you like (see comment)
html_content = response.text

### define soup
soup = BeautifulSoup(html_content, 'lxml')

### get title
title = get_html_element('h1',soup) # for front matter
title_name = title.lower().replace(" ","-") # for filename
title_name = title_name.replace(":","") # remove : - from experience :)
title_name = title_name.replace(".","") # remove .

if (title == ""):
print("no title")
sys.exit()

### get subtitle
subtitle = get_html_element('h2',soup) # for front matter

if (subtitle == ""):
print("no subtitle")
sys.exit()

### code blocks
html_content = html_content.replace("<pre", "```<pre")
html_content = html_content.replace("</pre>", "</pre>```")

### text separators
# Find all elements with role="separator"
separator_elements = soup.find_all(attrs={"role": "separator"})

# replace with <hr> element, markdown recognizes this
for element in separator_elements:
html_content = html_content.replace(str(element), "<hr>")

### convert to markdown
converter = html2text.HTML2Text()
converter.ignore_links = False # preserve hyperlinks
markdown_text = converter.handle(html_content)

### cut end
markdown_text = cut_text_at_marker('\--',markdown_text,False)

### cut beginning
markdown_text = cut_text_at_marker('Share',markdown_text,True)

### get tags
pattern = r"\[\s*([^\]]+?)\s*\]"
matches = re.findall(pattern, markdown_text)
tags = matches[-5:]


### cut end part II: remove the tags from the content
pattern = r'\[\s*{}'.format(re.escape(tags[0]))
all_patterns = list(re.finditer(pattern, markdown_text))
first_tag = all_patterns[-1]
second_cutoff = first_tag.start()
if second_cutoff != -1:
markdown_text = markdown_text[:second_cutoff]

### code blocks part II: remove empty lines
pattern = r'(^```$)(\s*\n\s*)+'
# Replace matches with just the "```" line
markdown_text = re.sub(pattern, r'\1\n', markdown_text, flags=re.MULTILINE)

### get formatted date
today = datetime.now()
formatted_date_str = today.strftime("%Y-%m-%d")

### save file
filename = f"{formatted_date_str}-{title_name}.md"
print(filename)

with open(filename, 'w', encoding='utf-8') as file:
file.write(markdown_text)

We now have a script that converts to Markdown and also polishes the result, putting your new, converted post at your fingertips.

Next time, we’ll create the front matter and remove the hard-coded URL.

153 days (5 months!😢) and still waiting. #BringThemHome.

Check out my free and open source online game Space Short. If you like my stories and site, you can also buy me a coffee.

--

--

Danielle H

I started programming in LabVIEW and Matlab, and quickly expanded to include Android, Swift, Flutter, Web(PHP, HTML, Javascript), Arduino and Processing.