初心 shoshin

Converting Kindle Notes to Markdown

A hacky Python script to parse Amazon's janky HTML

Published:   2023-07-20 09:46
Updated:   2023-09-03 11:49
Words: 1618
Tags:   #perl  #python  #regex  #tech 

Kindle1 has a really cool feature that allows you to email yourself all the notes you've taken for a book. They do this by attaching an HTML file with your notes to the email.

I store all of my notes in Obsidian, which uses Markdown. I thought, "Okay, I'll just use Pandoc to convert the HTML to Markdown. No problem!"

pandoc -w markdown_strict -s -r html my_notes.html -o my_notes.md

Enter: Amazon's Dumpster Fire HTML

This basically just output a bunch of plain text—no title, headings, lists, etc. That's when I realized that the notes file that Amazon sent me uses absolutely zero semantic HTML, opting instead for the ole divs-for-everything approach to web design 2. Lamentably, I thought, "Well, I guess it's time to bust out some Perl."

Here's an example of what I want my notes to look like. From a book I recently read, Moonwalking with Einstein:

# Moonwalking with Einstein

## Notes

### ONE: THE SMARTEST MAN IS HARD TO FIND

+ *Page 13*: The reason for the monitored decline in human memory performance is because we actually do anti-Olympic training.

### TWO: THE MAN WHO REMEMBERED TOO MUCH

+ *Page 27*: "Somewhere in your mind there's a trace from everything you've ever seen."
+ *Page 42*: if there were one precept that could be said to govern his life, it is that one's highest calling is to engage in enriching escapades at every turn.
+ *Page 44*: "It is always to associate the sound of a person's name with something you can clearly imagine. It's all about creating a vivid image in your mind"
+ *Page 44*: "Baker/baker paradox."

Now it's just a matter of converting Amazon's HTML to the above.

Easier said than done. Here's the HTML corresponding to the above that Amazon sent me:

<div class="sectionHeading">
    ONE: THE SMARTEST MAN IS HARD TO FIND
</div><div class="noteHeading">
    Highlight(<span class="highlight_yellow">yellow</span>) - Page 13 · Location 287
</div>
<div class="noteText">
    The reason for the monitored decline in human memory performance is because we actually do anti-Olympic training.
</div><div class="sectionHeading">
    TWO: THE MAN WHO REMEMBERED TOO MUCH
</div><div class="noteHeading">
    Highlight(<span class="highlight_yellow">yellow</span>) - Page 27 · Location 472
</div>
<div class="noteText">
    Somewhere in your mind there’s a trace from everything you’ve ever seen.”
</div><div class="noteHeading">
    Highlight(<span class="highlight_yellow">yellow</span>) - Page 42 · Location 697
</div>
<div class="noteText">
    if there were one precept that could be said to govern his life, it is that one’s highest calling is to engage in enriching escapades at every turn.
</div><div class="noteHeading">
    Highlight(<span class="highlight_blue">blue</span>) - Page 44 · Location 721
</div>
<div class="noteText">
    “It is always to associate the sound of a person’s name with something you can clearly imagine. It’s all about creating a vivid image in your mind
</div><div class="noteHeading">
    Highlight(<span class="highlight_yellow">yellow</span>) - Page 44 · Location 728
</div>
<div class="noteText">
    “Baker/ baker paradox.”
</div>

Genius.3

The Allure of Regular Expressions

I'm well aware of the fact that regular expressions are insufficient to parse HTML. Regular expressions are for parsing regular languages, without context, which HTML is not.

But, alas, regexes are like sirens leading unwary sailors astray. "Just one line," they sing. I oblige. They call out again, "There you go. That wasn't so bad. Just one more." Rapt in their pithy syntax, I indulge myself further.

cat my_notes.html |\
perl -pe 's|\n||g' | \
perl -pe 's|<div class="sectionHeading">(.*?)</div>|\n\n##\1\n\n|g' | \
perl -pe 's|<div class="noteHeading">|+|g' | \
perl -pe 's&Highlight\(<span class="highlight_(yellow|blue|orange)">(yellow|blue|orange)</span>\)&&g' | \
perl -pe 's| · Location \d+</div><div class="noteText">|:|g' | \
perl -pe 's|</div>||g' | \
perl -pe 's|\+\s+Note -|\n+ **Note** — |g' | \
perl -pe 's|\+\s+- Page (\d+):|\n+ Page \1:|g' | \
perl -pe 's|(Page \d+):|*\1*:|g' | \
tail -n +2 | \
pbcopy

Here's an explanation of what each of these commands does:

Should've Used BeautifulSoup From the Get-go

The above solution worked fine until I got to a book that had no page numbers. It turns out you can have any variation of page and/or chapter number. Here's the function I use to parse that:

def parse_chapter(text):
    # There is NO chapter number, but there is A page number. Example ->
    # Highlight(<span class="highlight_orange">orange</span>) - Page xxvii · Location 381
    if '>' not in text and 'Page' in text:
        page = text.split('Page')[1].split('·')[0].strip()
        return None, page

    # There is A chapter number, but there is NO page number. Example ->
    # Highlight(<span class="highlight_orange">orange</span>) - I > Location 38
    # Note: BS parses the <span>
    if '>' in text and 'Page' not in text:
        chapter = text.split('>')[0].strip()
        chapter = chapter.split('Highlight(orange) -')[1].strip()
        return chapter, None

    # There is A chapter number, and there is A page number. Example ->
    # Highlight(<span class="highlight_orange">orange</span>) - I.1 Enter the two Bishops, [the Archbishop] of Canterbury and [the Bishop of] Ely. > Page 7 · Location 778
    if '>' in text and 'Page' in text:
        chapter = text.split('Highlight(orange) -')[1]\
                      .split('>')[0].strip()
        page = text.split('Page')[1].split('·')[0].strip()
        return chapter, page

    # There is NO chapter number, and there is NO page number
    return None, None

From there, we basically just find the first "section" node (start_node = soup.find('div', {'class': 'sectionHeading'})) and loop until there are no more siblings: while current.next_sibling:.

Github Gist of the entire script. I also have a repository with unit tests, but my unit tests run on my actual notes, which I'd like to keep private.

Footnotes


  1. Overall, I'm much more of a fan of Kobo; however, I use Kindle for 3 reasons: 1. Using multiple libraries (via Overdrive) with Kobo is a nightmare, 2. I can read my Kindle books on my phone, and 3. What this post is about — the ability to email myself the notes I've taken is really convenient. 

  2. Not sure if body counts as semantic HTML, but they don't even use it! (See next footnote). 😒 

  3. They even have a <div class="bodyContainer">, completely redundant to body! I'm convinced that whoever wrote this intentionally made it bad. 

  4. Perl does not use POSIX regexes like sed. As we'll see, this is actually imperative, since there's no way to use the lazy quantifier, ?, in POSIX. See Comparison of regular expression engines

  5. | is my favorite delimiter. It's easier to see, and you have to escape it less. 


Return to top