How to convert a website’s content into simple text files.

20080211.monday   comments=1   propeller_beanie  

Every so often I find the need to convert masses of web pages into simple, editable text files. (Who among us doesn’t?) Programmer that I am, I also want to do this with as little manual intervention as possible.

For example, I recently wanted to gather together some of my Yukon College course notes to give to other instructors. The notes were originally written in HTML, but some people might prefer plain, unadulterated text.

Now, there’s text, and then there’s text. There are a bewildering variety of “lightweight” formats or conventions for specifying headings, emphasis, lists, hyperlinks, and so forth. My favourite is a format called Markdown.

To make a heading in Markdown, just underline it with equal signs or hyphens. To make a bullet list, start each point with an asterisk. To italicize, surround the word with underscores. These are all the same sorts of formatting tricks you might key into a quick e-mail.

(You can do the same in MS Word, if you can spare an hour or two to undo some of Word’s more aggressive auto-corrections.)

Of course, you don’t actually see any of the bullets, italics, or hyperlinks. C’mon it’s just text. Instead, you have the option to — presto-chango — translate Markdown into HTML. Beats typing angle brackets all the live long day.

But today’s exercise is in the other direction. Here are the steps I take to convert a website into Markdown.

  1. “Rip” the website: copy all of its HTML and image content to your computer. On Windows, I use HTTrack. On Linux, something like wget --convert-links --html-extension --mirror --random-wait --wait 3 http://microsoft.com/ will do (consider an extra hard drive or two to rip that site).
  2. Run Aaron Swartz’s html2text.py Python script to convert each ripped HTML file into the equivalent Markdown.
  3. Rename each Markdown text file to something more meaningful than the name typically assigned by HTTrack or wget. The contents of the <title> element makes for a pretty fair filename.

Unfortunately, steps 2 and 3 contain that tedious word “each.” There might be a couple of hundred eaches for one of my course sites. Any time you find yourself doing the same thing over, and over, and over, chances are you can get the computer do it more quickly and with fewer errors. That’s kinda what they’re good at.

So, I wrote some Linux shell script code to automate both steps. The full convert-html-to-md script is part of my in-progress, yet freely-downloadable, Public Domain scripnix project, and depends on some of the project’s other utilities.

If that’s just too much to contemplate, the following is a quick ‘n dirty approximation of the full script. It doesn’t handle filename collisions, and suffers from an excess of hyphenation, but it gets the job done.

#!/bin/bash
# Usage: convert-html-to-md <path-to-html2text.py> <file>[...]
# Convert the specified HTML files into Markdown text-format equivalents
# in the current working directory. The file extension will be .md.txt.
# Requires the html2text.py Python script by Aaron Swartz to convert
# from HTML to Markdown text [www.aaronsw.com/2002/html2text/].
html2text="${1}"
shift

while [ -n "${1}" ] ; do
    # Use the contents of the title element for the filename. In case
    # the title element spans multiple lines, the entire file is first
    # converted to a single line before the sed pattern is applied. Any
    # "unsafe" characters are then replaced with hyphens to produce a
    # valid filename.
    title=$(cat "${1}" | \
            tr -d '\n\r' | \
            sed -nre 's/^.*<title>(.*?)<\/title>.*$/\1\n/ip' | \
            tr "\`~\!@#$%^&*()+={}|[]\\:;\"\'<>?,/ \t" '[-*]')

    # If there's no title, then just use the original filename.
    if [ -z "${title}" ] ; then
        title=$(basename "${1}" .html)
    fi

    # Convert the HTML to Markdown.
    cat "${1}" | python "${html2text}" > "${title}.md.txt"
    shift
done

Your mileage may vary on Mac OS. Without Cygwin, Windows users are better off sticking to their pointee-clickee routine.