I noticed one of my old posts on the Stack Overflow blog no longer has paragraphs. It’s the sure sign the blog has move to a new Content Management System (CMS) and the import process was broken. Since I wanted to reference that post in a new blog post I’m working on, I figured I import it to my blog and properly format it.
Is this legal? Well, I don’t own the post since it was written for my employer on their platform. But I think the odds anyone will care are super low. If they cared, they’d have made sure the formatting wasn’t broken. And if they ask, I’ll take it down.
I spent a good deal of time trying to automate the import, but HTML is a rotten markdown language for parsing. In the end, I copied and pasted into Emacs and used a bit of elisp I wrote a few years ago to convert HTML into Markdown:
;; https://emacs.stackexchange.com/questions/249/how-to-search-and-replace-in-the-entire-buffer/253#253
(defun my/query-replace-regexp (regexp to-string &optional delimited start end)
"Replace some things after point matching REGEXP with TO-STRING. As each
match is found, the user must type a character saying what to do with
it. This is a modified version of the standard `query-replace-regexp'
function in `replace.el', This modified version defaults to operating on the
entire buffer instead of working only from POINT to the end of the
buffer. For more information, see the documentation of `query-replace-regexp'"
(interactive
(let ((common
(query-replace-read-args
(concat "Query replace"
(if current-prefix-arg " word" "")
" regexp"
(if (and transient-mark-mode mark-active) " in region" ""))
t)))
(list (nth 0 common) (nth 1 common) (nth 2 common)
(if (and transient-mark-mode mark-active)
(region-beginning)
(buffer-end -1))
(if (and transient-mark-mode mark-active)
(region-end)
(buffer-end 1)))))
(perform-replace regexp to-string t t delimited nil nil start end))
;; Replace the default key mapping
;; But I don't actually want this.
; (define-key esc-map [?\C-%] 'my/query-replace-regexp)
(defun my/scrub-html ()
"Replace things in HTML documents with the Markdown equivelent and kill annoying things I don't want."
(interactive)
(defun replace-all (regex to-string)
(perform-replace regex to-string t t nil nil nil (point-min) (point-max)))
(replace-all "<span [^>]*>" "")
(replace-all "</span>" "" )
(replace-all "</p>" "\n\n" )
(replace-all "<p>" "" )
(replace-all "<br>" "\n\n" )
(replace-all "<br/>" "\n\n" )
(replace-all "</h2>" "\n\n" )
(replace-all "<h2>" "\n## " )
(replace-all "</strong>" "**" )
(replace-all "<strong>" "**" )
(replace-all "</blockquote>" "\n\n" )
(replace-all "<blockquote>" "> " )
(replace-all "</em>" "__" )
(replace-all "<em>" "__" )
(replace-all "</i>" "_" )
(replace-all "<i>" "_" )
(replace-all "<!---->" "")
(replace-all "<a [^>]**href=\"\\([^\"]*\\)\"[^>]*>\\([^<]*\\)</a>" "[\\2](\\1)")
(replace-all "…" "...")
(replace-all "’" "'")
(replace-all "[“”]" "\"")
(replace-all " - " "—"))
If you put that in a buffer and execute it (M-x eval-buffer
), you can do M-x my/scrub-html
in a buffer with HTML to get an interactive find-and-replace session. It’s annoying to do this all manually, but I only have 7 more post to do, so that’s probably what I’ll do.