Importing outside writing

I noticed one of my old posts on the Stack Overflow blog no longer has paragraphs. It’s the sure sign the blog has move to a new Content Management System (CMS) and the import process was broken. Since I wanted to reference that post in a new blog post I’m working on, I figured I import it to my blog and properly format it.

Is this legal? Well, I don’t own the post since it was written for my employer on their platform. But I think the odds anyone will care are super low. If they cared, they’d have made sure the formatting wasn’t broken. And if they ask, I’ll take it down.

I spent a good deal of time trying to automate the import, but HTML is a rotten markdown language for parsing. In the end, I copied and pasted into Emacs and used a bit of elisp I wrote a few years ago to convert HTML into Markdown:


(defun my/query-replace-regexp (regexp to-string &optional delimited start end)
  "Replace some things after point matching REGEXP with TO-STRING.  As each
match is found, the user must type a character saying what to do with
it. This is a modified version of the standard `query-replace-regexp'
function in `replace.el', This modified version defaults to operating on the
entire buffer instead of working only from POINT to the end of the
buffer. For more information, see the documentation of `query-replace-regexp'"
   (let ((common
           (concat "Query replace"
                   (if current-prefix-arg " word" "")
                   " regexp"
                   (if (and transient-mark-mode mark-active) " in region" ""))
     (list (nth 0 common) (nth 1 common) (nth 2 common)
           (if (and transient-mark-mode mark-active)
             (buffer-end -1))
       (if (and transient-mark-mode mark-active)
         (buffer-end 1)))))
  (perform-replace regexp to-string t t delimited nil nil start end))
;; Replace the default key mapping
;; But I don't actually want this.
; (define-key esc-map [?\C-%] 'my/query-replace-regexp)

(defun my/scrub-html ()
  "Replace things in HTML documents with the Markdown equivelent and kill annoying things I don't want."
  (defun replace-all (regex to-string)
    (perform-replace regex to-string t t nil nil nil (point-min) (point-max)))
  (replace-all "<span [^>]*>" "")
  (replace-all "</span>" "" )
  (replace-all "</p>" "\n\n" )
  (replace-all "<p>" "" )
  (replace-all "<br>" "\n\n" )
  (replace-all "<br/>" "\n\n" )
  (replace-all "</h2>" "\n\n" )
  (replace-all "<h2>" "\n## " )
  (replace-all "</strong>" "**" )
  (replace-all "<strong>" "**" )
  (replace-all "</blockquote>" "\n\n" )
  (replace-all "<blockquote>" "> " )
  (replace-all "</em>" "__" )
  (replace-all "<em>" "__" )
  (replace-all "</i>" "_" )
  (replace-all "<i>" "_" )
  (replace-all "<!---->" "")

  (replace-all "<a [^>]**href=\"\\([^\"]*\\)\"[^>]*>\\([^<]*\\)</a>" "[\\2](\\1)")
  (replace-all "…" "...")
  (replace-all "’" "'")
  (replace-all "[“”]" "\"")
  (replace-all " - " "&mdash;"))

If you put that in a buffer and execute it (M-x eval-buffer), you can do M-x my/scrub-html in a buffer with HTML to get an interactive find-and-replace session. It’s annoying to do this all manually, but I only have 7 more post to do, so that’s probably what I’ll do.