Turning a WordPress Blog into a Book: Part One
Lately my lovely wife has been asking me to help her in the task of turning her blog (a WordPress installation like this one that I maintain for her) into a book. It’s a pretty daunting task, since her blog would easily surpass 1,000 pages if printed. After a few days of trying out different services and trying a variety of search terms, I just couldn’t find a good match for her blog.
Most services that turn blogs into books either require you to use one of the big hosting sites (like yourblog.wordpress.com), and of those that will take a simple WordPress XML export, the only one I could find that could handle the enormous file was a site called FastPencil. A problem that was universal, though, was finding a service that would import the file AND retain some semblance of layout in regard to pictures and text, block-quotes, centering, etc.
Then I discovered a project called WPTEX. It is a collection of PHP scripts that asks a few questions and then converts (or, attempts to convert) a WordPress blog to LaTeX, which can in turn be converted to a PDF using PDFLaTeX. The only problem was WPTEX didn’t really do a graceful job with my wife’s blog, which is rich in pictures and special formatting.
By this time I had enough information to decide to forge out on my own. I started by writing some C# that would read a WordPress export file (quick and dirty-like) into some data structures. Then I went off looking for an HTML-to-LaTeX converter online. I discovered Pandoc. After installing the Haskell-based markup converter, I started using it to convert my wife’s post titles and content to LaTeX. It was a brilliant success. I then wrote some code to recognize when it needed to download an image to include in the book and to strip out the obvious hyperlinks. Now we’re in the process of manually doing a lot of the centering and layout changes that were so cumbersome using those online services. To my surprise, LaTeX has been very easy to work with. In part two I’ll cover some of the (did I say dirty?) code that made it all come together.