1 Introduction

I have always said to myself that I should start blogging to keep track of things (projects, side-projects, side-projects of side-projects), but I have a terrible memory and end up forgetting that.

This time around I was thinking: - Hey, I should start blogging about something.

But you know what, I really suck at managing this kind of stuff, Jekyll helps quite a lot but I still need to write stuff with markdown.

I am not a huge fan of markdown either, patterns are weird, formatting tables is a mess, no equations, no bibtex (I also miss that when working with Word), and the list goes on.

Every engineer knows the best way to solve problems is to solve the problem for themselves by overengineering a solution (this is meant as a joke... even though it sounds too accurate to be just that...).

So, how can we make things better for ourselves and get some decent syntax for writing somewhat large texts?

Here is a tip: it is in the title of this post.

Yes, we can translate LaTeX into HTML, do some stripping, add a new Jekyll header and make things work nicely.

But how do we do that? Pandoc (Section 2), Jekyll (Section 3) and some script glue (Section 4) written in the best programming language ever: Python.

2 Pandoc

Of course each one of us could write a LaTeX parser and do the conversion for ourselves (if we had unlimited time, big brains and were determined to do it).

I do not know how about you, but I am of the kind that start writing programs right away just to realize how grand things will end up being.

As usual when doing these things, you realize "ain’t nobody got time for that" (NobodyGotTimeForThis, n.d.).

Started looking for alternatives. Found latex2html and make4ht, which are pretty cool. Tried setting them up, but did not manage to make it work as I wanted. Started looking for more alternatives.

Found a post in Dr. Joyce’s blog.

Pandoc: An universal markdown translator, which can magically translate a subset of LaTeX+BibTex into HTML+MathJax (yay /o/, equations). It is not perfect, but it also does not need to be.

While I did not quite manage to make equation numbering nor manage citations style/cross-references working correctly, pandoc-xnos could make it work. I think it has to do with the glue in Section 4. Maybe I will try again in the future, but for now it is doing what I needed.

The magic command ended up being

    pandoc --number-sections --mathjax -f latex -t html -s --bibliography=file.bib -o file.html file.tex
  • --number-sections for numbered sections

  • --mathjax to process equations and render using javascript

  • -f latex to process from LaTeX

  • -t html to output in html

  • -s standalone html

  • --bibliography=file.bib to get references from the bib file

  • -o file.html to indicate the path to the output file

  • file.tex to indicate the input file

3 Jekyll

You probably have heard of Jekyll at this point. It is a nice static site generator used by GitHub Pages, which made a ton of people used to it by default. Jekyll does not support LaTeX documents, but it does accept HTML files as inputs.

By placing the post header in the HTMLs, it will treat the HTML contents as the post content.

---
layout: post # could be a different layout
author: "authorname"
title: "postname"
---

4 Script glue

The final piece is the code glue written in Python. It just scans for .tex files, uses pandoc to convert them, strips out unnecessary HTML header/footer and include the post header expected by Jekyll containing the post author names, title and layout.

If there is a .bib file along with the .tex, it is used as the bibliography source file.

def latex_to_html_via_pandoc(source_file, source_dir="latex_posts", target_dir="_posts"):

    output_file = source_file.replace(".tex", ".html").replace(source_dir, target_dir)

    # Latex to html
    command = """pandoc --number-sections --mathjax -f latex -t html -s"""
    command = command.split(" ")

    # Load references from bib file if it exists
    bibliography = source_file.replace(".tex", ".bib")
    if os.path.exists(bibliography):
        command.append("--bibliography=%s" % bibliography)  # use external bib file or not

    command.append("-o")
    command.append(output_file)  # output path
    command.append(source_file)  # source path

    # Run pandoc
    try:
        subprocess.check_output(command)
    except Exception as e:
        raise Exception("Error during pandoc conversion of %s: %s" % (source_file, e))

    # Open the html file and get only contents to let jekyll handle style, links, etc
    with open(output_file, "r", encoding="utf-8") as f:
        contents = f.read()

        # extract authors and title
        authors = re_match_html_author.findall(contents)
        author = authors[0]
        authors.pop(0)
        while len(authors) > 0:
            author += ", " + authors[0]
            authors.pop(0)
        del authors
        title = re_match_html_title.findall(contents)[0]

        # Remove unnecessary header and trailer
        contents = contents.split("</header>")[1]
        contents = contents.split("</body>")[0]

    # Rewrite file with markdown header
    with open(output_file, "w", encoding="utf-8") as f:
        # Write markdown header
        f.write("""---\nlayout: post\nauthor: "%s"\ntitle: "%s"\n---""" % (author, title))

        # Write stripped down html
        f.write(contents)

After that, you can manually run jekyll build command or jekyll serve (which starts the webserver).

Or run Latex2Markdown.py --serve to run the LaTeX to HTML, Jekyll build and start service at once.

Sources for this blog are available here.

5 Conclusions

It works surprisingly well and I am pretty happy with the results.

References

NobodyGotTimeForThis. n.d. “Ain’t Nobody Got Time for That (Original + Autotune).” https://youtu.be/waEC-8GFTP4.