Literate Programming

In a nutshell

Filippo Gambarota

Last modified: 2025-09-29

Doing research is hard…

Doing research is hard…

  • you have to read papers, textbooks, slides and track information
  • you have to plan your experiment or research
  • you have to collect, organize and manage your data
  • you have to analyze data, create figures and tables
  • you have to write reports, papers, slides, etc.
  • you have to keep track of reviews from reviewers, co-authors, supervisor, etc.

Doing research is hard…

Doing reproducible research is even harder 😱

  • organize and share data in a comprehensive format
  • choose a future-proof place to share data
  • analyze data using reproducible tools i.e., scripting
  • create research reports in multiple formats: slides, reports, papers

Is there an issue about reproducibility?

Hardwicke et al. (2022) estimated the open science practices in 250 psychology articles published between 2014 and 2017.

Not only sharing

Curating and sharing data, scripts, materials is an important practice that improves the scientific community as a whole.

But, what are the short-term benefits of changing our workflow focusing on reproducibility?

  • using a programming language instead of a GUI program (e.g., SPSS) is hard with a steep learning curve
  • documenting and sharing data can be tedious and time consuming
  • writing reproducible documents can be initially hard with a lot of friction

What are the actual benefits?

Especially related to data analysis and writing, some important features are:

  • The number of errors is drastically reduced
  • The worflow (after the initial effort) is faster, smoother and somehow more stimulating
  • Huge changes such as adding new data, fixing several figures, changing citation style, etc. can be done automatically

Are there errors in scientific papers?

Nuijten et al. (2016) analyzed reporting errors in a sample of over 250,000 p-values reported in eight major psychology journals from 1985 until 2013. More than 50% have reporting and statistics errors.

Bakker & Wicherts (2011) analyzed 281 articles where around 18% of statistical results are incorrectly reported.

Thus, how to improve the way we produce scientific papers?

The big (reproducible) picture

The big picture

The big picture

The usual workflow

The usual workflow

Usually we work with two (or more) elements:

  1. Word processing software (Word, Google Docs, etc.)
  2. Data analysis software (R, Python, SPSS, Jamovi, JASP, etc.)
  3. Image processing software (for combining and modifying images; GIMP, Photoshop, Illustrator)

The problematic aspect of this approach, is that we have to manually combine elements into the final document. This manual merge reduces the reproducibility and increases the probability of making errors.

The elements of a (scientific) document

The elements of a (scientific) document

There are roughly 4 elements in a scientific document:

  1. plain and formatted text: this is the written part of your document. Just bare words with formatting (e.g., bold, italic, etc.)
  2. figures: these are produced by a software (e.g., MATLAB, ggplot2 in R, etc.) or imported from other sources (e.g., the picture of your equipments, the experimental paradigm, etc.)
  3. tables: these are created manually or exported from some software. Mainly tables resulting from computations (e.g., regression tables)
  4. bibliography: this is technically text but usually created with other software (e.g., Zotero) thus we can consider it a separate component

The 4 element in standard word processors

In Microsoft Word or Google Docs, the elements are usually combined in a uniform and coherent writing experience. You can write, add styling (bold, italic, etc.), add/remove images, create tables, etc.

The 4 element in standard word processors

In Microsoft Word or Google Docs, the elements are usually combined in a uniform and coherent writing experience. You can write, add styling (bold, italic, etc.), add/remove images, create tables, etc.

This is is usually called a What you see is what you get (WYSWYG) approach. MSWord (or Google Docs) is called a WYSWYG editor.

The common struggles of word-like documents

  • everthing is ready, but you realized an error that will changes all the numbers, tables, figures
  • you need to add/remove a figure or table and all the cross-referencing (“see in Figure x”) need to be updated
  • you realized that the journal require rounding to the third decimal and you did everthing with two decimals
  • you submit your previous paper to another journal that requires another citation style: e.g., from (Author, Year) to [1].

How text is formatted?

Using Google Docs or Word is not clear how the text is formatted (we just press “bold”). But the idea is roughly the following.

We need to choose some tags thus some text and/or symbols that the software will recognize and intepret these tags not as text but as special elements.

The system of tags is called markup language that define how the plain text need to be formatted.

Examples of markup languages

Practical example with HTML

We can open a very simple HTML document and see the source code.

Practical example with LaTeX

We can see a very simple LaTeX document and the pdf result.

What is the problem with markup languages?

Apparently (beyond learning nerdy stuff) there is no advantage of using a markup language such as HTML or LaTeX for standard documents. In fact there are some problems:

  • Hard to write: we need to learn the tags, writing is not as intuitive as in WYSIWYG editors
  • Hard to read: the tags make the text harder to read
  • Hard to collaborate: people that are not used to markup languages can be very confused

What are the advantages?

There are several advantages (trust me):

  • focus on the text: The formatting is managed by the markup and I can easily change post-hoc
  • high quality: markup languages are usually used for high-quality products (like LaTeX is used for typography)
  • integration with other software: using a plain-text + markup approach we can easily include other software and languages (we will see later)

The proposed trade-off? Markdown!

Markdown is a very simple and readable markup language that is nowadays used by a lot of software and services. Is (out of the box) less flexible compared to HTML or LaTeX but you can learn it in 20 minutes (really) and also naive people can read markdown documents.

You can see an example of Markdown document. The final result can be easily see here.

The real power of Markdown? conversion!

The real power of markdown is the ability to compile the document into basically everything such as Word, HTML, PDF, etc. There is a software called pandoc that given an .md file can create different type of documents.

For example, using the pandoc command line:

pandoc input.md -o input.pdf
pandoc input.md -o input.html
pandoc input.md -o input.docx

The real power of Markdown? conversion!

Basically, pandoc will take the .md file as a general input, convert the document into the plain-text target document (e.g., .tex or .html) and then compile the document creating the output.

https://pandoc.org/diagram.svgz?v=20250907114133

What about other elements? Figures? Tables?

Figures in Word/Google Docs

In standard WYSIWYG editors, the figures are external files embedded into the document. For updating the figure we need to re-create it externally and then re-importing into the document.

The essence of figures…

But in essence, if you use a programming language (e.g., MATLAB, Python or R), the figure is no longer the output file but some lines of code.

iris |> 
    ggplot(
        aes(
            x = Sepal.Length, 
            y = Petal.Width
        )
    ) +
    geom_point() +
    coord_fixed(ratio = 1)

What about text + markup + code??

What about having a method that no longer interpret the markup but also interpret a programming language? Thus we no longer produce the figure externally but we include directly in the document the code to produce some elements of the final result (e.g., a figure or a table).

This is the exact definition of literate programming. There are several tools but the most famous one are:

  • R Markdown
  • Quarto
  • Jupyter Notebook

Literate programming, the big picture

Quarto

Quarto, in a nutshell

The idea of Quarto is writing a unique *.qmd file. The file is compiled and using the knitr package (if using R) the R code is evaluated. The text is converted to the required format (e.g., latex if output is pdf) from markdown. The code output is included into the e.g. tex or html. Finally, the document is compiled into the output format.

From r4ds.hadley.nz/quarto.html

References

Bakker, M., & Wicherts, J. M. (2011). The (mis)reporting of statistical results in psychology journals. Behavior Research Methods, 43, 666–678. https://doi.org/10.3758/s13428-011-0089-5
Hardwicke, T. E., Thibault, R. T., Kosie, J. E., Wallach, J. D., Kidwell, M. C., & Ioannidis, J. P. A. (2022). Estimating the prevalence of transparency and reproducibility-related research practices in psychology (2014-2017). Perspectives on Psychological Science: A Journal of the Association for Psychological Science, 17, 239–251. https://doi.org/10.1177/1745691620979806
Nuijten, M. B., Hartgerink, C. H. J., Assen, M. A. L. M. van, Epskamp, S., & Wicherts, J. M. (2016). The prevalence of statistical reporting errors in psychology (1985-2013). Behavior Research Methods, 48, 1205–1226. https://doi.org/10.3758/s13428-015-0664-2

Let’s see a practical example!