Babylon as a Feature
Multi-lingual documentation, made simple
The Tower of Babylon is a myth meant to explain why the world’s peoples speak different languages. Refer to Bibliography [0] below for details. In modern IT systems, it’s often a requirement to support multiple languages.
Such internationalization (i18n for short) is a tough challenge – and this post describes a simple solution to just a tiny part of multilingual documents. Our solution combines the simplicity of the plain-text format AsciiDoc with a simple yet versatile build script to support multiple languages (like EN and DE) and multiple output formats (like PDF and HTML).
Let’s start with some requirements:
- Imagine you need to maintain documents.
- The desired output format is PDF or HTML, although our approach could easily handle *.docx or LaTeX. But we will keep it simple for now.
- Several people constantly provide updates to these documents, they need to collaborate without interfering with each other.
- Changes should be reviewed and approved by somebody else.
- From time to time, you need to release updated versions of your documents.
- For readers, this version number is important, therefore it needs to be contained within the documents.
- Maybe it is self-evident for you, but we strive for a high degree of automation. So please don’t come up with a “Save document as PDF” function within a word processor.
Just in case a few of these requirements sound familiar to you since source code needs to be maintained that way: The good news is you will recognize a few of our proposals.
Let’s visualize the situation: Figure 1 depicts a few authors that independently update distinct parts of an English and a German document.
Figure 1: Authors maintain documents
Figure 2 shows three hypothetical document releases with two languages.
Figure 2: Document releases
What Kind of Documents?
We (Ben and Gernot) are (co-)authors and maintainers of a few documents, for example, an extensive glossary of software architecture terminology (refer to Bibliography [1] below) and a number of technical curricula (see Bibliography [2] below).
We maintain these documents (together with a group of additional authors) in English and German. Our problem is that we write and speak only these two languages, but you will see below that additional languages can be easily integrated.
Collaboration First
As software developers, you will have experienced the numerous advantages of professional version control, namely git. Combined with services like Gitlab or Github, you get a rock-solid and proven platform for collaboration, including pull/merge requests (in our case: document reviews and approvals).
Therefore, we obviously maintain our documents on such a git platform.
Pull and merge requests require that differences between documents can be automatically determined, so the technical format for documents needs to be plain text. Several such formats are used in practice (see our explanatory box below). Several of these lack the babylonic features we require to process several languages automatically, which is why we decided to use AsciiDoc (see Bibliography [3] below). AsciiDoc is open-source and provides several incredibly powerful features that will come in handy later on.
Markup Languages
A few markup languages have become popular in software developer communities:
- Markdown is likely the most common markup language. Used primarily for shorter documents, like blog posts (this one has actually been authored in Markdown). On the positive side, it is extremely easy to use. However, it also has a few downsides:
- There are several dialects in the wild that each adds certain features, usually not compatible with the other dialects.
- No built-in support to modularize/structure documents.
- AsciiDoc is our language of choice, as it has been designed with large documents and language simplicity in mind, has excellent documentation and is used in several open-source projects. For example, the arc42 architecture template relies on AsciiDoc.
- Textile has been designed to be a shorthand syntax for creating HTML. We haven’t seen it in our projects and therefore did not consider using it.
- ReStructured Text and Sphinx: Used heavily in the Python world. Can create a variety of output formats, like HTML, LaTeX, Windows-Help, ePub, and others.
Wikipedia has a nice overview of these and other lightweight markup languages.
AsciiDoc HelloWorld
Using the AsciiDoc processor (either on your favorite shell or wrapped in a build script), you get the following output from the text above:
Image 1: Screenshot Hello Asciidoc(uments)
We compiled the AsciiDoc with gradle, using the following simple build file:
Split Documents into Parts
Now that we know how to create a document, let’s prepare for more complicated stuff. At first, we should modularize our document and split it into distinct parts. It’s like creating a larger software system from distinct components or modules, but for AsciiDoc documents. Luckily, AsciiDoc comes with a highly practical feature called include, which allows for the modularization of documents – see the following diagram.
Figure 3: Document made up of distinct parts
Of course, these include directives may contain path or directory information so that you can organize your files in adequate ways.
Hey Babylon: Multiple Languages
For multiple languages, you have two different options to organize your content (explained in Fig. 4 for EN and DE, English and German):
- Put EN content in an English-only file tree and DE German content in a second file tree.
- Put EN and DE content in the same files, and find a clever mechanism to separate these languages when creating output for a single language.
Figure 4: Multi-language options
Let’s consider an important text passage in both English and German: (we took the liberty of using the introductory paragraph of the Agile Manifesto):
We are uncovering better ways of developing software by doing it and helping others do it. Wir erschließen bessere Wege, Software zu entwickeln, indem wir es selbst tun und anderen dabei helfen.
We have the two language versions next to each other, but we need to create an English-only output, without the German stuff in it.
Excursion: The C Preprocessor
A few old-generation developers might remember the days of the C programming language. Programs sometimes contained nerdy statements like the following:
In C or C++, these conditional includes are quite common. Sometimes, even the behavior of the compiler is controlled via such directives. We tell you this for a reason, just read on.
But We Are Writing Documents, Not C?
If we had a similar directive, a kind of conditional compilation, for our documents, then we could for example write #ifdef ENGLISH #include page-1-EN.adoc, and leave out the other languages for a moment.
The AsciiDoc processors have learned their lessons from history and came up with a conditional include on steroids: One can include specific parts of a file, for example just the English parts. Such include statements can even be written with variables, and these variables can be set during the build process. Wow!
Fig. 5 gives an overview.
Figure 5: One build per language
AsciiDoc performs this magic by using tags, explicitly marked parts of a document. Here is a simple example:
We can then tell AsciiDoc to pass the tag for EN when including the file. See the following image.
Figure 6: Include only certain parts
Now our build script needs to iterate over all the desired output languages, call the asciidoc transformer and create a distinct output for each one. The common build tools like Gradle, Maven, or make have their specific mechanisms, a detailed explanation would exceed the scope of this article. The structure of such a build script (in Gradle) looks as follows:
You find a specific task definition per language (here: EN and DE), where the generic RenderDocumentTask gets called with the filename and the language as parameters. The heavy lifting of AsciiDoc conversion is done by the Asciidoctor Gradle plugin.
More Conditions AsciiDoc offers additional options to include conditions in your documents: You can use ifeval:: or the plain old ifdef::
But let’s have a look at a more realistic example.
Configuring the Output
When we started with this toolchain, we knew that we had to find a way to be able to create either a PDF file or an HTML representation of our documents. Fortunately, AsciiDoc allows us to do both.
PDF Files
AsciiDoc allows you to create a PDF theme which is used to configure the output. It allows you to configure all sorts of stuff, like a cover image, the position of elements on the pages, background images, and more. You can even use variables in the theme file, which are in our case filled with language-dependent text, like the date in the footer (you can have a look at our PDF theme here). All you need to do is to tell the Asciidoctor task where to look for the theme, and that’s it. Let’s have a look at our gradle task to generate the PDF.
We removed everything from the task that is not relevant to the PDF creation (you can check the full file here). You have to enable pdf as the backend (line 11) and then set the name of the theme (pdf-style), the directory where to look for the fonts that are used (pdf-fontsdir), and the directory where to look for the theme (pdf-stylesdir). Why are there two more lines that don’t seem to be related to PDF? Well, glad you asked!
HTML Files?
The two additional lines you see in the code snipped above can be used to also style the HTML output. Asciidoctor has a default theme that is used for HTML output. If you want to adjust the result, all you have to do is to provide a CSS file that contains all the magic you want for your result. Enable HTML as the backend and tell AsciiDoc where to find the stylesheet (stylesheet) and where to look for images or fonts that might be referenced in the stylesheet (stylesheet-dir). You can check one of our examples below to see the PDF and HTML results.
Ok, that’s fine for a single project, but the Advanced Level has more than ten curricula, so we would have to copy the themes to each project. If we adjusted the PDF theme in one repository, how can we make sure that all other curricula also benefit from the changes?
A Family of Similar Documents
To be able to only define both the HTML theme and the PDF theme once, we moved them to separate repositories. These repositories are then linked in each curriculum repository as a submodule. This has several advantages.
- There is only one place where we have to change the themes. If we’re working on a specific curriculum and want to improve on one of the themes, we can open the submodule and commit/push our changes.
- Owners (curators) of other curricula don’t have to think about doing the same changes. All they have to do is to update the respective submodule.
- Should owners of a curriculum not want to upgrade the themes for whatever reason, they can decide to just keep their submodules at the revision they are happy with.
We also identified the copyright of each curriculum as a candidate for a separate submodule. It is changed every year (to add the current year to it) and has to be done in each repository. Extracting the copyright file as a submodule allows us to only change one single file. Everyone who updates their curriculum also updates the submodule to the latest revision, and that’s it.
Real World Examples
The Curriculum for Software Architecture, iSAQB CPSA‑F®
Worldwide courses and classes in software architecture are taught based upon the iSAQB Software Architecture Foundation curriculum, guiding thousands of developers towards their “Professional for Software Architecture” certification, CPSA‑F. Therefore, the iSAQB needs to provide versions in different languages, both in HTML and PDF formats. This curriculum consists of approximately 40 learning goals (LGs) in 5 parts, resulting in about 30 pages Every two years the iSAQB releases an updated version of the curriculum, based on new ideas and input from the international software architecture community.
We (Ben and Gernot) belong to the core maintainers’ group of this document.
Let’s dissect its structure:
- The entry point is the file curriculum-foundation.adoc, which contains a number of include statements.
- The first is adoc, which defines several variables that are used all around the document. Among others, the document type (book), the position of the table-of-contents (left), and the location of the image directory.
- Now a list of all learning goals is included. Please note that this list is generated as part of the build process, to ensure we always have an up-to-date list of learning goals.
- Next all the chapters are included, one by one, which in turn include important terms, all learning goals, and the references helpful for this chapter.
This allows us to be able to change and review each single learning goal without conflicting with other learning goals of the document. We keep both the English and the German translation of a learning goal in a file, so if one language is changed, the other one is less likely to be omitted.
For translations in other languages, we added the possibility to easily upload PDF files to the repository which will be added to the next release automatically.
The Curricula of the iSAQB Advanced Level CPSA‑A®
We use the same template for each Advanced Level module that we also described in the previous example. This ensures a clear and overarching design and structure of the documents, so that participants can navigate through the different modules with ease, always knowing where to find what. Updating the formatting is no real effort since this is done via the submodules. Only changes to the build environment or GitHub actions require manual adjustments in each repository.
A Large Glossary
We maintain a glossary of software architecture terminology (available for free from the iSAQB), with close to a dozen authors. A few parts of this document change quite frequently (new terms are added, explanations are updated), while others are highly stable (e.g., the introduction, copyright notice, and authors’ biographies).
We maintained this glossary in GitHub before, but we had to manually create a PDF and upload it to Leanpub. The current approach with AsciiDoc and our build pipeline allows us to create a new release by creating a new git tag and pushing it to GitHub. That’s it.
Summary
You can maintain multi-lingual documents with a pragmatic, simple, and free (as in open-source) toolchain that is developer-friendly and proven in practice. Business- and other non-IT people might miss their favorite word processing tool, but the benefit of multiple languages organized along the principle of one fact, one place will help you in the long run. Until then – may the power of expressive wording be with you.
Bibliography
[0] Tower of Babylon: Brief explanation and history on Wikipedia
[1] iSAQB Glossary of Software Architecture Terminology, available in the following formats:
[3] AsciiDoc
About the Authors
Gernot Starke, INNOQ Fellow, co-founder of arc42.org and aim42.org. He “drinks his own champagne”:
Within iSAQB, he leads the Foundation Level Working Group and urgently needs to create and manage multilingual documents.
That’s why he sat down with Ben to create (and use) the toolchain described here.
Ben Wolf is an architect, iSAQB member, and a developer at INNOQ. He barely puts up with bad code and does not shy away from enormous refactorings. He shares his ideas about software quality and proper software development as a trainer, consultant, and speaker at conferences and meetups.
It is important to him that we recognize that the attitude of a team is crucial for good software quality and far exceeds the value that is provided by technology alone.
Share this article:
Related Posts
Featured in this article