Over the last few years, I have been toying with various plain text to document conversion methods. My most recent side project involved the conversion of millennia-old manuscripts into a modern, portable, digital format. This case study will outline the solution design process I followed, the trade-offs I made, and the outcomes of the initiative.
Know that the process described here was done in about 30 minutes, over a cup of coffee and drawing some sketches. This article is a detailed breakdown of my thought process. Writing this case-study took a lot longer than creating the actual solution did.
The goal of this blog post is to provide some insight in both the pet project, and how you can analyse and solve relatively simple challenges using automation techniques.
The high level steps I follow for breakable toys like this, are: (1) figure out what I want to achieve, (2) impose some constraints to guide the decision making, (3) brainstorm a couple of possible approaches, (4) pick a viable approach and design a solution, (5) attempt to implement it within a specified amount of time, (6) evaluate the outcomes and decide on next actions. We can group these steps in the following categories:
The image below shows these categories, and the steps they contain.
I have started writing a book using the LeanPub platform, which combines some of my previous content in a more structured way. One of the chapters will deal with communication and persuasion, and I wanted to include some excerpts from Aristotle’s Rhetoric. As I remembered Aristotle’s work from my high school Latin classes, I thought it would be a nice touch to revisit the original text, and include excerpted passages or references in my book.
As this is a side project (of a side project), I wanted to keep the scope limited, and not invest too much time/money in it.
The main challenge was finding a suitable digital version of the text. I found a few candidate translations, but none of them seemed to be available in portable digital formats. Most well-reviewed translations are available in print, and require overseas shipping, or visits to local university libraries. I found some online versions of the text, but none of them immediately usable for my purposes.
After some searching, I found a translation of Aristotle’s “Ars Rhethorica” by J.H. Freese, available on the Perseus website. Opening the website show you the text in a browser, and allows you to navigate the text one paragraph at a time. See: Perseus Hopper: Aristotle, Rhetoric.
This isn’t convenient for reading, and certainly not convenient for reading on a mobile e-reader device.
Conclusion: The input data is structured, and can be downloaded in bulk. This allows for automated processing and conversion. Assuming the XML format is syntactically correct, and the source material does not contain too many transcription errors, this is a viable input source.
Since the input data is structured XML, and it can be downloaded in bulk, it should be quite trivial to implement a parser to read the data. The parser can then convert the data to a more portable format, such as markdown. As I already have some experience with both XML parsing and markdown generation, this seemed like a good fit.
Whichever tool I pick, the general approach will be the same:
This means we split the problem in two main parts:
The reading part will likely be the most complex and important, as the writing part is significantly easier if the data is imported properly. In order to know how to read the data, I find it easier to first think on how I want to use it. As such, I will first describe how I intend to write out the parsed text to a file-based output.
The writing part is quite trivial, once the data is in a logical, in-memory format. It consists of simple looping over the chapters, sections, and annotations, and writing them to the desired output format. This gives us a simple, linear process, that can be implemented in a few lines of code.
The envisioned data model we want to use to write the content to a file-based output will be something very similar to the UML diagram displayed below.
As my existing toolchain is markdown based, this will be the first (and likely only) output format the solution will be able to deal with. The implementation will probably be more simplistic than the model provided above, as I do not particularly care about supporting other formats, or reusing this solution in the future.
The main question is what tooling to use for the conversion. I have used both STAX parsers and JAXB in the past. Both seem suitable, so it will be a matter of picking the one that is most convenient to implement.
Main concern: choice of tooling, and fit with the desiderata.
A JAXB based implementation will require the creation of a XSD schema, and use this to generate Java code. A STAX parser implementation will require manual parsing of the XML.
As an JAXB based approach would require me to look for the schema definition, add a bunch of dependencies to my code, generate the Java classes, and then write the actual conversion code, I decided to go with the STAX parser. This is a simpler, more straightforward approach, and will likely do the job just fine.
If you are unaware with the concept of a STAX parser, it basically iterates through a file line-by-line, and uses keywords in the line it reads to guess what context it is dealing with. A simple pseudo-code implementation looks something like this:
for(String line: fileLines) {
boolean inChapter;
int chapterNumber;
boolean inSection;
List<String> sectionContent;
if(line.contains("<chapter>")) {
inChapter = true;
chapterNumber += 1;
storeChapter(chapterNumber);
}
if(line.contains("</p>")) {
inSection = true;
}
if(line.contains("</chapter>")) {
inChapter = false;
}
if(line.contains("</p>")) {
storeSection(sectionContent);
sectionContent.clear();
}
if(inChapter && inSection) {
var sanitizedLine = stripHtml(line);
sectionContent.add(sanitizedLine);
}
}
The real implementation will be a bit more complex, but this is the general idea. If you are interested, the source code is available over at github.com/stijn-dejongh/ars-rethorica.
You can see the result here: LeanPub: Aristole’s Rhetoric