Case Study: Automated Book Conversion

Over the last few years, I have been toying with various plain text to document conversion methods. My most recent side project involved the conversion of millennia-old manuscripts into a modern, portable, digital format. This case study will outline the solution design process I followed, the trade-offs I made, and the outcomes of the initiative.

Know that the process described here was done in about 30 minutes, over a cup of coffee and drawing some sketches. This article is a detailed breakdown of my thought process. Writing this case-study took a lot longer than creating the actual solution did.

The goal of this blog post is to provide some insight in both the pet project, and how you can analyse and solve relatively simple challenges using automation techniques.

The high level steps I follow for breakable toys like this, are: (1) figure out what I want to achieve, (2) impose some constraints to guide the decision making, (3) brainstorm a couple of possible approaches, (4) pick a viable approach and design a solution, (5) attempt to implement it within a specified amount of time, (6) evaluate the outcomes and decide on next actions. We can group these steps in the following categories:

  • Problem and context definition
  • Solutioning design
  • Implementation
  • Evaluation

The image below shows these categories, and the steps they contain.

Context

I have started writing a book using the LeanPub platform, which combines some of my previous content in a more structured way. One of the chapters will deal with communication and persuasion, and I wanted to include some excerpts from Aristotle’s Rhetoric. As I remembered Aristotle’s work from my high school Latin classes, I thought it would be a nice touch to revisit the original text, and include excerpted passages or references in my book.

As this is a side project (of a side project), I wanted to keep the scope limited, and not invest too much time/money in it.

Problem Outline

The main challenge was finding a suitable digital version of the text. I found a few candidate translations, but none of them seemed to be available in portable digital formats. Most well-reviewed translations are available in print, and require overseas shipping, or visits to local university libraries. I found some online versions of the text, but none of them immediately usable for my purposes.

After some searching, I found a translation of Aristotle’s “Ars Rhethorica” by J.H. Freese, available on the Perseus website. Opening the website show you the text in a browser, and allows you to navigate the text one paragraph at a time. See: Perseus Hopper: Aristotle, Rhetoric.

This isn’t convenient for reading, and certainly not convenient for reading on a mobile e-reader device.

Desiderata

Goals

  • Have a portable, digital version of the text that is compatible with most e-readers
  • Use a data format that is easy to reuse in other contexts (e.g. markdown, website, blog)
  • Share the result with others asynchronously, preferably via online download

Constraints

  • No additional monetary investment
  • Time investment limited to a few hours, split in 30 minute chunks (playing around over morning coffee)
  • As much reuse of existing tools and side-projects as possible
  • My ancient Greek / Latin is fairly rusty, so I need to rely on English translations

Input Analysis

  • The Hopper text is licensed under a Creative Commons Attribution-ShareAlike 3.0 United States License
    • requires attribution
    • requires sharing under the same license
    • allows for reuse and modification, including commercial use
  • The text is available in a structured format (XML)
    • this allows for parsing and conversion
    • the XML text is available per-paragraph
    • further investigation: manually changing the URL allows for downloading on a per-chapter basis
  • The structured XML format contain metadata about the text
    • chapter, paragraph, section, subsection, etc.
    • annotations by translators

Conclusion: The input data is structured, and can be downloaded in bulk. This allows for automated processing and conversion. Assuming the XML format is syntactically correct, and the source material does not contain too many transcription errors, this is a viable input source.

Solutioning

Technical Solution

Since the input data is structured XML, and it can be downloaded in bulk, it should be quite trivial to implement a parser to read the data. The parser can then convert the data to a more portable format, such as markdown. As I already have some experience with both XML parsing and markdown generation, this seemed like a good fit.

Envisioned Approach

Whichever tool I pick, the general approach will be the same:

  1. Download the raw XML content from the website
  2. take the input XML, read it out and parse it to a logical structure
  3. then write this structure out to a different format
  4. Use existing toolchain to create an e-book, publish it online
  5. Transfer e-book to reader, read it, and highlight issues. Goto step 2 if needed.
  6. Read the text again after fixing issues and enjoy the content
high level solution sketch

This means we split the problem in two main parts:

  • Reading the raw XML input
  • Writing the parsed data to a suitable output format

The reading part will likely be the most complex and important, as the writing part is significantly easier if the data is imported properly. In order to know how to read the data, I find it easier to first think on how I want to use it. As such, I will first describe how I intend to write out the parsed text to a file-based output.

Writing

The writing part is quite trivial, once the data is in a logical, in-memory format. It consists of simple looping over the chapters, sections, and annotations, and writing them to the desired output format. This gives us a simple, linear process, that can be implemented in a few lines of code.

The envisioned data model we want to use to write the content to a file-based output will be something very similar to the UML diagram displayed below.

basic UML diagram of the envisioned data model

As my existing toolchain is markdown based, this will be the first (and likely only) output format the solution will be able to deal with. The implementation will probably be more simplistic than the model provided above, as I do not particularly care about supporting other formats, or reusing this solution in the future.

Reading

The main question is what tooling to use for the conversion. I have used both STAX parsers and JAXB in the past. Both seem suitable, so it will be a matter of picking the one that is most convenient to implement.

Main concern: choice of tooling, and fit with the desiderata.

A JAXB based implementation will require the creation of a XSD schema, and use this to generate Java code. A STAX parser implementation will require manual parsing of the XML.

As an JAXB based approach would require me to look for the schema definition, add a bunch of dependencies to my code, generate the Java classes, and then write the actual conversion code, I decided to go with the STAX parser. This is a simpler, more straightforward approach, and will likely do the job just fine.

If you are unaware with the concept of a STAX parser, it basically iterates through a file line-by-line, and uses keywords in the line it reads to guess what context it is dealing with. A simple pseudo-code implementation looks something like this:


for(String line: fileLines) {
    boolean inChapter;
    int chapterNumber;
    boolean inSection;
    List<String> sectionContent; 

    if(line.contains("<chapter>")) {
        inChapter = true;
        chapterNumber += 1;
        storeChapter(chapterNumber);
    }
    if(line.contains("</p>")) {
        inSection = true;
    }
    if(line.contains("</chapter>")) {
        inChapter = false;
    }
    if(line.contains("</p>")) {
        storeSection(sectionContent);
        sectionContent.clear();
    }
    
    if(inChapter && inSection) {
        var sanitizedLine = stripHtml(line);
        sectionContent.add(sanitizedLine);
    }
    
}

The real implementation will be a bit more complex, but this is the general idea. If you are interested, the source code is available over at github.com/stijn-dejongh/ars-rethorica.

Considered Alternatives

  • Use a different implementation language (shell scripting, python, Kotlin)
    • pros: possibly better data conversion libraries exist, as these tools are commonly used in data processing tasks
    • cons: I have less experience with these languages, and would need to invest time in setting up my development environment
  • Use a different data source (HTML, PDF, plain text)
    • pros: someone else might have done the conversion already, and I can reuse their work
    • cons: ethical considerations about reusing someone else’s work without permission, licensing issues
  • Avoid the problem entirely, and use a different text or simply buy a book
    • pros: less work
    • cons: less fun

Result

Implementation

  • The implementation was done in a about an hour, and the code was written in a single sitting
  • I wrote a simple STAX parser, but did not bother with making it testable or extensible

Outcomes

  • Initial parsing of the XML data was successful
    • The output was written to markdown, and converted to a LeanPub publication
    • Some minor duplications and formatting issues existed, which I decided to fix manually
  • The LeanPub publication was shared with a few friends, who provided feedback

You can see the result here: LeanPub: Aristole’s Rhetoric

Lessons Learned

  • While I went for a simple solution, I did not go for a robust solution
  • I regret not having spent a bit more time writing tests
    • the duplication issue could have been avoided or fixed afterwards
  • Using existing tooling saved me a lot of time
    • I used a markdown to LeanPub converter to create a publication
    • I used a LeanPub publication to share the content with others
  • If I were to do this again, I would likely throw away this implementation and start over, using a more robust approach
  • I quite like the LeanPub platform, and will likely use it again in the future
  • Aristotle’s Rhetoric is a fascinating read, and I am glad I took the time to revisit it
  • The ancient Athenians really hated the Spartans and would go out of their way to insult them, given half a chance

References

Project technical references

Approach and method