Publishing Technical Documents with ePub

Prerelease Version

ePub in Production

The golden rule for ePub production is this: we’re all in this together. We’re all working towards the same specification and using the same kinds of tools. You may have different needs than most of us, but the goal is the same for us all: to produce content that is semantically rich and that looks good across various Reader platforms. This also means that many of the problems you may experience in production are problems for everyone, and it’s likely that someone out there will be able to help.

This chapter will look at a number different software tools and processes. Some of these are for content creation, some of them are strictly for ePub production, and some cover both areas. The goal here is not to steer you towards the One True Platform™ but to help you to examine your workflow and understand the tradeoffs that are being made (consciously or not).

Managing Your Content

Though it is based on older standards, ePub itself is a fairly new technology. Depending upon the size of your organization, and how it came to start using ePub, you may not have a formal plan to manage your content. Or the plan may be, use our existing system, and add ePub at the end. And when you’re first learning how to use ePub, and first adding it to your system, this may work out fine.

But as time goes on, you may start to develop areas of friction. These can pile up until you’re at the point where you spend more time fixing small creeping problems than making quality ePubs. These may even turn into major issues where you are spending hours of time fixing production issues that weren’t apparent until they reached your stage of the process.

While the cause of these issues may seem mysterious to those involved, they likely stem from some common issues:

  • Scaling. Your workflow was manageable when you had x documents/week to create. Now you have 5x or 10x documents.
  • Conversion. Your content spends some of its time as something that isn’t HTML (or an equivalent). The tools you have to convert this to HTML are no longer adequate to your needs.
  • Growing edge cases. Your “special” cases are no longer “special.” For example, you originally planned for only having to deal with MathML 1–2 times per month. Now you have 20 per month.
  • Content mismatch. Other production areas are geared towards something besides ePub. Problems are introduced by those areas, for example missing, useless, or even harmful data is being added to the content. However, those issues aren’t actually a problem for anyone but you.
  • Platform mismatch. Quite simply your tools no longer meet your needs. Maybe you started out with just vanilla markup, for example, but are now asked to use more scripting in your ePubs. Whatever the reason, you’re now spending more and more time doing “under the hood” customization to a process that used to be straightforward. You know you should retool, but don’t know what to use and don’t want to spend the time or money learning something new.

Common Tools

Before we address these issues, we need to look at a few of the common tools used in ePub production. While the specific apps may be different, the kinds of tools used, and the solutions that come with them, should still apply.

Adobe inDesign

Built by a multi-billion dollar software company that caters to the publishing industry, this is one of the major workhorses of the industry. The major advantage to this for most publishers is that you can use it to build the print version of your document, then take all of that content and turn it into an ePub without having to hire people who know HTML, CSS, or any other technology besides Adobe. And the resulting ePub can look exactly like the print product, if you want.

Of course, there are downsides:

  • You are locked into a single proprietary format.
  • Since they aren’t required to learn about the underlying tech, employees may not know how to fix any problems that crop up in the resulting ePub.
  • The CSS and HTML may be less than ideal since it is geared towards making the ePub look exactly like the print product, rather than making it look good in the ePub Reader.
  • inDesign doesn’t handle Math or other technical content as well as other tools.
  • If you want some of the more advanced ePub features, you end up having to spend alot of time “under the hood” in the software to implement them.
  • Because the file format is proprietary, you are often limited to whatever automation can be accomplished inside the software, instead of making use of the tools that exist to work with HTML, XML, CSS, and ePub itself.

iBooks Author

Basically inDesign with a lower learning curve and better Math support. The upside is that it’s free, the downside is you don’t have as many export options as inDesign.

Sigil (and Calibre)

A free, open-source alternative to inDesign, Sigil is another major workhorse simply because it’s free, it works everywhere, and does exactly what most users need it to do. However, like many FOSS projects, it’s not been updated regularly and has some rough edges. Also, its support for ePub 3.x has lagged.

Calibre, which is an ePub reader/manager has added some editing capabilities to fill in the hole left by Sigil’s slower development. It’s good enough for most users, but less focused than Sigil. Both have HTML editors, when you need to work with the code directly, but also rich text editing for those that are more comfortable with a word processor-like interface.

The downsides:

  • Lacks support for Math, and other technical content.
  • “Book view” may not look like it does in most ePub Readers, unlike with inDesign, what you see is not always what you get.
  • Editor lags behind the latest changes to ePub Readers and HTML5.
  • Limited capabilities for automation inside the app, though you can use external tools as it just uses HTML and XML.
  • Since both are cross platform, you may be unable to use some common features available in your OS of choice.

Word Processor to ePub

MS Word, Pages, and even OpenOffice all have ways to take your word processing document and turn it into an ePub. Some of them even have decent support for Math. So you have a low cost, a very low learning curve, and a decent enough looking result. The disadvantages, however, are obvious once you unzip the resulting ePub.

  • The HTML is not very semantically rich. The CSS is not very useful as it is geared towards making the result look exactly like the Word processor.
  • Extra and unnecessary characters and markup is often added to the resulting document.
  • Unless you closely follow the template in your document, you end up with unnecessary classes and tags added to your ePub and CSS.
  • Automation can be very tricky to use inside the Word processor, and the resulting HTML will end up with very messy HTML that can be hard to work with outside the Word processor.

Markdown to HTML to ePub

I haven’t brought up this workflow before, but these tools can be very good once you have them set up. Markdown is a way to turn minimally marked up plain text documents into very clean HTML. MultiMarkdown is an extension of Markdown which has a number of features that allow you to better integrate Math, References, etc. You can take the resulting XHTML (and other formats, in the case of MultiMarkdown) and stitch them together with some command line tools to build ePubs, PDFs, or whatever else you need.

The advantages:

  • Fairly low learning curve for most users.
  • Very good HTML.
  • Very friendly to customization and automation.
  • A number of tools to help you integrate it into content management systems and other production systems.

The disadvantages:

  • Hard to create a good preview friendly workflow without alot of scripting or command line work.
  • Difficult for non-technical users to work with.
  • High startup costs (in terms of time and technical knowledge) may not be worth it for lower volume ePub production workflows.
  • You will likely still need to work with the resulting HTML a little before it is good enough to put into your ePub.

HTML to ePub (via command line tools)

Leaving this here more for completeness sake. If you have a good HTML editor (like say, BBEdit) and a way to zip everything together, you don’t really need anything else.

The main advantages are:

  • If you know what you’re doing, your markup will be semantically better than any other option.
  • You have total control over every aspect of the ePub. You don’t need to depend on another developers interpretation of the standards, you can work the markup directly.

The disadvantages:

  • A much higher learning curve, especially for users that aren’t comfortable with HTML.
  • Requires alot of templating and automation if you want the production speed to match the software tools.
  • It’s easier to introduce small errors in production when working directly with the code. Software tools are very useful when it comes to automating tedious and error prone parts of the standard, even if they’re not so good at creating semantically useful markup on their own.

Version Control Software

If you’re using an XML or Text based workflow (i.e. everything but inDesign), you should look into some sort of version control software, like git or subversion. These tools track changes made line by line and let you keep track of multiple versions of your content over time. This can be especially useful if you’re worried about automation causing problems as you can store a “before” and “after” picture of your documents and rollback the changes if something is wrong.

These tools, especially git, have some learning curve, but many have a good graphical front ends available and are really invaluable if you have (or are planning to have) alot of content.

About Automation

This book hasn’t really discussed coding or assumed any sort of programming experience. If you’re a novice with this sort of thing, there are two important things to keep in mind:

  1. “Coding” or “scripting” or whatever you want to call it is not as difficult as you think. It’s tough to do at a high level, but if you’re reading this far into the book, you’re likely already technical enough that you can manage simple tasks that are still helpful without having to be an Über Leet Hacker Person.
  2. Coding is not some kind of magic cure all. There are tasks that are well suited for it, and some that aren’t. And there are some tasks that seem like they should be something you can automate, but turn out to be very difficult. Knowing how and why takes experience.

That being said, we’re not going to get into the nuts and bolts of coding, but rather discuss some software tools, and let you decide how helpful it would be to learn how to use that tool/feature.

Scaling Issues

There are two ways to handle huge increases in the amount of content: Work harder, or work smarter. Since this isn’t a book about how to convince your boss to hire more people and/or approve unlimited overtime, we’ll cover working smarter which means, using automation.

Automation at its core is about using a computer’s speed to allow you to quickly apply a known solution to a known problem that crops up repeatedly in your workflow. If you want to use automation, you have to identify something that you do that is routine and you have to then examine what it is that makes it routine. Then you use your computer to find that problem and fix it for you.

The upshot of this is that, if feel like your workflow isn’t set up to benefit from automation, you need to change your workflow so that it does benefit. This is the only way to fix scaling issues in the long run.

Reuse, Retrain, Retool

The first thing to recognize is that there is no Silver Bullet, or Wise Old Elf, or Dungeonmaster that will help slay the Evil Scaly Scaling Monster. There’s likely not even a single change you can make to your process that will cure everything. You’re going to need a series of changes, small and large. It’s going to require work, and alot of thinking about how you work.

There are a few strategies that can help:

  • Reuse: Computers are great at storing your past work. Look for places in your workflow where you can just take something you did before, tweak it a bit, and use it again.
  • Retrain: Some of the software tools you are using may already have solutions. You, or your work colleagues, may not have taken the time to learn them. Time to dust off the manual (metaphorically speaking) and give it a look.
  • Retool: If your existing software can’t help you, you need to start looking for new tools or different workflows.

Reuse

Every one of the previously mentioned tools has a way to reuse your past work. Even if it’s just making a file as a template and building your new work from a that. And if that’s not enough, each file in your resulting ePub can also be reused or replaced to meet your needs (just zip them up when you’re finished).

If you want to improve your process, you need to think about your work at a higher level. You can’t think of each project as an island, rather you should think of each ePub you’re building as it relates to your workflow as a whole.

Your templates (and other shared files) are the tools you use to build your ePubs and you need to make everything as reusable as you can. You want your toolbox to have a few adjustable wrenches, rather than a box full of mixed up sockets that can only be used once per project.

Organize your templates so that you know which one to grab to create the kind of ePub you need. If you’re in a small group, share your work. If you find a solution to a thorny design problem, add it to an existing template or store it in a new one. Make sure to document your work so others can understand why you did what you did. Make sure to back up your templates regularly.

This may not seem like automation (where’s the “coding”) but your templates are known solutions to known problems, and they’re used by your software to fix those problems. In that sense, they can be the most important piece of automation you have.

Retrain

If you’re working on a computer and you’re fixing the same thing over and over by hand, you’re probably doing something wrong. Just as you can use templates to reuse solutions at a document level, your software (if it’s any good at all) has tools to help you reuse solutions at a lower level. And you need to learn about those solutions.

This can be as simple as learning to use regular expressions to make a single “smart” search and replace to do the work of several “dumb” ones, or learning how to use scripts others have written for inDesign. If you’re ambitious, you can even write your own scripts.

It’s hard to talk about specifics, but in general, if you’re doing an action over and over, find a way to teach it to your computer. It may seem slower, but once you learn how it more than pays for the time you spend. And, equally important, once you take that time to create a solution, save it. Even if it’s just saving a note to yourself about how you fixed it.

Retool

If the software you’re using isn’t automation friendly, or if it can’t handle the scale at which you are using it, you need to find a tool that does work. Choosing a new tool always involves tradeoffs. Choosing the right tool for your workflow requires that you look at how it will be used from multiple perspectives.

On top of that, trying to make major changes in your organization, like introducing a Markdown-centric workflow in an Adobe publishing house, can be difficult. There may be enough pushback from people who are used to doing things the current way that it isn’t possible. At least, not until things are so bad no one can fight against change.

However, there are plenty of smaller tools you can use that can make your work more efficient without causing issues in other areas. And demonstrating how awesome your small change works can help you gain enough momentum to help you introduce major changes later on.

Sass

Sass is a “CSS extension language” used by many web developers. Basically, it’s a tool that lets you build your stylesheets smarter and makes it easier to reuse. If you already know CSS, the learning curve is pretty gentle.

There are GUI interfaces (some of them free, some not) and a command line interface as well. If you find yourself making stylesheets for your ePubs (from scratch or not), it can be very helpful and is worth looking into.

Text Editors

If you don’t have to export your work back into inDesign (or equivalent), then you may as well unzip your ePub, work with the file in a better text/HTML editor, and zip it up when you’re done. The best text editors are fully scriptable, have amazing search and replace functions, understand the markup you are using and can point out problems and typos while you’re working. Many have had years of polish so they are just better than what the “average” user is given to work with.

There are tons of them out there. The Internet is your friend here, but some of the better ones I’ve seen include:

  • BBEdit (or TextWrangler which is the freeware version). Mac OS. Easy to learn, worth mastering.
  • Sublime Text. Popular, multi-platform, powerful, but not that easy to learn.
  • Atom. Multi-platform, fairly powerful, very customizable, easy to learn.
  • vim family. Very powerful, but more geared towards programmers than web designers. Relatively high learning curve.
  • emacs. Very powerful, very high learning curve, highly customizable, geared more towards programmers.

The Command Line

This may seem either stupidly obvious, or slightly scary, depending on your comfort level. But learning to use the command line, whether Unix or Windows PowerShell, can help you handle many complex tasks quickly and easily.

Need to replace a misspelled name across an entire ePub? You could open up each file in a text editor and do a global search and replace, or you could just go to the directory with the files and type:

sed -i '' 's/Jeferson/Jefferson/g' *.xhtml

and hit ‘Enter’. The spelling mistake is now fixed for every HTML file in the directory. And that’s just one simple example that could save you a half an hour’s worth of work.

The learning curve for the command line can be high, but it is very powerful and learning and applying the basics can help you automate and customize your workflow better than something even a billion-dollar software company could create. And that’s not even including the ability to add the hundreds of scripts and programs that developers have designed and put out there for free to handle HTML and ePub.

A few places to get started:

  • Unix for the Beginning Mage. This is a very good guide for those with limited experience on the command line. The PDF is a fun read even for experienced Unix Magi.
  • Stronger Shell. A very good article with a number of links to other good resources. With the web being the way it is, hopefully it’ll still be there when you’re reading this.
  • The Art of the Command Line. A good guide/breakdown/cheatsheet for all of the tools at your disposal, but not a great tutorial page.

Work Smarter

So again, when dealing with scaling issues, you need to look at your workflow and work smarter. Make it easier to reuse your files, learn how to better use your existing tools, and take the time to acquire and learn to use the more powerful tools. No one is an island when it comes to ePub production. If you have a problem, there’s a good chance someone else has it as well and they may even have come up with a workable solution. Remember, we’re all in this together!

Conversion Issues

Converting content to good HTML can be one of the more time-consuming issues with ePub production. Depending on the file formats involved, and the structure of the incoming content, handling this can either be a minor annoyance or a major source of pain.

How you handle this can greatly depend on your workflow and the tools available. But no matter what your toolset, the best way to diagnose and fix your problem is to approach it scientifically.

Collect Good Data

Conversion problems are, first and foremost, quality control problems. Potential solutions need analysis, not subjective judgement. For example, you may feel like there is a single conversion problem that is causing you the most annoyance, only to discover later on that you are spending more actual time on routine conversion fixes. If you don’t take the time to collect impartial data, you may miss out on real areas of improvement.

Ideally, you want all the data, but realistically this is not always possible. You need to think about the kinds of data you want to collect, and what measurements you want to take. This can be as simple as taking a random sample, tracking the file size (e.g. number of words), counting the number of errors, and then timing how long it takes to clean up those errors. Or it can be as complex as a script that runs tests on each file as it comes in and logs the results.

The key is to make sure that the tests are meaningful and unbiased and the sampling method is taken correctly (truly Random or Systematic, not just arbitrary). This method may be overkill for small workflows, but those can be fixed relatively easily no matter what method you’re using. But if your workflow has a high volume, or has multiple content types (or both), you need real data to uncover the root problems.

Analyze Your Data

This likely seems obvious, but it still needs spelled out. Look at what you’ve measured. Establish a baseline for the numbers and kinds of errors you deal with. Look at the outliers (both high and low) and try to uncover why those are different. If you want to diagnose your problem, and justify any potential solutions, you need to look for patterns in the data.

Form a Hypothesis and Test

This may seem both overly formal and overly obvious, but you should still spell out in clear language what you believe is causing the problem and why. You also need a way to test that hypothesis. That way, when you go back and collect the data again, it will show that what you changed actually did improve the results.

And if it turns out that you were wrong and it didn’t do what you thought it would, you now know that you haven’t actually fixed the problem. This is still better than believing you fixed a problem, because you did alot of work on it, only to have it reoccur in the future.

Collect More Data

Good Quality Control is about having reliable data. Obviously, you don’t want to make your coworkers run around with a stopwatch all the time, but if you can collect meaningful data automatically (even if it is just validating and logging it after conversion), there’s no reason not to.

Edge Cases

This is something that, again, is very hard to generalize about. What you need to do, first and foremost, is research. There are tools out there to help you, but you need to find them and learn how they work. At least, enough about how they work that you can decide whether they would be suitable and fit well into your workflow.

The other approach may be to change your workflow so that someone on the team can specialize in these cases. From a management perspective, this is not always desirable, but the reality is that your group always ends up doing this, in a de facto manner if nothing else, as there’s someone who knows more about the problem and ends up answering the questions about them.

In any case, research, research, research, then try to figure out how much these cases are costing you. And how much they may be costing in the future in terms of time and lost productivity. Use those costs as a rough budget for how much time, money, etc. you should spend on a solution. Again, it’s hard to generalize about this.

Content Mismatch

This is pretty much always a case of problems not cropping up until the “Rubber meets the Road” which is likely your area, and maybe only your area. You can complain about whatever it is that’s making your job difficult, but it’s much harder to get people to listen or change.

What you need, then, is a way to make it obvious to people where the problem is and make it relatively easy to fix before it comes to your area. And if you can’t do that, then you need to take some of these automation tools and put them to work finding and fixing the issue when it first comes into your area. Even if the “fix” is to just send it back with a notice for someone else to work on it.

Validation

The great thing about ePub content is that it is all XML. Which means there are tools that will automatically look over your files and let you know right away if there are structural issues. Of course, you can only use these tools if the content is in XML or another structured format.

Which creates an incentive to move your content out of a word processing format and into XML, Markdown, LaTeX, plain text with YAML, or anything else that’s automation friendly as early in the process as possible. The more content your group is dealing with, the bigger the incentive. If you can train the authors, or whoever, to create their content using some kind of markup, then it opens up a large number of possibilities compared to the usual MS Word or Adobe CS ecosystem.

There are businesses that manage to do just fine with these tools. And at a large enough scale, you can catch Microsoft’s or Adobe’s attention and come up with a good solution. But those mega-software companies don’t care about small scale customers, and if history is any guide, that means they aren’t likely to care about the needs of most people who do technical writing. Because this area will always be small scale.

So which tools are good for validation? Here’s a quick list:

ePubCheck. This is Free, Open Source software that was built to validate your ePubs. This should already be integrated into your workflow, whether from the command line, or some internal web tool or script that runs this. It’s written in Java so it should work pretty much everywhere.

LibXML. This is another FOSS library that includes tools you can use from the command line. It’s already installed on many systems (xmllint --version) but you can get a binary or build it yourself if you know how. Depending on the mode, you can even set it up to check against a DTD or Schema to make sure your documents are tagged correctly and not just well-formed.

LibXML may be very useful if your workflow has alot of XML or HTML/Markdown, but is less useful if your location is the only part of the process that uses these formats. Ideally, you'll just set up a script (or other process) to convert your documents (if needed), validate them, and then check the result.

XSLT or other scripting. Often your biggest headaches come from content that is technically correct but is marked up in a way that just comes out wrong. This could be really bad HTML with too many unnecessary elements (like extra <span> tags) or something else.

What you need is a way to quickly make these mistakes visible in a way that is obvious. One way is to convert the result to HTML (if it isn’t already) using XSLT or another tool. Then make up a CSS that shows the bad markup in a way that is visible (like say an angry red). This can then be used in your workflow as a way to quality check your markup.

Automated Fixes

Using automation to fix content should not be your first choice. Mainly because you end up with an additional piece of software you are constantly tweaking to fix whatever new weird problem is being caused upstream. However, if you don’t have any pushback with the people generating your content, your choice is to fix it manually or use automation.

Automation is better, but you have to be careful about a few things. First, make sure the original content is saved somewhere before you start hacking away at it with your script. Ideally, you would use some sort of version control software, as that can show you the changes and rollback if needed, but even separate input/output directories will do. Secondly, make sure to use the correct tool for the job. If many of your problems are easier to fix using regular expressions, use them, but if some of them are easier with an XML DOM type approach, then use that.

Third, don’t be afraid to break your scripts down into smaller pieces. In fact, this is probably the best approach as you can just chain the output of one into the input of another process. This also makes it easier to maintain and track your scripts. Breaking things down can also be useful if you intend to add some sort of interface, like an internal website, later on.

Also, you may need to examine what you are trying to accomplish and decide whether this is something in your content fix that is always wrong, or if it is something that may be weird but shouldn’t just be automatically “fixed” without giving the option to reject the change. Automated “fix it” tools are the kind of software that can end up being put into place and used blindly by other coworkers for a long time. So it’s best to think hard and evaluate what you’re doing because no one else may know how and why the software does what it does.

This also means you should try and document what you’re doing and why, early and often. You will likely be setting this work aside for months or longer before picking it up again. There is nothing worse than trying to figure out what you were thinking six months or a year ago when you are looking at a script.

Platform Mismatch

As I mentioned earlier in the chapter, this is my term for when your tools no longer meet your needs. The previously mentioned tools are a good place to start if you need an idea of what’s available.

This is one of those issues that will need alot of thought. If you’re looking at new file formats, new software tools, or more automation, there will be some education and learning curves for everyone. There may also be alot of rework if you’re trying to move some of your old content into a new system.

The best approach is to start small. Pick a small amount of new content to use as a prototype and examine it thoroughly to work out all of the issues you can think of in the new system.

Trying to move everything all at once is a bad idea, but then again, you don’t want to wait so long that all momentum stalls. If this happens, you can end up with some kind of hybrid system where some of your work is permanently stuck in the old system while the rest is in the new. Unless you’re an expert in the old system and want job security. Then it’s okay.

Remember, there are many barriers to change. In most cases, it’s very difficult to get any kind of change to happen unless the situation is unbearable.

The only good news is that there are a number of excellent software tools out there that are both free and robust. Which means you can evaluate them as much as you like, tweak them if needed, and don’t have to worry about licensing or other costs. And again, even if you can’t convince your entire group to use these tools, you likely can still argue for enough freedom to go ahead and use them yourself.

Previous

Next

Table of Contents