PDFs are now smaller thanks to the financial support of Code & Co. It has been a real pleasure to develop this feature with them.
We would also like to thank
@suvtur for his great ideas
to reduce the size of embedded images.
You would love to get a new feature, a bug fix, or some support on WeasyPrint? Don’t hesitate to get in touch with us! Or if you simply want to see the project grow beautifully, you can donate on OpenCollective 😉.
Reducing the size of PDFs is an important feature for WeasyPrint’s users, according to the 2-year survey. It saves some disk space when documents are archived, it reduces download times for documents generated by web applications, it avoids problems when the size is limited, for example for mail attachments.
When Code & Co. contacted us to reduce the size of our documents, we were excited to find a solution to reach this goal. Great news, but… how can we reduce the size of our PDFs? Is it even possible?
We already had two main ideas: optimize images and remove useless content. And finally, a third one we didn’t think of was even better… Which one? Let’s go through the whole process to find out!
Optimizing images is maybe the most obvious solution when we want to save space. In some documents, images are by far the biggest content, and finding ways to optimize them can lead to very interesting results.
Some optimizations are lossless, allowing users to get exactly the same
rendering while saving some space. Such an option already exists in
WeasyPrint: using the
-O images (now
Pillow to generate smaller
images if possible. Let’s trust Pillow for this, there’s probably
nothing more we can do.
Other optimizations are lossy. Reducing the image size or the image quality is a great way to get smaller files, even if the overall rendering may be slightly worst. In situations when the document’s size is more important than its quality, it could be interesting for users to have this possibility.
Thanks to good ideas proposed by
@suvtur, we now have two
--jpeg-quality that can reduce the
compression ratio of JPEG images
--dpi that reduces the width and height of images to
reach a maximum given resolution.
The second step of our journey was to try removing useless data. Because, let’s be honest, we do store useless data in our PDFs.
Now that we use pydyf, our own PDF generator, we have more flexibility about how PDF objects are stored. The PDF format gives a lot possibilities to remove some characters (some spaces, for example) without changing the actual structure (and the actual text) of the document. Of course, we didn’t want to sacrifice the readability of our documents to save a few bytes: it’s actually interesting to read them for debugging purpose, even if it sometimes gives the impression to be in the Matrix! Nevertheless, we found two very common cases where removed spaces could save quite a lot of kilobytes with exactly the same PDF structure.
Some spaces are "useless" in code, most developers know that, and that’s true for PDF objects too. But during our journey, we’ve also found that some spaces are useless in the real PDF text! As strange as it may seem, PDF readers don’t really rely on space characters to determinate the words boundaries. Many of them are actually impossible to select… and that’s a shame for Python code! So, we can safely remove spaces when they’re alone in their lines, and remove a few extra kilobytes from our PDFs.
We won’t go through all the successful (zeros at the right of decimal numbers…) and less successful (use ASCII characters instead of hexadecimal codes…) tries, but you get the idea!
Let’s Change the PDF Structure
Optimizing images can give impressive results, but it often alters the quality of generated documents. Removing useless data doesn’t change anything for users, but the size it saves is not as high as we can hope. Can we do better?
Yes, we can. The version 1.5 of the PDF specification introduces "object streams", a type of PDF object that can group other objects. Why is that useful? Because the content of PDF streams can be compressed, while normal objects can’t. Do our PDFs contain a lot of objects that can be grouped and compressed in object streams?
Of course, it depends. But even with a small number of objects, object streams are really useful: PDF objects are mainly composed of repetitive ASCII characters, that’s exactly when compression is really effective!
Objects that can be grouped in object streams are mainly metadata. Among metadata, one type is particularly interesting: links. Internal (reaching a specific position in the current PDF) or external (URLs open in a web browser), they mainly contain repetitive text and numbers. Moreover, many documents include a lot of links: tables of contents, hyperlinks, headers, footers…
Obviously, the results are very different for different documents. But in the end, was it worth the effort?
Short answer: yes, it was.
For some documents, the results can be really impressive. For our HTML5 sample document, the generated PDF loses 80% of its size, from 1,449 kB to 290 kB, without any data loss! 🚀
If you have documents with a lot of images, you can also get equivalent results. Lossless and lossy optimizations, if you don’t mind losing a little bit of rendering quality, can have a huge impact on the results. If you don’t want to spend time to manually optimize your images, you can now let WeasyPrint do this for you!
What Are the Next Steps?
The next step is yours! Don’t hesitate to try this beta and report bugs you may find 🐞.
The one after is to fix the reported bugs and release a nice WeasyPrint version 59.
Have fun with this beta 💜.