- 📝 Posted:
- 💰 Funded by:
- Ember2528
- 🏷️ Tags:
Well, that fell apart surprisingly quickly. The release of Shuusou Gyoku's Linux port just happened to be surrounded by the unluckiest sequence of events in Arch Linux land:
- Jan. 21: The SDL team releases the first stable version of SDL 3
- Jan. 24: Arch Linux packages SDL 3
- Jan. 25: I release 📝 the first version of Shuusou Gyoku's Linux port, completing the SDL 2 porting work I started in 2023
- Jan. 28: Arch Linux removes SDL 2 and replaces it with the previously packaged sdl2-compat, a compatibility layer that is meant to perfectly implement the SDL 2 API on top of SDL 3. In reality, though, it broke lots of applications including my Shuusou Gyoku port, and turned their users into quite disgruntled beta testers.
After I fixed a silly mistake on my part, Shuusou Gyoku was still playable on sdl2-compat as it was only affected by rather minor bugs, but these bugs still undermined the effort I put into the port. That left us with three options:
- Let the more involved SDL community fix sdl2-compat out on their own. After all, why should we bother if rogue distros randomly mess with our dependencies?
- Become part of that community and help fix the issues in either sdl2-compat or SDL 3.
- Properly update Shuusou Gyoku to SDL 3 right now, while keeping SDL 2 support for the Flatpak, more conservative Linux distributions, and the upcoming Windows 98 backport.
I really would have preferred to delay this migration for a few years until the dust has settled. For this project, I already picked C++ as the dependency I want to be on the bleeding edge of, and SDL 2 was supposed to balance this out by being the conservative and stable choice. Oh well, if we've got to update at some point, we might as well do it now. The ReC98 development schedule at least gave me another month of waiting for the community to sort out SDL 3's growing pains…
- Forced onto an unstable SDL 2 compatibility layer
- Updating to SDL 3
- Picking a screenshot format
- QOI, the expected disappointment
- JPEG XL, the unexpected disappointment
- Benchmark results
- Lossless WebP
- Letting players pick an effort level
- Future performance improvements
- Rendering the build ID with some unused glyphs
So, why does something like sdl2-compat even exist if it only causes problems? And why are distros rolling it out so soon after SDL 3 if SDL 2 has been working fine all the time? In a nutshell, sdl2-compat is the second pillar in SDL's forward compatibility strategy. While the 📝 dynamic API mechanism ensures compatibility with future minor versions by integrating dynamic linking so deeply that static linking is made entirely useless, sdlN-compat ensures compatibility with one future major version by implementing version N's API in terms of SDL version N+1. This allows the SDL team to very quickly stop updating version N while still allowing programs linked against that version to run well on modern systems by using all the actively maintained backends of version N+1. This worked out well with sdl12-compat, which nowadays seems to do a great job at preserving abandoned SDL 1 games – especially if we consider that you'd be running sdl12-compat on top of sdl2-compat on top of SDL 3 from now on.
So it only makes sense why the SDL developers would want to repeat this success story with the transition from SDL 2 to 3. The problem is that they're already selling sdl2-compat as a perfect drop-in replacement for proper SDL 2, and wanted to push it onto people even before SDL 3 was officially released. The sales pitch follows their usual "trust me bro" rhetoric:
If you absolutely must have the real SDL2 ("SDL 2 Classic"), please use the SDL2 branch at https://github.com/libsdl-org/SDL, which occasionally gets bug fixes (and eventually, no new formal releases). But we strongly encourage you not to do that.
Followed by zero arguments to back up this audacious suggestion. So they not only imply that sdl2-compat is already perfectly compatible and works without bugs for every SDL 2 program ever, but also that the underlying SDL 3 implementation doesn't introduce any bugs on top – and it only takes a single look into either project's issue tracker to disprove that notion. There is no technical reason why a distro couldn't ship SDL 3 and 2 in parallel. The continued existence of the SDL 2 AUR package is proof of that, and still received upset comments as of mid-March that justified its existence.
There was absolutely no reason to push sdl2-compat on everyone by default other than forcefully turning users into beta testers. SDL 2 was still stable, maintained, and working well. People who needed SDL 3 before its release for whatever feature already used SDL 3. People who want to use the SDL 3 backends to solve some obscure backend-related issue in an SDL 2 program can use sdl2-compat without needing it to be the only option available. And with a package size of 1.2 MiB, you can't convince me that SDL 2 is somehow a burden on the packaging front either – especially if your distro has separate packages for every commonly used fiddly Python and Haskell library.
I can't help but imagine the reaction if Microsoft pushed an enforced update of this magnitude. They're already getting regularly lambasted by the press for much smaller and ultimately inconsequential offenses…
For all the 📝 criticism I had about Flatpak and Flathub last time, they made the right choice of not treating their base package as a rolling and bleeding-edge distribution. The Freedesktop platform will only ship SDL 3 in its next version releasing in August, which will probably leave enough time for the SDL developers to address all but the rarest remaining issues in sdl2-compat. Although I'm not sure how I should interpret this commit being made at that specific time: This is either very considerate (because they've chosen to take up the job of early-adopting SDL 3 as part of developing the new SDK version, and thus will be helping out with reporting bugs), or very inconsiderate because they bought the whole sdl2-compat story just like Arch did. If Freedesktop SDK updates shipped in February rather than August and the release tag was on this branch, they would have screwed over their users just as much. Also, there's still not much point in force-updating everyone onto a compatibility layer in freaking 2025…
Then again, I can empathize with the SDL developers to a degree. Lots of developers have been asking the "when is SDL 3 ready and stable enough for regular use?" question while picturing SDL as this highly important and central library that surely has a big team of testers who could ensure its stability at one point. But if there just isn't enough Valve money to form such a team, what else should you do as a developer other than turn your personal hype into a "it's ready now, go use it and please leave feedback" reply? Maybe, turning your users into beta testers is the only realistic way to ever approach stability in this economy. And sure, they call it 3.2.0
for… reasons, but they're not fooling anyone.
The big irony, however, is this: At one point in the future, sdl2-compat will be that perfect solution for running abandoned SDL 2 (and SDL 1) programs on top of SDL 3. But it's the exact opposite of what you'd want during active development: You want to update to SDL 3 and use the new APIs and function names to be ready for the future, but also retain the option to run on the stable SDL 2 foundation for at least a little longer until every distribution has caught up. Or, in other words, you want to run SDL 3 on top of SDL 2.
You could totally have a library that implements this alternate kind of compatibility layer. It would still be prone to bugs just like sdl2-compat, but unlike that one, the chance for new bugs is halved since you'd be running on top of the proven and stable SDL 2. But of course, such a library would restrict your codebase to SDL 2's feature set, which is probably why something like this doesn't exist. So instead, our SDL platform layer now contains 64 conditional branches and a bunch of function renaming macros and generic helper code to support compiling against both SDL 3 and SDL 2. At least I wrote it all in a way that allows us to quickly rip out SDL 2 support once we no longer need it…
Oh well, enough ranting. Because once it works, there are plenty of things to like about SDL 3. Limited to, of course, everything notable that applies to Shuusou Gyoku:
- Requesting fullscreen from SDL 3's basic window creation API will now always give you a borderless window as they went with the times and removed the option to directly create a window in exclusive fullscreen mode. In isolation, this might look bad enough to not even consider updating to SDL 3. However, this doesn't mean that boomer fullscreen is gone – it only has been relegated to a separate and, in fact, much more comprehensive mode-changing API that also covers refresh rates. Using it does require significantly more and different code compared to SDL 2, but being explicit about the refresh rate is crucial for games whose speed depends on the frame rate, like this one. If your display supports a 62.5 Hz mode by any chance, we select it now.
- SDL 3's software blitters come with optimized SSE2, SSE4.1, and AVX implementations, replacing SDL 2's aging and nowadays actually suboptimal MMX code paths. On the surface, this only seems to speed up the software renderer as far as we're concerned, but it will also be very welcome once we have to do pixel format conversions. (Which, spoiler, I managed to just barely avoid on the SDL level for this new code.)
- The new
SDL_SetRenderLogicalPresentation()
function now implements all of the three borderless fullscreen layouts as part of SDL. Together with the now cleaned-up handling of render target state, this removes almost all of the complexity and state juggling that SDL 2 previously required for the combination of fullscreen and clipping. Too bad that I still have to retain all of that SDL 2 code for the time being… - The filesystem API that originated in SDL 2 is finally joined by a matching set of file access functions that Do The Right Thing, explicitly take UTF-8 filenames, and use the Unicode APIs on Windows. If this had existed 📝 at the end of 2022, I wouldn't have felt the need to write my own abstractions. Sure, the lack of UTF-16 overloads means that this API is not strictly, perfectly optimal on Windows, but in turn, we get this API for free with the rest of SDL. It'll even be very welcome for the Windows 9x port, which could simply translate UTF-8 to the system codepage without requiring any other kind of Unicode layer. Besides, I've found myself using these
strictly optimal
UTF-16 strings less and less: These have always been an implementation detail of the Windows version, and any path we save in a .CFG file should better be in UTF-8 to allow configuration sharing between Linux and Windows. SDL_RenderReadPixels()
, the "screenshot" function that transfers pixel data from the GPU to system memory, now allocates a new pixel surface instead of writing pixel data in a specific format to pre-allocated memory. This is another change that looks bad on the surface because we sure love them freedoms to self-allocate our memory in C/C++ land. However:- This single allocation is far from being the bottleneck in the screenshotting process. It doesn't even clearly stick out in execution timings because it gets completely masked by the variance of the actual GPU→CPU pixel transfer.
- In SDL 2's version of the function, you decided the pixel format that SDL would write into your buffer, which might have incurred a conversion if your chosen format didn't match the pixels returned by the GPU. In Shuusou Gyoku, this could have easily happened with geometry scaling. By newly allocating the returned surface, SDL 3 can keep the original pixel format and thus needs to involve at most a single
memcpy()
– which is always measurably faster than converting pixels, even if that conversion is SIMD-optimized. - Not even having the option to overthink memory pre-allocation sure simplifies your code a lot.
- Graphics APIs are now addressed by their identifier string rather than their index within the platform-specific list of APIs. SDL 2 has always provided ways to map between both indices and strings, but the fact that every function now takes a string is a nice way of nudging developers to use strings in their configuration as well. They would allow a user's API selection to be retained independently of the SDL developers later changing the order of that list – once I adapt our config format from numbers to strings in a future release, that is.
- This unassuming change to the OpenGL defaults on Windows removed the seemingly unfixable mode change for borderless fullscreen on some displays! 🙌
A few changes have good and bad elements:
- SDL apps can now define metadata strings. Most of these currently don't do anything, but the identifier now gets used as the Wayland and X11 window class name and thus represents a much cleaner way of having class-derived icons than 📝 the previous undocumented
SDL_VIDEO_X11_WMCLASS
environment variable. But if you read that post again, my main issue wasn't SDL's implementation, but the fact that support for class-derived icons is so rare among window managers to begin with. Not only does this change not help the situation, but it arguably makes it even worse due to a slightly different mapping decision: The app identifier is assigned to theWM_CLASS
class name, but the additional instance name receives the binary's file name, which unfortunately breaks class-derived icons in IceWM where the instance name takes precedence. - Draw calls are now batched on all renderers, and batching can no longer be deactivated. 📝 During my previous experiments, SDL's Direct3D 11 backend turned out to be by far the fastest batching renderer on Windows, and SDL 3 coincidentally also made it the new default. So it makes sense to follow suit and remove our previous OpenGL override, restoring 📝 pixel-perfect line rendering in framebuffer-scaled mode by default.
The massive downside, however, is that the combination of framebuffer rendering and OpenGL ES 2 is now completely broken on integrated Intel graphics, in the worst way: The game initializes fine and responds to input, but only shows a black screen. If we offer such a menu, we'd better also have a feature to unbrick your game in a non-graphical way if it only renders a black screen. That's why you now can- press F7 to cycle through the list of APIs at any point, or
- use the environment variable
SDL_RENDER_DRIVER
to override any previous manual API selection, which didn't work before.
- Draw call batching even extends to the software renderer now, for some reason. Doesn't software rendering boil down to nothing more than writing pixels into a system-memory buffer on a single thread? There's no penalty for just doing the thing, but there certainly is a small penalty for gathering all the things into a queue. I'd rather not pepper that procedural mess of a graphics backend with even more imperative function calls, but you can make just as much of an argument for the consistency of requiring a flush regardless of whether a renderer represents software or hardware.
- The new Vulkan and GPU render backends are perhaps the most exciting change for a certain group of people. The GPU API in particular provides an abstraction for the common modern paradigm of command buffers and shaders, which is shared among Vulkan, Direct3D 12, and Metal. Given the amount of attention it received, this feature is undoubtedly great for everyone developing modern games. However, not only couldn't we care less for a game of this vintage, but it's also just more of the same dilemma: While more backends can offer a higher chance of the game working well on some potato out there, they primarily mean more code surface, which means more bugs.
- Luckily, the GPU API was so much of a success that the SDL team is thinking about removing SDL_Renderer's non-GPU Direct3D 12, Vulkan, and Metal backends. All of these implement the immediate SDL_Renderer API in terms of command buffers and shaders, so it makes perfect sense to just replace these specific implementations with the single GPU abstraction that in turn uses any one of the three APIs under the hood. Ideally, the API menu should then also offer players to choose this second layer of backends once SDL has taken that step.
Thankfully, the list of entirely bad changes is quite short:
- All API functions now return
true
/nonzero on success andfalse
/zero on failure, rather than 0 on success and <0 on failure as in SDL 2. Sure,true
= success makes intuitive sense when you just start out programming, but then you realize that the overwhelming majority of functions can fail in multiple ways and success is just the absence of failure. SDL 2 got the right idea about this, but SDL 3 chose to regress to said beginner levels because Sam Lantinga got increasingly convinced of this idea that he, and everyone else, initially considered horrible. #include
directives must now be prefixed with an explicitSDL3/
path, unlike SDL 2 which didn't use a prefix. This was apparently necessary to fulfill some macOS requirement, but they've also removed the path from theirpkg-config --cflags
, turning the prefixed syntax into the only sanctioned cross-platform way of including SDL 3's headers. Being able to compile SDL3-using code without any additionalCFLAGS
might look pretty, but no sane build system is going to make an exception and not callpkg-config --cflags
as it does for any other external library. And now I have to duplicate the#include
section in every translation unit for the SDL 2 code path…- All SDL threads must now be manually awaited before calling
SDL_Quit()
. If they aren't, SDL reports a "leaked thread" even if the underlying OS thread might have cleanly finished. I get it, structured concurrency is probably a good idea, but it only works naturally if the rest of your program is structured accordingly, which doesn't apply to this 25-year-old codebase. Enforcing this leak check just forces me to write cleanup code for the sole purpose of satisfying SDL's bookkeeping to avoid that error.
Still, the constant stumbling over bugs and deliberate instabilities made this take way longer than it had any right to. For three of these bugs, I was the first one to report them, and I could have even reported a fourth one if I actually cared about Vulkan and didn't happen to find a workaround right before I pushed out the release.
With the additional API unbricking feature, we've ended up well into a second push. Replays were too big of a feature for now, but screenshot compression sounded like a nice task for the rest of that push. Really, how hard can it be? Add reference C library of our encoder of choice, call API with pixel buffer we get from SDL, write compressed pixel buffer to file. Easy, right? Well…
For starters, which format do we choose? Ember2528 had a clear preference, but it makes sense to compare it against other contenders first. There will be a complete benchmark further below, but let's get the seemingly most obvious candidate out of the way first:
QOI
Because who doesn't want a fast encoder for a simple format with steadily growing adoption? Sure, part of the adoption might be hype-driven, but as far as hype goes, there are definitely worse targets than a codec that fits in less than 300 lines of C. The low-color images we want to compress are rather simple from a modern point of view as well, so you'd expect QOI to be a perfect match…
…until you actually try encoding a few representative images and are greeted with file sizes that are way further removed from PNG than you'd expect after seeing the official benchmarks. Since the specification is short enough, we can easily explain these results:
- All of Shuusou Gyoku's sprites are intended to be rendered within a palettized 256-color framebuffer. 3D-rendered gradients and transparency will drive up the number of unique colors in screenshots into the low 4-digit range at times, but it still makes sense to assume uncompressed 8-bit BMPs as the baseline. At our native resolution of 640×480, these are 308,278 bytes large. This is what we expect our chosen codec to beat, by hopefully a quite significant margin.
- The 32-bit
QOI_OP_RGB
chunk would already blow up each affected pixel to 4× the size it would have had in a palettized image. Let's hope that the QOI encoder largely uses this chunk to define palette colors, and that we don't get to see it that often otherwise. - The 16-bit
QOI_OP_LUMA
chunk can maybe help compress unknown pixels that haven't yet been put into the running palette, but would still not contribute any compression compared to our baseline size. Fortunately, we shouldn't see too many of those as the encoder is specified to prefer 8-bit chunks where possible… - …except that
QOI_OP_INDEX
spends 8 bits on encoding a 6-bit palette index. With only 64 colors in the palette rather than the 256 we want, we're bound to see a lot more of those bulky 32-bitQOI_OP_RGB
chunks after all. Not to mention the fact that colors are mapped onto these 64 palette slots using a simple multiplicative hash that will cause collisions at regular color intervals. - Any compression gains over uncompressed 8-bit BMP would therefore come from
QOI_OP_RUN
. If run-length encoding is the best an image codec can do, that's rather basic instead of OK, I'd say.- Actually… wait a moment, doesn't BMP also have a run-length-encoded mode that was mostly forgotten after the 90s? And indeed, the compression rates between vintage BMP/RLE and QOI are very similar, with any differences stemming from the way these two formats encode their run lengths. QOI typically does slightly better, but BMP/RLE still beats it in the 西方Project logo and the main menu.
So while reduced complexity and blazingly fast encoding speed are good arguments, they don't cut it if decent compression of our source images relies on all the complexity found in PNG. But shouldn't this deficiency have stuck out in the official benchmark in some way? After all, 43% of the images in QOI's test suite have ≤256 colors, with most of them coming from Philip K's Ancient Collection in the textures_pk
directory, where they make up 80%. For this directory, the official numbers claim average compressed sizes of 80 KiB for PNG and 75 KiB for QOI, and running the benchmark myself confirms these numbers…
…but wait, the input PNG files in the test suite package are actually half that size?! Yup – this benchmark merely tests the fixed, untunable QOI format against two specific PNG encoders, libpng and stb_image, at their default compression level and filter settings. It does not claim anything about QOI's relation to the known limits of PNG as a format, despite what the hype drivers would lead you to conclude all too easily. In any case, it paints a much different picture of QOI's 256-color capabilities:
Average file size | |
---|---|
stb_image | 110,337 |
libpng | 82,136 |
QOI | 77,404 |
PNG source files | 43,437 |
oxipng -o max -Z | 41,032 |
The final nail in QOI's coffin is this concession at the end of its release announcement:
SIMD acceleration for QOI would also be cool but (from my very limited knowledge about some SIMD instructions on ARM), the format doesn't seem to be well suited for it. Maybe someone with a bit more experience can shed some light?
I'd rather take a new image format that's designed around modern SIMD instructions from the start. Then, it can invest these performance gains into more complex filters to end up with better compression at a roughly similar encoding performance. Heck, it can even be slightly slower for all I care. SIMD-first design worked great for non-cryptographic hashes, and we'll see in a minute that it works just as well for image formats.
But Ember2528 had a different codec in mind anyway. Let's jump right to the polar opposite of the complexity spectrum:
Lossless JPEG XL
Because why wouldn't you use the currently best and most popular image format according to actual professionals who know a couple of things about image compression? It's winning benchmarks left and right, and blog posts like these make it appear as if even version 0.10 of its reference encoder already beats out every other widely used codec. And after it unfairly got removed from Chromium in 2022, you can't help but root for it. Time to do my small part in bringing its adoption to a level that Google can no longer deny!
Too bad that the enthusiasm immediately drops after cloning the libjxl repo and running a CMake test build. What are all these library dependencies, and why can't I just reduce the build to the lossless encoder? The resulting binaries are way larger than what I'd consider appropriate in relation to game code. 😩
Looking through the repo more thoroughly, however, reveals a very welcome little surprise: If a few basic requirements are met, the fastest lossless speed tier actually uses an entirely separate encoder that's implemented in a single source file and can be used independently from the rest of libjxl. Nice to see that someone thought about simple integration after all! That's exactly what I've hoped to find. Sadly, Linux distributions don't have a separate standalone package for this encoder, but it wouldn't be the only library we'd statically link on Linux.
Having a single function as an easy entry point is always a good sign, too. Those parameters, though…
- Only accepting pixels in RGBA memory order sure is awkward in a 3D-accelerated world where everything else prefers BGRX, including BMP files. Sure, it doesn't matter for us because we live in SDL land where we have SIMD-optimized pixel format converters, but I don't think you should assume that everyone has these kinds of batteries included. "Just roll your own" isn't a good argument either because you'd want pixel format conversions to be SIMD-optimized. We'd all love it if compilers perfectly auto-vectorized such code, but we're not there yet; Visual Studio in particular is pretty bad at optimizing naive byte-flipping code. But writing SIMD code always comes with the same CPU feature detection and alignment boilerplate, and JPEG XL already has all of that in its codebase. Thus, it makes a lot more sense for it to include pixel format converters than forcing that onto every caller. It's API designs like this one that almost necessitate turning SDL into a hard dependency of the cross-platform frontend in the long run.
- The not further documented
big_endian
parameter is the first indication that a lot of development effort went into aspects we don't care about. You'd think that passingtrue
would cause thergba
buffer to be interpreted as ABGR, but it's only used to select the per-channel endianness of images with 16 bits per color channel. For 8-bit-per-channel images like the ones we're exclusively dealing with, it silently does nothing.
As the FJXL abbreviation implies, this encoder actually started as an independent project that, coincidentally, was a direct response to the hype surrounding QOI. By using AVX2 instructions within the confines of an existing format, it managed to beat QOI in both encoded file sizes and compression speed for every type of image its developer tested. But it's this competitive focus that brings us to its most questionable implementation decision.
The good news is that FJXL acknowledges that low-color images exist, are a prime use case for lossless compression, and are best dealt with using JPEG XL's palette features. However, detecting and optimizing that palette takes up a lot of time relative to QOI. If the input image uses more colors than a palette would make sense for, you'd want to fail as early as possible. Slide 11 explains the solution FJXL came up with:
- Hash table with 65k possible entries
- Any collision -> no palette
- […]
On non-palette-friendly images, this fails quickly (birthday paradox says after ~256 distinct pixels).
On palette images, encoding 1 channel rather than 4 more than compensates the cost of detection.
With 10 additional bits and a widely renowned multiplier, the hash function looks leaps and bounds ahead of the one in QOI:
// has to map 0 to 0 uint16_t pixel_hash(uint32_t p) { return ((p * 2654435761) >> 16); }
But since we're still hashing 32-bit RGBA pixels to 16 bits, we're bound to run into a collision sooner or later. You can certainly think of this hash function as mapping color values to uniformly distributed random numbers and then reason about its efficacy using probability theory, as we saw in the slide above. However, the conclusion drawn in that slide is rather abbreviated and ultimately misleading: The birthday paradox does not return a binary success/failure result, but a probability. In this case of 256 distinct colors:
Let's plug in 191, for no reason whatsoever:
That's a smaller probability, but a 1/4 failure rate would still be way too high for our use case. And sure enough, it actually happens in the main menu, where a single #583732FF pixel (or 0xFF323758
in its little-endian representation) collides with #FFFFFFFF:


The resulting 143 KiB file immediately tells us how not palettizing such images completely ruins the compression ratio. If this one pixel had any other non-colliding color, FJXL would have compressed it into a still decent 52 KiB. Therefore, the slides should have better added a graph of the failure probability, and said something like:
Not perfect, and likely to misdetect even low-color images with <256 distinct colors as not palette-friendly according to the birthday paradox.
For our use case of screenshots without an alpha channel, we could work around this whole issue by having a separate non-alpha code path. Detecting the potential palette of an RGBA image within a worst-case time complexity of 𝑂(𝑛) without using hashes requires a (232/8) = 512 MiB bit array to cover the entire RGBA color space, which is probably too steep of a memory requirement. Removing the alpha channel, however, would shrink this array to a definitely appropriate 2 MiB.
Ultimately though, we decided against doing any of that because FJXL by itself is as untunable from the outside as the codec it was inspired by. Ember2528 preferred the opposite: an encoder with multiple effort levels that offer different trade-offs between encoding speed and file size, which would allow faster CPUs to produce the smallest files at still reasonable speeds. So let's look past the bloat, link in the complete libjxl reference encoder, and see how it performs on higher effort levels…
…um, what is this API? Adapting the example code gave me encoding times that are at least 1.5× slower than the cjxl
command-line encoder, and already hit the 100 ms mark at -e 2
. Even -e 1
is suddenly much slower than using FJXL in isolation while yielding the same compressed sizes. Also, pushing speculative allocation onto the caller? 🤨 📝 stb_vorbis is a bad joke, not a model to be emulated.
The compressed file sizes are pretty underwhelming as well. Most of the test cases don't even get close to oxipng at -e ≤6
while still taking absurdly long to encode within the game. Even at peak effort, it's a mixed bag at best, with both oxipng and JPEG XL -e 10
massively beating the other in 3 out of 7 cases. And if that's the best we can say about this format…
All this is echoed by this recent issue that points out JPEG XL's inadequacy with an even more retro 16-color example. In the end, the documentation said it all along:
They are about 60-75% of size of PNG, and smaller than WebP lossless for photos.
But there is one widely-used image codec that both perfectly fits Ember2528's priorities and compresses well on lower effort levels. Let's finally look at the complete benchmark numbers:
main_menu / Effort | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|---|---|
JPEG XL | 146,352 | 51,851 | 59,453 | 45,329 | 37,864 | 37,276 | 36,130 | 35,222 | 33,793 | 31,724 |
WebP | 54,116 | 32,194 | 28,112 | 27,860 | 27,712 | 28,272 | 28,178 | 28,120 | 28,684 | 27,816 |
AVIF | 272,604 | 272,604 | 136,220 | 131,235 | 119,398 | 117,525 | 111,380 | 110,684 | 110,543 | 109,601 |
BMP (8 bpp) | 308,278 | |||||||||
BMP/RLE | 92,034 | |||||||||
QOI | 93,884 | |||||||||
oxipng -o max -Z | 30,702 | |||||||||
​ | ||||||||||
​ | ||||||||||
​ |
ingame / Effort | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|---|---|
JPEG XL | 123,606 | 102,949 | 130,689 | 102,944 | 84,916 | 72,590 | 68,302 | 49,618 | 45,865 | 46,997 |
WebP | 50,678 | 49,030 | 43,620 | 41,760 | 40,724 | 40,854 | 38,608 | 37,940 | 37,842 | 37,138 |
AVIF | 462,703 | 462,703 | 197,818 | 156,007 | 141,043 | 139,689 | 133,399 | 132,573 | 126,270 | 125,379 |
BMP (8 bpp) | 308,278 | |||||||||
BMP/RLE | 185,842 | |||||||||
QOI | 175,949 | |||||||||
oxipng -o max -Z | 38,409 | |||||||||
BMP, cropped | 185,398 | |||||||||
BMP/RLE, cropped | 177,456 | |||||||||
QOI, cropped | 165,620 |
stage6 / Effort | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|---|---|
JPEG XL | 32,204 | 24,146 | 35,053 | 24,599 | 19,936 | 19,560 | 19,336 | 18,444 | 17,423 | 16,183 |
WebP | 20,856 | 19,916 | 17,070 | 16,524 | 16,380 | 16,562 | 15,488 | 15,386 | 15,404 | 15,124 |
AVIF | 185,676 | 185,676 | 84,437 | 62,354 | 57,791 | 56,524 | 52,956 | 52,611 | 51,969 | 51,795 |
BMP (8 bpp) | 308,278 | |||||||||
BMP/RLE | 55,838 | |||||||||
QOI | 52,302 | |||||||||
oxipng -o max -Z | 18,741 | |||||||||
BMP, cropped | 185,398 | |||||||||
BMP/RLE, cropped | 48,954 | |||||||||
QOI, cropped | 45,874 |
laser / Effort | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|---|---|
JPEG XL | 345,199 | 287,279 | 301,608 | 248,852 | 92,463 | 85,529 | 81,206 | 66,811 | 61,445 | 47,173 |
WebP | 85,318 | 56,724 | 51,558 | 53,964 | 53,492 | 53,492 | 51,860 | 51,460 | 51,460 | 41,726 |
AVIF | 218,858 | 218,858 | 122,100 | 88,490 | 82,675 | 81,245 | 75,866 | 75,395 | 75,462 | 75,138 |
BMP (24 bpp) | 921,654 | |||||||||
​ | ||||||||||
QOI | 290,088 | |||||||||
oxipng -o max -Z | 61,595 | |||||||||
BMP, cropped | 553,014 | |||||||||
​ | ||||||||||
QOI, cropped | 280,462 |
laserbomb / Effort | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|---|---|
JPEG XL | 332,706 | 125,197 | 150,436 | 128,755 | 110,357 | 102,891 | 99,718 | 68,968 | 66,975 | 64,484 |
WebP | 129,472 | 94,564 | 86,538 | 64,990 | 64,062 | 64,062 | 60,776 | 60,318 | 60,318 | 59,198 |
AVIF | 313,731 | 313,731 | 168,388 | 114,111 | 109,239 | 107,121 | 104,109 | 102,054 | 99,106 | 99,103 |
BMP (24 bpp) | 921,654 | |||||||||
​ | ||||||||||
QOI | 210,496 | |||||||||
oxipng -o max -Z | 87,286 | |||||||||
BMP, cropped | 553,014 | |||||||||
​ | ||||||||||
QOI, cropped | 200,002 |
gates / Effort | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|---|---|
JPEG XL | 208,293 | 185,662 | 212,615 | 172,008 | 124,466 | 117,509 | 113,563 | 110,992 | 97,454 | 91,146 |
WebP | 124,308 | 125,070 | 113,896 | 102,656 | 102,482 | 102,482 | 95,536 | 94,768 | 94,768 | 57,850 |
AVIF | 306,742 | 306,742 | 293,874 | 293,276 | 254,073 | 243,953 | 243,947 | 242,188 | 241,943 | 241,359 |
BMP (24 bpp) | 921,654 | |||||||||
​ | ||||||||||
QOI | 157,705 | |||||||||
oxipng -o max -Z | 90,545 | |||||||||
BMP, cropped | 553,014 | |||||||||
​ | ||||||||||
QOI, cropped | 147,670 |
seihou / Effort | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|---|---|
JPEG XL | 6,124 | 5,088 | 4,732 | 4,468 | 4,427 | 4,416 | 4,377 | 4,112 | 4,016 | 4,040 |
WebP | 39,518 | 5,904 | 5,642 | 5,574 | 5,500 | 5,518 | 5,518 | 5,504 | 5,486 | 5,490 |
AVIF | 26,984 | 26,984 | 25,085 | 24,927 | 22,582 | 21,698 | 21,697 | 21,627 | 21,631 | 21,505 |
BMP (8 bpp) | 308,278 | |||||||||
BMP/RLE | 17,654 | |||||||||
QOI | 18,047 | |||||||||
oxipng -o max -Z | 5,383 | |||||||||
BMP, cropped | 23,798 | |||||||||
BMP/RLE, cropped | 14,144 | |||||||||
QOI, cropped | 13,371 |
cwebp
's -z
parameter. Add 1 to get cjxl
's -e
parameter, and subtract from 10 for avifenc
's -s
parameter. I definitely could have surveyed the landscape of PNG encoders more thoroughly, but since Ember2528 prioritized compression ratio over compression speed, there was no need to. oxipng is as good as it gets, but even its strongest and most sluggish setting is still outperformed by regular WebP at some level, and often as early as
-z 2
.
FJXL palette detection collision chance: 24.21%.
FJXL palette detection collision chance: 6.20%.
FJXL palette detection collision chance: 6.72%.
FJXL palette detection collision chance: 1.18%.







Lossless WebP
Yup, it's 📝 ZMBV beating AV1 all over again. For these kinds of retro game screenshots, JPEG XL is vastly outperformed by its counterpart from the previous generation of widely-used image formats. And not just in terms of compressed file sizes, but also in every single other aspect that matters to us:
- Faster compression times across every effort level? ✅ You bet. Imagine adapting its example code and actually getting encoding speeds that match the
cwebp
command-line encoder! Which brings us to… - Better C API? ✅ Check – well-documented and significantly easier to use, and I'm not even using the easiest entry point due to its fixed effort level. libwebp does use a single 32-bit pixel format internally, just like JPEG XL, but what's that, importers for other 32-bit pixel formats and even palettized 8-bit images? Sure, the latter ones are part of the extra code that typically isn't part of Linux distribution packages and it just does a simple unoptimized loop. But that's how a library communicates that it's the right tool for the job.
- Less bloat? ✅ Obviously. The unmodified reference library with all of its SSE and AVX optimizations adds an acceptable 274.5 KiB to the statically linked and optimized release binary.
That's not to say that libwebp is perfect. Its code makes it very obvious that lossless WebP was designed for 2010-era hardware as the encoder never got optimized for modern CPUs. There was an attempt at optimizing at least the lossy encoder for AVX2, but it was ultimately abandoned because it never got fast enough. Surprisingly, the codebase did receive new AVX2 code one week before I released this build, but it only covers the lossless decoder so far.
As for concurrency, libwebp does come with support for multi-threaded encoding, and I did activate it for the Shuusou Gyoku integration, but it's only used at effort levels 8 and 9. Also, why is argb
in this structure interpreted as native-endian and therefore BGRA memory order, but these are interpreted as big-endian?
But the main criticism is the same that also applies to JPEG XL: The lossless and lossy modes are lumped into the same repository despite having virtually no code in common, and are selected via a structure field rather than having unrelated API entry points. This once again makes it very difficult for static linkers to remove all the code on the lossy branches that I never asked for in the first place.
And I sure never want to run the lossy encoder under any circumstance. Lossy WebP deserves all its bad reputation for basically being VP8's intra-frame coding applied to still images. VP8, 📝 if you remember, is that bad video codec from two generations ago that I'm only serving on this website due to sheer inertia. Applying its enforced YCbCr 4:2:0 chroma subsampling to images does not only make it utterly unsuitable for pixel art, but also even worse than well-compressed JPEG which isn't limited to a single subsampling scheme. If anything in the GIAN07
process accidentally flips the "I want lossless" flag, I'd rather want the WebP encoder to error out and have the screenshot frontend fall back on BMP than save an image with mutilated colors.
But while JPEG XL is a lost cause as far as I'm concerned, I've grown to like lossless WebP too much to leave it trapped within the unfortunate organization of its codebase. Also, there seems to be a lot of untapped potential in the format – really, why does PNG get all the attention of people writing alternative encoders when lossless WebP is the demonstrably much more capable format?
So I've decided to fork libwebp and surgically remove all code related to the lossy encoder. The statically linked result now only takes up ~100 KiB in the Windows build while still being API- and ABI-compatible. Of course, Linux users will still use their distribution's libwebp package with the lossy encoder included, but let's hope that the aforementioned possibility of accidents stays purely theoretical.
Really though, why have people started to bundle lossless and lossy image codecs under the same format in the first place if their algorithms have nothing in common? It might make sense for Opus where SILK and CELT are different kinds of lossy, but lossless and lossy are two completely different paradigms. The bloat and usability confusion far outweigh any situational tricks this might offer.
Alright, we found a good format with configurable effort levels, and we're only missing a way for players to pick an effort level. Depending on how they want to use this rapid-fire screenshot feature, almost all of the options make sense in some context:
- You'd like to screenshot a whole section of a stage as fast as possible with the help of the disabled frame rate limiter, and you got plenty of free disk space? You probably want to stick with BMP and compress the screenshots outside of the game, just like how you would have done it without this feature.
- A slight slowdown is OK or maybe even welcome for providing additional feedback that you're actually taking screenshots? Pick one of WebP's higher effort values that certainly take longer than 16 ms to encode, but are still reasonably fast and won't turn the game into a <2-FPS slideshow.
- Want the lowest file size that your system can encode while staying at 62.5 FPS? Well, how fast is your system? And not just the CPU – maybe your system is actually bottlenecked by I/O and writing a large uncompressed BMP file takes much longer than encoding it into WebP and writing the resulting smaller file.
The latter two use cases would be covered by automatic detection of the maximum effort value that encodes within a given number of frames. The problem, however, is that encoding times are always relative to the complexity of the image. Once we're in-game and have lots of bullets and lasers, any choice that might have been appropriate for the main menu might suddenly start dropping frames after all. Thus, we can't solve this with an upfront benchmark, but have to dynamically adapt to the complexity of the current game scene. But then the whole idea falls apart as we can't possibly treat the configurable allowed screenshot time as a hard limit. To figure out whether it's safe to raise the effort level again, there's no way around periodically exceeding that limit and thus dropping more frames after all.
The ideal solution would involve deep hooks into the WebP encoder that could dynamically adjust the compression algorithms depending on the remaining time in the current frame. An image compressor with real-time guarantees… sure sounds like an interesting research project.
In the end, letting players choose a fixed format and effort level remains the best option. However, they can only make an informed choice if they know the performance of all options relative to each other. And that's how we arrive at this new submenu:

SDL_RenderReadPixels()
call.
These specific numbers I got on my now almost 7-year-old Intel Core i5 8400T are very peculiar. -z 0
gets quite close to the 16 ms we have per frame, but would still be too slow to reliably compress every gameplay situation without dropping frames. A 64-bit build would speed up -z 0
by 10%, -z 2
through -z 7
by 25%, -z 8
by 210% (!), and -z 9
by 60%. Linux users already enjoy these higher speeds, and the Windows build is just a few compiler settings away from matching them. 📝 Last time, the bitness argument was a lot more balanced, but WebP encoding performance presents the first compelling reason for going 64-bit.
Or we could always go multi-threaded, which already is a much more popular idea within the Seihou development Discord group.
Or I could investigate PNG after all to find out how exactly its encoding speed compares to WebP…
But then, Ember2528 posted the encoding times he got on his new Ryzen 9 9950X3D:

Finally, you probably already noticed another small change in this build: The ReC98 push ID is now shown in the bottom-right corner of the title screen image, just below the original game version number. This was the one part of replay preparations that I wanted to get in sooner rather than later. Since the game binary and the data files can be updated or modded independently from each other, I'm going to tag future replays with both of their respective versions to guarantee reproducibility. Of course, newer builds should never introduce bugs that affect gameplay and desynchronize existing replays. But if they ever do, the included push ID allows hosting sites to remove any replays recorded on such a broken build from the official competition tier associated with a specific data file version.
As for rendering the push ID, it should obviously look similar to the VERSION 1.005 text above. We can find these glyphs in GRAPH.DAT
file #0, but this particular text is actually baked into the main menu's background image, which explains why the decimal point glyph isn't part of that data file. The glyphs for 0-9 are also used in-game for the score popups, but the A-Z glyphs remain unused – so unused, in fact, that pbg didn't even leave any reference to them in the source code:

This means that the game provides us with all the glyphs we would need to display the ReC98 push ID. However:
- The 0-9 glyphs have a size of 5×7 and would stick out a bit too much against a capital P rendered as a smaller 5×5 glyph.
- In WIP builds, the build ID should also include the Git commit, which traditionally uses small letters. Surrounding the commit info with (brackets) would also be nice.

And that's all I've got for these very packed three pushes! In exchange, I'll reserve the next Shuusou Gyoku push for another round of maintenance and forward compatibility.
The new builds:
Next up: The long-awaited Windows 98 backport of our Shuusou Gyoku build! This has been in development for quite a while, so this should now be a matter of days rather than weeks.