Unfortunately, I have to quickly interrupt the current PC-98 Touhou progress with breaking news of a replay desync bug in my Shuusou Gyoku build. Yup, it's free mod bugfix time again, this time featuring a bug with the most complex implications so far…
The bug in question dates back to the P0295 build from last October. While that giant release mostly focused on porting the game's rendering to SDL, it also included 📝 fixes for three pbg bugs in Shuusou Gyoku's handling of Extra Stage replays. Unfortunately, these fixes would introduce a bug of my own that was even worse.
Ever since that build, the replay header has consistently stored the difficulty and starting life count as shown in the Config → Difficulty menu. This looks fine on the surface until you consider the exact issue behind the three bugs: Shuusou Gyoku's Extra Stage is supposed to run on Hard difficulty and 2 starting lives, not on whatever you set for the regular 6 stages.
You can probably already imagine how invalid difficulty settings will cause desyncs shortly into a replay. Running a debug build at any commit between ≥P0295 and ≤P0310 quickly reveals the issue:
Different difficulties come with different initial rank values (Pr), which cause bullets and lasers to spawn at different speeds than what the player maneuvered around while recording the replay, which in turn will manifest as a desync.
The only way to protect a replay from this bug was to set the regular game to Hard difficulty and 2 starting life settings in the Config → Difficulty menu before recording. This is probably one of the rarest configurations imaginable – most people will have set the difficulty to either Normal to get that survival clear that unlocks Extra in the first place, or to Lunatic because they're superplayers and that's the only difficulty that matters to them.
Note how the bug only affects the saved replay file. You were still playing and recording Extra in its intended Hard difficulty and with 2 starting lives, and any clear you've achieved was still valid.
This is exactly the kind of bug that can easily fall through the cracks of the regular testing that my backers and I do for every new build. Replays are a key item on my testing checklist, but I primarily test whether existing ones still work. With only one replay slot per stage, recording a new replay is always a cumbersome process: Is my previous replay for that stage worth keeping? If yes, what made it special? After all, I now need to give a more descriptive name to the file. Do I remember, or do I have to watch the replay again?
Also, the primary concern of replays is compatibility with pbg's original 1.005 version. In that context, they can provide important evidence that I haven't accidentally forked the gameplay. Therefore, replays should hit as many gameplay aspects and potential failure/desync points as possible, which requires actual gameplay effort. From that point of view, it makes more sense to just keep testing with existing replays, especially when it comes to the Extra Stage.
Since this was just a metadata issue, we can both easily fix this bug for future replaysand repair any existing affected ones. We simply have to set the replay's difficulty and starting lives to the one and only official values for the Extra Stage, and they will play back correctly again.
But doing that creates a potential problem. What if you actually modded the game before P0295, intentionally changed the difficulty and/or number of starting lives for the Extra Stage, and then recorded a replay? If that was the whole extent of your mod, such a replay would play back correctly on not just your modded build of Shuusou Gyoku, but on every single one of my builds and pbg's original 1.005 build. "Fixing" these settings in the replay header would then actually break such a replay. Since we're still using pbg's old replay format, there is no way we can distinguish valid modded replays from broken and desyncing ones by looking at just the replay header.
We could tell after we've run the replay – if the game ends before the replay has reached its last recorded frame, we know that something is wrong. However, we're not quite at the point where we can quickly simulate an entire round of gameplay logic on just the CPU, without rendering anything. The best we could do until then is to pop up a message at the end of a rendered replay, informing the player that they've just watched a desync and offering an automatic repair of known issues. But that would be a lot of work for a policy-bugfix, and even fall under the planned paid feature of improved replay-related error reporting. And if we zoom out, such a window won't be much of a help in the general case of people watching replays from incompatible builds. The game can't possibly know the specific mod a desyncing replay originated from, so what could it possibly do, other than to say "Sorry, that was, in fact, a desync 🤷"?
That's why it's so important to me that 📝 the new replay format stores the exact game binary and stage script versions a replay was recorded on. As well as any gameplay tweaking options, if we ever go that route: Properly fixing Shuusou Gyoku's fake deathbomb quirk is not just about the few lines of custom gameplay code you can find in Tasos500's fork, but mainly about the bureaucracy of cleanly establishing a separate competition tier, not breaking existing replays, and making sure that replay hosting sites deal with the distinction as well.
That said, that's a lot of thought for a very specific potential scenario. Any change to the Extra Stage settings would have required modding the game at either the C++ or machine code level. If you were able to pull that off and you're considering updating to the new build, you'll probably also read these lines and will have no problem adapting whatever fix I roll out for that issue.
So, let's go for that unconditional rewrite of every affected Extra Stage replay upon clicking the ExStage デモ再生 option… but wait, why are the rewritten files suddenly smaller than the old ones?
Turns out that there was another replay-related bug that dates back to 📝 my very first Shuusou Gyoku release from September 2022. This one boiled down to the classic C/C++ footgun of confusing byte sizes with element counts, but pbg's misleading variable names certainly played their part as well.
This bug is mostly harmless within the unmodded game, which also explains how I didn't detect it for so long. The game doesn't care about compressing and uncompressing twice as many bytes, the loader still copied the correct amount of bytes and wouldn't have overflowed the buffer, and at a few KB per replay, it doesn't really stick out if the files are roughly twice as large as they needed to be. But this was still a landmine that would have exploded once modders crafted stages longer than half of the 20-minute buffer that pbg designated for replays.
Since I'm already implementing an automated fix here, I might as well also recompress every watched non-Extra Stage replay if its amount of decompressed bytes doesn't match the replay's indicated frame count. Of course, recompression won't work if you've marked the replay files as read-only, which I often do as a means of protecting them from getting overwritten with accidentally recorded new replays of the same stage…
…but wait, how about restricting both fixes to writable replay files? This would create at least one possibility of protecting existing modded replays, and also make sense from a consistency point of view. If the game isn't allowed to fix the replay, it also shouldn't untransparently hotpatch its header and play it differently than any other build of the game would play it, even if that way was the correct one in the vast majority of cases. Sure, this is slightly annoying for people who use that same trick, but those people will probably also read either these lines or the release notes.
As a neat bonus, I also made sure to preserve the original timestamps of any repaired and/or recompressed replay file. This is the only other piece of meaningful identifying metadata we have with these files, and I don't want to throw it away just because I messed up the saving code at one point. Without that extra level of care, I probably wouldn't have gone for such an unconditional automatic fix in the first place. Instead, this little detail makes the whole fix as invisible as it could possibly be. If you only recorded an Extra Stage replay once, haven't watched it since, and haven't touched the file either, you won't even notice that there was a bug in the first place. SDL 3's filesystem API does not cover file timestamp modification, so this required more OS-specific code of the kind 📝 I'd rather want to get rid of. SDL 3 does support timestamp retrieval though, and that's all we need for the new replay format where I'll take the timestamp from the filesystem and properly write it into the file itself.
And there we go, no more replay bugs! Also, did I just write down all the justification anyone would ever need for the new replay format? That should be shortening that future blog post by quite a bit at least…
Thanks to >>49320040 on /jp/ for pointing out that desyncs exist. Please tell me this sort of thing! I'm not ZUN, desyncs are critical bugs that will always receive my immediate attention. If they turn out to be my fault, they definitely fall under my free bugfix policy, and if they don't, we at least get to document them as bugs in the original game that might get fixed in a later push.
Alright, back to writing blitters for the PC-98…
P0307
Seihou / Shuusou Gyoku (SDL 3 platform layer)
P0308
Seihou / Shuusou Gyoku (Render API unbricking / ReC98 build label on the title screen / Revamped pixel format handling)
P0309
Seihou / Shuusou Gyoku (WebP screenshot compression / Compression benchmark in the main menu)
💰 Funded by:
Ember2528
🏷️ Tags:
Well, that fell apart surprisingly quickly. The release of Shuusou Gyoku's Linux port just happened to be surrounded by the unluckiest sequence of events in Arch Linux land:
After a policy-bugfix for a silly mistake on my part, Shuusou Gyoku was still playable on sdl2-compat as it was only affected by rather minor bugs, but these bugs still undermined the effort I put into the port. That left us with three options:
Let the more involved SDL community fix sdl2-compat out on their own. After all, why should we bother if rogue distros randomly mess with our dependencies?
Become part of that community and help fix the issues in either sdl2-compat or SDL 3.
Properly update Shuusou Gyoku to SDL 3 right now, while keeping SDL 2 support for the Flatpak, more conservative Linux distributions, and the upcoming Windows 98 backport.
I really would have preferred to delay this migration for a few years until the dust has settled. For this project, I already picked C++ as the dependency I want to be on the bleeding edge of, and SDL 2 was supposed to balance this out by being the conservative and stable choice. Oh well, if we've got to update at some point, we might as well do it now. The ReC98 development schedule at least gave me another month of waiting for the community to sort out SDL 3's growing pains…
So, why does something like sdl2-compat even exist if it only causes problems? And why are distros rolling it out so soon after SDL 3 if SDL 2 has been working fine all the time? In a nutshell, sdl2-compat is the second pillar in SDL's forward compatibility strategy. While the 📝 dynamic API mechanism ensures compatibility with future minor versions by integrating dynamic linking so deeply that static linking is made entirely useless, sdlN-compat ensures compatibility with one future major version by implementing version N's API in terms of SDL version N+1. This allows the SDL team to very quickly stop updating version N while still allowing programs linked against that version to run well on modern systems by using all the actively maintained backends of version N+1. This worked out well with sdl12-compat, which nowadays seems to do a great job at preserving abandoned SDL 1 games – especially if we consider that you'd be running sdl12-compat on top of sdl2-compat on top of SDL 3 from now on.
If you absolutely must have the real SDL2 ("SDL 2 Classic"), please use the SDL2 branch at https://github.com/libsdl-org/SDL, which occasionally gets bug fixes (and eventually, no new formal releases). But we strongly encourage you not to do that.
Followed by zero arguments to back up this audacious suggestion. So they not only imply that sdl2-compat is already perfectly compatible and works without bugs for every SDL 2 program ever, but also that the underlying SDL 3 implementation doesn't introduce any bugs on top – and it only takes a single look into either project's issue tracker to disprove that notion. There is no technical reason why a distro couldn't ship SDL 3 and 2 in parallel. The continued existence of the SDL 2 AUR package is proof of that, and still received upset comments as of mid-March that justified its existence.
There was absolutely no reason to push sdl2-compat on everyone by default other than forcefully turning users into beta testers. SDL 2 was still stable, maintained, and working well. People who needed SDL 3 before its release for whatever feature already used SDL 3. People who want to use the SDL 3 backends to solve some obscure backend-related issue in an SDL 2 program can use sdl2-compat without needing it to be the only option available. And with a package size of 1.2 MiB, you can't convince me that SDL 2 is somehow a burden on the packaging front either – especially if your distro has separate packages for every commonly used fiddly Python and Haskell library.
I can't help but imagine the reaction if Microsoft pushed an enforced update of this magnitude. They're already getting regularly lambasted by the press for much smaller and ultimately inconsequential offenses…
For all the 📝 criticism I had about Flatpak and Flathub last time, they made the right choice of not treating their base package as a rolling and bleeding-edge distribution. The Freedesktop platform will only ship SDL 3 in its next version releasing in August, which will probably leave enough time for the SDL developers to address all but the rarest remaining issues in sdl2-compat. Although I'm not sure how I should interpret this commit being made at that specific time: This is either very considerate (because they've chosen to take up the job of early-adopting SDL 3 as part of developing the new SDK version, and thus will be helping out with reporting bugs), or very inconsiderate because they bought the whole sdl2-compat story just like Arch did. If Freedesktop SDK updates shipped in February rather than August and the release tag was on this branch, they would have screwed over their users just as much. Also, there's still not much point in force-updating everyone onto a compatibility layer in freaking 2025…
Then again, I can empathize with the SDL developers to a degree. Lots of developers have been asking the "when is SDL 3 ready and stable enough for regular use?" question while picturing SDL as this highly important and central library that surely has a big team of testers who could ensure its stability at one point. But if there just isn't enough Valve money to form such a team, what else should you do as a developer other than turn your personal hype into a "it's ready now, go use it and please leave feedback" reply? Maybe, turning your users into beta testers is the only realistic way to ever approach stability in this economy. And sure, they call it 3.2.0 for… reasons, but they're not fooling anyone.
The big irony, however, is this: At one point in the future, sdl2-compat will be that perfect solution for running abandoned SDL 2 (and SDL 1) programs on top of SDL 3. But it's the exact opposite of what you'd want during active development: You want to update to SDL 3 and use the new APIs and function names to be ready for the future, but also retain the option to run on the stable SDL 2 foundation for at least a little longer until every distribution has caught up. Or, in other words, you want to run SDL 3 on top of SDL 2.
You could totally have a library that implements this alternate kind of compatibility layer. It would still be prone to bugs just like sdl2-compat, but unlike that one, the chance for new bugs is halved since you'd be running on top of the proven and stable SDL 2. But of course, such a library would restrict your codebase to SDL 2's feature set, which is probably why something like this doesn't exist. So instead, our SDL platform layer now contains 64 conditional branches and a bunch of function renaming macros and generic helper code to support compiling against both SDL 3 and SDL 2. At least I wrote it all in a way that allows us to quickly rip out SDL 2 support once we no longer need it…
Oh well, enough ranting. Because once it works, there are plenty of things to like about SDL 3. Limited to, of course, everything notable that applies to Shuusou Gyoku:
Requesting fullscreen from SDL 3's basic window creation API will now always give you a borderless window as they went with the times and removed the option to directly create a window in exclusive fullscreen mode. In isolation, this might look bad enough to not even consider updating to SDL 3. However, this doesn't mean that boomer fullscreen is gone – it only has been relegated to a separate and, in fact, much more comprehensive mode-changing API that also covers refresh rates. Using it does require significantly more and different code compared to SDL 2, but being explicit about the refresh rate is crucial for games whose speed depends on the frame rate, like this one. If your display supports a 62.5 Hz mode by any chance, we select it now.
SDL 3's software blitters come with optimized SSE2, SSE4.1, and AVX implementations, replacing SDL 2's aging and nowadays actually suboptimal MMX code paths. On the surface, this only seems to speed up the software renderer as far as we're concerned, but it will also be very welcome once we have to do pixel format conversions. (Which, spoiler, I managed to just barely avoid on the SDL level for this new code.)
The new SDL_SetRenderLogicalPresentation() function now implements all of the three borderless fullscreen layouts as part of SDL. Together with the now cleaned-up handling of render target state, this removes almost all of the complexity and state juggling that SDL 2 previously required for the combination of fullscreen and clipping. Too bad that I still have to retain all of that SDL 2 code for the time being…
The filesystem API that originated in SDL 2 is finally joined by a matching set of file access functions that Do The Right Thing, explicitly take UTF-8 filenames, and use the Unicode APIs on Windows. If this had existed 📝 at the end of 2022, I wouldn't have felt the need to write my own abstractions. Sure, the lack of UTF-16 overloads means that this API is not strictly, perfectly optimal on Windows, but in turn, we get this API for free with the rest of SDL. It'll even be very welcome for the Windows 9x port, which could simply translate UTF-8 to the system codepage without requiring any other kind of Unicode layer. Besides, I've found myself using these strictly optimal UTF-16 strings less and less: These have always been an implementation detail of the Windows version, and any path we save in a .CFG file should better be in UTF-8 to allow configuration sharing between Linux and Windows.
SDL_RenderReadPixels(), the "screenshot" function that transfers pixel data from the GPU to system memory, now allocates a new pixel surface instead of writing pixel data in a specific format to pre-allocated memory. This is another change that looks bad on the surface because we sure love them freedoms to self-allocate our memory in C/C++ land. However:
This single allocation is far from being the bottleneck in the screenshotting process. It doesn't even clearly stick out in execution timings because it gets completely masked by the variance of the actual GPU→CPU pixel transfer.
In SDL 2's version of the function, you decided the pixel format that SDL would write into your buffer, which might have incurred a conversion if your chosen format didn't match the pixels returned by the GPU. In Shuusou Gyoku, this could have easily happened with geometry scaling. By newly allocating the returned surface, SDL 3 can keep the original pixel format and thus needs to involve at most a single memcpy() – which is always measurably faster than converting pixels, even if that conversion is SIMD-optimized.
Not even having the option to overthink memory pre-allocation sure simplifies your code a lot.
Graphics APIs are now addressed by their identifier string rather than their index within the platform-specific list of APIs. SDL 2 has always provided ways to map between both indices and strings, but the fact that every function now takes a string is a nice way of nudging developers to use strings in their configuration as well. They would allow a user's API selection to be retained independently of the SDL developers later changing the order of that list – once I adapt our config format from numbers to strings in a future release, that is.
SDL apps can now define metadata strings. Most of these currently don't do anything, but the identifier now gets used as the Wayland and X11 window class name and thus represents a much cleaner way of having class-derived icons than 📝 the previous undocumented SDL_VIDEO_X11_WMCLASS environment variable. But if you read that post again, my main issue wasn't SDL's implementation, but the fact that support for class-derived icons is so rare among window managers to begin with. Not only does this change not help the situation, but it arguably makes it even worse due to a slightly different mapping decision: The app identifier is assigned to the WM_CLASSclass name, but the additional instance name receives the binary's file name, which unfortunately breaks class-derived icons in IceWM where the instance name takes precedence.
Draw calls are now batched on all renderers, and batching can no longer be deactivated. 📝 During my previous experiments, SDL's Direct3D 11 backend turned out to be by far the fastest batching renderer on Windows, and SDL 3 coincidentally also made it the new default. So it makes sense to follow suit and remove our previous OpenGL override, restoring 📝 pixel-perfect line rendering in framebuffer-scaled mode by default.
The massive downside, however, is that the combination of framebuffer rendering and OpenGL ES 2 is now completely broken on integrated Intel graphics, in the worst way: The game initializes fine and responds to input, but only shows a black screen. If we offer such a menu, we'd better also have a feature to unbrick your game in a non-graphical way if it only renders a black screen. That's why you now can
press F7 to cycle through the list of APIs at any point, or
use the environment variable SDL_RENDER_DRIVER to override any previous manual API selection, which didn't work before.
Draw call batching even extends to the software renderer now, for some reason. Doesn't software rendering boil down to nothing more than writing pixels into a system-memory buffer on a single thread? There's no penalty for just doing the thing, but there certainly is a small penalty for gathering all the things into a queue. I'd rather not pepper that procedural mess of a graphics backend with even more imperative function calls, but you can make just as much of an argument for the consistency of requiring a flush regardless of whether a renderer represents software or hardware.
The new Vulkan and GPU render backends are perhaps the most exciting change for a certain group of people. The GPU API in particular provides an abstraction for the common modern paradigm of command buffers and shaders, which is shared among Vulkan, Direct3D 12, and Metal. Given the amount of attention it received, this feature is undoubtedly great for everyone developing modern games. However, not only couldn't we care less for a game of this vintage, but it's also just more of the same dilemma: While more backends can offer a higher chance of the game working well on some potato out there, they primarily mean more code surface, which means more bugs.
Thankfully, the list of entirely bad changes is quite short:
All API functions now return true/nonzero on success and false/zero on failure, rather than 0 on success and <0 on failure as in SDL 2. Sure, true = success makes intuitive sense when you just start out programming, but then you realize that the overwhelming majority of functions can fail in multiple ways and success is just the absence of failure. SDL 2 got the right idea about this, but SDL 3 chose to regress to said beginner levels because Sam Lantinga got increasingly convinced of this idea that he, and everyone else, initially considered horrible.
#include directives must now be prefixed with an explicit SDL3/ path, unlike SDL 2 which didn't use a prefix. This was apparently necessary to fulfill some macOS requirement, but they've also removed the path from their pkg-config --cflags, turning the prefixed syntax into the only sanctioned cross-platform way of including SDL 3's headers. Being able to compile SDL3-using code without any additional CFLAGS might look pretty, but no sane build system is going to make an exception and not call pkg-config --cflags as it does for any other external library. And now I have to duplicate the #include section in every translation unit for the SDL 2 code path…
All SDL threads must now be manually awaited before calling SDL_Quit(). If they aren't, SDL reports a "leaked thread" even if the underlying OS thread might have cleanly finished. I get it, structured concurrency is probably a good idea, but it only works naturally if the rest of your program is structured accordingly, which doesn't apply to this 25-year-old codebase. Enforcing this leak check just forces me to write cleanup code for the sole purpose of satisfying SDL's bookkeeping to avoid that error.
Still, the constant stumbling over bugs and deliberate instabilities made this take way longer than it had any right to. For three of these bugs, I was the first one to report them, and I could have even reported a fourth one if I actually cared about Vulkan and didn't happen to find a workaround right before I pushed out the release.
With the additional API unbricking feature, we've ended up well into a second push. Replays were too big of a feature for now, but screenshot compression sounded like a nice task for the rest of that push. Really, how hard can it be? Add reference C library of our encoder of choice, call API with pixel buffer we get from SDL, write compressed pixel buffer to file. Easy, right? Well…
For starters, which format do we choose? Ember2528 had a clear preference, but it makes sense to compare it against other contenders first. There will be a complete benchmark further below, but let's get the seemingly most obvious candidate out of the way first:
QOI
Because who doesn't want a fast encoder for a simple format with steadily growing adoption? Sure, part of the adoption might be hype-driven, but as far as hype goes, there are definitely worse targets than a codec that fits in less than 300 lines of C. The low-color images we want to compress are rather simple from a modern point of view as well, so you'd expect QOI to be a perfect match…
…until you actually try encoding a few representative images and are greeted with file sizes that are way further removed from PNG than you'd expect after seeing the official benchmarks. Since the specification is short enough, we can easily explain these results:
All of Shuusou Gyoku's sprites are intended to be rendered within a palettized 256-color framebuffer. 3D-rendered gradients and transparency will drive up the number of unique colors in screenshots into the low 4-digit range at times, but it still makes sense to assume uncompressed 8-bit BMPs as the baseline. At our native resolution of 640×480, these are 308,278 bytes large. This is what we expect our chosen codec to beat, by hopefully a quite significant margin.
The 32-bit QOI_OP_RGB chunk would already blow up each affected pixel to 4× the size it would have had in a palettized image. Let's hope that the QOI encoder largely uses this chunk to define palette colors, and that we don't get to see it that often otherwise.
The 16-bit QOI_OP_LUMA chunk can maybe help compress unknown pixels that haven't yet been put into the running palette, but would still not contribute any compression compared to our baseline size. Fortunately, we shouldn't see too many of those as the encoder is specified to prefer 8-bit chunks where possible…
…except that QOI_OP_INDEX spends 8 bits on encoding a 6-bit palette index. With only 64 colors in the palette rather than the 256 we want, we're bound to see a lot more of those bulky 32-bit QOI_OP_RGB chunks after all. Not to mention the fact that colors are mapped onto these 64 palette slots using a simple multiplicative hash that will cause collisions at regular color intervals.
Any compression gains over uncompressed 8-bit BMP would therefore come from QOI_OP_RUN. If run-length encoding is the best an image codec can do, that's rather basic instead of OK, I'd say.
Actually… wait a moment, doesn't BMP also have a run-length-encoded mode that was mostly forgotten after the 90s? And indeed, the compression rates between vintage BMP/RLE and QOI are very similar, with any differences stemming from the way these two formats encode their run lengths. QOI typically does slightly better, but BMP/RLE still beats it in the 西方Project logo and the main menu.
So while reduced complexity and blazingly fast encoding speed are good arguments, they don't cut it if decent compression of our source images relies on all the complexity found in PNG. But shouldn't this deficiency have stuck out in the official benchmark in some way? After all, 43% of the images in QOI's test suite have ≤256 colors, with most of them coming from Philip K's Ancient Collection in the textures_pk directory, where they make up 80%. For this directory, the official numbers claim average compressed sizes of 80 KiB for PNG and 75 KiB for QOI, and running the benchmark myself confirms these numbers…
…but wait, the input PNG files in the test suite package are actually half that size?! Yup – this benchmark merely tests the fixed, untunable QOI format against two specific PNG encoders, libpng and stb_image, at their default compression level and filter settings. It does not claim anything about QOI's relation to the known limits of PNG as a format, despite what the hype drivers would lead you to conclude all too easily. In any case, it paints a much different picture of QOI's 256-color capabilities:
Average file size
stb_image
110,337
libpng
82,136
QOI
77,404
PNG source files
43,437
oxipng -o max -Z
41,032
We will later see why comparing the slowest PNG encoders against the constantly fast QOI is, in fact, not unfair.
The final nail in QOI's coffin is this concession at the end of its release announcement:
SIMD acceleration for QOI would also be cool but (from my very limited knowledge about some SIMD instructions on ARM), the format doesn't seem to be well suited for it. Maybe someone with a bit more experience can shed some light?
I'd rather take a new image format that's designed around modern SIMD instructions from the start. Then, it can invest these performance gains into more complex filters to end up with better compression at a roughly similar encoding performance. Heck, it can even be slightly slower for all I care. SIMD-first design worked great for non-cryptographic hashes, and we'll see in a minute that it works just as well for image formats.
But Ember2528 had a different codec in mind anyway. Let's jump right to the polar opposite of the complexity spectrum:
Lossless JPEG XL
Because why wouldn't you use the currently best and most popular image format according to actual professionals who know a couple of things about image compression? It's winning benchmarks left and right, and blog posts like these make it appear as if even version 0.10 of its reference encoder already beats out every other widely used codec. And after it unfairly got removed from Chromium in 2022, you can't help but root for it. Time to do my small part in bringing its adoption to a level that Google can no longer deny!
Too bad that the enthusiasm immediately drops after cloning the libjxl repo and running a CMake test build. What are all these library dependencies, and why can't I just reduce the build to the lossless encoder? The resulting binaries are way larger than what I'd consider appropriate in relation to game code. 😩
Looking through the repo more thoroughly, however, reveals a very welcome little surprise: If a few basic requirements are met, the fastest lossless speed tier actually uses an entirely separate encoder that's implemented in a single source file and can be used independently from the rest of libjxl. Nice to see that someone thought about simple integration after all! That's exactly what I've hoped to find. Sadly, Linux distributions don't have a separate standalone package for this encoder, but it wouldn't be the only library we'd statically link on Linux.
Having a single function as an easy entry point is always a good sign, too. Those parameters, though…
Only accepting pixels in RGBA memory order sure is awkward in a 3D-accelerated world where everything else prefers BGRX, including BMP files. Sure, it doesn't matter for us because we live in SDL land where we have SIMD-optimized pixel format converters, but I don't think you should assume that everyone has these kinds of batteries included. "Just roll your own" isn't a good argument either because you'd want pixel format conversions to be SIMD-optimized. We'd all love it if compilers perfectly auto-vectorized such code, but we're not there yet; Visual Studio in particular is pretty bad at optimizing naive byte-flipping code. But writing SIMD code always comes with the same CPU feature detection and alignment boilerplate, and JPEG XL already has all of that in its codebase. Thus, it makes a lot more sense for it to include pixel format converters than forcing that onto every caller. It's API designs like this one that almost necessitate turning SDL into a hard dependency of the cross-platform frontend in the long run.
The not further documented big_endian parameter is the first indication that a lot of development effort went into aspects we don't care about. You'd think that passing true would cause the rgba buffer to be interpreted as ABGR, but it's only used to select the per-channel endianness of images with 16 bits per color channel. For 8-bit-per-channel images like the ones we're exclusively dealing with, it silently does nothing.
As the FJXL abbreviation implies, this encoder actually started as an independent project that, coincidentally, was a direct response to the hype surrounding QOI. By using AVX2 instructions within the confines of an existing format, it managed to beat QOI in both encoded file sizes and compression speed for every type of image its developer tested. But it's this competitive focus that brings us to its most questionable implementation decision.
The good news is that FJXL acknowledges that low-color images exist, are a prime use case for lossless compression, and are best dealt with using JPEG XL's palette features. However, detecting and optimizing that palette takes up a lot of time relative to QOI. If the input image uses more colors than a palette would make sense for, you'd want to fail as early as possible. Slide 11 explains the solution FJXL came up with:
Hash table with 65k possible entries
Any collision -> no palette
[…]
On non-palette-friendly images, this fails quickly (birthday paradox says after ~256 distinct pixels).
On palette images, encoding 1 channel rather than 4 more than compensates the
cost of detection.
With 10 additional bits and a widely renowned multiplier, the hash function looks leaps and bounds ahead of the one in QOI:
// has to map 0 to 0
uint16_t pixel_hash(uint32_t p) {
return ((p * 2654435761) >> 16);
}
But since we're still hashing 32-bit RGBA pixels to 16 bits, we're bound to run into a collision sooner or later. You can certainly think of this hash function as mapping color values to uniformly distributed random numbers and then reason about its efficacy using probability theory, as we saw in the slide above. However, the conclusion drawn in that slide is rather abbreviated and ultimately misleading: The birthday paradox does not return a binary success/failure result, but a probability. In this case of 256 distinct colors:
That's a smaller probability, but a 1/4 failure rate would still be way too high for our use case. And sure enough, it actually happens in the main menu, where a single #583732FF pixel (or 0xFF323758 in its little-endian representation) collides with #FFFFFFFF:
The resulting 143 KiB file immediately tells us how not palettizing such images completely ruins the compression ratio. If this one pixel had any other non-colliding color, FJXL would have compressed it into a still decent 52 KiB. Therefore, the slides should have better added a graph of the failure probability, and said something like:
Not perfect, and likely to misdetect even low-color images with <256 distinct colors as not palette-friendly according to the birthday paradox.
For our use case of screenshots without an alpha channel, we could work around this whole issue by having a separate non-alpha code path. Detecting the potential palette of an RGBA image within a worst-case time complexity of 𝑂(𝑛) without using hashes requires a (232/8) = 512 MiB bit array to cover the entire RGBA color space, which is probably too steep of a memory requirement. Removing the alpha channel, however, would shrink this array to a definitely appropriate 2 MiB.
Ultimately though, we decided against doing any of that because FJXL by itself is as untunable from the outside as the codec it was inspired by. Ember2528 preferred the opposite: an encoder with multiple effort levels that offer different trade-offs between encoding speed and file size, which would allow faster CPUs to produce the smallest files at still reasonable speeds. So let's look past the bloat, link in the complete libjxl reference encoder, and see how it performs on higher effort levels…
…um, what is this API? Adapting the example code gave me encoding times that are at least 1.5× slower than the cjxl command-line encoder, and already hit the 100 ms mark at -e 2. Even -e 1 is suddenly much slower than using FJXL in isolation while yielding the same compressed sizes. Also, pushing speculative allocation onto the caller? 🤨 📝 stb_vorbis is a bad joke, not a model to be emulated.
The compressed file sizes are pretty underwhelming as well. Most of the test cases don't even get close to oxipng at -e ≤6 while still taking absurdly long to encode within the game. Even at peak effort, it's a mixed bag at best, with both oxipng and JPEG XL -e 10 massively beating the other in 3 out of 7 cases. And if that's the best we can say about this format…
All this is echoed by this recent issue that points out JPEG XL's inadequacy with an even more retro 16-color example. In the end, the documentation said it all along:
They are about 60-75% of size of PNG, and smaller than WebP lossless for photos.
But there is one widely-used image codec that both perfectly fits Ember2528's priorities and compresses well on lower effort levels. Let's finally look at the complete benchmark numbers:
main_menu / Effort
0
1
2
3
4
5
6
7
8
9
JPEG XL
146,352
51,851
59,453
45,329
37,864
37,276
36,130
35,222
33,793
31,724
WebP
54,116
32,194
28,112
27,860
27,712
28,272
28,178
28,120
28,684
27,816
AVIF
272,604
272,604
136,220
131,235
119,398
117,525
111,380
110,684
110,543
109,601
BMP (8 bpp)
308,278
BMP/RLE
92,034
QOI
93,884
oxipng -o max -Z
30,702
​
​
​
ingame / Effort
0
1
2
3
4
5
6
7
8
9
JPEG XL
123,606
102,949
130,689
102,944
84,916
72,590
68,302
49,618
45,865
46,997
WebP
50,678
49,030
43,620
41,760
40,724
40,854
38,608
37,940
37,842
37,138
AVIF
462,703
462,703
197,818
156,007
141,043
139,689
133,399
132,573
126,270
125,379
BMP (8 bpp)
308,278
BMP/RLE
185,842
QOI
175,949
oxipng -o max -Z
38,409
BMP, cropped
185,398
BMP/RLE, cropped
177,456
QOI, cropped
165,620
stage6 / Effort
0
1
2
3
4
5
6
7
8
9
JPEG XL
32,204
24,146
35,053
24,599
19,936
19,560
19,336
18,444
17,423
16,183
WebP
20,856
19,916
17,070
16,524
16,380
16,562
15,488
15,386
15,404
15,124
AVIF
185,676
185,676
84,437
62,354
57,791
56,524
52,956
52,611
51,969
51,795
BMP (8 bpp)
308,278
BMP/RLE
55,838
QOI
52,302
oxipng -o max -Z
18,741
BMP, cropped
185,398
BMP/RLE, cropped
48,954
QOI, cropped
45,874
laser / Effort
0
1
2
3
4
5
6
7
8
9
JPEG XL
345,199
287,279
301,608
248,852
92,463
85,529
81,206
66,811
61,445
47,173
WebP
85,318
56,724
51,558
53,964
53,492
53,492
51,860
51,460
51,460
41,726
AVIF
218,858
218,858
122,100
88,490
82,675
81,245
75,866
75,395
75,462
75,138
BMP (24 bpp)
921,654
​
QOI
290,088
oxipng -o max -Z
61,595
BMP, cropped
553,014
​
QOI, cropped
280,462
laserbomb / Effort
0
1
2
3
4
5
6
7
8
9
JPEG XL
332,706
125,197
150,436
128,755
110,357
102,891
99,718
68,968
66,975
64,484
WebP
129,472
94,564
86,538
64,990
64,062
64,062
60,776
60,318
60,318
59,198
AVIF
313,731
313,731
168,388
114,111
109,239
107,121
104,109
102,054
99,106
99,103
BMP (24 bpp)
921,654
​
QOI
210,496
oxipng -o max -Z
87,286
BMP, cropped
553,014
​
QOI, cropped
200,002
gates / Effort
0
1
2
3
4
5
6
7
8
9
JPEG XL
208,293
185,662
212,615
172,008
124,466
117,509
113,563
110,992
97,454
91,146
WebP
124,308
125,070
113,896
102,656
102,482
102,482
95,536
94,768
94,768
57,850
AVIF
306,742
306,742
293,874
293,276
254,073
243,953
243,947
242,188
241,943
241,359
BMP (24 bpp)
921,654
​
QOI
157,705
oxipng -o max -Z
90,545
BMP, cropped
553,014
​
QOI, cropped
147,670
seihou / Effort
0
1
2
3
4
5
6
7
8
9
JPEG XL
6,124
5,088
4,732
4,468
4,427
4,416
4,377
4,112
4,016
4,040
WebP
39,518
5,904
5,642
5,574
5,500
5,518
5,518
5,504
5,486
5,490
AVIF
26,984
26,984
25,085
24,927
22,582
21,698
21,697
21,627
21,631
21,505
BMP (8 bpp)
308,278
BMP/RLE
17,654
QOI
18,047
oxipng -o max -Z
5,383
BMP, cropped
23,798
BMP/RLE, cropped
14,144
QOI, cropped
13,371
The effort value directly corresponds to cwebp's -z parameter. Add 1 to get cjxl's -e parameter, and subtract from 10 for avifenc's -s parameter.
I definitely could have surveyed the landscape of PNG encoders more thoroughly, but since Ember2528 prioritized compression ratio over compression speed, there was no need to. oxipng is as good as it gets, but even its strongest and most sluggish setting is still outperformed by regular WebP at some level, and often as early as -z 2.
191 colors. The large areas in black and #DDE4FA are a great test case for an encoder's RLE capabilities. The menu's half-transparent background is slightly nasty, but should still keep this image well within the range of potential palette-based compression. (Unless you're QOI, of course.)
FJXL palette detection collision chance: 24.21%.
92 colors. Lots of repeated bullet sprites to appropriately represent gameplay, plus a small transparency effect in the Evade gauge that shouldn't complicate compression all too much.
FJXL palette detection collision chance: 6.20%.
96 colors. The wavy clock animation makes Stage 6 look complex, but we expect encoders to actually have a much easier time on the last three stages due to their backgrounds being mostly black.
FJXL palette detection collision chance: 6.72%.
1219 colors. A simple repeated tile in the background, with a big gradient that is likely to push the color count beyond palette-based algorithms.
831 colors. Similar to enemy-fired lasers, but with multiple smaller gradients rather than a single big one.
2326 colors. With a comparatively complex background, bullets, and a big laser, this is probably the most intense test case for lossless compression that this game has to offer.
40 colors. A small consolation prize for JPEG XL, as the smoothly feathered and blurred colors match the photo-like characteristics this codec was meant to target. Even oxipng gets to barely outperform WebP on this one. Then again, the difference between JPEG XL and WebP is still less than 1.5 KiB at most, for an image that doesn't represent the rest of the game.
FJXL palette detection collision chance: 1.18%.
Lossless WebP
Yup, it's 📝 ZMBV beating AV1 all over again. For these kinds of retro game screenshots, JPEG XL is vastly outperformed by its counterpart from the previous generation of widely-used image formats. And not just in terms of compressed file sizes, but also in every single other aspect that matters to us:
Faster compression times across every effort level? ✅ You bet. Imagine adapting its example code and actually getting encoding speeds that match the cwebp command-line encoder! Which brings us to…
Better C API? ✅ Check – well-documented and significantly easier to use, and I'm not even using the easiest entry point due to its fixed effort level. libwebp does use a single 32-bit pixel format internally, just like JPEG XL, but what's that, importers for other 32-bit pixel formats and even palettized 8-bit images? Sure, the latter ones are part of the extra code that typically isn't part of Linux distribution packages and it just does a simple unoptimized loop. But that's how a library communicates that it's the right tool for the job.
Less bloat? ✅ Obviously. The unmodified reference library with all of its SSE and AVX optimizations adds an acceptable 274.5 KiB to the statically linked and optimized release binary.
That's not to say that libwebp is perfect. Its code makes it very obvious that lossless WebP was designed for 2010-era hardware as the encoder never got optimized for modern CPUs. There was an attempt at optimizing at least the lossy encoder for AVX2, but it was ultimately abandoned because it never got fast enough. Surprisingly, the codebase did receive new AVX2 code one week before I released this build, but it only covers the lossless decoder so far.
As for concurrency, libwebp does come with support for multi-threaded encoding, and I did activate it for the Shuusou Gyoku integration, but it's only used at effort levels 8 and 9. Also, why is argb in this structure interpreted as native-endian and therefore BGRA memory order, but these are interpreted as big-endian?
But the main criticism is the same that also applies to JPEG XL: The lossless and lossy modes are lumped into the same repository despite having virtually no code in common, and are selected via a structure field rather than having unrelated API entry points. This once again makes it very difficult for static linkers to remove all the code on the lossy branches that I never asked for in the first place.
And I sure never want to run the lossy encoder under any circumstance. Lossy WebP deserves all its bad reputation for basically being VP8's intra-frame coding applied to still images. VP8, 📝 if you remember, is that bad video codec from two generations ago that I'm only serving on this website due to sheer inertia. Applying its enforced YCbCr 4:2:0 chroma subsampling to images does not only make it utterly unsuitable for pixel art, but also even worse than well-compressed JPEG which isn't limited to a single subsampling scheme. If anything in the GIAN07 process accidentally flips the "I want lossless" flag, I'd rather want the WebP encoder to error out and have the screenshot frontend fall back on BMP than save an image with mutilated colors.
But while JPEG XL is a lost cause as far as I'm concerned, I've grown to like lossless WebP too much to leave it trapped within the unfortunate organization of its codebase. Also, there seems to be a lot of untapped potential in the format – really, why does PNG get alltheattention of people writing alternative encoders when lossless WebP is the demonstrably much more capable format?
So I've decided to fork libwebp and surgically remove all code related to the lossy encoder. The statically linked result now only takes up ~100 KiB in the Windows build while still being API- and ABI-compatible. Of course, Linux users will still use their distribution's libwebp package with the lossy encoder included, but let's hope that the aforementioned possibility of accidents stays purely theoretical.
Really though, why have people started to bundle lossless and lossy image codecs under the same format in the first place if their algorithms have nothing in common? It might make sense for Opus where SILK and CELT are different kinds of lossy, but lossless and lossy are two completely different paradigms. The bloat and usability confusion far outweigh any situational tricks this might offer.
Alright, we found a good format with configurable effort levels, and we're only missing a way for players to pick an effort level. Depending on how they want to use this rapid-fire screenshot feature, almost all of the options make sense in some context:
You'd like to screenshot a whole section of a stage as fast as possible with the help of the disabled frame rate limiter, and you got plenty of free disk space? You probably want to stick with BMP and compress the screenshots outside of the game, just like how you would have done it without this feature.
A slight slowdown is OK or maybe even welcome for providing additional feedback that you're actually taking screenshots? Pick one of WebP's higher effort values that certainly take longer than 16 ms to encode, but are still reasonably fast and won't turn the game into a <2-FPS slideshow.
Want the lowest file size that your system can encode while staying at 62.5 FPS? Well, how fast is your system? And not just the CPU – maybe your system is actually bottlenecked by I/O and writing a large uncompressed BMP file takes much longer than encoding it into WebP and writing the resulting smaller file.
The latter two use cases would be covered by automatic detection of the maximum effort value that encodes within a given number of frames. The problem, however, is that encoding times are always relative to the complexity of the image. Once we're in-game and have lots of bullets and lasers, any choice that might have been appropriate for the main menu might suddenly start dropping frames after all. Thus, we can't solve this with an upfront benchmark, but have to dynamically adapt to the complexity of the current game scene. But then the whole idea falls apart as we can't possibly treat the configurable allowed screenshot time as a hard limit. To figure out whether it's safe to raise the effort level again, there's no way around periodically exceeding that limit and thus dropping more frames after all.
The ideal solution would involve deep hooks into the WebP encoder that could dynamically adjust the compression algorithms depending on the remaining time in the current frame. An image compressor with real-time guarantees… sure sounds like an interesting research project.
In the end, letting players choose a fixed format and effort level remains the best option. However, they can only make an informed choice if they know the performance of all options relative to each other. And that's how we arrive at this new submenu:
These measurements start before retrieving the framebuffer's pixels, and end after the file writing syscalls. If you save to a reasonably fast and write-cached storage medium, these syscalls are unlikely to have a big impact. Thus, the BMP times almost purely represent the fixed cost of the SDL_RenderReadPixels() call.
These specific numbers I got on my now almost 7-year-old Intel Core i5 8400T are very peculiar. -z 0 gets quite close to the 16 ms we have per frame, but would still be too slow to reliably compress every gameplay situation without dropping frames. A 64-bit build would speed up -z 0 by 10%, -z 2 through -z 7 by 25%, -z 8 by 210% (!), and -z 9 by 60%. Linux users already enjoy these higher speeds, and the Windows build is just a few compiler settings away from matching them. 📝 Last time, the bitness argument was a lot more balanced, but WebP encoding performance presents the first compelling reason for going 64-bit.
Or we could always go multi-threaded, which already is a much more popular idea within the Seihou development Discord group.
Or I could investigate PNG after all to find out how exactly its encoding speed compares to WebP…
But then, Ember2528 posted the encoding times he got on his new Ryzen 9 9950X3D:
…yeah, I probably won't get funding for performance tuning.
Finally, you probably already noticed another small change in this build: The ReC98 push ID is now shown in the bottom-right corner of the title screen image, just below the original game version number. This was the one part of replay preparations that I wanted to get in sooner rather than later. Since the game binary and the data files can be updated or modded independently from each other, I'm going to tag future replays with both of their respective versions to guarantee reproducibility. Of course, newer builds should never introduce bugs that affect gameplay and desynchronize existing replays. But if they ever do, the included push ID allows hosting sites to remove any replays recorded on such a broken build from the official competition tier associated with a specific data file version.
As for rendering the push ID, it should obviously look similar to the VERSION 1.005 text above. We can find these glyphs in GRAPH.DAT file #0, but this particular text is actually baked into the main menu's background image, which explains why the decimal point glyph isn't part of that data file. The glyphs for 0-9 are also used in-game for the score popups, but the A-Z glyphs remain unused – so unused, in fact, that pbg didn't even leave any reference to them in the source code:
This means that the game provides us with all the glyphs we would need to display the ReC98 push ID. However:
The 0-9 glyphs have a size of 5×7 and would stick out a bit too much against a capital P rendered as a smaller 5×5 glyph.
In WIP builds, the build ID should also include the Git commit, which traditionally uses small letters. Surrounding the commit info with (brackets) would also be nice.
So, all the glyphs next to the BUILD label actually come from the TrueType text renderer. The non-slashed zeroes immediately give this away, but exactly emulating the color gradient of the 0-9 glyphs makes MS Gothic blend in very well regardless:
And that's all I've got for these very packed three pushes! In exchange, I'll reserve the next Shuusou Gyoku push for another round of maintenance and forward compatibility.
The new builds:
Next up: The long-awaited Windows 98 backport of our Shuusou Gyoku build! This has been in development for quite a while, so this should now be a matter of days rather than weeks.
P0264
TH03/TH04/TH05 decompilation (Music Rooms, part 1/2)
P0265
TH03/TH04/TH05 decompilation (Music Rooms, part 2/2 + MAINE.EXE main()) + TH02 PI/RE (Boss damage and position)
💰 Funded by:
Blue Bolt, [Anonymous], iruleatgames
🏷️ Tags:
Oh, it's 2024 already and I didn't even have a delivery for December or January? Yeah… I can only repeat what I said at the end of November, although the finish line is actually in sight now. With 10 pushes across 4 repositories and a blog post that has already reached a word count of 9,240, the Shuusou Gyoku SC-88Pro BGM release is going to break 📝 both the push record set by TH01 Sariel two years ago, and 📝 the blog post length record set by the last Shuusou Gyoku delivery. Until that's done though, let's clear some more PC-98 Touhou pushes out of the backlog, and continue the preparation work for the non-ASCII translation project starting later this year.
But first, we got another free bugfix according to my policy! 📝 Back in April 2022 when I researched the Divide Error crash that can occur in TH04's Stage 4 Marisa fight, I proposed and implemented four possible workarounds and let the community pick one of them for the generally recommended small bugfix mod. I still pushed the others onto individual branches in case the gameplay community ever wants to look more closely into them and maybe pick a different one… except that I accidentally pushed the wrong code for the warp workaround, probably because I got confused with the second warp variant I developed later on.
Fortunately, I still had the intended code for both variants lying around, and used the occasion to merge the current master branch into all of these mod branches. Thanks to wyatt8740 for spotting and reporting this oversight!
As the final piece of code shared in largely identical form between 4 of the 5 games, the Music Rooms were the biggest remaining piece of low-hanging fruit that guaranteed big finalization% gains for comparatively little effort. They seemed to be especially easy because I already decompiled TH02's Music Room together with the rest of that game's OP.EXE back in early 2015, when this project focused on just raw decompilation with little to no research. 9 years of increased standards later though, it turns out that I missed a lot of details, and ended up renaming most variables and functions. Combined with larger-than-expected changes in later games and the usual quality level of ZUN's menu code, this ended up taking noticeably longer than the single push I expected.
The undoubtedly most interesting part about this screen is the animation in the background, with the spinning and falling polygons cutting into a single-color background to reveal a spacey image below. However, the only background image loaded in the Music Room is OP3.PI (TH02/TH03) or MUSIC3.PI (TH04/TH05), which looks like this in a .PI viewer or when converted into another image format with the usual tools:
Let's call this "the blank image".
That is definitely the color that appears on top of the polygons, but where is the spacey background? If there is no other .PI file where it could come from, it has to be somewhere in that same file, right?
And indeed: This effect is another bitplane/color palette trick, exactly like the 📝 three falling stars in the background of TH04's Stage 5. If we set every bit on the first bitplane and thus change any of the resulting even hardware palette color indices to odd ones, we reveal a full second 8-color sub-image hiding in the same .PI file:
The spacey sub-image. Never before seen!1!! …OK, touhou-memories beat me by a month. Let's add each image's full 16-color palette to deliver some additional value.
On a high level, the first bitplane therefore acts as a stencil buffer that selects between the blank and spacey sub-image for every pixel. The important part here, however, is that the first bitplane of the blank sub-images does not consist entirely of 0 bits, but does have 1 bits at the pixels that represent the caption that's supposed to be overlaid on top of the animation. Since there now are some pixels that should always be taken from the spacey sub-image regardless of whether they're covered by a polygon, the game can no longer just clear the first bitplane at the start of every frame. Instead, it has to keep a separate copy of the first bitplane's original state (called nopoly_B in the code), captured right after it blitted the .PI image to VRAM. Turns out that this copy also comes in quite handy with the text, but more on that later.
Then, the game simply draws polygons onto only the reblitted first bitplane to conditionally set the respective bits. ZUN used master.lib's grcg_polygon_c() function for this, which means that we can entirely thank the uncredited master.lib developers for this iconic animation – if they hadn't included such a function, the Music Rooms would most certainly look completely different.
This is where we get to complete the series on the PC-98 GRCG chip with the last remaining four bits of its mode register. So far, we only needed the highest bit (0x80) to either activate or deactivate it, and the bit below (0x40) to choose between the 📝 RMW and 📝 TCR/📝 TDW modes. But you can also use the lowest four bits to restrict the GRCG's operations to any subset of the four bitplanes, leaving the other ones untouched:
// Enable the GRCG (0x80) in regular RMW mode (0x40). All bitplanes are
// enabled and written according to the contents of the tile register.
outportb(0x7C, 0xC0);
// The same, but limiting writes to the first bitplane by disabling the
// second (0x02), third (0x04), and fourth (0x08) one, as done in the
// PC-98 Touhou Music Rooms.
outportb(0x7C, 0xCE);
// Regular GRCG blitting code to any VRAM segment…
pokeb(0xA8000, offset, …);
// We're done, turn off the GRCG.
outportb(0x7C, 0x00);
This could be used for some unusual effects when writing to two or three of the four planes, but it seems rather pointless for this specific case at first. If we only want to write to a single plane, why not just do so directly, without the GRCG? Using that chip only involves more hardware and is therefore slower by definition, and the blitting code would be the same, right?
This is another one of these questions that would be interesting to benchmark one day, but in this case, the reason is purely practical: All of master.lib's polygon drawing functions expect the GRCG to be running in RMW mode. They write their pixels as bitmasks where 1 and 0 represent pixels that should or should not change, and leave it to the GRCG to combine these masks with its tile register and OR the result into the bitplanes instead of doing so themselves. Since GRCG writes are done via MOV instructions, not using the GRCG would turn these bitmasks into actual dot patterns, overwriting any previous contents of each VRAM byte that gets modified.
Technically, you'd only have to replace a few MOV instructions with OR to build a non-GRCG version of such a function, but why would you do that if you haven't measured polygon drawing to be an actual bottleneck.
An example with three polygons drawn from top to bottom. Without the GRCG, edges of later polygons overwrite any previously drawn pixels within the same VRAM byte. Note how treating bitmasks as dot patterns corrupts even those areas where the background image had nonzero bits in its first bitplane.
As far as complexity is concerned though, the worst part is the implicit logic that allows all this text to show up on top of the polygons in the first place. If every single piece of text is only rendered a single time, how can it appear on top of the polygons if those are drawn every frame?
Depending on the game (because of course it's game-specific), the answer involves either the individual bits of the text color index or the actual contents of the palette:
Colors 0 or 1 can't be used, because those don't include any of the bits that can stay constant between frames.
If the lowest bit of a palette color index has no effect on the displayed color, text drawn in either of the two colors won't be visually affected by the polygon animation and will always appear on top. TH04 and TH05 rely on this property with their colors 2/3, 4/5, and 6/7 being identical, but this would work in TH02 and TH03 as well.
But this doesn't apply to TH02 and TH03's palettes, so how do they do it? The secret: They simply include all text pixels in nopoly_B. This allows text to use any color with an odd palette index – the lowest bit then won't be affected by the polygons ORed into the first bitplane, and the other bitplanes remain unchanged.
TH04 is a curious case. Ostensibly, it seems to remove support for odd text colors, probably because the new 10-frame fade-in animation on the comment text would require at least the comment area in VRAM to be captured into nopoly_B on every one of the 10 frames. However, the initial pixels of the tracklist are still included in nopoly_B, which would allow those to still use any odd color in this game. ZUN only removed those from nopoly_B in TH05, where it had to be changed because that game lets you scroll and browse through multiple tracklists.
The contents of nopoly_B with each game's first track selected.
Finally, here's a list of all the smaller details that turn the Music Rooms into such a mess:
Due to the polygon animation, the Music Room is one of the few double-buffered menus in PC-98 Touhou, rendering to both VRAM pages on alternate frames instead of using the other page to store a background image. Unfortunately though, this doesn't actually translate to tearing-free rendering because ZUN's initial implementation for TH02 mixed up the order of the required operations. You're supposed to first wait for the GDC's VSync interrupt and then, within the display's vertical blanking interval, write to the relevant I/O ports to flip the accessed and shown pages. Doing it the other way around and flipping as soon as you're finished with the last draw call of a frame means that you'll very likely hit a point where the (real or emulated) electron beam is still traveling across the screen. This ensures that there will be a tearing line somewhere on the screen on all but the fastest PC-98 models that can render an entire frame of the Music Room completely within the vertical blanking interval, causing the very issue that double-buffering was supposed to prevent.
ZUN only fixed this landmine in TH05.
The polygons have a fixed vertex count and radius depending on their index, everything else is randomized. They are also never reinitialized while OP.EXE is running – if you leave the Music Room and reenter it, they will continue animating from the same position.
TH02 and TH04 don't handle it at all, causing held keys to be processed again after about a second.
TH03 and TH05 correctly work around the quirk, at the usual cost of a 614.4 µs delay per frame. Except that the delay is actually twice as long in frames in which a previously held key is released, because this code is a mess.
But even in 2024, DOSBox-X is the only emulator that actually replicates this detail of real hardware. On anything else, keyboard input will behave as ZUN intended it to. At least I've now mentioned this once for every game, and can just link back to this blog post for the other menus we still have to go through, in case their game-specific behavior matches this one.
TH02 is the only game that
separately lists the stage and boss themes of the main game, rather than following the in-game order of appearance,
continues playing the selected track when leaving the Music Room,
always loads both MIDI and PMD versions, regardless of the currently selected mode, and
does not stop the currently playing track before loading the new one into the PMD and MMD drivers.
The combination of 2) and 3) allows you to leave the Music Room and change the music mode in the Option menu to listen to the same track in the other version, without the game changing back to the title screen theme. 4), however, might cause the PMD and MMD drivers to play garbage for a short while if the music data is loaded from a slow storage device that takes longer than a single period of the OPN timer to fill the driver's song buffer. Probably not worth mentioning anymore though, now that people no longer try fitting PC-98 Touhou games on floppy disks.
Exactly 40 (TH02/TH03) / 38 (TH04/TH05) visible bytes per line,
padded with 2 bytes that can hold a CR/LF newline sequence for easier editing.
Every track starts with a title line that mostly just duplicates the names from the hardcoded tracklist,
followed by a fixed 19 (TH02/TH03/TH04) / 9 (TH05) comment lines.
In TH04 and TH05, lines can start with a semicolon (;) to prevent them from being rendered. This is purely a performance hint, and is visually equivalent to filling the line with spaces.
All in all, the quality of the code is even slightly below the already poor standard for PC-98 Touhou: More VRAM page copies than necessary, conditional logic that is nested way too deeply, a distinct avoidance of state in favor of loops within loops, and – of course – a couple of gotos to jump around as needed.
In TH05, this gets so bad with the scrolling and game-changing tracklist that it all gives birth to a wonderfully obscure inconsistency: When pressing both ⬆️/⬇️ and ⬅️/➡️ at the same time, the game first processes the vertical input and then the horizontal one in the next frame, making it appear as if the latter takes precedence. Except when the cursor is highlighting the first (⬆️ ) or 12th (⬇️ ) element of the list, and said list element is not the first track (⬆️ ) or the quit option (⬇️ ), in which case the horizontal input is ignored.
And that's all the Music Rooms! The OP.EXE binaries of TH04 and especially TH05 are now very close to being 100% RE'd, with only the respective High Score menus and TH04's title animation still missing. As for actual completion though, the finalization% metric is more relevant as it also includes the ZUN Soft logo, which I RE'd on paper but haven't decompiled. I'm 📝 still hoping that this will be the final piece of code I decompile for these two games, and that no one pays to get it done earlier…
For the rest of the second push, there was a specific goal I wanted to reach for the remaining anything budget, which was blocked by a few functions at the beginning of TH04's and TH05's MAINE.EXE. In another anticlimactic development, this involved yet another way too early decompilation of a main() function…
Generally, this main() function just calls the top-level functions of all other ending-related screens in sequence, but it also handles the TH04-exclusive congratulating All Clear images within itself. After a 1CC, these are an additional reward on top of the Good Ending, showing the player character wearing a different outfit depending on the selected difficulty. On Easy Mode, however, the Good Ending is unattainable because the game always ends after Stage 5 with a Bad Ending, but ZUN still chose to show the EASY ALL CLEAR!! image in this case, regardless of how many continues you used.
While this might seem inconsistent with the other difficulties, it is consistent within Easy Mode itself, as the enforced Bad Ending after Stage 5 also doesn't distinguish between the number of continues. Also, Try to Normal Rank!! could very well be ZUN's roundabout way of implying "because this is how you avoid the Bad Ending".
With that out of the way, I was finally able to separate the VRAM text renderer of TH04 and TH05 into its own assembly unit, 📝 finishing the technical debt repayment project that I couldn't complete in 2021 due to assembly-time code segment label arithmetic in the data segment. This now allows me to translate this undecompilable self-modifying mess of ASM into C++ for the non-ASCII translation project, and thus unify the text renderers of all games and enhance them with support for Unicode characters loaded from a bitmap font. As the final finalized function in the SHARED segment, it also allowed me to remove 143 lines of particularly ugly segmentation workarounds 🙌
The remaining 1/6th of the second push provided the perfect occasion for some light TH02 PI work. The global boss position and damage variables represented some equally low-hanging fruit, being easily identified global variables that aren't part of a larger structure in this game. In an interesting twist, TH02 is the only game that uses an increasing damage value to track boss health rather than decreasing HP, and also doesn't internally distinguish between bosses and midbosses as far as these variables are concerned. Obviously, there's quite a bit of state left to be RE'd, not least because Marisa is doing her own thing with a bunch of redundant copies of her position, but that was too complex to figure out right now.
Also doing their own thing are the Five Magic Stones, which need five positions rather than a single one. Since they don't move, the game doesn't have to keep 📝 separate position variables for both VRAM pages, and can handle their positions in a much simpler way that made for a nice final commit.
And for the first time in a long while, I quite like what ZUN did there!
Not only are their positions stored in an array that is indexed with a consistent ID for every stone, but these IDs also follow the order you fight the stones in: The two inner ones use 0 and 1, the two outer ones use 2 and 3, and the one in the center uses 4. This might look like an odd choice at first because it doesn't match their horizontal order on the playfield. But then you notice that ZUN uses this property in the respective phase control functions to iterate over only the subrange of active stones, and you realize how brilliant it actually is.
This seems like a really basic thing to get excited about, especially since the rest of their data layout sure isn't perfect. Splitting each piece of state and even the individual X and Y coordinates into separate 5-element arrays is still counter-productive because the game ends up paying more memory and CPU cycles to recalculate the element offsets over and over again than this would have ever saved in cache misses on a 486. But that's a minor issue that could be fixed with a few regex replacements, not a misdesigned architecture that would require a full rewrite to clean it up. Compared to the hardcoded and bloated mess that was 📝 YuugenMagan's five eyes, this is definitely an improvement worthy of the good-code tag. The first actual one in two years, and a welcome change after the Music Room!
These three pieces of data alone yielded a whopping 5% of overall TH02 PI in just 1/6th of a push, bringing that game comfortably over the 60% PI mark. MAINE.EXE is guaranteed to reach 100% PI before I start working on the non-ASCII translations, but at this rate, it might even be realistic to go for 100% PI on MAIN.EXE as well? Or at least technical position independence, without the false positives.
Next up: Shuusou Gyoku SC-88Pro BGM. It's going to be wild.
Turns out I was not quite done with the TH01 Anniversary Edition yet.
You might have noticed some white streaks at the beginning of Sariel's
second form, which are in fact a bug that I accidentally added to the
initial release.
These can be traced back to a quirk
I wasn't aware of, and hadn't documented so far. When defeating Sariel's
first form during a pattern that spawns pellets, it's likely for the second
form to start with additional pellets that resemble the previous pattern,
but come out of seemingly nowhere. This shouldn't really happen if you look
at the code: Nothing outside the typical pattern code spawns new pellets,
and all existing ones are reset before the form transition…
Except if they're currently showing the 10-frame delay cloud
animation , activated for all pellets during the symmetrical radial 2-ring
pattern in Phase 2 and left activated for the rest of the fight. These
pellets will continue their animation after the transition to the second
form, and turn into regular pellets you have to dodge once their animation
completed.
By itself, this is just one more quirk to keep in mind during refactoring.
It only turned into a bug in the Anniversary Edition because the game tracks
the number of living pellets in a separate counter variable. After resetting
all pellets, this counter is simply set to 0, regardless of any delay cloud
pellets that may still be alive, and it's merely incremented or decremented
when pellets are spawned or leave the playfield.
In the original game, this counter is only used as an optimization to skip
spawning new pellets once the cap is reached. But with batched
EGC-accelerated unblitting, it also makes sense to skip the rather costly
setup and shutdown of the EGC if no pellets are active anyway. Except if the
counter you use to check for that case can be 0 even if there are
pellets alive, which consequently don't get unblitted…
There is an optimal fix though: Instead of unconditionally resetting the
living pellet counter to 0, we decrement it for every pellet that
does get reset. This preserves the quirk and gives us a
consistently correct counter, allowing us to still skip every unnecessary
loop over the pellet array.
Cutting out the lengthy defeat animation makes it easier to see where the
additional pellets come from.
Cutting out the lengthy defeat animation makes it easier to see where the
additional pellets come from. Also, note how regular unblitting resumes
once the first pellet gets clipped at the top of the playfield – the
living pellet counter then gets decremented to -1, and who uses
<= rather than == on a seemingly unsigned
counter, right?
Cutting out the lengthy defeat animation makes it easier to see where the
additional pellets come from.
Ultimately, this was a harmless bug that didn't affect gameplay, but it's
still something that players would have probably reported a few more times.
So here's a free bugfix: