Blog

Showing all posts tagged

📝 Posted:
🏷️ Tags:

Unfortunately, I have to quickly interrupt the current PC-98 Touhou progress with breaking news of a replay desync bug in my Shuusou Gyoku build. Yup, it's free mod bugfix time again, this time featuring a bug with the most complex implications so far…

  1. The Extra Stage replay desync bug introduced in P0295
  2. How it happened
  3. Thinking about the best way to fix it
  4. Another filesize-related bug in replay saving

The bug in question dates back to the P0295 build from last October. While that giant release mostly focused on porting the game's rendering to SDL, it also included 📝 fixes for three pbg bugs in Shuusou Gyoku's handling of Extra Stage replays. Unfortunately, these fixes would introduce a bug of my own that was even worse. :tannedcirno:
Ever since that build, the replay header has consistently stored the difficulty and starting life count as shown in the Config → Difficulty menu. This looks fine on the surface until you consider the exact issue behind the three bugs: Shuusou Gyoku's Extra Stage is supposed to run on Hard difficulty and 2 starting lives, not on whatever you set for the regular 6 stages.
You can probably already imagine how invalid difficulty settings will cause desyncs shortly into a replay. Running a debug build at any commit between ≥P0295 and ≤P0310 quickly reveals the issue:

Screenshot of Shuusou Gyoku's debug mode at the beginning of the Extra Stage, showing off the intended internal difficulty (Hard) and initial rank value (Pr = 10240).Screenshot of Shuusou Gyoku's debug mode at the beginning of an Extra Stage replay recorded on a ReC98 build between ≥P0295 and ≤P0310, showing off how the replay incorrectly uses the configured difficulty (Lunatic) along with the higher rank that comes along with it.
Different difficulties come with different initial rank values (Pr), which cause bullets and lasers to spawn at different speeds than what the player maneuvered around while recording the replay, which in turn will manifest as a desync.

The only way to protect a replay from this bug was to set the regular game to Hard difficulty and 2 starting life settings in the Config → Difficulty menu before recording. This is probably one of the rarest configurations imaginable – most people will have set the difficulty to either Normal to get that survival clear that unlocks Extra in the first place, or to Lunatic because they're superplayers and that's the only difficulty that matters to them. :godzun:
Note how the bug only affects the saved replay file. You were still playing and recording Extra in its intended Hard difficulty and with 2 starting lives, and any clear you've achieved was still valid.


This is exactly the kind of bug that can easily fall through the cracks of the regular testing that my backers and I do for every new build. Replays are a key item on my testing checklist, but I primarily test whether existing ones still work. With only one replay slot per stage, recording a new replay is always a cumbersome process: Is my previous replay for that stage worth keeping? If yes, what made it special? After all, I now need to give a more descriptive name to the file. Do I remember, or do I have to watch the replay again?
Also, the primary concern of replays is compatibility with pbg's original 1.005 version. In that context, they can provide important evidence that I haven't accidentally forked the gameplay. Therefore, replays should hit as many gameplay aspects and potential failure/desync points as possible, which requires actual gameplay effort. :onricdennat: From that point of view, it makes more sense to just keep testing with existing replays, especially when it comes to the Extra Stage.


Since this was just a metadata issue, we can both easily fix this bug for future replays and repair any existing affected ones. We simply have to set the replay's difficulty and starting lives to the one and only official values for the Extra Stage, and they will play back correctly again.

But doing that creates a potential problem. What if you actually modded the game before P0295, intentionally changed the difficulty and/or number of starting lives for the Extra Stage, and then recorded a replay? If that was the whole extent of your mod, such a replay would play back correctly on not just your modded build of Shuusou Gyoku, but on every single one of my builds and pbg's original 1.005 build. "Fixing" these settings in the replay header would then actually break such a replay. Since we're still using pbg's old replay format, there is no way we can distinguish valid modded replays from broken and desyncing ones by looking at just the replay header.
We could tell after we've run the replay – if the game ends before the replay has reached its last recorded frame, we know that something is wrong. However, we're not quite at the point where we can quickly simulate an entire round of gameplay logic on just the CPU, without rendering anything. The best we could do until then is to pop up a message at the end of a rendered replay, informing the player that they've just watched a desync and offering an automatic repair of known issues. But that would be a lot of work for a policy-bugfix, and even fall under the planned paid feature of improved replay-related error reporting. And if we zoom out, such a window won't be much of a help in the general case of people watching replays from incompatible builds. The game can't possibly know the specific mod a desyncing replay originated from, so what could it possibly do, other than to say "Sorry, that was, in fact, a desync 🤷"?
That's why it's so important to me that 📝 the new replay format stores the exact game binary and stage script versions a replay was recorded on. As well as any gameplay tweaking options, if we ever go that route: Properly fixing Shuusou Gyoku's fake deathbomb quirk is not just about the few lines of custom gameplay code you can find in Tasos500's fork, but mainly about the bureaucracy of cleanly establishing a separate competition tier, not breaking existing replays, and making sure that replay hosting sites deal with the distinction as well.

That said, that's a lot of thought for a very specific potential scenario. Any change to the Extra Stage settings would have required modding the game at either the C++ or machine code level. If you were able to pull that off and you're considering updating to the new build, you'll probably also read these lines and will have no problem adapting whatever fix I roll out for that issue.


So, let's go for that unconditional rewrite of every affected Extra Stage replay upon clicking the ExStage デモ再生 option… but wait, why are the rewritten files suddenly smaller than the old ones? :thonk:
Turns out that there was another replay-related bug that dates back to 📝 my very first Shuusou Gyoku release from September 2022. This one boiled down to the classic C/C++ footgun of confusing byte sizes with element counts, but pbg's misleading variable names certainly played their part as well.
This bug is mostly harmless within the unmodded game, which also explains how I didn't detect it for so long. The game doesn't care about compressing and uncompressing twice as many bytes, the loader still copied the correct amount of bytes and wouldn't have overflowed the buffer, and at a few KB per replay, it doesn't really stick out if the files are roughly twice as large as they needed to be. But this was still a landmine that would have exploded once modders crafted stages longer than half of the 20-minute buffer that pbg designated for replays.

Since I'm already implementing an automated fix here, I might as well also recompress every watched non-Extra Stage replay if its amount of decompressed bytes doesn't match the replay's indicated frame count. Of course, recompression won't work if you've marked the replay files as read-only, which I often do as a means of protecting them from getting overwritten with accidentally recorded new replays of the same stage…
…but wait, how about restricting both fixes to writable replay files? This would create at least one possibility of protecting existing modded replays, and also make sense from a consistency point of view. If the game isn't allowed to fix the replay, it also shouldn't untransparently hotpatch its header and play it differently than any other build of the game would play it, even if that way was the correct one in the vast majority of cases. Sure, this is slightly annoying for people who use that same trick, but those people will probably also read either these lines or the release notes.

As a neat bonus, I also made sure to preserve the original timestamps of any repaired and/or recompressed replay file. This is the only other piece of meaningful identifying metadata we have with these files, and I don't want to throw it away just because I messed up the saving code at one point. Without that extra level of care, I probably wouldn't have gone for such an unconditional automatic fix in the first place. Instead, this little detail makes the whole fix as invisible as it could possibly be. If you only recorded an Extra Stage replay once, haven't watched it since, and haven't touched the file either, you won't even notice that there was a bug in the first place.
SDL 3's filesystem API does not cover file timestamp modification, so this required more OS-specific code of the kind 📝 I'd rather want to get rid of. SDL 3 does support timestamp retrieval though, and that's all we need for the new replay format where I'll take the timestamp from the filesystem and properly write it into the file itself.

And there we go, no more replay bugs! Also, did I just write down all the justification anyone would ever need for the new replay format? That should be shortening that future blog post by quite a bit at least…

Thanks to >>49320040 on /jp/ for pointing out that desyncs exist. Please tell me this sort of thing! I'm not ZUN, desyncs are critical bugs that will always receive my immediate attention. If they turn out to be my fault, they definitely fall under my free bugfix policy, and if they don't, we at least get to document them as bugs in the original game that might get fixed in a later push.
Alright, back to writing blitters for the PC-98…

📝 Posted:
💰 Funded by:
Ember2528
🏷️ Tags:

Well, that fell apart surprisingly quickly. The release of Shuusou Gyoku's Linux port just happened to be surrounded by the unluckiest sequence of events in Arch Linux land:

After a policy-bugfix for a silly mistake on my part, Shuusou Gyoku was still playable on sdl2-compat as it was only affected by rather minor bugs, but these bugs still undermined the effort I put into the port. That left us with three options:

  1. Let the more involved SDL community fix sdl2-compat out on their own. After all, why should we bother if rogue distros randomly mess with our dependencies?
  2. Become part of that community and help fix the issues in either sdl2-compat or SDL 3.
  3. Properly update Shuusou Gyoku to SDL 3 right now, while keeping SDL 2 support for the Flatpak, more conservative Linux distributions, and the upcoming Windows 98 backport.

I really would have preferred to delay this migration for a few years until the dust has settled. For this project, I already picked C++ as the dependency I want to be on the bleeding edge of, and SDL 2 was supposed to balance this out by being the conservative and stable choice. Oh well, if we've got to update at some point, we might as well do it now. The ReC98 development schedule at least gave me another month of waiting for the community to sort out SDL 3's growing pains…

  1. Forced onto an unstable SDL 2 compatibility layer
  2. Updating to SDL 3
  3. Picking a screenshot format
  4. Letting players pick an effort level
  5. Future performance improvements
  6. Rendering the build ID with some unused glyphs

So, why does something like sdl2-compat even exist if it only causes problems? And why are distros rolling it out so soon after SDL 3 if SDL 2 has been working fine all the time? In a nutshell, sdl2-compat is the second pillar in SDL's forward compatibility strategy. While the 📝 dynamic API mechanism ensures compatibility with future minor versions by integrating dynamic linking so deeply that static linking is made entirely useless, sdlN-compat ensures compatibility with one future major version by implementing version N's API in terms of SDL version N+1. This allows the SDL team to very quickly stop updating version N while still allowing programs linked against that version to run well on modern systems by using all the actively maintained backends of version N+1. This worked out well with sdl12-compat, which nowadays seems to do a great job at preserving abandoned SDL 1 games – especially if we consider that you'd be running sdl12-compat on top of sdl2-compat on top of SDL 3 from now on. :tannedcirno:

So it only makes sense why the SDL developers would want to repeat this success story with the transition from SDL 2 to 3. The problem is that they're already selling sdl2-compat as a perfect drop-in replacement for proper SDL 2, and wanted to push it onto people even before SDL 3 was officially released. The sales pitch follows their usual "trust me bro" rhetoric:

If you absolutely must have the real SDL2 ("SDL 2 Classic"), please use the SDL2 branch at https://github.com/libsdl-org/SDL, which occasionally gets bug fixes (and eventually, no new formal releases). But we strongly encourage you not to do that.

Followed by zero arguments to back up this audacious suggestion. So they not only imply that sdl2-compat is already perfectly compatible and works without bugs for every SDL 2 program ever, but also that the underlying SDL 3 implementation doesn't introduce any bugs on top – and it only takes a single look into either project's issue tracker to disprove that notion. There is no technical reason why a distro couldn't ship SDL 3 and 2 in parallel. The continued existence of the SDL 2 AUR package is proof of that, and still received upset comments as of mid-March that justified its existence.
There was absolutely no reason to push sdl2-compat on everyone by default other than forcefully turning users into beta testers. SDL 2 was still stable, maintained, and working well. People who needed SDL 3 before its release for whatever feature already used SDL 3. People who want to use the SDL 3 backends to solve some obscure backend-related issue in an SDL 2 program can use sdl2-compat without needing it to be the only option available. And with a package size of 1.2 MiB, you can't convince me that SDL 2 is somehow a burden on the packaging front either – especially if your distro has separate packages for every commonly used fiddly Python and Haskell library.
I can't help but imagine the reaction if Microsoft pushed an enforced update of this magnitude. They're already getting regularly lambasted by the press for much smaller and ultimately inconsequential offenses…

For all the 📝 criticism I had about Flatpak and Flathub last time, they made the right choice of not treating their base package as a rolling and bleeding-edge distribution. The Freedesktop platform will only ship SDL 3 in its next version releasing in August, which will probably leave enough time for the SDL developers to address all but the rarest remaining issues in sdl2-compat. Although I'm not sure how I should interpret this commit being made at that specific time: This is either very considerate (because they've chosen to take up the job of early-adopting SDL 3 as part of developing the new SDK version, and thus will be helping out with reporting bugs), or very inconsiderate because they bought the whole sdl2-compat story just like Arch did. If Freedesktop SDK updates shipped in February rather than August and the release tag was on this branch, they would have screwed over their users just as much. Also, there's still not much point in force-updating everyone onto a compatibility layer in freaking 2025

Then again, I can empathize with the SDL developers to a degree. Lots of developers have been asking the "when is SDL 3 ready and stable enough for regular use?" question while picturing SDL as this highly important and central library that surely has a big team of testers who could ensure its stability at one point. But if there just isn't enough Valve money to form such a team, what else should you do as a developer other than turn your personal hype into a "it's ready now, go use it and please leave feedback" reply? Maybe, turning your users into beta testers is the only realistic way to ever approach stability in this economy. And sure, they call it 3.2.0 for… reasons, but they're not fooling anyone.

The big irony, however, is this: At one point in the future, sdl2-compat will be that perfect solution for running abandoned SDL 2 (and SDL 1) programs on top of SDL 3. But it's the exact opposite of what you'd want during active development: You want to update to SDL 3 and use the new APIs and function names to be ready for the future, but also retain the option to run on the stable SDL 2 foundation for at least a little longer until every distribution has caught up. Or, in other words, you want to run SDL 3 on top of SDL 2.
You could totally have a library that implements this alternate kind of compatibility layer. It would still be prone to bugs just like sdl2-compat, but unlike that one, the chance for new bugs is halved since you'd be running on top of the proven and stable SDL 2. But of course, such a library would restrict your codebase to SDL 2's feature set, which is probably why something like this doesn't exist. So instead, our SDL platform layer now contains 64 conditional branches and a bunch of function renaming macros and generic helper code to support compiling against both SDL 3 and SDL 2. At least I wrote it all in a way that allows us to quickly rip out SDL 2 support once we no longer need it…


Oh well, enough ranting. Because once it works, there are plenty of things to like about SDL 3. Limited to, of course, everything notable that applies to Shuusou Gyoku:

A few changes have good and bad elements:

Thankfully, the list of entirely bad changes is quite short:

Still, the constant stumbling over bugs and deliberate instabilities made this take way longer than it had any right to. For three of these bugs, I was the first one to report them, and I could have even reported a fourth one if I actually cared about Vulkan and didn't happen to find a workaround right before I pushed out the release.
With the additional API unbricking feature, we've ended up well into a second push. Replays were too big of a feature for now, but screenshot compression sounded like a nice task for the rest of that push. Really, how hard can it be? Add reference C library of our encoder of choice, call API with pixel buffer we get from SDL, write compressed pixel buffer to file. Easy, right? Well…


For starters, which format do we choose? Ember2528 had a clear preference, but it makes sense to compare it against other contenders first. There will be a complete benchmark further below, but let's get the seemingly most obvious candidate out of the way first:

QOI

Because who doesn't want a fast encoder for a simple format with steadily growing adoption? Sure, part of the adoption might be hype-driven, but as far as hype goes, there are definitely worse targets than a codec that fits in less than 300 lines of C. The low-color images we want to compress are rather simple from a modern point of view as well, so you'd expect QOI to be a perfect match…
…until you actually try encoding a few representative images and are greeted with file sizes that are way further removed from PNG than you'd expect after seeing the official benchmarks. Since the specification is short enough, we can easily explain these results:

So while reduced complexity and blazingly fast encoding speed are good arguments, they don't cut it if decent compression of our source images relies on all the complexity found in PNG. But shouldn't this deficiency have stuck out in the official benchmark in some way? After all, 43% of the images in QOI's test suite have ≤256 colors, with most of them coming from Philip K's Ancient Collection in the textures_pk directory, where they make up 80%. For this directory, the official numbers claim average compressed sizes of 80 KiB for PNG and 75 KiB for QOI, and running the benchmark myself confirms these numbers…
…but wait, the input PNG files in the test suite package are actually half that size?! Yup – this benchmark merely tests the fixed, untunable QOI format against two specific PNG encoders, libpng and stb_image, at their default compression level and filter settings. It does not claim anything about QOI's relation to the known limits of PNG as a format, despite what the hype drivers would lead you to conclude all too easily. In any case, it paints a much different picture of QOI's 256-color capabilities:

Average file size
stb_image110,337
libpng82,136
QOI77,404
PNG source files43,437
oxipng -o max -Z41,032
We will later see why comparing the slowest PNG encoders against the constantly fast QOI is, in fact, not unfair.

The final nail in QOI's coffin is this concession at the end of its release announcement:

SIMD acceleration for QOI would also be cool but (from my very limited knowledge about some SIMD instructions on ARM), the format doesn't seem to be well suited for it. Maybe someone with a bit more experience can shed some light?

I'd rather take a new image format that's designed around modern SIMD instructions from the start. Then, it can invest these performance gains into more complex filters to end up with better compression at a roughly similar encoding performance. Heck, it can even be slightly slower for all I care. SIMD-first design worked great for non-cryptographic hashes, and we'll see in a minute that it works just as well for image formats.
But Ember2528 had a different codec in mind anyway. Let's jump right to the polar opposite of the complexity spectrum:

Lossless JPEG XL

Because why wouldn't you use the currently best and most popular image format according to actual professionals who know a couple of things about image compression? It's winning benchmarks left and right, and blog posts like these make it appear as if even version 0.10 of its reference encoder already beats out every other widely used codec. And after it unfairly got removed from Chromium in 2022, you can't help but root for it. Time to do my small part in bringing its adoption to a level that Google can no longer deny!

Too bad that the enthusiasm immediately drops after cloning the libjxl repo and running a CMake test build. What are all these library dependencies, and why can't I just reduce the build to the lossless encoder? The resulting binaries are way larger than what I'd consider appropriate in relation to game code. 😩
Looking through the repo more thoroughly, however, reveals a very welcome little surprise: If a few basic requirements are met, the fastest lossless speed tier actually uses an entirely separate encoder that's implemented in a single source file and can be used independently from the rest of libjxl. Nice to see that someone thought about simple integration after all! That's exactly what I've hoped to find. Sadly, Linux distributions don't have a separate standalone package for this encoder, but it wouldn't be the only library we'd statically link on Linux.
Having a single function as an easy entry point is always a good sign, too. Those parameters, though… :thonk:

As the FJXL abbreviation implies, this encoder actually started as an independent project that, coincidentally, was a direct response to the hype surrounding QOI. By using AVX2 instructions within the confines of an existing format, it managed to beat QOI in both encoded file sizes and compression speed for every type of image its developer tested. But it's this competitive focus that brings us to its most questionable implementation decision.
The good news is that FJXL acknowledges that low-color images exist, are a prime use case for lossless compression, and are best dealt with using JPEG XL's palette features. However, detecting and optimizing that palette takes up a lot of time relative to QOI. If the input image uses more colors than a palette would make sense for, you'd want to fail as early as possible. Slide 11 explains the solution FJXL came up with:

  • Hash table with 65k possible entries
  • Any collision -> no palette
  • […]

On non-palette-friendly images, this fails quickly (birthday paradox says after ~256 distinct pixels).

On palette images, encoding 1 channel rather than 4 more than compensates the cost of detection.

With 10 additional bits and a widely renowned multiplier, the hash function looks leaps and bounds ahead of the one in QOI:

// has to map 0 to 0
uint16_t pixel_hash(uint32_t p) {
	return ((p * 2654435761) >> 16);
}
Adapted from the original code.

But since we're still hashing 32-bit RGBA pixels to 16 bits, we're bound to run into a collision sooner or later. You can certainly think of this hash function as mapping color values to uniformly distributed random numbers and then reason about its efficacy using probability theory, as we saw in the slide above. However, the conclusion drawn in that slide is rather abbreviated and ultimately misleading: The birthday paradox does not return a binary success/failure result, but a probability. In this case of 256 distinct colors:

(1 - ( 65536!  /  (65536 - 256)! ) /  65536256 ) ≈ 39.27%

Let's plug in 191, for no reason whatsoever:

(1 - ( 65536!  /  (65536 - 191)! ) /  65536191 ) ≈ 24.21%

That's a smaller probability, but a 1/4 failure rate would still be way too high for our use case. And sure enough, it actually happens in the main menu, where a single #583732FF pixel (or 0xFF323758 in its little-endian representation) collides with #FFFFFFFF:

The `main_menu` benchmark image.A 16× zoomed view of the `main_menu` benchmark image, highlighting the single #583732FF pixel that causes the hash collision in FJXL's palette detection code

The resulting 143 KiB file immediately tells us how not palettizing such images completely ruins the compression ratio. If this one pixel had any other non-colliding color, FJXL would have compressed it into a still decent 52 KiB. Therefore, the slides should have better added a graph of the failure probability, and said something like:

Not perfect, and likely to misdetect even low-color images with <256 distinct colors as not palette-friendly according to the birthday paradox.

For our use case of screenshots without an alpha channel, we could work around this whole issue by having a separate non-alpha code path. Detecting the potential palette of an RGBA image within a worst-case time complexity of 𝑂(𝑛) without using hashes requires a (232/8) = 512 MiB bit array to cover the entire RGBA color space, which is probably too steep of a memory requirement. Removing the alpha channel, however, would shrink this array to a definitely appropriate 2 MiB.

Ultimately though, we decided against doing any of that because FJXL by itself is as untunable from the outside as the codec it was inspired by. Ember2528 preferred the opposite: an encoder with multiple effort levels that offer different trade-offs between encoding speed and file size, which would allow faster CPUs to produce the smallest files at still reasonable speeds. So let's look past the bloat, link in the complete libjxl reference encoder, and see how it performs on higher effort levels…

…um, what is this API? Adapting the example code gave me encoding times that are at least 1.5× slower than the cjxl command-line encoder, and already hit the 100 ms mark at -e 2. Even -e 1 is suddenly much slower than using FJXL in isolation while yielding the same compressed sizes. Also, pushing speculative allocation onto the caller? 🤨 📝 stb_vorbis is a bad joke, not a model to be emulated.
The compressed file sizes are pretty underwhelming as well. Most of the test cases don't even get close to oxipng at -e ≤6 while still taking absurdly long to encode within the game. Even at peak effort, it's a mixed bag at best, with both oxipng and JPEG XL -e 10 massively beating the other in 3 out of 7 cases. And if that's the best we can say about this format…

All this is echoed by this recent issue that points out JPEG XL's inadequacy with an even more retro 16-color example. In the end, the documentation said it all along:

They are about 60-75% of size of PNG, and smaller than WebP lossless for photos.

But there is one widely-used image codec that both perfectly fits Ember2528's priorities and compresses well on lower effort levels. Let's finally look at the complete benchmark numbers:

main_menu / Effort0123456789
JPEG XL146,35251,85159,45345,32937,86437,27636,13035,22233,79331,724
WebP54,11632,19428,11227,86027,71228,27228,17828,12028,68427,816
AVIF272,604272,604136,220131,235119,398117,525111,380110,684110,543109,601
BMP (8 bpp)308,278
BMP/RLE 92,034
QOI 93,884
oxipng -o max -Z 30,702
ingame / Effort0123456789
JPEG XL123,606102,949130,689102,94484,91672,59068,30249,61845,86546,997
WebP50,67849,03043,62041,76040,72440,85438,60837,94037,84237,138
AVIF462,703462,703197,818156,007141,043139,689133,399132,573126,270125,379
BMP (8 bpp)308,278
BMP/RLE185,842
QOI175,949
oxipng -o max -Z 38,409
BMP, cropped185,398
BMP/RLE, cropped177,456
QOI, cropped165,620
stage6 / Effort0123456789
JPEG XL32,20424,14635,05324,59919,93619,56019,33618,44417,42316,183
WebP20,85619,91617,07016,52416,38016,56215,48815,38615,40415,124
AVIF185,676185,67684,43762,35457,79156,52452,95652,61151,96951,795
BMP (8 bpp)308,278
BMP/RLE 55,838
QOI 52,302
oxipng -o max -Z 18,741
BMP, cropped185,398
BMP/RLE, cropped 48,954
QOI, cropped 45,874
laser / Effort0123456789
JPEG XL345,199287,279301,608248,85292,46385,52981,20666,81161,44547,173
WebP85,31856,72451,55853,96453,49253,49251,86051,46051,46041,726
AVIF218,858218,858122,10088,49082,67581,24575,86675,39575,46275,138
BMP (24 bpp)921,654
QOI290,088
oxipng -o max -Z 61,595
BMP, cropped553,014
QOI, cropped280,462
laserbomb / Effort0123456789
JPEG XL332,706125,197150,436128,755110,357102,89199,71868,96866,97564,484
WebP129,47294,56486,53864,99064,06264,06260,77660,31860,31859,198
AVIF313,731313,731168,388114,111109,239107,121104,109102,05499,10699,103
BMP (24 bpp)921,654
QOI210,496
oxipng -o max -Z 87,286
BMP, cropped553,014
QOI, cropped200,002
gates / Effort0123456789
JPEG XL208,293185,662212,615172,008124,466117,509113,563110,99297,45491,146
WebP124,308125,070113,896102,656102,482102,48295,53694,76894,76857,850
AVIF306,742306,742293,874293,276254,073243,953243,947242,188241,943241,359
BMP (24 bpp)921,654
QOI157,705
oxipng -o max -Z 90,545
BMP, cropped553,014
QOI, cropped147,670
seihou / Effort0123456789
JPEG XL6,1245,0884,7324,4684,4274,4164,3774,1124,0164,040
WebP39,5185,9045,6425,5745,5005,5185,5185,5045,4865,490
AVIF26,98426,98425,08524,92722,58221,69821,69721,62721,63121,505
BMP (8 bpp)308,278
BMP/RLE 17,654
QOI 18,047
oxipng -o max -Z  5,383
BMP, cropped 23,798
BMP/RLE, cropped 14,144
QOI, cropped 13,371
The effort value directly corresponds to cwebp's -z parameter. Add 1 to get cjxl's -e parameter, and subtract from 10 for avifenc's -s parameter.
I definitely could have surveyed the landscape of PNG encoders more thoroughly, but since Ember2528 prioritized compression ratio over compression speed, there was no need to. oxipng is as good as it gets, but even its strongest and most sluggish setting is still outperformed by regular WebP at some level, and often as early as -z 2.
191 colors. The large areas in black and #DDE4FA are a great test case for an encoder's RLE capabilities. The menu's half-transparent background is slightly nasty, but should still keep this image well within the range of potential palette-based compression. (Unless you're QOI, of course.)
FJXL palette detection collision chance: 24.21%.
92 colors. Lots of repeated bullet sprites to appropriately represent gameplay, plus a small transparency effect in the Evade gauge that shouldn't complicate compression all too much.
FJXL palette detection collision chance: 6.20%.
96 colors. The wavy clock animation makes Stage 6 look complex, but we expect encoders to actually have a much easier time on the last three stages due to their backgrounds being mostly black.
FJXL palette detection collision chance: 6.72%.
1219 colors. A simple repeated tile in the background, with a big gradient that is likely to push the color count beyond palette-based algorithms.
831 colors. Similar to enemy-fired lasers, but with multiple smaller gradients rather than a single big one.
2326 colors. With a comparatively complex background, bullets, and a big laser, this is probably the most intense test case for lossless compression that this game has to offer.
40 colors. A small consolation prize for JPEG XL, as the smoothly feathered and blurred colors match the photo-like characteristics this codec was meant to target. Even oxipng gets to barely outperform WebP on this one. Then again, the difference between JPEG XL and WebP is still less than 1.5 KiB at most, for an image that doesn't represent the rest of the game.
FJXL palette detection collision chance: 1.18%.
The `main_menu` benchmark image.The `ingame` benchmark image.The `stage6` benchmark image.The `laser` benchmark image.The `laserbomb` benchmark image.The `gates` benchmark image.The `seihou` benchmark image.

Lossless WebP

Yup, it's 📝 ZMBV beating AV1 all over again. For these kinds of retro game screenshots, JPEG XL is vastly outperformed by its counterpart from the previous generation of widely-used image formats. And not just in terms of compressed file sizes, but also in every single other aspect that matters to us:

That's not to say that libwebp is perfect. Its code makes it very obvious that lossless WebP was designed for 2010-era hardware as the encoder never got optimized for modern CPUs. There was an attempt at optimizing at least the lossy encoder for AVX2, but it was ultimately abandoned because it never got fast enough. Surprisingly, the codebase did receive new AVX2 code one week before I released this build, but it only covers the lossless decoder so far.
As for concurrency, libwebp does come with support for multi-threaded encoding, and I did activate it for the Shuusou Gyoku integration, but it's only used at effort levels 8 and 9. Also, why is argb in this structure interpreted as native-endian and therefore BGRA memory order, but these are interpreted as big-endian?

But the main criticism is the same that also applies to JPEG XL: The lossless and lossy modes are lumped into the same repository despite having virtually no code in common, and are selected via a structure field rather than having unrelated API entry points. This once again makes it very difficult for static linkers to remove all the code on the lossy branches that I never asked for in the first place.
And I sure never want to run the lossy encoder under any circumstance. Lossy WebP deserves all its bad reputation for basically being VP8's intra-frame coding applied to still images. VP8, 📝 if you remember, is that bad video codec from two generations ago that I'm only serving on this website due to sheer inertia. Applying its enforced YCbCr 4:2:0 chroma subsampling to images does not only make it utterly unsuitable for pixel art, but also even worse than well-compressed JPEG which isn't limited to a single subsampling scheme. If anything in the GIAN07 process accidentally flips the "I want lossless" flag, I'd rather want the WebP encoder to error out and have the screenshot frontend fall back on BMP than save an image with mutilated colors.

But while JPEG XL is a lost cause as far as I'm concerned, I've grown to like lossless WebP too much to leave it trapped within the unfortunate organization of its codebase. Also, there seems to be a lot of untapped potential in the format – really, why does PNG get all the attention of people writing alternative encoders when lossless WebP is the demonstrably much more capable format?
So I've decided to fork libwebp and surgically remove all code related to the lossy encoder. The statically linked result now only takes up ~100 KiB in the Windows build while still being API- and ABI-compatible. Of course, Linux users will still use their distribution's libwebp package with the lossy encoder included, but let's hope that the aforementioned possibility of accidents stays purely theoretical.

Really though, why have people started to bundle lossless and lossy image codecs under the same format in the first place if their algorithms have nothing in common? It might make sense for Opus where SILK and CELT are different kinds of lossy, but lossless and lossy are two completely different paradigms. The bloat and usability confusion far outweigh any situational tricks this might offer.


Alright, we found a good format with configurable effort levels, and we're only missing a way for players to pick an effort level. Depending on how they want to use this rapid-fire screenshot feature, almost all of the options make sense in some context:

  1. You'd like to screenshot a whole section of a stage as fast as possible with the help of the disabled frame rate limiter, and you got plenty of free disk space? You probably want to stick with BMP and compress the screenshots outside of the game, just like how you would have done it without this feature.
  2. A slight slowdown is OK or maybe even welcome for providing additional feedback that you're actually taking screenshots? Pick one of WebP's higher effort values that certainly take longer than 16 ms to encode, but are still reasonably fast and won't turn the game into a <2-FPS slideshow.
  3. Want the lowest file size that your system can encode while staying at 62.5 FPS? Well, how fast is your system? And not just the CPU – maybe your system is actually bottlenecked by I/O and writing a large uncompressed BMP file takes much longer than encoding it into WebP and writing the resulting smaller file.

The latter two use cases would be covered by automatic detection of the maximum effort value that encodes within a given number of frames. The problem, however, is that encoding times are always relative to the complexity of the image. Once we're in-game and have lots of bullets and lasers, any choice that might have been appropriate for the main menu might suddenly start dropping frames after all. Thus, we can't solve this with an upfront benchmark, but have to dynamically adapt to the complexity of the current game scene. But then the whole idea falls apart as we can't possibly treat the configurable allowed screenshot time as a hard limit. To figure out whether it's safe to raise the effort level again, there's no way around periodically exceeding that limit and thus dropping more frames after all.
The ideal solution would involve deep hooks into the WebP encoder that could dynamically adjust the compression algorithms depending on the remaining time in the current frame. An image compressor with real-time guarantees… sure sounds like an interesting research project.

In the end, letting players choose a fixed format and effort level remains the best option. However, they can only make an informed choice if they know the performance of all options relative to each other. And that's how we arrive at this new submenu:

These specific numbers I got on my now almost 7-year-old Intel Core i5 8400T are very peculiar. -z 0 gets quite close to the 16 ms we have per frame, but would still be too slow to reliably compress every gameplay situation without dropping frames. A 64-bit build would speed up -z 0 by 10%, -z 2 through -z 7 by 25%, -z 8 by 210% (!), and -z 9 by 60%. Linux users already enjoy these higher speeds, and the Windows build is just a few compiler settings away from matching them. 📝 Last time, the bitness argument was a lot more balanced, but WebP encoding performance presents the first compelling reason for going 64-bit.
Or we could always go multi-threaded, which already is a much more popular idea within the Seihou development Discord group.
Or I could investigate PNG after all to find out how exactly its encoding speed compares to WebP… :thonk:

But then, Ember2528 posted the encoding times he got on his new Ryzen 9 9950X3D:


Finally, you probably already noticed another small change in this build: The ReC98 push ID is now shown in the bottom-right corner of the title screen image, just below the original game version number. This was the one part of replay preparations that I wanted to get in sooner rather than later. Since the game binary and the data files can be updated or modded independently from each other, I'm going to tag future replays with both of their respective versions to guarantee reproducibility. Of course, newer builds should never introduce bugs that affect gameplay and desynchronize existing replays. But if they ever do, the included push ID allows hosting sites to remove any replays recorded on such a broken build from the official competition tier associated with a specific data file version.
As for rendering the push ID, it should obviously look similar to the VERSION 1.005 text above. We can find these glyphs in GRAPH.DAT file #0, but this particular text is actually baked into the main menu's background image, which explains why the decimal point glyph isn't part of that data file. The glyphs for 0-9 are also used in-game for the score popups, but the A-Z glyphs remain unused – so unused, in fact, that pbg didn't even leave any reference to them in the source code:

The unused 5×5 uppercase gradient font in GRAPH.DAT file #0

This means that the game provides us with all the glyphs we would need to display the ReC98 push ID. However:

So, all the glyphs next to the BUILD label actually come from the TrueType text renderer. The non-slashed zeroes immediately give this away, but exactly emulating the color gradient of the 0-9 glyphs makes MS Gothic blend in very well regardless:

Screenshot of the bottom-right corner of Shuusou Gyoku's title screen in the P0309 build, showing the new ReC98 build tag below the version number baked into the original title screen image

And that's all I've got for these very packed three pushes! In exchange, I'll reserve the next Shuusou Gyoku push for another round of maintenance and forward compatibility.
The new builds:

Next up: The long-awaited Windows 98 backport of our Shuusou Gyoku build! This has been in development for quite a while, so this should now be a matter of days rather than weeks.

📝 Posted:
💰 Funded by:
Blue Bolt, [Anonymous], iruleatgames
🏷️ Tags:

Oh, it's 2024 already and I didn't even have a delivery for December or January? Yeah… I can only repeat what I said at the end of November, although the finish line is actually in sight now. With 10 pushes across 4 repositories and a blog post that has already reached a word count of 9,240, the Shuusou Gyoku SC-88Pro BGM release is going to break 📝 both the push record set by TH01 Sariel two years ago, and 📝 the blog post length record set by the last Shuusou Gyoku delivery. Until that's done though, let's clear some more PC-98 Touhou pushes out of the backlog, and continue the preparation work for the non-ASCII translation project starting later this year.

But first, we got another free bugfix according to my policy! 📝 Back in April 2022 when I researched the Divide Error crash that can occur in TH04's Stage 4 Marisa fight, I proposed and implemented four possible workarounds and let the community pick one of them for the generally recommended small bugfix mod. I still pushed the others onto individual branches in case the gameplay community ever wants to look more closely into them and maybe pick a different one… except that I accidentally pushed the wrong code for the warp workaround, probably because I got confused with the second warp variant I developed later on.
Fortunately, I still had the intended code for both variants lying around, and used the occasion to merge the current master branch into all of these mod branches. Thanks to wyatt8740 for spotting and reporting this oversight!

  1. The Music Room background masking effect
  2. The GRCG's plane disabling flags
  3. Text color restrictions
  4. The entire messy rest of the Music Room code
  5. TH04's partially consistent congratulation picture on Easy Mode
  6. TH02's boss position and damage variables

As the final piece of code shared in largely identical form between 4 of the 5 games, the Music Rooms were the biggest remaining piece of low-hanging fruit that guaranteed big finalization% gains for comparatively little effort. They seemed to be especially easy because I already decompiled TH02's Music Room together with the rest of that game's OP.EXE back in early 2015, when this project focused on just raw decompilation with little to no research. 9 years of increased standards later though, it turns out that I missed a lot of details, and ended up renaming most variables and functions. Combined with larger-than-expected changes in later games and the usual quality level of ZUN's menu code, this ended up taking noticeably longer than the single push I expected.

The undoubtedly most interesting part about this screen is the animation in the background, with the spinning and falling polygons cutting into a single-color background to reveal a spacey image below. However, the only background image loaded in the Music Room is OP3.PI (TH02/TH03) or MUSIC3.PI (TH04/TH05), which looks like this in a .PI viewer or when converted into another image format with the usual tools:

TH02's Music Room background in its on-disk state TH03's Music Room background in its on-disk state TH04's Music Room background in its on-disk state TH05's Music Room background in its on-disk state
Let's call this "the blank image".

That is definitely the color that appears on top of the polygons, but where is the spacey background? If there is no other .PI file where it could come from, it has to be somewhere in that same file, right? :thonk:
And indeed: This effect is another bitplane/color palette trick, exactly like the 📝 three falling stars in the background of TH04's Stage 5. If we set every bit on the first bitplane and thus change any of the resulting even hardware palette color indices to odd ones, we reveal a full second 8-color sub-image hiding in the same .PI file:

TH02's Music Room background, with all bits in the first bitplane set to reveal the spacey background image, and the full color palette at the bottom TH03's Music Room background, with all bits in the first bitplane set to reveal the spacey background image, and the full color palette at the bottom TH04's Music Room background, with all bits in the first bitplane set to reveal the spacey background image, and the full color palette at the bottom TH05's Music Room background, with all bits in the first bitplane set to reveal the spacey background image, and the full color palette at the bottom
The spacey sub-image. Never before seen!1!! …OK, touhou-memories beat me by a month. Let's add each image's full 16-color palette to deliver some additional value.

On a high level, the first bitplane therefore acts as a stencil buffer that selects between the blank and spacey sub-image for every pixel. The important part here, however, is that the first bitplane of the blank sub-images does not consist entirely of 0 bits, but does have 1 bits at the pixels that represent the caption that's supposed to be overlaid on top of the animation. Since there now are some pixels that should always be taken from the spacey sub-image regardless of whether they're covered by a polygon, the game can no longer just clear the first bitplane at the start of every frame. Instead, it has to keep a separate copy of the first bitplane's original state (called nopoly_B in the code), captured right after it blitted the .PI image to VRAM. Turns out that this copy also comes in quite handy with the text, but more on that later.


Then, the game simply draws polygons onto only the reblitted first bitplane to conditionally set the respective bits. ZUN used master.lib's grcg_polygon_c() function for this, which means that we can entirely thank the uncredited master.lib developers for this iconic animation – if they hadn't included such a function, the Music Rooms would most certainly look completely different.
This is where we get to complete the series on the PC-98 GRCG chip with the last remaining four bits of its mode register. So far, we only needed the highest bit (0x80) to either activate or deactivate it, and the bit below (0x40) to choose between the 📝 RMW and 📝 TCR/📝 TDW modes. But you can also use the lowest four bits to restrict the GRCG's operations to any subset of the four bitplanes, leaving the other ones untouched:

// Enable the GRCG (0x80) in regular RMW mode (0x40). All bitplanes are
// enabled and written according to the contents of the tile register.
outportb(0x7C, 0xC0);

// The same, but limiting writes to the first bitplane by disabling the
// second (0x02), third (0x04), and fourth (0x08) one, as done in the
// PC-98 Touhou Music Rooms.
outportb(0x7C, 0xCE);

// Regular GRCG blitting code to any VRAM segment…
pokeb(0xA8000, offset, …);

// We're done, turn off the GRCG.
outportb(0x7C, 0x00);

This could be used for some unusual effects when writing to two or three of the four planes, but it seems rather pointless for this specific case at first. If we only want to write to a single plane, why not just do so directly, without the GRCG? Using that chip only involves more hardware and is therefore slower by definition, and the blitting code would be the same, right?
This is another one of these questions that would be interesting to benchmark one day, but in this case, the reason is purely practical: All of master.lib's polygon drawing functions expect the GRCG to be running in RMW mode. They write their pixels as bitmasks where 1 and 0 represent pixels that should or should not change, and leave it to the GRCG to combine these masks with its tile register and OR the result into the bitplanes instead of doing so themselves. Since GRCG writes are done via MOV instructions, not using the GRCG would turn these bitmasks into actual dot patterns, overwriting any previous contents of each VRAM byte that gets modified.
Technically, you'd only have to replace a few MOV instructions with OR to build a non-GRCG version of such a function, but why would you do that if you haven't measured polygon drawing to be an actual bottleneck.

Three overlapping Music Room polygons rendered using master.lib's grcg_polygon_c() function with a disabled GRCGThree overlapping Music Room polygons rendered as in the original game, with the GRCG enabled
An example with three polygons drawn from top to bottom. Without the GRCG, edges of later polygons overwrite any previously drawn pixels within the same VRAM byte. Note how treating bitmasks as dot patterns corrupts even those areas where the background image had nonzero bits in its first bitplane.

As far as complexity is concerned though, the worst part is the implicit logic that allows all this text to show up on top of the polygons in the first place. If every single piece of text is only rendered a single time, how can it appear on top of the polygons if those are drawn every frame?
Depending on the game (because of course it's game-specific), the answer involves either the individual bits of the text color index or the actual contents of the palette:

The contents of nopoly_B with each game's first track selected.

Finally, here's a list of all the smaller details that turn the Music Rooms into such a mess:

And that's all the Music Rooms! The OP.EXE binaries of TH04 and especially TH05 are now very close to being 100% RE'd, with only the respective High Score menus and TH04's title animation still missing. As for actual completion though, the finalization% metric is more relevant as it also includes the ZUN Soft logo, which I RE'd on paper but haven't decompiled. I'm 📝 still hoping that this will be the final piece of code I decompile for these two games, and that no one pays to get it done earlier… :onricdennat:


For the rest of the second push, there was a specific goal I wanted to reach for the remaining anything budget, which was blocked by a few functions at the beginning of TH04's and TH05's MAINE.EXE. In another anticlimactic development, this involved yet another way too early decompilation of a main() function…
Generally, this main() function just calls the top-level functions of all other ending-related screens in sequence, but it also handles the TH04-exclusive congratulating All Clear images within itself. After a 1CC, these are an additional reward on top of the Good Ending, showing the player character wearing a different outfit depending on the selected difficulty. On Easy Mode, however, the Good Ending is unattainable because the game always ends after Stage 5 with a Bad Ending, but ZUN still chose to show the EASY ALL CLEAR!! image in this case, regardless of how many continues you used.
While this might seem inconsistent with the other difficulties, it is consistent within Easy Mode itself, as the enforced Bad Ending after Stage 5 also doesn't distinguish between the number of continues. Also, Try to Normal Rank!! could very well be ZUN's roundabout way of implying "because this is how you avoid the Bad Ending".

With that out of the way, I was finally able to separate the VRAM text renderer of TH04 and TH05 into its own assembly unit, 📝 finishing the technical debt repayment project that I couldn't complete in 2021 due to assembly-time code segment label arithmetic in the data segment. This now allows me to translate this undecompilable self-modifying mess of ASM into C++ for the non-ASCII translation project, and thus unify the text renderers of all games and enhance them with support for Unicode characters loaded from a bitmap font. As the final finalized function in the SHARED segment, it also allowed me to remove 143 lines of particularly ugly segmentation workarounds 🙌


The remaining 1/6th of the second push provided the perfect occasion for some light TH02 PI work. The global boss position and damage variables represented some equally low-hanging fruit, being easily identified global variables that aren't part of a larger structure in this game. In an interesting twist, TH02 is the only game that uses an increasing damage value to track boss health rather than decreasing HP, and also doesn't internally distinguish between bosses and midbosses as far as these variables are concerned. Obviously, there's quite a bit of state left to be RE'd, not least because Marisa is doing her own thing with a bunch of redundant copies of her position, but that was too complex to figure out right now.

Also doing their own thing are the Five Magic Stones, which need five positions rather than a single one. Since they don't move, the game doesn't have to keep 📝 separate position variables for both VRAM pages, and can handle their positions in a much simpler way that made for a nice final commit.
And for the first time in a long while, I quite like what ZUN did there! Not only are their positions stored in an array that is indexed with a consistent ID for every stone, but these IDs also follow the order you fight the stones in: The two inner ones use 0 and 1, the two outer ones use 2 and 3, and the one in the center uses 4. This might look like an odd choice at first because it doesn't match their horizontal order on the playfield. But then you notice that ZUN uses this property in the respective phase control functions to iterate over only the subrange of active stones, and you realize how brilliant it actually is.

Screenshot of TH02's Five Magic Stones, with the first two (both internally and in the order you fight them in) alive and activated Screenshot of TH02's Five Magic Stones, with the second two (both internally and in the order you fight them in) alive and activated Screenshot of TH02's Five Magic Stones, with the last one (both internally and in the order you fight them in) alive and activated

This seems like a really basic thing to get excited about, especially since the rest of their data layout sure isn't perfect. Splitting each piece of state and even the individual X and Y coordinates into separate 5-element arrays is still counter-productive because the game ends up paying more memory and CPU cycles to recalculate the element offsets over and over again than this would have ever saved in cache misses on a 486. But that's a minor issue that could be fixed with a few regex replacements, not a misdesigned architecture that would require a full rewrite to clean it up. Compared to the hardcoded and bloated mess that was 📝 YuugenMagan's five eyes, this is definitely an improvement worthy of the good-code tag. The first actual one in two years, and a welcome change after the Music Room!

These three pieces of data alone yielded a whopping 5% of overall TH02 PI in just 1/6th of a push, bringing that game comfortably over the 60% PI mark. MAINE.EXE is guaranteed to reach 100% PI before I start working on the non-ASCII translations, but at this rate, it might even be realistic to go for 100% PI on MAIN.EXE as well? Or at least technical position independence, without the false positives.

Next up: Shuusou Gyoku SC-88Pro BGM. It's going to be wild.

📝 Posted:
🏷️ Tags:

Turns out I was not quite done with the TH01 Anniversary Edition yet. You might have noticed some white streaks at the beginning of Sariel's second form, which are in fact a bug that I accidentally added to the initial release. :tannedcirno:
These can be traced back to a quirk I wasn't aware of, and hadn't documented so far. When defeating Sariel's first form during a pattern that spawns pellets, it's likely for the second form to start with additional pellets that resemble the previous pattern, but come out of seemingly nowhere. This shouldn't really happen if you look at the code: Nothing outside the typical pattern code spawns new pellets, and all existing ones are reset before the form transition…

Except if they're currently showing the 10-frame delay cloud animation , activated for all pellets during the symmetrical radial 2-ring pattern in Phase 2 and left activated for the rest of the fight. These pellets will continue their animation after the transition to the second form, and turn into regular pellets you have to dodge once their animation completed.

By itself, this is just one more quirk to keep in mind during refactoring. It only turned into a bug in the Anniversary Edition because the game tracks the number of living pellets in a separate counter variable. After resetting all pellets, this counter is simply set to 0, regardless of any delay cloud pellets that may still be alive, and it's merely incremented or decremented when pellets are spawned or leave the playfield. :zunpet:
In the original game, this counter is only used as an optimization to skip spawning new pellets once the cap is reached. But with batched EGC-accelerated unblitting, it also makes sense to skip the rather costly setup and shutdown of the EGC if no pellets are active anyway. Except if the counter you use to check for that case can be 0 even if there are pellets alive, which consequently don't get unblitted… :onricdennat:
There is an optimal fix though: Instead of unconditionally resetting the living pellet counter to 0, we decrement it for every pellet that does get reset. This preserves the quirk and gives us a consistently correct counter, allowing us to still skip every unnecessary loop over the pellet array.

Cutting out the lengthy defeat animation makes it easier to see where the additional pellets come from.
Cutting out the lengthy defeat animation makes it easier to see where the additional pellets come from. Also, note how regular unblitting resumes once the first pellet gets clipped at the top of the playfield – the living pellet counter then gets decremented to -1, and who uses <= rather than == on a seemingly unsigned counter, right?
Cutting out the lengthy defeat animation makes it easier to see where the additional pellets come from.

Ultimately, this was a harmless bug that didn't affect gameplay, but it's still something that players would have probably reported a few more times. So here's a free bugfix:

:th01: TH01 Anniversary Edition, version P0234-1 2023-03-14-th01-anniv.zip

Thanks to mu021 for reporting this issue and providing helpful videos to identify the cause!