P0327
TH03 RE (Character-specific attack function pointers / Bullet dependencies / Bullet structure)
P0328
TH03 decompilation (Bullets, part 1)
P0329
TH03 decompilation (Bullets, part 2) / Twitter→Fediverse migration, part 1
💰 Funded by:
[Anonymous], Ember2528
🏷️ Tags:
Alright! I've been announcing a big look at TH03's in-game systems all throughout 2025, and I technically still made it before the end of the year. TH03's enemy, fireball, and explosion systems are a great fit for this occasion: They fulfill both of the netplay-relevant criteria I mentioned 📝 at the end of the previous blog post, but also unfortunately share the same structure and overload some of their fields with vastly different meanings, much like 📝 TH04's and TH05's custom entities. Hence, they will take rather long to untangle, which ensures that the resulting look will be appropriately big.
Until I noticed that explosions spawn bullets in a not-so-straightforward way that basically requires complete knowledge of the bullet system. What a great discovery to make 2½ pushes into development… Oh well, we have the budget, and bullets also happen to match our netplay-relevant criteria, so let's get those done first.
As usual, we'd also like to identify and name all character-specific functions in the ASM code so that we can immediately correlate certain interesting features of the bullet system with the characters and attacks that use them. In TH03, this is particularly worthwhile because it's all we need for a 100% complete overview of how bullets are used. Apart from the transferred pellets fired from exploding enemies outside of Gauge or Boss Attacks, every bullet pattern in the game is part of such a hardcoded and character-specific Extra, Gauge, or Boss Attack, since enemy scripts cannot fire bullets in this game.
This quick look also showed how ZUN implemented the 9 characters in a highly consistent manner. Gauge Attacks in particular follow a predictable convention:
The "Level 2" attack (available at 50% gauge and consuming 25% gauge) always fires 8×8 pellets.
The "Level 3" attack (available at 75% gauge and consuming 50% gauge) always fires 16×16 bullets.
The game only provides a single function pointer for each of the two levels, which gets called as part of game logic and before the game starts rendering the current frame. With no room for custom rendering calls, characters can only define these attacks as patterns that are made up of common entities.
The funny part in all of this: All characters follow these conventions, yet ZUN still architected TH03 as if they don't. Each of the two Gauge Attack levels gets a separate per-player function pointer, but every character just uses these two functions to call a single common function with a flag that indicates Level 2 or Level 3. This common function then follows a similar structure for all 9 characters as well. The same trend continues with the Boss Attacks, where we find 9 copies of the more or less unchanged update and rendering boilerplate… so yeah, ZUN basically copy-pasted the same code 9 times with minor variations.
And now we can be very hyped for the future of TH03 decompilation. Lots of duplicates of the same functionality means that I'll basically only have to decompile them once, which means that TH03 decompilation is very likely to progress very quickly in terms of absolute numbers once I get the basic gameplay systems done. And there's not a lot of that code left either: After this delivery, we're left with a mere 128 undecompiled foundational functions that are not related to any specific character. After the next delivery, that number will drop to 95. I'm expecting a return to the glorious days of 2020, where the 3 copies of TH01's foundational graphics code allowed me to decompile 10% of its entire code within 2½weeks. With 9 copies tripling that speed, we may even get to finish this game next year?
Onto bullets then! As you'd expect, TH03's bullet system is based on 📝 TH02's system we looked at earlier this year, which in turn was based on TH01's system. In some respects, it's a minor iteration of TH02's system adapted to the new features in TH03, but some of these new features also form the missing link between TH02 and 📝 TH04. The high-level overview:
Like most entities in TH03, bullets are stored in a single array that is shared between both players. Each bullet has a structure field that denotes which playfield it is moving on and constrained to.
The total bullet cap shared among both players is 320, slightly more than twice the 150-bullet cap we saw in TH02.
Just like in TH02, this system covers both 8×8 pellets and 16×16 sprite bullets. The former are once again hardcoded and rendered using the GRCG, while the latter are rendered by SPRITE16.
The system defines a default set of four adjacent 16×16 bullet sprites, starting at (64, 0) within 📝 SPRITE16's sprite area. These can be used in two ways:
Four animation frames for a single bullet type, animated at the maximum speed of 1 cel per frame. This is how they are used by most characters.
Alternatively, they can represent one non-animated bullet type and three trail sprites, as seen in Mima's and Chiyuri's Boss Attacks.
Patterns can override the default 16×16 sprites with an arbitrary other set of four adjacent sprites, but this feature is only used in Kana's Extra Attack.
Character
Sprites
Reimu
Mima
Marisa
Ellen
Kotohime
Kana
Rikako
Chiyuri
Yumemi
The SPRITE16 area of certain characters might contain other bullet-like sprites, but these are used by a different gameplay system than the one described in this post.
I've reduced the animation speed to 1/8 its original length because 18 ms would look very obnoxious in the context of a web page. Run webpmux -duration 18ms on the animated WebP files to restore the original speed.
And yes, Yumemi's sprite is animated very subtly. Click the animated sprites for the raw sprite sheet.
As we already found out 📝 in 2022, both 8×8 pellets and 16×16 bullets have the same "hitbox" – a single 2×2-pixel tile in the game's collision bitmap that gets compared against the 8×8 square surrounding the player's center.
Delay clouds are back after their absence in TH02. They are still limited to pellets, though.
The 📝 bullet template makes its debut in this game, replacing TH01's and TH02's spawn function parameters with a single piece of global data. This introduces the usual trade-offs with this sort of thing: Code size savings in patterns that spawn multiple groups with minor variations to their parameters, in exchange for the usual confusion that comes with widely mutated global state. As a result, code quality suffers greatly, especially when it comes to the derived transfer pellets fired within the bullet system itself. Sure, there are lots of patterns where retained state comes in handy, but local per-pattern template instances would have solved that as well.
This decline in code quality also extends to the rest of the logic code. Continuing his general trend of micro-optimizing based purely on vibes we've seen 📝 time and 📝 time again, ZUN wrote a significant part of TH03's bullet logic in ASM, especially in the update function. And once again, it's the same tragic conclusion: While TH04's and TH05's full-on ASM approach might later bring some measurable runtime benefits to bullet logic via self-modifying code and whatnot, TH03 is left with pretty much only the downsides of its partial ASM approach. If ZUN just wrote idiomatic C++ code without any optimization tricks and inlined just one function, the whole bullet update code would have been 87 lines of C++ shorter and exactly as large when compiled. And that's with all the redundant code still in place! TH02's not-great-but-passable implementation of bullets indeed marked the high point for the PC-98 series, and it only went downhill from there.
Three features of the bullet system deserve a deeper look:
Trail sprites
These work by remembering the last 6 positions of a bullet and rendering the sprites at the 2nd, 4th, and 6th position, respectively:
Obviously, this requires (6 ×
2 ×
2) =
24 additional bytes per bullet. Adding these to the regular bullet structure would waste
(24 × 320) = 7,680 bytes of conventional RAM, which would in no way be justified for a feature that ZUN used in a grand total of two patterns. For once, ZUN agreed, and instead provided a single ring buffer that can hold these 6 additional positions for up to 48 bullets. This allowed ZUN to reduce the per-bullet cost to 3 bytes: 1 byte for the has trail flag, and 2 bytes for a near pointer into the ring buffer. That's still two more bytes than absolutely needed, and the debloated branch will definitely free up these wasted 640 bytes for portability reasons alone.
This trail sprite cap of 48 seems a bit random at first. Unlike the regular bullet cap that the game enforces by just not spawning any new bullets if all 320 slots are occupied, the trail sprite cap is not enforced or even just checked in any way. Due to the circular nature of the buffer, the 49th simultaneously active bullet with a trail sprite will then share its position memory with the 1st trail sprite bullet, leading to one additional position memory update per frame and trail sprites appearing in wrong positions.
Thus, it's the game design's responsibility to make minimal use of trail sprites to avoid these glitches. On the surface, it certainly looks as if ZUN was careful here:
The cap happens to exactly match the 48 bullets fired as part of the ring group in Chiyuri's pattern, which is definitely the more bullet-intensive pattern of the two.
Both trail-using patterns are part of Boss Attacks, and only a single player's Boss Attack can be active at any given time.
The ring groups in Chiyuri's pattern move fast and are separated by a 96-frame delay, as captured in the video above. By the time Chiyuri spawns the next group, every bullet of the previous one should have long been removed due to flying past the edges of the playfield.
Mima fires her alternating 5- and 4-spreads at a much shorter interval, but that interval is still long enough to never leave more than 5 of these 9-bullet subpatterns on screen at once.
Until you test Chiyuri's pattern on the (📝 announced) Boss Attack level 1 and notice that the bullets move slow enough for 2) to no longer apply. The result:
Missing bullets on every group beyond the first, and even sporadic trail sprites at mixed-up X and Y coordinates. This nicely demonstrates how these trail sprites are not just cosmetic, but also take control of clipping and affect gameplay as a result. For obvious optical reasons, trail sprite bullets will only get removed after the 6th remembered position lies outside of the clipping area – i.e., 6 frames later than bullets without trail sprites. If the remembered positions are then shared with a second bullet, the game would also clip that bullet if the clipping condition of the first one is met – regardless of the fact that the second bullet's main sprite might be nowhere close to the boundaries of the playfield. This clipping then either happens on the same frame if the second bullet's slot number within the 320-element bullet array is higher than the slot number of the first one, or on the next frame if the second bullet's slot number is lower.
The mixed-up X and Y positions on frames 139 and 235 can also be explained by clipping. The update function processes the X and Y coordinates independently from each other: It starts with the horizontal clipping checks, updates the position memory for the X coordinate, and then repeats both steps for the Y coordinate, immediately removing the bullet and moving on to the next one if it failed the respective clipping check. If the vertical clipping checks fail in a situation where two bullets share the same position memory, you'll end up with a mismatched X/Y pair where X comes from a clipped bullet and Y comes from an active one… for a single frame, until the same clipping check is applied to the other bullet and removes it as well. Hence, this is the only "fixable" bug in the bullet system that won't affect gameplay, as the mixed-up positions are unrelated to the result of the clipping condition that ultimately removes both bullets.
It's rare for Chiyuri's Boss Attack to launch the same pattern multiple times in a row, and once the (announced) Boss Attack level is ≥3, bullets already move fast enough to prevent this quirk from happening. But it's definitely possible to run into it during regular gameplay.
For a clearer and more extreme demonstration of the resulting glitches, let's turn Mima's trail sprite pattern into a 64-ring:
Explaining every single quirk in this hypothetical video is left as an exercise to the reader.
Rings and other groups
If there's one aspect where TH03's bullet system shows its TH02 lineage most clearly, it's the set of predefined bullet groups. The 2-, 3-, 4-, and 5-spreads with fixed narrow, medium, and wide arc angles, as well as the multi-bullet groups with randomized angles and speeds, are not only available in TH03 once again, but reuse the exact same code from TH02.
Instead, TH03's main innovation can be found in its ring system. Rings can now have any number of bullets between 0 and 255, and are no longer limited to the first six powers of 2. This allowed ZUN to fine-tune most ring groups based on the Gauge or Boss Attack level, and to also just have a few static ring patterns with non-power-of-two bullet counts. Chiyuri's aforementioned 48-ring trail sprite pattern falls in this category, and the rotating 5-ring pattern seen in Kana's Boss Attack is another example.
And yes, storing the number of ring bullets in a regular uint8_t field now also allows patterns to spawn 0-rings. And sure enough, the bug that would later cause 📝 Kurumi's Divide Error crash in TH04 was actually introduced in TH03! The underlying code wasn't modified between the two games, which further proves that TH04's bullet system also traces back to TH03 and wasn't rewritten from scratch, at least concerning this aspect. TH03 just doesn't have any (known) way of triggering the bug in the unmodded original game.
Interestingly, TH02's ring system with predefined power-of-2 bullet counts is still part of TH03, and ZUN does use it for some ring groups in a few Boss Attack patterns. Did he do this because it's shorter than adding a second line of code that sets bullet_template.count? Did he deliberately need to preserve the previous value of bullet_template.count across groups? Or did he code these patterns at an earlier time in development when the arbitrary ring system didn't exist yet? Until I've decompiled every single bullet pattern in this game, we can only guess.
However, ZUN also removed two of TH02's group-related features from TH03:
The eight special motion types have been reduced to a single gravity type. While gravity is now a separate flag in the bullet structure and template that can now be applied to any group, this vast removal of options still severely limits the expressivity of bullet patterns in TH03. This means that every non-gravity bullet in the game moves at a constant velocity.
Gravity is also exclusively used by Kana, in both her Extra attack with the
alternative 16×16 bullet sprites as well as in one certain pattern of her Boss Attack, which demonstrates gravity in combination with a ring group.
One of the rare patterns that arguably looks prettier on Easy, where the slower bullet speeds leave more room for the gravity effect to accelerate the fall.
The auto-stacking system was removed without any direct replacement. With TH03's more 📝 numeric method of defining difficulty, ZUN no longer needed this quick mechanism to 📝 distinguish Easy and Normal from Hard and Lunatic. This was one of the better changes between the two games though; the auto-stacking system added a quite annoying asterisk to the documentation of the random groups that is no longer needed in TH03.
Manually creating stacks is obviously still possible by spawning separate versions of the same group with gradually reduced speed. This might be considered another practical advantage of the global bullet template, since you only need to mutate a single field before calling bullets_add() again. But really, nothing justifies global data.
Transferred pellets
This is the final gameplay feature that deserves its own section. Let's follow the pellet's X coordinate on its way from the spawn point to its destination on the other playfield:
ZUN calculates the destination coordinate on the target playfield as a random Q12.4 X coordinate between 0 and 288.
This subpixel coordinate is translated to screen space, adding either 16 for the left playfield or 336 for the right one. ZUN does this using the regular pixel-space conversion function that is typically used to calculate blitting coordinates, losing subpixel precision in the process and forming a very minor quirk.
The pellet's movement angle is calculated in screen space, aiming a screen-space version of the pellet's origin point at the coordinate from 2).
The screen-space pixel coordinate from 2) is translated back to a Q12.4 subpixel coordinate on the pellet's originating playfield. The result will deliberately lie outside the boundaries of this playfield: For a pellet flying from left to right, it will be between 320 and 608, while it will be between -320 and -32 for a pellet flying from right to left.
The pellet then flies to this out-of-bounds coordinate while internally staying on the playfield it originated on. This means that neither the update nor the rendering code can clip the pellet at the borders of its originating playfield. Once it flew past the border, it only visually appears on the other playfield because that's what the out-of-bounds X coordinate translates to when the renderer converts it to screen space.
Once the pellet's X coordinate has approached or flown past this relative target coordinate from its respective movement direction, the pellet is removed and respawned as a delay cloud.
This shows that the 32-pixel border between the two playfields is not just visual, but an actual part of the simulated game world. We can visualize this by removing the black cells on the text layer:
Also, we need to clear these 32 border pixels in VRAM on every frame to nicely visualize just these pellet transfers. TH03 obviously doesn't do that for performance reasons and lets partially clipped sprites accumulate below the border, 📝 just like the other shmups do.
This video also demonstrates another minor quirk: Transferred pellets are aimed at a random center Y coordinate between 0 and 16, but the subsequent delay cloud is always spawned at center_y = 2.0.
Speaking of, there's also a lot to cover in…
TH03's bullet renderer
And at first, it looks pretty good! TH03 retains the best idea from TH02 and batches rendering into three passes:
16×16 bullets, rendered normally via SPRITE16
32×32 delay clouds, rendered via SPRITE16's monochrome mode
8×8 pellets, rendered from hardcoded sprites via the GRCG
The second pass is skipped if the first pass didn't detect at least a single active delay cloud. However, this skip can only remove 320 out of the 960 iterations over the entire bullet array, every frame. Combine that with the most unlucky allocation of registers, and the resulting instructions end up wasting a low 5-digit number of CPU cycles per frame on a 486 in the worst case of no bullets being active. Same game that wrote large parts of its bullet update function in ASM, by the way.
While that number is still an order of magnitude away from causing significant performance problems, this issue became serious enough in TH04 for ZUN to introduce a display list for at least pellets that would cut down the number of iterations.
And then, we look at…
Pellet rendering
… and are greeted by the single strangest set of hardcoded sprites across all of PC-98 Touhou so far. TH03's pellets are not only the first time we see a doubly-preshifted sprite sheet, but 2 of the 16 variants for the transfer pellet sprites are also shifted incorrectly:
You can see this bug all over the video above, for example in frame 97, 105, 107, 109, 111…
The doubly-preshifted nature of this sprite sheet, on the other hand, raises a whole lot of PC-98 blitting performance questions. This only possibly makes sense as an attempt at optimizing away the unaligned 16-bit VRAM writes you'd naturally run into when shifting an 8-wide sprite to cover two bytes.
Let's look at a regular 8-wide pellet sprite that was singly-preshifted to 16 pixels/bits. If we want to blit such a sprite to a left X position of 12 with the minimum amount of instructions, we would perform a single 2-byte write to VRAM address 0x0001, which itself is not divisible by 2:
On most 16-bit architectures, unaligned memory writes like these are either slower than aligned writes or entirely unsupported. The x86 MOV and MOVS instructions fall into the first category, so it makes sense to think that the GRCG might add a performance penalty of its own on top of the already higher latency of these instructions.
The natural workaround, then, is to add a second set of preshifted sprites to cover the remaining 8 possible start bit positions within a 16-pixel VRAM word. This would expand pellet sprites to a total width of 23 pixels. Understandably, ZUN also wanted to optimize for the low instruction counts, so he had to round up the physical width of the sprite to 32 pixels. Then, every preshifted variant could be blitted with a single MOVSD instruction:
But does this actually matter for the PC-98 and the GRCG? Are unaligned writes actually slow enough to justify writing 2× as much sprite data per frame and hardcoding 4× as many bytes? Unfortunately, I don't know of any hardware-level documentation about the GRCG that would conclusively answer this question. All the usual books and text files are disappointingly surface-level and only document the same programmer interfaces over and over, and hardware researchers are still waiting for EGC and GRCG die shots to even get started.
There are a few signs that this might be a good idea:
Any VRAM-reading EGC operation must use aligned 16-bit accesses, which probably has a deeper reason that goes beyond the size of its internal shift register. And since you activate the EGC by first activating the GRCG in TDW mode…
Neko Project spends the same number of clock cycles on both 8- and 16-bit GRCG writes. The absence of a dedicated 32-bit write handler suggests that real hardware breaks down 32-bit writes into two 16-bit writes, implying that we don't also have to worry about 32-bit alignment of our single MOVSD instruction.
Shouldn't byte access be a given? Clearly, this would only deserve special mention if it wasn't because the previous contents of this book heavily implied some sort of 16-bit nature and I just missed it.
But without documentation or benchmarks, none of this means anything.
This is also why I haven't yet explored this whole field of optimizing VRAM writes for alignment. It would always involve branching to alignment-respecting code similar to how master.lib does it, but code like this is at odds with the more tangible goal of minimizing instruction counts in the generic case. Not to mention that we'll once again have to test this across every PC-98 hardware generation and possibly even GRCG revision if we ever go down to that level of optimization…
But even if alignment matters, ZUN's unconditional MOVSD instructions approach still appears to be slower on average. Consider the optimal 56.25% of cases where the sprite does lie within a single 16-bit word:
8 start positions within the first byte + 1 start position on the second byte = 9/16 = 56.25%. The 9th variant for (x % 16) == 8 wouldn't be part of a regular singly-preshifted sprite sheet where the renderer blits the (x % 8)th = 0th variant. But it would definitely be worth adding if alignment does matter at all.
Keep in mind that we still use the GRCG here, and that it will also have to perform its fast-but-not-entirely-free four-plane Read-Modify-Write operation for the empty sprite bytes 3 and 4. Unconditional 32-bit writes would only be worth it if the GRCG somehow optimizes away empty writes at the microarchitecture level. That assumption is even more of a stretch, because 📝 why would master.lib even check for emptiness if that were true?
In the end, doubly-preshifted sprites slow down 56.25% of all blitting operations in a dubious attempt to speed up the other 43.75%. Unaligned 16-bit writes would have to be really slow to justify this approach – and judging from the fact that TH04 went back to single-byte preshifting, this is not the case. Maybe I'll write a benchmark for this someday, but honestly, this is the least interesting PC-98 benchmark question I've encountered so far. There are slowdown issues at 📝 our performance target of 66 MHz in Neko Project, but pellet sprite alignment is unlikely to significantly contribute to those.
16×16 bullets are simply rendered using standard SPRITE16 calls, nothing special there. That only leaves…
Pellet delay clouds
Just like the 48×48📝 hit circle, these 32×32 sprites are rendered using SPRITE16's single-color render-path, which uses the EGC's GRCG-equivalent mode. Last year, I took a very brief look at this mode and wondered whether this was actually faster than just using the GRCG. 1½ years and 📝 one benchmark won by the EGC later, it certainly seems so, especially since we want to blit these to unaligned X positions. The EGC's hardware-accelerated pixel shifting seems highly preferable once sprite widths exceed 24 pixels and you can't fit a row of pixels in a 32-bit register anymore.
Stepping through SPRITE16 reveals that this GRCG-equivalent mode matches the GRCG even in how it doesn't read monochrome sprite data from VRAM, but from SPRITE16's 1bpp alpha mask buffer in conventional RAM.
But that only raises the question of why you'd want to use SPRITE16 over the raw EGC. It makes sense why SPRITE16 would have this feature; flashing existing sprites in a single color every once in a while is a useful thing to have in a game-focused rendering API. But using this feature for sprites that are only rendered in this monochrome mode just wastes the VRAM that these sprites occupy in SPRITE16's sprite area. You still blit such a sprite by passing a byte offset into the sprite area, which then gets interpreted as an offset into SPRITE16's alpha mask buffer.
If SPRITE16 had a function for directly blitting from a pointer to 1bpp data, ZUN could have freed up quite a bit of VRAM and maybe even added more sprites for character-specific attacks. Conceptually, it makes sense why SPRITE16 would restrict itself to a single sprite source, but it is quite an unfortunate omission, I'd say.
13,568 pixels, to be exact. And yeah, you could technically overwrite the affected portions of VRAM after generating alpha masks via INT 42h, AH=01h. But since SPRITE16 only stores one such alpha mask buffer, you still couldn't reuse this space for other SPRITE16 sprites.
And that was the last PC-98 Touhou bullet system we were still missing! But at a little over 2 pushes, I have to find something else to do to round out the third one… wait, what about that one incident?
Migrating away from Twitter
On November 6, Twitter was hit by an automated ban wave that suspended all accounts that were using the OldTweetDeck extension. After Twitter discontinued the official free TweetDeck frontend on 2023-08-17, I quickly switched to OldTweetDeck – not just because it was free, but because it supported multiple accounts and was simply more performant than X's own premium offering at the time. In return, I gladly paid my €10 a month to dimden instead, who deserved it much more for continuously updating OldTweetDeck to all of Twitter's API changes over the years. It's very impressive how he kept it running for 2⅓ years without any such critical issues and still keeps maintaining it to this day.
Aside from Touhou Patch Center and all of my accounts, the ban wave affected enough people that Twitter decided to gradually revert it a day later. But without any public postmortem or excuse, this feels more like an act of gratitude that we shouldn't take for granted.
Ever since Elon took over, the Internet has been full of sensationalist doomposts about Twitter's imminent downfall any moment now OMG. For the longest time, I could ignore all these pundits because nothing of what they were complaining about was affecting my little corner. But sudden account suspensions are an existential threat to my business, and finally provided the first actual technical and non-political argument to get my data off Twitter in the medium term. I've put too much effort into all of the content there to let it be exclusively controlled by any one company.
Hilariously, things only got worse from there. Until two days ago, Twitter's data download option was inaccessible due to an infinite redirection bug. Call it malice or incompetence, but leaving such an issue unfixed for weeks is a definite sign of a platform in decline. Thus, I had to run the import on an older archive I happened to request on 2023-07-02.
And then I looked inside that archive and noticed that it was missing at least three key pieces of data that Twitter demonstrably stores for my account:
Poll options
Alt text for images. (Also known as the actually most annoying and time-consuming part of every tweet if you actually want to properly explain an image with all its context and implications. AI won't help with that as long as its context window doesn't span every piece of knowledge related to this project. 🤷)
The original version of each uploaded image, which they do have for a fact because it's shown in the /status view. The archive only contains the processed versions shown in the timeline, which were resized to at most 1200 pixels along their larger dimension and which may or may not have been converted to JPEG based on rules I didn't bother to reverse-engineer.
That's a rather selective interpretation of Art. 20 GDPR. If the argument is that you can just scrape that data out of the HTML yourself, why are they even bothering with sending me anything more than a nested list of tweet IDs, then? 🤨 Someone with more time and care could probably turn this into a lawsuit…
Presenting all media in its original quality is one of the more important reasons for moving to a self-hosted service as far as I'm concerned, especially since mainstream media conversion pipelines are infamous for destroying pixel art. So I went through my hard drives and replaced Twitter's images with the original versions of all 167 non-retweeted images I had uploaded to Twitter until July 2023. The videos also desperately needed to be replaced with their original AV1 versions; Twitter's enforced x264 YUV420P format has been the single worst implementation detail of that entire platform…
You can backdate posts by modifying their creation time, but Bluesky's crawlers will also record the indexed time when they first saw each post on the network. Unfortunately, the bsky.app frontend that everyone uses will then present this indexed time as a post's main timestamp, demoting your intended creation time to an archival time that Bluesky can't confirm the authenticity of:
The PDS database schema does track an indexedAt timestamp in addition to the createdAt timestamp you specify during the import, but indexedAt might as well not exist because it doesn't seem to be used anywhere.
There is a PR that would slightly improve the UI in this case, but it's been languishing unmerged throughout most of 2025. Probably because it has to be merged by the same people who came up with the current UI in the first place, and who prioritized resilience against pranks and disinformation campaigns.
But even if the UI is fixed, these imports would spam the timeline of everyone who follows the existing Bluesky account that we obviously want to import into.
Figuring out and confirming that first issue required remote debugging of the PDS server written in Node.js. Visual Studio Code's LSP quickly ran up against my server's low amount of RAM, which forced me to upgrade my server just to efficiently navigate through the source code…
Typical Node.js criticisms aside, the architecture of the PDS server is quite bizarre. A whole lot of the apparent API surface is never directly called, but generically proxied to some other node in the AT Protocol network at the byte level. If you log into your PDS via bsky.app, it seems as if the AppView calls API endpoints like /xrpc/app.bsky.unspecced.getPostThreadV2 on your PDS, but good luck meaningfully intercepting any of these requests, or even just getting your debugger to break on them.
Together with lots of bulky API schema descriptions in the form of lexicons, all this XRPC code makes up a big proportion of the code in the @atproto/pds package. But for… what exactly? Why would the PDS server need a thick layer of type safety and validation for payloads it doesn't look at, and that the relays will have to verify anyway? Why do they install all this dead code that will confuse most people who are trying to understand this system?
In the end, we just can't thoroughly backdate our imported posts because the crawl timestamps are set by the relays, whose code we have no control over. Now, I could ignore all these issues and still upload some sort of full archive to the platform that now houses 1/6 of my following, but this just doesn't match the quality I expect from the canonical, definitive source of my short-form news posts.
Trying Mastodon
That leaves the Fediverse as the only remaining alternative for a service where people can still follow, like, and repost my content using relatively commonly used clients. Among the various ActivityPub implementations, Misskey is particularly popular among the Japanese Touhou community, but I've only heard bad things about its resource usage. Mastodon isn't the most lightweight option either – as aptly implied by its name – but you can make the argument that it's become the default option across the Fediverse over the years. Thus, there'll be at least a slight chance that people will be familiar with the web UI of what I'm about to self-host.
Too bad that I didn't even get through the first page of the setup guide before being stumped by obscure asset precompilation errors that apparently no one else has ever faced. In a way, it's commendable that a project would exclusively explain a bare-metal from-source setup in the Docker-dominated DevOps seascape of 2025. But why would you want to do this for a project that requires servers to be infested with npm and Postgres and a bleeding-edge self-compiled version of Ruby and several -dev packages for C dependencies of certain Ruby gems? Unsurprisingly, Japanese Python behaves just like Dutch Ruby in how the community effectively treats every minor version as a major version because there are no adults left in the room to put all the children and Ph.Ds in their place…
Fortunately, ActivityPub is relatively simple to implement and there are plenty of existing servers that are better suited to the kind of PR channel I'm actually looking for. After a very quick search, I settled on…
GoToSocial
…which immediately impresses in pretty much every single area:
After the two previous bloatfests, it's very refreshing to see a single binary next to a bunch of static assets. Sure, 87.4 MiB is certainly way more bulky than necessary, but still much smaller than either of our two competitors.
The documentation is extremely well-organized and polished, especially for a project that's on version 0.20.2.
I can write Markdown in posts!
WebP and AV1? Just work too, without any attempt to convert the main image or video that gets attached to a post. Sure, the thumbnailer does convert images, but that's way less critical…
…and you can effectively bypass it by passing some five-digit size to media-thumb-max-pixels.
The whole thing works exactly like a lightweight server should work: A single binary serving posts from a SQLite database and media attachments from static files lying next to it. With easy access to every piece of data, fixing typos and import errors after the fact is trivial. Applying these might need a server restart for caching reasons, but they're immediately reflected in whatever app is accessing the data.
Drawbacks? The database schema is highly redundant, poster image conversion for videos results in weirdly green images for every one of my AV1 source files, and the paginated timeline view could use just a few more navigation options and customizability. Other seemingly missing features like posting and search are handled by third-party clients like the very admirable Pinafore. And except for the first issue, these are all relatively minor, and I might even fix them myself one day. That's how you get new contributors to your free software project.
And just in case you ever want to import a Twitter archive onto a GoToSocial instance, here is the no-nonsense importer I used:
So if you've got an account on Misskey, Mastodon, or another ActivityPub server, please follow @rec98@nmlgc.net. I'll keep posting everything to both Twitter and Bluesky for the time being, but will no longer advertise either of them. If they ever go down, I'll make no attempt at restoring them.
And that was 2025! It surely brought lots of words, breaking even last year's record by an additional 37% of blog post content. 😮 Here's to 2026 bringing more of the actual reverse-engineering we've been sorely lacking with all the modding and porting projects over the past few years. And with at least four TH03 gameplay pushes queued up, things are already looking quite promising…
Next up: Enemies! Formation scripts! Fireballs! Explosions! Combos, or at least the first part of them! And a slightly more common glitch that players have been wondering about for many years…
Remember when ReC98 was about researching the PC-98 Touhou games? After over half a year, we're finally back with some actual RE and decompilation work. The 📝 build system improvement break was definitely worth it though, the new system is a pure joy to use and injected some newfound excitement into day-to-day development.
And what game would be better suited for this occasion than TH03, which currently has the highest number of individual backers interested in it. Funding the full decompilation of TH03's OP.EXE is the clearest signal you can send me that 📝 you want your future TH03 netplay to be as seamlessly integrated and user-friendly as possible. We're just two menu screens away from reaching that goal anyway, and the character selection screen fits nicely into a single push.
The code of a menu typically starts with loading all its graphics, and TH03's character selection already stands out in that regard due to the sheer amount of image data it involves. Each of the game's 9 selectable characters comes with
a 192×192-pixel portrait (??SL.CD2),
a 32×44-pixel pictogram describing her Extra Attack (in SLEX.CD2), and
a 128×16-pixel image of her name (in CHNAME.BFT). While this image just consists of regular boldfaced versions of font ROM glyphs that the game could just render procedurally, pre-rendering these names and keeping them around in memory does make sense for performance reasons, as we're soon going to see. What doesn't make sense, though, is the fact that this is a 16-color BFNT image instead of a monochrome one, wasting both memory and rendering time.
Luckily, ZUN was sane enough to draw each character's stats programmatically. If you've ever looked through this game's data, you might have wondered where the game stores the sprite for an individual stat star. There's SLWIN.CDG, but that file just contains a full stat window with five stars in all three rows. And sure enough, ZUN renders each character's stats not by blitting sprites, but by painting (5 - value) yellow rectangles over the existing stars in that image.
The only stat-related image you will find as part of the game files. The number of stat stars per character is hardcoded and not based on any other internal constant we know about.
Together with the EXTRA🎔 window and the question mark portrait for Story Mode, all of this sums up to 255,216 bytes of image data across 14 files. You could remove the unnecessary alpha plane from SLEX.CD2 (-1,584 bytes) or store CHNAME.BFT in a 1-bit format (-6,912 bytes), but using 3.3% less memory barely makes a difference in the grand scheme of things.
From the code, we can assume that loading such an amount of data all at once would have led to a noticeable pause on the game's target PC-98 models. The obvious alternative would be to just start out with the initially visible images and lazy-load the data for other characters as the cursors move through the menu, but the resulting mini-latencies would have been bound to cause minor frame drops as well. Instead, ZUN opted for a rather creative solution: By segmenting the loading process into four parts and moving three of these parts ahead into the main menu, we instead get four smaller latencies in places where they don't stick out as much, if at all:
The loading process starts at the logo animation, with Ellen's, Kotohime's, and Kana's portraits getting loaded after the 東方夢時空 letters finished sliding in. Why ZUN chose to start with characters #3, #4, and #5 is anyone's guess.
Reimu's, Mima's, and Marisa's portraits as well as all 9 EXTRA🎔 attack pictograms are loaded at the end of the flash animation once the full title image is shown on screen and before the game is waiting for the player to press a key.
The stat and EXTRA🎔 windows are loaded at the end of the main menu's slide-in animation… together with the question mark portrait for Story Mode, even though the player might not actually want to play Story Mode.
Finally, the game loads Rikako's, Chiyuri's, and Yumemi's portraits after it cleared VRAM upon entering the Select screen, regardless of whether the latter two are even unlocked.
I don't like how ZUN implemented this split by using three separately named standalone functions with their own copy-pasted character loop, and the load calls for specific files could have also been arranged in a more optimal order. But otherwise, this has all the ingredients of good-code. As usual, though, ZUN then definitively ruins it all by counteracting the intended latency hiding with… deliberately added latency frames:
The entire initialization process of the character selection screen, including Step #4 of image loading, is enforced to take at least 30 frames, with the count starting before the switch to the Selection theme. Presumably, this is meant to give the player enough time to release the Z key that entered this menu, because holding it would immediately select Reimu (in Story mode) or the previously selected 1P character (in VS modes) on the very first frame. But this is a workaround at best – and a completely unnecessary one at that, given that regular navigation in this menu already needs to lock keys until they're released. In the end, you can still auto-select the default choice by just not releasing the Z key.
And if that wasn't enough, the 1P vs. 2P variant of the menu adds 16 more frames of startup delay on top.
Sure, maybe loading the fourth part's 69,120 bytes from a highly fragmented hard drive might have even taken longer than 30 frames on a period-correct PC-98, but the point still stands that these delays don't solve the problem they are supposed to solve.
But the unquestionable main attraction of this menu is its fancy background animation. Mathematically, it consists of Lissajous curves with a twist: Instead of calculating each point as
x = sin((fx·t)+ẟx)y = sin((fy·t)+ẟy), TH03 effectively calculates its points as
x = cos(fx·((t+ẟx) % 0xFF))y = sin(fy·((t+ẟy) % 0xFF)), due to t and ẟ being 📝 8-bit angles. Since the result of the addition remains 8-bit as well, it can and will regularly overflow before the frequency scaling factors fx and fy are applied, thus leading to sudden jumps between both ends of the 8-bit value range. The combination of this overflow and the gradual changes to fx and fy create all these interesting splits along the 360° of the curve:
At a high level, there really is just one big curve and one small curve, plus an array of trailing curves that approximate motion blur by subtracting from ẟx and ẟy.
In a rather unusual display of mathematical purity, ZUN fully re-calculates all variables and every point on every frame from just the single byte of state that indicates the current time within the animation's 128-frame cycle. However, that beauty is quickly tarnished by the sheer cost of fully recalculating these curves every frame:
In total, the effect calculates, clips, and plots 16 curves: 2 main ones, with up to 7×2 = 14 darker trailing curves.
Each of these curves is made up of the 256 maximum possible points you can get with 8-bit angles, giving us 4,096 points in total.
Each of these points takes at least 333 cycles on a 486 if it passes all clipping checks, not including VRAM latencies or the performance impact of the 📝 GRCG's RMW mode.
Due to the larger curve's diameter of 440 pixels, a few of the points at its edges are needlessly calculated only to then be discarded by the clipping checks as they don't fit within the 400 VRAM rows. Still, >1.3 million cycles for a single frame remains a reasonable ballpark assumption.
This is decidedly more than the 1.17 million cycles we have between each VSync on the game's target 66 MHz CPUs. So it's not surprising that this effect is not rendered at 56.4 FPS, but instead drops the frame rate of the entire menu by targeting a hardcoded 1 frame per 3 VSync interrupts, or 18.8 FPS. Accordingly, I reduced the frame rate of the video above to represent the actual animation cycle as cleanly as possible.
Apparently, ZUN also tested the game on the 33 MHz PC-98 model that he targeted with TH01, and realized that 4,096 points were way too much even at 18.8 FPS. So he also added a mechanism that decrements the number of trailing curves if the last frame took ≥5 VSync interrupts, down to a minimum of only a single extra curve. You can see this in action by underclocking the CPU in your Neko Project fork of choice.
But were any of these measures really necessary? Couldn't ZUN just have allocated a 12 KiB ring buffer to keep the coordinates of previous curves, thus reducing per-frame calculations to just 512 points? Well, he could have, but we now can't use such a buffer to optimize the original animation. The 8-bit main angle offset/animation cycle variable advances by 0x02 every frame, but some of the trailing curves subtract odd numbers from this variable and thus fall between two frames of the main curves.
So let's shelve the idea of high-level algorithmic optimizations. In this particular case though, even micro-optimizations can have massive benefits. The sheer number of points magnifies the performance impact of every suboptimal code generation decision within the inner point loop:
Frequency scaling works by multiplying the 8-bit angles with a fixed-point Q8.8 factor. The result is then scaled back to regular integers via… two divisions by 256 rather than two bitshifts? That's another ≥46 cycles where ≥10 would have sufficed. Edit (2025-08-29): The initial version of this post miscounted the number of required cycles as ≥4, or 2× the cycle count of a single SAR instruction. That number didn't consider that the frequency scaling multiplication occasionally produces negative numbers, which 📝 must be conditionally rounded up when replacing signed divisions with arithmetic bitshifts to still produce the exact original animation. This conditional rounding adds ≥8 cycles in the more common positive case, and ≥6 in the rarer negative case.
The biggest gains, however, would come from inlining the two far calls to the 5-instruction function that calculates one dimension of a polar coordinate, saving another ≥100 cycles.
Multiplied by the number of points, even these low-hanging fruit already save a whopping ≥729,088 cycles per frame on an i486, without writing a single line of ASM! On Pentium CPUs such as the one in the PC-9821Xa7 that ZUN supposedly developed this game on, the savings are slightly smaller because far calls are much faster, but still come in at a hefty ≥466,944 cycles. Thus, this animation easily beats 📝 TH01's sprite blitting and unblitting code, which just barely hit the 6-digit mark of wasted cycles, and snatches the crown of being the single most unoptimized code in all of PC-98 Touhou.
The incredible irony here is that TH03 is the point where ZUN 📝 really📝 started📝 going📝 overboard with useless ASM micro-optimizations, yet he didn't even begin to optimize the one thing that would have actually benefitted from it. Maybe he 📝 once again went for the 📽️ cinematic look 📽️ on purpose?
Unlike TH01's sprites though, all this wasted performance doesn't really matter much in the end. Sure, optimizing the animation would give us more trailing curves on slower PC-98 models, but any attempt to increase the frame rate by interpolating angles would send us straight into fanfiction territory. Due to the 0x02/2.8125° increment per cycle, tripling the frame rate of this animation would require a change to a very awkward (log2384) = 8.58-bit angle format, complete with a new 384-entry sine/cosine lookup table. And honestly, the effect does look quite impressive even at 18.8 FPS.
There are three more bugs and quirks in this animation that are unrelated to performance:
If you've tried counting the number of trailing dots in the video above, you might have noticed that the very first frame actually renders 8×2 trailing curves instead of 7×2, thus rendering an even higher 4,608 points. What's going on there is that ZUN actually requested 8 trailing curves, but then forgot to reset the VSync counter after the initial 30-frame delay. As a result, the game always thinks that the first frame of the menu took ≥30 VSync interrupts to render, thus causing the decrement mechanism to kick in and deterministically reduce the trailing curve count to 7.
This is a textbook example of my definition of a ZUN bug: The code unmistakably says 8, and we only don't get 8 because ZUN forgot to mutate a piece of global state.
The small trailing curves have a noticeable discontinuity where they suddenly get rotated by ±90° between the last and first frame of the animation cycle.
This quirk comes down to the small curve's ẟy angle offset being calculated as ((c/2)-i), with i being the number of the trailing curve. Halving the main cycle variable effectively restricts this smaller curve to only the first half of the sine oscillation, between [0x00, 0x80[. For the main curve, this is fine as i is always zero. But once the trailing curves leave us with a negative value after the subtraction, the resulting angle suddenly flips over into the second half of the sine oscillation that the regular curve never touches. And if you recall how a sine wave looks, the resulting visual rotation immediately makes sense:
Removing the division would be the most obvious fix, but that would double the speed of the sine oscillation and change the shape of the curve way beyond ZUN's intentions. The second-most obvious fix involves matching the trailing curves to the movement of the main one by restricting the subtraction to the first half of the oscillation, i.e., calculating ẟy as (((c/2)-i) % 0x80) instead. With c increasing by 0x02 on each frame of the animation, this fix would only affect the first 8 frames.
ZUN decided to plot the darker trailing curves on top of the lighter main ones. Maybe it should have been the other way round?
Now with the full 18 curves, a direction change of the smaller trailing curves at the end of the loop that only looks slightly odd, and a reversed and more natural plotting order.
Now that we fully understand how the curve animation works, there's one more issue left to investigate. Let's actually try holding the Z key to auto-select Reimu on the very first frame of the Story Mode Select screen:
The confirmation flash even happens before the menu's first page flip.
Stepping through the individual frames of the video above reveals quite a bit of tearing, particularly when VRAM is cleared in frame 1 and during the menu's first page flip in frame 49. This might remind you of 📝 the tearing issues in the Music Rooms – and indeed, this tearing is once again the expected result of ZUN landmines in the code, not an emulation bug. In fact, quite the contrary: Scanline-based rendering is a mark of quality in an emulator, as it always requires more coding effort and processing power than not doing it. Everyone's favorite two PC-98 emulators from 20 years ago might look nicer on a per-frame basis, but only because they effectively hide ZUN's frequent confusion around VRAM page flips.
To understand these tearing issues, we need to consider two more code details:
If a frame took longer than 3 VSync interrupts to render, ZUN flips the VRAM pages immediately without waiting for the next VSync interrupt.
The hardware palette fade-out is the last thing done at the end of the per-frame rendering loop, but before busy-waiting for the VSync interrupt.
The combination of 1) and the aforementioned 30-frame delay quirk explains Frame 49. There, the page flip happens within the second frame of the three-frame chunk while the electron beam is drawing row #156. DOSBox-X doesn't try to be cycle-accurate to specific CPUs, but 1 menu frame taking 1.39 real-time frames at 56.4 FPS is roughly in line with the cycle counting we did earlier.
Frame 97 is the much more intriguing one, though. While it's mildly amusing to see the palette actually go brighter for a single frame before it fades out, the interesting aspect here is that 2) practically guarantees its palette changes to happen mid-frame. And since the CRT's electron beam might be anywhere at that point… yup, that's how you'd get more than 16 colors out of the PC-98's 16-color graphics mode. 🎨
Let's exaggerate the brightness difference a bit in case the original difference doesn't come across too clearly on your display:
Probably not too much of a reason for demosceners to get excited; generic PC-98 code that doesn't try to target specific CPUs would still need a way of reliably timing such mid-frame palette changes. Bit 6 (0x40) of I/O port 0xA0 indicates HBlank, and the usual documentation suggests that you could just busy-wait for that bit to flip, but an HBlank interrupt would be much nicer.
This reproduces on both DOSBox-X and Neko Project 21/W, although the latter needs the Screen → Real palettes option enabled to actually emulate a CRT electron beam. Unfortunately, I couldn't confirm it on real hardware because my PC-9821Nw133's screen vinegar'd at the beginning of the year. But just as with the image loading times, TH03's remaining code sorts of indicate that mid-frame palette changes were noticeable on real hardware, by means of this little flag I RE'd way back in March 2019. Sure, palette_show() takes >2,850 cycles on a 486 to downconvert master.lib's 8-bit palette to the GDC's 4-bit format and send it over, and that might add up with more than one palette-changing effect per frame. But tearing is a way more likely explanation for deferring all palette updates until after VSync and to the next frame.
And that completes another menu, placing us a very likely 2 pushes away from completing TH03's OP.EXE! Not many of those left now…
To balance out this heavy research into a comparatively small amount of code, I slotted in 2024's Part 2 of my usual bi-annual website improvements. This time, they went toward future-proofing the blog and making it a lot more navigable. You've probably already noticed the changes, but here's the full changelog:
The Progress blog link in the main navigation bar now points to a new list page with just the post headers and each post's table of contents, instead of directly overwhelming your browser with a view of every blog post ever on a single page.
If you've been reading this blog regularly, you've probably been starting to dread clicking this link just as much as I've been. 14 MB of initially loaded content isn't too bad for 136 posts with an increasing amount of media content, but laying out the now 2 MB of HTML sure takes a while, leaving you with a sluggish and unresponsive browser in the meantime. The old one-page view is still available at a dedicated URL in case you want to Ctrl-F over the entire history from time to time, but it's no longer the default.
The new 🔼 and 🔽 buttons now allow quick jumps between blog posts without going through the table of contents or the old one-page view. These work as expected on all views of the blog: On single-post pages, the buttons link to the adjacent single-post pages, whereas they jump up and down within the same page on the list of posts or the tag-filtered and one-page views.
The header section of each post now shows the individual goals of each push that the post documents, providing a sort of title. This is much more useful than wasting space with meaningless commit hashes; just like in the log, links to the commit diffs don't need to be longer than a GitHub icon.
The web feeds that 📝 handlerug implemented two years ago are now prominently displayed in the new blog navigation sub-header. Listing them using <link rel="alternate"> tags in the HTML <head> is usually enough for integrated feed reader extensions to automatically discover their presence, but it can't hurt to draw more attention to them. Especially now that Twitter has been locking out unregistered users for quite some time…
Speaking of microblogging platforms, I've now also followed a good chunk of the Touhou community to Bluesky! The algorithms there seem to treat my posts much more favorably than Twitter has been doing lately, despite me having less than 1/10 of mostly automatically migrated followers there. For now, I'm going to cross-post new stuff to both platforms, but I might eventually spend a push to migrate my entire tweet history over to a self-hosted PDS to own the primary source of this data.
Next up: Staying with main menus, but jumping forward to TH04 and TH05 and finalizing some code there. Should be a quick one.
P0262
Decompilation (TH04/TH05 main/option menu)
P0263
Decompilation (TH04/TH05 first-launch sound setup menu + TH05 title screen animation)
💰 Funded by:
Blue Bolt, [Anonymous]
🏷️ Tags:
And once again, the Shuusou Gyoku task was too complex to be satisfyingly solved within a single month. Even just finding provably correct loop sections in both the original and arranged MIDI files required some rather involved detection algorithms. I could have just defined what sounded like correct loops, but the results of these algorithms were quite surprising indeed. Turns out that not even Seihou is safe from ZUN quirks, and some tracks technically loop much later than you'd think they do, or don't loop at all. And since I then wanted to put these MIDI loops back into the game to ensure perfect synchronization between the recordings and MIDI versions, I ended up rewriting basically all the MIDI code in a cross-platform way. This rewrite also uncovered a pbg bug that has traveled from Shuusou Gyoku into Windows Touhou, where it survived until ZUN ultimately removed all MIDI code in TH11 (!)…
Fortunately, the backlog still had enough general PC-98 Touhou funds that I could spend on picking some soon-important low-hanging fruit, giving me something to deliver for the end of the month after all. TH04 and TH05 use almost identical code for their main/option menus, so decompiling it would make number go up quite significantly and the associated blog post won't be that long…
Wait, what's this, a bug report from touhou-memories concerning the website?
Tab switchers tended to break on certain Firefox versions, and
video playback didn't work on Microsoft Edge at all?
Those are definitely some high-priority bugs that demand immediate attention.
The tab switcher issue was easily fixed by replacing the previous z-index trickery with a more robust solution involving the hidden attribute. The second one, however, is much more aggravating, because video playback on Edge has been broken ever since I 📝 switched the preferred video codec to AV1.
This goes so far beyond not supporting a specific codec. Usually, unsupported codecs aren't supposed to be an issue: As soon as you start using the HTML <video> tag, you'll learn that not every browser supports all codecs. And so you set up an encoding pipeline to serve each video in a mix of new and ancient formats, put the <source> tag of the most preferred codec first, and rest assured that browsers will fall back on the best-supported option as necessary. Except that Edge doesn't even try, and insists on staying on a non-playing AV1 video. 🙄
The codecs parameter for the <source> type attribute was the first potential solution I came across. Specifying the video codec down to the finest encoding details right in the HTML markup sounds like a good idea, similar to specifying sizes of images and videos to prevent layout reflows on long pages during the initial page load. So why was this the first time I heard of this feature? The fact that there isn't a simple ffprobe -show_html_codecs_string command to retrieve this string might already give a clue about how useful it is in practice. Instead, you have to manually piece the string together by grepping your way through all of a video's metadata…
…and then it still doesn't change anything about Edge's behavior, even when also specifying the string for the VP9 and VP8 sources. Calling the infamously ridiculous HTMLMediaElement.canPlayType() method with a representative parameter of "video/webm; codecs=av01.1.04M.08.0.000.01.13.00.0" explains why: Both the AV1-supporting Chrome and Edge return "probably", but only the former can actually play this format. 🤦
But wait, there is an AV1 video extension in the Microsoft Store that would add support to any unspecified favorite video app. Except that it stopped working inside Edge as of version 116. And even if it did: If you can't query the presence of this extension via JavaScript, it might as well not exist at all.
Not to mention that the favorite video app part is obviously a lie as a lot of widely preferred Windows video apps are bundled with their own codecs, and have probably long supported AV1.
In the end, there's no way around the utter desperation move of removing the AV1 <source> for Edge users. Serving each video in two other formats means that we can at least do something here – try visiting the GitHub release page of the P0234-1 TH01 Anniversary Edition build in Edge and you also don't get to see anything, because that video uses AV1 and GitHub understandably doesn't re-encode every uploaded video into a variety of old formats.
Just for comparison, I tried both that page and the ReC98 blog on an old Android 6 phone from 2014, and even that phone picked and played the AV1 videos with the latest available Chrome and Firefox versions. This was the phone whose available Firefox version didn't support VP9 in 2019, which was my initial reason for adding the VP8 versions. Looks like it's finally time to drop those… 🤔 Maybe in the far future once I start running out of space on this server.
Removing the <source> tags can be done in one of two places:
server-side, detecting Edge via the User-Agent header, or
I went with 2) because more dynamic server-side code would only move us further away from static site generation, which would make a lot of sense as the next evolutionary step in the architecture of this website. The client-side solution is much simpler too, and we can defer the deletion until a user actually hovers over a specific video.
And while we're at it, let's also add a popup complaining about this whole state of affairs. Edge is heavily marketed inside Windows as "the modern browser recommended by Microsoft", and you sure wouldn't expect low-quality chroma-subsampled VP9 from such a tagline. With such a level of anti-support for AV1, Edge users deserve to know exactly what's going on, especially since this post also explains what they will encounter on other websites.
That's the polite way of putting it.
Alright, where was I? For TH01, the main menu was the last thing I decompiled before the 100% finalization mark, so it's rather anticlimactic to already cover the TH04/TH05 one now, with both of the games still being very far away from 100%, just because people will soon want to translate the description text in the bottom-right corner of the screen. But then again, the ZUN Soft logo animation would make for an even nicer final piece of decompiled code, especially since the bouncing-ball logo from TH01, TH02, and TH03 was the very first decompilation I did, all the way back in 2015.
The code quality of ZUN's VRAM-based menus has barely increased between TH01 and TH05. Both the top-level and option menu still need to know the bounding rectangle of the other one to unblit the right pixels when switching between the two. And since ZUN sure loved hardcoded and copy-pasted numbers in the PC-98 days, the coordinates both tend to be excessively large, and excessively wrong. Luckily, each menu item comes with its own correct unblitting rectangle, which avoids any graphical glitches that would otherwise occur.
As for actual observable quirks and bugs, these menus only contain one of each, and both are exclusive to TH04:
Quitting out of the Music Room moves the cursor to the Start option. In TH05, it stays on Music Room.
Changing the S.E. mode seems to do nothing within TH04's menus, and would only take effect if you also change the Music mode afterward, or launch into the game.
And yes, these videos do have a frame rate of 2 FPS.
Now that 100% finalization of their OP.EXE binaries is within reach, all this bloat made me think about the viability of a 📝 single-executable build for TH04's and TH05's debloated and anniversary versions. It would be really nice to have such a build ready before I start working on the non-ASCII translations – not just because they will be based on the anniversary branch by default, but also because it would significantly help their development if there are 4 fewer executables to worry about.
However, it's not as simple for these games as it was for TH01. The unique code in their OP.EXE and MAINE.EXE binaries is much larger than Borland's easily removed C++ exception handler, so I'd have to remove a lot more bloat to keep the resulting single binary at or below the size of the original MAIN.EXE. But I'm sure going to try.
Speaking of code that can be debloated for great effect: The second push of this delivery focused on the first-launch sound setup menu, whose BGM and sound effect submenus are almost complete code duplicates of each other. The debloated branch could easily remove more than half of the code in there, yielding another ≈800 bytes in case we need them.
If hex-editing MIKO.CFG is more convenient for you than deleting that file, you can set its first byte to FF to re-trigger this menu. Decompiling this screen was not only relevant now because it contains text rendered with font ROM glyphs and it would help dig our way towards more important strings in the data segment, but also because of its visual style. I can imagine many potential mods that might want to use the same backgrounds and box graphics for their menus.
How about an initial language selection menu in the same style?
With the two submenus being shown in a fixed sequence, there's not a lot of room for the code to do anything wrong, and it's even more identical between the two games than the main menu already was. Thankfully, ZUN just reblits the respective options in the new color when moving the cursor, with no 📝 palette tricks. TH04's background image only uses 7 colors, so he could have easily reserved 3 colors for that. In exchange, the TH05 image gets to use the full 16 colors with no change to the code.
Rounding out this delivery, we also got TH05's rolling Yin-Yang Orb animation before the title screen… and it's just more bloat and landmines on a smaller scale that might be noticeable on slower PC-98 models. In total, there are three unnecessary inter-page copies of the entire VRAM that can easily insert lag frames, and two minor page-switching landmines that can potentially lead to tearing on the first frame of the roll or fade animation. Clearly, ZUN did not have smoothness or code quality in mind there, as evidenced by the fact that this animation simply displays 8 .PI files in sequence. But hey, a short animation like this is 📝 another perfectly appropriate place for a quick-and-dirty solution if you develop with a deadline.
And that's 1.30% of all PC-98 Touhou code finalized in two pushes! We're slowly running out of these big shared pieces of ASM code…
I've been neglecting TH03's OP.EXE quite a bit since it simply doesn't contain any translatable plaintext outside the Music Room. All menu labels are gaiji, and even the character selection menu displays its monochrome character names using the 4-plane sprites from CHNAME.BFT. Splitting off half of its data into a separate .ASM file was more akin to getting out a jackhammer to free up the room in front of the third remaining Music Room, but now we're there, and I can decompile all three of them in a natural way, with all referenced data.
Next up, therefore: Doing just that, securing another important piece of text for the upcoming non-ASCII translations and delivering another big piece of easily finalized code. I'm going to work full-time on ReC98 for almost all of December, and delivering that and the Shuusou Gyoku SC-88Pro recording BGM back-to-back should free up about half of the slightly higher cap for this month.
And now we're taking this small indie game from the year 2000 and porting
its game window, input, and sound to the industry-standard cross-platform
API with "simple" in its name.
Why did this have to be so complicated?! I expected this to take maybe 1-2
weeks and result in an equally short blog post. Instead, it raised so many
questions that I ended up with the longest blog post so far, by quite a wide
margin. These pushes ended up covering so many aspects that could be
interesting to a general and non-Seihou-adjacent audience, so I think we
need a table of contents for this one:
Before we can start migrating to SDL, we of course have to integrate it into
the build somehow. On Linux, we'd ideally like to just dynamically link to a
distribution's SDL development package, but since there's no such thing on
Windows, we'd like to compile SDL from source there. This allows us to reuse
our debug and release flags and ensures that we get debug information,
without needing to clone build scripts for every
C++ library ever in the process or something.
So let's get my Tup build scripts ready for compiling vendored libraries… or
maybe not? Recently, I've kept hearing about a hot new
technology that not only provides the rare kind of jank-free
cross-compiling build system for C/C++ code, but innovates by even
bundling a C++ compiler into a single 279 MiB package with no
further dependencies. Realistically replacing both Visual Studio and Tup
with a single tool that could target every OS is quite a selling point. The
upcoming Linux port makes for the perfect occasion to evaluate Zig, and to
find out whether Tup is still my favorite build system in 2023.
Even apart from its main selling point, there's a lot to like about Zig:
First and foremost: It's a modern systems programming language with
seamless C interop that we could gradually migrate parts of the codebase to.
The feature set of the core language seems to hit the sweet spot between C
and C++, although I'd have to use it more to be completely sure.
A native, optimized Hello World binary with no string formatting is
4 KiB when compiled for Windows, and 6.4 KiB when cross-compiled
from Windows to Linux. It's so refreshing to see a systems language in 2023
that doesn't bundle a bulky runtime for trivial programs and then defends it
with the old excuse of "but all this runtime code will come in handy the
larger your program gets". With a first impression like this, Zig
managed to realize the "don't pay for what you don't use" mantra that C++
typically claims for itself, but only pulls off maybe half of the time.
You can directly
target specific CPU models, down to even the oldest 386 CPUs?! How
amazing is that?! In contrast, Visual Studio only describes its /arch:IA32
compatibility option in very vague terms, leaving it up to you to figure out
that "legacy 32-bit x86 instruction set without any vector
operations" actually means "i586/P5 Pentium, because the startup code
still includes an unconditional CPUID instruction". In any
case, it means that Zig could also cover the i586 build.
Even better, changing Zig's CPU model setting recompiles both its
bundled C/C++ standard library and Zig's own compiler-rt polyfill
library for that architecture. This ensures that no unsupported
instructions ever show up in the binary, and also removes the need for
any CPUID checks. This is so much better than the Visual
Studio model of linking against a fixed pre-compiled standard library
because you don't have to trust that all these newer instructions
wouldn't actually be executed on older CPUs that don't have them.
I love the auto-formatter. Want to lay out your struct literal into
multiple lines? Just add a trailing comma to the end of the last element.
It's very snappy, and a joy to use.
Like every modern programming language, Zig comes with a test framework
built into the language. While it's not all too important for my grand plan
of having one big test that runs a bunch of replays and compares their game
states against the original binary, small tests could still be useful for
protecting gameplay code against accidental changes. It would be great if I
didn't have to evaluate and choose among
the many testing frameworks for C++ and could just use a language
standard.
Package
management is still in its infancy, but it's looking pretty good so far,
resembling Go's decentralized approach of just pointing to a URL but with
specific version selection from the get-go.
However, as a version number of 0.11.0 might already suggest, the whole
experience was then bogged down by quite a lot of issues:
While Zig's C/C++ compilation feature is very
well architected to reuse the C/C++ standard libraries of GCC and MinGW and
thus automatically keeps up with changes to the C++ standard library,
it's ultimately still just a Clang frontend. If you've been working with a
Visual Studio-exclusive codebase – which, as we're going to see below, can
easily happen even if you compile in C++23 mode – you'd now have to
migrate to Clang and Zig in a single step. Obviously, this can't ever
be fixed without Microsoft open-sourcing their C++ compiler. And even then,
supporting a separate set of command-line flags might not be worth it.
The standard library is very poorly documented, especially in the
build-related parts that are meant to attract the C++ audience.
Often, the only documentation is found in blog posts from a few years
ago, with example code written against old Zig versions that doesn't compile
on the newest version anymore. It's all very far from stable.
However, Zig's project generation sub-commands (zig
init-exe and friends) do emit well-documented boilerplate
code? It does make sense for that code to double as a comprehensive example,
but Zig advertises itself as so simple that I didn't even think about
bootstrapping my project with a CLI tool at first – unlike, say, Rust, where
a project always starts with filling out a small form in
Cargo.toml.
There's no progress output for C/C++ compilation? Like, at all?
This hurts especially because compilation times are significantly longer
than they were with Visual Studio. By default, the current Tupfile builds
Shuusou Gyoku in both debug and release configurations simultaneously. If I
fully rebuild everything from a clean cache, Visual Studio finishes such a
build in roughly the same amount of time that Zig takes to compile just a
debug build.
The --global-cache-dir option is only supported by specific
subcommands of the zig CLI rather than being a top-level
setting, and throws an error if used for any other subcommand. Not having a
system-wide way to change it and being forced into writing a wrapper script
for that is fine, but it would be nice if said wrapper script didn't have to
also parse and switch over the subcommand just to figure out whether it is
allowed to append the setting.
compiler-rt still needs a bit of dead code elimination work. As soon as
your program needs a single polyfilled function, you get all of them,
because they get referenced in some exception-related table even if nothing
uses them? Changing the link_eh_frame_hdr option had no
effect.
And that was not the only std.Build.Step.Compile option
that did nothing. Worse, if I just tweaked the options and changed nothing
about the code itself, Zig simply copied a previously built executable
out of its build cache into the output directory, as revealed by the
timestamp on the .EXE. While I am willing to believe that Zig correctly
detects that all these settings would just produce the same binary, I do not
like how this behavior inspires distrust and uncertainty in Zig's build
process as a whole. After all, we still live in a world where clearing
the build cache is way too often the solution for weird problems in
software, especially when using CMake. And it makes sense why it would be:
If you develop a complex system and then try solving the infamously hard
problem of cache invalidation on top, the risk of getting cache invalidation
wrong is, by definition, higher than if that was the only thing your system
did. That's the reason why I like Tup so much: It solely focuses on
getting cache invalidation right, and rather errs on the side of caution by
maybe unnecessarily rebuilding certain files every once in a while because
the compiler may have read from an environment variable that has changed in
the meantime. But this is the one job I expect a build system to do, and Tup
has been delivering for years and has become fundamentally more trustworthy
as a result.
Zig activates Clang's UBSan
in debug builds by default, which executes a program-crashing
UD2 instruction whenever the program is about to rely on
undefined C++ behavior. In theory, that's a great help for spotting hidden
portability issues, but it's not helpful at all if these crashes are
seemingly caused by C++ standard library code?! Without any clear info
about the actual cause, this just turned into yet another annoyance on
top of all the others. Especially because I apparently kept searching for
the wrong terms when I first encountered this issue, and only found
out how to deactivate it after I already decided against Zig.
Also, can we get /PDBALTPATH?
Baking absolute paths from the filesystem of the developer's machine into
released binaries is not only cringe in itself, but can also cause potential
privacy or security accidents.
So for the time being, I still prefer Tup. But give it maybe two or three
years, and I'm sure that Zig will eventually become the best tool for
resurrecting legacy C++ codebases. That is, if the proposed divorce of the
core Zig compiler from LLVMisn't an indication that the
productive parts of the Zig community consider the C/C++ building features
to be "good enough", and are about to de-emphasize them to focus more
strongly on the actual Zig language. Gaining adoption for your new systems
language by bundling it with a C/C++ build system is such a great and unique
strategy, and it almost worked in my case. And who knows, maybe Zig will
already be good enough by the time I get to port PC-98 Touhou to modern
systems.
(If you came from the Zig
wiki, you can stop reading here.)
A few remnants of the Zig experiment still remain in the final delivery. If
that experiment worked out, I would have had to immediately change the
execution encoding to UTF-8, and decompile a few ASM functions exclusive to
the 8-bit rendering mode which we could have otherwise ignored. While Clang
does support inline assembly with Intel syntax via
-fms-extensions, it has trouble with ; comments
and instructions like REP STOSD, and if I have to touch that
code anyway… (The REP STOSD function translated into a single
call to memcpy(), by the way.)
Another smaller issue was Visual Studio's lack of standard library header
hygiene, where #including some of the high-level STL features also includes
more foundational headers that Clang requires to be included separately, but
I've already known about that. Instead, the biggest shocker was that Visual
Studio accepts invalid syntax for a language feature as recent as C++20
concepts:
// Defines the interface of a text rendering session class. To simplify this
// example, it only has a single `Print(const char* str)` method.
template <class T> concept Session = requires(T t, const char* str) {
t.Print(str);
};
// Once the rendering backend has started a new session, it passes the session
// object as a parameter to a user-defined function, which can then freely call
// any of the functions defined in the `Session` concept to render some text.
template <class F, class S> concept UserFunctionForSession = (
Session<S> && requires(F f, S& s) {
{ f(s) };
}
);
// The rendering backend defines a `Prerender()` method that takes the
// aforementioned user-defined function object. Unfortunately, C++ concepts
// don't work like this: The standard doesn't allow `auto` in the parameter
// list of a `requires` expression because it defines another implicit
// template parameter. Nevertheless, Visual Studio compiles this code without
// errors.
template <class T, class S> concept BackendAttempt = requires(
T t, UserFunctionForSession<S> auto func
) {
t.Prerender(func);
};
// A syntactically correct definition would use a different constraint term for
// the type of the user-defined function. But this effectively makes the
// resulting concept unusable for actual validation because you are forced to
// specify a type for `F`.
template <class T, class S, class F> concept SyntacticallyFixedBackend = (
UserFunctionForSession<F, S> && requires(T t, F func) {
t.Prerender(func);
}
);
// The solution: Defining a dummy structure that behaves like a lambda as an
// "archetype" for the user-defined function.
struct UserFunctionArchetype {
void operator ()(Session auto& s) {
}
};
// Now, the session type disappears from the template parameter list, which
// even allows the concrete session type to be private.
template <class T> concept CorrectBackend = requires(
T t, UserFunctionArchetype func
) {
t.Prerender(func);
};
What's this, Visual Studio's infamous delayed template parsing applied to
concepts, because they're templates as well? Didn't
they get rid of that 6 years ago? You would think that we've moved
beyond the age where compilers differed in their interpretation of the core
language, and that opting into a current C++ standard turns off any
remaining antiquated behaviors…
So let's actually get my Tup build scripts ready for compiling
vendored libraries, because the
📝 previous 70 lines of Lua definitely
weren't. For this use case, we'd like to have some notion of distinct build
targets that can have a unique set of compilation and linking flags. We'd
also like to always build them in debug and release versions even if you
only intend to build your actual program in one of those versions – with the
previous system of specifying a single version for all code, Tup would
delete the other one, which forces a time-consuming and ultimately needless
rebuild once you switch to the other version.
The solution I came up with treats the set of compiler command-line options
like a tree whose branches can concatenate new options and/or filter the
versions that are built on this branch. In total, this is my 4th
attempt at writing a compiler abstraction layer for Tup. Since we're
effectively forced to write such layers in Lua, it will always be a
bit janky, but I think I've finally arrived at a solid underlying design
that might also be interesting for others. Hence, I've split off the result
into its own separate
repository and added high-level documentation and a documented example.
And yes, that's a Code Nutrition
label! I've wanted to add one of these ever since I first heard about the
idea, since it communicates nicely how seriously such an open-source project
should be taken. Which, in this case, is actually not all too
seriously, especially since development of the core Tup project has all but
stagnated. If Zig does indeed get better and better at being a Clang
frontend/build system, the only niches left for Tup will be Visual
Studio-exclusive projects, or retrocoding with nonstandard toolchains (i.e.,
ReC98). Quite ironic, given Tup's Unix heritage…
Oh, and maybe general Makefile-like tasks where you just want to run
specific programs. Maybe once the general hype swings back around and people
start demanding proper graph-based dependency tracking instead of just a command runner…
Alright, alternatives evaluated, build system ready, time to include SDL!
Once again, I went for Git submodules, but this time they're held together
by a
batch file that ensures that the intended versions are checked out before
starting Tup. Git submodules have a bad rap mainly because of their
usability issues, and such a script should hopefully work around
them? Let's see how this plays out. If it ends up causing issues after all,
I'll just switch to a Zig-like model of downloading and unzipping a source
archive. Since Windows comes with curl and tar
these days, this can even work without any further dependencies.
Compiling SDL from a non-standard build system requires a
bit of globbing to include all the code that is being referenced, as
well as a few linker settings, but it's ultimately not much of a big deal.
I'm quite happy that it was possible at all without pre-configuring a build,
but hey, that's what maintaining a Visual Studio project file does to a
project.
By building SDL with the stock Windows configuration, we then end up with
exactly what the SDL developers want us to use… which is a DLL. You
can statically link SDL, but they really don't want you to do
that. So strongly, in fact, that they not
merely argue how well the textbook advantages of dynamic linking have worked
for them and gamers as a whole, but implemented a whole dynamic API
system that enforces overridable dynamic function loading even in static
builds. Nudging developers to their preferred solution by removing most
advantages from static linking by default… that's certainly a strategy. It
definitely fits with SDL's grassroots marketing, which is very good at
painting SDL as the industry standard and the only reliable way to keep your
game running on all originally supported operating systems. Well, at least
until SDL 3 is so stable that SDL 2 gets deprecated and won't
receive any code for new backends…
However, dynamic linking does make sense if you consider what SDL is.
Offering all those multiple rendering, input, and sound backends is what
sets it apart from its more hip competition, and you want to have all of
them available at any time so that SDL can dynamically select them based on
what works best on a system. As a result, everything in SDL is being
referenced somewhere, so there's no dead code for the linker to eliminate.
Linking SDL statically with link-time code generation just prolongs your
link time for no benefit, even without the dynamic API thwarting any chance
of SDL calls getting inlined.
There's one thing I still don't like about all this, though. The dynamic
API's table references force you to include all of SDL's subsystems in the
DLL even if your game doesn't need some of them. But it does fit with their
intention of having SDL2.dll be swappable: If an older game
stopped working because of an outdated SDL2.dll, it should be
possible for anyone to get that game working again by replacing that DLL
with any newer version that was bundled with any random newer game. And
since that would fail if the newer SDL2.dll was size-optimized
to not include some of the subsystems that the older game required, they
simply removed (or de-prioritized) the possibility altogether.
Maybe that was their train of thought? You can always just use the official Windows
DLL, whose whole point is to include everything, after all. 🤷
So, what do we get in these 1.5 MiB? There are:
renderer backends for Direct3D 9/11/12, regular OpenGL, OpenGL ES 2.0,
Vulkan, and a software renderer,
and audio backends for WinMM, DirectSound, WASAPI, and direct-to-disk
recording.
Unfortunately, SDL 2 also statically references some newer Windows API
functions and therefore doesn't run on Windows 98. Since this build of
Shuusou Gyoku doesn't introduce any new features to the input or sound
interfaces, we can still use pbg's original DirectSound and DirectInput code
for the i586 build to keep it working with the rest of the
platform-independent game logic code, but it will start to lag behind in
features as soon as we add support for SC-88Pro BGM or more sophisticated input
remapping. If we do want to keep this build at the same feature level as
the SDL one, we now have a choice: Do we write new DirectInput and
DirectSound code and get it done quickly but only for Shuusou Gyoku, or do
we port SDL 2 to Windows 98 and benefit all other SDL 2 games as
well? I leave
that for my backers to decide.
Immediately after writing the first bits of actual SDL code to initialize
the library and create the game window, you notice that SDL makes it very
simple to gradually migrate a game. After creating the game window, you can
call SDL_GetWindowWMInfo()
to retrieve HWND and HINSTANCE handles that allow
you to continue using your original DirectDraw, DirectSound, and DirectInput
code and focus on porting one subsystem at a time.
Sadly, D3DWindower can no longer turn SDL's fullscreen mode into a windowed
one, but DxWnd still works, albeit behaving a bit janky and insisting on
minimizing the game whenever its window loses focus. But in exchange, the
game window can surprisingly be moved now! Turns out that the originally
fixed window position had nothing to do with the way the game created its
DirectDraw context, and everything to do with pbg
blocking the Win32 "syscommand" that allows a window to be moved. By
deleting a system menu… seriously?! Now I'm dying to hear the Raymond
Chen explanation for how this behavior dates back to an unfortunate decision
during the Win16 days or something.
As implied by that commit, I immediately backported window movability to the
i586 build.
However, the most important part of Shuusou Gyoku's main loop is its frame
rate limiter, whose Win32 version leaves a bit of room for improvement.
Outside of the uncapped [おまけ] DrawMode, the
original main loop continuously checks whether at least 16 milliseconds have
elapsed since the last simulated (but not necessarily rendered) frame. And
by that I mean continuously, and deliberately without using any of
the Windows system facilities to sleep the process in the meantime, as
evidenced by a commented-out Sleep(1) call. This has two
important effects on the game:
The 60Fps DrawMode actually corresponds to a
frame rate of
(1000 / 16) = 62.5 FPS,
not 60. Since the game didn't account for the missing
2/3 ms to bring the limit down to exactly 60 FPS,
62.5 FPS is Shuusou Gyoku's actual official frame rate in a
non-VSynced setting, which we should also maintain in the SDL port.
Not sleeping the process turns Shuusou Gyoku's frame rate limitation
into a busy-waiting loop, which always uses 100% of a single CPU core just
to wait for the next frame.
Sure, modern computers are fast, but a frame won't ever take an
infinitely fast 0 milliseconds to render. So we still need to take the
current frame time into account.
SDL_Delay()'s documentation says that the wake-up could be
further delayed due to OS scheduling.
To address both of these issues, I went with a base delay time of
15 ms minus the time spent on the current frame, followed by
busy-waiting for the last millisecond to make sure that the next frame
starts on the exact frame boundary. And lo and behold: Even though this
still technically wastes up to 1 ms of CPU time, it still dropped CPU
usage into the 0%-2% range during gameplay on my Intel Core i5-8400T CPU,
which is over 5 years old at this point. Your laptop battery will appreciate
this new build quite a bit.
Time to look at audio then, because it sure looks less complicated than
input, doesn't it? Loading sounds from .WAV file buffers, playing a fixed
number of instances of every sound at a given position within the stereo
field and with optional looping… and that's everything already. The
DirectSound implementation is so straightforward that the most complex part
of its code is the .WAV file parser.
Well, the big problem with audio is actually finding a cross-platform
backend that implements these features in a way that seamlessly works with
Shuusou Gyoku's original files. DirectSound really is the perfect sound API
for this game:
It doesn't require the game code to specify any output sample format.
Just load the individual sound effects in their original format, and
playback just works and sounds correctly.
Its final sound stream seems to have a latency of 10 ms, which is
perfectly fine for a game running at 62.5 FPS. Even 15 ms would be
OK.
Sound effect looping? Specified by passing the
DSBPLAY_LOOPING flag to
IDirectSoundBuffer::Play().
Stereo panning balancing? One method call.
Playing the same sound multiple times simultaneously from a single
memory buffer? One
method call. (It can fail though, requiring you to copy the data after
all.)
Pausing all sounds while the game window is not focused? That's the
default behavior, but it can be equally easily disabled with just
a single per-buffer flag.
Future streaming of waveform BGM? No problem either. Windows Touhou has
always done that, and here's
some code I wrote 12½ years ago that would even work without DirectSound
8's notification feature.
No further binary bloat, because it's part of the operating system.
The last point can't really be an argument against anything, but we'd still
be left with 7 other boxes that a cross-platform alternative would have to
tick. We already picked SDL for our portability needs, so how does its audio
subsystem stack up? Unfortunately, not great:
It's fully DIY. All you get is a single output buffer, and you have to
do all the mixing and effect processing yourself. In other words, it's the
masochistic approach to cross-platform audio.
There are helper functions for resampling and mixing, but the
documentation of the latter is full of FUD. With a disclaimer that so
vehemently discourages the use of this function, what are you supposed to do
if you're newly integrating SDL audio into a game? Hunt for a separate sound
mixing library, even though your only quality goal is parity with stone-age
DirectSound? 🙄
It forces the game to explicitly define the PCM sampling rate, bit
depth, and channel count of the output buffer. You can't
just pass a nullptr to SDL_OpenAudioDevice(),
and if you pass a zeroed SDL_AudioSpec structure, SDL just defaults
to an unacceptable 22,050 Hz sampling rate, regardless of what the
audio device would actually prefer. It took until last year for them to
notice that people would at least like to query the native
format. But of course, this approach requires the backend to actually
provide this information – and since we've seen above that DirectSound
doesn't care, the
DirectSound version of this function has to actually use the more modern
WASAPI, and remains unimplemented if that API is not available.
Standardizing the game on a single sampling rate, bit depth, and channel
count might be a decent choice for games that consistently use a single
format for all its sounds anyway. In that case, you get to do all mixing and
processing in that format, and the audio backend will at most do one final
conversion into the playback device's native format. But in Shuusou Gyoku,
most sound effects use 22,050 Hz, the boss explosion sound effect uses
11,025 Hz, and the future SC-88Pro BGM will obviously use
44,100 Hz. In such a scenario, you would have to pick the highest
sampling rate among all sound sources, and resample any lower-quality sounds
to that rate. But if the audio device uses a different sampling rate, those
lower-quality sounds would get resampled a second time.
I know that this
will be fixed in SDL 3, but that version is still under heavy
development.
Positives? Uh… the callback-based nature means that BGM streaming is
rather trivial, and would even be comparatively less complicated than with
DirectSound. Having a mutex to prevent
writes to your sound instance structures while they're being read by the
audio thread is nice too.
OK, sure, but you're not supposed to use it for anything more than a
single stream of audio. SDL_mixer exists precisely to cover such non-trivial
use cases, and it even supports sound effect looping and panning with just a
single function call! But as far as the rest of the library is concerned, it
manages to be an even bigger disappointment than raw SDL audio:
As it sits on top of SDL's audio subsystem, it still can't just use your
audio device's native sample format.
It only offers a very opinionated system for streaming – and of course,
its opinion is wrong. 😛 The fact that it only supports a single streaming
audio track wouldn't matter all too much if you could switch to another
track at sample precision. But since you can't, you're forced to implement
looping BGM using a single file…
…which brings us to the unfortunate issue of loop point definitions.
And, perhaps most importantly, the complete lack of any way to set them
through the API?! It doesn't take long until you come up with a theory for
why the API only offers a function to retrieve loop points: The
"music" abstraction is so format-agnostic that it even supports MIDI
and tracker formats where a typical loop point in PCM samples doesn't make
sense. Both of these formats already have in-band ways of specifying loop
points in their respective time units. They
might not be standardized, but it's still much better than usual
single-file solutions for PCM streams where the loop point has to be stored
in an out-of-band way – such as in a metadata tag or an entirely separate
file.
Speaking of MIDI, why is it so common among these APIs to not have
any way of specifying the MIDI device? The fact that Windows Vista
removed the Control Panel option for specifying the system-wide default
MIDI output device is no excuse for your API lacking the option as well.
In fact, your MIDI API now needs such a setting more than it was
needed in the Windows XP and 9x days.
Funnily enough, they did once receive a patch for a function to set loop
points which was never upstreamed… and this patch came from
the main developer behind PyTouhou, who needed that feature for obvious
reasons. The world sure is a small place.
As a result, they turned loop points into a property that each
individual format may
or may
not have. Want to loop
MP3 files at sample precision? Tough luck, time to reconvert to another
lossy format. 🙄 This is the exact jank I decided against when I implemented
BGM modding for thcrap back in 2018,
where I concluded that separate intro and
loop files are the way to go.
But OK, we only plan to use FLAC and Ogg Vorbis for the SC-88Pro BGM, for
which SDL_mixer does support loop points in the form of Vorbiscomments,
and hey, we can even pass them at sample accuracy. Sure, it's wrong and
everything, but nothing I couldn't work with…
However, the final straw that makes SDL_mixer unsuitable for Shuusou
Gyoku is its core sound mixing paradigm of distributing all sound effects
onto a fixed number of channels, set to 8
by default. Which raises the quite ridiculous question of how many we
would actually need to cover the maximum amount of sounds that can
simultaneously be played back in any game situation. The theoretic maximum
would be 41, which is the combined sum of individual sound buffer instances
of all 20 original sound effects. The practical limit would surely be a lot
smaller, but we could only find out that one through experiments, which
honestly is quite a silly proposition.
It makes you wonder why they went with this paradigm in the first
place. And sure enough, they actually
use the aforementioned SDL core function for mixing audio. Yes, the
same function whose current documentation advises against using it for
this exact use case. 🙄 What's the argument here? "Sure, 8 is
significantly more than 2, but any mixing artifacts that will occur for
the next 6 sounds are not worrying about, but they get really bad
after the 8th sound, so we're just going to protect you from
that"?
This dire situation made me wonder if SDL was the wrong choice for Shuusou
Gyoku to begin with. Looking at other low-level cross-platform game
libraries, you'll quickly notice that all of them come with mostly
equally capable 2D renderers these days, and mainly differentiate themselves
in minute API details that you'd only notice upon a really close look. raylib is another one of those
libraries and has been getting exceptionally popular in recent years, to the
point of even having more than twice as many GitHub stars as SDL. By
restricting itself to OpenGL, it can even offer an
abstraction for shaders, which we'd really like for the 西方Project lens ball effect.
In the case of raylib's audio system, the lack of sound effect looping is
the minute API detail that would make it annoying to use for Shuusou Gyoku.
But it might be worth a look at how raylib implements all this if it doesn't
use SDL… which turned out to be the best look I've taken in a long time,
because raylib builds on top of miniaudio
which is exactly the kind of audio library I was hoping to find.
Let's check the list from above:
🟢 miniaudio's high-level API initialization defaults to the native
sample format of the playback device. Its internal processing uses 32-bit
floating-point samples and only converts back to the native bit depth as
necessary when writing the final stream into the backend's audio buffer.
WASAPI, for example, never needs any further conversion because it operates
with 32-bit floats as well.
🟢 The final audio stream uses the same 10 ms update period (and
thus, sound effect latency) that I was getting with DirectSound.
🟢 Stereo panning balancing? ma_sound_set_pan(),
although it does require a conversion from Shuusou Gyoku's dB units into a
linear attenuation factor.
🟢 Sound effect looping? ma_sound_set_looping().
🟢 Playing the same sound multiple times simultaneously from a single
memory buffer? Perfectly possible, but requires a bit of digging in the
header to find the best solution. More on that below.
🟢 Future streaming of waveform BGM? Just call
ma_sound_init_from_file() with the
MA_SOUND_FLAG_STREAM flag.
👍 It also comes with a FLAC decoder in the core library and an Ogg
Vorbis one as part of the repo, …
🤩 … and even supports gapless switching between the intro and loop
files via a single declarative call to
ma_data_source_set_next()!
(Oh, and it also has ma_data_set_loop_point_in_pcm_frames()
for anyone who still believes in obviously and objectively
inferior out-of-band loop points.)
🟢 Pausing all sounds while the game window is not focused? It's not
automatic, but adding new functions to the sound interface and calling
ma_engine_stop() and ma_engine_start() does the
trick, and most importantly doesn't cause any samples to be lost in the
process.
🟡 Sound control is implemented in a lock-free way, allowing your main
game thread to call these at any time without causing glitches on the audio
thread. While that looks nice and optimal on the surface, you now have to
either believe in the soundness (ha) of the implementation, or verify that
atomic structure fields actually are enough to not cause any race
conditions (which I did for the calls that Shuusou Gyoku uses, and I didn't
find any). "It's all lock-free, don't worry about it" might be
easier, but I consider SDL's approach of just providing a mutex to
prevent the output callback from running while you mutate the sound state to
actually be simpler conceptually.
🟡 miniaudio adds 247 KB to the binary in its minimum
configuration, a bit more than expected. Some of that is bloat from effect
code that we never use, but it does include backends for all three Windows
audio subsystems (WASAPI, DirectSound, and WinMM).
✅ But perhaps most importantly: It natively supports all modern
operating systems that one could seriously want to port this game to, and
could be easily ported to any other backend, including
SDL.
Oh, and it's written by the same developer who also wrote the best FLAC
library back in 2018. And that's despite them being single-file C libraries,
which I consider to be massively overrated…
The drawback? Similar to Zig, it's only on version 0.11.18, and also focuses
on good high-level documentation at the expense of an API reference. Unlike
Zig though, the three issues I ran into turned out to be actual and fixable
bugs: Two minor
ones related to looping of streamed sounds shorter than 2 seconds which
won't ever actually affect us before we get into BGM modding, and a critical one that
added high-frequency corruption to any mono sound effect during its
expansion to stereo. The latter took days to track down – with symptoms
like these, you'd immediately suspect the bug to lie in the resampler or its
low-pass filter, both of which are so much more of a fickle and configurable
part of the conversion chain here. Compared to that, stereo expansion is so
conceptually simple that you wouldn't imagine anyone getting it wrong.
While the latter PR has been merged, the fix is still only part of the
dev branch and hasn't been properly released yet. Fortunately,
raylib is not affected by this bug: It does currently
ship version 0.11.16 of miniaudio, but its usage of the library predates
miniaudio's high-level API and it therefore uses a different,
non-SSE-optimized code path for its format conversions.
The only slightly tricky part of implementing a miniaudio backend for
Shuusou Gyoku lies in setting up multiple simultaneously playing instances
for each individual sound. The documentation and answers on the issue
tracker heavily push you toward miniaudio's resource manager and its file
abstractions to handle this use case. We surely could turn Shuusou Gyoku's
numeric sound effect IDs into fake file names, but it doesn't really fit the
existing architecture where the sound interface just receives in-memory .WAV
file buffers loaded from the SOUND.DAT packfile.
In that case, this seems to be the best way:
Call ma_decode_memory() to decode from any of the supported
audio formats to a buffer of raw PCM samples. At this point, you can
choose between
decoding into the original format the sound effect is stored in,
which would require it to be converted to the playback format every
time it's played, or
decoding into 32-bit floats (the native bit depth of the miniaudio
engine) and the native sampling rate of the playback device, which
avoids any further resampling and floating-point conversion, but takes
up more memory.
Nowadays, it's not clear at all which of the two approaches is faster.
Does it actually matter if we save the audio thread from doing all those
floating-point operations on every sample? Or is that no longer true these
days because the audio thread is probably running on a different CPU core,
the rest of the game largely doesn't touch the floating-point parts of your
CPU anyway, and you'd rather want to keep sound effects small so that they
can better fit into the CPU cache? That would be an interesting question to
benchmark, but just like the similar text rendering question from the last
blog posts, it doesn't matter for this tiny 2000s retro game. 😌
I went with 2) mainly because it simplified all the debugging I was doing.
At a sampling rate of 48,000 Hz, this increases the memory usage for
all sound effects from 379 KiB to 3.67 MiB. At least I'm not
channel-expanding all sound effects as well here…
We've seen earlier that mono➜stereo expansion
is SSE-optimized, so it's very hard to justify a further doubling of the
memory usage here.
Then, for each instance of the sound, call
ma_audio_buffer_ref_init() to create a reference
buffer with its own playback cursor, and
ma_sound_init_from_data_source() to create a new
high-level sound node that will play back the reference buffer.
As a side effect of hunting that one critical bug in miniaudio, I've now
learned a fair bit about audio resampling in general. You'll probably need
some knowledge about basic
digital signal behavior to follow this section, and that video is still
probably the best introduction to the topic.
So, how could this ever be an issue? The only time I ever consciously
thought about resampling used to be in the context of the Opus codec and its
enforced sampling rate of 48,000 Hz, and how Opus advocates
claim that resampling is a solved problem and nothing to worry about,
especially in the context of a lossy codec. Still, I didn't add Opus to
thcrap's BGM modding feature entirely because the mere thought of having to
downsample to 44,100 Hz in the decoder was off-putting enough. But even
if my worries were unfounded in that specific case: Recording the
Stereo Mix of Shuusou Gyoku's now two audio backends revealed that
apparently not every audio processing chain features an Opus-quality
resampler…
If we take a look at the material that resamplers actually have to work with
here, it quickly becomes obvious why their results are so varied. As
mentioned above, Shuusou Gyoku's sound effects use rather low sampling rates
that are pretty far away from the 48,000 Hz your audio device is most
definitely outputting. Therefore, any potential imaging noise across the
extended high-frequency range – i.e., from the original Nyquist frequencies
of 11,025 Hz/5,512.5 Hz up to the new limit of 24,000 Hz – is
still within the audible range of most humans and can clearly color the
resulting sound.
But it gets worse if the audio data you put into the resampler is
objectively defective to begin with, which is exactly the problem we're
facing with over half of Shuusou Gyoku's sound effects. Encoding them all as
8-bit PCM is definitely excusable because it was the turn of the millennium
and the resulting noise floor is masked by the BGM anyway, but the blatant
clipping and DC offsets definitely aren't:
KEBARI
TAME
LASER
LASER2
BOMB
SELECT
HIT
CANCEL
WARNING
SBLASER
BUZZ
MISSILE
JOINT
DEAD
SBBOMB
BOSSBOMB
ENEMYSHOT
HLASER
TAMEFAST
WARP
Waveforms for all 20 of Shuusou Gyoku's sound effects, in the order they
appear inside SOUND.DAT and with their internal names. We can
see quite an abundance of clipping, as well
as a significant DC
offset in WARNING, BUZZ, JOINT,
SBBOMB, and BOSSBOMB.
Wait a moment, true peaks? Where do those come from? And, equally
importantly, how can we even observe, measure, and store anything
above the maximum amplitude of a digital signal?
The answer to the first question can be directly derived from the Xiph.org
video I linked above: Digital signals are lollipop graphs, not stairsteps as
commonly depicted in audio editing software. Converting them back to an
analog signal involves constructing a continuous curve that passes through
each sample point, and whose frequency components stay below the Nyquist
frequency. And if the amplitude of that reconstructed wave changes too
strongly and too rapidly, the resulting curve can easily overshoot the
maximum digital amplitude of 0
dBFS even if none of the defined samples are above that limit.
So let's store the resampled output as a FLAC file and load it into Audacity
to visualize the clipped peaks… only to find all of them replaced with the
typical kind of clipping distortion? 😕 Turns out that I've stumbled over
the one case where the FLAC format isn't lossless and there's
actually no alternative to .WAV: FLAC just doesn't support
floating-point samples and simply truncates them to discrete integers during
encoding. When we measured inter-sample peaks above, we weren't only
resampling to a floating-point format to avoid any quantization to discrete
integer values, but also to make it possible to store amplitudes beyond the
0 dBFS point of ±1.0 in the first place. Once we lose that ability,
these amplitudes are clipped to the maximum value of the integer bit depth,
and baked into the waveform with no way to get rid of them again. After all,
the resampled file now uses a higher sampling rate, and the clipping
distortion is now a defined part of what the sound is.
Finally, storing a digital signal with inter-sample peaks in a
floating-point format also makes it possible for you to reduce the
volume, which moves these peaks back into the regular, unclipped amplitude
range. This is especially relevant for Shuusou Gyoku as you'll probably
never listen to sound effects at full volume.
Now that we understand what's going on there, we can finally compare the
output of various resamplers and pick a suitable one to use with miniaudio.
And immediately, we see how they fall into two categories:
High-quality resamplers are the ones I described earlier: They cleanly
recreate the signal at a higher sampling rate from its raw frequency
representation and thus add no high-frequency noise, but can lead to
inter-sample peaks above 0 dBFS.
Linear resamplers use much simpler math to merely interpolate
between neighboring samples. Since the newly interpolated samples can only
ever stay within 0 dBFS, this approach fully avoids inter-sample
clipping, but at the expense of adding high-frequency imaging noise that has
to then be removed using a low-pass filter.
miniaudio only comes with a linear resampler – but so does DirectSound as it
turns out, so we can get actually pretty close to how the game sounded
originally:
All of Shuusou Gyoku's sound effects combined and resampled into a
single 48,000 Hz / 32-bit float .WAV file, using GoldWave's File Merger tool. By
converting to 32-bit float first and then resampling, the
conversion preserved the exact frequency range of the original
22,050 Hz and 11,025 Hz files, even despite clipping. There
are small noise peaks across the entire frequency range, but they
only occur at the exact boundary between individual sound effects. These
are a simple result of the discontinuities that naturally occur in the
waveform when concatenating signals that don't start or end at a 0
sample.
As mentioned above, you'll only get this sound out of your DAC at lower
volumes where all of the resampled peaks still fit within 0 dBFS.
But you most likely will have reduced your volume anyway, because these
effects would be ear-splittingly loud otherwise.
The result of converting 1️⃣ into FLAC. The necessary bit depth
conversion from 32-bit float to 16-bit integers clamps any data above
0 dBFS or ±1.0f to the discrete
[-32,678; 32,767] range, the maximum value of such
an integer. The resulting straight lines at maximum amplitude in the
time domain then turn into distortion across the entire 24,000 Hz
frequency domain, which then remains a part of the waveform even at
lower volumes. The locations of the high-frequency noise exactly match
the clipped locations in the time-domain waveform images above.
The resulting additional distortion can be best heard in
BOSSBOMB, where the low source frequency ensures that any
distortion stays firmly within the hearing range of most humans.
All of Shuusou Gyoku's sound effects as played through DirectSound and
recorded through Stereo Mix. DirectSound also seems to use a linear
low-pass filter that leaves quite a bit of high-frequency noise in the
signals, making these effects sound crispier than they should be.
Depending on where you stand, this is either highly inaccurate and
something that should be fixed, or actually good because the sound
effects really benefit from that added high end. I myself am definitely
in the latter camp – and hey, this sound is the result of original game
code, so it is accurate at least in that regard.
All of Shuusou Gyoku's sound effects as converted by miniaudio and
directly saved to a file, with the same low-pass filter setting used in
the P0256 build. This first-order low-pass filter is a decent
approximation of DirectSound's resampler, even though it sounds slightly
crispier as the high-frequency noise is boosted a little further. By
default, miniaudio would use a 4th-order low-pass filter, so
this is the second-lowest resampling quality you can get, short of
disabling the low-pass filter altogether.
Conversion results when using miniaudio's 8th-order low-pass
filter for resampling, the highest quality supported. This is the
closest we can get to the reference conversion without using a custom
resampler. If we do want to go for perfect accuracy though, we might as
well go
for 1️⃣ directly?
These spectrum images were initially created using ffmpeg's -lavfi
showspectrumpic=mode=combined:s=1280x720 filter. The samples
appear in the same order as in the waveform above.
And yes, these are indeed the first videos on this blog to have sound! I
spent another push on preparing the
📝 video conversion pipeline for audio
support, and on adding the highly important volume control to the player.
Web video codecs only support lossy audio, so the sound in these videos will
not exactly match the spectrum image, but the lossless source files do
contain the original audio as uncompressed PCM streams.
Compared to that whole mess of signals and noise, keyboard and joypad input
is indeed much simpler. Thanks to SDL, it's almost trivial, and only
slightly complicated because SDL offers two subsystems with seemingly
identical APIs:
SDL_GameController provides a consistent interface for the typical kind
of modern gamepad with two analog sticks, a D-pad, and at least 4 face and 2
shoulder buttons. This API is implemented by simply combining SDL_Joystick
with a
long list of mappings for specific controllers, and therefore doesn't
work with joypads that don't match this standard.
To match Shuusou Gyoku's original WinMM backend, we'd ideally want to keep
the best aspects from both APIs but without being restricted to
SDL_GameController's idea of a controller. The Joy
Pad menu just identifies each button with a numeric ID, so
SDL_Joystick would be a natural fit. But what do we do about directional
controls if SDL_Joystick doesn't tell us which joypad axes correspond to the
X and Y directions, and we don't have the SDL-recommended configuration UI yet?
Doing that right would also mean supporting
POV hats and D-pads, after all… Luckily, all joypads we've tested map
their main X axis to ID 0 and their main Y axis to ID 1, so this seems like
a reasonable default guess.
The necessary consolidation of the game's original input handling uncovered
several minor bugs around the High Score and Game Over screen that I
sufficiently described in the release notes of the new build. But it also
revealed an interesting detail about the Joy Pad
screen: Did you know that Shuusou Gyoku lets you unbind all these
actions by pressing more than one joypad button at the same time? The
original game indicated unbound actions with a [Button
0] label, which is pretty confusing if you have ever programmed
anything because you now no longer know whether the game starts numbering
buttons at 0 or 1. This is now communicated much more clearly.
ESC is not bound to any joypad button in
either screenshot, but it's only really obvious in the P0256
build.
With that, we're finally feature-complete as far as this delivery is
concerned! Let's send a build over to the backers as a quick sanity check…
a~nd they quickly found a bug when running on Linux and Wine. When holding a
button, the game randomly stops registering directional inputs for a short
while on some joypads? Sounds very much like a Wine bug, especially if the
same pad works without issues on Windows.
And indeed, on certain joypads, Wine maps the buttons to completely
different and disconnected IDs, as if it simply invents new buttons or axes
to fill the resulting gaps. Until we can differentiate joypad bindings
per controller, it's therefore unlikely that you can use the same joypad
mapping on both Windows and Linux/Wine without entering the Joy Pad menu and remapping the buttons every time you
switch operating systems.
Still, by itself, this shouldn't cause any issues with my SDL event handling
code… except, of course, if I forget a break; in a switch case.
🫠
This completely preventable implicit fallthrough has now caused a few hours
of debugging on my end. I'd better crank up the warning level to keep this
from ever happening again. Opting into this specific warning also revealed
why we haven't been getting it so far: Visual Studio did gain a whole host
of new warnings related to the C++ Core
Guidelines a while ago, including the one I
was looking for, but actually getting the compiler to throw these
requires activating
a separate static analysis mode together with a plugin, which
significantly slows down build times. Therefore I only activate them for
release builds, since these already take long enough.
Since all that input debugging already started a 5th push, I
might as well fill that one by restoring the original screenshot feature.
After all, it's triggered by a key press (and is thus related to the input
backend), reads the contents of the frame buffer (and is thus related to the
graphics backend), and it honestly looks bad to have this disclaimer in the
release notes just because we're one small feature away from 100% parity
with pbg's original binary.
Coincidentally, I had already written code to save a DirectDraw surface to a
.BMP file for all the debugging I did in the last delivery, so we were
basically only missing filename generation. Except that Shuusou
Gyoku's original choice of mapping screenshots to the PrintScreen key did
not age all too well:
And as of Windows 11, the OS takes full control of the key by binding it
to the Snipping Tool by default, complete with a UI that politely steals
focus when hitting that key.
As a result, both Arandui and I independently arrived at the
idea of remapping screenshots to the P key, which is the same screenshot key
used by every Windows Touhou game since TH08.
The rest of the feature remains unchanged from how it was in pbg's original
build and will save every distinct frame rendered by the game (i.e., before
flipping the two framebuffers) to a .BMP file as long as the P key is being
held. At a 32-bit color depth, these screenshots take up 1.2 MB per
frame, which will quickly add up – especially since you'll probably hold the
P key for more than 1/60 of a second and therefore end
up saving multiple frames in a row. We should probably compress
them one day.
Since I already translated some of Shuusou Gyoku's ASM code to C++ during
the Zig experiment, it made sense to finish the fifth push by covering the
rest of those functions. The integer math functions are used all throughout
the game logic, and are the main reason why this goal is important for a
Linux port, or any port to a 64-bit architecture for that matter. If you've
ever read a micro-optimization-related blog post, you'll know that hand-written ASM is a great recipe that often results in the finest jank, and the game's square root function definitely delivers in that regard, right out of the gate.
What slightly differentiates this algorithm from the typical definition of
an integer
square root is that it rounds up: In real numbers, √3 is
≈ 1.73, so isqrt(3) returns 2 instead of 1. However, if
the result is always rounded down, you can determine whether you have to
round up by simply squaring the calculated root and comparing it to the radicand. And even that
is only necessary if the difference between the two doesn't naturally fall
out of the algorithm – which is what also happens with Shuusou Gyoku's
original ASM code, but pbg
didn't realize this and squared the result regardless.
That's one suboptimal detail already. Let's call the original ASM function
in a loop over the entire supported range of radicands from 0 to
231 and produce a list of results that I can verify my C++
translation against… and watch as the function's linear time complexity with
regard to the radicand causes the loop to run for over 15 hours on my
system. 🐌 In a way, I've found the literal opposite of Q_rsqrt()
here: Not fast, not inverse, no bit hacks, and surely without the
awe-inspiring kind of WTF.
I really didn't want to run the same loop over a
literal C++ translation of the same algorithm afterward. Calculating
integer square roots is a common problem with lots of solutions, so let's
see if we can go better than linear.
And indeed, Wikipedia
also has a bitwise algorithm that runs in logarithmic time, uses only
additions, subtractions, and bit shifts, and even ends up with an error term
that we can use to round up the result as necessary, without a
multiplication. And this algorithm delivers the exact same results over the
exact same range in… 50 seconds. 🏎️ And that's with the I/O to print
the first value that returns each of the 46,341 different square root
results.
"But wait a moment!", I hear you say. "Why are you bothering with
an integer square root algorithm to begin with? Shouldn't good old
round(sqrt(x)) from <math.h> do the trick
just fine? Our CPUs have had SSE for a long time, and this probably compiles
into the single SQRTSD instruction. All that extra
floating-point hardware might mean that this instruction could even run in
parallel with non-SSE code!"
And yes, all of that is technically true. So I tested it, and my very
synthetic and constructed micro-benchmark did indeed deliver the same
results in… 48 seconds. That's not enough of a
difference to justify breaking the spirit of treating the FPU as lava that
permeates Shuusou Gyoku's code base. Besides, it's not used for that much to
begin with:
pre-calculating the 西方Project lens ball effect
the fade animation when entering and leaving stages
rendering the circular part of stationary lasers
pulling items to the player when bombing
After a quick C++ translation of the RNG function that spells out a 32-bit
multiplication on a 32-bit CPU using 16-bit instructions, we reach the final
pieces of ASM code for the 8-bit atan2() and trapezoid
rendering. These could actually pass for well-written ASM code in how they
express their 64-bit calculations: atan8() prepares its 64-bit
dividend in the combined EDX and EAX registers in
a way that isn't obvious at all from a cursory look at the code, and the
trapezoid functions effectively use Q32.32 subpixels. C++ allows us to
cleanly model all these calculations with 64-bit variables, but
unfortunately compiles the divisions into a call to a comparatively much
more bloated 64-bit/64-bit-division polyfill function. So yeah, we've
actually found a well-optimized piece of inline assembly that even Visual
Studio 2022's optimizer can't compete with. But then again, this is all
about code generation details that are specific to 32-bit code, and it
wouldn't be surprising if that part of the optimizer isn't getting much
attention anymore. Whether that optimization was useful, on the other hand…
Oh well, the new C++ version will be much more efficient in 64-bit builds.
And with that, there's no more ASM code left in Shuusou Gyoku's codebase,
and the original DirectXUTYs directory is slowly getting
emptier and emptier.
Phew! Was that everything for this delivery? I think that was everything.
Here's the new build, which checks off 7 of the 15 remaining portability
boxes:
Next up: Taking a well-earned break from Shuusou Gyoku and starting with the
preparations for multilingual PC-98 Touhou translatability by looking at
TH04's and TH05's in-game dialog system, and definitely writing a shorter
blog post about all that…
Stripe is now
properly integrated into this website as an alternative to PayPal! Now, you
can also financially support the project if PayPal doesn't work for you, or
if you prefer using a
provider out of Stripe's greater variety. It's unfortunate that I had to
ship this integration while the store is still sold out, but the Shuusou
Gyoku OpenGL backend has turned out way too complicated to be finished next
to these two pushes within a month. It will take quite a while until the
store reopens and you all can start using Stripe, so I'll just link back to
this blog post when it happens.
Integrating Stripe wasn't the simplest task in the world either. At first,
the Checkout API
seems pretty friendly to developers: The entire payment flow is handled on
the backend, in the server language of your choice, and requires no frontend
JavaScript except for the UI feedback code you choose to write. Your
backend API endpoint initiates the Stripe Checkout session, answers with a
redirect to Stripe, and Stripe then sends a redirect back to your server if
the customer completed the payment. Superficially, this server-based
approach seems much more GDPR-friendly than PayPal, because there are no
remote scripts to obtain consent for. In reality though, Stripe shares
much more potential personal data about your credit card or bank
account with a merchant, compared to PayPal's almost bare minimum of
necessary data.
It's also rather annoying how the backend has to persist the order form
information throughout the entire Checkout session, because it would
otherwise be lost if the server restarts while a customer is still busy
entering data into Stripe's Checkout form. Compare that to the PayPal
JavaScript SDK, which only POSTs back to your server after the
customer completed a payment. In Stripe's case, more JavaScript actually
only makes the integration harder: If you trigger the initial payment
HTTP request from JavaScript, you will have
to improvise a bit to avoid the CORS error when redirecting away to a
different domain.
But sure, it's all not too bad… for regular orders at least. With
subscriptions, however, things get much worse. Unlike PayPal, Stripe
kind of wants to stay out of the way of the payment process as much as
possible, and just be a wrapper around its supported payment methods. So if
customers aren't really meant to register with Stripe, how would they cancel
their subscriptions?
Answer: Through
the… merchant? Which I quite dislike in principle, because why should
you have to trust me to actually cancel your subscription after you
requested it? It also means that I probably should add some sort of UI for
self-canceling a Stripe subscription, ideally without adding full-blown user
accounts. Not that this solves the underlying trust issue, but it's more
convenient than contacting me via email or, worse, going through your bank
somehow. Here is how my solution works:
When setting up a Stripe subscription, the server will generate a random
ID for authentication. This ID is then used as a salt for a hash
of the Stripe subscription ID, linking the two without storing the latter on
my server.
The thank you page, which is parameterized with the Stripe
Checkout session ID, will use that ID to retrieve the subscription
ID via an API call to Stripe, and display it together with the above
salt. This works indefinitely – contrary to what the expiry field in the
Checkout session object suggests, Stripe sessions are indeed stored
forever. After all, Stripe also displays this session information in a
merchant's transaction log with an excessive amount of detail. It might have
been better to add my own expiration system to these pages, but this had
been taking long enough already. For now, be aware that sharing the link to
a Stripe thank you page is equivalent to sharing your subscription
cancellation password.
The salt is then used as the key for a subscription management page. To
cancel, you visit this page and enter the Stripe subscription ID to confirm.
The server then checks whether the salt and subscription ID pair belong to
each other, and sends the actual cancellation
request back to Stripe if they do.
I might have gone a bit overboard with the crypto there, but I liked the
idea of not storing any of the Stripe session IDs in the server database.
It's not like that makes the system more complex anyway, and it's nice to
have a separate confirmation step before canceling a subscription.
But even that wasn't everything I had to keep in mind here. Once you
switch from test to production mode for the final tests, you'll notice that
certain SEPA-based
payment providers take their sweet time to process and activate new
subscriptions. The Checkout session object even informs you about that, by
including a payment status field. Which initially seems just like
another field that could indicate hacking attempts, but treating it as such
and rejecting any unpaid session can also reject perfectly valid
subscriptions. I don't want all this control… 🥲
Instead, all I can do in this case is to tell you about it. In my test, the
Stripe dashboard said that it might take days or even weeks for the initial
subscription transaction to be confirmed. In such a case, the respective
fraction of the cap will unfortunately need to remain red for that entire time.
And that was 1½ pushes just to replicate the basic functionality of a simple
PayPal integration with the simplest type of Stripe integration. On the
architectural site, all the necessary refactoring work made me finally
upgrade my frontend code to TypeScript at least, using the amazing esbuild to handle transpilation inside
the server binary. Let's see how long it will now take for me to upgrade to
SCSS…
With the new payment options, it makes sense to go for another slight price
increase, from up to per push.
The amount of taxes I have to pay on this income is slowly becoming
significant, and the store has been selling out almost immediately for the
last few months anyway. If demand remains at the current level or even
increases, I plan to gradually go up to by the end
of the year. 📝 As📝 usual,
I'm going to deliver existing orders in the backlog at the value they were
originally purchased at. Due to the way the cap has to be calculated, these
contributions now appear to have increased in value by a rather awkward
13.33%.
This left ½ of a push for some more work on the TH01 Anniversary Edition.
Unfortunately, this was too little time for the grand issue of removing
byte-aligned rendering of bigger sprites, which will need some additional
blitting performance research. Instead, I went for a bunch of smaller
bugfixes:
ANNIV.EXE now launches ZUNSOFT.COM if
MDRV98 wasn't resident before. In hindsight, it's completely obvious
why this is the right thing to do: Either you start
ANNIV.EXE directly, in which case there's no resident
MDRV98 and you haven't seen the ZUN Soft logo, or you have
made a single-line edit to GAME.BAT and replaced
op with anniv, in which case MDRV98 is
resident and you have seen the logo. These are the two
reasonable cases to support out of the box. If you are doing
anything else, it shouldn't be that hard to adjust though?
You might be wondering why I didn't just include all code of
ZUNSOFT.COM inside ANNIV.EXE together with
the rest of the game. The reason: ZUNSOFT.COM has
almost nothing in common with regular TH01 code. While the rest of
TH01 uses the custom image formats and bad rendering code I
documented again and again during its RE process,
ZUNSOFT.COM fully relies on master.lib for everything
about the bouncing-ball logo animation. Its code is much closer to
TH02 in that respect, which suggests that ZUN did in fact write this
animation for TH02, and just included the binary in TH01 for
consistency when he first sold both games together at Comiket 52.
Unlike the 📝 various bad reasons for splitting the PC-98 Touhou games into three main executables,
it's still a good idea to split off animations that use a completely
different set of rendering and file format functions. Combined with
all the BFNT and shape rendering code, ZUNSOFT.COM
actually contains even more unique code than OP.EXE,
and only slightly less than FUUIN.EXE.
The optional AUTOEXEC.BAT is now correctly encoded in
Shift-JIS instead of accidentally being UTF-8, fixing the previous
mojibake in its final ECHO line.
The command-line option that just adds a stage selection without
other debug features (anniv s) now works reliably.
This one's quite interesting because it only ever worked
because of a ZUN bug. From a superficial look at the code, it
shouldn't: While the presence of an 's' branch proves
that ZUN had such a mode during development, he nevertheless forgot
to initialize the debug flag inside the resident structure within
this branch. This mode only ever worked because master.lib's
resdata_create() function doesn't clear the resident
structure after allocation. If anything on the system previously
happened to write something other than 0x00,
0x01, or 0x03 to the specific byte that
then gets repurposed as the debug mode flag, this lack of
initialization does in fact result in a distinct non-test and
non-debug stage selection mode.
This is what happens on a certain widely circulated .HDI copy of
TH01 that boots MS-DOS 3.30C. On this system, the memory that
master.lib will allocate to the TH01 resident structure was
previously used by DOS as stack for its kernel, which left the
future resident debug flag byte at address 9FF6:0012 at
a value of 0x12. This might be the entire reason why
game s is even widely documented to trigger a stage
selection to begin with – on the widely circulated TH04 .HDI that
boots MS-DOS 6.20, or on DOSBox-X, the s parameter
doesn't work because both DOS systems leave the resident debug flag
byte at 0x00. And since ANNIV.EXE pushes
MDRV98 into that area of conventional DOS RAM, anniv s
previously didn't work even on MS-DOS 3.30C.
Both bugs in the
📝 1×1 particle system during the Mima fight
have been fixed. These include the off-by-one error that killed off the
very first particle on the 80th
frame and left it in VRAM, and, just like every other entity type, a
replacement of ZUN's EGC unblitter with the new pixel-perfect and fast
one. Until I've rearchitected unblitting as a whole, the particles will
now merely rip barely visible 1×1 holes into the sprites they overlap.
The bomb value shown in the lowest line of the in-game
debug mode output is now right-aligned together with the rest of the
values. This ensures that the game always writes a consistent number
of characters to TRAM, regardless of the magnitude of the
bomb value, preventing the seemingly wrong
timer values that appeared in the original game
whenever the value of the bomb variable changed to a
lower number of digits:
Finally, I've streamlined VRAM page access changes, which allowed me to
consistently replace ZUN's expensive function call with the optimal two
inlined x86 instructions. Interestingly, this change alone removed
2 KiB from the binary size, which is almost all of the difference
between 📝 the P0234-1 release and this
one. Let's see how much longer we can make each new release of
ANNIV.EXE smaller than the previous one.
The final point, however, raised the question of what we're now going to do
about
📝 a certain issue in the 地獄/Jigoku Bad Ending.
ZUN's original expensive way of switching the accessed VRAM page was the
main reason behind the lag frames on slower PC-98 systems, and
search-replacing the respective function calls would immediately get us to
the optimized version shown in that blog post. But is this something we
actually want? If we wanted to retain the lag, we could surely preserve that
function just for this one instance… The discovery of this issue
predates the clear distinction between bloat, quirks, and bugs, so it makes
sense to first classify what this issue even is. The distinction comes all
down to observability, which I defined as changes to rendered frames
between explicitly defined frame boundaries. That alone would be enough to
categorize any cause behind lag frames as bloat, but it can't hurt to be
more explicit here.
Therefore, I now officially judge observability in terms of an infinitely
fast PC-98 that can instantly render everything between two explicitly
defined frames, and will never add additional lag frames. If we plan to port
the games to faster architectures that aren't bottlenecked by disappointing
blitter chips, this is the only reasonable assumption to make, in my
opinion: The minimum system requirements in the games' README files are
minimums, after all, not recommendations. Chasing the exact frame
drop behavior that ZUN must have experienced during the time he developed
these games can only be a guessing game at best, because how can we know
which PC-98 model ZUN actually developed the games on? There might even be
more than one model, especially when it comes to TH01 which had been in
development for at least two years before ZUN first sold it. It's also not
like any current PC-98 emulator even claims to emulate the specific timing
of any existing model, and I sure hope that nobody expects me to import a
bunch of bulky obsolete hardware just to count dropped frames.
That leaves the tearing, where it's much more obvious how it's a bug. On an
infinitely fast PC-98, the ドカーン
frame would never be visible, and thus falls into the same category as the
📝 two unused animations in the Sariel fight.
With only a single unconditional 2-frame delay inside the animation loop, it
becomes clear that ZUN intended both frames of the animation to be displayed
for 2 frames each:
No tearing, and 34 frames in total for the first of the two
instances of this animation.
Next up: Taking the oldest still undelivered push and working towards TH04
position independence in preparation for multilingual translations. The
Shuusou Gyoku OpenGL backend shouldn't take that much longer either,
so I should have lots of stuff coming up in May afterward.
P0218
Website (Video pipeline, part 1/2: Preparations / Source file gathering)
P0219
Website (Video pipeline, part 2/2: Encoding / Tweaking settings / Caching)
P0220
Website (Video player, part 1/3: Basic controls and frame-accurate seeking)
P0221
Website (Video player, part 2/3: Tabs and markers)
P0222
Website (Video player, part 3/3: Dynamic captions and useful fullscreen mode)
💰 Funded by:
[Anonymous], Yanga, Ember2528
🏷️ Tags:
Yes, I'm still alive. This delivery was just plagued by all of the worst
luck: Data loss, physical hard drive failure, exploding phone batteries,
minor illness… and after taking 4 weeks to recover from all of that, I had
to face this beast of a task. 😵
Turns out that neither part of improving video performance and usability on
this blog was particularly easy. Decently encoding the videos into all
web-supported formats required unexpected trade-offs even for the low-res,
low-color material we are working with, and writing custom video player
controls added the timing precision resistance of HTML
<video> on top of the inherent complexity of frontend web
development. Why did this need to be 800 lines of commented JavaScript and
200 lines of commented CSS, and consume almost more than 5 pushes?!
Apparently, the latest price increase also seemed to have raised the minimum
level of acceptable polish in my work, since that's more than the maximum of
3.67 pushes it should have taken. To fund the rest, I stole some of the
reserved JIS trail word rendering research pushes, which means that the next
towards anything will go back towards that goal.
The codec situation is especially sad because it seems like so much of a
solved problem. ZMBV, the lossless capture codec introduced by DOSBox, is
both very well suited for retro game footage and remarkably simple too:
DOSBox-X's implementation of both an encoder and decoder comes in at under
650 lines of C++, excluding the Deflate implementation. Heck, the AVI
container around the codec is more complicated to write than the
compressed video data itself, and AVI is already the easiest choice you have
for a widely supported video container format.
Currently, this blog contains 9:02 minutes of video across 86 files, with a
total frame count of 24,515. In case this post attracts a general video
encoding audience that isn't familiar with what I'm encoding here: The
maximum resolution is 640×400, and most of the video uses 16 colors, with
some parts occasionally using more. With ZMBV, the lossless source files
take up 43.8 MiB, and that's even with AVI's infamously bad
overhead. While you can always spend more time on any compression task and
precisely tune your algorithm to match your source data even better,
43.8 MiB looks like a more than reasonable amount for this type of
content.
Especially compared with what I actually have to ship here, because sadly,
ZMBV is not supported by browsers. 😔 Writing a WebAssembly player for ZMBV
would have certainly been interesting, but it already took 5 pushes to get
to what we have now. So, let's instead shell out to ffmpeg and build a
pipeline to convert ZMBV to the ill-suited codecs supported by web browsers,
replacing the previously committed VP9 and VP8 files. From that point, we
can then look into AV1, the latest and greatest web-supported video codec,
to save some additional bandwidth.
But first, we've got to gather all the ZMBV source files. While I was
working on the 📝 2022-07-10 blog post, I
noticed some weirdly washed-out colors in the converted videos, leading to
the shocking realization that my previous, historically grown conversion
script didn't actually encode in a lossless way. 😢 By extension,
this meant that every video before that post could have had minor
discolorations as well.
For the majority of videos, I still had the original ZMBV capture files
straight out of DOSBox-X, and reproducing the final videos wasn't too big of
a deal. For the few cases where I didn't, I went the extra mile, took the
VP9 files, and manually fixed up all the minor color errors based on
reference videos from the same gameplay stage. There might be a huge ffmpeg
command line with a complicated filter graph to do the job, but for such a
small 4-digit number of frames, it is much more straightforward to just dump
each frame as an image and perform the color replacement with ImageMagick's
-opaque and -fill options.
So, time to encode our new definite collection of source files into AV1, and
what the hell, how slow is this codec? With ffmpeg's
libaom-av1, fully encoding all 86 videos takes almost 9
hours on my mid-range
development system, regardless of the quality selected.
But sure, the encoded videos are managed by a cache, and this obviously only
needs to be done once. If the results are amazing, they might even justify
these glacial encoding speeds. Unfortunately, they don't: In its lossless
-crf 0 mode, AV1 performs even worse than VP9, taking up
222 MiB rather than 182 MiB. It might not sound bad now,
but as we're later going to find out, we want to have a lot of
keyframes in these videos, which will blow up video sizes even further.
So, time to go lossy and maybe take a deep dive into AV1 tuning? Turns out
that it only gets worse from there:
The alternative libsvtav1 encoder is fast and creates small
files… but even on the highest-quality settings, -crf 0 and
-qp 0, the video quality resembled the terrible x264 YUV420P
format that Twitter enforces on uploaded videos.
I don't remember the librav1e results, but they sure
weren't convincing either.
libaom-av1's -usage realtime option is a
complete joke. 771 MiB for all videos, and it doesn't even compress
in real time on my system, more like 2.5× real-time. For comparison,
a certain stone-age technology by the name of "animated GIF" would take
54.3 MiB, encode in sub-realtime (0.47×), and the only necessary tuning
you need is an easily
googled palette generation and usage filter. Why can't I just use
those in a <video> tag?! These results have
clearly proven the top-voted just use modern video codecs Stack
Overflow answers wrong.
What you're actually supposed to do is to drop -cpu-used to
maybe 2 or 3, and then selectively add back prediction filters that suit
your type of content. In our case, these are
and maybe others, depending on much time you want to waste.
Because that's what all this tuning ended up being: a complete waste of
time. No matter which tuning options I tried, all they did was cut down
encoding time in exchange for slightly larger files on average. If there is
a magic tuning option that would suddenly cause AV1 to maybe even beat ZMBV,
I haven't found it. Heck, at particularly low settings,
-enable-intrabc even caused blocky glitches with certain pellet
patterns that looked like the internal frame block hashes were colliding all
over the place. Unfortunately, I didn't save the video where it happened.
So yeah, if you've already invested the computation time and encoded your
content by just specifying a -crf value and keeping the
remaining settings at their time-consuming defaults, any further tuning will
make no difference. Which is… an interesting choice from a usability
perspective. I would have expected the exact
opposite: default to a reasonably fast and efficient profile, and leave the
vast selection of tuning options for those people to explore who do
want to wait 5× as long for their encoder for that additional 5% of
compression efficiency. On the other hand, that surely is one way to get
people to extensively study your glorious engineering efforts, I guess? You
know what would maybe even motivate people to intrinsically do that?
Good documentation, with examples of the intent behind every option and its
optimal use case. Nobody needs long help strings that just spell out all of
the abbreviations that occur in the name of the option…
But hey, that at least means there's no reason to not use anything but ZMBV
for storing and archiving the lossless source files. Best compression
efficiency, encodes in real-time, and the files are much easier to edit.
OK, end of rant. To understand why anyone could be hyped about AV1 to begin
with, we just have to compare it to VP9, not to ZMBV. In that light, AV1
is pretty impressive even at -crf 1, compressing all 86
videos to 68.9 MiB, and even preserving 22.3% of frames completely
losslessly. The remaining frames exhibit the exact kind of quality loss
you'd want for retro game footage: Minor discoloration in individual pixels,
so minuscule that subtracting the encoded image from the source yields an
almost completely black image. Even after highlighting the errors by
normalizing such a difference image, they are barely visible even if you
know where to look. If "compressed PNG size of the normalized difference
between ZMBV and AV1 -crf 1" is a useful metric, this would be
its median frame among the 77.7% of non-lossless frames:
Whether you can actually spot the difference is pretty much down to the
glass between the physical pixels and your eyes. In any case, it's very
hard, even if you know where to look. As far as I'm concerned, I can
confidently call this "visually lossless", and it's definitely good enough
for regular watching and even single-frame stepping on this blog.
Since the appeal of the original lossless files is undeniable though, I also
made those more easily available. You can directly download the one for the
currently active video with the ⍗ button in the new video player – or directly
get all of them from the Git repository if you don't like clicking.
Unfortunately, even that only made up for half of the complexity in this
pipeline. As impressive as the AV1 -crf 1 result may be, it
does in fact come with the drawback of also being impressively heavy to
decode within today's browsers. Seeking is dog slow, with even the latencies
for single-frame stepping being way beyond what I'd consider
tolerable. To compensate, we have to invest another 78 MiB into turning
every 10th frame into a keyframe until single-stepping through an
entire video becomes as fast as it could be on my system.
But fine, 146 MiB, that's still less than the 178 MiB that the old
committed VP9 files used to take up. However, we still want to support VP9
for older browsers, older
hardware, and people who use Safari. And it's this codec where keyframes
are so bad that there is no clear best solution, only compromises. The main
issue: The lower you turn VP9's -crf value, the slower the
seeking performance with the same number of keyframes. Conversely,
this means that raising quality also requires more keyframes for the same
seeking performance – and at these file sizes, you really don't want to
raise either. We're talking 1.2 GiB for all 86 videos at
-crf 10 and -g 5, and even on that configuration,
seeking takes 1.3× as long as it would in the optimal case.
Thankfully, a full VP9 encode of all 86 videos only takes some 30 minutes as
opposed to 9 hours. At that speed, it made sense to try a larger number of
encoding settings during the ongoing development of the player. Here's a
table with all the trials I've kept:
Codec
-crf
-g
Other parameters
Total size
Seek time
VP9
32
20
-vf format=yuv420p
111 MiB
32 s
VP8
10
30
-qmin 10 -qmax 10 -b:v 1G
120 MiB
32 s
VP8
7
30
-qmin 7 -qmax 7 -b:v 1G
140 MiB
32 s
AV1
1
10
146 MiB
32 s
VP8
10
20
-qmin 10 -qmax 10 -b:v 1G
147 MiB
32 s
VP8
6
30
-qmin 6 -qmax 6 -b:v 1G
149 MiB
32 s
VP8
15
10
-qmin 15 -qmax 15 -b:v 1G
177 MiB
32 s
VP8
10
10
-qmin 10 -qmax 10 -b:v 1G
225 MiB
32 s
VP9
32
10
-vf format=yuv422p
329 MiB
32 s
VP8
0-4
10
-qmin 0 -qmax 4 -b:v 1G
376 MiB
32 s
VP8
5
30
-qmin 5 -qmax 5 -b:v 1G
169 MiB
33 s
VP9
63
40
47 MiB
34 s
VP9
32
20
-vf format=yuv422p
146 MiB
34 s
VP8
4
30
-qmin 0 -qmax 4 -b:v 1G
192 MiB
34 s
VP8
4
40
-qmin 4 -qmax 4 -b:v 1G
168 MiB
35 s
VP9
25
20
-vf format=yuv422p
173 MiB
36 s
VP9
15
15
-vf format=yuv422p
252 MiB
36 s
VP9
32
25
-vf format=yuv422p
118 MiB
37 s
VP9
20
20
-vf format=yuv422p
190 MiB
37 s
VP9
19
21
-vf format=yuv422p
187 MiB
38 s
VP9
32
10
553 MiB
38 s
VP9
32
10
-tune-content screen
553 MiB
VP9
32
10
-tile-columns 6 -tile-rows 2
553 MiB
VP9
15
20
-vf format=yuv422p
207 MiB
39 s
VP9
10
5
1210 MiB
43 s
VP9
32
20
264 MiB
45 s
VP9
32
20
-vf format=yuv444p
215 MiB
46 s
VP9
32
20
-vf format=gbrp10le
272 MiB
49 s
VP9
63
24 MiB
67 s
VP8
0-4
-qmin 0 -qmax 4 -b:v 1G
119 MiB
76 s
VP9
32
107 MiB
170 s
The bold rows correspond to the final encoding choices that
are live right now. The seeking time was measured by holding → Right on
the 📝 cheeto dodge strategy video.
Yup, the compromise ended up including a chroma subsampling conversion to
YUV422P. That's the one thing you don't want to do for retro pixel
graphics, as it's the exact cause behind washed-out colors and red fringing
around edges:
The worst example of chroma subsampling in a VP9-encoded file according
to the above metric, from frame 130 (0-based) of
📝 Sariel's restored leaf "spark" animation,
featuring smeared-out contours and even an all-around darker image,
blowing up the image to a whopping 3653 colors. It's certainly an
aesthetic.
But there simply was no satisfying solution around the ~200 MiB mark
with RGB colors, and even this compromise is still a disappointment in both
size and seeking speed. Let's hope that Safari
users do get AV1 support soon… Heck, even VP8, with its exclusive
support for YUV420P, performs much better here, with the impact of
-crf on seeking speed being much less pronounced. Encoding VP8
also just takes 3 minutes for all 86 videos, so I could have experimented
much more. Too bad that it only matters for really ancient systems…
Two final takeaways about VP9:
-tune-content screen and the tile options make no
difference at all.
All results used two-pass encoding. VP9 is the only codec where two
passes made a noticeable difference, cutting down the final encoded size
from 224 MiB to 207 MiB. For AV1, compression even seems to be
slightly worse with two passes, yielding 154,201,892 bytes rather than the
153,643,316 bytes we get with a single pass. But that's a difference of
0.36%, and hardly significant.
Alright, now we're done with codecs and get to finish the work on the
pipeline with perhaps its biggest advantage. With a ffmpeg conversion
infrastructure in place, we can also easily output a video's first frame as
a poster image to be passed into the <video> tag.
If this image is kept at the exact resolution of the video, the browser
doesn't need to wait for an indeterminate amount of "video metadata" to be
loaded, and can reserve the necessary space in the page layout much faster
and without any of these dreaded loading spinners. For the big
/blog page, this cuts down the minimum amount of required
resources from 69.5 MB to 3.6 MB, finally making it usable again without
waiting an eternity for the page to fully load. It's become pretty bad, so I
really had to prioritize this task before adding any more blog posts on top.
That leaves the player itself, which is basically a sum of lots of little
implementation challenges. Single-frame stepping and seeking to discrete
frames is the biggest one of them, as it's technically
not possible within the <video> tag, which only
returns the current time as a continuous value in seconds. It only sort
of works for us because the backend can pass the necessary FPS and frame
count values to the frontend. These allow us to place a discrete grid of
frame "frets" at regular intervals, and thus establish a consistent mapping
from frames to seconds and back. The only drawback here is a noticeably
weird jump back by one frame when pausing a video within the second half of
a frame, caused by snapping the continuous time in seconds back onto the
frame grid in order to maintain a consistent frame counter. But the whole
feature of frame-based seeking more than makes up for that.
The new scrubbable timeline might be even nicer to use with a mouse or a
finger than just letting a video play regularly. With all the tuning work I
put into keyframes, seeking is buttery smooth, and much better than the
built-in <video> UI of either Chrome or Firefox.
Unfortunately, it still costs a whole lot of CPU, but I'd say it's worth it.
🥲
Finally, the new player also has a few features that might not be
immediately obvious:
Keybindings for almost everything you might want them for, indicated by
hovering on top of each button. The tab switchers additionally support the
↑ Up and ↓ Down keys to cycle through all tabs, or the number keys
to jump to a specific tab. Couldn't find a way to indicate these mappings in
the UI yet.
Per-video captions now reserve the maximum height of any caption in the
layout. This prevents layout reflows when switching through such videos,
which previously caused quite annoying lag on the big /blog
page.
Useful fullscreen modes on both desktop and mobile, including all
markers and the video caption. Firefox made this harder than it needed to
be, and if it weren't for display: contents, the implementation
would have been even worse. In the end though, we didn't even need any video
pixel sizes from the backend – just as it should be…
… and supporting Firefox was definitely worth it, as it's the only
browser to support nearest-neighbor interpolation on videos.
As some of the Unicode codepoints on the buttons aren't covered by the
default fonts of some operating systems, I've taken them from the Catrinity font, licensed under the SIL
Open Font License. With all
the edits I did on this font, that license definitely was necessary. I
hope I applied it correctly though; it's not straightforward at all how to
properly license a Modified Version of an original font with a
Reserved Font Name.
And with that, development hell is over, and I finally get to return to the
core business! Just more than one month late.
Next up: Shipping the oldest still pending order, covering the TH04/TH05
ending script format. Meanwhile, the Seihou community also wants to keep
investing in Shuusou Gyoku, so we're also going to see more of that on the
side.
The "bad" news first: Expanding to Stripe in order to support Google Pay
requires bureaucratic effort that is not quite justified yet, and would only
be worth it after the next price increase.
Visualizing technical debt has definitely been overdue for a while though.
With 1 of these 2 pushes being focused on this topic, it makes sense to
summarize once again what "technical debt"
means in the context of ReC98, as this info was previously kind of scattered
over multiple blog posts. Mainly, it encompasses
any ZUN-written code
that we did name and reverse-engineer,
but which we simply moved out into dedicated files that are then
#included back into the big .ASM translation units,
without worrying about decompilation or proving undecompilability for
now.
Technically (ha), it would also include all of master.lib, which has
always been compiled into the binaries in this way, and which will require
quite a bit of dedicated effort to be moved out into a properly linkable
library, once it's feasible. But this code has never been part of any
progress metric – in fact, 0% RE is
defined as the total number of x86 instructions in the binary minus
any library code. There is also no relation between instruction numbers and
the time it will take to finalize master.lib code, let alone a precedent of
how much it would cost.
If we now want to express technical debt as a percentage, it's clear where
the 100% point would be: when all RE'd code is also compiled in from a
translation unit outside the big .ASM one. But where would 0% be? Logically,
it would be the point where no reverse-engineered code has ever been moved
out of the big translation units yet, and nothing has ever been decompiled.
With these boundary points, this is what we get:
Not too bad! So it's 6.22% of total RE that we will have to revisit at some
point, concentrated mostly around TH04 and TH05 where it resulted from a
focus on position independence. The prices also give an accurate impression
of how much more work would be required there.
But is that really the best visualization? After all, it requires an
understanding of our definition of technical debt, so it's maybe not the
most useful measurement to have on a front page. But how about subtracting
those 6.22% from the number shown on the RE% bars? Then, we get this:
Which is where we get to the good news: Twitter surprisingly helped me out
in choosing one visualization over the other, voting
7:2 in favor of the Finalized version. While this one requires
you to manually calculate € finalized - € RE'd to
obtain the raw financial cost of technical debt, it clearly shows, for the
first time, how far away we are from the main goal of fully decompiling all
5 games… at least to the extent it's possible.
Now that the parser is looking at these recursively included .ASM files for
the first time, it needed a small number of improvements to correctly handle
the more advanced directives used there, which no automatic disassembler
would ever emit. Turns out I've been counting some directives as
instructions that never should have been, which is where the additional
0.02% total RE came from.
One more overcounting issue remains though. Some of the RE'd assembly slices
included by multiple games contain different if branches for
each game, like this:
; An example assembly file included by both TH04's and TH05's MAIN.EXE:
if (GAME eq 5)
; (Code for TH05)
else
; (Code for TH04)
endif
Currently, the parser simply ignores if, else, and
endif, leading to the combined code of all branches being
counted for every game that includes such a file. This also affects the
calculated speed, and is the reason why finalization seems to be slightly
faster than reverse-engineering, at currently 471 instructions per push
compared to 463. However, it's not that bad of a signal to send: Most of the
not yet finalized code is shared between TH04 and TH05, so finalizing it
will roughly be twice as fast as regular reverse-engineering to begin with.
(Unless the code then turns out to be twice as complex than average code…
).
For completeness, finalization is now also shown as part of the per-commit metrics. Now it's clearly visible what I was
doing in those very slow five months between P0131 and P0140, where
the progress bar didn't move at all: Repaying 3.49% of previously
accumulated technical debt across all games. 👌
As announced, I've also implemented a new caching system for this website,
as the second main feature of these two pushes. By appending a hash string
to the URLs of static resources, your browser should now both cache them
forever and re-download them once they did change on the server. This
avoids the unnecessary (and quite frankly, embarrassing) re-requests for all
static resources that typically just return a 304 Not Modified
response. As a result, the blog should now load a bit faster on repeated
visits, especially on slower connections. That should allow me to
deliberately not paginate it for another few years, without it getting all
too slow – and should prepare us for the day when our first game
reaches 100% and the server will get smashed.
However, I am open to changing the progress blog link in the
navigation bar at the top to the list of tags, once
people start complaining.
Apart frome some more invisible correctness and QoL improvements, I've also
prepared some new funding goals, but I'll cover those once the store
reopens, next year. Syntax highlighting for code snippets would have also
been cool, but unfortunately didn't make it into those two pushes. It's
still on the list though!
Next up: Back to RE with the TH03 score file format, and other code that
surrounds it.
P0143
Website (Progress number caching)
P0144
Website (Blog tag system, part 1: Manual and automatic tag assignment, blog filtering, design)
P0145
Website (Blog tag system, part 2: Combining tags, per-tag descriptions)
Who said working on the website was "fun"? That code is a mess.
This right here is the first time I seriously
wrote a website from (almost) scratch. Its main job is to parse over a Git
repository and calculate numbers, so any additional bulky frameworks would
only be in the way, and probably need to be run on some sort of wobbly,
unmaintainable "stack" anyway, right? 😛
📝 As with the main project though, I'm only
beginning to figure out the best structure for this, and these new features
prompted quite a lot of upfront refactoring…
Before I start ranting though, let's quickly summarize the most visible
change, the new tag system for this blog!
Yes, I manually went through every one of the 82 posts I've written so
far, and assigned labels to them.
The per-project (rec98 and
website) and per-game (th01th02th03th04th05) tags are automatically generated from the
database and the Git commit history, respectively. That might have
ended us up with a fair bit of category clutter, as any single change
to a tiny aspect is enough for a blog post to be tagged with an
otherwise unrelated game. For now, it doesn't seem too much of
an issue though.
Filtering already works for an arbitrary number of tags. Right now,
these are always combined with AND – no arbitrary boolean expressions for tag filtering yet.
Adding filters simply works by adding components to the URL path:
https://rec98.nmlgc.net/blog/tag/tag1/tag2/tag3/… and so
on.
Hovering over any tag shows a brief description of what that tag is
about. Some of the terms really needed a definition, so I just added one for
all of them. Hope you all enjoy them!
These descriptions are also shown on the new
tag overview page, which now kind of doubles as a
glossary.
Finally, the order page now shows the exact number of pushes a contribution
will fund – no more manual divisions required.
Shoutout to the one email I received, which pointed out this potential
improvement!
As for the "invisible" changes: The one main feature of this website, the
aforementioned calculation of the progress metrics, also turned out as its
biggest annoyance over the years. It takes a little while to parse all the
big .ASM files in the source tree, once for every push that can affect the
average number of removed instructions and unlabeled addresses. And without
a cache, we've had to do that every time we re-launch the app server
process.
Fundamentally, this is – you might have guessed it – a dependency tracking
problem, with two inputs: the .ASM files from the ReC98 repo, and the
Golang code that calculates the instruction and PI numbers. Sure, the code
has been pretty stable, but what if we do end up extending it one day? I've
always disliked manually specified version numbers for use cases like this
one, where the problem at hand could be exactly solved with a hashing
function, without being prone to human error.
(Sidenote: That's why I never actively supported thcrap mods that affected
gameplay while I was still working on that project. We still want to be
able to save and share replays made on modded games, but I do not
want to subject users to the unacceptable burden of manually remembering
which version of which patch stack they've recorded a given replay with.
So, we'd somehow need to calculate a hash of everything that defines the
gameplay, exclude the things that don't, and only show
replays that were recorded on the hash that matches the currently running
patch stack. Well, turns out that True Touhou Fans™ quite enjoy watching
the games get broken in every possible way. That's the way ZUN intended the
games to be experienced, after all. Otherwise, he'd be constantly
maintaining the games and shipping bugfix patches… 🤷)
Now, why haven't I been caching the progress numbers all along? Well,
parallelizing that parsing process onto all available CPU cores seemed
enough in 2019 when this site launched. Back then, the estimates were
calculated from slightly over 10 million lines of ASM, which took about 7
seconds to be parsed on my mid-range dev system.
Fast forward to P0142 though, and we have to parse 34.3 million lines of
ASM, which takes about 26 seconds on my dev system. That would have only
got worse with every new delivery, especially since this production server
doesn't have as many cores.
I was thinking about a "doing less" approach for a while: Parsing only the
files that had changed between the start and end commit of a push, and
keeping those deltas across push boundaries. However, that turned out to be
slightly more complex than the few hours I wanted to spend on it.
And who knows how well that would have scaled. We've still got a few
hundred pushes left to go before we're done here, after all.
So with the tag system, as always, taking longer and consuming more pushes
than I had planned, the time had come to finally address the underlying
dependency tracking problem.
Initially, this sounded like a nail that was tailor-made for
📝 my favorite hammer, Tup: Move the parser
to a separate binary, gather the list of all commits via git
rev-list, and run that parser binary on every one of the commits
returned. That should end up correctly tracking the relevant parts of
.git/ and the new binary as inputs, and cause the commits to
be re-parsed if the parser binary changes, right? Too bad that Tup both
refuses to track
anything inside .git/, and can't track a Golang binary
either, due to all of the compiler's unpredictable outputs into its build
cache. But can't we at least turn off–
> The build cache is now required as a step toward eliminating $GOPATH/pkg.
— Go 1.12 release notes
Oh, wonderful. Hey, I always liked $GOPATH! 🙁
But sure, Golang is too smart anyway to require an external build system.
The compiler's
build
ID is exactly what we need to correctly invalidate the progress number
cache. Surely there is a way to retrieve the build ID for any package that
makes up a binary at runtime via some kind of reflection, right? Right? …Of
course not, in the great Unix tradition, this functionality is only
available as a CLI tool that prints its result to stdout.
🙄
But sure, no problem, let's just exec() a separate process on
the parser's library package file… oh wait, such a thing doesn't exist
anymore, unless you manually install the package. This would
have added another complication to the build process, and you'd
still have to manually locate the package file, with its version-specific
directory name. That might have worked out in the end, but figuring
all this out would have probably gone way beyond the budget.
OK, but who cares about packages? We just care about one single file here,
anyway. Didn't they put the official Golang source code parser into the
standard library? Maybe that can give us something close to the
build ID, by hashing the abstract syntax tree of that file. Well, for
starters, one does not simply serialize the returned AST. At least
into Golang's own, most "native" Gob
format, which requires all types from the go/ast package
to be manually registered first.
That leaves
ast.Fprint() as the
only thing close to a ready-made serialization function… and guess what,
that one suffers from Golang's typical non-deterministic order when
rendering any map to a string. 🤦
Guess there's no way around the simplest, most stupid way of simply
calculating any cryptographically secure hash over the ASM parser file. 😶
It's not like we frequently change comments in this file, but still, this
could have been so much nicer.
Oh well, at least I did get that issue resolved now, in an
acceptable way. If you ever happened to see this website rebuilding: That
should now be a matter of seconds, rather than minutes. Next up: Shinki's
background animations!
Calculating the average speed of the previous crowdfunded pushes, we arrive at estimated "goals" of…
So, time's up, and I didn't even get to the entire PayPal integration and FAQ parts… 😕 Still got to clarify a couple of legal questions before formally starting this, though. So for now, let's continue with zorg's next 5 TH05 reverse-engineering and decompilation pushes, and watch those prices go down a bit… hopefully quite significantly!
In order to be able to calculate how many instructions and absolute memory references are actually being removed with each push, we first need the database with the previous pushes from the Discord crowdfunding days. And while I was at it, I also imported the summary posts from back then.
Also, we now got something resembling a web design!
So yeah, "upper bound" and "probability". In reality it's certainly better than the numbers suggest, but as I keep saying, we can't say much about position independence without having reverse-engineered everything.
Now with the number of not yet RE'd x86 instructions the you might have seen in the thpatch Discord. They're a bit smaller now, didn't filter out a couple of directives back then.
Yes, requesting these currently is super slow. That's why I didn't want to have everyone here yet!
Next step: Figuring out the actual total number of game code instructions, for that nice "% done". Also, trying to do the same for position independence.