ReC98

⮜ Blog

⮜ List of tags

Showing all posts tagged

📝 Posted:: 2024-04-24 13:14 UTC
🚚 Summary of:: P0280
⌨ Commits:: 20bac82...87eed57
💰 Funded by:: Blue Bolt, JonathKane, [Anonymous]
🏷 Tags:

TH03 gameplay! 📝 It's been over two years. People have been investing some decent money with the intention of eventually getting netplay, so let's cover some more foundations around player movement… and quickly notice that there's almost no overlap between gameplay RE and netplay preparations? That makes for a fitting opportunity to think about what TH03 netplay would look like:

Data exchange protocol: The unofficial TH19 battle tool considered a few possibilities and ultimately chose WebRTC Data Channels. The argument goes like this:
- You'd want UDP rather than TCP for both its low latency and its NAT hole-punching ability
- However, raw UDP does not guarantee that the packets arrive in order, or that they even arrive at all
- WebRTC implements these reliability guarantees on top of UDP in a modern package, providing the best of both worlds
- NAT traversal via public or self-hosted STUN/TURN servers is built into the connection establishment protocol and APIs, so you don't even have to understand the underlying issue
I'm not too deep into networking to argue here, and it clearly works for Ju.N.Owen. If we do explore other options, it would mainly be because I can't easily get something as modern as WebRTC to natively run on Windows 9x or DOS, if we decide to go for that route.
Matchmaking: I like Ju.N.Owen's initial way of copy-pasting signaling codes into chat clients to establish a peer-to-peer connection without a dedicated matchmaking server. progre eventually implemented rooms on the AWS cloud, but signaling codes are still used for spectating and the Pure P2P mode. We'll probably copy the same evolution, with a slight preference for Pure P2P – if only because you would have to check a GDPR consent box before I can put the combination of your room name and IP address into a database. Server costs shouldn't be an issue at the scale I expect this to have.
Rollback: In emulators, rollback netcode can be and has been implemented by keeping savestates of the last few frames together with the local player's inputs and then replaying the emulation with updated inputs of the remote player if a prediction turned out to be incorrect. This technique is a great fit for TH03 for two reasons:
- All game state is contained within a relatively small bit of memory. The only heap allocations done in MAIN.EXE are the 📝 .MRS images for gauge attack portraits and bomb backgrounds, and the enemy scripts and formations, both of which remain constant throughout a round. All other state is statically allocated, which can reduce per-frame snapshots from the naive 640 KiB of conventional DOS memory to just the 37 KiB of MAIN.EXE's data segment. And that's the upper bound – this number is only going to go down as we move towards 100% PI, figure out how TH03 uses all its static data, and get to consolidate all mutated data into an even smaller block of memory.
- For input prediction, we could even let the game's existing AI play the remote player until the actual inputs come in, guaranteeing perfect play until the remote inputs prove otherwise. Then again… probably only while the remote player is not moving, because the chance for a human to replicate the AI's infamous erratic dodging is fairly low.
The only issue with rollback in specifically a PC-98 emulator is its implications for performance. Rendering is way more computationally expensive on PC-98 than it is on consoles with hardware sprites, involving lots of memory writes to the disjointed 4 bitplane segments that make up the 128 KB framebuffer, and equally as many reads and bitshift operations on sprite data. TH03 lessens the impact somewhat thanks to most of its rendering being EGC-accelerated and thus running inside the emulator as optimized native code, but we'd still be emulating all the x86 code surrounding the EGC accesses – from the emulator's point of view, it looks no different than game logic. Let's take my aging i5 system for example:
- With the Screen → No wait option, Neko Project 21/W can emulate TH03 gameplay at 260 FPS, or 4.6× its regular speed.
- This leaves room for each frame to contain 3.6 frames of rollback in addition to the frame that's supposed to be displayed,
- which results in a maximum safe network latency of ≈63 ms, or a ping of ≈126 ms. According to this site, that's enough for a smooth connection from Germany to any other place in Europe and even out to the US Midwest. At this ping, my system could still run the game without slowdown even if every single frame required a rollback, which is highly unlikely.
- Any higher ping, however, could occasionally lead to a rollback queue that's too large for my system to process within a single frame at the intended 56.4 FPS rate. As a result, me playing anyone in the western US is highly likely to involve at least occasional slowdowns. Delaying inputs on purpose is the usual workaround, but isn't Touhou that kind of game series where people use vpatch to get rid of even the default input delay in the Windows games?
So we'd ideally want to put TH03 into an update-only mode that skips all rendering calls during re-simulation of rolled-back frames. Ironically, this means that netplay-focused RE would actually focus on the game's rendering code and ensure that it doesn't mutate any statically allocated data, allowing it to be freely skipped without affecting the game. Imagine palette-based flashing animations that are implemented by gradually mutating statically allocated values – these would cause wrong colors for the rest of the game if the animation doesn't run on every frame.

Implementing all of this into TH03 can be done in one, a few, or all of the following 6 ways, depending on what the backers prefer. Sorted from the most generic to the most specialized solution (and, coincidentally, from least to most total effort required):

Generic PC-98 netcode for one or more emulators
This is the most basic and puristic variant that implements generic netplay for PC-98 games in general by effectively providing remote control of the emulated keyboard and joypad. The emulator will be unaware of the game, and the game will be unaware of being netplayed, which makes this solution particularly interesting for the non-Touhou PC-98 scene, or competitive players who absolutely insist on using ZUN's original binaries and won't trust any of my modded game builds.
Applied to TH03, this means that players would select the regular hot-seat 1P vs 2P mode and then initiate a match through a new menu in the emulator UI. The same UI must then provide an option to manually remap incoming key and button presses to the 2P controls (newly introducing remapping to the emulator if necessary), as well as blocking any non-2P keys. The host then sends an initial savestate to the guest to ensure an identical starting state, and starts synchronizing and rolling back inputs at VSync boundaries.
This generic nature means that we don't get to include any of the TH03-specific rollback optimizations mentioned above, leading to the highest CPU and memory requirements out of all the variants. It sure is the easiest to implement though, as we get to freely use modern C++ WebRTC libraries that are designed to work with the network stack of the underlying OS.
I can try to build this netcode as a generic library that can work with any PC-98 emulator, but it would ultimately be up to the respective upstream developers to integrate it into official releases. Therefore, expect this variant to require separate funding and custom builds for each individual emulator codebase that we'd like to support.
Emulator-level netcode with optional game integration
Takes the generic netcode developed in 1) and adds the possibility for the game to control it via a special interrupt API. This enables several improvements:
- Online matches could be initiated through new options in TH03's main menu rather than the emulator's UI.
- The game could communicate the memory region that should be backed up every frame, cutting down memory usage as described above.
- The exchanged input data could use the game's internal format instead of keyboard or joypad inputs. This removes the need for key remapping at the emulator level and naturally prevents the inherent issue of remote control where players could mess with each other's controls.
- The game could be aware of the rollbacks, allowing it to jump over its rendering code while processing the queue of remote inputs and thus gain some performance as explained above.
- The game could add synchronization points that block gameplay until both players have reached them, preventing the rollback queue from growing infinitely. This solves the issue of 1) not having any inherent way of working around desyncs and the resulting growth of the rollback queue. As an example, if one of the two emulators in 1) took, say, 2 seconds longer to load the game due to a random CPU spike caused by some bloatware on their system, the two players would be out of sync by 2 seconds for the rest of the session, forcing the faster system to render 113 frames every time an input prediction turned out to be incorrect.
  Good places for synchronization points include the beginning of each round, the WARNING!! You are forced to evade / Your life is in peril popups that pause the game for a few frames anyway, and whenever the game is paused via the ESC key.
- During such pauses, the game could then also block the resuming ESC key of the player who didn't pause the game.
Edit (2024-04-30): Emulated serial port communicating over named pipes with a standalone netplay tool
This approach would take the netcode developed in 2) out of the emulator and into a separate application running on the (modern) host OS, just like Ju.N.Owen or Adonis. The previous interrupt API would then be turned into binary protocol communicated over the PC-98's serial port, while the rollback snapshots would be stored inside the emulated PC-98 in EMS or XMS/Protected Mode memory. Netplay data would then move through these stages:
🖥️ PC-98 game logic ⇄ Serial port ⇄ Emulator ⇄ Named pipe ⇄ Netcode logic ⇄ WebRTC Data Channel ⇄ Internet 🛜
All green steps run natively on the host OS.
Sending serial port data over named pipes is only a semi-common feature in PC-98 emulators, and would currently restrict netplay to Neko Project 21/W and NP2kai on Windows. This is a pretty clean and generally useful feature to have in an emulator though, and emulator maintainers will be much more likely to include this than the custom netplay code I proposed in 1) and 2). DOSBox-X has an open issue that we could help implement, and the NP2kai Linux port would probably also appreciate a mkfifo(3) implementation.
This could even work with emulators that only implement PC-98 serial ports in terms of, well, native Windows serial ports. This group currently includes Neko Project II fmgen, SL9821, T98-Next, and rare bundles of Anex86 that replace MIDI support with COM port emulation. These would require separately installed and configured virtual serial port software in place of the named pipe connection, as well as support for actual serial ports in the netplay tool itself. In fact, this is the only way that die-hard Anex86 and T98-Next fans could enjoy any kind of netplay on these two ancient emulators.
If it works though, it's the optimal solution for the emulated use case if we don't want to fork the emulator. From the point of view of the PC-98, the serial port is the cheapest way to send a couple of bytes to some external thing, and named pipes are one of many native ways for two Windows/Linux applications to efficiently communicate.
The only slight drawback of this approach is the expected high DOS memory requirement for rollback. Unless we find a way to really compress game state snapshots to just a few KB, this approach will require a more modern DOS setup with EMS/XMS support instead of the pre-installed MS-DOS 3.30C on a certain widely circulated .HDI copy. But apart from that, all you'd need to do is run the separate netplay tool, pick the same pipe name in both the tool and the emulator, and you're good to go.

It could even work for real hardware, but would require the PC-98 to be linked to the separately running modern system via a null modem cable.
Native PC-98 Windows 9x netcode (only for real PC-98 hardware equipped with an Ethernet card)
Equivalent in features to 2), but pulls the netcode into the PC-98 system itself. The tool developed in 3) would then as a separate 32-bit or 16-bit Windows application that somehow communicates with the game running in a DOS window. The handful of real-hardware owners who have actually equipped their PC-98 with a network card such as the LGY-98 would then no longer require the modern PC from 3) as a bridge in the middle.
This specific card also happens to be low-level-emulated by the 21/W fork of Neko Project. However, it makes little sense to use this technique in an emulator when compared to 3), as NP21/W requires a separately installed and configured TAP driver to actually be able to access your native Windows Internet connection. While the setup is well-documented and I did manage to get a working Internet connection inside an emulated Windows 95, it's definitely not foolproof. Not to mention DOSBox-X, which currently emulates the apparently hardware-compatible NE2000 card, but disables its emulation in PC-98 mode, most likely because its I/O ports clash with the typical peripherals of a PC-98 system.
And that's not the end of the drawbacks:
- Netplay would depend on the PC-98 versions of Windows 9x and its full network stack, nothing of which is required for the game itself.
- Porting libdatachannel (and especially the required transport encryption) to Windows 95 will probably involve a bit of effort as well.
- As would actually finding a way to access V86 mode memory from a 32-bit or 16-bit Windows process, particularly due to how isolated DOS processes are from the rest of the system and even each other. A quick investigation revealed three potential approaches:
  - A 32-bit process could read the memory out of the address space of the console host process (WINOA32.MOD). There seems to be no way of locating the specific base address of a DOS process, but you could always do a brute-force search through the memory map.
  - If started before Windows, TSRs will share their resident memory with both DOS and Win16 processes. The segment pointer would then be retrieved through a typical interrupt API.
  - Writing a VxD driver 😩
- Correctly setting up TH03 to run within Windows 95 to begin with can be rather tricky. The GDC clock speed check needs to be either patched out or overridden using mode-setting tools, Windows needs to be blocked from accessing the FM chip, and even then, MAIN.EXE might still immediately crash during the first frame and leave all of VRAM corrupted:
  
  This is probably a bug in the latest ver0.86 rev92β3 version of Neko Project 21/W; I got it to work fine on real hardware. 📝 StormySpace did run on the same emulated Windows 95 system without any issues, though. Regardless, it's still worth mentioning as a symbol of everything that can go wrong.
- A matchmaking server would be much more of a requirement than in any of the emulator variants. Players are unlikely to run their favorite chat client on the same PC-98 system, and the signaling codes are way too unwieldy to type them in manually. (Then again, IRC is always an option, and the people who would fund this variant are probably the exact same people who are already running IRC clients on their PC-98.)
Native PC-98 DOS netcode (only for real PC-98 hardware equipped with an Ethernet card)
Conceptually the same as 4), but going yet another level deeper, replacing the Windows 9x network stack with a DOS-based one. This might look even more intimidating and error-prone, but after I got ping and even Telnet working, I was pleasantly surprised at how much simpler it is when compared to the Windows variant. The whole stack consists of just one LGY-98 hardware information tool, a LGY-98 packet driver TSR, and a TSR that implements TCP/IP/UDP/DNS/ICMP and is configured with a plaintext file. I don't have any deep experience with these protocols, so I was quite surprised that you can implement all of them in a single 40 KiB binary. Installed as TSRs, the entire stack takes up an acceptable 82 KiB of conventional memory, leaving more than enough space for the game itself. And since both of the TSRs are open-source, we can even legally bundle them with the future modified game binaries.
The matchmaking issue from the Windows 9x approach remains though, along with the following issues:
- Porting libdatachannel and the required transport encryption to the TEEN stack seems even more time-consuming than a Windows 95 port.
- The TEEN stack has no UI for specifying the system's or gateway's IP addresses outside of its plaintext configuration file. This provides a nice opportunity for adding a new Internet settings menu with great error feedback to the game itself. Great for UX, but it's another thing I'd have to write.
- The LGY-98 is not the only network card for the PC-98. Others might have more complicated DOS drivers that might not work as seamlessly with the TEEN stack, or have no preserved DOS drivers at all. Heck, the most time-consuming part of the DOS setup was finding the correct download link for the LGY-98 packet driver, as the one link that appears in a lot of places only throws an access denied error these days. Edit (2024-04-30): spaztron64 is now hosting both the LGY-98 packet driver and the entire TEEN bundle on his homepage.
  If you're interested in funding this variant and are using a non-LGY-98 card on real hardware, make sure you get general Internet working on DOS first.
Porting the game first
As always, this is the premium option. If the entire game already runs as a standalone executable on a modern system, we can just put all the netcode into the same binary and have the most seamless integration possible.

That leaves us with these prerequisites:

1), by definition, needs nothing from ReC98, and I could theoretically start implementing it right now. If you're interested in funding it, just tell me via the usual Twitter or Discord channels.
2) through 5) require at least 100% RE of TH03's OP.EXE to facilitate the new menu code. Reverse-engineering all rendering-related code in MAIN.EXE would be nice for performance, but we don't strictly need all of it before we start. Re-simulated frames can just skip over the few pieces of rendering code we do know, and we can gradually increase the skipped area of code in future pushes.
100% PI won't be a requirement either, as I expect the MAIN.EXE part of the interfacing netcode layer to be thin enough that it can easily fit within the original game's code layout.
6), obviously, requires all of TH03 to be RE'd, decompiled, cleaned up, and ported to modern systems. Currently, TH03 appears to be the second-easiest game to port behind TH02:
- Although TH03 already has more needlessly micro-optimized ASM code than TH02 and there's even more to come, it still appears to have way less than TH04 or TH05.
- Its game logic and rendering code seem to be somewhat neatly separated from each other, unlike TH01 which deeply intertwines them.
- Its graphics seem free of obvious bugs, unlike – again — the flicker-fest that is TH01.
But still, it's the game with the least amount of RE%. Decompilation might get easier once I've worked myself up to the higher levels of game code, and even more so if we're lucky and all of the 9 characters are coded in a similar way, but I can't promise anything at this point.

Once we've reached any of these prerequisites, I'll set up a separate campaign funding method that runs parallel to the cap. As netplay is one of those big features where incremental progress makes little sense and we can expect wide community support for the idea, I'll go for a more classic crowdfunding model with a fixed goal for the minimum feature set and stretch goals for optional quality-of-life features. Since I've still got two other big projects waiting to be finished, I'd like to at least complete the Shuusou Gyoku Linux port before I start working on TH03 netplay, even if we manage to hit any of the funding goals before that.

For the first time in a long while, the actual content of this push can be listed fairly quickly. I've now RE'd:

conversions from playfield-relative coordinates to screen coordinates and back (a first in PC-98 Touhou; even TH02 uses screen space for every coordinate I've seen so far),
the low-level code that moves the player entity across the screen,
a copy of the per-round frame counter that, for some reason, resets to 0 at the start of the Win/Lose animation, resetting a bunch of animations with it,
a global hitbox with one variable that sometimes stores the center of an entity, and sometimes its top-left corner,
and the 48×48 hit circles from EN2.PI.

It's also the third TH03 gameplay push in a row that features inappropriate ASM code in places that really, really didn't need any. As usual, the code is worse than what Turbo C++ 4.0J would generate for idiomatic C code, and the surrounding code remains full of untapped and quick optimization opportunities anyway. This time, the biggest joke is the sprite offset calculation in the hit circle rendering code:

_BX = (circle->age - 1);
_BX >>= 2;
_BX *= 2;
uint16_t sprite_offset_in_sprite16_area = (0x1910 + _BX + _BX + _BX);

A multiplication with 6 would have compiled into a single IMUL instruction. This compiles into 4 MOVs, one IMUL (with 2), and two ADDs.

This surely must have been left in on purpose for us to laugh about it one day?

But while we've all come to expect the usual share of ZUN bloat by now, this is also the first push without either a ZUN bug or a landmine since I started using these terms! 🎉 It does contain a single ZUN quirk though, which can also be found in the hit circles. This animation comes in two types with different caps: 12 animation slots across both playfields for the enemy circles shown in alternating bright/dark yellow colors, whereas the white animation for the player characters has a cap of… 1? P2 takes precedence over P1 because its update code always runs last, which explains what happens when both players get hit within the 16 frames of the animation:

If they both get hit on the exact same frame, the animation for P1 never plays, as P2 takes precedence.

If the other player gets hit within 16 frames of an active white circle animation, the animation is reinitialized for the other player as there's only a single slot to hold it. Is this supposed to telegraph that the other player got hit without them having to look over to the other playfield? After all, they're drawn on top of most other entities, but below the player.

SPRITE16 uses the PC-98's EGC to draw these single-color sprites. If the EGC is already set up, it can be set into a GRCG-equivalent RMW mode using the pattern/read plane register (0x4A2) and foreground color register (0x4A6), together with setting the mode register (0x4A4) to 0x0CAC. Unlike the typical blitting operations that involve its 16-dot pattern register, the EGC even supports 8- or 32-bit writes in this mode, just like the GRCG. 📝 As expected for EGC features beyond the most ordinary ones though, T98-Next simply sets every written pixel to black on a 32-bit write.

Comparing the actual performance of such writes to the GRCG would be 📝 yet another interesting question to benchmark.

Next up: I think it's time for ReC98's build system to reach its final form. For almost 5 years, I've been using an unreleased sane build system on a parallel private branch that was just missing some final polish and bugfixes. Meanwhile, the public repo is still using the project's initial Makefile that, 📝 as typical for Makefiles, is so unreliable that BUILD16B.BAT force-rebuilds everything by default anyway. While my build system has scaled decently over the years, something even better happened in the meantime: MS-DOS Player, a DOS emulator exclusively meant for seamless integration of CLI programs into the Windows console, has been forked and enhanced enough to finally run Turbo C++ 4.0J at an acceptable speed. So let's remove DOSBox from the equation, merge the 32-bit and 16-bit build steps into a single 32-bit one, set all of this up in a user-friendly way, and maybe squeeze even more performance out of MS-DOS Player specifically for this use case.

📝 Posted:: 2023-09-30 21:01 UTC
🚚 Summary of:: P0252, P0253, P0254, P0255, P0256, P0257
⌨ Commits:: (Seihou) P0251...e98feef, (Seihou) e98feef...24df71c, (Seihou) 24df71c...b7b863b, (Seihou) b7b863b...2b8218e, (Seihou) 2b8218e...P0256, (Website) b79c667...e2ba49b
💰 Funded by:: Arandui, Ember2528, [Anonymous]
🏷 Tags:

And now we're taking this small indie game from the year 2000 and porting its game window, input, and sound to the industry-standard cross-platform API with "simple" in its name.

Why did this have to be so complicated?! I expected this to take maybe 1-2 weeks and result in an equally short blog post. Instead, it raised so many questions that I ended up with the longest blog post so far, by quite a wide margin. These pushes ended up covering so many aspects that could be interesting to a general and non-Seihou-adjacent audience, so I think we need a table of contents for this one:

Evaluating Zig
Visual Studio doesn't implement concepts correctly?
Reusable building blocks for Tup
Compiling SDL 2
The new frame rate limiter
Audio via SDL or SDL_mixer? (Nope, neither)
miniaudio
Resampling defective sound effects (including FLAC not always being lossless)
Joypad input with SDL
Restoring the original screenshot feature
Integer math in hand-written ASM

Before we can start migrating to SDL, we of course have to integrate it into the build somehow. On Linux, we'd ideally like to just dynamically link to a distribution's SDL development package, but since there's no such thing on Windows, we'd like to compile SDL from source there. This allows us to reuse our debug and release flags and ensures that we get debug information, without needing to clone build scripts for every C++ library ever in the process or something.
So let's get my Tup build scripts ready for compiling vendored libraries… or maybe not? Recently, I've kept hearing about a hot new technology that not only provides the rare kind of jank-free cross-compiling build system for C/C++ code, but innovates by even bundling a C++ compiler into a single 279 MiB package with no further dependencies. Realistically replacing both Visual Studio and Tup with a single tool that could target every OS is quite a selling point. The upcoming Linux port makes for the perfect occasion to evaluate Zig, and to find out whether Tup is still my favorite build system in 2023.

Even apart from its main selling point, there's a lot to like about Zig:

First and foremost: It's a modern systems programming language with seamless C interop that we could gradually migrate parts of the codebase to. The feature set of the core language seems to hit the sweet spot between C and C++, although I'd have to use it more to be completely sure.
A native, optimized Hello World binary with no string formatting is 4 KiB when compiled for Windows, and 6.4 KiB when cross-compiled from Windows to Linux. It's so refreshing to see a systems language in 2023 that doesn't bundle a bulky runtime for trivial programs and then defends it with the old excuse of "but all this runtime code will come in handy the larger your program gets". With a first impression like this, Zig managed to realize the "don't pay for what you don't use" mantra that C++ typically claims for itself, but only pulls off maybe half of the time.
You can directly target specific CPU models, down to even the oldest 386 CPUs?! How amazing is that?! In contrast, Visual Studio only describes its /arch:IA32 compatibility option in very vague terms, leaving it up to you to figure out that "legacy 32-bit x86 instruction set without any vector operations" actually means "i586/P5 Pentium, because the startup code still includes an unconditional CPUID instruction". In any case, it means that Zig could also cover the i586 build.
- Even better, changing Zig's CPU model setting recompiles both its bundled C/C++ standard library and Zig's own compiler-rt polyfill library for that architecture. This ensures that no unsupported instructions ever show up in the binary, and also removes the need for any CPUID checks. This is so much better than the Visual Studio model of linking against a fixed pre-compiled standard library because you don't have to trust that all these newer instructions wouldn't actually be executed on older CPUs that don't have them.
I love the auto-formatter. Want to lay out your struct literal into multiple lines? Just add a trailing comma to the end of the last element. It's very snappy, and a joy to use.
Like every modern programming language, Zig comes with a test framework built into the language. While it's not all too important for my grand plan of having one big test that runs a bunch of replays and compares their game states against the original binary, small tests could still be useful for protecting gameplay code against accidental changes. It would be great if I didn't have to evaluate and choose among the many testing frameworks for C++ and could just use a language standard.
Package management is still in its infancy, but it's looking pretty good so far, resembling Go's decentralized approach of just pointing to a URL but with specific version selection from the get-go.

However, as a version number of 0.11.0 might already suggest, the whole experience was then bogged down by quite a lot of issues:

While Zig's C/C++ compilation feature is very well architected to reuse the C/C++ standard libraries of GCC and MinGW and thus automatically keeps up with changes to the C++ standard library, it's ultimately still just a Clang frontend. If you've been working with a Visual Studio-exclusive codebase – which, as we're going to see below, can easily happen even if you compile in C++23 mode – you'd now have to migrate to Clang and Zig in a single step. Obviously, this can't ever be fixed without Microsoft open-sourcing their C++ compiler. And even then, supporting a separate set of command-line flags might not be worth it.
The standard library is very poorly documented, especially in the build-related parts that are meant to attract the C++ audience.
Often, the only documentation is found in blog posts from a few years ago, with example code written against old Zig versions that doesn't compile on the newest version anymore. It's all very far from stable.
However, Zig's project generation sub-commands (zig init-exe and friends) do emit well-documented boilerplate code? It does make sense for that code to double as a comprehensive example, but Zig advertises itself as so simple that I didn't even think about bootstrapping my project with a CLI tool at first – unlike, say, Rust, where a project always starts with filling out a small form in Cargo.toml.
There's no progress output for C/C++ compilation? Like, at all?
This hurts especially because compilation times are significantly longer than they were with Visual Studio. By default, the current Tupfile builds Shuusou Gyoku in both debug and release configurations simultaneously. If I fully rebuild everything from a clean cache, Visual Studio finishes such a build in roughly the same amount of time that Zig takes to compile just a debug build.
The --global-cache-dir option is only supported by specific subcommands of the zig CLI rather than being a top-level setting, and throws an error if used for any other subcommand. Not having a system-wide way to change it and being forced into writing a wrapper script for that is fine, but it would be nice if said wrapper script didn't have to also parse and switch over the subcommand just to figure out whether it is allowed to append the setting.
compiler-rt still needs a bit of dead code elimination work. As soon as your program needs a single polyfilled function, you get all of them, because they get referenced in some exception-related table even if nothing uses them? Changing the link_eh_frame_hdr option had no effect.
And that was not the only std.Build.Step.Compile option that did nothing. Worse, if I just tweaked the options and changed nothing about the code itself, Zig simply copied a previously built executable out of its build cache into the output directory, as revealed by the timestamp on the .EXE. While I am willing to believe that Zig correctly detects that all these settings would just produce the same binary, I do not like how this behavior inspires distrust and uncertainty in Zig's build process as a whole. After all, we still live in a world where clearing the build cache is way too often the solution for weird problems in software, especially when using CMake. And it makes sense why it would be: If you develop a complex system and then try solving the infamously hard problem of cache invalidation on top, the risk of getting cache invalidation wrong is, by definition, higher than if that was the only thing your system did. That's the reason why I like Tup so much: It solely focuses on getting cache invalidation right, and rather errs on the side of caution by maybe unnecessarily rebuilding certain files every once in a while because the compiler may have read from an environment variable that has changed in the meantime. But this is the one job I expect a build system to do, and Tup has been delivering for years and has become fundamentally more trustworthy as a result.
Zig activates Clang's UBSan in debug builds by default, which executes a program-crashing UD2 instruction whenever the program is about to rely on undefined C++ behavior. In theory, that's a great help for spotting hidden portability issues, but it's not helpful at all if these crashes are seemingly caused by C++ standard library code?! Without any clear info about the actual cause, this just turned into yet another annoyance on top of all the others. Especially because I apparently kept searching for the wrong terms when I first encountered this issue, and only found out how to deactivate it after I already decided against Zig.
Also, can we get /PDBALTPATH? Baking absolute paths from the filesystem of the developer's machine into released binaries is not only cringe in itself, but can also cause potential privacy or security accidents.

So for the time being, I still prefer Tup. But give it maybe two or three years, and I'm sure that Zig will eventually become the best tool for resurrecting legacy C++ codebases. That is, if the proposed divorce of the core Zig compiler from LLVM isn't an indication that the productive parts of the Zig community consider the C/C++ building features to be "good enough", and are about to de-emphasize them to focus more strongly on the actual Zig language. Gaining adoption for your new systems language by bundling it with a C/C++ build system is such a great and unique strategy, and it almost worked in my case. And who knows, maybe Zig will already be good enough by the time I get to port PC-98 Touhou to modern systems.

(If you came from the Zig wiki, you can stop reading here.)

A few remnants of the Zig experiment still remain in the final delivery. If that experiment worked out, I would have had to immediately change the execution encoding to UTF-8, and decompile a few ASM functions exclusive to the 8-bit rendering mode which we could have otherwise ignored. While Clang does support inline assembly with Intel syntax via -fms-extensions, it has trouble with ; comments and instructions like REP STOSD, and if I have to touch that code anyway… (The REP STOSD function translated into a single call to memcpy(), by the way.)

Another smaller issue was Visual Studio's lack of standard library header hygiene, where #including some of the high-level STL features also includes more foundational headers that Clang requires to be included separately, but I've already known about that. Instead, the biggest shocker was that Visual Studio accepts invalid syntax for a language feature as recent as C++20 concepts:

// Defines the interface of a text rendering session class. To simplify this
// example, it only has a single `Print(const char* str)` method.
template <class T> concept Session = requires(T t, const char* str) {
	t.Print(str);
};

// Once the rendering backend has started a new session, it passes the session
// object as a parameter to a user-defined function, which can then freely call
// any of the functions defined in the `Session` concept to render some text.
template <class F, class S> concept UserFunctionForSession = (
	Session<S> && requires(F f, S& s) {
		{ f(s) };
	}
);

// The rendering backend defines a `Prerender()` method that takes the
// aforementioned user-defined function object. Unfortunately, C++ concepts
// don't work like this: The standard doesn't allow `auto` in the parameter
// list of a `requires` expression because it defines another implicit
// template parameter. Nevertheless, Visual Studio compiles this code without
// errors.
template <class T, class S> concept BackendAttempt = requires(
	T t, UserFunctionForSession<S> auto func
) {
	t.Prerender(func);
};

// A syntactically correct definition would use a different constraint term for
// the type of the user-defined function. But this effectively makes the
// resulting concept unusable for actual validation because you are forced to
// specify a type for `F`.
template <class T, class S, class F> concept SyntacticallyFixedBackend = (
	UserFunctionForSession<F, S> && requires(T t, F func) {
		t.Prerender(func);
	}
);

// The solution: Defining a dummy structure that behaves like a lambda as an
// "archetype" for the user-defined function.
struct UserFunctionArchetype {
	void operator ()(Session auto& s) {
	}
};

// Now, the session type disappears from the template parameter list, which
// even allows the concrete session type to be private.
template <class T> concept CorrectBackend = requires(
	T t, UserFunctionArchetype func
) {
	t.Prerender(func);
};

Here's a Godbolt link, configured with both Visual Studio and Clang compilers.

What's this, Visual Studio's infamous delayed template parsing applied to concepts, because they're templates as well? Didn't they get rid of that 6 years ago? You would think that we've moved beyond the age where compilers differed in their interpretation of the core language, and that opting into a current C++ standard turns off any remaining antiquated behaviors…

So let's actually get my Tup build scripts ready for compiling vendored libraries, because the 📝 previous 70 lines of Lua definitely weren't. For this use case, we'd like to have some notion of distinct build targets that can have a unique set of compilation and linking flags. We'd also like to always build them in debug and release versions even if you only intend to build your actual program in one of those versions – with the previous system of specifying a single version for all code, Tup would delete the other one, which forces a time-consuming and ultimately needless rebuild once you switch to the other version.

The solution I came up with treats the set of compiler command-line options like a tree whose branches can concatenate new options and/or filter the versions that are built on this branch. In total, this is my 4^th attempt at writing a compiler abstraction layer for Tup. Since we're effectively forced to write such layers in Lua, it will always be a bit janky, but I think I've finally arrived at a solid underlying design that might also be interesting for others. Hence, I've split off the result into its own separate repository and added high-level documentation and a documented example. And yes, that's a Code Nutrition label! I've wanted to add one of these ever since I first heard about the idea, since it communicates nicely how seriously such an open-source project should be taken. Which, in this case, is actually not all too seriously, especially since development of the core Tup project has all but stagnated. If Zig does indeed get better and better at being a Clang frontend/build system, the only niches left for Tup will be Visual Studio-exclusive projects, or retrocoding with nonstandard toolchains (i.e., ReC98). Quite ironic, given Tup's Unix heritage…
Oh, and maybe general Makefile-like tasks where you just want to run specific programs. Maybe once the general hype swings back around and people start demanding proper graph-based dependency tracking instead of just a command runner…

Alright, alternatives evaluated, build system ready, time to include SDL! Once again, I went for Git submodules, but this time they're held together by a batch file that ensures that the intended versions are checked out before starting Tup. Git submodules have a bad rap mainly because of their usability issues, and such a script should hopefully work around them? Let's see how this plays out. If it ends up causing issues after all, I'll just switch to a Zig-like model of downloading and unzipping a source archive. Since Windows comes with curl and tar these days, this can even work without any further dependencies, and will also remove all the test code bloat.

Compiling SDL from a non-standard build system requires a bit of globbing to include all the code that is being referenced, as well as a few linker settings, but it's ultimately not much of a big deal. I'm quite happy that it was possible at all without pre-configuring a build, but hey, that's what maintaining a Visual Studio project file does to a project.
By building SDL with the stock Windows configuration, we then end up with exactly what the SDL developers want us to use… which is a DLL. You can statically link SDL, but they really don't want you to do that. So strongly, in fact, that they not merely argue how well the textbook advantages of dynamic linking have worked for them and gamers as a whole, but implemented a whole dynamic API system that enforces overridable dynamic function loading even in static builds. Nudging developers to their preferred solution by removing most advantages from static linking by default… that's certainly a strategy. It definitely fits with SDL's grassroots marketing, which is very good at painting SDL as the industry standard and the only reliable way to keep your game running on all originally supported operating systems. Well, at least until SDL 3 is so stable that SDL 2 gets deprecated and won't receive any code for new backends…

However, dynamic linking does make sense if you consider what SDL is. Offering all those multiple rendering, input, and sound backends is what sets it apart from its more hip competition, and you want to have all of them available at any time so that SDL can dynamically select them based on what works best on a system. As a result, everything in SDL is being referenced somewhere, so there's no dead code for the linker to eliminate. Linking SDL statically with link-time code generation just prolongs your link time for no benefit, even without the dynamic API thwarting any chance of SDL calls getting inlined.
There's one thing I still don't like about all this, though. The dynamic API's table references force you to include all of SDL's subsystems in the DLL even if your game doesn't need some of them. But it does fit with their intention of having SDL2.dll be swappable: If an older game stopped working because of an outdated SDL2.dll, it should be possible for anyone to get that game working again by replacing that DLL with any newer version that was bundled with any random newer game. And since that would fail if the newer SDL2.dll was size-optimized to not include some of the subsystems that the older game required, they simply removed (or de-prioritized) the possibility altogether. Maybe that was their train of thought? You can always just use the official Windows DLL, whose whole point is to include everything, after all. 🤷

So, what do we get in these 1.5 MiB? There are:

renderer backends for Direct3D 9/11/12, regular OpenGL, OpenGL ES 2.0, Vulkan, and a software renderer,
input backends for DirectInput, XInput, Raw Input, and all the official game console controllers that can be connected via USB,
and audio backends for WinMM, DirectSound, WASAPI, and direct-to-disk recording.

Unfortunately, SDL 2 also statically references some newer Windows API functions and therefore doesn't run on Windows 98. Since this build of Shuusou Gyoku doesn't introduce any new features to the input or sound interfaces, we can still use pbg's original DirectSound and DirectInput code for the i586 build to keep it working with the rest of the platform-independent game logic code, but it will start to lag behind in features as soon as we add support for SC-88Pro BGM or more sophisticated input remapping. If we do want to keep this build at the same feature level as the SDL one, we now have a choice: Do we write new DirectInput and DirectSound code and get it done quickly but only for Shuusou Gyoku, or do we port SDL 2 to Windows 98 and benefit all other SDL 2 games as well? I leave that for my backers to decide.

Immediately after writing the first bits of actual SDL code to initialize the library and create the game window, you notice that SDL makes it very simple to gradually migrate a game. After creating the game window, you can call SDL_GetWindowWMInfo() to retrieve HWND and HINSTANCE handles that allow you to continue using your original DirectDraw, DirectSound, and DirectInput code and focus on porting one subsystem at a time.
Sadly, D3DWindower can no longer turn SDL's fullscreen mode into a windowed one, but DxWnd still works, albeit behaving a bit janky and insisting on minimizing the game whenever its window loses focus. But in exchange, the game window can surprisingly be moved now! Turns out that the originally fixed window position had nothing to do with the way the game created its DirectDraw context, and everything to do with pbg blocking the Win32 "syscommand" that allows a window to be moved. By deleting a system menu… seriously?! Now I'm dying to hear the Raymond Chen explanation for how this behavior dates back to an unfortunate decision during the Win16 days or something.
As implied by that commit, I immediately backported window movability to the i586 build.

However, the most important part of Shuusou Gyoku's main loop is its frame rate limiter, whose Win32 version leaves a bit of room for improvement. Outside of the uncapped [おまけ] DrawMode, the original main loop continuously checks whether at least 16 milliseconds have elapsed since the last simulated (but not necessarily rendered) frame. And by that I mean continuously, and deliberately without using any of the Windows system facilities to sleep the process in the meantime, as evidenced by a commented-out Sleep(1) call. This has two important effects on the game:

The 60Fps DrawMode actually corresponds to a frame rate of (1000 / 16) = 62.5 FPS, not 60. Since the game didn't account for the missing ²/₃ ms to bring the limit down to exactly 60 FPS, 62.5 FPS is Shuusou Gyoku's actual official frame rate in a non-VSynced setting, which we should also maintain in the SDL port.
Not sleeping the process turns Shuusou Gyoku's frame rate limitation into a busy-waiting loop, which always uses 100% of a single CPU core just to wait for the next frame.

Unsurprisingly, SDL features a delay function that properly sleeps the process for a given number of milliseconds. But just specifying 16 here is not exactly what we want:

Sure, modern computers are fast, but a frame won't ever take an infinitely fast 0 milliseconds to render. So we still need to take the current frame time into account.
SDL_Delay()'s documentation says that the wake-up could be further delayed due to OS scheduling.

To address both of these issues, I went with a base delay time of 15 ms minus the time spent on the current frame, followed by busy-waiting for the last millisecond to make sure that the next frame starts on the exact frame boundary. And lo and behold: Even though this still technically wastes up to 1 ms of CPU time, it still dropped CPU usage into the 0%-2% range during gameplay on my Intel Core i5-8400T CPU, which is over 5 years old at this point. Your laptop battery will appreciate this new build quite a bit.

Time to look at audio then, because it sure looks less complicated than input, doesn't it? Loading sounds from .WAV file buffers, playing a fixed number of instances of every sound at a given position within the stereo field and with optional looping… and that's everything already. The DirectSound implementation is so straightforward that the most complex part of its code is the .WAV file parser.
Well, the big problem with audio is actually finding a cross-platform backend that implements these features in a way that seamlessly works with Shuusou Gyoku's original files. DirectSound really is the perfect sound API for this game:

It doesn't require the game code to specify any output sample format. Just load the individual sound effects in their original format, and playback just works and sounds correctly.
Its final sound stream seems to have a latency of 10 ms, which is perfectly fine for a game running at 62.5 FPS. Even 15 ms would be OK.
Sound effect looping? Specified by passing the DSBPLAY_LOOPING flag to IDirectSoundBuffer::Play().
Stereo ~~panning~~ balancing? One method call.
Playing the same sound multiple times simultaneously from a single memory buffer? One method call. (It can fail though, requiring you to copy the data after all.)
Pausing all sounds while the game window is not focused? That's the default behavior, but it can be equally easily disabled with just a single per-buffer flag.
Future streaming of waveform BGM? No problem either. Windows Touhou has always done that, and here's some code I wrote 12½ years ago that would even work without DirectSound 8's notification feature.
No further binary bloat, because it's part of the operating system.

The last point can't really be an argument against anything, but we'd still be left with 7 other boxes that a cross-platform alternative would have to tick. We already picked SDL for our portability needs, so how does its audio subsystem stack up? Unfortunately, not great:

It's fully DIY. All you get is a single output buffer, and you have to do all the mixing and effect processing yourself. In other words, it's the masochistic approach to cross-platform audio.
There are helper functions for resampling and mixing, but the documentation of the latter is full of FUD. With a disclaimer that so vehemently discourages the use of this function, what are you supposed to do if you're newly integrating SDL audio into a game? Hunt for a separate sound mixing library, even though your only quality goal is parity with stone-age DirectSound? 🙄
It forces the game to explicitly define the PCM sampling rate, bit depth, and channel count of the output buffer. You can't just pass a nullptr to SDL_OpenAudioDevice(), and if you pass a zeroed SDL_AudioSpec structure, SDL just defaults to an unacceptable 22,050 Hz sampling rate, regardless of what the audio device would actually prefer. It took until last year for them to notice that people would at least like to query the native format. But of course, this approach requires the backend to actually provide this information – and since we've seen above that DirectSound doesn't care, the DirectSound version of this function has to actually use the more modern WASAPI, and remains unimplemented if that API is not available.
Standardizing the game on a single sampling rate, bit depth, and channel count might be a decent choice for games that consistently use a single format for all its sounds anyway. In that case, you get to do all mixing and processing in that format, and the audio backend will at most do one final conversion into the playback device's native format. But in Shuusou Gyoku, most sound effects use 22,050 Hz, the boss explosion sound effect uses 11,025 Hz, and the future SC-88Pro BGM will obviously use 44,100 Hz. In such a scenario, you would have to pick the highest sampling rate among all sound sources, and resample any lower-quality sounds to that rate. But if the audio device uses a different sampling rate, those lower-quality sounds would get resampled a second time.
I know that this will be fixed in SDL 3, but that version is still under heavy development.
Positives? Uh… the callback-based nature means that BGM streaming is rather trivial, and would even be comparatively less complicated than with DirectSound. Having a mutex to prevent writes to your sound instance structures while they're being read by the audio thread is nice too.

OK, sure, but you're not supposed to use it for anything more than a single stream of audio. SDL_mixer exists precisely to cover such non-trivial use cases, and it even supports sound effect looping and panning with just a single function call! But as far as the rest of the library is concerned, it manages to be an even bigger disappointment than raw SDL audio:

As it sits on top of SDL's audio subsystem, it still can't just use your audio device's native sample format.
Even worse, it insists on initializing the audio device itself, and thus always needs to duplicate whatever you would do for raw SDL.
It only offers a very opinionated system for streaming – and of course, its opinion is wrong. 😛 The fact that it only supports a single streaming audio track wouldn't matter all too much if you could switch to another track at sample precision. But since you can't, you're forced to implement looping BGM using a single file…
…which brings us to the unfortunate issue of loop point definitions. And, perhaps most importantly, the complete lack of any way to set them through the API?! It doesn't take long until you come up with a theory for why the API only offers a function to retrieve loop points: The "music" abstraction is so format-agnostic that it even supports MIDI and tracker formats where a typical loop point in PCM samples doesn't make sense. Both of these formats already have in-band ways of specifying loop points in their respective time units. They might not be standardized, but it's still much better than usual single-file solutions for PCM streams where the loop point has to be stored in an out-of-band way – such as in a metadata tag or an entirely separate file.
- Speaking of MIDI, why is it so common among these APIs to not have any way of specifying the MIDI device? The fact that Windows Vista removed the Control Panel option for specifying the system-wide default MIDI output device is no excuse for your API lacking the option as well. In fact, your MIDI API now needs such a setting more than it was needed in the Windows XP and 9x days.
- Actually, wait, the API does have a function that is exclusive to tracker formats. Which means that they aren't actually insisting on a clean, consistent, and minimal API here… 🤔
Funnily enough, they did once receive a patch for a function to set loop points which was never upstreamed… and this patch came from the main developer behind PyTouhou, who needed that feature for obvious reasons. The world sure is a small place.
As a result, they turned loop points into a property that each individual format may or may not have. Want to loop MP3 files at sample precision? Tough luck, time to reconvert to another lossy format. 🙄 This is the exact jank I decided against when I implemented BGM modding for thcrap back in 2018, where I concluded that separate intro and loop files are the way to go.
But OK, we only plan to use FLAC and Ogg Vorbis for the SC-88Pro BGM, for which SDL_mixer does support loop points in the form of Vorbis comments, and hey, we can even pass them at sample accuracy. Sure, it's wrong and everything, but nothing I couldn't work with…
However, the final straw that makes SDL_mixer unsuitable for Shuusou Gyoku is its core sound mixing paradigm of distributing all sound effects onto a fixed number of channels, set to 8 by default. Which raises the quite ridiculous question of how many we would actually need to cover the maximum amount of sounds that can simultaneously be played back in any game situation. The theoretic maximum would be 41, which is the combined sum of individual sound buffer instances of all 20 original sound effects. The practical limit would surely be a lot smaller, but we could only find out that one through experiments, which honestly is quite a silly proposition.
- It makes you wonder why they went with this paradigm in the first place. And sure enough, they actually use the aforementioned SDL core function for mixing audio. Yes, the same function whose current documentation advises against using it for this exact use case. 🙄 What's the argument here? "Sure, 8 is significantly more than 2, but any mixing artifacts that will occur for the next 6 sounds are not worrying about, but they get really bad after the 8^th sound, so we're just going to protect you from that"?

There is a fork that does add support for an arbitrary number of music streams, but the rest of its features leave me questioning the priorities and focus of this project. Because surely, when I think about missing features in an audio backend, I immediately think about support for a vast array of chiptune file formats… 🤪
And wait, what, they merged this piece of bloat back into the official SDL_mixer library?! Thanks for opening up a vast attack surface for potential security vulnerabilities in code that would never run for the majority of users, just to cover some niche formats that nobody would seriously expect in a general audio library. And that's coming from someone who loves listening to that stuff!
At this rate, I'm expecting SDL_mixer to gain a mail client by the end of the decade. Hmm, what's the closest audio thing to a mail client… oh, right, WebRTC! Yeah, let's just casually drop a giant part of the Chromium codebase into SDL_mixer, what could possibly go wrong?

This dire situation made me wonder if SDL was the wrong choice for Shuusou Gyoku to begin with. Looking at other low-level cross-platform game libraries, you'll quickly notice that all of them come with mostly equally capable 2D renderers these days, and mainly differentiate themselves in minute API details that you'd only notice upon a really close look.
raylib is another one of those libraries and has been getting exceptionally popular in recent years, to the point of even having more than twice as many GitHub stars as SDL. By restricting itself to OpenGL, it can even offer an abstraction for shaders, which we'd really like for the 西方Ｐｒｏｊｅｃｔ lens ball effect.
In the case of raylib's audio system, the lack of sound effect looping is the minute API detail that would make it annoying to use for Shuusou Gyoku. But it might be worth a look at how raylib implements all this if it doesn't use SDL… which turned out to be the best look I've taken in a long time, because raylib builds on top of miniaudio which is exactly the kind of audio library I was hoping to find. Let's check the list from above:

🟢 miniaudio's high-level API initialization defaults to the native sample format of the playback device. Its internal processing uses 32-bit floating-point samples and only converts back to the native bit depth as necessary when writing the final stream into the backend's audio buffer. WASAPI, for example, never needs any further conversion because it operates with 32-bit floats as well.
🟢 The final audio stream uses the same 10 ms update period (and thus, sound effect latency) that I was getting with DirectSound.
🟢 Stereo ~~panning~~ balancing? ma_sound_set_pan(), although it does require a conversion from Shuusou Gyoku's dB units into a linear attenuation factor.
🟢 Sound effect looping? ma_sound_set_looping().
🟢 Playing the same sound multiple times simultaneously from a single memory buffer? Perfectly possible, but requires a bit of digging in the header to find the best solution. More on that below.
🟢 Future streaming of waveform BGM? Just call ma_sound_init_from_file() with the MA_SOUND_FLAG_STREAM flag.
- 👍 It also comes with a FLAC decoder in the core library and an Ogg Vorbis one as part of the repo, …
- 🤩 … and even supports gapless switching between the intro and loop files via a single declarative call to ma_data_source_set_next()!
  (Oh, and it also has ma_data_set_loop_point_in_pcm_frames() for anyone who still believes in obviously and objectively inferior out-of-band loop points.)
🟢 Pausing all sounds while the game window is not focused? It's not automatic, but adding new functions to the sound interface and calling ma_engine_stop() and ma_engine_start() does the trick, and most importantly doesn't cause any samples to be lost in the process.
🟡 Sound control is implemented in a lock-free way, allowing your main game thread to call these at any time without causing glitches on the audio thread. While that looks nice and optimal on the surface, you now have to either believe in the soundness (ha) of the implementation, or verify that atomic structure fields actually are enough to not cause any race conditions (which I did for the calls that Shuusou Gyoku uses, and I didn't find any). "It's all lock-free, don't worry about it" might be easier, but I consider SDL's approach of just providing a mutex to prevent the output callback from running while you mutate the sound state to actually be simpler conceptually.
🟡 miniaudio adds 247 KB to the binary in its minimum configuration, a bit more than expected. Some of that is bloat from effect code that we never use, but it does include backends for all three Windows audio subsystems (WASAPI, DirectSound, and WinMM).
✅ But perhaps most importantly: It natively supports all modern operating systems that one could seriously want to port this game to, and could be easily ported to any other backend, including SDL.

Oh, and it's written by the same developer who also wrote the best FLAC library back in 2018. And that's despite them being single-file C libraries, which I consider to be massively overrated…

The drawback? Similar to Zig, it's only on version 0.11.18, and also focuses on good high-level documentation at the expense of an API reference. Unlike Zig though, the three issues I ran into turned out to be actual and fixable bugs: Two minor ones related to looping of streamed sounds shorter than 2 seconds which won't ever actually affect us before we get into BGM modding, and a critical one that added high-frequency corruption to any mono sound effect during its expansion to stereo. The latter took days to track down – with symptoms like these, you'd immediately suspect the bug to lie in the resampler or its low-pass filter, both of which are so much more of a fickle and configurable part of the conversion chain here. Compared to that, stereo expansion is so conceptually simple that you wouldn't imagine anyone getting it wrong.
While the latter PR has been merged, the fix is still only part of the dev branch and hasn't been properly released yet. Fortunately, raylib is not affected by this bug: It does currently ship version 0.11.16 of miniaudio, but its usage of the library predates miniaudio's high-level API and it therefore uses a different, non-SSE-optimized code path for its format conversions.

The only slightly tricky part of implementing a miniaudio backend for Shuusou Gyoku lies in setting up multiple simultaneously playing instances for each individual sound. The documentation and answers on the issue tracker heavily push you toward miniaudio's resource manager and its file abstractions to handle this use case. We surely could turn Shuusou Gyoku's numeric sound effect IDs into fake file names, but it doesn't really fit the existing architecture where the sound interface just receives in-memory .WAV file buffers loaded from the SOUND.DAT packfile.
In that case, this seems to be the best way:

Call ma_decode_memory() to decode from any of the supported audio formats to a buffer of raw PCM samples.
At this point, you can choose between
1. decoding into the original format the sound effect is stored in, which would require it to be converted to the playback format every time it's played, or
2. decoding into 32-bit floats (the native bit depth of the miniaudio engine) and the native sampling rate of the playback device, which avoids any further resampling and floating-point conversion, but takes up more memory.
Nowadays, it's not clear at all which of the two approaches is faster. Does it actually matter if we save the audio thread from doing all those floating-point operations on every sample? Or is that no longer true these days because the audio thread is probably running on a different CPU core, the rest of the game largely doesn't touch the floating-point parts of your CPU anyway, and you'd rather want to keep sound effects small so that they can better fit into the CPU cache? That would be an interesting question to benchmark, but just like the similar text rendering question from the last blog posts, it doesn't matter for this tiny 2000s retro game. 😌
I went with 2) mainly because it simplified all the debugging I was doing. At a sampling rate of 48,000 Hz, this increases the memory usage for all sound effects from 379 KiB to 3.67 MiB. At least I'm not channel-expanding all sound effects as well here… We've seen earlier that mono➜stereo expansion is SSE-optimized, so it's very hard to justify a further doubling of the memory usage here.
Then, for each instance of the sound, call
- ma_audio_buffer_ref_init() to create a reference buffer with its own playback cursor, and
- ma_sound_init_from_data_source() to create a new high-level sound node that will play back the reference buffer.

As a side effect of hunting that one critical bug in miniaudio, I've now learned a fair bit about audio resampling in general. You'll probably need some knowledge about basic digital signal behavior to follow this section, and that video is still probably the best introduction to the topic.

So, how could this ever be an issue? The only time I ever consciously thought about resampling used to be in the context of the Opus codec and its enforced sampling rate of 48,000 Hz, and how Opus advocates claim that resampling is a solved problem and nothing to worry about, especially in the context of a lossy codec. Still, I didn't add Opus to thcrap's BGM modding feature entirely because the mere thought of having to downsample to 44,100 Hz in the decoder was off-putting enough. But even if my worries were unfounded in that specific case: Recording the Stereo Mix of Shuusou Gyoku's now two audio backends revealed that apparently not every audio processing chain features an Opus-quality resampler…

If we take a look at the material that resamplers actually have to work with here, it quickly becomes obvious why their results are so varied. As mentioned above, Shuusou Gyoku's sound effects use rather low sampling rates that are pretty far away from the 48,000 Hz your audio device is most definitely outputting. Therefore, any potential imaging noise across the extended high-frequency range – i.e., from the original Nyquist frequencies of 11,025 Hz/5,512.5 Hz up to the new limit of 24,000 Hz – is still within the audible range of most humans and can clearly color the resulting sound.
But it gets worse if the audio data you put into the resampler is objectively defective to begin with, which is exactly the problem we're facing with over half of Shuusou Gyoku's sound effects. Encoding them all as 8-bit PCM is definitely excusable because it was the turn of the millennium and the resulting noise floor is masked by the BGM anyway, but the blatant clipping and DC offsets definitely aren't:

<code>SOUND.DAT</code>, file 1/20 — Waveforms for all 20 of Shuusou Gyoku's sound effects, in the order they appear inside `SOUND.DAT` and with their internal names. We can see quite an abundance of clipping, as well as a significant DC offset in `WARNING`, `BUZZ`, `JOINT`, `SBBOMB`, and `BOSSBOMB`.
Show true peaks

<code>SOUND.DAT</code>, file 2/20 — Waveforms for all 20 of Shuusou Gyoku's sound effects, in the order they appear inside `SOUND.DAT` and with their internal names. We can see quite an abundance of clipping, as well as a significant DC offset in `WARNING`, `BUZZ`, `JOINT`, `SBBOMB`, and `BOSSBOMB`.
Show true peaks

Wait a moment, true peaks? Where do those come from? And, equally importantly, how can we even observe, measure, and store anything above the maximum amplitude of a digital signal?

The answer to the first question can be directly derived from the Xiph.org video I linked above: Digital signals are lollipop graphs, not stairsteps as commonly depicted in audio editing software. Converting them back to an analog signal involves constructing a continuous curve that passes through each sample point, and whose frequency components stay below the Nyquist frequency. And if the amplitude of that reconstructed wave changes too strongly and too rapidly, the resulting curve can easily overshoot the maximum digital amplitude of 0 dBFS even if none of the defined samples are above that limit.

But I can assure you that I did not create the waveform images above by recording the analog output of some speakers or headphones and then matching the levels to the original files, so how did I end up with that image? It's not an Audacity feature either because the development team argues that there is no "true waveform" to be visualized as every DAC behaves differently. While this is correct in theory, we'd be happy just to get a rough approximation here.
ffmpeg's ebur128 filter has a parameter to measure the true peak of a waveform and fairly understandable source code, and once I looked at it, all the pieces suddenly started to make sense: For our purpose of only looking at digital signals, 💡 resampling to a floating-point signal with an infinite sampling rate is equivalent to a DAC. And that's exactly what this filter does: It picks 192,000 Hz and 64-bit float as a format that's close enough to the ideal of "analog infinity" for all practical purposes that involve digital audio, and then simply converts each incoming 100 ms of audio and keeps the sample with the largest floating-point value.

So let's store the resampled output as a FLAC file and load it into Audacity to visualize the clipped peaks… only to find all of them replaced with the typical kind of clipping distortion? 😕 Turns out that I've stumbled over the one case where the FLAC format isn't lossless and there's actually no alternative to .WAV: FLAC just doesn't support floating-point samples and simply truncates them to discrete integers during encoding. When we measured inter-sample peaks above, we weren't only resampling to a floating-point format to avoid any quantization to discrete integer values, but also to make it possible to store amplitudes beyond the 0 dBFS point of ±1.0 in the first place. Once we lose that ability, these amplitudes are clipped to the maximum value of the integer bit depth, and baked into the waveform with no way to get rid of them again. After all, the resampled file now uses a higher sampling rate, and the clipping distortion is now a defined part of what the sound is.
Finally, storing a digital signal with inter-sample peaks in a floating-point format also makes it possible for you to reduce the volume, which moves these peaks back into the regular, unclipped amplitude range. This is especially relevant for Shuusou Gyoku as you'll probably never listen to sound effects at full volume.

Now that we understand what's going on there, we can finally compare the output of various resamplers and pick a suitable one to use with miniaudio. And immediately, we see how they fall into two categories:

High-quality resamplers are the ones I described earlier: They cleanly recreate the signal at a higher sampling rate from its raw frequency representation and thus add no high-frequency noise, but can lead to inter-sample peaks above 0 dBFS.
Linear resamplers use much simpler math to merely interpolate between neighboring samples. Since the newly interpolated samples can only ever stay within 0 dBFS, this approach fully avoids inter-sample clipping, but at the expense of adding high-frequency imaging noise that has to then be removed using a low-pass filter.

miniaudio only comes with a linear resampler – but so does DirectSound as it turns out, so we can get actually pretty close to how the game sounded originally:

All of Shuusou Gyoku's sound effects combined and resampled into a single 48,000 Hz / 32-bit float .WAV file, using GoldWave's File Merger tool. By converting to 32-bit float first and then resampling, the conversion preserved the exact frequency range of the original 22,050 Hz and 11,025 Hz files, even despite clipping. There are small noise peaks across the entire frequency range, but they only occur at the exact boundary between individual sound effects. These are a simple result of the discontinuities that naturally occur in the waveform when concatenating signals that don't start or end at a 0 sample.
As mentioned above, you'll only get this sound out of your DAC at lower volumes where all of the resampled peaks still fit within 0 dBFS. But you most likely will have reduced your volume anyway, because these effects would be ear-splittingly loud otherwise.

The result of converting 1️⃣ into FLAC. The necessary bit depth conversion from 32-bit float to 16-bit integers clamps any data above 0 dBFS or ±1.0f to the discrete -32,678 32,767, the maximum value of such an integer. The resulting straight lines at maximum amplitude in the time domain then turn into distortion across the entire 24,000 Hz frequency domain, which then remains a part of the waveform even at lower volumes. The locations of the high-frequency noise exactly match the clipped locations in the time-domain waveform images above.
The resulting additional distortion can be best heard in BOSSBOMB, where the low source frequency ensures that any distortion stays firmly within the hearing range of most humans.

All of Shuusou Gyoku's sound effects as played through DirectSound and recorded through Stereo Mix. DirectSound also seems to use a linear low-pass filter that leaves quite a bit of high-frequency noise in the signals, making these effects sound crispier than they should be. Depending on where you stand, this is either highly inaccurate and something that should be fixed, or actually good because the sound effects really benefit from that added high end. I myself am definitely in the latter camp – and hey, this sound is the result of original game code, so it is accurate at least in that regard.

All of Shuusou Gyoku's sound effects as converted by miniaudio and directly saved to a file, with the same low-pass filter setting used in the P0256 build. This first-order low-pass filter is a decent approximation of DirectSound's resampler, even though it sounds slightly crispier as the high-frequency noise is boosted a little further. By default, miniaudio would use a 4^th-order low-pass filter, so this is the second-lowest resampling quality you can get, short of disabling the low-pass filter altogether.

Conversion results when using miniaudio's 8^th-order low-pass filter for resampling, the highest quality supported. This is the closest we can get to the reference conversion without using a custom resampler. If we do want to go for perfect accuracy though, we might as well go for 1️⃣ directly?

These spectrum images were initially created using ffmpeg's

-lavfi
		showspectrumpic=mode=combined:s=1280x720

filter. The samples appear in the same order as in the waveform above.

And yes, these are indeed the first videos on this blog to have sound! I spent another push on preparing the 📝 video conversion pipeline for audio support, and on adding the highly important volume control to the player. Web video codecs only support lossy audio, so the sound in these videos will not exactly match the spectrum image, but the lossless source files do contain the original audio as uncompressed PCM streams.

Compared to that whole mess of signals and noise, keyboard and joypad input is indeed much simpler. Thanks to SDL, it's almost trivial, and only slightly complicated because SDL offers two subsystems with seemingly identical APIs:

SDL_Joystick simply numbers all axes and buttons of a joypad and doesn't assign any meaning to these numbers, but works with every joypad ever.
SDL_GameController provides a consistent interface for the typical kind of modern gamepad with two analog sticks, a D-pad, and at least 4 face and 2 shoulder buttons. This API is implemented by simply combining SDL_Joystick with a long list of mappings for specific controllers, and therefore doesn't work with joypads that don't match this standard.

According to SDL, this is what a "game controller" looks like. Here's the source of the SVG.

To match Shuusou Gyoku's original WinMM backend, we'd ideally want to keep the best aspects from both APIs but without being restricted to SDL_GameController's idea of a controller. The Joy Pad menu just identifies each button with a numeric ID, so SDL_Joystick would be a natural fit. But what do we do about directional controls if SDL_Joystick doesn't tell us which joypad axes correspond to the X and Y directions, and we don't have the SDL-recommended configuration UI yet? Doing that right would also mean supporting POV hats and D-pads, after all… Luckily, all joypads we've tested map their main X axis to ID 0 and their main Y axis to ID 1, so this seems like a reasonable default guess.

Fortunately, there is a solution for our exact issue. We can still try to open a joypad via SDL_GameController, and if that succeeds, we can use a function to retrieve the SDL_Joystick ID for the main X and Y axis, close the SDL_GameController instance, and keep using SDL_Joystick for the rest of the game.
And with that, the SDL build no longer needs DirectInput 7, certain antivirus scanners will no longer complain about its low-level keyboard hook, and I turned the original game's single-joypad hot-plugging into multi-joypad hot-plugging with barely any code. 🎮

The necessary consolidation of the game's original input handling uncovered several minor bugs around the High Score and Game Over screen that I sufficiently described in the release notes of the new build. But it also revealed an interesting detail about the Joy Pad screen: Did you know that Shuusou Gyoku lets you unbind all these actions by pressing more than one joypad button at the same time? The original game indicated unbound actions with a [Button 0] label, which is pretty confusing if you have ever programmed anything because you now no longer know whether the game starts numbering buttons at 0 or 1. This is now communicated much more clearly.

Joypad button unbinding in the original version of Shuusou Gyoku, indicated by a rather confusing [Button 0] label — `ESC` is not bound to any joypad button in either screenshot, but it's only really obvious in the P0256 build.

Joypad button unbinding in the P0256 build of Shuusou Gyoku, using a much clearer [--------] label — `ESC` is not bound to any joypad button in either screenshot, but it's only really obvious in the P0256 build.

With that, we're finally feature-complete as far as this delivery is concerned! Let's send a build over to the backers as a quick sanity check… a~nd they quickly found a bug when running on Linux and Wine. When holding a button, the game randomly stops registering directional inputs for a short while on some joypads? Sounds very much like a Wine bug, especially if the same pad works without issues on Windows.
And indeed, on certain joypads, Wine maps the buttons to completely different and disconnected IDs, as if it simply invents new buttons or axes to fill the resulting gaps. Until we can differentiate joypad bindings per controller, it's therefore unlikely that you can use the same joypad mapping on both Windows and Linux/Wine without entering the Joy Pad menu and remapping the buttons every time you switch operating systems.

Still, by itself, this shouldn't cause any issues with my SDL event handling code… except, of course, if I forget a break; in a switch case. 🫠
This completely preventable implicit fallthrough has now caused a few hours of debugging on my end. I'd better crank up the warning level to keep this from ever happening again. Opting into this specific warning also revealed why we haven't been getting it so far: Visual Studio did gain a whole host of new warnings related to the C++ Core Guidelines a while ago, including the one I was looking for, but actually getting the compiler to throw these requires activating a separate static analysis mode together with a plugin, which significantly slows down build times. Therefore I only activate them for release builds, since these already take long enough.

But that wasn't the only step I took as a result of this blunder. In addition, I now offer free fixes for regressions in my mod releases if anyone else reports an issue before I find it myself. I've already been following this policy 📝 earlier this year when mu021 reported the unblitting bug in the initial release of the TH01 Anniversary Edition, and merely made it official now. If I was the one who broke a thing, I'll fix it for free.

Since all that input debugging already started a 5^th push, I might as well fill that one by restoring the original screenshot feature. After all, it's triggered by a key press (and is thus related to the input backend), reads the contents of the frame buffer (and is thus related to the graphics backend), and it honestly looks bad to have this disclaimer in the release notes just because we're one small feature away from 100% parity with pbg's original binary.
Coincidentally, I had already written code to save a DirectDraw surface to a .BMP file for all the debugging I did in the last delivery, so we were basically only missing filename generation. Except that Shuusou Gyoku's original choice of mapping screenshots to the PrintScreen key did not age all too well:

As of Windows XP's 64-bit version, you can no longer use standard window messages to detect that this key is being pressed.
And as of Windows 11, the OS takes full control of the key by binding it to the Snipping Tool by default, complete with a UI that politely steals focus when hitting that key.

As a result, both Arandui and I independently arrived at the idea of remapping screenshots to the P key, which is the same screenshot key used by every Windows Touhou game since TH08.

The rest of the feature remains unchanged from how it was in pbg's original build and will save every distinct frame rendered by the game (i.e., before flipping the two framebuffers) to a .BMP file as long as the P key is being held. At a 32-bit color depth, these screenshots take up 1.2 MB per frame, which will quickly add up – especially since you'll probably hold the P key for more than ¹/₆₀ of a second and therefore end up saving multiple frames in a row. We should probably compress them one day.

Since I already translated some of Shuusou Gyoku's ASM code to C++ during the Zig experiment, it made sense to finish the fifth push by covering the rest of those functions. The integer math functions are used all throughout the game logic, and are the main reason why this goal is important for a Linux port, or any port to a 64-bit architecture for that matter. If you've ever read a micro-optimization-related blog post, you'll know that hand-written ASM is a great recipe that often results in the finest jank, and the game's square root function definitely delivers in that regard, right out of the gate.
What slightly differentiates this algorithm from the typical definition of an integer square root is that it rounds up: In real numbers, √3 is ≈ 1.73, so isqrt(3) returns 2 instead of 1. However, if the result is always rounded down, you can determine whether you have to round up by simply squaring the calculated root and comparing it to the radicand. And even that is only necessary if the difference between the two doesn't naturally fall out of the algorithm – which is what also happens with Shuusou Gyoku's original ASM code, but pbg didn't realize this and squared the result regardless.

That's one suboptimal detail already. Let's call the original ASM function in a loop over the entire supported range of radicands from 0 to 2³¹ and produce a list of results that I can verify my C++ translation against… and watch as the function's linear time complexity with regard to the radicand causes the loop to run for over 15 hours on my system. 🐌 In a way, I've found the literal opposite of Q_rsqrt() here: Not fast, not inverse, no bit hacks, and surely without the awe-inspiring kind of WTF.
I really didn't want to run the same loop over a literal C++ translation of the same algorithm afterward. Calculating integer square roots is a common problem with lots of solutions, so let's see if we can go better than linear.

And indeed, Wikipedia also has a bitwise algorithm that runs in logarithmic time, uses only additions, subtractions, and bit shifts, and even ends up with an error term that we can use to round up the result as necessary, without a multiplication. And this algorithm delivers the exact same results over the exact same range in… 50 seconds. 🏎️ And that's with the I/O to print the first value that returns each of the 46,341 different square root results.

"But wait a moment!", I hear you say. "Why are you bothering with an integer square root algorithm to begin with? Shouldn't good old round(sqrt(x)) from <math.h> do the trick just fine? Our CPUs have had SSE for a long time, and this probably compiles into the single SQRTSD instruction. All that extra floating-point hardware might mean that this instruction could even run in parallel with non-SSE code!"
And yes, all of that is technically true. So I tested it, and my very synthetic and constructed micro-benchmark did indeed deliver the same results in… 48 seconds. That's not enough of a difference to justify breaking the spirit of treating the FPU as lava that permeates Shuusou Gyoku's code base. Besides, it's not used for that much to begin with:

pre-calculating the 西方Ｐｒｏｊｅｃｔ lens ball effect
the fade animation when entering and leaving stages
rendering the circular part of stationary lasers
pulling items to the player when bombing

After a quick C++ translation of the RNG function that spells out a 32-bit multiplication on a 32-bit CPU using 16-bit instructions, we reach the final pieces of ASM code for the 8-bit atan2() and trapezoid rendering. These could actually pass for well-written ASM code in how they express their 64-bit calculations: atan8() prepares its 64-bit dividend in the combined EDX and EAX registers in a way that isn't obvious at all from a cursory look at the code, and the trapezoid functions effectively use Q32.32 subpixels. C++ allows us to cleanly model all these calculations with 64-bit variables, but unfortunately compiles the divisions into a call to a comparatively much more bloated 64-bit/64-bit-division polyfill function. So yeah, we've actually found a well-optimized piece of inline assembly that even Visual Studio 2022's optimizer can't compete with. But then again, this is all about code generation details that are specific to 32-bit code, and it wouldn't be surprising if that part of the optimizer isn't getting much attention anymore. Whether that optimization was useful, on the other hand… Oh well, the new C++ version will be much more efficient in 64-bit builds.

And with that, there's no more ASM code left in Shuusou Gyoku's codebase, and the original DirectXUTYs directory is slowly getting emptier and emptier.

Phew! Was that everything for this delivery? I think that was everything. Here's the new build, which checks off 7 of the 15 remaining portability boxes:

Shuusou Gyoku P0256

Next up: Taking a well-earned break from Shuusou Gyoku and starting with the preparations for multilingual PC-98 Touhou translatability by looking at TH04's and TH05's in-game dialog system, and definitely writing a shorter blog post about all that…

📝 Posted:: 2023-03-30 13:23 UTC
🚚 Summary of:: P0235, P0236, P0237
⌨ Commits:: e7a9262...62c4b7f, 62c4b7f...7fa9038, 7fa9038...c5e51e6
💰 Funded by:: Ember2528, Yanga
🏷 Tags:

So, TH02! Being the only game whose main binary hadn't seen any dedicated attention ever, we get to start the TH02-related blog posts at the very beginning with the most foundational pieces of code. The stage tile system is the best place to start here: It not only blocks every entity that is rendered on top of these tiles, but is curiously placed right next to master.lib code in TH02, and would need to be separated out into its own translation unit before we can do the same with all the master.lib functions.

In late 2018, I already RE'd 📝 TH04's and TH05's stage tile implementation, but haven't properly documented it on this blog yet, so this post is also going to include the details that are unique to those games. On a high level, the stage tile system works identically in all three games:

The tiles themselves are 16×16 pixels large, and a stage can use 100 of them at the same time.
The optimal way of blitting tiles would involve VRAM-to-VRAM copies within the same page using the EGC, and that's exactly what the games do. All tiles are stored on both VRAM pages within the rightmost 64×400 pixels of the screen just right next to the HUD, and you only don't see them because the games cover the same area in text RAM with black cells:

The initial screen of TH02's Stage 1, with the tile source area uncovered by filling the same area in text RAM with transparent cells instead of black ones. In TH02, this also reveals how the tile area ends with a bunch of glitch tiles, tinted blue in the image. These are the result of ZUN unconditionally blitting 100 tile images every time, regardless of how many are actually contained in an .MPN file.
These glitch tiles are another good example of a ZUN landmine. Their appearance is the result of reading heap memory outside allocated boundaries, which can easily cause segmentation faults when porting the game to a system with virtual memory. Therefore, these would not just be removed in this game's Anniversary Edition, but on the more conservative debloated branch as well. Since the game never uses these tiles and you can't observe them unless you manipulate text RAM from outside the confines of the game, it's not a bug according to our definition.
To reduce the memory required for a map, tiles are arranged into fixed vertical sections of a game-specific constant size.
The 6 24×8-tile sections defined in TH02's STAGE0.MAP, in reverse order compared to how they're defined in the file. Note the duplicated row at the top of the final section: The boss fight starts once the game scrolled the last full row of tiles onto the top of the screen, not the playfield. But since the PC-98 text chip covers the top tile row of the screen with black cells, this final row is never visible, which effectively reduces a map's final tile section to 7 rows rather than 8.
The actual stage map then is simply a list of these tile sections, ordered from the start/bottom to the top/end.
Any manipulation of specific tiles within the fixed tile sections has to be hardcoded. An example can be found right in Stage 1, where the Shrine Tank leaves track marks on the tiles it appears to drive over:

This video also shows off the two issues with Touhou's first-ever midboss: The replaced tiles are rendered below the midboss during their first 4 frames, and maybe ZUN should have stopped the tile replacements one row before the timeout. The first one is clearly a bug, but it's not so clear-cut with the second one. I'd need to look at the code to tell for sure whether it's a quirk or a bug.

The differences between the three games can best be summarized in a table:

	TH02	TH04	TH05
Tile image file extension	.MPN
Tile section format	.MAP
Tile section order defined as part of	.DT1	.STD
Tile section index format	0-based ID		0-based ID × 2
Tile image index format	Index between 0 and 100, 1 byte	VRAM offset in tile source area, 2 bytes
Scroll speed control	Hardcoded	Part of the .STD format, defined per referenced tile section
Redraw granularity	Full tiles (16×16)	Half tiles (16×8)
Rows per tile section	8	5
Maximum number of tile sections	16	32
Lowest number of tile sections used	5 (Stage 3 / Extra)	8 (Stage 6)	11 (Stage 2 / 4)
Highest number of tile sections used	13 (Stage 4)	19 (Extra)	24 (Stage 3)
Maximum length of a map	320 sections (static buffer)	256 sections (format limitation)
Shortest map	14 sections (Stage 5)	20 sections (Stage 5)	15 sections (Stage 2)
Longest map	143 sections (Stage 4)	95 sections (Stage 4)	40 sections (Stage 1 / 4 / Extra)

The most interesting part about stage tiles is probably the fact that some of the .MAP files contain unused tile sections. 👀 Many of these are empty, duplicates, or don't really make sense, but a few are unique, fit naturally into their respective stage, and might have been part of the map during development. In TH02, we can find three unused sections in Stage 5:

Section 0 of TH02's STAGE4.MAP — The non-empty tile sections defined in TH02's `STAGE4.MAP`, showing off three unused ones.

Section 1 of TH02's STAGE4.MAP — The non-empty tile sections defined in TH02's `STAGE4.MAP`, showing off three unused ones.

These unused tile sections are much more common in the later games though, where we can find them in TH04's Stage 3, 4, and 5, and TH05's Stage 1, 2, and 4. I'll document those once I get to finalize the tile rendering code of these games, to leave some more content for that blog post. TH04/TH05 tile code would be quite an effective investment of your money in general, as most of it is identical across both games. Or how about going for a full-on PC-98 Touhou map viewer and editor GUI?

Compared to TH04 and TH05, TH02's stage tile code definitely feels like ZUN was just starting to understand how to pull off smooth vertical scrolling on a PC-98. As such, it comes with a few inefficiencies and suboptimal implementation choices:

The redraw flag for each tile is stored in a 24×25 bool array that does nothing with 7 of the 8 bits.
During bombs and the Stage 4, 5, and Extra bosses, the game disables the tile system to render more elaborate backgrounds, which require the playfield to be flood-filled with a single color on every frame. ZUN uses the GRCG's RMW mode rather than TDW mode for this, leaving almost half of the potential performance on the table for no reason. Literally, changing modes only involves changing a single constant.
The scroll speed could theoretically be changed at any time. However, the function that scrolls in new stage tiles can only ever blit part of a single tile row during every call, so it's up to the caller to ensure that scrolling always ends up on an exact 16-pixel boundary. TH02 avoids this problem by keeping the scroll speed constant across a stage, using 2 pixels for Stage 4 and 1 pixel everywhere else.
Since the scroll speed is given in pixels, the slowest speed would be 1 pixel per frame. To allow the even slower speeds seen in the final game, TH02 adds a separate scroll interval variable that only runs the scroll function every 𝑛th frame, effectively adding a prescaler to the scroll speed. In TH04 and TH05, the speed is specified as a Q12.4 value instead, allowing true fractional speeds at any multiple of ¹/₁₆ pixels. This also necessitated a fixed algorithm that correctly blits tile lines from two rows.
Finally, we've got a few inconsistencies in the way the code handles the two VRAM pages, which cause a few unnecessary tiles to be rendered to just one of the two pages. Mentioning that just in case someone tries to play this game with a fully cleared text RAM and wonders where the flickering tiles come from.

Even though this was ZUN's first attempt at scrolling tiles, he already saw it fit to write most of the code in assembly. This was probably a reaction to all of TH01's performance issues, and the frame rate reduction workarounds he implemented to keep the game from slowing down too much in busy places. "If TH01 was all C++ and slow, TH02 better contain more ASM code, and then it will be fast, right?"
Another reason for going with ASM might be found in the kind of documentation that may have been available to ZUN. Last year, the PC-98 community discovered and scanned two new game programming tutorial books from 1991 (1, 2). Their example code is not only entirely written in assembly, but restricts itself to the bare minimum of x86 instructions that were available on the 8086 CPU used by the original PC-9801 model 9 years earlier. Such code is not only suboptimal on the 486, but can often be actually worse than what your C++ compiler would generate. TH02 is where the trend of bad hand-written ASM code started, and it 📝 only intensified in ZUN's later games. So, don't copy code from these books unless you absolutely want to target the earlier 8086 and 286 models. Which, 📝 as we've gathered from the recent blitting benchmark results, are not all too common among current real-hardware owners.
That said, all that ASM code really only impacts readability and maintainability. Apart from the aforementioned issues, the algorithms themselves are mostly fine – especially since most EGC and GRCG operations are decently batched this time around, in contrast to TH01.

Luckily, the tile functions merely use inline assembly within a typical C function and can therefore be at least part of a C++ source file, even if the result is pretty ugly. This time, we can actually be sure that they weren't written directly in a .ASM file, because they feature x86 instruction encodings that can only be generated with Turbo C++ 4.0J's inline assembler, not with TASM. The same can't unfortunately be said about the following function in the same segment, which marks the tiles covered by the spark sprites for redrawing. In this one, it took just one dumb hand-written ASM inconsistency in the function's epilog to make the entire function undecompilable.
The standard x86 instruction sequence to set up a stack frame in a function prolog looks like this:

PUSH	BP
MOV 	BP, SP
SUB 	SP, ?? ; if the function needs the stack for local variables

When compiling without optimizations, Turbo C++ 4.0J will replace this sequence with a single ENTER instruction. That one is two bytes smaller, but much slower on every x86 CPU except for the 80186 where it was introduced.

In functions without local variables, BP and SP remain identical, and a single POP BP is all that's needed in the epilog to tear down such a stack frame before returning from the function. Otherwise, the function needs an additional

MOV SP,
	BP

instruction to pop all local variables. With x86 being the helpful CISC architecture that it is, the 80186 also introduced the LEAVE instruction to perform both tasks. Unlike ENTER, this single instruction is faster than the raw two instructions on a lot of x86 CPUs (and even current ones!), and it's always smaller, taking up just 1 byte instead of 3.
So what if you use LEAVE even if your function doesn't use local variables?

The fact that the instruction first does the equivalent of MOV SP, BP doesn't matter if these registers are identical, and who cares about the additional CPU cycles of LEAVE compared to just POP BP, right? So that's definitely something you could theoretically do, but not something that any compiler would ever generate.

And so, TH02 MAIN.EXE decompilation already hits the first brick wall after two pushes. Awesome! Theoretically, we could slowly mash through this wall using the 📝 code generator. But having such an inconsistency in the function epilog would mean that we'd have to keep Turbo C++ 4.0J from emitting any epilog or prolog code so that we can write our own. This means that we'd once again have to hide any use of the SI and DI registers from the compiler… and doing that requires code generation macros for 22 of the 49 instructions of the function in question, almost none of which we currently have. So, this gets quite silly quite fast, especially if we only need to do it for one single byte.

Instead, wouldn't it be much better if we had a separate build step between compile and link time that allowed us to replicate mistakes like these by just patching the compiled .OBJ files? These files still contain the names of exported functions for linking, which would allow us to look up the code of a function in a robust manner, navigate to specific instructions using a disassembler, replace them, and write the modified .OBJ back to disk before linking. Such a system could then naturally expand to cover all other decompilation issues, culminating in a full-on optimizer that could even recreate ZUN's self-modifying code. At that point, we would have sealed away all of ZUN's ugly ASM code within a separate build step, and could finally decompile everything into readable C++.

Pulling that off would require a significant tooling investment though. Patching that one byte in TH02's spark invalidation function could be done within 1 or 2 pushes, but that's just one issue, and we currently have 32 other .ASM files with undecompilable code. Also, note that this is fundamentally different from what we're doing with the debloated branch and the Anniversary Editions. Mistake patching would purely be about having readable code on master that compiles into ZUN's exact binaries, without fixing weird code. The Anniversary Editions go much further and rewrite such code in a much more fundamental way, improving it further than mistake patching ever could.
Right now, the Anniversary Editions seem much more popular, which suggests that people just want 100% RE as fast as possible so that I can start working on them. In that case, why bother with such undecompilable functions, and not just leave them in raw and unreadable x86 opcode form if necessary… But let's first see how much backer support there actually is for mistake patching before falling back on that.

The best part though: Once we've made a decision and then covered TH02's spark and particle systems, that was it, and we will have already RE'd all ZUN-written PC-98-specific blitting code in this game. Every further sprite or shape is rendered via master.lib, and is thus decently abstracted. Guess I'll need to update 📝 the assessment of which PC-98 Touhou game is the easiest to port, because it sure isn't TH01, as we've seen with all the work required for the first Anniversary Edition build.

Until then, there are still enough parts of the game that don't use any of the remaining few functions in the _TEXT segment. Previously, I mentioned in the 📝 status overview blog post that TH02 had a seemingly weird sprite system, but the spark and point popup ( 〇一二三四五六七八九十×÷ ) structures showed that the game just stores the current and previous position of its entities in a slightly different way compared to the rest of PC-98 Touhou. Instead of having dedicated structure fields, TH02 uses two-element arrays indexed with the active VRAM page. Same thing, and such a pattern even helps during RE since it's easy to spot once you know what to look for.
There's not much to criticize about the point popup system, except for maybe a landmine that causes sprite glitches when trying to display more than 99,990 points. Sadly, the final push in this delivery was rounded out by yet another piece of code at the opposite end of the quality spectrum. The particle and smear effects for Reimu's bomb animations consist almost entirely of assembly bloat, which would just be replaced with generic calls to the generic blitter in this game's future Anniversary Edition.

If I continue to decompile TH02 while avoiding the brick wall, items would be next, but they probably require two pushes. Next up, therefore: Integrating Stripe as an alternative payment provider into the order form. There have been at least three people who reported issues with PayPal, and Stripe has been working much better in tests. In the meantime, here's a temporary Stripe order link for everyone. This one is not connected to the cap yet, so please make sure to stay within whatever value is currently shown on the front page – I will treat any excess money as donations. If there's some time left afterward, I might also add some small improvements to the TH01 Anniversary Edition.

📝 Posted:: 2023-01-17 17:52 UTC
🚚 Summary of:: P0227, P0228
⌨ Commits:: 4f85326...bfd24c6, bfd24c6...739e1d8
💰 Funded by:: nrook, [Anonymous]
🏷 Tags:

Starting the year with a delivery that wasn't delayed until the last day of the month for once, nice! Still, very soon and high-maintenance did not go well together…

It definitely wasn't Sara's fault though. As you would expect from a Stage 1 Boss, her code was no challenge at all. Most of the TH02, TH04, and TH05 bosses follow the same overall structure, so let's introduce a new table to replace most of the boilerplate overview text:

	Phase #	Patterns	HP boundary	Timeout condition
	(Entrance)		4,650	288 frames
	2	4	2,550	2,568 frames	(= 32 patterns)
	3	4	450	5,296 frames	(= 24 patterns)
	4	1	0	1,300 frames
Total		9		9,452 frames

In Phases 2 and 3, Sara cycles between waiting, moving randomly for a fixed 28 frames, and firing a random pattern among the 4 phase-specific ones. The pattern selection makes sure to never pick any pattern twice in a row. Both phases contain spiral patterns that only differ in the clockwise or counterclockwise turning direction of the spawner; these directions are treated as individual unrelated patterns, so it's possible for the "same" pattern to be fired multiple times in a row with a flipped direction.
The two phases also differ in the wait and pattern durations:
- In Phase 2, the wait time starts at 64 frames and decreases by 12 frames after the first 5 patterns each, ending on a minimum of 4 frames. In Phase 3, it's a constant 16 frames instead.
- All Phase 2 patterns are fired for 28 frames, after a 16-frame gather animation. The Phase 3 pattern time starts at 80 frames and increases by 24 frames for the first 6 patterns, ending at 200 frames for all later ones.
Phase 4 consists of the single laser corridor pattern with additional random bullets every 16 frames.

And that's all the gameplay-relevant detail that ZUN put into Sara's code. It doesn't even make sense to describe the remaining patterns in depth, as their groups can significantly change between difficulties and rank values. The 📝 general code structure of TH05 bosses won't ever make for good-code, but Sara's code is just a lesser example of what I already documented for Shinki.
So, no bugs, no unused content, only inconsequential bloat to be found here, and less than 1 push to get it done… That makes 9 PC-98 Touhou bosses decompiled, with 22 to go, and gets us over the sweet 50% overall finalization mark! 🎉 And sure, it might be possible to pass through the lasers in Sara's final pattern, but the boss script just controls the origin, angle, and activity of lasers, so any quirk there would be part of the laser code… wait, you can do what?!?

TH05 expands TH04's one-off code for Yuuka's Master and Double Sparks into a more featureful laser system, and Sara is the first boss to show it off. Thus, it made sense to look at it again in more detail and finalize the code I had purportedly 📝 reverse-engineered over 4 years ago. That very short delivery notice already hinted at a very time-consuming future finalization of this code, and that prediction certainly came true. On the surface, all of the low-level laser ray rendering and collision detection code is undecompilable: It uses the SI and DI registers without Turbo C++'s safety backups on the stack, and its helper functions take their input and output parameters from convenient registers, completely ignoring common calling conventions. And just to raise the confusion even further, the code doesn't just set these registers for the helper function calls and then restores their original values, but permanently shifts them via additions and subtractions. Unfortunately, these convenient registers also include the BP base pointer to the stack frame of a function… and shifting that register throws any intuition behind accessed local variables right out of the window for a good part of the function, requiring a correctly shifted view of the stack frame just to make sense of it again. How could such code even have been written?! This goes well beyond the already wrong assumption that using more stack space is somehow bad, and straight into the territory of self-inflicted pain.

So while it's not a lot of instructions, it's quite dense and really hard to follow. This code would really benefit from a decompilation that anchors all this madness as much as possible in existing C++ structures… so let's decompile it anyway?
Doing so would involve emitting lots of raw machine code bytes to hide the SI and DI registers from the compiler, but I already had a certain 📝 batshit insane compiler bug workaround abstraction lying around that could make such code more readable. Hilariously, it only took this one additional use case for that abstraction to reveal itself as premature and way too complicated. Expanding the core idea into a full-on x86 instruction generator ended up simplifying the code structure a lot. All we really want there is a way to set all potential parameters to e.g. a specific form of the MOV instruction, which can all be expressed as the parameters to a force-inlined __emit__() function. Type safety can help by providing overloads for different operand widths here, but there really is no need for classes, templates, or explicit specialization of templates based on classes. We only need a couple of enums with opcode, register, and prefix constants from the x86 reference documentation, and a set of associated macros that token-paste pseudoregisters onto the prefixes of these enum constants.
And that's how you get a custom compile-time assembler in a 1994 C++ compiler and expand the limits of decompilability even further. What's even truly left now? Self-modifying code, layout tricks that can't be replicated with regularly structured control flow… and that's it. That leaves quite a few functions I previously considered undecompilable to be revisited once I get to work on making this game more portable.

With that, we've turned the low-level laser code into the expected horrible monstrosity that exposes all the hidden complexity in those few ASM instructions. The high-level part should be no big deal now… except that we're immediately bombarded with Fixup overflow errors at link time? Oh well, time to finally learn the true way of fixing this highly annoying issue in a second new piece of decompilation tech – and one that might actually be useful for other x86 Real Mode retro developers at that.
Earlier in the RE history of TH04 and TH05, I often wrote about the need to split the two original code segments into multiple segments within two groups, which makes it possible to slot in code from different translation units at arbitrary places within the original segment. If we don't want to define a unique segment name for each of these slotted-in translation units, we need a way to set custom segment and group names in C land. Turbo C++ offers two #pragmas for that:

#pragma option -zCsegment -zPgroup – preferred in most cases as it's equivalent to setting the default segment and group via the command line, but can only be used at the beginning of a translation unit, before the first non-preprocessor and non-comment C language token
#pragma codeseg segment <group> – necessary if a translation unit needs to emit code into two or more segments

For the most part, these #pragmas work well, but they seemed to not help much when it came to calling near functions declared in different segments within the same group. It took a bit of trial and error to figure out what was actually going on in that case, but there is a clear logic to it:

Symbols are allocated to the segment and group that's active during their first appearance, no matter whether that appearance is a declaration or definition. Any later appearance of the function in a different segment is ignored.
The linker calculates the 16-bit offsets of such references relative to the symbol's declared segment, not its actual one. Turbo C++ does not show an error or warning if the declared and actual segments are different, as referencing the same symbol from multiple segments is a valid use case. The linker merely throws the Fixup overflow error if the calculated distance exceeds 64 KiB and thus couldn't possibly fit within a near reference. With a wrong segment declaration though, your code can be incorrect long before a fixup hits that limit.

Summarized in code:

#pragma option -zCfoo_TEXT -zPfoo

void bar(void);
void near qux(void); // defined somewhere else, maybe in a different segment

#pragma codeseg baz_TEXT baz

// Despite the segment change in the line above, this function will still be
// put into `foo_TEXT`, the active segment during the first appearance of the
// function name.
void bar(void) {
}

// This function hasn't been declared yet, so it will go into `baz_TEXT` as
// expected.
void baz(void) {
	// This `near` function pointer will be calculated by subtracting the
	// flat/linear address of qux() inside the binary from the base address
	// of qux()'s declared segment, i.e., `foo_TEXT`.
	void (near *ptr_to_qux)(void) = qux;
}

So yeah, you might have to put #pragma codeseg into your headers to tell the linker about the correct segment of a near function in advance. 🤯 This is an important insight for everyone using this compiler, and I'm shocked that none of the Borland C++ books documented the interaction of code segment definitions and near references at least at this level of clarity. The TASM manuals did have a few pages on the topic of groups, but that syntax obviously doesn't apply to a C compiler. Fixup overflows in particular are such a common error and really deserved better than the unhelpful 🤷 of an explanation that ended up in the User's Guide. Maybe this whole technique of custom code segment names was considered arcane even by 1993, judging from the mere three sentences that #pragma codeseg was documented with? Still, it must have been common knowledge among Amusement Makers, because they couldn't have built these exact binaries without knowing about these details. This is the true solution to 📝 any issues involving references to near functions, and I'm glad to see that ZUN did not in fact lie to the compiler. 👍

OK, but now the remaining laser code compiles, and we get to write C++ code to draw some hitboxes during the two collision-detected states of each laser. These confirm what the low-level code from earlier already uncovered: Collision detection against lasers is done by testing a 12×12-pixel box at every 16 pixels along the length of a laser, which leaves obvious 4-pixel gaps at regular intervals that the player can just pass through. This adds 📝 yet 📝 another 📝 quirk to the growing list of quirks that were either intentional or must have been deliberately left in the game after their initial discovery. This is what constants were invented for, and there really is no excuse for not using them – especially during intoxicated coding, and/or if you don't have a compile-time abstraction for Q12.4 literals.

When detecting laser collisions, the game checks the player's single center coordinate against any of the aforementioned 12×12-pixel boxes. Therefore, it's correct to split these 12×12 pixels into two 6×6-pixel boxes and assign the other half to the player for a more natural visualization. Always remember that hitbox visualizations need to keep all colliding entities in mind – 📝 assigning a constant-sized hitbox to "the player" and "the bullets" will be wrong in most other cases.

Using subpixel coordinates in collision detection also introduces a slight inaccuracy into any hitbox visualization recorded in-engine on a 16-color PC-98. Since we have to render discrete pixels, we cannot exactly place a Q12.4 coordinate in the 93.75% of cases where the fractional part is non-zero. This is why pretty much every laser segment hitbox in the video above shows up as 7×7 rather than 6×6: The actual W×H area of each box is 13 pixels smaller, but since the hitbox lies between these pixels, we cannot indicate where it lies exactly, and have to err on the side of caution. It's also why Reimu's box slightly changes size as she moves: Her non-diagonal movement speed is 3.5 pixels per frame, and the constant focused movement in the video above halves that to 1.75 pixels, making her end up on an exact pixel every 4 frames. Looking forward to the glorious future of displays that will allow us to scale up the playfield to 16× its original pixel size, thus rendering the game at its exact internal resolution of 6144×5888 pixels. Such a port would definitely add a lot of value to the game…

The remaining high-level laser code is rather unremarkable for the most part, but raises one final interesting question: With no explicitly defined limit, how wide can a laser be? Looking at the laser structure's 1-byte width field and the unsigned comparisons all throughout the update and rendering code, the answer seems to be an obvious 255 pixels. However, the laser system also contains an automated shrinking state, which can be most notably seen in Mai's wheel pattern. This state shrinks a laser by 2 pixels every 2 frames until it reached a width of 0. This presents a problem with odd widths, which would fall below 0 and overflow back to 255 due to the unsigned nature of this variable. So rather than, I don't know, treating width values of 0 as invalid and stopping at a width of 1, or even adding a condition for that specific case, the code just performs a signed comparison, effectively limiting the width of a shrinkable laser to a maximum of 127 pixels. This small signedness inconsistency now forces the distinction between shrinkable and non-shrinkable lasers onto every single piece of code that uses lasers. Yet another instance where 📝 aiming for a cinematic 30 FPS look made the resulting code much more complicated than if ZUN had just evenly spread out the subtraction across 2 frames. 🤷
Oh well, it's not as if any of the fixed lasers in the original scripts came close to any of these limits. Moving lasers are much more streamlined and limited to begin with: Since they're hardcoded to 6 pixels, the game can safely assume that they're always thinner than the 28 pixels they get gradually widened to during their decay animation.

Finally, in case you were missing a mention of hitboxes in the previous paragraph: Yes, the game always uses the aforementioned 12×12 boxes, regardless of a laser's width.

This video also showcases the 127-pixel limit because I wanted to include the shrink animation for a seamless loop.

That was what, 50% of this blog post just being about complications that made laser difficult for no reason? Next up: The first TH01 Anniversary Edition build, where I finally get to reap the rewards of having a 100% decompiled game and write some good code for once.

📝 Posted:: 2022-11-30 22:20 UTC
🚚 Summary of:: P0223, P0224, P0225
⌨ Commits:: 139746c...371292d, 371292d...8118e61, 8118e61...4f85326
💰 Funded by:: rosenrose, Blue Bolt, Splashman, -Tom-, Yanga, Enderwolf, 32th System
🏷 Tags:

More than three months without any reverse-engineering progress! It's been way too long. Coincidentally, we're at least back with a surprising 1.25% of overall RE, achieved within just 3 pushes. The ending script system is not only more or less the same in TH04 and TH05, but actually originated in TH03, where it's also used for the cutscenes before stages 8 and 9. This means that it was one of the final pieces of code shared between three of the four remaining games, which I got to decompile at roughly 3× the usual speed, or ⅓ of the price.
The only other bargains of this nature remain in OP.EXE. The Music Room is largely equivalent in all three remaining games as well, and the sound device selection, ZUN Soft logo screens, and main/option menus are the same in TH04 and TH05. A lot of that code is in the "technically RE'd but not yet decompiled" ASM form though, so it would shift Finalized% more significantly than RE%. Therefore, make sure to order the new Finalization option rather than Reverse-engineering if you want to make number go up.

General overview
Game-specific differences
Command reference
Thoughts about translation support

So, cutscenes. On the surface, the .TXT files look simple enough: You directly write the text that should appear on the screen into the file without any special markup, and add commands to define visuals, music, and other effects at any place within the script. Let's start with the basics of how text is rendered, which are the same in all three games:

First off, the text area has a size of 480×64 pixels. This means that it does not correspond to the tiled area painted into TH05's EDBK?.PI images:
The yellow area is designated for character names.
Since the font weight can be customized, all text is rendered to VRAM. This also includes gaiji, despite them ignoring the font weight setting.
The system supports automatic line breaks on a per-glyph basis, which move the text cursor to the beginning of the red text area. This might seem like a piece of long-forgotten ancient wisdom at first, considering the absence of automatic line breaks in Windows Touhou. However, ZUN probably implemented it more out of pure necessity: Text in VRAM needs to be unblitted when starting a new box, which is way more straightforward and performant if you only need to worry about a fixed area.
The system also automatically starts a new (key press-separated) text box after the end of the 4^th line. However, the text cursor is also unconditionally moved to the top-left corner of the yellow name area when this happens, which is almost certainly not what you expect, given that automatic line breaks stay within the red area. A script author might as well add the necessary text box change commands manually, if you're forced to anticipate the automatic ones anyway…
Due to ZUN forgetting an unblitting call during the TH05 refactoring of the box background buffer, this feature is even completely broken in that game, as any new text will simply be blitted on top of the old one:
Wait, why are we already talking about game-specific differences after all? Also, note how the ⏎ animation appears one line below where you'd expect it.
Overall, the system is geared toward exclusively full-width text. As exemplified by the 2014 static English patches and the screenshots in this blog post, half-width text is possible, but comes with a lot of asterisks attached:
- Each loop of the script interpreter starts by looking at the next byte to distinguish commands from text. However, this step also skips over every ASCII space and control character, i.e., every byte ≤ 32. If you only intend to display full-width glyphs anyway, this sort of makes sense: You gain complete freedom when it comes to the physical layout of these script files, and it especially allows commands to be freely separated with spaces and line breaks for improved readability. Still, enforcing commands to be separated exclusively by line breaks might have been even better for readability, and would have freed up ASCII spaces for regular text…
- Non-command text is blindly processed and rendered two bytes at a time. The rendering function interprets these bytes as a Shift-JIS string, so you can use half-width characters here. While the second byte can even be an ASCII 0x20 space due to the parser's blindness, all half-width characters must still occur in pairs that can't be interrupted by commands:
- As a workaround for at least the ASCII space issue, you can replace them with any of the unassigned Shift-JIS lead bytes – 0x80, 0xA0, or anything between 0xF0 and 0xFF inclusive. That's what you see in all screenshots of this post that display half-width spaces.
Finally, did you know that you can hold ESC to fast-forward through these cutscenes, which skips most frame delays and reduces the rest? Due to the blocking nature of all commands, the ESC key state is only updated between commands or 2-byte text groups though, so it can't interrupt an ongoing delay.

Superficially, the list of game-specific differences doesn't look too long, and can be summarized in a rather short table:

	TH03	TH04	TH05
Script size limit	65536 bytes (heap-allocated)	8192 bytes (statically allocated)
Delay between every 2 bytes of text	1 frame by default, customizable via `\v`	None
Text delay when holding `ESC`	Varying speed-up factor	None
Visibility of new text	Immediately typed onto the screen	Rendered onto invisible VRAM page, faded in on wait commands
Visibility of old text	Unblitted when starting a new box		Left on screen until crossfaded out with new text
Key binding for advancing the script	Any key		`⏎ Return`, `Shot`, or `ESC`
Animation while waiting for an advance key	None		, past right edge of current row
Inexplicable delays	None		1 frame before changing pictures and after rendering new text boxes
Additional delay per interpreter loop	614.4 µs	None	614.4 µs

The 614.4 µs correspond to the necessary delay for working around the repeated key up and key down events sent by PC-98 keyboards when holding down a key. While the absence of this delay significantly speeds up TH04's interpreter, it's also the reason why that game will stop recognizing a held ESC key after a few seconds, requiring you to press it again.

It's when you get into the implementation that the combined three systems reveal themselves as a giant mess, with more like 56 differences between the games. Every single new weird line of code opened up another can of worms, which ultimately made all of this end up with 24 pieces of bloat and 14 bugs. The worst of these should be quite interesting for the general PC-98 homebrew developers among my audience:

The final official 0.23 release of master.lib has a bug in graph_gaiji_put*(). To calculate the JIS X 0208 code point for a gaiji, it is enough to ADD 5680h onto the gaiji ID. However, these functions accidentally use ADC instead, which incorrectly adds the x86 carry flag on top, causing weird off-by-one errors based on the previous program state. ZUN did fix this bug directly inside master.lib for TH04 and TH05, but still needed to work around it in TH03 by subtracting 1 from the intended gaiji ID. Anyone up for maintaining a bug-fixed master.lib repository?
The worst piece of bloat comes from TH03 and TH04 needlessly switching the visibility of VRAM pages while blitting a new 320×200 picture. This makes it much harder to understand the code, as the mere existence of these page switches is enough to suggest a more complex interplay between the two VRAM pages which doesn't actually exist. Outside this visibility switch, page 0 is always supposed to be shown, and page 1 is always used for temporarily storing pixels that are later crossfaded onto page 0. This is also the only reason why TH03 has to render text and gaiji onto both VRAM pages to begin with… and because TH04 doesn't, changing the picture in the middle of a string of text is technically bugged in that game, even though you only get to temporarily see the new text on very underclocked PC-98 systems.
These performance implications made me wonder why cutscenes even bother with writing to the second VRAM page anyway, before copying each crossfade step to the visible one. 📝 We learned in June how costly EGC-"accelerated" inter-page copies are; shouldn't it be faster to just blit the image once rather than twice?
Well, master.lib decodes .PI images into a packed-pixel format, and unpacking such a representation into bitplanes on the fly is just about the worst way of blitting you could possibly imagine on a PC-98. EGC inter-page copies are already fairly disappointing at 42 cycles for every 16 pixels, if we look at the i486 and ignore VRAM latencies. But under the same conditions, packed-pixel unpacking comes in at 81 cycles for every 8 pixels, or almost 4× slower. On lower-end systems, that can easily sum up to more than one frame for a 320×200 image. While I'd argue that the resulting tearing could have been an acceptable part of the transition between two images, it's understandable why you'd want to avoid it in favor of the pure effect on a slower framerate.
Really makes me wonder why master.lib didn't just directly decode .PI images into bitplanes. The performance impact on load times should have been negligible? It's such a good format for the often dithered 16-color artwork you typically see on PC-98, and deserves better than master.lib's implementation which is both slow to decode and slow to blit.

That brings us to the individual script commands… and yes, I'm going to document every single one of them. Some of their interactions and edge cases are not clear at all from just looking at the code.

Almost all commands are preceded by… well, a 0x5C lead byte. Which raises the question of whether we should document it as an ASCII-encoded \ backslash, or a Shift-JIS-encoded ¥ yen sign. From a gaijin perspective, it seems obvious that it's a backslash, as it's consistently displayed as one in most of the editors you would actually use nowadays. But interestingly, iconv -f shift-jis -t utf-8 does convert any 0x5C lead bytes to actual ¥ U+00A5 YEN SIGN code points .
Ultimately, the distinction comes down to the font. There are fonts that still render 0x5C as ¥, but mainly do so out of an obvious concern about backward compatibility to JIS X 0201, where this mapping originated. Unsurprisingly, this group includes MS Gothic/Mincho, the old Japanese fonts from Windows 3.1, but even Meiryo and Yu Gothic/Mincho, Microsoft's modern Japanese fonts. Meanwhile, pretty much every other modern font, and freely licensed ones in particular, render this code point as \, even if you set your editor to Shift-JIS. And while ZUN most definitely saw it as a ¥, documenting this code point as \ is less ambiguous in the long run. It can only possibly correspond to one specific code point in either Shift-JIS or UTF-8, and will remain correct even if we later mod the cutscene system to support full-blown Unicode.

Now we've only got to clarify the parameter syntax, and then we can look at the big table of commands:

Numeric parameters are read as sequences of up to 3 ASCII digits. This limits them to a range from 0 to 999 inclusive, with 000 and 0 being equivalent. Because there's no further sentinel character, any further digit from the 4^th one onwards is interpreted as regular text.
Filename parameters must be terminated with a space or newline and are limited to 12 characters, which translates to 8.3 basenames without any directory component. Any further characters are ignored and displayed as text as well.
Each .PI image can contain up to four 320×200 pictures ("quarters") for the cutscene picture area. In the script commands, they are numbered like this:

0 1

2 3

			\@	Clears both VRAM pages by filling them with VRAM color 0. 🐞 In TH03 and TH04, this command does not update the internal text area background used for unblitting. This bug effectively restricts usage of this command to either the beginning of a script (before the first background image is shown) or its end (after no more new text boxes are started). See the image below for an example of using it anywhere else.
			\b`2`	Sets the font weight to a value between 0 (raw font ROM glyphs) to 3 (very thicc). Specifying any other value has no effect.
			\b`2`	🐞 In TH04 and TH05, `\b3` leads to glitched pixels when rendering half-width glyphs due to a bug in the newly micro-optimized ASM version of `📝 graph_putsa_fx()`; see the image below for an example. In these games, the parameter also directly corresponds to the `graph_putsa_fx()` effect function, removing the sanity check that was present in TH03. In exchange, you can also access the four dissolve masks for the bold font (`\b2`) by specifying a parameter between `4` (fewest pixels) to `7` (most pixels). Demo video below.
			\c`15`	Changes the text color to VRAM color `15`.
			\c=`字`,`15`	Adds a color map entry: If `字` is the first code point inside the name area on a new line, the text color is automatically set to `15`. Up to 8 such entries can be registered before overflowing the statically allocated buffer. 🐞 The comma is assumed to be present even if the color parameter is omitted.
			\e`0`	Plays the sound effect with the given ID.
			\f	(no-op)
			\fi`1` \fo`1`	Calls master.lib's `palette_black_in()` or `palette_black_out()` to play a hardware palette fade animation from or to black, spending roughly `1` frame on each of the 16 fade steps.
			\fm`1`	Fades out BGM volume via PMD's `AH=02h` interrupt call, in a non-blocking way. The fade speed can range from `1` (slowest) to `127` (fastest). Values from `128` to `255` technically correspond to `AH=02h`'s fade-in feature, which can't be used from cutscene scripts because it requires BGM volume to first be lowered via `AH=19h`, and there is no command to do that.
			\g`8`	Plays a blocking `8`-frame screen shake animation.
			\ga`0`	Shows the gaiji with the given ID from 0 to 255 at the current cursor position. Even in TH03, gaiji always ignore the text delay interval configured with `\v`.
			@`3`	TH05's replacement for the `\ga` command from TH03 and TH04. The default ID of `3` corresponds to the gaiji. Not to be confused with `\@`, which starts with a backslash, unlike this command.
			@h	Shows the gaiji.
			@t	Shows the gaiji.
			@!	Shows the gaiji.
			@?	Shows the gaiji.
			@!!	Shows the gaiji.
			@!?	Shows the gaiji.
			\k`0`	Waits `0` frames (0 = forever) for an advance key to be pressed before continuing script execution. Before waiting, TH05 crossfades in any new text that was previously rendered to the invisible VRAM page… 🐞 …but TH04 doesn't, leaving the text invisible during the wait time. As a workaround, `\vp1` can be used before `\k` to immediately display that text without a fade-in animation.
			\m$	Stops the currently playing BGM.
			\m*	Restarts playback of the currently loaded BGM from the beginning.
			\m,`filename`	Stops the currently playing BGM, loads a new one from the given file, and starts playback.
			\n	Starts a new line at the leftmost X coordinate of the box, i.e., the start of the name area. This is how scripts can "change" the name of the currently speaking character, or use the entire 480×64 pixels without being restricted to the non-name area. Note that automatic line breaks already move the cursor into a new line. Using this command at the "end" of a line with the maximum number of 30 full-width glyphs would therefore start a second new line and leave the previously started line empty. If this command moved the cursor into the 5^th line of a box, `\s` is executed afterward, with any of `\n`'s parameters passed to `\s`.
			\p	(no-op)
			\p-	Deallocates the loaded .PI image.
			\p,`filename`	Loads the .PI image with the given file into the single .PI slot available to cutscenes. TH04 and TH05 automatically deallocate any previous image, 🐞 TH03 would leak memory without a manual prior call to `\p-`.
			\pp	Sets the hardware palette to the one of the loaded .PI image.
			\p@	Sets the loaded .PI image as the full-screen 640×400 background image and overwrites both VRAM pages with its pixels, retaining the current hardware palette.
			\p=	Runs `\pp` followed by `\p@`.
			\s`0` \s-	Ends a text box and starts a new one. Fades in any text rendered to the invisible VRAM page, then waits `0` frames (0 = forever) for an advance key to be pressed. Afterward, the new text box is started with the cursor moved to the top-left corner of the name area. `\s-` skips the wait time and starts the new box immediately.
			\t`100`	Sets palette brightness via master.lib's `palette_settone()` to any value from 0 (fully black) to 200 (fully white). 100 corresponds to the palette's original colors. Preceded by a 1-frame delay unless `ESC` is held.
			\v`1`	Sets the number of frames to wait between every 2 bytes of rendered text.
			\v`1`	Sets the number of frames to spend on each of the 4 fade steps when crossfading between old and new text. The game-specific default value is also used before the first use of this command.
			\v`2`
			\vp`0`	Shows VRAM page `0`. Completely useless in TH03 (this game always synchronizes both VRAM pages at a command boundary), only of dubious use in TH04 (for working around a bug in `\k`), and the games always return to their intended shown page before every blitting operation anyway. A debloated mod of this game would just remove this command, as it exposes an implementation detail that script authors should not need to worry about. None of the original scripts use it anyway.
			\w`64`	`\w` and `\wk` wait for the given number of frames `\wm` and `\wmk` wait until PMD has played back the current BGM for the total number of measures, including loops, given in the first parameter, and fall back on calling `\w` and `\wk` with the second parameter as the frame number if BGM is disabled. 🐞 Neither PMD nor MMD reset the internal measure when stopping playback. If no BGM is playing and the previous BGM hasn't been played back for at least the given number of measures, this command will deadlock. Since both TH04 and TH05 fade in any new text from the invisible VRAM page, these commands can be used to simulate TH03's typing effect in those games. Demo video below. Contrary to `\k` and `\s`, specifying 0 frames would simply remove any frame delay instead of waiting forever. The TH03-exclusive `k` variants allow the delay to be interrupted if `⏎ Return` or `Shot` are held down. TH04 and TH05 recognize the `k` as well, but removed its functionality. All of these commands have no effect if `ESC` is held.
			\wm`64`,`64`
			\wk`64`
			\wmk`64`,`64`
			\wi`1` \wo`1`	Calls master.lib's `palette_white_in()` or `palette_white_out()` to play a hardware palette fade animation from or to white, spending roughly `1` frame on each of the 16 fade steps.
			\=`4`	Immediately displays the given quarter of the loaded .PI image in the picture area, with no fade effect. Any value ≥ `4` resets the picture area to black.
			\==`4`,`1`	Crossfades the picture area between its current content and quarter #`4` of the loaded .PI image, spending `1` frame on each of the 4 fade steps unless `ESC` is held. Any value ≥ `4` is replaced with quarter #0.
			\$	Stops script execution. Must be called at the end of each file; otherwise, execution continues into whatever lies after the script buffer in memory. TH05 automatically deallocates the loaded .PI image, TH03 and TH04 require a separate manual call to `\p-` to not leak its memory.

Bold values signify the default if the parameter is omitted; \c is therefore equivalent to \c15.

Using the \@ command in the middle of a TH03 or TH04 cutscene script — The `\@` bug. Yes, the ¥ is fake. It was easier to GIMP it than to reword the sentences so that the backslashes landed on the second byte of a 2-byte half-width character pair.

Cutscene font weights in TH03 — The font weights and effects available through `\b`, including the glitch with `\b3` in TH04 and TH05.
Font weight 3 is technically not rendered correctly in TH03 either; if you compare 1️⃣ with 4️⃣, you notice a single missing column of pixels at the left side of each glyph, which would extend into the previous VRAM byte. Ironically, the TH04/TH05 version is more *correct* in this regard: For half-width glyphs, it preserves any further pixel columns generated by the weight functions in the high byte of the 16-dot glyph variable. Unlike TH03, which still cuts them off when rendering text to unaligned X positions (3️⃣), TH04 and TH05 do bit-rotate them towards their correct place (4️⃣). It's only at byte-aligned X positions (2️⃣) where they remain at their internally calculated place, and appear on screen as these glitched pixel columns, 15 pixels away from the glyph they belong to. It's easy to blame bugs like these on micro-optimized ASM code, but in this instance, you really can't argue against it if the original C++ version was equally incorrect.

Cutscene font weights in TH05, demonstrating the <code>\b3</code> bug that also affects TH04 — The font weights and effects available through `\b`, including the glitch with `\b3` in TH04 and TH05.
Font weight 3 is technically not rendered correctly in TH03 either; if you compare 1️⃣ with 4️⃣, you notice a single missing column of pixels at the left side of each glyph, which would extend into the previous VRAM byte. Ironically, the TH04/TH05 version is more *correct* in this regard: For half-width glyphs, it preserves any further pixel columns generated by the weight functions in the high byte of the 16-dot glyph variable. Unlike TH03, which still cuts them off when rendering text to unaligned X positions (3️⃣), TH04 and TH05 do bit-rotate them towards their correct place (4️⃣). It's only at byte-aligned X positions (2️⃣) where they remain at their internally calculated place, and appear on screen as these glitched pixel columns, 15 pixels away from the glyph they belong to. It's easy to blame bugs like these on micro-optimized ASM code, but in this instance, you really can't argue against it if the original C++ version was equally incorrect.

Combining \b and s- into a partial dissolve animation. The speed can be controlled with \v.

Simulating TH03's typing effect in TH04 and TH05 via \w. Even prettier in TH05 where we also get an additional fade animation after the box ends.

So yeah, that's the cutscene system. I'm dreading the moment I will have to deal with the other command interpreter in these games, i.e., the stage enemy system. Luckily, that one is completely disconnected from any other system, so I won't have to deal with it until we're close to finishing MAIN.EXE… that is, unless someone requests it before. And it won't involve text encodings or unblitting…

The cutscene system got me thinking in greater detail about how I would implement translations, being one of the main dependencies behind them. This goal has been on the order form for a while and could soon be implemented for these cutscenes, with 100% PI being right around the corner for the TH03 and TH04 cutscene executables.
Once we're there, the "Virgin" old-school way of static translation patching for Latin-script languages could be implemented fairly quickly:

Establish basic UTF-8 parsing for less painful manual editing of the source files
Procedurally generate glyphs for the few required additional letters based on existing font ROM glyphs. For example, we'd generate ä by painting two short lines on top of the font ROM's a glyph, or generate ¿ by vertically flipping the question mark. This way, the text retains a consistent look regardless of whether the translated game is run with an NEC or EPSON font ROM, or the that Neko Project II auto-generates if you don't provide either.
(Optional) Change automatic line breaks to work on a per-word basis, rather than per-glyph

That's it – script editing and distribution would be handled by your local translation group. It might seem as if this would also work for Greek and Cyrillic scripts due to their presence in the PC-98 font ROM, but I'm not sure if I want to attempt procedurally shrinking these glyphs from 16×16 to 8×16… For any more thorough solution, we'd need to go for a more "Chad" kind of full-blown translation support:

Implement text subdivisions at a sensible granularity while retaining automatic line and box breaks
Compile translatable text into a Japanese→target language dictionary (I'm too old to develop any further translation systems that would overwrite modded source text with translations of the original text)
Implement a custom Unicode font system (glyphs would be taken from GNU Unifont unless translators provide a different 8×16 font for their language)
Combine the text compiler with the font compiler to only store needed glyphs as part of the translation's font file (dealing with a multi-MB font file would be rather ugly in a Real Mode game)
Write a simple install/update/patch stacking tool that supports both .HDI and raw-file DOSBox-X scenarios (it's different enough from thcrap to warrant a separate tool – each patch stack would be statically compiled into a single package file in the game's directory)
Add a nice language selection option to the main menu
(Optional) Support proportional fonts

Which sounds more like a separate project to be commissioned from Touhou Patch Center's Open Collective funds, separate from the ReC98 cap. This way, we can make sure that the feature is completely implemented, and I can talk with every interested translator to make sure that their language works.
It's still cheaper overall to do this on PC-98 than to first port the games to a modern system and then translate them. On the other hand, most of the tasks in the Chad variant (3, 4, 5, and half of 2) purely deal with the difficulty of getting arbitrary Unicode characters to work natively in a PC-98 DOS game at all, and would be either unnecessary or trivial if we had already ported the game. Depending on where the patrons' interests lie, it may not be worth it. So let's see what all of you think about which way we should go~~, or whether it's worth doing at all~~. (Edit (2022-12-01): With Splashman's order towards the stage dialogue system, we've pretty much confirmed that it is.) Maybe we want to meet in the middle – using e.g. procedural glyph generation for dynamic translations to keep text rendering consistent with the rest of the PC-98 system, and just not support non-Latin-script languages in the beginning? In any case, I've added both options to the order form.
Edit (2023-07-28): Touhou Patch Center has agreed to fund a basic feature set somewhere between the Virgin and Chad level. Check the 📝 dedicated announcement blog post for more details and ideas, and to find out how you can support this goal!

Surprisingly, there was still a bit of RE work left in the third push after all of this, which I filled with some small rendering boilerplate. Since I also wanted to include TH02's playfield overlay functions, ¹/₁₅ of that last push went towards getting a TH02-exclusive function out of the way, which also ended up including that game in this delivery.
The other small function pointed out how TH05's Stage 5 midboss pops into the playfield quite suddenly, since its clipping test thinks it's only 32 pixels tall rather than 64:

~~Good chance that the pop-in might have been intended.~~
Edit (2023-06-30): Actually, it's a 📝 systematic consequence of ZUN having to work around the lack of clipping in master.lib's sprite functions.
There's even another quirk here: The white flash during its first frame is actually carried over from the previous midboss, which the game still considers as actively getting hit by the player shot that defeated it. It's the regular boilerplate code for rendering a midboss that resets the responsible damage variable, and that code doesn't run during the defeat explosion animation.

Next up: Staying with TH05 and looking at more of the pattern code of its boss fights. Given the remaining TH05 budget, it makes the most sense to continue in in-game order, with Sara and the Stage 2 midboss. If more money comes in towards this goal, I could alternatively go for the Mai & Yuki fight and immediately develop a pretty fix for the cheeto storage glitch. Also, there's a rather intricate pull request for direct ZMBV decoding on the website that I've still got to review…

📝 Posted:: 2022-04-30 22:58 UTC
🚚 Summary of:: P0190, P0191, P0192
⌨ Commits:: 5734815...293e16a, 293e16a...71cb7b5, 71cb7b5...e1f3f9f
💰 Funded by:: nrook, -Tom-, [Anonymous]
🏷 Tags:

The important things first:

TH05 has passed the 50% RE mark, with both MAIN.EXE and the game as a whole! With that, we've also reached what -Tom- wanted out of the project, so he's suspending his discount offer for a bit.
Curve bullets are now officially called cheetos! 76.7% of fans prefer this term, and it fits into the 8.3 DOS filename scheme much better than homing lasers (as they're called in OMAKE.TXT) or Taito lasers (which would indeed have made sense as well).
…oh, and I managed to decompile Shinki within 2 pushes after all. That left enough budget to also add the Stage 1 midboss on top.

So, Shinki! As far as final boss code is concerned, she's surprisingly economical, with 📝 her background animations making up more than ⅓ of her entire code. Going straight from TH01's 📝 final 📝 bosses to TH05's final boss definitely showed how much ZUN had streamlined danmaku pattern code by the end of PC-98 Touhou. Don't get me wrong, there is still room for improvement: TH05 not only 📝 reuses the same 16 bytes of generic boss state we saw in TH04 last month, but also uses them 4× as often, and even for midbosses. Most importantly though, defining danmaku patterns using a single global instance of the group template structure is just bad no matter how you look at it:

The script code ends up rather bloated, with a single MOV instruction for setting one of the fields taking up 5 bytes. By comparison, the entire structure for regular bullets is 14 bytes large, while the template structure for Shinki's 32×32 ball bullets could have easily been reduced to 8 bytes.
Since it's also one piece of global state, you can easily forget to set one of the required fields for a group type. The resulting danmaku group then reuses these values from the last time they were set… which might have been as far back as another boss fight from a previous stage. And of course, I wouldn't point this out if it didn't actually happen in Shinki's pattern code. Twice.

Declaring a separate structure instance with the static data for every pattern would be both safer and more space-efficient, and there's more than enough space left for that in the game's data segment.
But all in all, the pattern functions are short, sweet, and easy to follow. The "devil" pattern is significantly more complex than the others, but still far from TH01's final bosses at their worst. I especially like the clear architectural separation between "one-shot pattern" functions that return true once they're done, and "looping pattern" functions that run as long as they're being called from a boss's main function. Not many all too interesting things in these pattern functions for the most part, except for two pieces of evidence that Shinki was coded after Yumeko:

The gather animation function in the first two phases contains a bullet group configuration that looks like it's part of an unused danmaku pattern. It quickly turns out to just be copy-pasted from a similar function in Yumeko's fight though, where it is turned into actual bullets.
As one of the two places where ZUN forgot to set a template field, the lasers at the end of the white wing preparation pattern reuse the 6-pixel width of Yumeko's final laser pattern. This actually has an effect on gameplay: Since these lasers are active for the first 8 frames after Shinki's wings appear on screen, the player can get hit by them in the last 2 frames after they grew to their final width.

Of course, there are more than enough safespots between the lasers.

Speaking about that wing sprite: If you look at ST05.BB2 (or any other file with a large sprite, for that matter), you notice a rather weird file layout:

Raw file layout of TH05's ST05.BB2, demonstrating master.lib's supposed BFNT width limit of 64 pixels — A large sprite split into multiple smaller ones with a width of 64 pixels each? What's this, hardware sprite limitations? On my PC-98?!

And it's not a limitation of the sprite width field in the BFNT+ header either. Instead, it's master.lib's BFNT functions which are limited to sprite widths up to 64 pixels… or at least that's what MASTER.MAN claims. Whatever the restriction was, it seems to be completely nonexistent as of master.lib version 0.23, and none of the master.lib functions used by the games have any issues with larger sprites.
Since ZUN stuck to the supposed 64-pixel width limit though, it's now the game that expects Shinki's winged form to consist of 4 physical sprites, not just 1. Any conversion from another, more logical sprite sheet layout back into BFNT+ must therefore replicate the original number of sprites. Otherwise, the sequential IDs ("patnums") assigned to every newly loaded sprite no longer match ZUN's hardcoded IDs, causing the game to crash. This is exactly what used to happen with -Tom-'s MysticTK automation scripts, which combined these exact sprites into a single large one. This issue has now been fixed – just in case there are some underground modders out there who used these scripts and wonder why their game crashed as soon as the Shinki fight started.

And then the code quality takes a nosedive with Shinki's main function. Even in TH05, these boss and midboss update functions are still very imperative:

The origin point of all bullet types used by a boss must be manually set to the current boss/midboss position; there is no concept of a bullet type tracking a certain entity.
The same is true for the target point of a player's homing shots…
… and updating the HP bar. At least the initial fill animation is abstracted away rather decently.
Incrementing the phase frame variable also must be done manually. TH05 even "innovates" here by giving the boss update function exclusive ownership of that variable, in contrast to TH04 where that ownership is given out to the player shot collision detection (?!) and boss defeat helper functions.
Speaking about collision detection: That is done by calling different functions depending on whether the boss is supposed to be invincible or not.
Timeout conditions? No standard way either, and all done with manual if statements. In combination with the regular phase end condition of lowering (mid)boss HP to a certain value, this leads to quite a convoluted control flow.
The manual calls to the score bonus functions for cleared phases at least provide some sense of orientation.
One potentially nice aspect of all this imperative freedom is that phases can end outside of HP boundaries… by manually incrementing the phase variable and resetting the phase frame variable to 0.

The biggest WTF in there, however, goes to using one of the 16 state bytes as a "relative phase" variable for differentiating between boss phases that share the same branch within the switch(boss.phase) statement. While it's commendable that ZUN tried to reduce code duplication for once, he could have just branched depending on the actual boss.phase variable? The same state byte is then reused in the "devil" pattern to track the activity state of the big jerky lasers in the second half of the pattern. If you somehow managed to end the phase after the first few bullets of the pattern, but before these lasers are up, Shinki's update function would think that you're still in the phase before the "devil" pattern. The main function then sequence-breaks right to the defeat phase, skipping the final pattern with the burning Makai background. Luckily, the HP boundaries are far away enough to make this impossible in practice.
The takeaway here: If you want to use the state bytes for your custom boss script mods, alias them to your own 16-byte structure, and limit each of the bytes to a clearly defined meaning across your entire boss script.

One final discovery that doesn't seem to be documented anywhere yet: Shinki actually has a hidden bomb shield during her two purple-wing phases. uth05win got this part slightly wrong though: It's not a complete shield, and hitting Shinki will still deal 1 point of chip damage per frame. For comparison, the first phase lasts for 3,000 HP, and the "devil" pattern phase lasts for 5,800 HP.

And there we go, 3rd PC-98 Touhou boss script* decompiled, 28 to go! 🎉 In case you were expecting a fix for the Shinki death glitch: That one is more appropriately fixed as part of the Mai & Yuki script. It also requires new code, should ideally look a bit prettier than just removing cheetos between one frame and the next, and I'd still like it to fit within the original position-dependent code layout… Let's do that some other time.
Not much to say about the Stage 1 midboss, or midbosses in general even, except that their update functions have to imperatively handle even more subsystems, due to the relative lack of helper functions.

The remaining ¾ of the third push went to a bunch of smaller RE and finalization work that would have hardly got any attention otherwise, to help secure that 50% RE mark. The nicest piece of code in there shows off what looks like the optimal way of setting up the 📝 GRCG tile register for monochrome blitting in a variable color:

mov ah, palette_index ; Any other non-AL 8-bit register works too.
                      ; (x86 only supports AL as the source operand for OUTs.)

rept 4                ; For all 4 bitplanes…
    shr ah,  1        ; Shift the next color bit into the x86 carry flag
    sbb al,  al       ; Extend the carry flag to a full byte
                      ; (CF=0 → 0x00, CF=1 → 0xFF)
    out 7Eh, al       ; Write AL to the GRCG tile register
endm

Thanks to Turbo C++'s inlining capabilities, the loop body even decompiles into a surprisingly nice one-liner. What a beautiful micro-optimization, at a place where micro-optimization doesn't hurt and is almost expected.
Unfortunately, the micro-optimizations went all downhill from there, becoming increasingly dumb and undecompilable. Was it really necessary to save 4 x86 instructions in the highly unlikely case of a new spark sprite being spawned outside the playfield? That one 2D polar→Cartesian conversion function then pointed out Turbo C++ 4.0J's woefully limited support for 32-bit micro-optimizations. The code generation for 32-bit 📝 pseudo-registers is so bad that they almost aren't worth using for arithmetic operations, and the inline assembler just flat out doesn't support anything 32-bit. No use in decompiling a function that you'd have to entirely spell out in machine code, especially if the same function already exists in multiple other, more idiomatic C++ variations.
Rounding out the third push, we got the TH04/TH05 DEMO?.REC replay file reading code, which should finally prove that nothing about the game's original replay system could serve as even just the foundation for community-usable replays. Just in case anyone was still thinking that.

Next up: Back to TH01, with the Elis fight! Got a bit of room left in the cap again, and there are a lot of things that would make a lot of sense now:

TH04 would really enjoy a large number of dedicated pushes to catch up with TH05. This would greatly support the finalization of both games.
Continuing with TH05's bosses and midbosses has shown to be good value for your money. Shinki would have taken even less than 2 pushes if she hadn't been the first boss I looked at.
I've got a new idea for 📝 properly linking in master.lib and getting rid of the 32-bit build step… (Edit (2022-05-31): Nope, that didn't work out after all.)
Oh, and I also added Seihou as a selectable goal, for the two people out there who genuinely like it. If I ever want to quit my day job, I need to branch out into safer territory that isn't threatened by takedowns, after all.

📝 Posted:: 2022-02-18 19:40 UTC
🚚 Summary of:: P0182, P0183
⌨ Commits:: 313450f...1e2c7ad, 1e2c7ad...f9d983e
💰 Funded by:: Lmocinemod, [Anonymous], Yanga
🏷 Tags:

Been 📝 a while since we last looked at any of TH03's game code! But before that, we need to talk about Y coordinates.

During TH03's MAIN.EXE, the PC-98 graphics GDC runs in its line-doubled 640×200 resolution, which gives the in-game portion its distinctive stretched low-res look. This lower resolution is a consequence of using 📝 Promisence Soft's SPRITE16 driver: Its performance simply stems from the fact that it expects sprites to be stored in the bottom half of VRAM, which allows them to be blitted using the same EGC-accelerated VRAM-to-VRAM copies we've seen again and again in all other games. Reducing the visible resolution also means that the sprites can be stored on both VRAM pages, allowing the game to still be double-buffered. If you force the graphics chip to run at 640×400, you can see them:

TH03's VRAM at regular line-doubled 640×200 resolution — The full VRAM contents during TH03's in-game portion, as seen when forcing the system into a 640×400 resolution.
•

TH03's VRAM at full 640×400 resolution, including the SPRITE16 sprite area — The full VRAM contents during TH03's in-game portion, as seen when forcing the system into a 640×400 resolution.
•

Note that the text chip still displays its overlaid contents at 640×400, which means that TH03's in-game portion technically runs at two resolutions at the same time.

But that means that any mention of a Y coordinate is ambiguous: Does it refer to undoubled VRAM pixels, or on-screen stretched pixels? Especially people who have known about the line doubling for years might almost expect technical blog posts on this game to use undoubled VRAM coordinates. So, let's introduce a new formatting convention for both on-screen 640×400 and undoubled 640×200 coordinates, and always write out both to minimize the confusion.

Alright, now what's the thing gonna be? The enemy structure is highly overloaded, being used for enemies, fireballs, and explosions with seemingly different semantics for each. Maybe a bit too much to be figured out in what should ideally be a single push, especially with all the functions that would need to be decompiled? Bullet code would be easier, but not exactly single-push material either. As it turns out though, there's something more fundamental left to be done first, which both of these subsystems depend on: collision detection!

And it's implemented exactly how I always naively imagined collision detection to be implemented in a fixed-resolution 2D bullet hell game with small hitboxes: By keeping a separate 1bpp bitmap of both playfields in memory, drawing in the collidable regions of all entities on every frame, and then checking whether any pixels at the current location of the player's hitbox are set to 1. It's probably not done in the other games because their single data segment was already too packed for the necessary 17,664 bytes to store such a bitmap at pixel resolution, and 282,624 bytes for a bitmap at Q12.4 subpixel resolution would have been prohibitively expensive in 16-bit Real Mode DOS anyway. In TH03, on the other hand, this bitmap is doubly useful, as the AI also uses it to elegantly learn what's on the playfield. By halving the resolution and only tracking tiles of 2×2 / 2×1 pixels, TH03 only requires an adequate total of 6,624 bytes of memory for the collision bitmaps of both playfields.

So how did the implementation not earn the good-code tag this time? Because the code for drawing into these bitmaps is undecompilable hand-written x86 assembly. And not just your usual ASM that was basically compiled from C and then edited to maybe optimize register allocation and maybe replace a bunch of local variables with self-modifying code, oh no. This code is full of overly clever bit twiddling, abusing the fact that the 16-bit AX, BX, CX, and DX registers can also be accessed as two 8-bit registers, calculations that change the semantic meaning behind the value of a register, or just straight-up reassignments of different values to the same small set of registers. Sure, in some way it is impressive, and it all does work and correctly covers every edge case, but come on. This could have all been a lot more readable in exchange for just a few CPU cycles.

What's most interesting though are the actual shapes that these functions draw into the collision bitmap. On the surface, we have:

vertical slopes at any angle across the whole playfield; exclusively used for Chiyuri's diagonal laser EX attack
straight vertical lines, with a width of 1 tile; exclusively used for the 2×2 / 2×1 hitboxes of bullets
rectangles at arbitrary sizes

But only 2) actually draws a full solid line. 1) and 3) are only ever drawn as horizontal stripes, with a hardcoded distance of 2 vertical tiles between every stripe of a slope, and 4 vertical tiles between every stripe of a rectangle. That's 66-75% of each rectangular entity's intended hitbox not actually taking part in collision detection. Now, if player hitboxes were ≤ 6 / 3 pixels, we'd have one possible explanation of how the AI can "cheat", because it could just precisely move through those blank regions at TAS speeds. So, let's make this two pushes after all and tell the complete story, since this is one of the more interesting aspects to still be documented in this game.

And the code only gets worse. While the player collision detection function is decompilable, it might as well not have been, because it's just more of the same "optimized", hard-to-follow assembly. With the four splittable 16-bit registers having a total of 20 different meanings in this function, I would have almost preferred self-modifying code…

In fact, it was so bad that it prompted some maintenance work on my inline assembly coding standards as a whole. Turns out that the _asm keyword is not only still supported in modern Visual Studio compilers, but also in Clang with the -fms-extensions flag, and compiles fine there even for 64-bit targets. While that might sound like amazing news at first ("awesome, no need to rewrite this stuff for my x86_64 Linux port!"), you quickly realize that almost all inline assembly in this codebase assumes either PC-98 hardware, segmented 16-bit memory addressing, or is a temporary hack that will be removed with further RE progress.
That's mainly because most of the raw arithmetic code uses Turbo C++'s register pseudovariables where possible. While they certainly have their drawbacks, being a non-standard extension that's not supported in other x86-targeting C compilers, their advantages are quite significant: They allow this code to stay in the same language, and provide slightly more immediate portability to any other architecture, together with 📝 readability and maintainability improvements that can get quite significant when combined with inlining:

// This one line compiles to five ASM instructions, which would need to be
// spelled out in any C compiler that doesn't support register pseudovariables.
// By adding typed aliases for these registers via `#define`, this code can be
// both made even more readable, and be prepared for an easier transformation
// into more portable local variables.
_ES = (((_AX * 4) + _BX) + SEG_PLANE_B);

However, register pseudovariables might cause potential portability issues as soon as they are mixed with inline assembly instructions that rely on their state. The lazy way of "supporting pseudo-registers" in other compilers would involve declaring the full set as global variables, which would immediately break every one of those instances:

_DI = 0;
_AX = 0xFFFF;

// Special x86 instruction doing the equivalent of
//
// 	*reinterpret_cast(MK_FP(_ES, _DI)) = _AX;
// 	_DI += sizeof(uint16_t);
//
// Only generated by Turbo C++ in very specific cases, and therefore only
// reliably available through inline assembly.
asm { movsw; }

What's also not all too standardized, though, are certain variants of the asm keyword. That's why I've now introduced a distinction between the _asm keyword for "decently sane" inline assembly, and the slightly less standard asm keyword for inline assembly that relies on the contents of pseudo-registers, and should break on compilers that don't support them.
So yeah, have some minor portability work in exchange for these two pushes not having all that much in RE'd content.

With that out of the way and the function deciphered, we can confirm the player hitboxes to be a constant 8×8 / 8×4 pixels, and prove that the hit stripes are nothing but an adequate optimization that doesn't affect gameplay in any way.

And what's the obvious thing to immediately do if you have both the collision bitmap and the player hitbox? Writing a "real hitbox" mod, of course:

Reorder the calls to rendering functions so that player and shot sprites are rendered after bullets
Blank out all player sprite pixels outside an 8×8 / 8×4 box around the center point
After the bullet rendering function, turn on the GRCG in RMW mode and set the tile register set to the background color
Stretch the negated contents of collision bitmap onto each playfield, leaving only collidable pixels untouched
Do the same with the actual, non-negated contents and a white color, for extra contrast against the background. This also makes sure to show any collidable areas whose sprite pixels are transparent, such as with the moon enemy. (Yeah, how unfair.) Doing that also loses a lot of information about the playfield, such as enemy HP indicated by their color, but what can you do:

A decently busy TH03 in-game frame. — A decently busy TH03 in-game frame and its underlying collision bitmap, showing off all three different collision shapes together with the player hitboxes.

The underlying content of the collision bitmap, showing off all three different shapes together with the player hitboxes. — A decently busy TH03 in-game frame and its underlying collision bitmap, showing off all three different collision shapes together with the player hitboxes.

2022-02-18-TH03-real-hitbox.zip The secret for writing such mods before having reached a sufficient level of position independence? Put your new code segment into DGROUP, past the end of the uninitialized data section. That's why this modded MAIN.EXE is a lot larger than you would expect from the raw amount of new code: The file now actually needs to store all these uninitialized 0 bytes between the end of the data segment and the first instruction of the mod code – normally, this number is simply a part of the MZ EXE header, and doesn't need to be redundantly stored on disk. Check the th03_real_hitbox branch for the code.

And now we know why so many "real hitbox" mods for the Windows Touhou games are inaccurate: The games would simply be unplayable otherwise – or can you dodge rapidly moving 2×2 / 2×1 blocks as an 8×8 / 8×4 rectangle that is smaller than your shot sprites, especially without focused movement? I can't. Maybe it will feel more playable after making explosions visible, but that would need more RE groundwork first.
It's also interesting how adding two full GRCG-accelerated redraws of both playfields per frame doesn't significantly drop the game's frame rate – so why did the drawing functions have to be micro-optimized again? It would be possible in one pass by using the GRCG's TDW mode, which should theoretically be 8× faster, but I have to stop somewhere.

Next up: The final missing piece of TH04's and TH05's bullet-moving code, which will include a certain other type of projectile as well.

📝 Posted:: 2021-07-31 19:19 UTC
🚚 Summary of:: P0149, P0150, P0151, P0152
⌨ Commits:: e1a26bb...05e4c4a, 05e4c4a...768251d, 768251d...4d24ca5, 4d24ca5...81fc861
💰 Funded by:: Blue Bolt, Ember2528, -Tom-, [Anonymous]
🏷 Tags:

…or maybe not that soon, as it would have only wasted time to untangle the bullet update commits from the rest of the progress. So, here's all the bullet spawning code in TH04 and TH05 instead. I hope you're ready for this, there's a lot to talk about!

(For the sake of readability, "bullets" in this blog post refers to the white 8×8 pellets and all 16×16 bullets loaded from MIKO16.BFT, nothing else.)

But first, what was going on 📝 in 2020? Spent 4 pushes on the basic types and constants back then, still ended up confusing a couple of things, and even getting some wrong. Like how TH05's "bullet slowdown" flag actually always prevents slowdown and fires bullets at a constant speed instead. Or how "random spread" is not the best term to describe that unused bullet group type in TH04.
Or that there are two distinct ways of clearing all bullets on screen, which deserve different names:

Mechanic #1: Clearing bullets for a custom amount of time, awarding 1000 points for all bullets alive on the first frame, and 100 points for all bullets spawned during the clear time.

Mechanic #2: Zapping bullets for a fixed 16 frames, awarding a semi-exponential and loudly announced Bonus!! for all bullets alive on the first frame, and preventing new bullets from being spawned during those 16 frames. In TH04 at least; thanks to a ZUN bug, zapping got reduced to 1 frame and no animation in TH05…

Bullets are zapped at the end of most midboss and boss phases, and cleared everywhere else – most notably, during bombs, when losing a life, or as rewards for extends or a maximized Dream bonus. The Bonus!! points awarded for zapping bullets are calculated iteratively, so it's not trivial to give an exact formula for these. For a small number 𝑛 of bullets, it would exactly be 5𝑛³ - 10𝑛² + 15𝑛 points – or, using uth05win's (correct) recursive definition, Bonus(𝑛) = Bonus(𝑛-1) + 15𝑛² - 5𝑛 + 10. However, one of the internal step variables is capped at a different number of points for each difficulty (and game), after which the points only increase linearly. Hence, "semi-exponential".

On to TH04's bullet spawn code then, because that one can at least be decompiled. And immediately, we have to deal with a pointless distinction between regular bullets, with either a decelerating or constant velocity, and special bullets, with preset velocity changes during their lifetime. That preset has to be set somewhere, so why have separate functions? In TH04, this separation continues even down to the lowest level of functions, where values are written into the global bullet array. TH05 merges those two functions into one, but then goes too far and uses self-modifying code to save a grand total of two local variables… Luckily, the rest of its actual code is identical to TH04.

Most of the complexity in bullet spawning comes from the (thankfully shared) helper function that calculates the velocities of the individual bullets within a group. Both games handle each group type via a large switch statement, which is where TH04 shows off another Turbo C++ 4.0 optimization: If the range of case values is too sparse to be meaningfully expressed in a jump table, it usually generates a linear search through a second value table. But with the -G command-line option, it instead generates branching code for a binary search through the set of cases. 𝑂(log 𝑛) as the worst case for a switch statement in a C++ compiler from 1994… that's so cool. But still, why are the values in TH04's group type enum all over the place to begin with?
Unfortunately, this optimization is pretty rare in PC-98 Touhou. It only shows up here and in a few places in TH02, compared to at least 50 switch value tables.

In all of its micro-optimized pointlessness, TH05's undecompilable version at least fixes some of TH04's redundancy. While it's still not even optimal, it's at least a decently written piece of ASM… if you take the time to understand what's going on there, because it certainly took quite a bit of that to verify that all of the things which looked like bugs or quirks were in fact correct. And that's how the code for this function ended up with 35% comments and blank lines before I could confidently call it "reverse-engineered"…
Oh well, at least it finally fixes a correctness issue from TH01 and TH04, where an invalid bullet group type would fill all remaining slots in the bullet array with identical versions of the first bullet.

Something that both games also share in these functions is an over-reliance on globals for return values or other local state. The most ridiculous example here: Tuning the speed of a bullet based on rank actually mutates the global bullet template… which ZUN then works around by adding a wrapper function around both regular and special bullet spawning, which saves the base speed before executing that function, and restores it afterward. Add another set of wrappers to bypass that exact tuning, and you've expanded your nice 1-function interface to 4 functions. Oh, and did I mention that TH04 pointlessly duplicates the first set of wrapper functions for 3 of the 4 difficulties, which can't even be explained with "debugging reasons"? That's 10 functions then… and probably explains why I've procrastinated this feature for so long.

At this point, I also finally stopped decompiling ZUN's original ASM just for the sake of it. All these small TH05 functions would look horribly unidiomatic, are identical to their decompiled TH04 counterparts anyway, except for some unique constant… and, in the case of TH05's rank-based speed tuning function, actually become undecompilable as soon as we want to return a C++ class to preserve the semantic meaning of the return value. Mainly, this is because Turbo C++ does not allow register pseudo-variables like _AX or _AL to be cast into class types, even if their size matches. Decompiling that function would have therefore lowered the quality of the rest of the decompiled code, in exchange for the additional maintenance and compile-time cost of another translation unit. Not worth it – and for a TH05 port, you'd already have to decompile all the rest of the bullet spawning code anyway!

The only thing in there that was still somewhat worth being decompiled was the pre-spawn clipping and collision detection function. Due to what's probably a micro-optimization mistake, the TH05 version continues to spawn a bullet even if it was spawned on top of the player. This might sound like it has a different effect on gameplay… until you realize that the player got hit in this case and will either lose a life or deathbomb, both of which will cause all on-screen bullets to be cleared anyway. So it's at most a visual glitch.

But while we're at it, can we please stop talking about hitboxes? At least in the context of TH04 and TH05 bullets. The actual collision detection is described way better as a kill delta of 8×8 pixels between the center points of the player and a bullet. You can distribute these pixels to any combination of bullet and player "hitboxes" that make up 8×8. 4×4 around both the player and bullets? 1×1 for bullets, and 8×8 for the player? All equally valid… or perhaps none of them, once you keep in mind that other entity types might have different kill deltas. With that in mind, the concept of a "hitbox" turns into just a confusing abstraction.

The same is true for the 36×44 graze ~~box~~ delta. For some reason, this one is not exactly around the center of a bullet, but shifted to the right by 2 pixels. So, a bullet can be grazed up to 20 pixels right of the player, but only up to 16 pixels left of the player. uth05win also spotted this… and rotated the deltas clockwise by 90°?!

Which brings us to the bullet updates… for which I still had to research a decompilation workaround, because 📝 P0148 turned out to not help at all? Instead, the solution was to lie to the compiler about the true segment distance of the popup function and declare its signature far rather than near. This allowed ZUN to save that ridiculous overhead of 1 additional far function call/return per frame, and those precious 2 bytes in the BSS segment that he didn't have to spend on a segment value. 📝 Another function that didn't have just a single declaration in a common header file… really, 📝 how were these games even built???

The function itself is among the longer ones in both games. It especially stands out in the indentation department, with 7 levels at its most indented point – and that's the minimum of what's possible without goto. Only two more notable discoveries there:

Bullets are the only entity affected by Slow Mode. If the number of bullets on screen is ≥ (24 + (difficulty * 8) + rank) in TH04, or (42 + (difficulty * 8)) in TH05, Slow Mode reduces the frame rate by 33%, by waiting for one additional VSync event every two frames.
The code also reveals a second tier, with 50% slowdown for a slightly higher number of bullets, but that conditional branch can never be executed
Bullets must have been grazed in a previous frame before they can be collided with. (Note how this does not apply to bullets that spawned on top of the player, as explained earlier!)

Whew… When did ReC98 turn into a full-on code review?! 😅 And after all this, we're still not done with TH04 and TH05 bullets, with all the special movement types still missing. That should be less than one push though, once we get to it. Next up: Back to TH01 and Konngara! Now have fun rewriting the Touhou Wiki Gameplay pages 😛

📝 Posted:: 2021-06-10 20:56 UTC
🚚 Summary of:: P0146
⌨ Commits:: 08bc188...456b621
💰 Funded by:: Ember2528, -Tom-
🏷 Tags:

Y'know, I kinda prefer the pending crowdfunded workload to stay more near the middle of the cap, rather than being sold out all the time. So to reach this point more quickly, let's do the most relaxing thing that can be easily done in TH05 right now: The boss backgrounds, starting with Shinki's, 📝 now that we've got the time to look at it in detail.

… Oh come on, more things that are borderline undecompilable, and require new workarounds to be developed? Yup, Borland C++ always optimizes any comparison of a register with a literal 0 to OR reg, reg, no matter how many calculations and inlined function calls you replace the 0 with. Shinki's background particle rendering function contains a CMP AX, 0 instruction though… so yeah, 📝 yet another piece of custom ASM that's worse than what Turbo C++ 4.0J would have generated if ZUN had just written readable C. This was probably motivated by ZUN insisting that his modified master.lib function for blitting particles takes its X and Y parameters as registers. If he had just used the __fastcall convention, he also would have got the sprite ID passed as a register. 🤷
So, we really don't want to be forced into inline assembly just because of the third comparison in the otherwise perfectly decompilable four-comparison if() expression that prevents invisible particles from being drawn. The workaround: Comparing to a pointer instead, which only the linker gets to resolve to the actual value of 0. This way, the compiler has to make room for any 16-bit literal, and can't optimize anything.

And then we go straight from micro-optimization to waste, with all the duplication in the code that animates all those particles together with the zooming and spinning lines. This push decompiled 1.31% of all code in TH05, and thanks to alignment, we're still missing Shinki's high-level background rendering function that calls all the subfunctions I decompiled here.
With all the manipulated state involved here, it's not at all trivial to see how this code produces what you see in-game. Like:

If all lines have the same Y velocity, how do the other three lines in background type B get pushed down into this vertical formation while the top one stays still? (Answer: This velocity is only applied to the top line, the other lines are only pushed based on some delta.)
How can this delta be calculated based on the distance of the top line with its supposed target point around Shinki's wings? (Answer: The velocity is never set to 0, so the top line overshoots this target point in every frame. After calculating the delta, the top line itself is pushed down as well, canceling out the movement. )
Why don't they get pushed down infinitely, but stop eventually? (Answer: We only see four lines out of 20, at indices #0, #6, #12, and #18. In each frame, lines [0..17] are copied to lines [1..18], before anything gets moved. The invisible lines are pushed down based on the delta as well, which defines a distance between the visible lines of (velocity * array gap). And since the velocity is capped at -14 pixels per frame, this also means a maximum distance of 84 pixels between the midpoints of each line.)
And why are the lines moving back up when switching to background type C, before moving down? (Answer: Because type C increases the velocity rather than decreasing it. Therefore, it relies on the previous velocity state from type B to show a gapless animation.)

So yeah, it's a nice-looking effect, just very hard to understand. 😵

With the amount of effort I'm putting into this project, I typically gravitate towards more descriptive function names. Here, however, uth05win's simple and seemingly tiny-brained "background type A/B/C/D" was quite a smart choice. It clearly defines the sequence in which these animations are intended to be shown, and as we've seen with point 4 from the list above, that does indeed matter.

Next up: At least EX-Alice's background animations, and probably also the high-level parts of the background rendering for all the other TH05 bosses.

📝 Posted:: 2021-04-23 00:46 UTC
🚚 Summary of:: P0138
⌨ Commits:: 8d953dc...864e864
💰 Funded by:: [Anonymous], Blue Bolt
🏷 Tags:

Technical debt, part 9… and as it turns out, it's highly impractical to repay 100% of it at this point in development. 😕

The reason: graph_putsa_fx(), ZUN's function for rendering optionally boldfaced text to VRAM using the font ROM glyphs, in its ridiculously micro-optimized TH04 and TH05 version. This one sets the "callback function" for applying the boldface effect by self-modifying the target of two CALL rel16 instructions… because there really wasn't any free register left for an indirect CALL, eh? The necessary distance, from the call site to the function itself, has to be calculated at assembly time, by subtracting the target function label from the call site label.
This usually wouldn't be a problem… if ZUN didn't store the resulting lookup tables in the .DATA segment. With code segments, we can easily split them at pretty much any point between functions because there are multiple of them. But there's only a single .DATA segment, with all ZUN and master.lib data sandwiched between Borland C++'s crt0 at the top, and Borland C++'s library functions at the bottom of the segment. Adding another split point would require all data after that point to be moved to its own translation unit, which in turn requires EXTERN references in the big .ASM file to all that moved data… in short, it would turn the codebase into an even greater mess.
Declaring the labels as EXTERN wouldn't work either, since the linker can't do fancy arithmetic and is limited to simply replacing address placeholders with one single address. So, we're now stuck with this function at the bottom of the SHARED segment, for the foreseeable future.

We can still continue to separate functions off the top of that segment, though. Pretty much the only thing noteworthy there, so far: TH04's code for loading stage tile images from .MPN files, which we hadn't reverse-engineered so far, and which nicely fit into one of Blue Bolt's pending ⅓ RE contributions. Yup, we finally moved the RE% bars again! If only for a tiny bit.
Both TH02 and TH05 simply store one pointer to one dynamically allocated memory block for all tile images, as well as the number of images, in the data segment. TH04, on the other hand, reserves memory for 8 .MPN slots, complete with their color palettes, even though it only ever uses the first one of these. There goes another 458 bytes of conventional RAM… I should start summing up all the waste we've seen so far. Let's put the next website contribution towards a tagging system for these blog posts.

At 86% of technical debt in the SHARED segment repaid, we aren't quite done yet, but the rest is mostly just TH04 needing to catch up with functions we've already separated. Next up: Getting to that practical 98.5% point. Since this is very likely to not require a full push, I'll also decompile some more actual TH04 and TH05 game code I previously reverse-engineered – and after that, reopen the store!

📝 Posted:: 2021-03-20 20:19 UTC
🚚 Summary of:: P0135, P0136
⌨ Commits:: a6eed55...252c13d, 252c13d...07bfcf2
💰 Funded by:: [Anonymous]
🏷 Tags:

Alright, no more big code maintenance tasks that absolutely need to be done right now. Time to really focus on parts 6 and 7 of repaying technical debt, right? Except that we don't get to speed up just yet, as TH05's barely decompilable PMD file loading function is rather… complicated.
Fun fact: Whenever I see an unusual sequence of x86 instructions in PC-98 Touhou, I first consult the disassembly of Wolfenstein 3D. That game was originally compiled with the quite similar Borland C++ 3.0, so it's quite helpful to compare its ASM to the officially released source code. If I find the instructions in question, they mostly come from that game's ASM code, leading to the amusing realization that "even John Carmack was unable to get these instructions out of this compiler" This time though, Wolfenstein 3D did point me to Borland's intrinsics for common C functions like memcpy() and strchr(), available via #pragma intrinsic. Bu~t those unfortunately still generate worse code than what ZUN micro-optimized here. Commenting how these sequences of instructions should look in C is unfortunately all I could do here.
The conditional branches in this function did compile quite nicely though, clarifying the control flow, and clearly exposing a ZUN bug: TH05's snd_load() will hang in an infinite loop when trying to load a non-existing -86 BGM file (with a .M2 extension) if the corresponding -26 BGM file (with a .M extension) doesn't exist either.

Unsurprisingly, the PMD channel monitoring code in TH05's Music Room remains undecompilable outside the two most "high-level" initialization and rendering functions. And it's not because there's data in the middle of the code segment – that would have actually been possible with some #pragmas to ensure that the data and code segments have the same name. As soon as the SI and DI registers are referenced anywhere, Turbo C++ insists on emitting prolog code to save these on the stack at the beginning of the function, and epilog code to restore them from there before returning. Found that out in September 2019, and confirmed that there's no way around it. All the small helper functions here are quite simply too optimized, throwing away any concern for such safety measures. 🤷
Oh well, the two functions that were decompilable at least indicate that I do try.

Within that same 6th push though, we've finally reached the one function in TH05 that was blocking further progress in TH04, allowing that game to finally catch up with the others in terms of separated translation units. Feels good to finally delete more of those .ASM files we've decompiled a while ago… finally!

But since that was just getting started, the most satisfying development in both of these pushes actually came from some more experiments with macros and inline functions for near-ASM code. By adding "unused" dummy parameters for all relevant registers, the exact input registers are made more explicit, which might help future port authors who then maybe wouldn't have to look them up in an x86 instruction reference quite as often. At its best, this even allows us to declare certain functions with the __fastcall convention and express their parameter lists as regular C, with no additional pseudo-registers or macros required.
As for output registers, Turbo C++'s code generation turns out to be even more amazing than previously thought when it comes to returning pseudo-registers from inline functions. A nice example for how this can improve readability can be found in this piece of TH02 code for polling the PC-98 keyboard state using a BIOS interrupt:

inline uint8_t keygroup_sense(uint8_t group) {
	_AL = group;
	_AH = 0x04;
	geninterrupt(0x18);
	// This turns the output register of this BIOS call into the return value
	// of this function. Surprisingly enough, this does *not* naively generate
	// the `MOV AL, AH` instruction you might expect here!
	return _AH;
}

void input_sense(void)
{
	// As a result, this assignment becomes `_AH = _AH`, which Turbo C++
	// never emits as such, giving us only the three instructions we need.
	_AH = keygroup_sense(8);

	// Whereas this one gives us the one additional `MOV BH, AH` instruction
	// we'd expect, and nothing more.
	_BH = keygroup_sense(7);

	// And now it's obvious what both of these registers contain, from just
	// the assignments above.
	if(_BH & K7_ARROW_UP || _AH & K8_NUM_8) {
		key_det |= INPUT_UP;
	}
	// […]
}

I love it. No inline assembly, as close to idiomatic C code as something like this is going to get, yet still compiling into the minimum possible number of x86 instructions on even a 1994 compiler. This is how I keep this project interesting for myself during chores like these. We might have even reached peak inline already?

And that's 65% of technical debt in the SHARED segment repaid so far. Next up: Two more of these, which might already complete that segment? Finally!

📝 Posted:: 2021-02-21 13:36 UTC
🚚 Summary of:: P0134
⌨ Commits:: 1d5db71...a6eed55
💰 Funded by:: [Anonymous]
🏷 Tags:

Technical debt, part 5… and we only got TH05's stupidly optimized .PI functions this time?

As far as actual progress is concerned, that is. In maintenance news though, I was really hyped for the #include improvements I've mentioned in 📝 the last post. The result: A new x86real.h file, bundling all the declarations specific to the 16-bit x86 Real Mode in a smaller file than Turbo C++'s own DOS.H. After all, DOS is something else than the underlying CPU. And while it didn't speed up build times quite as much as I had hoped, it now clearly indicates the x86-specific parts of PC-98 Touhou code to future port authors.

After another couple of improvements to parameter declaration in ASM land, we get to TH05's .PI functions… and really, why did ZUN write all of them in ASM? Why (re)declare all the necessary structures and data in ASM land, when all these functions are merely one layer of abstraction above master.lib, which does all the actual work?
I get that ZUN might have wanted masked blitting to be faster, which is used for the fade-in effect seen during TH05's main menu animation and the ending artwork. But, uh… he knew how to modify master.lib. In fact, he did already modify the graph_pack_put_8() function used for rendering a single .PI image row, to ignore master.lib's VRAM clipping region. For this effect though, he first blits each row regularly to the invisible 400th row of VRAM, and then does an EGC-accelerated VRAM-to-VRAM blit of that row to its actual target position with the mask enabled. It would have been way more efficient to add another version of this function that takes a mask pattern. No amount of REP MOVSW is going to change the fact that two VRAM writes per line are slower than a single one. Not to mention that it doesn't justify writing every other .PI function in ASM to go along with it…
This is where we also find the most hilarious aspect about this: For most of ZUN's pointless micro-optimizations, you could have maybe made the argument that they do save some CPU cycles here and there, and therefore did something positive to the final, PC-98-exclusive result. But some of the hand-written ASM here doesn't even constitute a micro-optimization, because it's worse than what you would have got out of even Turbo C++ 4.0J with its 80386 optimization flags!

At least it was possible to "decompile" 6 out of the 10 functions here, making them easy to clean up for future modders and port authors. Could have been 7 functions if I also decided to "decompile" pi_free(), but all the C++ code is already surrounded by ASM, resulting in 2 ASM translation units and 2 C++ translation units. pi_free() would have needed a single translation unit by itself, which wasn't worth it, given that I would have had to spell out every single ASM instruction anyway.

void pascal pi_free(int slot)
{
	if(pi_buffers[slot]) {
		graph_pi_free(&pi_headers[slot], &pi_buffers[slot]);
		pi_buffers[slot] = NULL;
	}
}

There you go. What about this needed to be written in ASM?!?

The function calls between these small translation units even seemed to glitch out TASM and the linker in the end, leading to one CALL offset being weirdly shifted by 32 bytes. Usually, TLINK reports a fixup overflow error when this happens, but this time it didn't, for some reason? Mirroring the segment grouping in the affected translation unit did solve the problem, and I already knew this, but only thought of it after spending quite some RTFM time… during which I discovered the -lE switch, which enables TLINK to use the expanded dictionaries in Borland's .OBJ and .LIB files to speed up linking. That shaved off roughly another second from the build time of the complete ReC98 repository. The more you know… Binary blobs compiled with non-Borland tools would be the only reason not to use this flag.

So, even more slowdown with this 5th dedicated push, since we've still only repaid 41% of the technical debt in the SHARED segment so far. Next up: Part 6, which hopefully manages to decompile the FM and SSG channel animations in TH05's Music Room, and hopefully ends up being the final one of the slow ones.

📝 Posted:: 2021-01-31 15:54 UTC
🚚 Summary of:: P0133
⌨ Commits:: 045450c...1d5db71
💰 Funded by:: [Anonymous]
🏷 Tags:

Wow, 31 commits in a single push? Well, what the last push had in progress, this one had in maintenance. The 📝 master.lib header transition absolutely had to be completed in this one, for my own sanity. And indeed, it reduced the build time for the entirety of ReC98 to about 27 seconds on my system, just as expected in the original announcement. Looking forward to even faster build times with the upcoming #include improvements I've got up my sleeve! The port authors of the future are going to appreciate those quite a bit.

As for the new translation units, the funniest one is probably TH05's function for blitting the 1-color .CDG images used for the main menu options. Which is so optimized that it becomes decompilable again, by ditching the self-modifying code of its TH04 counterpart in favor of simply making better use of CPU registers. The resulting C code is still a mess, but what can you do.
This was followed by even more TH05 functions that clearly weren't compiled from C, as evidenced by their padding bytes. It's about time I've documented my lack of ideas of how to get those out of Turbo C++.

And just like in the previous push, I also had to 📝 throw away a decompiled TH02 function purely due to alignment issues. Couldn't have been a better one though, no one's going to miss a residency check for the MMD driver that is largely identical to the corresponding (and indeed decompilable) function for the PMD driver. Both of those should have been merged into a single function anyway, given how they also mutate the game's sound configuration flags…

In the end, I've slightly slowed down with this one, with only 37% of technical debt done after this 4th dedicated push. Next up: One more of these, centered around TH05's stupidly optimized .PI functions. Maybe also with some more reverse-engineering, after not having done any for 1½ months?

📝 Posted:: 2020-11-16 20:39 UTC
🚚 Summary of:: P0126, P0127
⌨ Commits:: 6c22af7...8b01657, 8b01657...dc65b59
💰 Funded by:: Blue Bolt, [Anonymous]
🏷 Tags:

Alright, back to continuing the master.hpp transition started in P0124, and repaying technical debt. The last blog post already announced some ridiculous decompilations… and in fact, not a single one of the functions in these two pushes was decompilable into idiomatic C/C++ code.

As usual, that didn't keep me from trying though. The TH04 and TH05 version of the infamous 16-pixel-aligned, EGC-accelerated rectangle blitting function from page 1 to page 0 was fairly average as far as unreasonable decompilations are concerned.
The big blocker in TH03's MAIN.EXE, however, turned out to be the .MRS functions, used to render the gauge attack portraits and bomb backgrounds. The blitting code there uses the additional FS and GS segment registers provided by the Intel 386… which

are not supported by Turbo C++'s inline assembler, and
can't be turned into pointers, due to a compiler bug in Turbo C++ that generates wrong segment prefix opcodes for the _FS and _GS pseudo-registers.

Apparently I'm the first one to even try doing that with this compiler? I haven't found any other mention of this bug…
Compiling via assembly (#pragma inline) would work around this bug and generate the correct instructions. But that would incur yet another dependency on a 16-bit TASM, for something honestly quite insignificant.

What we can always do, however, is using __emit__() to simply output x86 opcodes anywhere in a function. Unlike spelled-out inline assembly, that can even be used in helper functions that are supposed to inline… which does in fact allow us to fully abstract away this compiler bug. Regular if() comparisons with pseudo-registers wouldn't inline, but "converting" them into C++ template function specializations does. All that's left is some C preprocessor abuse to turn the pseudo-registers into types, and then we do retain a normal-looking poke() call in the blitting functions in the end. 🤯

Yeah… the result is batshit insane. I may have gone too far in a few places…

One might certainly argue that all these ridiculous decompilations actually hurt the preservation angle of this project. "Clearly, ZUN couldn't have possibly written such unreasonable C++ code. So why pretend he did, and not just keep it all in its more natural ASM form?" Well, there are several reasons:

Future port authors will merely have to translate all the pseudo-registers and inline assembly to C++. For the former, this is typically as easy as replacing them with newly declared local variables. No need to bother with function prolog and epilog code, calling conventions, or the build system.
No duplication of constants and structures in ASM land.
As a more expressive language, C++ can document the code much better. Meticulous documentation seems to have become the main attraction of ReC98 these days – I've seen it appreciated quite a number of times, and the continued financial support of all the backers speaks volumes. Mods, on the other hand, are still a rather rare sight.
Having as few .ASM files in the source tree as possible looks better to casual visitors who just look at GitHub's repo language breakdown. This way, ReC98 will also turn from an "Assembly project" to its rightful state of "C++ project" much sooner.
And finally, it's not like the ASM versions are gone – they're still part of the Git history.

Unfortunately, these pushes also demonstrated a second disadvantage in trying to decompile everything possible: Since Turbo C++ lacks TASM's fine-grained ability to enforce code alignment on certain multiples of bytes, it might actually be unfeasible to link in a C-compiled object file at its intended original position in some of the .EXE files it's used in. Which… you're only going to notice once you encounter such a case. Due to the slightly jumbled order of functions in the 📝 second, shared code segment, that might be long after you decompiled and successfully linked in the function everywhere else.

And then you'll have to throw away that decompilation after all 😕 Oh well. In this specific case (the lookup table generator for horizontally flipping images), that decompilation was a mess anyway, and probably helped nobody. I could have added a dummy .OBJ that does nothing but enforce the needed 2-byte alignment before the function if I really insisted on keeping the C version, but it really wasn't worth it.

Now that I've also described yet another meta-issue, maybe there'll really be nothing to say about the next technical debt pushes? Next up though: Back to actual progress again, with TH01. Which maybe even ends up pushing that game over the 50% RE mark?

📝 Posted:: 2020-09-21 13:01 UTC
🚚 Summary of:: P0119
⌨ Commits:: cbf14eb...453dd3c
💰 Funded by:: [Anonymous], -Tom-
🏷 Tags:

So, TH05 OP.EXE. The first half of this push started out nicely, with an easy decompilation of the entire player character selection menu. Typical ZUN quality, with not much to say about it. While the overall function structure is identical to its TH04 counterpart, the two games only really share small snippets inside these functions, and do need to be RE'd separately.

The high score viewing (not registration) menu would have been next. Unfortunately, it calls one of the GENSOU.SCR loading functions… which are all a complete mess that still needed to be sorted out first. 5 distinct functions in 6 binaries, and of course TH05 also micro-optimized its MAIN.EXE version to directly use the DOS INT 21h file loading API instead of master.lib's wrappers. Could have all been avoided with a single method on the score data structure, taking a player character ID and a difficulty level as parameters…

So, no score menu in this push then. Looking at the other end of the ASM code though, we find the starting functions for the main game, the Extra Stage, and the demo replays, which did fit perfectly to round out this push.

Which is where we find an easter egg! 🥚 If you've ever looked into 怪綺談2.DAT, you might have noticed 6 .REC files with replays for the Demo Play mode. However, the game only ever seems to cycle between 4 replays. So what's in the other two, and why are they 40 KB instead of just 10 KB like the others? Turns out that they combine into a full Extra Stage Clear replay with Mima, with 3 bombs and 1 death, obviously recorded by ZUN himself. The split into two files for the stage (DEMO4.REC) and boss (DEMO5.REC) portion is merely an attempt to limit the amount of simultaneously allocated heap memory.
To watch this replay without modding the game, unlock the Extra Stage with all 4 characters, then hold both the ⬅️ left and ➡️ right arrow keys in the main menu while waiting for the usual demo replay. I can't possibly be the first one to discover this, but I couldn't find any other mention of it.
Edit (2021-03-15): ZUN did in fact document this replay in Section 6 of TH05's OMAKE.TXT, along with the exact method to view it. Thanks to Popfan for the discovery!

Here's a recording of the whole replay:

Note how the boss dialogue is skipped. MAIN.EXE actually contains no less than 6 if() branches just to distinguish this overly long replay from the regular ones.

I'd really like to do the TH04 and TH05 main menus in parallel, since we can expect a bit more shared code after all the initial differences. Therefore, I'm going to put the next "anything" push towards covering the TH04 version of those functions. Next up though, it's back to TH01, with more redundant image format code…

📝 Posted:: 2020-08-16 19:48 UTC
🚚 Summary of:: P0109
⌨ Commits:: dcf4e2c...2c7d86b
💰 Funded by:: [Anonymous], Blue Bolt
🏷 Tags:

Back to TH05! Thanks to the good funding situation, I can strike a nice balance between getting TH05 position-independent as quickly as possible, and properly reverse-engineering some missing important parts of the game. Once 100% PI will get the attention of modders, the code will then be in better shape, and a bit more usable than if I just rushed that goal.

By now, I'm apparently also pretty spoiled by TH01's immediate decompilability, after having worked on that game for so long. Reverse-engineering in ASM land is pretty annoying, after all, since it basically boils down to meticulously editing a piece of ASM into something I can confidently call "reverse-engineered". Most of the time, simply decompiling that piece of code would take just a little bit longer, but be massively more useful. So, I immediately tried decompiling with TH05… and it just worked, at every place I tried!? Whatever the issue was that made 📝 segment splitting so annoying at my first attempt, I seem to have completely solved it in the meantime. 🤷 So yeah, backers can now request pretty much any part of TH04 and TH05 to be decompiled immediately, with no additional segment splitting cost.

(Protip for everyone interested in starting their own ReC project: Just declare one segment per function, right from the start, then group them together to restore the original code segmentation…)

Except that TH05 then just throws more of its infamous micro-optimized and undecompilable ASM at you. 🙄 This push covered the function that adjusts the bullet group template based on rank and the selected difficulty, called every time such a group is configured. Which, just like pretty much all of TH05's bullet spawning code, is one of those undecompilable functions. If C allowed labels of other functions as goto targets, it might have been decompilable into something useful to modders… maybe. But like this, there's no point in even trying.

This is such a terrible idea from a software architecture point of view, I can't even. Because now, you suddenly have to mirror your C++ declarations in ASM land, and keep them in sync with each other. I'm always happy when I get to delete an ASM declaration from the codebase once I've decompiled all the instances where it was referenced. But for TH05, we now have to keep those declarations around forever. 😕 And all that for a performance increase you probably couldn't even measure. Oh well, pulling off Galaxy Brain-level ASM optimizations is kind of fun if you don't have portability plans… I guess?

If I started a full fangame mod of a PC-98 Touhou game, I'd base it on TH04 rather than TH05, and backport selected features from TH05 as needed. Just because it was released later doesn't make it better, and this is by far not the only one of ZUN's micro-optimizations that just went way too far.

Dropping down to ASM also makes it easier to introduce weird quirks. Decompiled, one of TH05's tuning conditions for stack groups on Easy Mode would look something like:

case BP_STACK:
	// […]
	if(spread_angle_delta >= 2) {
		stack_bullet_count--;
	}

The fields of the bullet group template aren't typically reset when setting up a new group. So, spread_angle_delta in the context of a stack group effectively refers to "the delta angle of the last spread group that was fired before this stack – whenever that was". uth05win also spotted this quirk, considered it a bug, and wrote fanfiction by changing spread_angle_delta to stack_bullet_count.
As usual for functions that occur in more than one game, I also decompiled the TH04 bullet group tuning function, and it's perfectly sane, with no such quirks.

In the more PI-focused parts of this push, we got the TH05-exclusive smooth boss movement functions, for flying randomly or towards a given point. Pretty unspectacular for the most part, but we've got yet another uth05win inconsistency in the latter one. Once the Y coordinate gets close enough to the target point, it actually speeds up twice as much as the X coordinate would, whereas uth05win used the same speedup factors for both. This might make uth05win a couple of frames slower in all boss fights from Stage 3 on. Hard to measure though – and boss movement partly depends on RNG anyway.

Next up: Shinki's background animations – which are actually the single biggest source of position dependence left in TH05.

📝 Posted:: 2020-02-16 20:51 UTC
🚚 Summary of:: P0072, P0073, P0074, P0075
⌨ Commits:: 4bb04ab...cea3ea6, cea3ea6...5286417, 5286417...1807906, 1807906...222fc99
💰 Funded by:: [Anonymous], -Tom-, Myles
🏷 Tags:

Long time no see! And this is exactly why I've been procrastinating bullets while there was still meaningful progress to be had in other parts of TH04 and TH05: There was bound to be quite some complexity in this most central piece of game logic, and so I couldn't possibly get to a satisfying understanding in just one push.

Or in two, because their rendering involves another bunch of micro-optimized functions adapted from master.lib.

Or in three, because we'd like to actually name all the bullet sprites, since there are a number of sprite ID-related conditional branches. And so, I was refining things I supposedly RE'd in the the commits from the first push until the very end of the fourth.

When we talk about "bullets" in TH04 and TH05, we mean just two things: the white 8×8 pellets, with a cap of 240 in TH04 and 180 in TH05, and any 16×16 sprites from MIKO16.BFT, with a cap of 200 in TH04 and 220 in TH05. These are by far the most common types of… err, "things the player can collide with", and so ZUN provides a whole bunch of pre-made motion, animation, and n-way spread / ring / stack group options for those, which can be selected by simply setting a few fields in the bullet template. All the other "non-bullets" have to be fired and controlled individually.

Which is nothing new, since uth05win covered this part pretty accurately – I don't think anyone could just make up these structure member overloads. The interesting insights here all come from applying this research to TH04, and figuring out its differences compared to TH05. The most notable one there is in the default groups: TH05 allows you to add a stack to any single bullet, n-way spread or ring, but TH04 only lets you create stacks separately from n-way spreads and rings, and thus gets by with fewer fields in its bullet template structure. On the other hand, TH04 has a separate "n-way spread with random angles, yet still aimed at the player" group? Which seems to be unused, at least as far as midbosses and bosses are concerned; can't say anything about stage enemies yet.

In fact, TH05's larger bullet template structure illustrates that these distinct group types actually are a rather redundant piece of over-engineering. You can perfectly indicate any permutation of the basic groups through just the stack bullet count (1 = no stack), spread bullet count (1 = no spread), and spread delta angle (0 = ring instead of spread). Add a 4-flag bitfield to cover the rest (aim to player, randomize angle, randomize speed, force single bullet regardless of difficulty or rank), and the result would be less redundant and even slightly more capable.

Even those 4 pushes didn't quite finish all of the bullet-related types, stopping just shy of the most trivial and consistent enum that defines special movement. This also left us in a 📝 TH03-like situation, in which we're still a bit away from actually converting all this research into actual RE%. Oh well, at least this got us way past 50% in overall position independence. On to the second half! 🎉

For the next push though, we'll first have a quick detour to the remaining C code of all the ZUN.COM binaries. Now that the 📝 TH04 and TH05 resident structures no longer block those, -Tom- has requested TH05's RES_KSO.COM to be covered in one of his outstanding pushes. And since 32th System recently RE'd TH03's resident structure, it makes sense to also review and merge that, before decompiling all three remaining RES_*.COM binaries in hopefully a single push. It might even get done faster than that, in which case I'll then review and merge some more of WindowsTiger's research.

📝 Posted:: 2019-11-29 00:15 UTC
🚚 Summary of:: P0060
⌨ Commits:: 29385dd...73f5ae7
💰 Funded by:: Touhou Patch Center
🏷 Tags:

So, where to start? Well, TH04 bullets are hard, so let's ~~procrastinate~~ start with TH03 instead The 📝 sprite display functions are the obvious blocker for any structure describing a sprite, and therefore most meaningful PI gains in that game… and I actually did manage to fit a decompilation of those three functions into exactly the amount of time that the Touhou Patch Center community votes alloted to TH03 reverse-engineering!

And a pretty amazing one at that. The original code was so obviously written in ASM and was just barely decompilable by exclusively using register pseudovariables and a bit of goto, but I was able to abstract most of that away, not least thanks to a few helpful optimization properties of Turbo C++… seriously, I can't stop marveling at this ancient compiler. The end result is both readable, clear, and dare I say portable?! To anyone interested in porting TH03, take a look. How painful would it be to port that away from 16-bit x86?

However, this push is also a typical example that the RE/PI priorities can only control what I look at, and the outcome can actually differ greatly. Even though the priorities were 65% RE and 35% PI, the progress outcome was +0.13% RE and +1.35% PI. But hey, we've got one more push with a focus on TH03 PI, so maybe that one will include more RE than PI, and then everything will end up just as ordered?

📝 Posted:: 2019-09-21 12:11 UTC
🚚 Summary of:: P0031, P0032, P0033
⌨ Commits:: dea40ad...9f764fa, 9f764fa...e6294c2, e6294c2...6cdd229
💰 Funded by:: zorg
🏷 Tags:

The glacial pace continues, with TH05's unnecessarily, inappropriately micro-optimized, and hence, un-decompilable code for rendering the current and high score, as well as the enemy health / dream / power bars. While the latter might still pass as well-written ASM, the former goes to such ridiculous levels that it ends up being technically buggy. If you enjoy quality ZUN code, it's definitely worth a read.

In TH05, this all still is at the end of code segment #1, but in TH04, the same code lies all over the same segment. And since I really wanted to move that code into its final form now, I finally did the research into decompiling from anywhere else in a segment.

Turns out we actually can! It's kinda annoying, though: After splitting the segment after the function we want to decompile, we then need to group the two new segments back together into one "virtual segment" matching the original one. But since all ASM in ReC98 heavily relies on being assembled in MASM mode, we then start to suffer from MASM's group addressing quirk. Which then forces us to manually prefix every single function call

from inside the group
to anywhere else within the newly created segment

with the group name. It's stupidly boring busywork, because of all the function calls you mustn't prefix. Special tooling might make this easier, but I don't have it, and I'm not getting crowdfunded for it.

So while you now definitely can request any specific thing in any of the 5 games to be decompiled right now, it will take slightly longer, and cost slightly more.
(Except for that one big segment in TH04, of course.)

Only one function away from the TH05 shot type control functions now!

📝 Posted:: 2019-03-06 19:39 UTC
🚚 Summary of:: P0051, P0052, P0053
⌨ Commits:: 6ed8e60...3ba536a
💰 Funded by:: -Tom-
🏷 Tags:

Boss explosions! And… urgh, I really also had to wade through that overly complicated HUD rendering code. Even though I had to pick -Tom-'s 7th push here as well, the worst of that is still to come. TH04 and TH05 exclusively store the current and high score internally as unpacked little-endian BCD, with some pretty dense ASM code involving the venerable x86 BCD instructions to update it.

So, what's actually the goal here. Since I was given no priorities , I still haven't had to (potentially) waste time researching whether we really can decompile from anywhere else inside a segment other than backwards from the end. So, the most efficient place for decompilation right now still is the end of TH05's main_01_TEXT segment. With maybe 1 or 2 more reverse-engineering commits, we'd have everything for an efficient decompilation up to sub_123AD. And that mass of code just happens to include all the shot type control functions, and makes up 3,007 instructions in total, or 12% of the entire remaining unknown code in MAIN.EXE.

So, the most reasonable thing would be to actually put some of the upcoming decompilation pushes towards reverse-engineering that missing part. I don't think that's a bad deal since it will allow us to mod TH05 shot types in C sooner, but zorg and qp might disagree

Next up: thcrap TL notes, followed by finally finishing GhostPhanom's old ReC98 future-proofing pushes. I really don't want to decompile without a proper build system.

📝 Posted:: 2019-01-01 01:38 UTC
🚚 Summary of:: P0046
⌨ Commits:: 612beb8...deb45ea
💰 Funded by:: -Tom-
🏷 Tags:

Stumbled across one more drawing function in the way… which was only a duplicated and seemingly pointlessly micro-optimized copy of master.lib's super_roll_put_tiny() function, used for fast display of 4-color 16×16 sprites.

With this out of the way, we can tackle player shot sprite animation next. This will get rid of a lot of code, since every power level of every character's shot type is implemented in its own function. Which makes up thousands of instructions in both TH04 and TH05 that we can nicely decompile in the future without going through a dedicated reverse-engineering step.

📝 Posted:: 2018-12-16 23:34 UTC
🚚 Summary of:: P0023, P0024
⌨ Commits:: 807df3d...0cde4b7
💰 Funded by:: zorg
🏷 Tags:

Actually, I lied, and lasers ended up coming with everything that makes reverse-engineering ZUN code so difficult: weirdly reused variables, unexpected structures within structures, and those TH05-specific nasty, premature ASM micro-optimizations that will waste a lot of time during decompilation, since the majority of the code actually was C, except for where it wasn't.

KEBARI	TAME	LASER	LASER2	BOMB	SELECT	HIT	CANCEL	WARNING	SBLASER	BUZZ	MISSILE	JOINT	DEAD	SBBOMB	BOSSBOMB	ENEMYSHOT	HLASER	TAMEFAST	WARP

⮜ Blog

⮜ List of tags

Showing all posts tagged micro-optimization-

Showing all posts tagged