I'm 13 days late, but 🎉 ReC98 is now 10 years old! 🎉 On June 26, 2014, I first tried exporting IDA's disassembly of TH05's OP.EXE and reassembling and linking the resulting file back into a binary, and was amazed that it actually yielded an identical binary. Now, this doesn't actually mean that I've spent 10 years working on this project; priorities have been shifting and continue to shift, and time-consuming mistakes were certainly made. Still, it's a good occasion to finally fully realize the good future for ReC98 that GhostPhanom invested in with the very first financial contribution back in 2018, deliver the last three of the first four reserved pushes, cross another piece of time-consuming maintenance off the list, and prepare the build process for hopefully the next 10 years.
But why did it take 8 pushes and over two months to restore feature parity with the old system? 🥲
The original plan for ReC98's good future was quite different from what I ended up shipping here. Before I started writing the code for this website in August 2019, I focused on feature-completing the experimental 16-bit DOS build system for Borland compilers that I'd been developing since 2018, and which would form the foundation of my internal development work in the following years. Eventually, I wanted to polish and publicly release this system as soon as people stopped throwing money at me. But as of November 2019, just one month after launch, the store kept selling out with everyone investing into all the flashier goals, so that release never happened.
The main idea behind the system still has its charm: Your build script is a regular C++ program that #includes the build system as a static library and passes fixed structures with names of source files and build flags. By employing static constructors, even a 1994 Turbo C++ would let you define the whole build at compile time, although this certainly requires some dank preprocessor magic to remain anywhere near readable at ReC98 scale. 🪄 While this system does require a bootstrapping process, the resulting binary can then use the same dependency-checking mechanisms to recompile and overwrite itself if you change the C++ build code later. Since DOS just simply loads an entire binary into RAM before executing it, there is no lock to worry about, and overwriting the originating binary is something you can just do.
Later on, the system also made use of batched compilation: By passing more than one source file to TCC.EXE, you get to avoid TCC's quite noticeable startup times, thus speeding up the build proportional to the number of translation units in each batch. Of course, this requires that every passed source file is supposed to be compiled with the same set of command-line flags, but that's a generally good complexity-reducing guideline to follow in a build script. I went even further and enforced this guideline in the system itself, thus truly making per-file compiler command line switches considered harmful. Thanks to Turbo C++'s #pragma option, changing the command line isn't even necessary for the few unfortunate cases where parts of ZUN's code were compiled with inconsistent flags.
I combined all these ideas with a general approach of "targeting DOSBox": By maximizing DOS syscalls and minimizing algorithms and data structures, we spend as much time as possible in DOSBox's native-code DOS implementation, which should give us a performance advantage over DOS-native implementations of MAKE that typically follow the opposite approach.
Of course, all this only matters if the system is correct and reliable at its core. Tup teaches us that it's fundamentally impossible to have a reliable generic build system without
augmenting the build graph with all actual files read and written by each invoked build tool, which involves tracing all file-related syscalls, and
persistently serializing the full build graph every time the system runs, allowing later runs to detect every possible kind of change in the build script and rebuild or clean up accordingly.
Unfortunately, the design limitations of my system only allowed half-baked attempts at solving both of these prerequisites:
If your build system is not supposed to be generic and only intended to work with specific tools that emit reliable dependency information, you can replace syscall tracing with a parser for those specific formats. This is what my build system was doing, reading dependency information out of each .OBJ file's OMF COMENT record.
Since DOS command lines are limited to 127 bytes, DOS compilers support reading additional arguments from response files, typically indicated with an @ next to their path on the command line. If we now put every parameter passed to TCC or TLINK into a response file and leave these files on disk afterward, we've effectively serialized all command-line arguments of the entire build into a makeshift database. In later builds, the system can then detect changed command-line arguments by comparing the existing response files from the previous run with the new contents it would write based on the current build structures. This way, we still only recompile the parts of the codebase that are affected by the changed arguments, which is fundamentally impossible with Makefiles.
But this strategy only covers changes within each binary's compile or link arguments, and ignores the required deletions in "the database" when removing binaries between build runs. This is a non-issue as long as we keep decompiling on master, but as soon as we switch between master and similarly old commits on the debloated/anniversary branches, we can get very confusing errors:
Apparently, there's also such a thing as "too much batching", because TCC would suddenly stop applying certain compiler optimizations at very specific places if too many files were compiled within a single process? At least you quickly remember which source files you then need to manually touch and recompile to make the binaries match ZUN's original ones again…
But the final nail in the coffin was something I'd notice on every single build: 5 years down the line, even the performance argument wasn't convincing anymore. The strategy of minimizing emulated code still left me with an 𝑂(𝑛) algorithm, and with this entire thing still being single-threaded, there was no force to counteract the dependency check times as they grew linearly with the number of source files.
At P0280, each build run would perform a total of 28,130 file-related DOS syscalls to figure out which source files have changed and need to be rebuilt. At some point, this was bound to become noticeable even despite these syscalls being native, not to mention that they're still surrounded by emulator code that must convert their parameters and results to and from the DOS ABI. And with the increasing delays before TCC would do its actual work, the entire thing started feeling increasingly jankier.
While this system was waiting to be eventually finished, the public master branch kept using the Makefile that dates back to early 2015. Back then, it didn't takelong for me to abandon raw dumb batch files because Make was simply the most straightforward way of ensuring that the build process would abort on the first compile error.
The following years also proved that Makefile syntax is quite well-suited for expressing the build rules of a codebase at this scale. The built-in support for automatically turning long commands into response files was especially helpful because of how naturally it works together with batched compilation. Both of these advantages culminate in this wonderfully arcane incantation of ASCII special characters and syntactically significant linebreaks:
Which translates to "take the filenames of all dependents of this explicit rule, write them into a temporary file with an autogenerated name, insert this filename into the tcc … @ command line, and delete the file after the command finished executing". The @ is part of TCC's command-line interface, the rest is all MAKE syntax.
But 📝 as we all know by now, these surface-level niceties change nothing about Makefiles inherently being unreliable trash due to implementing none of the aforementioned two essential properties of a generic build system. Borland got so close to a correct and reliable implementation of autodependencies, but that would have just covered one of the two properties. Due to this unreliability, the old build16b.bat called Borland's MAKER.EXE with the -B flag, recompiling everything all the time. Not only did this leave modders with a much worse build process than I was using internally, but it also eventually got old for me to merge my internal branch onto master before every delivery. Let's finally rectify that and work towards a single good build process for everyone.
As you would expect by now, I've once again migrated to Tup's Lua syntax. Rewriting it all makes you realize once again how complex the PC-98 Touhou build process is: It has to cover 2 programming languages, 2 pipeline steps, and 3 third-party libraries, and currently generates a total of 39 executables, including the small programs I wrote for research. The final Lua code comprises over 1,300 lines – but then again, if I had written it in 📝 Zig, it would certainly be as long or even longer due to manual memory management. The Tup building blocks I constructed for Shuusou Gyoku quickly turned out to be the wrong abstraction for a project that has no debug builds, but their 📝 basic idea of a branching tree of command-line options remained at the foundation of this script as well.
This rewrite also provided an excellent opportunity for finally dumping all the intermediate compilation outputs into a separate dedicated obj/ subdirectory, finally leaving bin/ nice and clean with only the final executables. I've also merged this new system into most of the public branches of the GitHub repo.
As soon as I first tried to build it all though, I was greeted with a particularly nasty Tup bug. Due to how DOS specified file metadata mutation, MS-DOS Player has to open every file in a way that current Tup treats as a write access… but since unannotated file writes introduce the risk of a malformed build graph if these files are read by another build command later on, Tup providently deletes these files after the command finished executing. And by these files, I mean TCC.EXE as well as every one of its C library header files opened during compilation.
Due to a minor unsolved question about a failing test case, my fix has not been merged yet. But even if it was, we're now faced with a problem: If you previously chose to set up Tup for ReC98 or 📝 Shuusou Gyoku and are maybe still running 📝 my 32-bit build from September 2020, running the new build.bat would in fact delete the most important files of your Turbo C++ 4.0J installation, forcing you to reinstall it or restore it from a backup. So what do we do?
Should my custom build get a special version number so that the surrounding batch file can fail if the version number of your installed Tup is lower?
Or do I just put a message somewhere, which some people invariably won't read?
The easiest solution, however, is to just put a fixed Tup binary directly into the ReC98 repo. This not only allows me to make Tup mandatory for 64-bit builds, but also cuts out one step in the build environment setup that at least one person previously complained about. *nix users might not like this idea all too much (or do they?), but then again, TASM32 and the Windows-exclusive MS-DOS Player require Wine anyway. Running Tup through Wine as well means that there's only one PATH to worry about, and you get to take advantage of the tool checks in the surrounding batch file.
If you're one of those people who doesn't trust binaries in Git repos, the repo also links to instructions for building this binary yourself. Replicating this specific optimized binary is slightly more involved than the classic ./configure && make && make install trinity, so having these instructions is a good idea regardless of the fact that Tup's GPL license requires it.
One particularly interesting aspect of the Lua code is the way it handles sprite dependencies:
If build commands read from files that were created by other build commands, Tup requires these input dependencies to be spelled out so that it can arrange the build graph and parallelize the build correctly. We could simply put every sprite into a single array and automatically pass that as an extra input to every source file, but that would effectively split the build into a "sprite convert" and "code compile" phase. Spelling out every individual dependency allows such source files to be compiled as soon as possible, before (and in parallel to) the rest of the sprites they don't depend on. Similarly, code files without sprite dependencies can compile before the first sprite got converted, or even before the sprite converter itself got compiled and linked, maximizing the throughput of the overall build process.
Running a 30-year-old DOS toolchain in a parallel build system also introduces new issues, though. The easiest and recommended way of compiling and linking a program in Turbo C++ is a single tcc invocation:
This performs a batched compilation of main.cpp and utils.cpp within a single TCC process, and then launches TLINK to link the resulting .obj files into main.exe, together with the C++ runtime library and any needed objects from master.lib. The linking step works by TCC generating a TLINK command line and writing it into a response file with the fixed name turboc.$ln… which obviously can't work in a parallel build where multiple TCC processes will want to link different executables via the same response file.
Therefore, we have to launch TLINK with a custom response file ourselves. This file is echo'd as a separate parallel build rule, and the Lua code that constructs its contents has to replicate TCC's logic for picking the correct C++ runtime .lib file for the selected memory model.
While this does add more string formatting logic, not relying on TCC to launch TLINK actually removes the one possible PATH-related error case I previously documented in the README. Back in 2021 when I first stumbled over the issue, it took a few hours of RE to figure this out. I don't like these hours to go to waste, so here's a Gist, and here's the text replicated for SEO reasons:
Issue: TCC compiles, but fails to link, with Unable to execute command 'tlink.exe'
Cause: This happens when invoking TCC as a compiler+linker, without the -c flag. To locate TLINK, TCC needlessly copies the PATH environment variable into a statically allocated 128-byte buffer. It then constructs absolute tlink.exe filenames for each of the semicolon- or \0-terminated paths, writing these into a buffer that immediately follows the 128-byte PATH buffer in memory. The search is finished as soon as TCC finds an existing file, which gives precedence to earlier paths in the PATH. If the search didn't complete until a potential "final" path that runs past the 128 bytes, the final attempted filename will consist of the part that still managed to fit into the buffer, followed by the previously attempted path.
Workaround: Make sure that the BIN\ path to Turbo C++ is fully contained within the first 127 bytes of the PATH inside your DOS system. (The 128th byte must either be a separating ; or the terminating \0 of the PATH string.)
Now that DOS emulation is an integral component of the single-part build process, it even makes sense to compile our pipeline tools as 16-bit DOS executables and then emulate them as part of the build. Sure, it's technically slower, but realistically it doesn't matter: Our only current pipeline tools are 📝 the converter for hardcoded sprites and the 📝 ZUN.COM generators, both of which involve very little code and are rarely run during regular development after the initial full build. In return, we get to drop that awkward dependency on the separate Borland C++ 5.5 compiler for Windows and yet another additional manual setup step. 🗑️ Once PC-98 Touhou becomes portable, we're probably going to require a modern compiler anyway, so you can now delete that one as well.
That gives us perfect dependency tracking and minimal parallel rebuilds across the whole codebase! While MS-DOS Player is noticeably slower than DOSBox-X, it's not going to matter all too much; unless you change one of the more central header files, you're rarely if ever going to cause a full rebuild. Then again, given that I'm going to use this setup for at least a couple of years, it's worth taking a closer look at why exactly the compilation performance is so underwhelming …
On the surface, MS-DOS Player seems like the right tool for our job, with a lot of advantages over DOSBox:
It doesn't spawn a window that boots an entire emulated PC, but is instead
perfectly integrated into the Windows console. Using it in a modern developer console would allow you to click on a compile error and have your editor immediately open the relevant file and jump to that specific line! With DOSBox, this basic comfort feature was previously unthinkable.
Heck, Takeda Toshiya originally developed it to run the equally vintage LSI C-86 compiler on 64-bit Windows. Fixing any potential issues we'd run into would be well within the scope of the project.
It consists of just a single comparatively small binary that we could just drop into the ReC98 repo. No manual setup steps required.
But once I began integrating it, I quickly noticed two glaring flaws:
Back in 2009, Takeda Toshiya chose to start the project by writing a custom DOS implementation from scratch. He was aware of DOSBox, but only adapted small tricky parts of its source code rather than starting with the DOSBox codebase and ripping out everything he didn't need. This matches the more research-oriented nature that all of his projects appear to follow, where the primary goal of writing the code is a personal understanding of the problem domain rather than a widely usable piece of software. MS-DOS Player is even the outlier in this regard, with Takeda Toshiya describing it as 珍しく実用的かもしれません. I am definitely sympathetic to this mindset; heck, my old internal build system falls under this category too, being so specialized and narrow that it made little sense to use it outside of ReC98. But when you apply it to emulators for niche systems, you end up with exactly the current PC-98 emulation scene, where there's no single universally good emulator because all of them have some inaccuracy somewhere. This scene is too small for you not to eventually become part of someone else's supply chain… 🥲
Emulating DOS is a particularly poor fit for a research/NIH project because it's Hyrum's Law incarnate. With the lack of memory protection in Real Mode, programs could freely access internal DOS (and even BIOS) data structures if they only knew where to look, and frequently did. It might look as if "DOS command-line tools" just equals x86 plus INT 21h, but soon you'll also be emulating the BIOS, PIC, PIT, EMS, XMS, and probably a few more things, all with their individual quirks that some application out there relies on. DOSBox simply had much more time to grow and mature and figure out all of these details by trial and error. If you start a DOS emulator from scratch, you're bound to duplicate all this research as people want to use your emulator to run more and more programs, until you've ended up with what's effectively a clone of DOSBox's exact logic. Unless, of course, if you draw a line somewhere and limit the scope of the DOS and BIOS emulation. But given how many people have wanted to use MS-DOS Player for running DOS TUIs in arbitrarily sized terminal windows with arbitrary fonts, that's not what happened. I guess it made sense for this use case before DOSBox-X gained a TTF output mode in late 2020?
As usual, I wouldn't mention this if I didn't run into twobugs when combining MS-DOS Player with Turbo C++ and Tup. Both of these originated from workarounds for inaccuracies in the DOS emulation that date back to MS-DOS Player's initial release and were thankfully no longer necessary with the accuracy improvements implemented in the years since.
For CPU emulation, MS-DOS Player can use either MAME's or Neko Project 21/W's x86 core, both of which are interpreters and won't win any performance contests. The NP21/W core is significantly better optimized and runs ≈41% faster, but still pales in comparison to DOSBox-X's dynamic recompiler. Running the same sequential commands that the P0280 Makefile would execute, the upstream 2024-03-02 NP21/W core build of MS-DOS Player would take to compile the entire ReC98 codebase on my system, whereas DOSBox-X's dynamic core manages the same in , or 94% faster.
Granted, even the DOSBox-X performance is much slower than we would like it to be. Most of it can be blamed on the awkward time in the early-to-mid-90s when Turbo C++ 4.0J came out. This was the time when DOS applications had long grown past the limitations of the x86 Real Mode and required DOS extenders or even sillier hacks to actually use all the RAM in a typical system of that period, but Win32 didn't exist yet to put developers out of this misery. As such, this compiler not only requires at least a 386 CPU, but also brings its own DOS extender (DPMI16BI.OVL) plus a loader for said extender (RTM.EXE), both of which need to be emulated alongside the compiler, to the great annoyance of emulator maintainers 30 years later. Even MS-DOS Player's README file notes how Protected Mode adds a lot of complexity and slowdown:
8086 binaries are much faster than 80286/80386/80486/Pentium4/IA32 binaries.
If you don't need the protected mode or new mnemonics added after 80286,
I recommend i86_x86 or i86_x64 binary.
The immediate reaction to these performance numbers is obvious: Let's just put DOSBox-X's dynamic recompiler into MS-DOS Player, right?! 🙌 Except that once you look at DOSBox-X, you immediately get why Takeda Toshiya might have preferred to start from scratch. Its codebase is a historically grown tangled mess, requiring intimate familiarity and a significant engineering effort to isolate the dynamic core in the first place. I did spend a few days trying to untangle and copy it all over into MS-DOS Player… only to be greeted with an infinite loop as soon as everything compiled for the first time. 😶 Yeah, no, that's bound to turn into a budget-exceeding maintenance nightmare.
Instead, let's look at squeezing at least some additional performance out of what we already have. A generic emulator for the entire CISCy instruction set of the 80386, with complete support for Protected Mode, but it's only supposed to run the subset of instructions and features used by a specific compiler and linker as fast as possible… wait a moment, that sounds like a use case for profile-guided optimization! This is the first time I've encountered a situation that would justify the required 2-phase build process and lengthy profile collection – after all, writing into some sort of database for every function call does slow down MS-DOS Player by roughly 15×. However, profiling just the compilation of our most complex translation unit (📝 TH01 YuugenMagan) and the linking of our largest executable (TH01's REIIDEN.EXE) should be representative enough.
I'll get to the performance numbers later, but even the build output is quite intriguing. Based on this profile, Visual Studio chooses to optimize only 104 out of MS-DOS Player's 1976 functions for speed and the rest for size, shaving off a nice 109 KiB from the binary. Presumably, keeping rare code small is also considered kind of fast these days because it takes up less space in your CPU's instruction cache once it does get executed?
With PGO as our foundation, let's run a performance profile and see if there are any further code-level optimizations worth trying out:
Removing redundant memset() calls: MS-DOS Player is written in a very C-like style of C++, and initializes a bunch of its statically allocated data by memset()ing it with 00 bytes at startup. This is strictly redundant even in C; Section 6.7.9/10 of the C standard mandates that all static data is zero-initialized by default. In turn, the program loaders of modern operating systems employ all sorts of paging tricks to reduce the CPU cost (and actual RAM usage!) of this initialization as much as possible. If you manually memset() afterward, you throw all these advantages out of the window.
Of course, these calls would only ever show up among the top CPU consumers in a performance profile if a program uses a large amount of static data, but the hardcoded 32 MiB of emulated RAM in ≥i386-supporting builds definitely qualifies. Zeroing 32.8 MiB of memory makes up a significant chunk of the runtime of some of the shorter build steps and quickly adds up; a full rebuild of the ReC98 codebase currently spawns a total of 361 MS-DOS Player instances, totaling 11.5 GiB of needless memory writes.
Limiting the emulated instruction set: NP21/W's x86 core emulates everything up to the SSE3 extension from 2004, but Turbo C++ 4.0J's x86 instruction set usage doesn't stretch past the 386. It doesn't even need the x87 FPU for compiling code that involves floating-point constants. Disabling all these unneeded extensions speeds up x86's infamously annoying instruction decoding, and also reduces the size of the MS-DOS Player binary by another 149.5 KiB. The source code already had macros for this purpose, and only needed a slight fix for the code to compile with these macros disabled.
Removing x86 paging: Borland's DOS extender uses segmented memory addressing even in Protected Mode. This allows us to remove the MMU emulation and the corresponding "are we paging" check for every memory access.
Removing cycle counting: When emulating a whole system, counting the cycles of each instruction is important for accurately synchronizing the CPU with other pieces of hardware. As hinted above, MS-DOS Player does emulate and periodically update a few pieces of hardware outside the CPU, but we need none of them for a build tool.
Testing Takeda Toshiya's optimizations: In a nice turn of events, Takeda Toshiya merged every single one of my bugfixes and optimization flags into his upstream codebase. He even agreed with my memset() and cycle counting removal optimizations, which are now part of all upstream builds as of 2024-06-24. For the 2024-06-27 build, he claims to have gone even further than my more minimal optimization, so let's see how these additional changes affect our build process.
Further risky optimizations: A lot of the remaining slowness of x86 emulation comes from the segmentation and protection fault checks required for every memory access. If we assume that the emulator only ever executes correct code, we can remove these checks and implement further shortcuts based on their absence.
The L[DEFGS]S group of instructions that load a segment and offset register from a 32-bit far pointer, for example, are both frequently used in Turbo C++ 4.0J code and particularly expensive to emulate. Intel specified their Real Mode operation as loading the segment and offset part in two separate 16-bit reads. But if we assume that neither of those reads can fault, we can compress them into a single 32-bit read and thus only perform the costly address translation once rather than twice. Emulator authors are probably rolling their eyes at this gross violation of Intel documentation now, but it's at least worth a try to see just how much performance we could get out of it.
Measured on a 6-year-old 6-core Intel Core i5 8400T on Windows 11. The first number in each column represents the codebase before the #include cleanup explained below, and the second one corresponds to this commit. All builds are 64-bit, 32-bit builds were ≈5% slower across the board. I kept the fastest run within three attempts; as Tup parallelizes the build process across all CPU cores, it's common for the long-running full build to take up to a few seconds longer depending on what else is running on your system. Tup's standard output is also redirected to a file here; its regular terminal output and nice progress bar will add more slowdown on top.
The key takeaways:
By merely disabling certain x86 features from MS-DOS Player and retaining the accuracy of the remaining emulation, we get speedups of ≈60% (full build), ≈70% (median TU), and ≈80% (largest TU).
≈25% (full build), ≈29% (median TU), and ≈41% (largest TU) of this speedup came from Visual Studio's profile-guided optimization, with no changes to the MS-DOS Player codebase.
The effects of removing cycle counting are the biggest surprise. Between ≈17% and ≈23%, just for removing one subtraction per emulated instruction? Turns out that in the absence of a "target cycle amount" setting, the x86 emulation loop previously ran for only a single cycle. This caused the PIC check to run after every instruction, followed by PIT, serial I/O, keyboard, mouse, and CRTC update code every millisecond. Without cycle counting, the x86 loop actually keeps running until a CPU exception is raised or the emulated process terminates, skipping the hardware code during the vast majority of the program's execution time.
While Takeda Toshiya's changes in the 2024-06-27 build completely throw out the cycle counter and clean up process termination, they also reintroduce the hardware updates that made up the majority of the cycle removal speedup. This explains the results we're getting: The small speedup for full rebuilds is too insignificant to bother with and might even fall within a statistical margin of error, but the build slows down more and more the longer the emulated process runs. Compiling and linking YuugenMagan takes a whole 14% longer on generic builds, and ≈9-12% longer on PGO builds. I did another in-between test that just removed the x86 loop from the cycle removal version, and got exactly the same numbers. This just goes to show how much removing two writes to a fixed memory address per emulated instruction actually matters. Let's not merge back this one, and stay on top of 2024-06-24 for the time being.
The risky optimizations of ignoring segment limits and speeding up 32-bit segment+offset pointer load instructions could yield a further speedup. However, most of these changes boil down to removing branches that would never be taken when emulating correct x86 code. Consequently, these branches get recorded as unlikely during PGO training, which then causes the profile-guided rebuild to rearrange the instructions on these branches in a way that favors the common case, leaving the rest of their effective removal to your CPU's branch predictor. As such, the 10%-15% speedup we can observe in generic builds collapses down to 2%-6% in PGO builds. At this rate and with these absolute durations, it's not worth it to maintain what's strictly a more inaccurate fork of Neko Project 21/W's x86 core.
The redundant header inclusions afforded by #include guards do in fact have a measurable performance cost on Turbo C++ 4.0J, slowing down compile times by 5%.
But how does this compare to DOSBox-X's dynamic core? Dynamic recompilers need some kind of cache to ensure that every block of original ASM gets recompiled only once, which gives them an advantage in long-running processes after the initial warmup. As a result, DOSBox-X compiles and links YuugenMagan in , ≈92% faster than even our optimized MS-DOS Player build. That percentage resembles the slowdown we were initially getting when comparing full rebuilds between DOSBox-X and MS-DOS Player, as if we hadn't optimized anything.
On paper, this would mean that DOSBox-X barely lost any of its huge advantage when it comes to single-threaded compile+link performance. In practice, though, this metric is supposed to measure a typical decompilation or modding workflow that focuses on repeatedly editing a single file. Thus, a more appropriate comparison would also have to add the aforementioned constant 28,130 syscalls that my old build system required to detect that this is the one file/binary that needs to be recompiled/relinked. The video at the top of this blog post happens to capture the best time () I got for the detection process on DOSBox-X. This is almost as slow as the compilation and linking itself, and would have only gotten slower as we continue decompiling the rest of the games. Tup, on the other hand, performs its filesystem scan in a near-constant , matching the claim in Section 4.7 of its paper, and thus shrinking the performance difference to ≈14% after all. Sure, merging the dynamic core would have been even better (contribution-ideas, anyone?), but this is good enough for now.
Just like with Tup, I've also placed this optimized binary directly into the ReC98 repo and added the specific build instructions to the GitHub release page.
I do have more far-reaching ideas for further optimizing Neko Project 21/W's x86 core for this specific case of repeated switches between Real Mode and Protected Mode while still retaining the interpreted nature of this core, but these already strained the budget enough.
The perhaps more important remaining bottleneck, however, is hiding in the actual DOS emulation. Right now, a Tup-driven full rebuild spawns a total of 361 MS-DOS Player processes, which means that we're booting an emulated DOS 361 times. This isn't as bad as it sounds, as "booting DOS" basically just involves initializing a bunch of internal DOS structures in conventional memory to meaningful values. However, these structures also include a few environment variables like PATH, APPEND, or TEMP/TMP, which MS-DOS Player seamlessly integrates by translating them from their value on the Windows host system to the DOS 8.3 format. This could be one of the main reasons why MS-DOS Player is a native Windows program rather than being cross-platform:
On Windows, this path translation is as simple as calling GetShortPathNameA(), which returns a unique 8.3 name for every component along the path.
Also, drive letters are an integralpart of the DOS INT 21h API, and Windows still uses them as well.
However, the NT kernel doesn't actually use drive letters either, and views them as just a legacy abstraction over its reality of volume GUIDs. Converting paths back and forth between these two views therefore requires it to communicate with a
mount point manager service, which can coincidentally also be observed in debug builds of Tup.
As a result, calling any path-retrieving API is a surprisingly expensive operation on modern Windows. When running a small sprite through our 📝 sprite converter, MS-DOS Player's boot process makes up 56% of the runtime, with 64% of that boot time (or 36% of the entire runtime) being spent on path translation. The actual x86 emulation to run the program only takes up 6.5% of the runtime, with the remaining 37.5% spent on initializing the multithreaded C++ runtime.
But then again, the truly optimal solution would not involve MS-DOS Player at all. If you followed general video game hacking news in May, you'll probably remember the N64 community putting the concept of statically recompiled game ports on the map. In case you're wondering where this seemingly sudden innovation came from and whether a reverse-engineered decompilation project like ReC98 is obsolete now, I wrote a new FAQ entry about why this hype, although justified, is at least in part misguided. tl;dr: None of this can be meaningfully applied to PC-98 games at the moment.
On the other hand, recompiling our compiler would not only be a reasonable thing to attempt, but exactly the kind of problem that recompilation solves best. A 16-bit command-line tool has none of the pesky hardware factors that drag down the usefulness of recompilations when it comes to game ports, and a recompiled port could run even faster than it would on 32-bit Windows. Sure, it's not as flashy as a recompiled game, but if we got a few generous backers, it would still be a great investment into improving the state of static x86 recompilation by simply having another open-source project in that space. Not to mention that it would be a great foundation for improving Turbo C++ 4.0J's code generation and optimizations, which would allow us to simplify lots of awkward pieces of ZUN code… 🤩
That takes care of building ReC98 on 64-bit platforms, but what about the 32-bit ones we used to support? The previous split of the build process into a Tup-driven 32-bit part and a Makefile-driven 16-bit part sure was awkward and I'm glad it's gone, but it did give you the choice between 1) emulating the 16-bit part or 2) running both parts natively on 32-bit Windows. While Tup's upstream Windows builds are 64-bit-only, it made sense to 📝 compile a custom 32-bit version and thus turn any 32-bit Windows ≥Vista into the perfect build platform for ReC98. Older Windows versions that can't run Tup had to build the 32-bit part using a separately maintained dumb batch script created by tup generate, but again, due to Make being trash, they were fully rebuilding the entire codebase every time anyway.
Driving the entire build via Tup changes all of that. Now, it makes little sense to continue using 32-bit Tup:
We need to DLL-inject into a 64-bit MS-DOS Player. Sure, we could compile a 32-bit build of MS-DOS Player, but why would we? If we look at current marketshares, nobody runs 32-bit Windows anymore, not even by accident. If you run 32-bit Windows in 2024, it's because you know what you're doing and made a conscious choice for the niche use case of natively running DOS programs. Emulating them defeats the whole point of setting up this environment to begin with.
It would make sense if Tup could inject into DOS programs, but it can't.
Also, as we're going to see later, requiring Windows ≥Vista goes in the opposite direction of what we want for a 32-bit build. The earlier the Windows version, the better it is at running native DOS tools.
This means that we could now only support 32-bit Windows via an even larger tup generated batch file. We'd have to move the MS-DOS Player prefix of the respective command lines into an environment variable to make Tup use the same rules for both itself and the batch file, but the result seems to work…
…but it's really slow, especially on Windows 9x. 🐌 If we look back at the theory behind my previous custom build system, we can already tell why: Efficiently building ReC98 requires a completely different approach depending on whether you're running a typical modern multi-core 64-bit system or a vintage single-core 32-bit system. On the former, you'd want to parallelize the slow emulation as much as you can, so you maximize the amount of TCC processes to keep all CPU cores as busy as possible. But on the latter, you'd want the exact opposite – there, the biggest annoyance is the repeated startup and shutdown of the VDM, TCC, and its DOS extender, so you want to continue batching translation units into as few TCC processes as possible.
CMake fans will probably feel vindicated now, thinking "that sounds exactly like you need a meta build system 🤪". Leaving aside the fact that the output vomited by all of CMake's Makefile generators is a disgusting monstrosity that's far removed from addressing any performance concerns, we sure could solve this problem by adding another layer of abstraction. But then, I'd have to rewrite my working Lua script into either C++ or (heaven forbid) Batch, which are the only options we'd have for bootstrapping without adding any further dependencies, and I really wouldn't want to do that. Alternatively, we could fork Tup and modify tup generate to rewrite the low-level build rules that end up in Tup's database.
But why should we go for any of these if the Lua script already describes the build in a high-level declarative way? The most appropriate place for transforming the build rules is the Lua script itself…
… if there wasn't the slight problem of Tup forbidding file writes from Lua. 🥲 Presumably, this limitation exists because there is no way of replicating these writes in a tup generated dumb shell script, and it does make sense from that point of view.
But wait, printing to stdout or stderr works, and we always invoke Tup from a batch file anyway. You can now tell where this is going. Hey, exfiltrating commands from a build script to the build system via standard I/O streams works for Rust's Cargo too!
Just like Cargo, we want to add a sufficiently unique prefix to every line of the generated batch script to distinguish it from Tup's other output. Since Tup only reruns the Lua script – and would therefore print the batch file – if the script changed between the previous and current build run, we only want to overwrite the batch file if we got one or more lines. Getting all of this to work wasn't all too easy; we're once again entering the more awful parts of Batch syntax here, which apparently are so terrible that Wine doesn't even bother to correctly implement parts of it. 😩
Most importantly, we don't really want to redirect any of Tup's standard I/O streams. Redirecting stdout disables console output coloring and the pretty progress bar at the bottom, and looping over stderr instead of stdout in Batch is incredibly awkward. Ideally, we'd run a second Tup process with a sub-command that would just evaluate the Lua script if it changed - and fortunately, tup parse does exactly that. 😌
In the end, the optimally fast and ERRORLEVEL-preserving solution involves two temporary files. But since creating files between two Tup runs causes it to reparse the Lua code, which would print the batch file to the unfiltered stdout, we have to hide these temporary files from Tup by placing them into its .tup/ database directory. 🤪
On a more positive note, programmatically generating batches from single-file TCC rules turned out to be a great idea. Since the Lua code maps command-line flags to arrays of input files, it can also batch across binaries, surpassing my old system in this regard. This works especially well on the debloated and anniversary branches, which replace ZUN's little command-line flag inconsistencies with a single set of good optimization flags that every translation unit is compiled with.
Time to fire up some VMs then… only to see the build failing on Windows 9x with multiple unhelpful Bad command or file name errors. Clearly, the long echo lines that write our response files run up against some length limit in command.com and need to be split into multiple ones. Windows 9x's limit is larger than the 127 characters of DOS, that's for sure, and the exact number should just be one search away…
…except that it's not the 1024 characters recounted in a surviving newsgroup post. Sure, lines are truncated to 1023 bytes and that off-by-one error is no big deal in this context, but that's not the whole story:
Wait, what, something about / being the SWITCHAR? And not even just that…
And what's perhaps the worst example:
My complete set of test cases: 2024-07-09-Win9x-batch-tokenizer-tests.bat
So, time to load command.com into DOSBox-X's debugger and step through some code. 🤷 The earliest NT-based Windows versions were ported to a variety of CPUs and therefore received the then-all-new cmd.exe shell written in C, whereas Windows 9x's command.com was still built on top of the dense hand-written ASM code that originated in the very first DOS versions. Fortunately though, Microsoft open-sourced one of the later DOS versions in April. This made it somewhat easier to cross-reference the disassembly even though the Windows 9x version significantly diverged in the parts we're interested in.
And indeed: After truncating to 1023 bytes and parsing out any redirectors, each line is split into tokens around whitespace and = signs and before every occurrence of the SWITCHAR. These tokens are written into a statically allocated 64-element array, and once the code tries to write the 65th element, we get the Bad command or file name error instead.
#
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
String
echo
-DA
1
2
3
a
/B
/C
/D
/1
a
/B
/C
/D
/2
Switch flag
🚩
🚩
🚩
🚩
🚩
🚩
🚩
🚩
The first few elements of command.com's internal argument array after calling the Windows 9x equivalent of parseline with my initial example string. Note how all the "switches" got capitalized and annotated with a flag, whereas the = sign no longer appears in either string or flag form.
Needless to say, this makes no sense. Both DOS and Windows pass command lines as a single string to newly created processes, and since this tokenization is lossy, command.com will just have to pass the original string anyway. If your shell wants to handle tokenization at a central place, it should happen after it decided that the command matches a builtin that can actually make use of a pointer to the resulting token array – or better yet, as the first call of each builtin's code. Doing it before is patently ridiculous.
I don't know what's worse – the fact that Windows 9x blindly grinds each batch line through this tokenizer, or the fact that no documentation of this behavior has survived on today's Internet, if any even ever existed. The closest thing I found was this page that doesn't exist anymore, and it also just contains a mere hint rather than a clear description of the issue. Even the usual Batch experts who document everything else seem to have a blind spot when it comes to this specific issue. As do emulators: DOSBox and FreeDOS only reimplement the sane DOS versions of command.com, and Wine only reimplements cmd.exe.
Oh well. 71 lines of Lua later, the resulting batch file does in fact work everywhere:
But wait, there's more! The codebase now compiles on all 32-bit Windows systems I've tested, and yields binaries that are equivalent to ZUN's… except on 32-bit Windows 10. 🙄 Suddenly, we're facing the exact same batched compilation bug from my custom build system again, with REIIDEN.EXE being 16 bytes larger than it's supposed to be.
Looks like I have to look into that issue after all, but figuring out the exact cause by debugging TCC would take ages again. Thankfully, trial and error quickly revealed a functioning workaround: Separating translation unit filenames in the response file with two spaces rather than one. Really, I couldn't make this up. This is the most ridiculous workaround for a bug I've encountered in a long time.
Hopefully, you've now got the impression that supporting any kind of 32-bit Windows build is way more of a liability than an asset these days, at least for this specific project. "Real hardware", "motivating a TCC recompilation", and "not dropping previous features" really were the only reasons for putting up with the sheer jank and testing effort I had to go through. And I wouldn't even be surprised if real-hardware developers told me that the first reason doesn't actually hold up because compiling ReC98 on actual PC-98 hardware is slow enough that they'd rather compile it on their main machine and then transfer the binaries over some kind of network connection.
I guess it also made for some mildly interesting blog content, but this was definitely the last time I bothered with such a wide variety of Windows versions without being explicitly funded to do so. If I ever get to recompile TCC, it will be 64-bit only by default as well.
Instead, let's have a tier list of supported build platforms that clearly defines what I am maintaining, with just the most convincing 32-bit Windows version in Tier 1. Initially, that was supposed to be Windows 98 SE due to its superior performance, but that's just unreasonable if key parts of the OS remain undocumented and make no sense. So, XP it is.
*nix fans will probably once again be disappointed to see their preferred OS in Tier 2. But at least, all we'd need for that to move up to Tier 1 is a CI configuration, contributed either via funding me or sending a PR. (Look, even more contribution-ideas!)
Getting rid of the Wine requirement for a fully cross-platform build process wouldn't be too unrealistic either, but would require us to make a few quality decisions, as usual:
Do we run the DOS tools by creating a cross-platform MS-DOS Player fork, or do we statically recompile them?
Do we replace 32-bit Windows TASM with the 16-bit DOS TASM.EXE or TASMX.EXE, which we then either run through our forked MS-DOS Player or recompile? This would further slow down the build and require us to get rid of these nice long non-8.3 filenames… 😕 I'd only recommend this after the looming librarization of ZUN's master.lib fork is completed.
Or do we try migrating to JWasm again? As an open-source assembler that aims for MASM compatibility, it's the closest we can get to TASM, but it's not a drop-in replacement by any means. I already tried in late 2014, but encountered too many issues and quickly abandoned the idea. Maybe it works better now that we have less ASM? In any case, this migration would only get easier the less ASM code we have remaining in the codebase as we get closer to the 100% finalization mark.
Y'know what I think would be the best idea for right now, though? Savoring this new build system and spending an extended amount of time doing actual decompilation or modding for a change.
Now that even full rebuilds are decently fast, let's make use of that productivity boost by doing some urgent and far-reaching code cleanup that touches almost every single C++ source file. The most immediately annoying quirk of this codebase was the silly way each translation unit #included the headers it needed. Many years ago, I measured that repeatedly including the same header did significantly impact Turbo C++ 4.0J's compilation times, regardless of any include guards inside. As a consequence of this discovery, I slightly overreacted and decided to just not use any include guards, ever. After all, this emulated build process is slow enough, and we don't want it to needlessly slow down even more! This way, redundantly including any file that adds more than just a few #define macros won't even compile, throwing lots of Multiple definition errors.
Consequently, the headers themselves #included almost nothing. Starting a new translation unit therefore always involved figuring and spelling out the transitive dependencies of the headers the new unit actually wants to use, in a short trial-and-error process. While not too bad by itself, this was bound to become quite counterproductive once we get closer to porting these games: If some inlined function in a header needed access to, let's say, PC-98-specific I/O ports as an implementation detail, the header would have externalized this dependency to the top-level translation unit, which in turn made that that unit appear to contain PC-98-native code even if the unit's code itself was perfectly portable.
But once we start making some of these implicit transitive dependencies optional, it all stops being justifiable. Sometimes, a.hpp declared things that required declarations from b.hpp but these things are used so rarely that it didn't justify adding #include "b.hpp" to all translation units that #include "a.hpp". So how about conditionally declaring these things based on previously #included headers?
Now that we've measured that the sane alternative of include guards comes with a performance cost of just 5% and we've further reduced its effective impact by parallelizing the build, it's worth it to take that cost in exchange for a tidy codebase without such surprises. From now on, every header file will #include its own dependencies and be a valid translation unit that must compile on its own without errors. In turn, this allows us to remove at least 1,000 #include of transitive dependencies from .cpp files. 🗑️
However, that 5% number was only measured after I reduced these redundant #includes to their absolute minimum. So it still makes sense to only add include guards where they are absolutely necessary – i.e., transitively dependent headers included from more than one other file – and continue to (ab)use the Multiple definition compiler errors as a way of communicating "you're probably #including too many headers, try removing a few". Certainly a less annoying error than Undefined symbol.
Since all of this went way over the 7-push mark, we've got some small bits of RE and PI work to round it all out. The .REC loader in TH04 and TH05 is completely unremarkable, but I've got at least a bit to say about TH02's High Score menu. I already decompiled MAINE.EXE's post-Staff Roll variant in 2015, so we were only missing the almost identical MAIN.EXE variant shown after a Game Over or when quitting out of the game. The two variants are similar enough that it mostly needed just a small bit of work to bring my old 2015 code up to current standards, and allowed me to quickly push TH02 over the 40% RE mark.
Functionally, the two variants only differ in two assignments, but ZUN once again chose to copy-paste the entire code to handle them. This was one of ZUN's better copy-pasting jobs though – and honestly, I can't even imagine how you would mess up a menu that's entirely rendered on the PC-98's text RAM. It almost makes you wonder whether ZUN actually used the same #if ENDING preprocessor branching that my decompilation uses… until the visual inconsistencies in the alignment of the place numbers and the and labels clearly give it away as copy-pasted:
Next up: Starting the big Seihou summer! Fortunately, waiting two more months was worth it: In mid-June, Microsoft released a preview version of Visual Studio that, in response to my bug report, finally, finally makes C++ standard library modules fully usable. Let's clean up that codebase for real, and put this game into a window.
TH03 gameplay! 📝 It's been over two years. People have been investing some decent money with the intention of eventually getting netplay, so let's cover some more foundations around player movement… and quickly notice that there's almost no overlap between gameplay RE and netplay preparations? That makes for a fitting opportunity to think about what TH03 netplay would look like:
You'd want UDP rather than TCP for both its low latency and its NAT hole-punching ability
However, raw UDP does not guarantee that the packets arrive in order, or that they even arrive at all
WebRTC implements these reliability guarantees on top of UDP in a modern package, providing the best of both worlds
NAT traversal via public or self-hosted STUN/TURN servers is built into the connection establishment protocol and APIs, so you don't even have to understand the underlying issue
I'm not too deep into networking to argue here, and it clearly works for Ju.N.Owen. If we do explore other options, it would mainly be because I can't easily get something as modern as WebRTC to natively run on Windows 9x or DOS, if we decide to go for that route.
Matchmaking: I like Ju.N.Owen's initial way of copy-pasting signaling codes into chat clients to establish a peer-to-peer connection without a dedicated matchmaking server. progre eventually implemented rooms on the AWS cloud, but signaling codes are still used for spectating and the Pure P2P mode. We'll probably copy the same evolution, with a slight preference for Pure P2P – if only because you would have to check a GDPR consent box before I can put the combination of your room name and IP address into a database. Server costs shouldn't be an issue at the scale I expect this to have.
Rollback: In emulators, rollback netcode can be and has been implemented by keeping savestates of the last few frames together with the local player's inputs and then replaying the emulation with updated inputs of the remote player if a prediction turned out to be incorrect. This technique is a great fit for TH03 for two reasons:
All game state is contained within a relatively small bit of memory. The only heap allocations done in MAIN.EXE are the 📝 .MRS images for gauge attack portraits and bomb backgrounds, and the enemy scripts and formations, both of which remain constant throughout a round. All other state is statically allocated, which can reduce per-frame snapshots from the naive 640 KiB of conventional DOS memory to just the 37 KiB of MAIN.EXE's data segment. And that's the upper bound – this number is only going to go down as we move towards 100% PI, figure out how TH03 uses all its static data, and get to consolidate all mutated data into an even smaller block of memory.
For input prediction, we could even let the game's existing AI play the remote player until the actual inputs come in, guaranteeing perfect play until the remote inputs prove otherwise. Then again… probably only while the remote player is not moving, because the chance for a human to replicate the AI's infamous erratic dodging is fairly low.
The only issue with rollback in specifically a PC-98 emulator is its implications for performance. Rendering is way more computationally expensive on PC-98 than it is on consoles with hardware sprites, involving lots of memory writes to the disjointed 4 bitplane segments that make up the 128 KB framebuffer, and equally as many reads and bitshift operations on sprite data. TH03 lessens the impact somewhat thanks to most of its rendering being EGC-accelerated and thus running inside the emulator as optimized native code, but we'd still be emulating all the x86 code surrounding the EGC accesses – from the emulator's point of view, it looks no different than game logic. Let's take my aging i5 system for example:
With the Screen → No wait option, Neko Project 21/W can emulate TH03 gameplay at 260 FPS, or 4.6× its regular speed.
This leaves room for each frame to contain 3.6 frames of rollback in addition to the frame that's supposed to be displayed,
which results in a maximum safe network latency of ≈63 ms, or a ping of ≈126 ms. According to this site, that's enough for a smooth connection from Germany to any other place in Europe and even out to the US Midwest. At this ping, my system could still run the game without slowdown even if every single frame required a rollback, which is highly unlikely.
Any higher ping, however, could occasionally lead to a rollback queue that's too large for my system to process within a single frame at the intended 56.4 FPS rate. As a result, me playing anyone in the western US is highly likely to involve at least occasional slowdowns. Delaying inputs on purpose is the usual workaround, but isn't Touhou that kind of game series where people use vpatch to get rid of even the default input delay in the Windows games?
So we'd ideally want to put TH03 into an update-only mode that skips all rendering calls during re-simulation of rolled-back frames. Ironically, this means that netplay-focused RE would actually focus on the game's rendering code and ensure that it doesn't mutate any statically allocated data, allowing it to be freely skipped without affecting the game. Imagine palette-based flashing animations that are implemented by gradually mutating statically allocated values – these would cause wrong colors for the rest of the game if the animation doesn't run on every frame.
Implementing all of this into TH03 can be done in one, a few, or all of the following 6 ways, depending on what the backers prefer. Sorted from the most generic to the most specialized solution (and, coincidentally, from least to most total effort required):
Generic PC-98 netcode for one or more emulators
This is the most basic and puristic variant that implements generic netplay for PC-98 games in general by effectively providing remote control of the emulated keyboard and joypad. The emulator will be unaware of the game, and the game will be unaware of being netplayed, which makes this solution particularly interesting for the non-Touhou PC-98 scene, or competitive players who absolutely insist on using ZUN's original binaries and won't trust any of my modded game builds.
Applied to TH03, this means that players would select the regular hot-seat 1P vs 2P mode and then initiate a match through a new menu in the emulator UI. The same UI must then provide an option to manually remap incoming key and button presses to the 2P controls (newly introducing remapping to the emulator if necessary), as well as blocking any non-2P keys. The host then sends an initial savestate to the guest to ensure an identical starting state, and starts synchronizing and rolling back inputs at VSync boundaries.
This generic nature means that we don't get to include any of the TH03-specific rollback optimizations mentioned above, leading to the highest CPU and memory requirements out of all the variants. It sure is the easiest to implement though, as we get to freely use modern C++ WebRTC libraries that are designed to work with the network stack of the underlying OS.
I can try to build this netcode as a generic library that can work with any PC-98 emulator, but it would ultimately be up to the respective upstream developers to integrate it into official releases. Therefore, expect this variant to require separate funding and custom builds for each individual emulator codebase that we'd like to support.
Emulator-level netcode with optional game integration
Takes the generic netcode developed in 1) and adds the possibility for the game to control it via a special interrupt API. This enables several improvements:
Online matches could be initiated through new options in TH03's main menu rather than the emulator's UI.
The game could communicate the memory region that should be backed up every frame, cutting down memory usage as described above.
The exchanged input data could use the game's internal format instead of keyboard or joypad inputs. This removes the need for key remapping at the emulator level and naturally prevents the inherent issue of remote control where players could mess with each other's controls.
The game could be aware of the rollbacks, allowing it to jump over its rendering code while processing the queue of remote inputs and thus gain some performance as explained above.
The game could add synchronization points that block gameplay until both players have reached them, preventing the rollback queue from growing infinitely. This solves the issue of 1) not having any inherent way of working around desyncs and the resulting growth of the rollback queue. As an example, if one of the two emulators in 1) took, say, 2 seconds longer to load the game due to a random CPU spike caused by some bloatware on their system, the two players would be out of sync by 2 seconds for the rest of the session, forcing the faster system to render 113 frames every time an input prediction turned out to be incorrect.
Good places for synchronization points include the beginning of each round, the WARNING!! You are forced to evade / Your life is in peril popups that pause the game for a few frames anyway, and whenever the game is paused via the ESC key.
During such pauses, the game could then also block the resuming ESC key of the player who didn't pause the game.
Edit (2024-04-30): Emulated serial port communicating over named pipes with a standalone netplay tool
This approach would take the netcode developed in 2) out of the emulator and into a separate application running on the (modern) host OS, just like Ju.N.Owen or Adonis. The previous interrupt API would then be turned into binary protocol communicated over the PC-98's serial port, while the rollback snapshots would be stored inside the emulated PC-98 in EMS or XMS/Protected Mode memory. Netplay data would then move through these stages:
Sending serial port data over named pipes is only a semi-common feature in PC-98 emulators, and would currently restrict netplay to Neko Project 21/W and NP2kai on Windows. This is a pretty clean and generally useful feature to have in an emulator though, and emulator maintainers will be much more likely to include this than the custom netplay code I proposed in 1) and 2). DOSBox-X has an open issue that we could help implement, and the NP2kai Linux port would probably also appreciate a mkfifo(3) implementation.
This could even work with emulators that only implement PC-98 serial ports in terms of, well, native Windows serial ports. This group currently includes Neko Project II fmgen, SL9821, T98-Next, and rare bundles of Anex86 that replace MIDI support with COM port emulation. These would require separately installed and configured virtual serial port software in place of the named pipe connection, as well as support for actual serial ports in the netplay tool itself. In fact, this is the only way that die-hard Anex86 and T98-Next fans could enjoy any kind of netplay on these two ancient emulators.
If it works though, it's the optimal solution for the emulated use case if we don't want to fork the emulator. From the point of view of the PC-98, the serial port is the cheapest way to send a couple of bytes to some external thing, and named pipes are one of many native ways for two Windows/Linux applications to efficiently communicate.
The only slight drawback of this approach is the expected high DOS memory requirement for rollback. Unless we find a way to really compress game state snapshots to just a few KB, this approach will require a more modern DOS setup with EMS/XMS support instead of the pre-installed MS-DOS 3.30C on a certain widely circulated .HDI copy. But apart from that, all you'd need to do is run the separate netplay tool, pick the same pipe name in both the tool and the emulator, and you're good to go.
It could even work for real hardware, but would require the PC-98 to be linked to the separately running modern system via a null modem cable.
Native PC-98 Windows 9x netcode (only for real PC-98 hardware equipped with an Ethernet card)
Equivalent in features to 2), but pulls the netcode into the PC-98 system itself. The tool developed in 3) would then as a separate 32-bit or 16-bit Windows application that somehow communicates with the game running in a DOS window. The handful of real-hardware owners who have actually equipped their PC-98 with a network card such as the LGY-98 would then no longer require the modern PC from 3) as a bridge in the middle.
This specific card also happens to be low-level-emulated by the 21/W fork of Neko Project. However, it makes little sense to use this technique in an emulator when compared to 3), as NP21/W requires a separately installed and configured TAP driver to actually be able to access your native Windows Internet connection. While the setup is well-documented and I did manage to get a working Internet connection inside an emulated Windows 95, it's definitely not foolproof. Not to mention DOSBox-X, which currently emulates the apparently hardware-compatible NE2000 card, but disables its emulation in PC-98 mode, most likely because its I/O ports clash with the typical peripherals of a PC-98 system.
And that's not the end of the drawbacks:
Netplay would depend on the PC-98 versions of Windows 9x and its full network stack, nothing of which is required for the game itself.
Porting libdatachannel (and especially the required transport encryption) to Windows 95 will probably involve a bit of effort as well.
As would actually finding a way to access V86 mode memory from a 32-bit or 16-bit Windows process, particularly due to how isolated DOS processes are from the rest of the system and even each other. A quick investigation revealed three potential approaches:
A 32-bit process could read the memory out of the address space of the console host process (WINOA32.MOD). There seems to be no way of locating the specific base address of a DOS process, but you could always do a brute-force search through the memory map.
If started before Windows, TSRs will share their resident memory with both DOS and Win16 processes. The segment pointer would then be retrieved through a typical interrupt API.
Writing a VxD driver 😩
Correctly setting up TH03 to run within Windows 95 to begin with can be rather tricky. The GDC clock speed check needs to be either patched out or overridden using mode-setting tools, Windows needs to be blocked from accessing the FM chip, and even then, MAIN.EXE might still immediately crash during the first frame and leave all of VRAM corrupted:
A matchmaking server would be much more of a requirement than in any of the emulator variants. Players are unlikely to run their favorite chat client on the same PC-98 system, and the signaling codes are way too unwieldy to type them in manually. (Then again, IRC is always an option, and the people who would fund this variant are probably the exact same people who are already running IRC clients on their PC-98.)
Native PC-98 DOS netcode (only for real PC-98 hardware equipped with an Ethernet card)
Conceptually the same as 4), but going yet another level deeper, replacing the Windows 9x network stack with a DOS-based one. This might look even more intimidating and error-prone, but after I got pingand even Telnet working, I was pleasantly surprised at how much simpler it is when compared to the Windows variant. The whole stack consists of just one LGY-98 hardware information tool, a LGY-98 packet driver TSR, and a TSR that implements TCP/IP/UDP/DNS/ICMP and is configured with a plaintext file. I don't have any deep experience with these protocols, so I was quite surprised that you can implement all of them in a single 40 KiB binary. Installed as TSRs, the entire stack takes up an acceptable 82 KiB of conventional memory, leaving more than enough space for the game itself. And since both of the TSRs are open-source, we can even legally bundle them with the future modified game binaries.
The matchmaking issue from the Windows 9x approach remains though, along with the following issues:
Porting libdatachannel and the required transport encryption to the TEEN stack seems even more time-consuming than a Windows 95 port.
The TEEN stack has no UI for specifying the system's or gateway's IP addresses outside of its plaintext configuration file. This provides a nice opportunity for adding a new Internet settings menu with great error feedback to the game itself. Great for UX, but it's another thing I'd have to write.
As always, this is the premium option. If the entire game already runs as a standalone executable on a modern system, we can just put all the netcode into the same binary and have the most seamless integration possible.
That leaves us with these prerequisites:
1), by definition, needs nothing from ReC98, and I could theoretically start implementing it right now. If you're interested in funding it, just tell me via the usual Twitter or Discord channels.
2) through 5) require at least 100% RE of TH03's OP.EXE to facilitate the new menu code. Reverse-engineering all rendering-related code in MAIN.EXE would be nice for performance, but we don't strictly need all of it before we start. Re-simulated frames can just skip over the few pieces of rendering code we do know, and we can gradually increase the skipped area of code in future pushes.
100% PI won't be a requirement either, as I expect the MAIN.EXE part of the interfacing netcode layer to be thin enough that it can easily fit within the original game's code layout.
6), obviously, requires all of TH03 to be RE'd, decompiled, cleaned up, and ported to modern systems. Currently, TH03 appears to be the second-easiest game to port behind TH02:
Although TH03 already has more needlessly micro-optimized ASM code than TH02 and there's even more to come, it still appears to have way less than TH04 or TH05.
Its game logic and rendering code seem to be somewhat neatly separated from each other, unlike TH01 which deeply intertwines them.
Its graphics seem free of obvious bugs, unlike – again — the flicker-fest that is TH01.
But still, it's the game with the least amount of RE%. Decompilation might get easier once I've worked myself up to the higher levels of game code, and even more so if we're lucky and all of the 9 characters are coded in a similar way, but I can't promise anything at this point.
Once we've reached any of these prerequisites, I'll set up a separate campaign funding method that runs parallel to the cap. As netplay is one of those big features where incremental progress makes little sense and we can expect wide community support for the idea, I'll go for a more classic crowdfunding model with a fixed goal for the minimum feature set and stretch goals for optional quality-of-life features. Since I've still got two other big projects waiting to be finished, I'd like to at least complete the Shuusou Gyoku Linux port before I start working on TH03 netplay, even if we manage to hit any of the funding goals before that.
For the first time in a long while, the actual content of this push can be listed fairly quickly. I've now RE'd:
conversions from playfield-relative coordinates to screen coordinates and back (a first in PC-98 Touhou; even TH02 uses screen space for every coordinate I've seen so far),
the low-level code that moves the player entity across the screen,
a copy of the per-round frame counter that, for some reason, resets to 0 at the start of the Win/Lose animation, resetting a bunch of animations with it,
a global hitbox with one variable that sometimes stores the center of an entity, and sometimes its top-left corner,
and the 48×48 hit circles from EN2.PI.
It's also the third TH03 gameplay push in a row that features inappropriate ASM code in places that really, really didn't need any. As usual, the code is worse than what Turbo C++ 4.0J would generate for idiomatic C code, and the surrounding code remains full of untapped and quick optimization opportunities anyway. This time, the biggest joke is the sprite offset calculation in the hit circle rendering code:
But while we've all come to expect the usual share of ZUN bloat by now, this is also the first push without either a ZUN bug or a landmine since I started using these terms! 🎉 It does contain a single ZUN quirk though, which can also be found in the hit circles. This animation comes in two types with different caps: 12 animation slots across both playfields for the enemy circles shown in alternating bright/dark yellow colors, whereas the white animation for the player characters has a cap of… 1? P2 takes precedence over P1 because its update code always runs last, which explains what happens when both players get hit within the 16 frames of the animation:
SPRITE16 uses the PC-98's EGC to draw these single-color sprites. If the EGC is already set up, it can be set into a GRCG-equivalent RMW mode using the pattern/read plane register (0x4A2) and foreground color register (0x4A6), together with setting the mode register (0x4A4) to 0x0CAC. Unlike the typical blitting operations that involve its 16-dot pattern register, the EGC even supports 8- or 32-bit writes in this mode, just like the GRCG. 📝 As expected for EGC features beyond the most ordinary ones though, T98-Next simply sets every written pixel to black on a 32-bit write. Comparing the actual performance of such writes to the GRCG would be 📝 yet another interesting question to benchmark.
Next up: I think it's time for ReC98's build system to reach its final form.
For almost 5 years, I've been using an unreleased sane build system on a parallel private branch that was just missing some final polish and bugfixes. Meanwhile, the public repo is still using the project's initial Makefile that, 📝 as typical for Makefiles, is so unreliable that BUILD16B.BAT force-rebuilds everything by default anyway. While my build system has scaled decently over the years, something even better happened in the meantime: MS-DOS Player, a DOS emulator exclusively meant for seamless integration of CLI programs into the Windows console, has been forked and enhanced enough to finally run Turbo C++ 4.0J at an acceptable speed. So let's remove DOSBox from the equation, merge the 32-bit and 16-bit build steps into a single 32-bit one, set all of this up in a user-friendly way, and maybe squeeze even more performance out of MS-DOS Player specifically for this use case.
That was quick: In a surprising turn of events, Romantique Tp themselves came in just one day after the last blog post went up, updated me with their current and much more positive opinion on Sound Canvas VA, and confirmed that real SC-88Pro hardware clamps invalid Reverb Macro values to the specified range. I promised to release a new Sound Canvas VA BGM pack for free once I knew the exact behavior of real hardware, so let's go right back to Seihou and also integrate the necessary SysEx patches into the game's MIDI player behind a toggle. This would also be a great occasion to quickly incorporate some long overdue code maintenance and build system improvements, and a migration to C++ modules in particular. When I started the Shuusou Gyoku Linux port a year ago, the combination of modules and <windows.h> threw lots of weird errors and even crashed the Visual Studio compiler. But nowadays, Microsoft even uses modules in the Office code base. This must mean that these issues are fixed by now, right?
Well, there's still a bug that causes the modularized C++ standard library to be basically unusable in combination with the static analyzer, and somehow, I was the first one to report it. So it's 3½ years after C++20 was finalized, and somehow, modules are still a bleeding-edge feature and a second-class citizen in even the compiler that supports them the best. I want fast compile times already! 😕
Thankfully, Microsoft agrees that this is a bug, and will work on it at some point. While we're waiting, let's return to the original plan of decompiling the endings of the one PC-98 Touhou game that still needed them decompiled.
After the textless slideshows of TH01, TH02 was the first Touhou game to feature lore text in its endings. Given that this game stores its 📝 in-game dialog text in fixed-size plaintext files, you wouldn't expect anything more fancy for the endings either, so it's not surprising to see that the END?.TXT files use the same concept, with 44 visible bytes per line followed by two bytes of padding for the CR/LF newline sequence. Each of these lines is typed to the screen in full, with all whitespace and a fixed time for each 2-byte chunk.
As a result, everything surrounding the text is just as hardcoded as TH01's endings were, which once again opens up the possibility of freely integrating all sorts of creative animations without the overhead of an interpreter. Sadly, TH02 only makes use of this freedom in a mere two cases: the picture scrolling effect from Reimu's head to Marisa's head in the Bad Endings, and a single hardware palette change in the Good Endings.
Hardcoding also still made sense for this game because of how the ending text is structured. The Good and Bad Endings for the individual shot types respectively share 55% and 77% of their text, and both only diverge after the first 27 lines. In straight-line procedural code, this translates to one branch for each shot type at a single point, neatly matching the high-level structure of these endings.
But that's the end of the positive or neutral aspects I can find in these scripts. The worst part, by far, is ZUN's approach to displaying the text in alternating colors, and how it impacts the entire structure of the code.
The simplest solution would have involved a hardcoded array with the color of each line, just like how the in-game dialogs store the face IDs for each text box. But for whatever reason, ZUN did not apply this piece of wisdom to the endings and instead hardcoded these color changes by… mutating a global variable before calling the text typing function for every individual line. This approach ruins any possibility of compressing the script code into loops. While ZUN did use loops, all of them are very short because they can only last until the next color change. In the end, the code contains 90 explicitly spelled-out calls to the 5-parameter line typing function that only vary in the pointer to each line and in the slower speed used for the one or two final lines of each ending. As usual, I've deduplicated the code in the ReC98 repository down to a sensible level, but here's the full inlined and macro-expanded horror:
It's highly likely that this is what ZUN hacked into his PC-98 and was staring at back in 1997.
All this redundancy bloats the two script functions for the 6 endings to a whopping 3,344 bytes inside TH02's MAINE.EXE. In particular, the single function that covers the three Good Endings ends up with a total of 631 x86 ASM instructions, making it the single largest function in TH02 and the 7th longest function in all of PC-98 Touhou. If the 📝 single-executable build for TH02's debloated and anniversary branches ends up needing a few more KB to reduce its size below the original MAIN.EXE, there are lots of opportunities to compress it all.
The ending text can also be fast-forwarded by holding any key. As we've come to expect for this sort of ZUN code, the text typing function runs its own rendering loop with VSync delays and input detection, which means that we 📝 once📝 again have to talk about the infamous quirk of the PC-98 keyboard controller in relation to held keys. We've still got 54 not yet decompiled calls to input detection functions left in this codebase, are you excited yet?!
Holding any key speeds up the text of all ending lines before the last one by displaying two kana/kanji instead of one per rendered frame and reducing the delay between the rendered frames to 1/3 of its regular length. In pseudocode:
for(i = 0; i < number_of_2_byte_chunks_on_displayed_line; i++) {
input = convert_current_pc98_bios_input_state_to_game_specific_bitflags();
add_chunk_to_internal_text_buffer(i);
blit_internal_text_buffer_from_the_beginning();
if(input == INPUT_NONE) {
// Basic case, no key pressed
frame_delay(frames_per_chunk);
} else if((i % 2) == 1) {
// Key pressed, chunk number is odd.
frame_delay(frames_per_chunk / 3);
} else {
// Key pressed, chunk number is even.
// No delay; next iteration adds to the same frame.
}
}
This is exactly the kind of code you would write if you wanted to deliberately maximize the impact of this hardware quirk. If the game happens to read the current input state right after a key up scancode for the last previously held and game-relevant key, it will then wrongly take the branch that uninterruptibly waits for the regular, non-divided amount of VSync interrupts. In my tests, this broke the rhythm of the fast-forwarded text about once per line. Note how this branch can also be taken on an even chunk: Rendering glyphs straight from font ROM to VRAM is not exactly cheap, and if each iteration (needlessly) blits one more full-width glyph than the last one, the probability of a key up scancode arriving in the middle of a frame only increases.
The fact that TH02 allows any of the supported input keys to be held points to another detail of this quirk I haven't mentioned so far. If you press multiple keys at once, the PC-98's keyboard controller only sends the periodic key up scancodes as long as you are holding the last key you pressed. Because the controller only remembers this last key, pressing and releasing any other key would get rid of these scancodes for all keys you are still holding.
As usual, this ZUN bug only occurs on real hardware and with DOSBox-X's correct emulation of the PC-98 keyboard controller.
After the ending, we get to witness the most seamless transition between ending and Staff Roll in any Touhou game as the BGM immediately changes to the Staff Roll theme, and the ending picture is shifted into the same place where the Staff Roll pictures will appear. Except that the code misses the exact position by four pixels, and cuts off another four pixels at the right edge of the picture:
Also, note the green 1-pixel line at the right edge of this specific picture. This is a bug in the .PI file where the picture is indeed shifted one pixel to the left.
What follows is a comparatively large amount of unused content for a single scene. It starts right at the end of this underappreciated 11-frame animation loaded from ENDFT.BFT:
Wastefully using the 4bpp BFNT format. The single frame at the end of the animation is unused; while it might look identical to the ZUN glyphs later on in the Staff Roll, that's only because both are independently rendered boldfaced versions of the same font ROM glyphs. Then again, it does prove that ZUN created this animation on a PC-98 model made by NEC, as the Epson clones used a font ROM with a distinctly different look.
TH02's Staff Roll is also unique for the pre-made screenshots of all 5 stages that get shown together with a fancy rotating rectangle animation while the Staff Roll progresses in sync with the BGM. The first interesting detail shows up immediately after the first image, where the code jumps over one of the 320×200 quarters in ED06.PI, leaving the screenshot of the Stage 2 midboss unused.
All of the cutscenes in PC-98 Touhou store their pictures as 320×200 quarters within a single 640×400 .PI file. Anywhere else, all four quarters are supposed to be displayed with the same palette specified in the .PI header, but TH02's Staff Roll screenshots are also unique in how all quarters beyond the top-left one require palettes loaded from external .RGB files to look right. Consequently, the game doesn't clearly specify the intended palette of this unused screenshot, and leaves two possibilities:
The unused second 320×200 quarter of TH02's ED06.PI, displayed in the Stage 2 color palette used in-game.
The unused second 320×200 quarter of TH02's ED06.PI, displayed in the palette specified in the .PI header. These are the colors you'd see when looking at the file in a .PI viewer, when converting it into another format with the usual tools, or in sprite rips that don't take TH02's hardcoded palette changes into account. These colors are only intended for the Stage 1 screenshot in the top-left quarter of the file.
The unused second 320×200 quarter of TH02's ED06.PI, displayed in the palette from ED06B.RGB, which the game uses for the following screenshot of the Meira fight. As it's from the same stage, it almost matches the in-game colors seen in 1️⃣, and only differs in the white color (#FFF) being slightly red-tinted (#FCC).
It might seem obvious that the Stage 2 palette in 1️⃣ is the correct one, but ZUN indeed uses ED06B.RGB with the red-tinted white color for the following screenshot of the Meira fight. Not only does this palette not match Meira's in-game appearance, but it also discolors the rectangle animation and the surrounding Staff Roll text:
Also, that tearing on frame #1 is not a recording artifact, but the expected result of yet another VSync-related landmine. 💣 This time, it's caused by the combination of 1) the entire sequence from the ending to the verdict screen being single-buffered, and 2) this animation always running immediately after an expensive operation (640×400 .PI image loading and blitting to VRAM, 320×200 VRAM inter-page copy, or hardware palette loading from a packed file), without waiting for the VSync interrupt. This makes it highly likely for the first frame of this animation to start rendering at a point where the (real or emulated) electron beam has already traveled over a significant portion of the screen.
But when I went into Stage 2 to compare these colors to the in-game palette, I found something even more curious. ZUN obviously made this screenshot with the Reimu-C shot type, but one of the shot sprites looks slightly different from how it does in-game. These screenshots must have been made earlier in development when the sprite didn't yet feature the second ring at the top. The same applies to the Stage 4 screenshot later on:
Finally, the rotating rectangle animation delivers one more minor rendering bug. Each of the 20 frames removes the largest and outermost rectangle from VRAM by redrawing it in the same black color of the background before drawing the remaining rectangles on top. The corners of these rectangles are placed on a shrinking circle that starts with a radius of 256 pixels and is centered at (192, 200), which results in a maximum possible X coordinate of 448 for the rightmost corner of the rectangle. However, the Staff Roll text starts at an X coordinate of 416, causing the first two full-width glyphs to still fall within the area of the circle. Each line of text is also only rendered once before the animation. So if any of the rectangles then happens to be placed at an angle that causes its edges to overlap the text, its removal will cut small holes of black pixels into the glyphs:
The green dotted circle corresponds to the newest/smallest rectangle. Note how ZUN only happened to avoid the holes for the two final animations by choosing an initial angle and angular velocity that causes the resulting rectangles to just barely avoid touching the TEST PLAYER glyphs.
At least the following verdict screen manages to have no bugs aside from the slightly imperfect centering of its table values, and only comes with a small amount of additional bloat. Let's get right to the mapping from skill points to the 12 title strings from END3.TXT, because one of them is not like the others:
Skill
Title
≥100
神を超えた巫女!!
90 - 99
もはや神の領域!!
80 - 99
A級シューター!!
78 - 79
うきうきゲーマー!
77
バニラはーもにー!
70 - 76
うきうきゲーマー!
60 - 69
どきどきゲーマー!
50 - 59
要練習ゲーマー
40 - 49
非ゲーマー級
30 - 39
ちょっとだめ
20 - 29
非人間級
10 - 19
人間でない何か
≤9
死んでいいよ、いやいやまじで
Looks like I'm the first one to document the required skill points as well? Everyoneelse just copy-pastes END3.TXT without providing context.
So how would you get exactly 77 and achieve vanilla harmony? Here's the formula:
* Ranges from 0 (Easy) to 3 (Lunatic). † Across all 5 stages.
With Easy Mode capping out at 85, this is possible on every difficulty, although it requires increasingly perfect play the lower you go. Reaching 77 on purpose, however, pretty much demands a careful route through the entire game, as every collected and missed item will influence the item_skill in some way. This almost feels it's like the ultimate challenge that this game has to offer. Looking forward to the first Vanilla Harmony% run!
And with that, TH02's MAINE.EXE is both fully position-independent and ready for translation. There's a tiny bit of undecompiled bit of code left in the binary, but I'll leave that for rounding up a future TH02 decompilation push.
With one of the game's skill-based formulas decompiled, it's fitting to round out the second push with the other two. The in-game bonus tables at the end of a stage also have labels that we'd eventually like to translate, after all.
The bonus formula for the 4 regular stages is also the first place where we encounter TH02's rank value, as well as the only instance in PC-98 Touhou where the game actually displays a rank-derived value to the player. KirbyComment and Colin Douglas Howell accurately documented the rank mechanics over at Touhou Wiki two years ago, which helped quite a bit as rank would have been slightly out of scope for these two pushes. 📝 Similar to TH01, TH02's rank value only affects bullet speed, but the exact details of how rank is factored in will have to wait until RE progress arrives at this game's bullet system.
These bonuses are calculated by taking a sum of various gameplay metrics and multiplying it with the amount of point items collected during the stage. In the 4 regular stages, the sum consists of:
難易度
Difficulty level* × 2,000
ステージ
(Rank + 16) × 200
ボム
max((2,500 - (Bombs used* × 500)), 0)
ミス
max((3,000 - (Lives lost* × 1,000)), 0)
靈撃初期数
(4 - Starting bombs) × 800
靈夢初期数
(5 - Starting lives) × 1,000
* Within this stage, across all continues.
Yup, 封魔録.TXT does indeed document this correctly.
As rank can range from -6 to +4 on Easy and +16 on the other difficulties, this sum can range between:
Easy
Normal
Hard
Lunatic
Minimum
2,800
4,800
6,800
8,800
Maximum
16,700
21,100
23,100
25,100
The sum for the Extra Stage is not documented in 封魔録.TXT:
クリア
10,000
ミス回数
max((20,000 - (Lives lost × 4,000)), 0)
ボム回数
max((20,000 - (Bombs used × 4,000)), 0)
クリアタイム
⌊max((20,000 - Boss fight frames*), 0) ÷ 10⌋ × 10
* Amount of frames spent fighting Evil Eye Σ, counted from the end of the pre-boss dialog until the start of the defeat animation.
And that's two pushes packed full of the most bloated and copy-pasted code that's unique to TH02! So bloated, in fact, that TH02 RE as a whole jumped by almost 7%, which in turn finally pushed overall RE% over the 60% mark. 🎉 It's been a while since we hit a similar milestone; 50% overall RE happened almost 2 years ago during 📝 P0204, a month before I completed the TH01 decompilation.
Next up: Continuing to wait for Microsoft to fix the static analyzer bug until May at the latest, and working towards the newly popular dreams of TH03 netplay by looking at some of its foundational gameplay code.
Oh, it's 2024 already and I didn't even have a delivery for December or January? Yeah… I can only repeat what I said at the end of November, although the finish line is actually in sight now. With 10 pushes across 4 repositories and a blog post that has already reached a word count of 9,240, the Shuusou Gyoku SC-88Pro BGM release is going to break 📝 both the push record set by TH01 Sariel two years ago, and 📝 the blog post length record set by the last Shuusou Gyoku delivery. Until that's done though, let's clear some more PC-98 Touhou pushes out of the backlog, and continue the preparation work for the non-ASCII translation project starting later this year.
But first, we got another free bugfix according to my policy! 📝 Back in April 2022 when I researched the Divide Error crash that can occur in TH04's Stage 4 Marisa fight, I proposed and implemented four possible workarounds and let the community pick one of them for the generally recommended small bugfix mod. I still pushed the others onto individual branches in case the gameplay community ever wants to look more closely into them and maybe pick a different one… except that I accidentally pushed the wrong code for the warp workaround, probably because I got confused with the second warp variant I developed later on.
Fortunately, I still had the intended code for both variants lying around, and used the occasion to merge the current master branch into all of these mod branches. Thanks to wyatt8740 for spotting and reporting this oversight!
As the final piece of code shared in largely identical form between 4 of the 5 games, the Music Rooms were the biggest remaining piece of low-hanging fruit that guaranteed big finalization% gains for comparatively little effort. They seemed to be especially easy because I already decompiled TH02's Music Room together with the rest of that game's OP.EXE back in early 2015, when this project focused on just raw decompilation with little to no research. 9 years of increased standards later though, it turns out that I missed a lot of details, and ended up renaming most variables and functions. Combined with larger-than-expected changes in later games and the usual quality level of ZUN's menu code, this ended up taking noticeably longer than the single push I expected.
The undoubtedly most interesting part about this screen is the animation in the background, with the spinning and falling polygons cutting into a single-color background to reveal a spacey image below. However, the only background image loaded in the Music Room is OP3.PI (TH02/TH03) or MUSIC3.PI (TH04/TH05), which looks like this in a .PI viewer or when converted into another image format with the usual tools:
Let's call this "the blank image".
That is definitely the color that appears on top of the polygons, but where is the spacey background? If there is no other .PI file where it could come from, it has to be somewhere in that same file, right?
And indeed: This effect is another bitplane/color palette trick, exactly like the 📝 three falling stars in the background of TH04's Stage 5. If we set every bit on the first bitplane and thus change any of the resulting even hardware palette color indices to odd ones, we reveal a full second 8-color sub-image hiding in the same .PI file:
The spacey sub-image. Never before seen!1!! …OK, touhou-memories beat me by a month. Let's add each image's full 16-color palette to deliver some additional value.
On a high level, the first bitplane therefore acts as a stencil buffer that selects between the blank and spacey sub-image for every pixel. The important part here, however, is that the first bitplane of the blank sub-images does not consist entirely of 0 bits, but does have 1 bits at the pixels that represent the caption that's supposed to be overlaid on top of the animation. Since there now are some pixels that should always be taken from the spacey sub-image regardless of whether they're covered by a polygon, the game can no longer just clear the first bitplane at the start of every frame. Instead, it has to keep a separate copy of the first bitplane's original state (called nopoly_B in the code), captured right after it blitted the .PI image to VRAM. Turns out that this copy also comes in quite handy with the text, but more on that later.
Then, the game simply draws polygons onto only the reblitted first bitplane to conditionally set the respective bits. ZUN used master.lib's grcg_polygon_c() function for this, which means that we can entirely thank the uncredited master.lib developers for this iconic animation – if they hadn't included such a function, the Music Rooms would most certainly look completely different.
This is where we get to complete the series on the PC-98 GRCG chip with the last remaining four bits of its mode register. So far, we only needed the highest bit (0x80) to either activate or deactivate it, and the bit below (0x40) to choose between the 📝 RMW and 📝 TCR/📝 TDW modes. But you can also use the lowest four bits to restrict the GRCG's operations to any subset of the four bitplanes, leaving the other ones untouched:
// Enable the GRCG (0x80) in regular RMW mode (0x40). All bitplanes are
// enabled and written according to the contents of the tile register.
outportb(0x7C, 0xC0);
// The same, but limiting writes to the first bitplane by disabling the
// second (0x02), third (0x04), and fourth (0x08) one, as done in the
// PC-98 Touhou Music Rooms.
outportb(0x7C, 0xCE);
// Regular GRCG blitting code to any VRAM segment…
pokeb(0xA8000, offset, …);
// We're done, turn off the GRCG.
outportb(0x7C, 0x00);
This could be used for some unusual effects when writing to two or three of the four planes, but it seems rather pointless for this specific case at first. If we only want to write to a single plane, why not just do so directly, without the GRCG? Using that chip only involves more hardware and is therefore slower by definition, and the blitting code would be the same, right?
This is another one of these questions that would be interesting to benchmark one day, but in this case, the reason is purely practical: All of master.lib's polygon drawing functions expect the GRCG to be running in RMW mode. They write their pixels as bitmasks where 1 and 0 represent pixels that should or should not change, and leave it to the GRCG to combine these masks with its tile register and OR the result into the bitplanes instead of doing so themselves. Since GRCG writes are done via MOV instructions, not using the GRCG would turn these bitmasks into actual dot patterns, overwriting any previous contents of each VRAM byte that gets modified.
Technically, you'd only have to replace a few MOV instructions with OR to build a non-GRCG version of such a function, but why would you do that if you haven't measured polygon drawing to be an actual bottleneck.
An example with three polygons drawn from top to bottom. Without the GRCG, edges of later polygons overwrite any previously drawn pixels within the same VRAM byte. Note how treating bitmasks as dot patterns corrupts even those areas where the background image had nonzero bits in its first bitplane.
As far as complexity is concerned though, the worst part is the implicit logic that allows all this text to show up on top of the polygons in the first place. If every single piece of text is only rendered a single time, how can it appear on top of the polygons if those are drawn every frame?
Depending on the game (because of course it's game-specific), the answer involves either the individual bits of the text color index or the actual contents of the palette:
Colors 0 or 1 can't be used, because those don't include any of the bits that can stay constant between frames.
If the lowest bit of a palette color index has no effect on the displayed color, text drawn in either of the two colors won't be visually affected by the polygon animation and will always appear on top. TH04 and TH05 rely on this property with their colors 2/3, 4/5, and 6/7 being identical, but this would work in TH02 and TH03 as well.
But this doesn't apply to TH02 and TH03's palettes, so how do they do it? The secret: They simply include all text pixels in nopoly_B. This allows text to use any color with an odd palette index – the lowest bit then won't be affected by the polygons ORed into the first bitplane, and the other bitplanes remain unchanged.
TH04 is a curious case. Ostensibly, it seems to remove support for odd text colors, probably because the new 10-frame fade-in animation on the comment text would require at least the comment area in VRAM to be captured into nopoly_B on every one of the 10 frames. However, the initial pixels of the tracklist are still included in nopoly_B, which would allow those to still use any odd color in this game. ZUN only removed those from nopoly_B in TH05, where it had to be changed because that game lets you scroll and browse through multiple tracklists.
The contents of nopoly_B with each game's first track selected.
Finally, here's a list of all the smaller details that turn the Music Rooms into such a mess:
Due to the polygon animation, the Music Room is one of the few double-buffered menus in PC-98 Touhou, rendering to both VRAM pages on alternate frames instead of using the other page to store a background image. Unfortunately though, this doesn't actually translate to tearing-free rendering because ZUN's initial implementation for TH02 mixed up the order of the required operations. You're supposed to first wait for the GDC's VSync interrupt and then, within the display's vertical blanking interval, write to the relevant I/O ports to flip the accessed and shown pages. Doing it the other way around and flipping as soon as you're finished with the last draw call of a frame means that you'll very likely hit a point where the (real or emulated) electron beam is still traveling across the screen. This ensures that there will be a tearing line somewhere on the screen on all but the fastest PC-98 models that can render an entire frame of the Music Room completely within the vertical blanking interval, causing the very issue that double-buffering was supposed to prevent.
ZUN only fixed this landmine in TH05.
The polygons have a fixed vertex count and radius depending on their index, everything else is randomized. They are also never reinitialized while OP.EXE is running – if you leave the Music Room and reenter it, they will continue animating from the same position.
TH02 and TH04 don't handle it at all, causing held keys to be processed again after about a second.
TH03 and TH05 correctly work around the quirk, at the usual cost of a 614.4 µs delay per frame. Except that the delay is actually twice as long in frames in which a previously held key is released, because this code is a mess.
But even in 2024, DOSBox-X is the only emulator that actually replicates this detail of real hardware. On anything else, keyboard input will behave as ZUN intended it to. At least I've now mentioned this once for every game, and can just link back to this blog post for the other menus we still have to go through, in case their game-specific behavior matches this one.
TH02 is the only game that
separately lists the stage and boss themes of the main game, rather than following the in-game order of appearance,
continues playing the selected track when leaving the Music Room,
always loads both MIDI and PMD versions, regardless of the currently selected mode, and
does not stop the currently playing track before loading the new one into the PMD and MMD drivers.
The combination of 2) and 3) allows you to leave the Music Room and change the music mode in the Option menu to listen to the same track in the other version, without the game changing back to the title screen theme. 4), however, might cause the PMD and MMD drivers to play garbage for a short while if the music data is loaded from a slow storage device that takes longer than a single period of the OPN timer to fill the driver's song buffer. Probably not worth mentioning anymore though, now that people no longer try fitting PC-98 Touhou games on floppy disks.
Exactly 40 (TH02/TH03) / 38 (TH04/TH05) visible bytes per line,
padded with 2 bytes that can hold a CR/LF newline sequence for easier editing.
Every track starts with a title line that mostly just duplicates the names from the hardcoded tracklist,
followed by a fixed 19 (TH02/TH03/TH04) / 9 (TH05) comment lines.
In TH04 and TH05, lines can start with a semicolon (;) to prevent them from being rendered. This is purely a performance hint, and is visually equivalent to filling the line with spaces.
All in all, the quality of the code is even slightly below the already poor standard for PC-98 Touhou: More VRAM page copies than necessary, conditional logic that is nested way too deeply, a distinct avoidance of state in favor of loops within loops, and – of course – a couple of gotos to jump around as needed.
In TH05, this gets so bad with the scrolling and game-changing tracklist that it all gives birth to a wonderfully obscure inconsistency: When pressing both ⬆️/⬇️ and ⬅️/➡️ at the same time, the game first processes the vertical input and then the horizontal one in the next frame, making it appear as if the latter takes precedence. Except when the cursor is highlighting the first (⬆️ ) or 12th (⬇️ ) element of the list, and said list element is not the first track (⬆️ ) or the quit option (⬇️ ), in which case the horizontal input is ignored.
And that's all the Music Rooms! The OP.EXE binaries of TH04 and especially TH05 are now very close to being 100% RE'd, with only the respective High Score menus and TH04's title animation still missing. As for actual completion though, the finalization% metric is more relevant as it also includes the ZUN Soft logo, which I RE'd on paper but haven't decompiled. I'm 📝 still hoping that this will be the final piece of code I decompile for these two games, and that no one pays to get it done earlier…
For the rest of the second push, there was a specific goal I wanted to reach for the remaining anything budget, which was blocked by a few functions at the beginning of TH04's and TH05's MAINE.EXE. In another anticlimactic development, this involved yet another way too early decompilation of a main() function…
Generally, this main() function just calls the top-level functions of all other ending-related screens in sequence, but it also handles the TH04-exclusive congratulating All Clear images within itself. After a 1CC, these are an additional reward on top of the Good Ending, showing the player character wearing a different outfit depending on the selected difficulty. On Easy Mode, however, the Good Ending is unattainable because the game always ends after Stage 5 with a Bad Ending, but ZUN still chose to show the EASY ALL CLEAR!! image in this case, regardless of how many continues you used.
While this might seem inconsistent with the other difficulties, it is consistent within Easy Mode itself, as the enforced Bad Ending after Stage 5 also doesn't distinguish between the number of continues. Also, Try to Normal Rank!! could very well be ZUN's roundabout way of implying "because this is how you avoid the Bad Ending".
With that out of the way, I was finally able to separate the VRAM text renderer of TH04 and TH05 into its own assembly unit, 📝 finishing the technical debt repayment project that I couldn't complete in 2021 due to assembly-time code segment label arithmetic in the data segment. This now allows me to translate this undecompilable self-modifying mess of ASM into C++ for the non-ASCII translation project, and thus unify the text renderers of all games and enhance them with support for Unicode characters loaded from a bitmap font. As the final finalized function in the SHARED segment, it also allowed me to remove 143 lines of particularly ugly segmentation workarounds 🙌
The remaining 1/6th of the second push provided the perfect occasion for some light TH02 PI work. The global boss position and damage variables represented some equally low-hanging fruit, being easily identified global variables that aren't part of a larger structure in this game. In an interesting twist, TH02 is the only game that uses an increasing damage value to track boss health rather than decreasing HP, and also doesn't internally distinguish between bosses and midbosses as far as these variables are concerned. Obviously, there's quite a bit of state left to be RE'd, not least because Marisa is doing her own thing with a bunch of redundant copies of her position, but that was too complex to figure out right now.
Also doing their own thing are the Five Magic Stones, which need five positions rather than a single one. Since they don't move, the game doesn't have to keep 📝 separate position variables for both VRAM pages, and can handle their positions in a much simpler way that made for a nice final commit.
And for the first time in a long while, I quite like what ZUN did there!
Not only are their positions stored in an array that is indexed with a consistent ID for every stone, but these IDs also follow the order you fight the stones in: The two inner ones use 0 and 1, the two outer ones use 2 and 3, and the one in the center uses 4. This might look like an odd choice at first because it doesn't match their horizontal order on the playfield. But then you notice that ZUN uses this property in the respective phase control functions to iterate over only the subrange of active stones, and you realize how brilliant it actually is.
This seems like a really basic thing to get excited about, especially since the rest of their data layout sure isn't perfect. Splitting each piece of state and even the individual X and Y coordinates into separate 5-element arrays is still counter-productive because the game ends up paying more memory and CPU cycles to recalculate the element offsets over and over again than this would have ever saved in cache misses on a 486. But that's a minor issue that could be fixed with a few regex replacements, not a misdesigned architecture that would require a full rewrite to clean it up. Compared to the hardcoded and bloated mess that was 📝 YuugenMagan's five eyes, this is definitely an improvement worthy of the good-code tag. The first actual one in two years, and a welcome change after the Music Room!
These three pieces of data alone yielded a whopping 5% of overall TH02 PI in just 1/6th of a push, bringing that game comfortably over the 60% PI mark. MAINE.EXE is guaranteed to reach 100% PI before I start working on the non-ASCII translations, but at this rate, it might even be realistic to go for 100% PI on MAIN.EXE as well? Or at least technical position independence, without the false positives.
Next up: Shuusou Gyoku SC-88Pro BGM. It's going to be wild.
And once again, the Shuusou Gyoku task was too complex to be satisfyingly solved within a single month. Even just finding provably correct loop sections in both the original and arranged MIDI files required some rather involved detection algorithms. I could have just defined what sounded like correct loops, but the results of these algorithms were quite surprising indeed. Turns out that not even Seihou is safe from ZUN quirks, and some tracks technically loop much later than you'd think they do, or don't loop at all. And since I then wanted to put these MIDI loops back into the game to ensure perfect synchronization between the recordings and MIDI versions, I ended up rewriting basically all the MIDI code in a cross-platform way. This rewrite also uncovered a pbg bug that has traveled from Shuusou Gyoku into Windows Touhou, where it survived until ZUN ultimately removed all MIDI code in TH11 (!)…
Fortunately, the backlog still had enough general PC-98 Touhou funds that I could spend on picking some soon-important low-hanging fruit, giving me something to deliver for the end of the month after all. TH04 and TH05 use almost identical code for their main/option menus, so decompiling it would make number go up quite significantly and the associated blog post won't be that long…
Wait, what's this, a bug report from touhou-memories concerning the website?
Tab switchers tended to break on certain Firefox versions, and
video playback didn't work on Microsoft Edge at all?
Those are definitely some high-priority bugs that demand immediate attention.
The tab switcher issue was easily fixed by replacing the previous z-index trickery with a more robust solution involving the hidden attribute. The second one, however, is much more aggravating, because video playback on Edge has been broken ever since I 📝 switched the preferred video codec to AV1.
This goes so far beyond not supporting a specific codec. Usually, unsupported codecs aren't supposed to be an issue: As soon as you start using the HTML <video> tag, you'll learn that not every browser supports all codecs. And so you set up an encoding pipeline to serve each video in a mix of new and ancient formats, put the <source> tag of the most preferred codec first, and rest assured that browsers will fall back on the best-supported option as necessary. Except that Edge doesn't even try, and insists on staying on a non-playing AV1 video. 🙄
The codecs parameter for the <source> type attribute was the first potential solution I came across. Specifying the video codec down to the finest encoding details right in the HTML markup sounds like a good idea, similar to specifying sizes of images and videos to prevent layout reflows on long pages during the initial page load. So why was this the first time I heard of this feature? The fact that there isn't a simple ffprobe -show_html_codecs_string command to retrieve this string might already give a clue about how useful it is in practice. Instead, you have to manually piece the string together by grepping your way through all of a video's metadata…
…and then it still doesn't change anything about Edge's behavior, even when also specifying the string for the VP9 and VP8 sources. Calling the infamously ridiculous HTMLMediaElement.canPlayType() method with a representative parameter of "video/webm; codecs=av01.1.04M.08.0.000.01.13.00.0" explains why: Both the AV1-supporting Chrome and Edge return "probably", but only the former can actually play this format. 🤦
But wait, there is an AV1 video extension in the Microsoft Store that would add support to any unspecified favorite video app. Except that it stopped working inside Edge as of version 116. And even if it did: If you can't query the presence of this extension via JavaScript, it might as well not exist at all.
Not to mention that the favorite video app part is obviously a lie as a lot of widely preferred Windows video apps are bundled with their own codecs, and have probably long supported AV1.
In the end, there's no way around the utter desperation move of removing the AV1 <source> for Edge users. Serving each video in two other formats means that we can at least do something here – try visiting the GitHub release page of the P0234-1 TH01 Anniversary Edition build in Edge and you also don't get to see anything, because that video uses AV1 and GitHub understandably doesn't re-encode every uploaded video into a variety of old formats.
Just for comparison, I tried both that page and the ReC98 blog on an old Android 6 phone from 2014, and even that phone picked and played the AV1 videos with the latest available Chrome and Firefox versions. This was the phone whose available Firefox version didn't support VP9 in 2019, which was my initial reason for adding the VP8 versions. Looks like it's finally time to drop those… 🤔 Maybe in the far future once I start running out of space on this server.
Removing the <source> tags can be done in one of two places:
server-side, detecting Edge via the User-Agent header, or
I went with 2) because more dynamic server-side code would only move us further away from static site generation, which would make a lot of sense as the next evolutionary step in the architecture of this website. The client-side solution is much simpler too, and we can defer the deletion until a user actually hovers over a specific video.
And while we're at it, let's also add a popup complaining about this whole state of affairs. Edge is heavily marketed inside Windows as "the modern browser recommended by Microsoft", and you sure wouldn't expect low-quality chroma-subsampled VP9 from such a tagline. With such a level of anti-support for AV1, Edge users deserve to know exactly what's going on, especially since this post also explains what they will encounter on other websites.
That's the polite way of putting it.
Alright, where was I? For TH01, the main menu was the last thing I decompiled before the 100% finalization mark, so it's rather anticlimactic to already cover the TH04/TH05 one now, with both of the games still being very far away from 100%, just because people will soon want to translate the description text in the bottom-right corner of the screen. But then again, the ZUN Soft logo animation would make for an even nicer final piece of decompiled code, especially since the bouncing-ball logo from TH01, TH02, and TH03 was the very first decompilation I did, all the way back in 2015.
The code quality of ZUN's VRAM-based menus has barely increased between TH01 and TH05. Both the top-level and option menu still need to know the bounding rectangle of the other one to unblit the right pixels when switching between the two. And since ZUN sure loved hardcoded and copy-pasted numbers in the PC-98 days, the coordinates both tend to be excessively large, and excessively wrong. Luckily, each menu item comes with its own correct unblitting rectangle, which avoids any graphical glitches that would otherwise occur.
As for actual observable quirks and bugs, these menus only contain one of each, and both are exclusive to TH04:
Quitting out of the Music Room moves the cursor to the Start option. In TH05, it stays on Music Room.
Changing the S.E. mode seems to do nothing within TH04's menus, and would only take effect if you also change the Music mode afterward, or launch into the game.
And yes, these videos do have a frame rate of 2 FPS.
Now that 100% finalization of their OP.EXE binaries is within reach, all this bloat made me think about the viability of a 📝 single-executable build for TH04's and TH05's debloated and anniversary versions. It would be really nice to have such a build ready before I start working on the non-ASCII translations – not just because they will be based on the anniversary branch by default, but also because it would significantly help their development if there are 4 fewer executables to worry about.
However, it's not as simple for these games as it was for TH01. The unique code in their OP.EXE and MAINE.EXE binaries is much larger than Borland's easily removed C++ exception handler, so I'd have to remove a lot more bloat to keep the resulting single binary at or below the size of the original MAIN.EXE. But I'm sure going to try.
Speaking of code that can be debloated for great effect: The second push of this delivery focused on the first-launch sound setup menu, whose BGM and sound effect submenus are almost complete code duplicates of each other. The debloated branch could easily remove more than half of the code in there, yielding another ≈800 bytes in case we need them.
If hex-editing MIKO.CFG is more convenient for you than deleting that file, you can set its first byte to FF to re-trigger this menu. Decompiling this screen was not only relevant now because it contains text rendered with font ROM glyphs and it would help dig our way towards more important strings in the data segment, but also because of its visual style. I can imagine many potential mods that might want to use the same backgrounds and box graphics for their menus.
How about an initial language selection menu in the same style?
With the two submenus being shown in a fixed sequence, there's not a lot of room for the code to do anything wrong, and it's even more identical between the two games than the main menu already was. Thankfully, ZUN just reblits the respective options in the new color when moving the cursor, with no 📝 palette tricks. TH04's background image only uses 7 colors, so he could have easily reserved 3 colors for that. In exchange, the TH05 image gets to use the full 16 colors with no change to the code.
Rounding out this delivery, we also got TH05's rolling Yin-Yang Orb animation before the title screen… and it's just more bloat and landmines on a smaller scale that might be noticeable on slower PC-98 models. In total, there are three unnecessary inter-page copies of the entire VRAM that can easily insert lag frames, and two minor page-switching landmines that can potentially lead to tearing on the first frame of the roll or fade animation. Clearly, ZUN did not have smoothness or code quality in mind there, as evidenced by the fact that this animation simply displays 8 .PI files in sequence. But hey, a short animation like this is 📝 another perfectly appropriate place for a quick-and-dirty solution if you develop with a deadline.
And that's 1.30% of all PC-98 Touhou code finalized in two pushes! We're slowly running out of these big shared pieces of ASM code…
I've been neglecting TH03's OP.EXE quite a bit since it simply doesn't contain any translatable plaintext outside the Music Room. All menu labels are gaiji, and even the character selection menu displays its monochrome character names using the 4-plane sprites from CHNAME.BFT. Splitting off half of its data into a separate .ASM file was more akin to getting out a jackhammer to free up the room in front of the third remaining Music Room, but now we're there, and I can decompile all three of them in a natural way, with all referenced data.
Next up, therefore: Doing just that, securing another important piece of text for the upcoming non-ASCII translations and delivering another big piece of easily finalized code. I'm going to work full-time on ReC98 for almost all of December, and delivering that and the Shuusou Gyoku SC-88Pro recording BGM back-to-back should free up about half of the slightly higher cap for this month.
And we're back to PC-98 Touhou for a brief interruption of the ongoing Shuusou Gyoku Linux port.
Let's clear some of the Touhou-related progress from the backlog, and use
the unconstrained nature of these contributions to prepare the
📝 upcoming non-ASCII translations commissioned by Touhou Patch Center.
The current budget won't cover all of my ambitions, but it would at least be
nice if all text in these games was feasibly translatable by the time I
officially start working on that project.
At a little over 3 pushes, it might be surprising to see that this took
longer than the
📝 TH03/TH04/TH05 cutscene system. It's
obvious that TH02 started out with a different system for in-game dialog,
but while TH04 and TH05 look identical on the surface, they only
actually share 30% of their dialog code. So this felt more like decompiling
2.4 distinct systems, as opposed to one identical base with tons of
game-specific differences on top.
The table of contents was pretty popular last time around, so let's have
another one:
Let's start with the ones from TH04 and TH05, since they are not that
broken. For TH04, ZUN started out by copy-pasting the cutscene system,
causing the result to inherit many of the caveats I already described in the
cutscene blog post:
It's still a plaintext format geared exclusively toward full-width
Japanese text.
The parser still ignores all whitespace, forcing ASCII text into hacks
with unassigned Shift-JIS lead bytes outside the second byte of a 2-byte
chunk.
Commands are still preceded by a 0x5C byte, which renders
as either a \ or a ¥ depending on your font and
interpretation of Shift-JIS.
Command parameters are parsed in exactly the same way, with all the same
limits.
A lot of the same script commands are identical, including 7 of them
that were not used in TH04's original dialog scripts.
Then, however, he greatly simplified the system. Mainly, this was done by
moving text rendering from the PC-98 graphics chip to the text chip, which
avoids the need for any text-related unblitting code, but ZUN also added a
bunch of smaller changes:
The player must advance through every dialog box by releasing any held
keys and then pressing any key mapped to a game action. There are no
timeouts.
The delay for every 2 bytes of text was doubled to 2 frames, and can't
be overridden.
Instead of holding ESC to fast-forward, pressing any key
will immediately print the entire rest of a text box.
Dialogs run in their own single-buffered frame loop, interrupting the
rest of the game. The other VRAM page keeps the background pixels required
for unblitting the face images.
All script commands that affect the graphics layer are preceded by a
1-frame delay. ZUN most likely did this because of the single-buffered
nature, as it prevents tearing on the first frame by waiting for the CRT
beam to return to the top-left corner before changing any pixels.
Both boxes are intended to contain up to 30 half-width characters on
each of their up to 3 lines, but nothing in the code enforces these limits.
There is no support for automatic line breaks or starting new boxes.
While it would seem that TH05 has no issues with ASCII 0x20
spaces, the text as a whole is still blindly processed two bytes at a
time, and any commands can only appear at even byte positions within a
line. I dimmed the VRAM pixels to 25% of their original brightness to make the
text easier to read.
The same text backported to TH04, additionally demonstrating how that
game's dialog system inherited the whitespace skipping behavior of
TH03's cutscene system. Just like there, ASCII 0x20 spaces
only work at odd byte positions because the game treats them as the
trailing byte of a full-width Shift-JIS codepoint. I don't know how
large the budget for the upcoming non-ASCII translations will be, but
I'm going to fix this even in the very basic fully static variant.
I dimmed the VRAM pixels to 25% of their original brightness to make the
text easier to read.
TH05 then moved from TH04's plaintext scripts to the binary
.TX2 format while removing all the unused commands copy-pasted
from the cutscene system. Except for a
single additional command intended to clear a text box, TH05's dialog
system only supports a strict subset of the features of TH04's system.
This change also introduced the following differences compared to TH04:
The game now stores the dialog of all 4 playable characters in the same
file, with a (4 + 1)-word header that indicates the byte offset
and length of each character's script. This way, it can load only the one
script for the currently played character.
Since there is no need for whitespace in a binary format, you can now
use ASCII 0x20 spaces even as the first byte of a 2-byte text
chunk! 🥳
All command parameters are now mandatory.
Filenames are now passed directly by pointer to the respective game
function. Therefore, they now need to be null-terminated, but can in turn be
as long as
📝 the number of remaining bytes in the allocated dialog segment.
In practice though, the game still runs on DOS and shares its restriction of
8.3 filenames…
When starting a new dialog box, any existing text in the other box is
now colored blue.
Thanks to ZUN messing up the return values of the command-interpreting
switch function, you can effectively use only line break and gaiji commands in the middle of text. All other
commands do execute, but the interpreter then also treats their command byte
as a Shift-JIS lead byte and places it in text RAM together with whatever
other byte follows in the script.
This is why TH04 can and does put its \= commandsinto the boxes
started with the 0 or 1 commands, but TH05 has to
put its 0x02 commands before the equivalent 0x0D.
Writing the 0x02 byte to text RAM results in an character, which is simply the PC-98 font ROM's glyph for that
Shift-JIS codepoint. Also note how each face change is now
preceded by two frames of delay.
No problem in TH04. Note how the dialog also runs a bit faster – TH04
only adds the aforementioned one frame of delay to each face change, and
has fewer two-byte chunks of text to display overall.
For modding these files, you probably want to use TXDEF from
-Tom-'s MysticTK. It decodes these
files into a text representation, and its encoder then takes care of the
character-specific byte offsets in the 10-byte header. This text
representation simplifies the format a lot by avoiding all corner cases and
landmines you'd experience during hex-editing – most notably by interpreting
the box-starting 0x0D as a
command to show text that takes a string parameter, avoiding the broken
calls to script commands in the middle of text. However, you'd still have to
manually ensure an even number of bytes on every line of text.
In the entry function of TH05's dialog loop, we also encounter the hack that
is responsible for properly handling
📝 ZUN's hidden Extra Stage replay. Since the
dialog loop doesn't access the replay inputs but still requires key presses
to advance through the boxes, ZUN chose to just skip the dialog altogether in the
specific case of the Extra Stage replay being active, and replicated all
sprite management commands from the dialog script by just hardcoding
them.
And you know what? Not only do I not mind this hack, but I would have
preferred it over the actual dialog system! The aforementioned sprite
management commands effectively boil down to manual memory management,
deallocating all stage enemy and midboss sprites and thus ensuring that the
boss sprites end up at specific master.lib sprite IDs (patnums). The
hardcoded boss rendering function then expects these sprites to be available
at these exact IDs… which means that the otherwise hardcoded bosses can't
render properly without the dialog script running before them.
There is absolutely no excuse for the game to burden dialog scripts with
this functionality. Sure, delayed deallocation would allow them to blit
stage-specific sprites, but the original games don't do that; probably
because none of the two games feature an unblitting command. And even if
they did, it would have still been cleaner to expose the boss-specific
sprite setup as a single script command that can then also be called from
game code if the script didn't do so. Commands like these just are a recipe
for crashes, especially with parsers that expect fullwidth Shift-JIS
text and where misaligned ASCII text can easily cause these commands to be
skipped.
But then again, it does make for funny screenshot material if you
accidentally the deallocation and then see bosses being turned into stage
enemies:
Some of the more amusing consequences of not calling the
sprite-deallocating
\c /
0x04 command inside a dialog
script.
In the case of 4️⃣, the game then even crashes on this frame at the end
of the dialog, in a way that resembles the infamous
📝 TH04 crash before Stage 5 Yuuka if no EMS driver is loaded.
Both the stage- and boss-specific BFNT sprites are loaded into memory at
this point, leaving no room for the 256×256-pixel background image on
the size-limited master.lib heap.
With all the general details out of the way, here's the command reference:
0 1
0x00 0x01
Selects either the player character (0) or the boss (1) as the
currently speaking character, and moves the cursor to the beginning of
the text box. In TH04, this command also directly starts the new dialog
box, which is probably why it's not prefixed with a \ as it
only makes sense outside of text. TH05 requires a separate 0x0D command to do the
same.
\=1
0x02 0x!!
Replaces the face portrait of the currently active speaking
character with image #1 within her .CD2
file.
\=255
0x02 0xFF
Removes the face portrait from the currently active text box.
\l,filename
0x03 filename 0x00
Calls master.lib's super_entry_bfnt() function, which
loads sprites from a BFNT file to consecutive IDs starting at the
current patnum write cursor.
\c
0x04
Deallocates all stage-specific BFNT sprites (i.e., stage enemies and
midbosses), freeing up conventional RAM for the boss sprites and
ensuring that master.lib's patnum write cursor ends up at
128 /
180.
In TH05's Extra Stage, this command also replaces
📝 the sprites loaded from MIKO16.BFT with the ones from ST06_16.BFT.
\d
Deallocates all face portrait images.
The game automatically does this at the end of each dialog sequence.
However, ZUN wanted to load Stage 6 Yuuka's 76 KiB of additional
animations inside the script via \l, and would have once again
run up against the master.lib heap size limit without that extra free
memory.
\m,filename
0x05 filename 0x00
Stops the currently playing BGM, loads a new one from the given
file, and starts playback.
\m$
0x05 $ 0x00
Stops the currently playing BGM.
Note that TH05 interprets $ as a null-terminated filename as
well.
\m*
Restarts playback of the currently loaded BGM from the
beginning.
\b0,0,0
0x06 0x!!!!0x!!!!0x!!
Blits the master.lib patnum with the ID indicated by the third
parameter to the current VRAM page at the top-left screen position
indicated by the first two parameters.
\e0
Plays the sound effect with the given ID.
\t100
Sets palette brightness via master.lib's
palette_settone() to any value from 0 (fully black) to 200
(fully white). 100 corresponds to the palette's original colors.
\fo1
\fi1
Calls master.lib's palette_black_out() or
palette_black_in() to play a hardware palette fade
animation from or to black, spending roughly 1 frame on each of the 16 fade steps.
\wo1
\wi1
0x09 0x!!
0x0A 0x!!
Calls master.lib's palette_white_out() or
palette_white_in() to play a hardware palette fade
animation from or to white, spending roughly 1 frame on each of the 16 fade steps. The
TH05 version of 0x09 also clears the text in both boxes
before the animation.
\n
0x0B
Starts a new line by resetting the X coordinate of the TRAM cursor
to the left edge of the text area and incrementing the Y coordinate.
The new line will always be the next one below the last one that was
properly started, regardless of whether the text previously wrapped to
the next TRAM row at the edge of the screen.
\g8
Plays a blocking 8-frame screen shake
animation. Copy-pasted from the cutscene parser, but actually used right
at the end of the dialog shown before TH04's Bad Ending.
\ga0
0x0C 0x!!
Shows the gaiji with the given ID from 0 to 255
at the current cursor position, ignoring the per-glyph delay.
\k0
Waits 0 frames (0 = forever) for any key
to be pressed before continuing script execution.
Takes the current dialog cursor as the top-left corner of a
240×48-pixel rectangle, and replaces all text RAM characters within that
rectangle with whitespace.
This is only used to clear the player character's text box before
Shinki's final いくよ‼ box. Shinki has two
consecutive text boxes in all 4 scripts here, and ZUN probably wanted to
clear the otherwise blue text to imply a dramatic pause before Shinki's
final sentence. Nice touch.
(You could, however, also use it after a
box-ending 0xFF command to mess with text RAM in
general.)
\#
Quits the currently running loop. This returns from either the text
loop to the command loop, or it ends the dialog sequence by returning
from the command loop back to gameplay. If this stage of the game later
starts another dialog sequence, it will start at the next script
byte.
\$
Like \#, but first waits for any key to be
pressed.
0xFF
Behaves like TH04's \$ in the text loop, and like
\# in the command loop. Hence, it's not possible in TH05 to
automatically end a text box and advance to the next one without waiting
for a key press.
Unused commands are in gray.
At the end of the day, you might criticize the system for how its landmines
make it annoying to mod in ASCII text, but it all works and does what it's
supposed to. ZUN could have written the cleanest single and central
Shift-JIS iterator that properly chunks a byte buffer into halfwidth and
fullwidth codepoints, and I'd still be throwing it out for the upcoming
non-ASCII translations in favor of something that either also supports UTF-8
or performs dictionary lookups with a full box of text.
The only actual bug can be found in the input detection, which once
again doesn't correctly handle the infamous key
up/key down scancode quirk of PC-98 keyboards. All it takes
is one wrongly placed input polling call, and suddenly you have to think
about how the update cycle behind the PC-98 keyboard state bytes
might cause the game to run the regular 2-frame delay for a single
2-byte chunk of text before it shows the full text of a box after
all… But even this bug is highly theoretical and could probably only be
observed very, very rarely, and exclusively on real hardware.
The same can't be said about TH02 though, but more on that later. Let's
first take a look at its data, which started out much simpler in that game.
The STAGE?.TXT files contain just raw Shift-JIS text with no
trace of commands or structure. Turning on the whitespace display feature in
your editor reveals how the dialog system even assumes a fixed byte
length for each box: 36 bytes per line which will appear on screen, followed
by 4 bytes of padding, which the original files conveniently use to visually
split the lines via a CR/LF newline sequence. Make sure to disable trimming
of trailing whitespace in your editor to not ruin the file when modding the
text…
Two boxes from TH02's STAGE5.TXT with visualized whitespace.
These also demonstrate how the CR/LF newlines only make up 2 of the 4
padding bytes, and require each line to be padded with two more bytes; you
could not use these trailing spaces for actual text. Also note how
the exquisite mixture of fullwidth and halfwidth spaces demands the text to
be viewed with only the most metrically consistent monospace fonts to
preserve the intended alignment. 🍷 It appears quite misaligned on my phone.
Consequently, everything else is hardcoded – every effect shown between text
boxes, the face portrait shown for each box, and even how many boxes are
part of each dialog sequence. Which means that the source code now contains
a
long hardcoded list of face IDs for most of the text boxes in the game,
with the rest being part of the
dedicated hardcoded dialog scripts for 2/3 of the
game's stages.
Without the restriction to a fixed set of scripting commands, TH02 naturally
gravitated to having the most varied dialog sequences of all PC-98 Touhou
games. This flexibility certainly facilitated Mima's grand entrance
animation in Stage 4, or the different lines in Stage 4 and 5 depending on
whether you already used a continue or not. Marisa's post-boss dialog even
inserts the number of continues into the text itself – by, you guessed it,
writing to hardcoded byte offsets inside the dialog text before printing it
to the screen. But once again, I have nothing to
criticize here – not even the fact that the alternate dialog scripts have to
mutate the "box cursor" to jump to the intended boxes within the file. I
know that some people in my audience like VMs, but I would have considered
it more bloated if ZUN had implemented a full-blown scripting
language just to handle all these special cases.
Another unique aspect of TH02 is the way it stores its face portraits, which
are infamous for how hard they are to find in the original data files. These
sprites are actually map tiles, stored in MIKO_K.MPN,
and drawn using the same functions used to blit the regular map tiles to the
📝 tile source area in VRAM. We can only guess
why ZUN chose this one out of the three graphics formats he used in TH02:
BFNT supports transparency, but sacrifices one of the 16 colors to do
so. ZUN only used 15 colors for the face portraits, but might have wanted to
keep open the option to use that 16th color. The detailed
backgrounds also suggest that these images were never supposed to be
transparent to begin with.
PI is used for all bigger and non-transparent images, but ZUN would have
had to write a separate small function to blit a 48×48 subsection of such an
image. That certainly wouldn't have stopped him in the TH01 days, but he
probably was already past that point by this game.
That only leaves .MPN. Sure, he did have to slice each face into 9
separate 16×16 "map" tiles to use this format, but that's a small price to
pay in exchange for not having to write any new low-level blitting code,
especially since he must have already had an asset pipeline to generate
these files.
TH02's MIKO_K.PTN, arranged into a 16×16-tile layout that
reveals how these tiles are combined into face portraits. MPNDEF from -Tom-'s MysticTK conveniently uses
this exact layout in its .BMP output. Earlier MPNDEF
versions crashed when converting this file as its 256 tiles led to an
8-bit overflow bug, so make sure you've updated to the current version
from the end of October 2023 if you want to convert this file yourself.
The format stores the 4 bitplanes of each 16×16 tile in order, so good
luck finding a different planar image viewer that would support both
such a tiled layout and a custom palette. Sometimes, a weird
internal format is the best type of obfuscation.
And since you're certainly wondering about all these black tiles at the
edges: Yes, these are not only part of the file and pad it from the required
240×192 pixels to 256×256, but also kept in memory during a stage, wasting
9.5 KiB of conventional RAM. That's 172 seconds of potential input
replay data, just for those people who might still think that we need EMS
for replays.
Alright, we've got the text, we've got the faces, let's slide in the box and
display it all on screen. Apparently though, we also have to blit the player
and option sprites using raw, low-level master.lib function calls in the
process? This can't be right, especially because ZUN
always blits the option sprite associated with the Reimu-A shot type,
regardless of which one the player actually selected. And if you keep moving
above the box area before the dialog starts, you get to see exactly how
wrong this is:
Let's look closer at Reimu's sprite during the slide-in animation, and in
the two frames before:
This one image shows off no less than 4 bugs:
ZUN blits the stationary player sprite here, regardless of whether the
player was previously moving left or right. This is a nice way of indicating
that Reimu stops moving once the dialog starts, but maybe ZUN should
have unblitted the old sprite so that the new one wouldn't have appeared on
top. The game only unblits the 384×64 pixels covered by the dialog box on
every frame of the slide-in animation, so Reimu would only appear correctly
if her sprite happened to be entirely located within that area.
All sprites are shifted up by 1 pixel in frame 2️⃣. This one is not a
bug in the dialog system, but in the main game loop. The game runs the
relevant actions in the following order:
Invalidate any map tiles covered by entities
Redraw invalidated tiles
Decrement the Y coordinate at the top of VRAM according to the
scroll speed
Update and render all game entities
Scroll in new tiles as necessary according to the scroll speed, and
report whether the game has scrolled one pixel past the end of the
map
If that happened, pretend it didn't by incrementing the value
calculated in #3 for all further frames and skipping to
#8.
Issue a GDC SCROLL command to reflect the line
calculated in #3 on the display
Wait for VSync
Flip VRAM pages
Start boss if we're past the end of the map
The problem here: Once the dialog starts, the game has already rendered
an entire new frame, with all sprites being offset by a new Y scroll
offset, without adjusting the graphics GDC's scroll registers to
compensate. Hence, the Y position in 3️⃣ is the correct one, and the
whole existence of frame 2️⃣ is a bug in itself. (Well… OK, probably a
quirk because speedrunning exists, and it would be pretty annoying to
synchronize any video regression tests of the future TH02 Anniversary
Edition if it renders one fewer frame in the middle of a stage.)
ZUN blits the option sprites to their position from frame 1️⃣. This
brings us back to
📝 TH02's special way of retaining the previous and current position in a two-element array, indexed with a VRAM page ID.
Normally, this would be equivalent to using dedicated prev and
cur structure fields and you'd just index it with the back page
for every rendering call. But if you then decide to go single-buffered for
dialogs and render them onto the front page instead…
Note that fixing bug #2 would not cancel out this one – the sprites would
then simply be rendered to their position in the frame before 1️⃣.
And of course, the fixed option sprite ID also counts as a bug.
As for the boxes themselves, it's yet another loop that prints 2-byte chunks
of Shift-JIS text at an even slower fixed interval of 3 frames. In an
interesting quirk though, ZUN assumes that every box starts with the name of
the speaking character in its first two fullwidth Shift-JIS characters,
followed by a fullwidth colon. These 6 bytes are displayed immediately at
the start of every box, without the usual delay. The resulting alignment
looks rather janky with Genjii, whose single right-padded 亀
kanji looks quite awkward with the fullwidth space between the name
and the colon. Kind of makes you wonder why ZUN just didn't spell out his
proper name, 玄爺, instead, but I get the stylistic
difference.
In Stage 4, the two-kanji assumption then breaks with Marisa's three-kanji
name, which causes the full-width colon to be printed as the first delayed
character in each of her boxes:
That's all the issues and quirks in the system itself. The scripts
themselves don't leave much room for bugs as they basically just loop over
the hardcoded face ID array at this level… until we reach the end of the
game. Previously, the slide-in animation could simply use the tile
invalidation and re-rendering system to unblit the box on each frame, which
also explained why Reimu had to be separately rendered on top. But this no
longer works with a custom-rendered boss background, and so the game just
chooses to flood-fill the area with graphics chip color #0:
Then again, transferring pixels from the back page would be just
as wrong as they lag one frame behind. No way around capturing these 384×64
pixels to main memory here… Oh well, this flood-fill at least adds even more
legibility on top of the already half-transparent text box. A property that
the following dialog sequence unfortunately lacks…
For Mima's final defeat dialog though, ZUN chose to not even show the box.
He might have realized the issue by that point, or simply preferred the more
dramatic effect this had on the lines. The resulting issues, however, might
even have ramifications for such un-technical things as lore and
character dynamics. As it turns out, the code
for this dialog sequence does in fact render Mima's smiling face for all
boxes?! You only don't see it in the original game because it's rendered to
the other VRAM page that remains invisible during the dialog sequence:
Caution, flashing lights.
Here's how I interpret the situation:
The function that launches into the final part of the dialog script
starts with dedicated
code to re-render Mima to the back page, on top of the previously
rendered planet background. Since the entire script runs on the front
page (and thus, on top of the previous frame) and the game launches into
the ending immediately after, you don't ever get to see this new partial
frame in the original game.
Showing this partial frame would also ensure that you can actually
read the dialog text without a surrounding box. Then, the white
letters won't ever be put on top of any white bullets – or, worse, be completely invisible if the
dialog is triggered in the middle of Reimu-B's bomb animation, which
fills VRAM with lots of white pixels.
Hence, we've got enough evidence to classify not showing the back page
as a ZUN
bug. 🐞
However, Mima's smiling face jars with the words she says here. Adding
the face would deviate more significantly from the original game than
removing the player shot, item, bullet, or spark sprites would. It's
imaginable that ZUN just forgot about the dedicated code that
re-rendered just Mima to the back page, but the faces add
something to the dialog, and ZUN would have clearly noticed and
fixed it if their absence wasn't intended. Heck, ZUN might have just put
something related to Mima into the code because TH02's dialog system has
no way of not drawing a face for a dialog box. Filling the face
area with graphics chip color #0, as seen in the first and third boxes
of the Extra Stage pre-boss dialog, would have been an alternative, but
that would have been equally wrong with regard to the background.
Hence, the invisible face portrait from the original game is a ZUN
quirk. 🎺
So, the future TH02 Anniversary Edition will fix the bug by showing
the back page, but retain the quirk by rewriting the dialog code to
not blit the face.
And with that, we've secured all in-game dialog for the upcoming non-ASCII
translations! The remaining 2/3 of the last push made
for a good occasion to also decompile the small amount of code related to
TH03's win messages, stored in the @0?TX.TXT files. Similar to
TH02's dialog format, these files are also split into fixed-size blocks of
3×60 bytes. But this time, TH03 loads all 60 bytes of a line, including the
CR/LF line breaking codepoints in the original files, into the statically
allocated buffer that it renders from. These control characters are then
only filtered to whitespace by ZUN's graph_putsa_fx() function.
If you remove the line breaks, you get to use the full 60 bytes on every
line.
The final commits went to the MIKO.CFG loading and saving
functions used in TH04's and TH05's OP.EXE, as well as TH04's
game startup code to finally catch up with
📝 TH05's counterpart from over 3 years ago.
This brought us right in front of the main menu rendering code in both TH04
and TH05, which is identical in both games and will be tackled in the next
PC-98 Touhou delivery.
Next up, though: Returning to Shuusou Gyoku, and adding support for SC-88Pro
recordings as BGM. Which may or may not come with a slight controversy…
And then, the supposed boilerplate code revealed yet another confusing issue
that quickly forced me back to serial work, leading to no parallel progress
made with Shuusou Gyoku after all. 🥲 The list of functions I put together
for the first ½ of this push seemed so boring at first, and I was so sure
that there was almost nothing I could possibly talk about:
TH02's gaiji animations at the start and end of each stage, resembling
opening and closing window blind slats. ZUN should have maybe not defined
the regular whitespace gaiji as what's technically the last frame of the
closing animation, but that's a minor nitpick. Nothing special there
otherwise.
The remaining spawn functions for TH04's and TH05's gather circles. The
only dumb antic there is the way ZUN initializes the template for bullets
fired at the end of the animation, featuring ASM instructions that are
equivalent to what Turbo C++ 4.0J generates for the __memcpy__
intrinsic, but show up in a different order. Which means that they must have
been handwritten. I already figured that out in 2022
though, so this was just more of the same.
EX-Alice's override for the game's main 16×16 sprite sheet, loaded
during her dialog script. More of a naming and consistency challenge, if
anything.
The regular version of TH05's big 16×16 sprite sheet.
EX-Alice's variant of TH05's big 16×16 sprite sheet.
The rendering function for TH04's Stage 4 midboss, which seems to
feature the same premature clipping quirk we've seen for
📝 TH05's Stage 5 midboss, 7 months ago?
The rendering function for the big 48×48 explosion sprite, which also
features the same clipping quirk?
That's three instances of ZUN removing sprites way earlier than you'd want
to, intentionally deciding against those sprites flying smoothly in and out
of the playfield. Clearly, there has to be a system and a reason behind it.
Turns out that it can be almost completely blamed on master.lib. None of the
super_*() sprite blitting functions can clip the rendered
sprite to the edges of VRAM, and much less to the custom playfield rectangle
we would actually want here. This is exactly the wrong choice to make for a
game engine: Not only is the game developer now stuck with either rendering
the sprite in full or not at all, but they're also left with the burden of
manually calculating when not to display a sprite.
However, strictly limiting the top-left screen-space coordinate to
(0, 0) and the bottom-right one to (640, 400) would actually
stop rendering some of the sprites much earlier than the clipping conditions
we encounter in these games. So what's going on there?
The answer is a combination of playfield borders, hardware scrolling, and
master.lib needing to provide at least some help to support the
latter. Hardware scrolling on PC-98 works by dividing VRAM into two vertical
partitions along the Y-axis and telling the GDC to display one of them at
the top of the screen and the other one below. The contents of VRAM remain
unmodified throughout, which raises the interesting question of how to deal
with sprites that reach the vertical edges of VRAM. If the top VRAM row that
starts at offset 0x0000 ends up being displayed below
the bottom row of VRAM that starts at offset 0x7CB0 for 399 of
the 400 possible scrolling positions, wouldn't we then need to vertically
wrap most of the rendered sprites?
For this reason, master.lib provides the super_roll_*()
functions, which unconditionally perform exactly this vertical wrapping. But
this creates a new problem: If these functions still can't clip, and don't
even know which VRAM rows currently correspond to the top and bottom row of
the screen (since master.lib's graph_scrollup() function
doesn't retain this information), won't we also see sprites wrapping around
the actual edges of the screen? That's something we certainly
wouldn't want in a vertically scrolling game…
The answer is yes, and master.lib offers no solution for this issue. But
this is where the playfield borders come in, and helpfully cover 16 pixels
at the top and 16 pixels at the bottom of the screen. As a result, they can
hide up to 32 rows of potentially wrapped sprite pixels below them:
•
The earliest possible frame that TH05 can start rendering the Stage 5
midboss on. Hiding the text layer reveals how master.lib did in fact
"blindly" render the top part of her sprite to the bottom of the
playfield. That's where her sprite starts before it is correctly
wrapped around to the top of VRAM.
If we scrolled VRAM by another 200 pixels (and faked an equally shifted
TRAM for demonstration purposes), we get an equally valid game scene
that points out why a vertically scrolling PC-98 game must wrap all sprites
at the vertical edges of VRAM to begin with.
Also, note how the HP bar has filled up quite a bit before the midboss can
actually appear on screen.
And that's how the lowest possible top Y coordinate for sprites blitted
using the master.lib super_roll_*() functions during the
scrolling portions of TH02, TH04, and TH05 is not 0, but -16. Any lower, and
you would actually see some of the sprite's upper pixels at the
bottom of the playfield, as there are no more opaque black text cells to
cover them. Theoretically, you could lower this number for
some animation frames that start with multiple rows of transparent
pixels, but I thankfully haven't found any instance of ZUN using such a
hack. So far, at least…
Visualized like that, it all looks quite simple and logical, but for days, I
did not realize that these sprites were rendered to a scrolling VRAM.
This led to a much more complicated initial explanation involving the
invisible extra space of VRAM between offsets 0x7D00 and
0x7FFF that effectively grant a hidden additional 9.6 lines
below the playfield. Or even above, since PC-98 hardware ignores the highest
bit of any offset into a VRAM bitplane segment
(& 0x7FFF), which prevents blitting operations from
accidentally reaching into a different bitplane. Together with the
aforementioned rows of transparent pixels at the top of these midboss
sprites, the math would have almost worked out exactly.
The need for manual clipping also applies to the X-axis. Due to the lack of
scrolling in this dimension, the boundaries there are much more
straightforward though. The minimum left coordinate of a sprite can't fall
below 0 because any smaller coordinate would wrap around into the
📝 tile source area and overwrite some of the
pixels there, which we obviously don't want to re-blit every frame.
Similarly, the right coordinate must not extend into the HUD, which starts
at 448 pixels.
The last part might be surprising if you aren't familiar with the PC-98 text
chip. Contrary to the CGA and VGA text modes of IBM-compatibles, PC-98 text
cells can only use a single color for either their foreground or
background, with the other pixels being transparent and always revealing the
pixels in VRAM below. If you look closely at the HUD in the images above,
you can see how the background of cells with gaiji glyphs is slightly
brighter (◼ #100) than the opaque black
cells (◼ #000) surrounding them. This
rather custom color clearly implies that those pixels must have been
rendered by the graphics GDC. If any other sprite was rendered below the
HUD, you would equally see it below the glyphs.
So in the end, I did find the clear and logical system I was looking for,
and managed to reduce the new clipping conditions down to a
set of basic rules for each edge. Unfortunately, we also need a second
macro for each edge to differentiate between sprites that are smaller or
larger than the playfield border, which is treated as either 32×32 (for
super_roll_*()) or 32×16 (for non-"rolling"
super_*() functions). Since smaller sprites can be fully
contained within this border, the games can stop rendering them as soon as
their bottom-right coordinate is no longer seen within the playfield, by
comparing against the clipping boundaries with <= and
>=. For example, a 16×16 sprite would be completely
invisible once it reaches (16, 0), so it would still be rendered at
(17, 1). A larger sprite during the scrolling part of a stage, like,
say, the 64×64 midbosses, would still be rendered if their top-left
coordinate was (0, -16), so ZUN used < and
> comparisons to at least get an additional pixel before
having to stop rendering such a sprite. Turbo C++ 4.0J sadly can't
constant-fold away such a difference in comparison operators.
And for the most part, ZUN did follow this system consistently. Except for,
of course, the typical mistakes you make when faced with such manual
decisions, like how he treated TH04's Stage 4 midboss as a "small" sprite
below 32×32 pixels (it's 64×64), losing that precious one extra pixel. Or
how the entire rendering code for the 48×48 boss explosion sprite pretends
that it's actually 64×64 pixels large, which causes even the initial
transformation into screen space to be misaligned from the get-go.
But these are additional bugs on top of the single
one that led to all this research.
Because that's what this is, a bug. 🐞 Every resulting pixel boundary is a
systematic result of master.lib's unfortunate lack of clipping. It's as much
of a bug as TH01's byte-aligned rendering of entities whose internal
position is not byte-aligned. In both cases, the entities are alive,
simulated, and partake in collision detection, but their rendered appearance
doesn't accurately reflect their internal position.
Initially, I classified
📝 the sudden pop-in of TH05's Stage 5 midboss
as a quirk because we had no conclusive evidence that this wasn't
intentional, but now we do. There have been multiple explanations for why
ZUN put borders around the playfield, but master.lib's lack of sprite
clipping might be the biggest reason.
And just like byte-aligned rendering, the clipping conditions can easily be
removed when porting the game away from PC-98 hardware. That's also what
uth05win chose to do: By using OpenGL and not having to rely on hardware
scrolling, it can simply place every sprite as a textured quad at its exact
position in screen space, and then draw the black playfield borders on top
in the end to clip everything in a single draw call. This way, the Stage 5
midboss can smoothly fly into the playfield, just as defined by its movement
code:
The entire smooth Stage 5 midboss entrance animation as shown in
uth05win. If the simultaneous appearance of the Enemy!! label
doesn't lend further proof to this having been ZUN's actual intention, I
don't know what will.
Meanwhile, I designed the interface of the 📝 generic blitter used in the TH01 Anniversary Edition entirely around
clipping the blitted sprite at any explicit combination of VRAM edges. This
was nothing I tacked on in the end, but a core aspect that informed the
architecture of the code from the very beginning. You really want to
have one and only one place where sprite clipping is done right – and
only once per sprite, regardless of how many bitplanes you want to write to.
Which brings us to the goal that the final ¼ of this push went toward. I
thought I was going to start cleaning up the
📝 player movement and rendering code, but
that turned out too complicated for that amount of time – especially if you
want to start with just cleanup, preserving all original bugs for the
time being.
Fixing and smoothening player and Orb movement would be the next big task in
Anniversary Edition development, needing about 3 pushes. It would start with
more performance research into runtime-shifting of larger sprites, followed
by extending my generic blitter according to the results, writing new
optimized loaders for the original image formats, and finally rewriting all
rendering code accordingly. With that code in place, we can then start
cleaning up and fixing the unique code for each boss, one by one.
Until that's funded, the code still contains a few smaller and easier pieces
of code that are equally related to rendering bugs, but could be dealt with
in a more incremental way. Line rendering is one of those, and first needs
some refactoring of every call site, including
📝 the rotating squares around Mima and
📝 YuugenMagan's pentagram. So far, I managed
to remove another 1,360 bytes from the binary within this final ¼ of a push,
but there's still quite a bit to do in that regard.
This is the perfect kind of feature for smaller (micro-)transactions. Which
means that we've now got meaningful TH01 code cleanup and Anniversary
Edition subtasks at every price range, no matter whether you want to invest
a lot or just a little into this goal.
If you can, because Ember2528 revealed the plan behind
his Shuusou Gyoku contributions: A full-on Linux port of the game, which
will be receiving all the funding it needs to happen. 🐧 Next up, therefore:
Turning this into my main project within ReC98 for the next couple of
months, and getting started by shipping the long-awaited first step towards
that goal.
I've raised the cap to avoid the potential of rounding errors, which might
prevent the last needed Shuusou Gyoku push from being correctly funded. I
already had to pick the larger one of the two pending TH02 transactions for
this push, because we would have mathematically ended up
1/25500 short of a full push with the smaller
transaction. And if I'm already at it, I might
as well free up enough capacity to potentially ship the complete OpenGL
backend in a single delivery, which is currently estimated to cost 7 pushes
in total.
🎉 After almost 3 years, TH04 finally caught up to TH05 and is now 100%
position-independent as well! 🎉
For a refresher on what this means and does not mean, check the
announcements from back in 2019 and 2020 when we chased the goal for TH05's
📝 OP.EXE and
📝 the rest of the game. These also feature
some demo videos that show off the kind of mods you were able to efficiently
code back then. With the occasional reverse-engineering attention it
received over the years, TH04's code should now be slightly easier to work
with than TH05's was back in the day. Although not by much – TH04 has
remained relatively unpopular among backers, and only received more than the
funded attention because it shares most of its core code with the more
popular TH05. Which, coincidentally, ended up becoming
📝 the reason for getting this done now.
Not that it matters a lot. Ever since we reached 100% PI for TH05, community
and backer interest in position independence has dropped to near zero. We
just didn't end up seeing the expected large amount of community-made mods
that PI was meant to facilitate, and even the
📝 100% decompilation of TH01 changed nothing
about that. But that's OK; after all, I do appreciate the business of
continually getting commissioned for all the
📝 large-scale mods. Not focusing on PI is
also the correct choice for everyone who likes reading these blog posts, as
it often means that I can't go that much into detail due to cutting corners
and piling up technical debt left and right.
Surprisingly, this only took 1.25 pushes, almost twice as fast as expected.
As that's closer to 1 push than it is to 2, I'm OK with releasing it like
this – especially since it was originally meant to come out three days ago.
🍋 Unfortunately, it was delayed thanks to surprising
website bugs and a certain piece of code that was way more difficult to
document than it was to decompile… The next push will have slightly less
content in exchange, though.
📝 P0240 and P0241 already covered the final
remaining structures, so I only needed to do some superficial RE to prove
the remaining numeric literals as either constants or memory addresses. For
example, I initially thought I'd have to decompile the dissolve animations
in the staff roll, but I only needed to identify a single function pointer
type to prove all false positives as screen coordinates there. Now, the TH04
staff roll would be another fast and cheap decompilation, similar to the
custom entity types of TH04. (And TH05 as well!)
The one piece of code I did have to decompile was Stage 4's carpet
lighting animation, thanks to hex literals that were way too complicated to
leave in ASM. And this one probably takes the crown for TH04's worst set of
landmines and bloat that still somehow results in no observable bugs or
quirks.
This animation starts at frame 1664, roughly 29.5 seconds into the stage,
and quickly turns the stage background into a repeated row of dark-red plaid
carpet tiles by moving out from the center of the playfield towards the
edges. Afterward, the animation repeats with a brighter set of tiles that is
then used for the rest of the stage. As I explained
📝 a while ago in the context of TH02, the
stage tile and map formats in PC-98 Touhou can't express animations, so all
of this needed to be hardcoded in the binary.
The repeating 384×16 row of carpet tiles at the beginning of TH04's
Stage 4 in all three light levels, shown twice for better visibility.
And ZUN did start out making the right decision by only using fully-lit
carpet tiles for all tile sections defined in ST03.MAP. This
way, the animation can simply disable itself after it completed, letting the
rest of the stage render normally and use new tile sections that are only
defined for the final light level. This means that the "initial" dark
version of the carpet is as much a result of hardcoded tile manipulation as
the animation itself.
But then, ZUN proceeded to implement it all by directly manipulating the
ring buffer of on-screen tiles. This is the lowest level before the tiles
are rendered, and rather detached from the defined content of the
📝 .MAP tile sections. Which leads to a whole
lot of problems:
If you decide to do this kind of tile ring modification, it should ideally
happen at a very specific point: after scrolling in new tiles into
the ring buffer, but before blitting any scrolled or invalidated
tiles to VRAM based on the ring buffer. Which is not where ZUN chose to put
it, as he placed the call to the stage-specific render function after both
of those operations. By the time the function is
called, the tile renderer has already blitted a few lines of the fully-lit
carpet tiles from the defined .MAP tile section, matching the scroll speed.
Fortunately, these are hidden behind the black TRAM cells above and below
the playfield…
Still, the code needs to get rid of them before they would become visible.
ZUN uses the regular tile invalidation function for this, which will only
cause actual redraws on the next frame. Again, the tile rendering call has
already happened by the time the Stage 4-specific rendering function gets
called.
But wait, this game also flips VRAM pages between frames to provide a
tear-free gameplay experience. This means that the intended redraw of the
new tiles actually hits the wrong VRAM page.
And sure, the code does attempt to invalidate these newly blitted lines
every frame – but only relative to the current VRAM Y coordinate that
represents the top of the hardware-scrolled screen. Once we're back on the
original VRAM page on the next frame, the lines we initially set out to
remove could have already scrolled past that point, making it impossible to
ever catch up with them in this way.
The only real "solution": Defining the height of the tile invalidation
rectangle at 3× the scroll speed, which ensures that each invalidation call
covers 3 frames worth of newly scrolled-in lines. This is not intuitive at
all, and requires an understanding of everything I have just written to even
arrive at this conclusion. Needless to say that ZUN didn't comprehend it
either, and just hardcoded an invalidation height that happened to be enough
for the small scroll speeds defined in ST03.STD for the first
30 seconds of the stage.
The effect must consistently modify the tile ring buffer to "fix" any new
tiles, overriding them with the intended light level. During the animation,
the code not only needs to set the old light level for any tiles that are
still waiting to be replaced, but also the new light level for any tiles
that were replaced – and ZUN forgot the second part. As a result, newly scrolled-in tiles within the already animated
area will "remain" untouched at light level 2 if the scroll speed is fast
enough during the transition from light level 0 to 1.
All that means that we only have to raise the scroll speed for the effect to
fall apart. Let's try, say, 4 pixels per frame rather than the original
0.25:
By hiding the text RAM layer and revealing what's below the usually
opaque black cells above and below the playfield, we can observe all
three landmines – 1) and 2) throughout light level 0, and 3) during the
transition from level 0 to 1.
All of this could have been so much simpler and actually stable if ZUN
applied the tile changes directly onto the .MAP. This is a much more
intuitive way of expressing what is supposed to happen to the map, and would
have reduced the code to the actually necessary tile changes for the first
frame and each individual frame of the animation. It would have still
required a way to force these changes into the tile ring buffer, but ZUN
could have just used his existing full-playfield redraw functions for that.
In any case, there would have been no need for any per-frame tile
fixing and redrawing. The CPU cycles saved this way could have then maybe
been put towards writing the tile-replacing part of the animation in C++
rather than ASM…
Wow, that was an unreasonable amount of research into a feature that
superficially works fine, just because its decompiled code didn't make
sense. To end on a more positive note, here are
some minor new discoveries that might actually matter to someone:
The laser part of Marisa's Illusion Laser shot type always does 3
points of damage per frame, regardless of the player's power level. Its
hitbox also remains identical on all power levels, no matter how wide the
laser appears on screen. The strength difference between the levels purely
comes from the number of frames the laser stays active before a fixed
non-damaging 32-frame cooldown time:
Power level
Frames per cycle (including 32-frame cooldown)
2
64
3
72
4
88
5
104
6
128
7
144
8
168
9
192
The decay animation for player shots is faster in TH05 (12 frames) than in
TH04 (16 frames).
In the first phase of her Stage 6 fight, Yuuka moves along one of two
randomly chosen hardcoded paths, defined as a set of 5 movement angles.
After reaching the final point and firing a danmaku pattern, she teleports
back to her initial position to repeat the path one more time before the
phase times out.
Similarly, TH04's Stage 3 midboss also goes through 12 fixed movement angles
before flying off the playfield.
The formulas for calculating the skill rating on both TH04's and TH05's
final verdict screen are going to be very long and complicated.
Next up: ¾ of a push filled with random boilerplate, finalization, and TH01
code cleanup work, while I finish the preparations for Shuusou Gyoku's
OpenGL backend. This month, everything should finally work out as intended:
I'll complete both tasks in parallel, ship the former to free up the cap,
and then ship the latter once its 5th push is fully funded.
OK, let's decompile TH02's HUD code first, gain a solid understanding of how
increasing the score works, and then look at the item system of this game.
Should be no big deal, no surprises expected, let's go!
…Yeah, right, that's never how things end up in ReC98 land.
And so, we get the usual host of newly discovered
oddities in addition to the expected insights into the item mechanics. Let's
start with the latter:
Some regular stage enemies appear to randomly drop either or items. In reality, there is
very little randomness at play here: These items are picked from a
hardcoded, repeating ring of 10 items
(𝄆 𝄇), and the only source of
randomness is the initial position within this ring, which changes at
the beginning of every stage. ZUN further increased the illusion of
randomness by only dropping such a semi-random item for every
3rd defeated enemy that is coded to drop one, and also having
enemies that drop fixed, non-random items. I'd say it's a decent way of
ensuring both randomness and balance.
There's a 1/512 chance for such a semi-random
item drop to turn into a item instead –
which translates to 1/1536 enemies due to the
fixed drop rate.
Edit (2023-06-11): These are the only ways that items can randomly drop in this game. All other drops, including
any items, are scripted and deterministic.
After using a continue (both after a Game Over, or after manually
choosing to do so through the Pause menu for whatever reason), the
next
(Stage number + 1) semi-random item
drops are turned into items instead.
Items can contribute up to 25 points to the skill value and subsequent
rating (あなたの腕前) on the final verdict
screen. Doing well at item collection first increases a separate
collect_skill value:
Item
Collection condition
collect_skill change
below max power
+1
at or above max power
+2
value == 51,200
+8
value ≥20,000 and <51,200
+4
value ≥10,000 and <20,000
+2
value <10,000
+1
with 5 bombs in stock
+16
Note, again, the lack of anything involving
items. At the maximum of 5 lives, the item spawn function transforms
them into bomb items anyway. It is possible though to gain
the 5th life by reaching one of the extend scores while a
item is still on screen; in that case,
collecting the 1-up has no effect at all.
Every 32 collect_skill points will then raise the
item_skill by 1, whereas every 16 dropped items will lower
it by 1. Before launching into the ending sequence,
item_skill is clamped to the [0; 25] range and
added to the other skill-relevant metrics we're going to look at in
future pushes.
When losing a life, the game will drop a single
and 4 randomly picked or items in a random order
around Reimu's position. Contrary to an
unsourced Touhou Wiki edit from 2009, each of the 4 does have an
equal and independent chance of being either a
or item.
Finally, and perhaps most
interestingly, item values! These are
determined by the top Y coordinate of an item during the frame it is
collected on. The maximum value of 51,200 points applies to the top 48
pixels of the playfield, and drops off as soon as an item falls below
that line. For the rest of the playfield, point items then use a formula
of (28,000 - (top Y coordinate of item in
screen space × 70)):
Point items and their collection value in TH02. The numbers
correspond to items that are collected while their top Y coordinate
matches the line they are directly placed on. The upper
item in the image would therefore give
23,450 points if the player collected it at that specific
position.
Reimu collects any item whose 16×16 bounding box lies fully within
the red 48×40 hitbox. Note that
the box isn't cut off in this specific case: At Reimu's lowest
possible position on the playfield, the lowest 8 pixels of her
sprite are clipped, but the item hitbox still happens to end exactly
at the bottom of the playfield. Since an item's Y velocity
accelerates on every frame, it's entirely possible to collect a
point item at the lowest value of 2,240 points, on the exact frame
before it falls below the collection hitbox.
Onto score tracking then, which only took a single commit to raise another
big research question. It's widely known that TH02 grants extra lives upon
reaching a score of 1, 2, 3, 5, or 8 million points. But what hasn't been
documented is the fact that the game does not stop at the end of the
hardcoded extend score array. ZUN merely ends it with a sentinel value of
999,999,990 points, but if the score ever increased beyond this value, the
game will interpret adjacent memory as signed 32-bit score values and
continue giving out extra lives based on whatever thresholds it ends up
finding there. Since the following bytes happen to turn into a negative
number, the next extra life would be awarded right after gaining another 10
points at exactly 1,000,000,000 points, and the threshold after that would
be 11,114,905,600 points. Without an explicit counterstop, the number of
score-based extra lives is theoretically unlimited, and would even continue
after the signed 32-bit value overflowed into the negative range. Although
we certainly have bigger problems once scores ever reach that point…
That said, it seems impossible that any of this could ever happen
legitimately. The current high scores of 42,942,800 points on
Lunatic and 42,603,800 points on
Extra don't even reach 1/20 of ZUN's sentinel
value. Without either a graze or a bullet cancel system, the scoring
potential in this game is fairly limited, making it unlikely for high scores
to ever increase by that additional order of magnitude to end up anywhere
near the 1 billion mark.
But can we really be sure? Is this a landmine because it's impossible
to ever reach such high scores, or is it a quirk because these extends
could be observed under rare conditions, perhaps as the result of
other quirks? And if it's the latter, how many of these adjacent bytes do we
need to preserve in cleaned-up versions and ports? We'd pretty much need to
know the upper bound of high scores within the original stage and boss
scripts to tell. This value should be rather easy to calculate in a
game with such a simple scoring system, but doing that only makes sense
after we RE'd all scoring-related code and could efficiently run such
simulations. It's definitely something we'd need to look at before working
on this game's debloated version in the far future, which is
when the difference between quirks and landmines will become relevant.
Still, all that uncertainty just because ZUN didn't restrict a loop to the
size of the extend threshold array…
TH02 marks a pivotal point in how the PC-98 Touhou games handle the current
score. It's the last game to use a 32-bit variable before the later games
would regrettably start using arrays of binary-coded
decimals. More importantly though, TH02 is also the first game to
introduce the delayed score counting animation, where the displayed score
intentionally lags behind and gradually counts towards the real one over
multiple frames. This could be implemented in one of two ways:
Keep the displayed score as a separate variable inside the presentation
layer, and let it gradually count up to the real score value passed in from
the logic layer
Burden the game logic with this presentation detail, and split the score
into two variables: One for the displayed score, and another for the
delta between that score and the actual one. Newly gained points are
first added to the delta variable, and then gradually subtracted from there
and added to the real score before being displayed.
And by now, we can all tell which option ZUN picked for the rest of the
PC-98 games, even if you don't remember
📝 me mentioning this system last year.
📝 Once again, TH02 immortalized ZUN's initial
attempt at the concept, which lacks the abstraction boundaries you'd want
for managing this one piece of state across two variables, and messes up the
abstractions it does have. In addition to the regular score
transfer/render function, the codebase therefore has
a function that transfers the current delta to the score immediately,
but does not re-render the HUD, and
a function that adds the delta to the score and re-renders the HUD, but
does not reset the delta.
And – you guessed it – I wouldn't have mentioned any of this if it didn't
result in one bug and one quirk in TH02. The bug resulting from 1) is pretty
minor: The function is called when losing a life, and simply stops any
active score-counting animation at the value rendered on the frame where the
player got hit. This one is only a rendering issue – no points are lost, and
you just need to gain 10 more for the rendered value to jump back up to its
actual value. You'll probably never notice this one because you're likely
busy collecting the single spawned around Reimu
when losing a life, which always awards at least 10 points.
The quirk resulting from 2) is more intriguing though. Without a separate
reset of the score delta, the function effectively awards the current delta
value as a one-time point bonus, since the same delta will still be
regularly transferred to the score on further game frames.
This function is called at the start of every dialog sequence. However, TH02
stops running the regular game loop between the post-boss dialog and the
next stage where the delta is reset, so we can only observe this quirk for
the pre-boss sequences and the dialog before Mima's form change.
Unfortunately, it's not all too exploitable in either case: Each of the
pre-boss dialog sequences is preceded by an ungrazeable pellet pattern and
followed by multiple seconds of flying over an empty playfield with zero
scoring opportunities. By the time the sequence starts, the game will have
long transferred any big score delta from max-valued point items. It's
slightly better with Mima since you can at least shoot her and use a bomb to
keep the delta at a nonzero value, but without a health bar, there is little
indication of when the dialog starts, and it'd be long after Mima
gave out her last bonus items in any case.
But two of the bosses – that is, Rika, and the Five Magic Stones – are
scrolled onto the playfield as part of the stage script, and can also be hit
with player shots and bombs for a few seconds before their dialog starts.
While I'll only get to cover shot types and bomb damage within the next few
TH02 pushes, there is an obvious initial strategy for maximizing the effect
of this quirk: Spreading out the A-Type / Wide / High Mobility shot to land
as many hits as possible on all Five Magic Stones, while firing off a bomb.
Turns out that the infamous button-mashing mechanics of the
player shot are also more complicated than simply pressing and releasing the
Shot key at alternating frames. Even this result took way too many
takes.
Wow, a grand total of 1,750 extra points! Totally worth wasting a bomb for…
yeah, probably not. But at the very least, it's
something that a TAS score run would want to keep in mind. And all that just
because ZUN "forgot" a single score_delta = 0; assignment at
the end of one function…
And that brings TH02 over the 30% RE mark! Next up: 100% position
independence for TH04. If anyone wants to grab the
that have now been freed up in the cap: Any small Touhou-related task would
be perfect to round out that upcoming TH04 PI delivery.
Well, well. My original plan was to ship the first step of Shuusou Gyoku
OpenGL support on the next day after this delivery. But unfortunately, the
complications just kept piling up, to a point where the required solutions
definitely blow the current budget for that goal. I'm currently sitting on
over 70 commits that would take at least 5 pushes to deliver as a meaningful
release, and all of that is just rearchitecting work, preparing the
game for a not too Windows-specific OpenGL backend in the first place. I
haven't even written a single line of OpenGL yet… 🥲
This shifts the intended Big Release Month™ to June after all. Now I know
that the next round of Shuusou Gyoku features should better start with the
SC-88Pro recordings, which are much more likely to get done within their
current budget. At least I've already completed the configuration versioning
system required for that goal, which leaves only the actual audio part.
So, TH04 position independence. Thanks to a bit of funding for stage
dialogue RE, non-ASCII translations will soon become viable, which finally
presents a reason to push TH04 to 100% position independence after
📝 TH05 had been there for almost 3 years. I
haven't heard back from Touhou Patch Center about how much they want to be
involved in funding this goal, if at all, but maybe other backers are
interested as well.
And sure, it would be entirely possible to implement non-ASCII translations
in a way that retains the layout of the original binaries and can be easily
compared at a binary level, in case we consider translations to be a
critical piece of infrastructure. This wouldn't even just be an exercise in
needless perfectionism, and we only have to look to Shuusou Gyoku to realize
why: Players expected
that my builds were compatible with existing SpoilerAL SSG files, which
was something I hadn't even considered the need for. I mean, the game is
open-source 📝 and I made it easy to build.
You can just fork the code, implement all the practice features you want in
a much more efficient way, and I'd probably even merge your code into my
builds then?
But I get it – recompiling the game yields just yet another build that can't
be easily compared to the original release. A cheat table is much more
trustworthy in giving players the confidence that they're still practicing
the same original game. And given the current priorities of my backers,
it'll still take a while for me to implement proof by replay validation,
which will ultimately free every part of the community from depending on the
original builds of both Seihou and PC-98 Touhou.
However, such an implementation within the original binary layout would
significantly drive up the budget of non-ASCII translations, and I sure
don't want to constantly maintain this layout during development. So, let's
chase TH04 position independence like it's 2020, and quickly cover a larger
amount of PI-relevant structures and functions at a shallow level. The only
parts I decompiled for now contain calculations whose intent can't be
clearly communicated in ASM. Hitbox visualizations or other more in-depth
research would have to wait until I get to the proper decompilation of these
features.
But even this shallow work left us with a large amount of TH04-exclusive
code that had its worst parts RE'd and could be decompiled fairly quickly.
If you want to see big TH04 finalization% gains, general TH04 progress would
be a very good investment.
The first push went to the often-mentioned stage-specific custom entities
that share a single statically allocated buffer. Back in 2020, I
📝 wrongly claimed that these were a TH05 innovation,
but the system actually originated in TH04. Both games use a 26-byte
structure, but TH04 only allocates a 32-element array rather than TH05's
64-element one. The conclusions from back then still apply, but I also kept
wondering why these games used a static array for these entities to begin
with. You know what they call an area of memory that you can cleanly
repurpose for things? That's right, a heap!
And absolutely no one would mind one additional heap allocation at the start
of a stage, next to the ones for all the sprites and portraits.
However, we are still running in Real Mode with segmented memory. Accessing
anything outside a common data segment involves modifying segment registers,
which has a nonzero CPU cycle cost, and Turbo C++ 4.0J is terrible at
optimizing away the respective instructions. Does this matter? Probably not,
but you don't take "risks" like these if you're in a permanent
micro-optimization mindset…
In TH04, this system is used for:
Kurumi's symmetric bullet spawn rays, fired from her hands towards the left
and right edges of the playfield. These are rather infamous for being the
last thing you see before
📝 the Divide Error crash that can happen in ZUN's original build.
Capped to 6 entities.
The 4 📝 bits used in Marisa's Stage 4 boss
fight. Coincidentally also related to the rare Divide Error
crash in that fight.
Stage 4 Reimu's spinning orbs. Note how the game uses two different sets
of sprites just to have two different outline colors. This was probably
better than messing with the palette, which can easily cause unintended
effects if you only have 16 colors to work with. Heck, I have an entire blog post tag just to highlight
these cases. Capped to the full 32 entities.
The chasing cross bullets, seen in Phase 14 of the same Stage 6 Yuuka
fight. Featuring some smart sprite work, making use of point symmetry to
achieve a fluid animation in just 4 frames. This is
good-code in sprite form. Capped to 31 entities, because the 32nd custom entity during this fight is defined to be…
The single purple pulsating and shrinking safety circle, seen in Phase 4 of
the same fight. The most interesting aspect here is actually still related
to the cross bullets, whose spawn function is wrongly limited to 32 entities
and could theoretically overwrite this circle. This
is strictly landmine territory though:
Yuuka never uses these bullets and the safety circle
simultaneously
She never spawns more than 24 cross bullets
All cross bullets are fast enough to have left the screen by the
time Yuuka restarts the corresponding subpattern
The cross bullets spawn at Yuuka's center position, and assign its
Q12.4 coordinates to structure fields that the safety circle interprets
as raw pixels. The game does try to render the circle afterward, but
since Yuuka's static position during this phase is nowhere near a valid
pixel coordinate, it is immediately clipped.
The flashing lines seen in Phase 5 of the Gengetsu fight,
telegraphing the slightly random bullet columns.
These structures only took 1 push to reverse-engineer rather than the 2 I
needed for their TH05 counterparts because they are much simpler in this
game. The "structure" for Gengetsu's lines literally uses just a single X
position, with the remaining 24 bytes being basically padding. The only
minor bug I found on this shallow level concerns Marisa's bits, which are
clipped at the right and bottom edges of the playfield 16 pixels earlier
than you would expect:
The remaining push went to a bunch of smaller structures and functions:
The structure for the up to 2 "thick" (a.k.a. "Master Spark") lasers. Much
saner than the
📝 madness of TH05's laser system while being
equally customizable in width and duration.
The structure for the various monochrome 16×16 shapes in the background of
the Stage 6 Yuuka fight, drawn on top of the checkerboard.
The rendering code for the three falling stars in the background of Stage 5.
The effect here is entirely palette-related: After blitting the stage tiles,
the 📝 1bpp star image is ORed
into only the 4th VRAM plane, which is equivalent to setting the
highest bit in the palette color index of every pixel within the star-shaped
region. This of course raises the question of how the stage would look like
if it was fully illuminated:
The full tile map of TH04's Stage 5, in both dark and fully
illuminated views. Since the illumination effect depends on two
matching sets of palette colors that are distinguished by a single
bit, the illuminated view is limited to only 8 of the 16 colors. The
dark view, on the other hand, can freely use colors from the
illuminated set, since those are unaffected by the OR
operation.
Most code that modifies a stage's tile map, and directly specifies tiles via
their top-left offset in VRAM.
Thanks to code alignment reasons, this forced a much longer detour into the
.STD format loader. Nothing all too noteworthy there since we're still
missing the enemy script and spawn structures before we can call .STD
"reverse-engineered", but maybe still helpful if you're looking for an
overview of the format. Also features a buffer overflow landmine if a .STD
file happens to contain more than 32 enemy scripts… you know, the usual
stuff.
To top off the second push, we've got the vertically scrolling checkerboard
background during the Stage 6 Yuuka fight, made up of 32×32 squares. This
one deserves a special highlight just because of its needless complexity.
You'd think that even a performant implementation would be pretty simple:
Set the GRCG to TDW mode
Set the GRCG tile to one of the two square colors
Start with Y as the current scroll offset, and X
as some indicator of which color is currently shown at the start of each row
of squares
Iterate over all lines of the playfield, filling in all pixels that
should be displayed in the current color, skipping over the other ones
Count down Y for each line drawn
If Y reaches 0, reset it to 32 and flip X
At the bottom of the playfield, change the GRCG tile to the other color,
and repeat with the initial value of X flipped
The most important aspect of this algorithm is how it reduces GRCG state
changes to a minimum, avoiding the costly port I/O that we've identified
time and time again as one of the main bottlenecks in TH01. With just 2
state variables and 3 loops, the resulting code isn't that complex either. A
naive implementation that just drew the squares from top to bottom in a
single pass would barely be simpler, but much slower: By changing the GRCG
tile on every color, such an implementation would burn a low 5-digit number
of CPU cycles per frame for the 12×11.5-square checkerboard used in the
game.
And indeed, ZUN retained all important aspects of this algorithm… but still
implemented it all in ASM, with a ridiculous layer of x86 segment arithmetic
on top? Which blows up the complexity to 4 state
variables, 5 nested loops, and a bunch of constants in unusual units. I'm
not sure what this code is supposed to optimize for, especially with that
rather questionable register allocation that nevertheless leaves one of the
general-purpose registers unused. Fortunately,
the function was still decompilable without too many code generation hacks,
and retains the 5 nested loops in all their goto-connected
glory. If you want to add a checkerboard to your next PC-98
demo, just stick to the algorithm I gave above.
(Using a single XOR for flipping the starting X offset between 32 and 64
pixels is pretty nice though, I have to give him that.)
This makes for a good occasion to talk about the third and final GRCG mode,
completing the series I started with my previous coverage of the
📝 RMW and
📝 TCR modes. The TDW (Tile Data Write) mode
is the simplest of the three and just writes the 8×1 GRCG tile into VRAM
as-is, without applying any alpha bitmask. This makes it perfect for
clearing rectangular areas of pixels – or even all of VRAM by doing a single
memset():
// Set up the GRCG in TDW mode.
outportb(0x7C, 0x80);
// Fill the tile register with color #7 (0111 in binary).
outportb(0x7E, 0xFF); // Plane 0: (B): (********)
outportb(0x7E, 0xFF); // Plane 1: (R): (********)
outportb(0x7E, 0xFF); // Plane 2: (G): (********)
outportb(0x7E, 0x00); // Plane 3: (E): ( )
// Set the 32 pixels at the top-left corner of VRAM to the exact contents of
// the tile register, effectively repeating the tile 4 times. In TDW mode, the
// GRCG ignores the CPU-supplied operand, so we might as well just pass the
// contents of a register with the intended width. This eliminates useless load
// instructions in the compiled assembly, and even sort of signals to readers
// of this code that we do not care about the source value.
*reinterpret_cast<uint32_t far *>(MK_FP(0xA800, 0)) = _EAX;
// Fill the entirety of VRAM with the GRCG tile. A simple C one-liner that will
// probably compile into a single `REP STOS` instruction. Unfortunately, Turbo
// C++ 4.0J only ever generates the 16-bit `REP STOSW` here, even when using
// the `__memset__` intrinsic and when compiling in 386 mode. When targeting
// that CPU and above, you'd ideally want `REP STOSD` for twice the speed.
memset(MK_FP(0xA800, 0), _AL, ((640 / 8) * 400));
However, this might make you wonder why TDW mode is even necessary. If it's
functionally equivalent to RMW mode with a CPU-supplied bitmask made up
entirely of 1 bits (i.e., 0xFF, 0xFFFF, or
0xFFFFFFFF), what's the point? The difference lies in the
hardware implementation: If all you need to do is write tile data to
VRAM, you don't need the read and modify parts of RMW mode
which require additional processing time. The PC-9801 Programmers'
Bible claims a speedup of almost 2× when using TDW mode over equivalent
operations in RMW mode.
And that's the only performance claim I found, because none of these old
PC-98 hardware and programming books did any benchmarks. Then again, it's
not too interesting of a question to benchmark either, as the byte-aligned
nature of TDW blitting severely limits its use in a game engine anyway.
Sure, maybe it makes sense to temporarily switch from RMW to TDW mode
if you've identified a large rectangular and byte-aligned section within a
sprite that could be blitted without a bitmask? But the necessary
identification work likely nullifies the performance gained from TDW mode,
I'd say. In any case, that's pretty deep
micro-optimization territory. Just use TDW mode for the
few cases it's good at, and stick to RMW mode for the rest.
So is this all that can be said about the GRCG? Not quite, because there are
4 bits I haven't talked about yet…
And now we're just 5.37% away from 100% position independence for TH04! From
this point, another 2 pushes should be enough to reach this goal. It might
not look like we're that close based on the current estimate, but a
big chunk of the remaining numbers are false positives from the player shot
control functions. Since we've got a very special deadline to hit, I'm going
to cobble these two pushes together from the two current general
subscriptions and the rest of the backlog. But you can, of course, still
invest in this goal to allow the existing contributions to go to something
else.
… Well, if the store was actually open. So I'd better
continue with a quick task to free up some capacity sooner rather than
later. Next up, therefore: Back to TH02, and its item and player systems.
Shouldn't take that long, I'm not expecting any surprises there. (Yeah, I
know, famous last words…)
Stripe is now
properly integrated into this website as an alternative to PayPal! Now, you
can also financially support the project if PayPal doesn't work for you, or
if you prefer using a
provider out of Stripe's greater variety. It's unfortunate that I had to
ship this integration while the store is still sold out, but the Shuusou
Gyoku OpenGL backend has turned out way too complicated to be finished next
to these two pushes within a month. It will take quite a while until the
store reopens and you all can start using Stripe, so I'll just link back to
this blog post when it happens.
Integrating Stripe wasn't the simplest task in the world either. At first,
the Checkout API
seems pretty friendly to developers: The entire payment flow is handled on
the backend, in the server language of your choice, and requires no frontend
JavaScript except for the UI feedback code you choose to write. Your
backend API endpoint initiates the Stripe Checkout session, answers with a
redirect to Stripe, and Stripe then sends a redirect back to your server if
the customer completed the payment. Superficially, this server-based
approach seems much more GDPR-friendly than PayPal, because there are no
remote scripts to obtain consent for. In reality though, Stripe shares
much more potential personal data about your credit card or bank
account with a merchant, compared to PayPal's almost bare minimum of
necessary data.
It's also rather annoying how the backend has to persist the order form
information throughout the entire Checkout session, because it would
otherwise be lost if the server restarts while a customer is still busy
entering data into Stripe's Checkout form. Compare that to the PayPal
JavaScript SDK, which only POSTs back to your server after the
customer completed a payment. In Stripe's case, more JavaScript actually
only makes the integration harder: If you trigger the initial payment
HTTP request from JavaScript, you will have
to improvise a bit to avoid the CORS error when redirecting away to a
different domain.
But sure, it's all not too bad… for regular orders at least. With
subscriptions, however, things get much worse. Unlike PayPal, Stripe
kind of wants to stay out of the way of the payment process as much as
possible, and just be a wrapper around its supported payment methods. So if
customers aren't really meant to register with Stripe, how would they cancel
their subscriptions?
Answer: Through
the… merchant? Which I quite dislike in principle, because why should
you have to trust me to actually cancel your subscription after you
requested it? It also means that I probably should add some sort of UI for
self-canceling a Stripe subscription, ideally without adding full-blown user
accounts. Not that this solves the underlying trust issue, but it's more
convenient than contacting me via email or, worse, going through your bank
somehow. Here is how my solution works:
When setting up a Stripe subscription, the server will generate a random
ID for authentication. This ID is then used as a salt for a hash
of the Stripe subscription ID, linking the two without storing the latter on
my server.
The thank you page, which is parameterized with the Stripe
Checkout session ID, will use that ID to retrieve the subscription
ID via an API call to Stripe, and display it together with the above
salt. This works indefinitely – contrary to what the expiry field in the
Checkout session object suggests, Stripe sessions are indeed stored
forever. After all, Stripe also displays this session information in a
merchant's transaction log with an excessive amount of detail. It might have
been better to add my own expiration system to these pages, but this had
been taking long enough already. For now, be aware that sharing the link to
a Stripe thank you page is equivalent to sharing your subscription
cancellation password.
The salt is then used as the key for a subscription management page. To
cancel, you visit this page and enter the Stripe subscription ID to confirm.
The server then checks whether the salt and subscription ID pair belong to
each other, and sends the actual cancellation
request back to Stripe if they do.
I might have gone a bit overboard with the crypto there, but I liked the
idea of not storing any of the Stripe session IDs in the server database.
It's not like that makes the system more complex anyway, and it's nice to
have a separate confirmation step before canceling a subscription.
But even that wasn't everything I had to keep in mind here. Once you
switch from test to production mode for the final tests, you'll notice that
certain SEPA-based
payment providers take their sweet time to process and activate new
subscriptions. The Checkout session object even informs you about that, by
including a payment status field. Which initially seems just like
another field that could indicate hacking attempts, but treating it as such
and rejecting any unpaid session can also reject perfectly valid
subscriptions. I don't want all this control… 🥲
Instead, all I can do in this case is to tell you about it. In my test, the
Stripe dashboard said that it might take days or even weeks for the initial
subscription transaction to be confirmed. In such a case, the respective
fraction of the cap will unfortunately need to remain red for that entire time.
And that was 1½ pushes just to replicate the basic functionality of a simple
PayPal integration with the simplest type of Stripe integration. On the
architectural site, all the necessary refactoring work made me finally
upgrade my frontend code to TypeScript at least, using the amazing esbuild to handle transpilation inside
the server binary. Let's see how long it will now take for me to upgrade to
SCSS…
With the new payment options, it makes sense to go for another slight price
increase, from up to per push.
The amount of taxes I have to pay on this income is slowly becoming
significant, and the store has been selling out almost immediately for the
last few months anyway. If demand remains at the current level or even
increases, I plan to gradually go up to by the end
of the year. 📝 As📝 usual,
I'm going to deliver existing orders in the backlog at the value they were
originally purchased at. Due to the way the cap has to be calculated, these
contributions now appear to have increased in value by a rather awkward
13.33%.
This left ½ of a push for some more work on the TH01 Anniversary Edition.
Unfortunately, this was too little time for the grand issue of removing
byte-aligned rendering of bigger sprites, which will need some additional
blitting performance research. Instead, I went for a bunch of smaller
bugfixes:
ANNIV.EXE now launches ZUNSOFT.COM if
MDRV98 wasn't resident before. In hindsight, it's completely obvious
why this is the right thing to do: Either you start
ANNIV.EXE directly, in which case there's no resident
MDRV98 and you haven't seen the ZUN Soft logo, or you have
made a single-line edit to GAME.BAT and replaced
op with anniv, in which case MDRV98 is
resident and you have seen the logo. These are the two
reasonable cases to support out of the box. If you are doing
anything else, it shouldn't be that hard to adjust though?
You might be wondering why I didn't just include all code of
ZUNSOFT.COM inside ANNIV.EXE together with
the rest of the game. The reason: ZUNSOFT.COM has
almost nothing in common with regular TH01 code. While the rest of
TH01 uses the custom image formats and bad rendering code I
documented again and again during its RE process,
ZUNSOFT.COM fully relies on master.lib for everything
about the bouncing-ball logo animation. Its code is much closer to
TH02 in that respect, which suggests that ZUN did in fact write this
animation for TH02, and just included the binary in TH01 for
consistency when he first sold both games together at Comiket 52.
Unlike the 📝 various bad reasons for splitting the PC-98 Touhou games into three main executables,
it's still a good idea to split off animations that use a completely
different set of rendering and file format functions. Combined with
all the BFNT and shape rendering code, ZUNSOFT.COM
actually contains even more unique code than OP.EXE,
and only slightly less than FUUIN.EXE.
The optional AUTOEXEC.BAT is now correctly encoded in
Shift-JIS instead of accidentally being UTF-8, fixing the previous
mojibake in its final ECHO line.
The command-line option that just adds a stage selection without
other debug features (anniv s) now works reliably.
This one's quite interesting because it only ever worked
because of a ZUN bug. From a superficial look at the code, it
shouldn't: While the presence of an 's' branch proves
that ZUN had such a mode during development, he nevertheless forgot
to initialize the debug flag inside the resident structure within
this branch. This mode only ever worked because master.lib's
resdata_create() function doesn't clear the resident
structure after allocation. If anything on the system previously
happened to write something other than 0x00,
0x01, or 0x03 to the specific byte that
then gets repurposed as the debug mode flag, this lack of
initialization does in fact result in a distinct non-test and
non-debug stage selection mode.
This is what happens on a certain widely circulated .HDI copy of
TH01 that boots MS-DOS 3.30C. On this system, the memory that
master.lib will allocate to the TH01 resident structure was
previously used by DOS as stack for its kernel, which left the
future resident debug flag byte at address 9FF6:0012 at
a value of 0x12. This might be the entire reason why
game s is even widely documented to trigger a stage
selection to begin with – on the widely circulated TH04 .HDI that
boots MS-DOS 6.20, or on DOSBox-X, the s parameter
doesn't work because both DOS systems leave the resident debug flag
byte at 0x00. And since ANNIV.EXE pushes
MDRV98 into that area of conventional DOS RAM, anniv s
previously didn't work even on MS-DOS 3.30C.
Both bugs in the
📝 1×1 particle system during the Mima fight
have been fixed. These include the off-by-one error that killed off the
very first particle on the 80th
frame and left it in VRAM, and, just like every other entity type, a
replacement of ZUN's EGC unblitter with the new pixel-perfect and fast
one. Until I've rearchitected unblitting as a whole, the particles will
now merely rip barely visible 1×1 holes into the sprites they overlap.
The bomb value shown in the lowest line of the in-game
debug mode output is now right-aligned together with the rest of the
values. This ensures that the game always writes a consistent number
of characters to TRAM, regardless of the magnitude of the
bomb value, preventing the seemingly wrong
timer values that appeared in the original game
whenever the value of the bomb variable changed to a
lower number of digits:
Finally, I've streamlined VRAM page access changes, which allowed me to
consistently replace ZUN's expensive function call with the optimal two
inlined x86 instructions. Interestingly, this change alone removed
2 KiB from the binary size, which is almost all of the difference
between 📝 the P0234-1 release and this
one. Let's see how much longer we can make each new release of
ANNIV.EXE smaller than the previous one.
The final point, however, raised the question of what we're now going to do
about
📝 a certain issue in the 地獄/Jigoku Bad Ending.
ZUN's original expensive way of switching the accessed VRAM page was the
main reason behind the lag frames on slower PC-98 systems, and
search-replacing the respective function calls would immediately get us to
the optimized version shown in that blog post. But is this something we
actually want? If we wanted to retain the lag, we could surely preserve that
function just for this one instance… The discovery of this issue
predates the clear distinction between bloat, quirks, and bugs, so it makes
sense to first classify what this issue even is. The distinction comes all
down to observability, which I defined as changes to rendered frames
between explicitly defined frame boundaries. That alone would be enough to
categorize any cause behind lag frames as bloat, but it can't hurt to be
more explicit here.
Therefore, I now officially judge observability in terms of an infinitely
fast PC-98 that can instantly render everything between two explicitly
defined frames, and will never add additional lag frames. If we plan to port
the games to faster architectures that aren't bottlenecked by disappointing
blitter chips, this is the only reasonable assumption to make, in my
opinion: The minimum system requirements in the games' README files are
minimums, after all, not recommendations. Chasing the exact frame
drop behavior that ZUN must have experienced during the time he developed
these games can only be a guessing game at best, because how can we know
which PC-98 model ZUN actually developed the games on? There might even be
more than one model, especially when it comes to TH01 which had been in
development for at least two years before ZUN first sold it. It's also not
like any current PC-98 emulator even claims to emulate the specific timing
of any existing model, and I sure hope that nobody expects me to import a
bunch of bulky obsolete hardware just to count dropped frames.
That leaves the tearing, where it's much more obvious how it's a bug. On an
infinitely fast PC-98, the ドカーン
frame would never be visible, and thus falls into the same category as the
📝 two unused animations in the Sariel fight.
With only a single unconditional 2-frame delay inside the animation loop, it
becomes clear that ZUN intended both frames of the animation to be displayed
for 2 frames each:
No tearing, and 34 frames in total for the first of the two
instances of this animation.
Next up: Taking the oldest still undelivered push and working towards TH04
position independence in preparation for multilingual translations. The
Shuusou Gyoku OpenGL backend shouldn't take that much longer either,
so I should have lots of stuff coming up in May afterward.
So, TH02! Being the only game whose main binary hadn't seen any dedicated
attention ever, we get to start the TH02-related blog posts at the very
beginning with the most foundational pieces of code. The stage tile system
is the best place to start here: It not only blocks every entity that is
rendered on top of these tiles, but is curiously placed right next to
master.lib code in TH02, and would need to be separated out into its own
translation unit before we can do the same with all the master.lib
functions.
In late 2018, I already RE'd
📝 TH04's and TH05's stage tile implementation, but haven't properly documented it on this
blog yet, so this post is also going to include the details that are unique
to those games. On a high level, the stage tile system works identically in
all three games:
The tiles themselves are 16×16 pixels large, and a stage can use 100 of
them at the same time.
The optimal way of blitting tiles would involve VRAM-to-VRAM copies
within the same page using the EGC, and that's exactly what the games do.
All tiles are stored on both VRAM pages within the rightmost 64×400 pixels
of the screen just right next to the HUD, and you only don't see them
because the games cover the same area in text RAM with black cells:
The initial screen of TH02's Stage 1, with the tile source
area uncovered by filling the same area in text RAM with transparent
cells instead of black ones. In TH02, this also reveals how the tile
area ends with a bunch of glitch tiles, tinted blue in the image. These
are the result of ZUN unconditionally blitting 100 tile images every
time, regardless of how many are actually contained in an
.MPN file.
These glitch tiles are another good example of a ZUN
landmine. Their appearance is the result of reading heap memory
outside allocated boundaries, which can easily cause segmentation faults
when porting the game to a system with virtual memory. Therefore, these
would not just be removed in this game's Anniversary Edition, but on the
more conservative debloated branch as well. Since the game
never uses these tiles and you can't observe them unless you manipulate
text RAM from outside the confines of the game, it's not a bug
according to our definition.
To reduce the memory required for a map, tiles are arranged into fixed
vertical sections of a game-specific constant size.
The 6 24×8-tile sections defined in TH02's STAGE0.MAP, in
reverse order compared to how they're defined in the file. Note the
duplicated row at the top of the final section: The boss fight starts
once the game scrolled the last full row of tiles onto the top of the
screen, not the playfield. But since the PC-98 text chip
covers the top tile row of the screen with black cells, this final row
is never visible, which effectively reduces a map's final tile section
to 7 rows rather than 8.
The actual stage map then is simply a list of these tile sections,
ordered from the start/bottom to the top/end.
Any manipulation of specific tiles within the fixed tile sections has to
be hardcoded. An example can be found right in Stage 1, where the Shrine
Tank leaves track marks on the tiles it appears to drive over:
This video also shows off the two issues with Touhou's first-ever
midboss: The replaced tiles are rendered below the midboss
during their first 4 frames, and maybe ZUN should have stopped the
tile replacements one row before the timeout. The first one is
clearly a bug, but it's not so clear-cut with the second one. I'd
need to look at the code to tell for sure whether it's a quirk or a
bug.
The differences between the three games can best be summarized in a table:
TH02
TH04
TH05
Tile image file extension
.MPN
Tile section format
.MAP
Tile section order defined as part of
.DT1
.STD
Tile section index format
0-based ID
0-based ID × 2
Tile image index format
Index between 0 and 100, 1 byte
VRAM offset in tile source area, 2 bytes
Scroll speed control
Hardcoded
Part of the .STD format, defined per referenced tile
section
Redraw granularity
Full tiles (16×16)
Half tiles (16×8)
Rows per tile section
8
5
Maximum number of tile sections
16
32
Lowest number of tile sections used
5 (Stage 3 / Extra)
8 (Stage 6)
11 (Stage 2 / 4)
Highest number of tile sections used
13 (Stage 4)
19 (Extra)
24 (Stage 3)
Maximum length of a map
320 sections (static buffer)
256 sections (format limitation)
Shortest map
14 sections (Stage 5)
20 sections (Stage 5)
15 sections (Stage 2)
Longest map
143 sections (Stage 4)
95 sections (Stage 4)
40 sections (Stage 1 / 4 / Extra)
The most interesting part about stage tiles is probably the fact that some
of the .MAP files contain unused tile sections. 👀 Many
of these are empty, duplicates, or don't really make sense, but a few
are unique, fit naturally into their respective stage, and might have
been part of the map during development. In TH02, we can find three unused
sections in Stage 5:
The non-empty tile sections defined in TH02's STAGE4.MAP,
showing off three unused ones.
These unused tile sections are much more common in the later games though,
where we can find them in TH04's Stage 3, 4, and 5, and TH05's Stage 1, 2,
and 4. I'll document those once I get to finalize the tile rendering code of
these games, to leave some more content for that blog post. TH04/TH05 tile
code would be quite an effective investment of your money in general, as
most of it is identical across both games. Or how about going for a full-on
PC-98 Touhou map viewer and editor GUI?
Compared to TH04 and TH05, TH02's stage tile code definitely feels like ZUN
was just starting to understand how to pull off smooth vertical scrolling on
a PC-98. As such, it comes with a few inefficiencies and suboptimal
implementation choices:
The redraw flag for each tile is stored in a 24×25 bool
array that does nothing with 7 of the 8 bits.
During bombs and the Stage 4, 5, and Extra bosses, the game disables the
tile system to render more elaborate backgrounds, which require the
playfield to be flood-filled with a single color on every frame. ZUN uses
the GRCG's RMW mode rather than TDW mode for this, leaving almost half of
the potential performance on the table for no reason. Literally,
changing modes only involves changing a single constant.
The scroll speed could theoretically be changed at any time. However,
the function that scrolls in new stage tiles can only ever blit part of a
single tile row during every call, so it's up to the caller to ensure
that scrolling always ends up on an exact 16-pixel boundary. TH02 avoids
this problem by keeping the scroll speed constant across a stage, using 2
pixels for Stage 4 and 1 pixel everywhere else.
Since the scroll speed is given in pixels, the slowest speed would be 1
pixel per frame. To allow the even slower speeds seen in the final game,
TH02 adds a separate scroll interval variable that only runs the
scroll function every 𝑛th frame, effectively adding a prescaler to the
scroll speed. In TH04 and TH05, the speed is specified as a Q12.4 value
instead, allowing true fractional speeds at any multiple of
1/16 pixels. This also necessitated a fixed algorithm
that correctly blits tile lines from two rows.
Finally, we've got a few inconsistencies in the way the code handles the
two VRAM pages, which cause a few unnecessary tiles to be rendered to just
one of the two pages. Mentioning that just in case someone tries to play
this game with a fully cleared text RAM and wonders where the flickering
tiles come from.
Even though this was ZUN's first attempt at scrolling tiles, he already saw
it fit to write most of the code in assembly. This was probably a reaction
to all of TH01's performance issues, and the frame rate reduction
workarounds he implemented to keep the game from slowing down too much in
busy places. "If TH01 was all C++ and slow, TH02 better contain more ASM
code, and then it will be fast, right?"
Another reason for going with ASM might be found in the kind of
documentation that may have been available to ZUN. Last year, the PC-98
community discovered and scanned two new game programming tutorial books
from 1991 (1, 2).
Their example code is not only entirely written in assembly, but restricts
itself to the bare minimum of x86 instructions that were available on the
8086 CPU used by the original PC-9801 model 9 years earlier. Such code is
not only suboptimal
on the 486, but can often be actually worse than what your C++
compiler would generate. TH02 is where the trend of bad hand-written ASM
code started, and it
📝 only intensified in ZUN's later games. So,
don't copy code from these books unless you absolutely want to target the
earlier 8086 and 286 models. Which,
📝 as we've gathered from the recent blitting benchmark results,
are not all too common among current real-hardware owners.
That said, all that ASM code really only impacts readability and
maintainability. Apart from the aforementioned issues, the algorithms
themselves are mostly fine – especially since most EGC and GRCG operations
are decently batched this time around, in contrast to TH01.
Luckily, the tile functions merely use inline assembly within a
typical C function and can therefore be at least part of a C++ source file,
even if the result is pretty ugly. This time, we can actually be sure that
they weren't written directly in a .ASM file, because they feature x86
instruction encodings that can only be generated with Turbo C++ 4.0J's
inline assembler, not with TASM. The same can't unfortunately be said about
the following function in the same segment, which marks the tiles covered by
the spark sprites for redrawing. In this one, it took just one dumb hand-written ASM
inconsistency in the function's epilog to make the entire function
undecompilable.
The standard x86 instruction sequence to set up a stack frame in a function prolog looks like this:
PUSH BP
MOV BP, SP
SUB SP, ?? ; if the function needs the stack for local variables
When compiling without optimizations, Turbo C++ 4.0J will
replace this sequence with a single ENTER instruction. That one
is two bytes smaller, but much slower on every x86 CPU except for the 80186
where it was introduced.
In functions without local variables, BP and SP
remain identical, and a single POP BP is all that's needed in
the epilog to tear down such a stack frame before returning from the
function. Otherwise, the function needs an additional MOV SP,
BP instruction to pop all local variables. With x86 being the helpful
CISC architecture that it is, the 80186 also introduced the
LEAVE instruction to perform both tasks. Unlike
ENTER, this single instruction
is faster than the raw two instructions on a lot of x86 CPUs (and
even current ones!), and it's always smaller, taking up just 1 byte instead
of 3. So what if you use LEAVE even if your function
doesn't use local variables? The fact that the
instruction first does the equivalent of MOV SP, BP doesn't
matter if these registers are identical, and who cares about the additional
CPU cycles of LEAVE compared to just POP BP,
right? So that's definitely something you could theoretically do, but
not something that any compiler would ever generate.
And so, TH02 MAIN.EXE decompilation already hits the first
brick wall after two pushes. Awesome! Theoretically,
we could slowly mash through this wall using the 📝 code generator. But having such an inconsistency in the
function epilog would mean that we'd have to keep Turbo C++ 4.0J from
emitting any epilog or prolog code so that we can write our
own. This means that we'd once again have to hide any use of the
SI and DI registers from the compiler… and doing
that requires code generation macros for 22 of the 49 instructions of
the function in question, almost none of which we currently have. So, this
gets quite silly quite fast, especially if we only need to do it
for one single byte.
Instead, wouldn't it be much better if we had a separate build step between
compile and link time that allowed us to replicate mistakes like these by
just patching the compiled .OBJ files? These files still contain the names
of exported functions for linking, which would allow us to look up the code
of a function in a robust manner, navigate to specific instructions using a
disassembler, replace them, and write the modified .OBJ back to disk before
linking. Such a system could then naturally expand to cover all other
decompilation issues, culminating in a full-on optimizer that could even
recreate ZUN's self-modifying code. At that point, we would have sealed away
all of ZUN's ugly ASM code within a separate build step, and could finally
decompile everything into readable C++.
Pulling that off would require a significant tooling investment though.
Patching that one byte in TH02's spark invalidation function could be done
within 1 or 2 pushes, but that's just one issue, and we currently have 32
other .ASM files with undecompilable code. Also, note that this is
fundamentally different from what we're doing with the
debloated branch and the Anniversary Editions. Mistake patching
would purely be about having readable code on master that
compiles into ZUN's exact binaries, without fixing weird
code. The Anniversary Editions go much further and rewrite such code in
a much more fundamental way, improving it further than mistake patching ever
could.
Right now, the Anniversary Editions seem much more
popular, which suggests that people just want 100% RE as fast as
possible so that I can start working on them. In that case, why bother with
such undecompilable functions, and not just leave them in raw and unreadable
x86 opcode form if necessary… But let's first
see how much backer support there actually is for mistake patching before
falling back on that.
The best part though: Once we've made a decision and then covered TH02's
spark and particle systems, that was it, and we will have already RE'd
all ZUN-written PC-98-specific blitting code in this game. Every further
sprite or shape is rendered via master.lib, and is thus decently abstracted.
Guess I'll need to update
📝 the assessment of which PC-98 Touhou game is the easiest to port,
because it sure isn't TH01, as we've seen with all the work required for the first Anniversary Edition build.
Until then, there are still enough parts of the game that don't use any of
the remaining few functions in the _TEXT segment. Previously, I
mentioned in the 📝 status overview blog post
that TH02 had a seemingly weird sprite system, but the spark and point popup
() structures showed that the game just
stores the current and previous position of its entities in a slightly
different way compared to the rest of PC-98 Touhou. Instead of having
dedicated structure fields, TH02 uses two-element arrays indexed with the
active VRAM page. Same thing, and such a pattern even helps during RE since
it's easy to spot once you know what to look for.
There's not much to criticize about the point popup system, except for maybe
a landmine that causes sprite glitches when trying to display more than
99,990 points. Sadly, the final push in this delivery was rounded out by yet
another piece of code at the opposite end of the quality spectrum. The
particle and smear effects for Reimu's bomb animations consist almost
entirely of assembly bloat, which would just be replaced with generic calls
to the generic blitter in this game's future Anniversary Edition.
If I continue to decompile TH02 while avoiding the brick wall, items would
be next, but they probably require two pushes. Next up, therefore:
Integrating Stripe as an alternative payment provider into the order form.
There have been at least three people who reported issues with PayPal, and
Stripe has been working much better in tests. In the meantime, here's a temporary Stripe
order link for everyone. This one is not connected to the cap yet, so
please make sure to stay within whatever value is currently shown on the
front page – I will treat any excess money as donations.
If there's some time left afterward, I might
also add some small improvements to the TH01 Anniversary Edition.
📝 Posted:
🏷 Tags:
Turns out I was not quite done with the TH01 Anniversary Edition yet.
You might have noticed some white streaks at the beginning of Sariel's
second form, which are in fact a bug that I accidentally added to the
initial release.
These can be traced back to a quirk
I wasn't aware of, and hadn't documented so far. When defeating Sariel's
first form during a pattern that spawns pellets, it's likely for the second
form to start with additional pellets that resemble the previous pattern,
but come out of seemingly nowhere. This shouldn't really happen if you look
at the code: Nothing outside the typical pattern code spawns new pellets,
and all existing ones are reset before the form transition…
Except if they're currently showing the 10-frame delay cloud
animation , activated for all pellets during the symmetrical radial 2-ring
pattern in Phase 2 and left activated for the rest of the fight. These
pellets will continue their animation after the transition to the second
form, and turn into regular pellets you have to dodge once their animation
completed.
By itself, this is just one more quirk to keep in mind during refactoring.
It only turned into a bug in the Anniversary Edition because the game tracks
the number of living pellets in a separate counter variable. After resetting
all pellets, this counter is simply set to 0, regardless of any delay cloud
pellets that may still be alive, and it's merely incremented or decremented
when pellets are spawned or leave the playfield.
In the original game, this counter is only used as an optimization to skip
spawning new pellets once the cap is reached. But with batched
EGC-accelerated unblitting, it also makes sense to skip the rather costly
setup and shutdown of the EGC if no pellets are active anyway. Except if the
counter you use to check for that case can be 0 even if there are
pellets alive, which consequently don't get unblitted…
There is an optimal fix though: Instead of unconditionally resetting the
living pellet counter to 0, we decrement it for every pellet that
does get reset. This preserves the quirk and gives us a
consistently correct counter, allowing us to still skip every unnecessary
loop over the pellet array.
Cutting out the lengthy defeat animation makes it easier to see where the
additional pellets come from.
Cutting out the lengthy defeat animation makes it easier to see where the
additional pellets come from. Also, note how regular unblitting resumes
once the first pellet gets clipped at the top of the playfield – the
living pellet counter then gets decremented to -1, and who uses
<= rather than == on a seemingly unsigned
counter, right?
Cutting out the lengthy defeat animation makes it easier to see where the
additional pellets come from.
Ultimately, this was a harmless bug that didn't affect gameplay, but it's
still something that players would have probably reported a few more times.
So here's a free bugfix:
128 commits! Who would have thought that the ideal first release of the TH01
Anniversary Edition would involve so much maintenance, and raise so many
research questions? It's almost as if the real work only starts after
the 100% finalization mark… Once again, I had to steal some funding from the
reserved JIS trail word pushes to cover everything I liked to research,
which means that the next towards the
anything goal will repay this debt. Luckily, this doesn't affect any
immediate plans, as I'll be spending March with tasks that are already fully
funded.
So, how did this end up so massive? The list of things I originally set out
to do was pretty short:
Build entire game into single executable
Fix rendering issues in the one or two most important parts of the game
for a good initial impression
But even the first point already started with tons of little cleanup
commits. A part of them can definitely be blamed on the rush to hit the 100%
decompilation mark before the 25th anniversary last August.
However, all the structural changes that I can't commit to
master reveal how much of a mess the TH01 codebase actually
is.
Merging the executables is mainly difficult because of all the
inconsistencies between REIIDEN.EXE and FUUIN.EXE.
The worst parts can be found in the REYHI*.DAT format code and
the High Score menu, but the little things are just as annoying, like how
the current score is an unsigned variable in
REIIDEN.EXE, but a signed one in FUUIN.EXE.
If it takes me this long and this many
commits just to sort out all of these issues, it's no wonder that the only
thing I've seen being done with this codebase since TH01's 100%
decompilation was a single porting attempt that ended in a rather quick
ragequit.
So why are we merging the executables in preparation for the Anniversary
Edition, and not waiting with it until we start doing ports?
Distributing and updating one executable is cleaner than doing the same
with three, especially as long as installation will still involve manually
dropping the new binary into the game directory.
The Anniversary Edition won't be the only fork binary. We are already
going to start out with a separate DEBLOAT.EXE that contains
only the bloat removal changes without any bug fixes, and spaztron64
will probably redo his seizure-less edition. We don't want to clutter
the game directory with three binaries for each of these fork builds, and we
especially don't want to remember things like oh, but this fork
only modifies REIIDEN.EXE…
All forks should run side-by-side with the original game. During the
time I was maintaining thcrap, I've had countless bug reports of people
assuming that thcrap was
responsible for bugs that were present in the original game, and the
same is certain to happen with the Anniversary Edition. Separate binaries
will make it easier for everyone to check where these bugs came from.
Also, I'd like to make a point about how bloated the original
three-executable structure really is, since I've heard people defending it
as neat software architecture. Really, even in Real Mode where you typically
want to use as little of the 640 KiB of conventional memory as possible, you
don't want to split your game up like this.
The game actually is so bloated that the combined binary ended up
smaller than the original REIIDEN.EXE. If all you see are the
file sizes of the original three executables, this might look like a
pretty impressive feat. Like, how can we possibly get 407,812
bytes into less than 238,612 bytes, without using compression?
If you've ever looked at the linker map though, it's not at all surprising.
Excluding the aforementioned inconsistencies that are hard to quantify,
OP.EXE and FUUIN.EXE only feature 5,767 and 6,475
bytes of unique code and data, respectively. All other code in these
binaries is already part of REIIDEN.EXE, with more than half of
the size coming from the Borland C++ runtime. The single worst offender here
is the C++ exception handler that Borland forces
onto every non-.COM binary by default, which alone adds 20,512 bytes
even if your binary doesn't use C++ exceptions.
On a more hilarious note, this
single line is responsible for pulling another unnecessary 14,242 bytes
into OP.EXE and FUUIN.EXE. This floating-point
multiplication is completely unnecessary in this context because all
possible parameters are integers, but it's enough for Turbo C++ and TLINK to
pull in the entire x87 FPU emulation machinery. These two binaries don't
even draw lines, but since this function is part of the general
graphics code translation unit and contains other functions that these
binaries do need, TLINK links in the entire thing. Maybe, multiple
executables aren't the best choice either if you use a linker that can't do
dead code elimination…
Since the 📝 Orb's physics do turn the entire
precision of a double variable into gameplay effects, it's not
feasible to ever get rid of all FPU code in TH01. The exception handler,
however, can
be removed, which easily brings the combined binary below the size of
the original REIIDEN.EXE. Compiling all code with a single set
of compiler optimization flags, including the more x86-friendly
pascal calling convention, then gets us a few more KB on top.
As does, of course, removing unused code: The only remaining purpose of
features such as 📝 resident palettes is to
potentially make porting more difficult for anyone who doesn't immediately
realize that nothing in the game uses these functions.
Technically, all unused code would be bloat, but for now, I'm keeping
the parts that may tell stories about the game's development history (such
as unused effects or the 📝 mouse cursor), or
that might help with debugging. Even with that in mind, I've only scratched
the surface when it comes to bloat removal, and the binary is only going to
get smaller from here. A lot smaller.
If only we now could start MDRV98 from this new combined binary, we wouldn't
need a second batch file either…
Which brings us to the first big research question of this delivery. Using
the C spawn() function works fine on this compiler, so
spawn("MDRV98.COM") would be all we need to do, right? Except
that the game crashes very soon after that subprocess returned.
So it's not going to be that easy if the spawned process is a TSR.
But why should this be a problem? Let's take a look at the DOS heap, and how
DOS lays out processes in conventional memory if we launch the game
regularly through GAME.BAT:
The rough layout of the DOS heap when launching TH01 from
GAME.BAT.
The batch file starts MDRV98 first, which will therefore end up below
the game in conventional memory. This is perfect for a TSR: The program can
resize itself arbitrarily before returning to DOS, and the rest of memory
will be left over for the game. If we assume such a layout, a DOS program
can implement a custom memory allocator in a very simple way, as it only has
to search for free memory in one direction – and this is exactly how Borland
implemented the C heap for functions like malloc() and
free(), and the C++ new and delete
operators.
But if we spawn MDRV98 after starting TH01, well…
MDRV98 will spawn in the next free memory location, allocate itself, return
to TH01… which suddenly finds its C heap blocked from growing. As a result,
the next big allocation will immediately fail with a rather misleading "out
of memory" error.
So, what can we do about this? Still in a bloat removal mindset, my gut
reaction was to just throw out Borland's C heap implementation, and replace
it with a very thin wrapper around the DOS heap as managed by INT 21h,
AH=48h/49h/4Ah. Like, why
did these DOS compilers even bother with a custom allocator in the first
place if DOS already comes with a perfectly fine native one? Using the
native allocator would completely erase the distinction between TSR memory
and game memory, and inherently allow the game to allocate beyond
MDRV98.
I did in fact implement this, and noticed even more benefits:
While DOS uses 16 bytes rather than Borland's 4 bytes for the control
structure of each memory block, this larger size automatically aligns all
allocations to 16-byte boundaries. Therefore, all allocation addresses would
fit into 16-bit segment-only pointers rather than needing 32-bit
far ones. On the Borland heap, the 4-byte header further limits
regular far pointers to 65,532 bytes, forcing you into
expensive huge pointers for bigger allocations.
Debuggers in DOS emulators typically have features to show and manage
the DOS heap. No need for custom debugging code.
You can change the memory placement
strategy to allocate from the top of conventional memory down to the
bottom. This is how the games allocate their resident structures.
Ultimately though, the drawbacks became too significant. Most of them are
related to the PC-98 Touhou games only ever creating a single DOS
process, even though they contain multiple executables.
Switching executables is done via exec(), which resizes a
program's main allocation to match the new binary and then overwrites the
old program image with the new one. If you've ever wondered why DOSBox-X
only ever shows OP as the active process name in the title bar,
you now know why. As far as DOS is concerned, it's still the same
OP.EXE process rooted at the same segment, and
exec() doesn't bother rewriting the name either. Most
importantly though, this is how REIIDEN.EXE can launch into
another REIIDEN.EXE process even if there are less than 238,612
bytes free when exec() is called, and without consuming more
memory for every successive binary.
For now, ANNIV.EXE still re-exec()s itself at
every point where the original game did, as ZUN's original code really
depends on being reinitialized at boss and scene boundaries. The resulting
accidental semi-hot reloading is also a useful property to retain
during development.
So why is the DOS heap a bad idea for regular game allocation after all?
Even DOS automatically releases all memory associated with a process
during its termination. But since we keep running the same process until the
player quits out of the main menu, we lose the C heap's implicit cleanup on
exec(), and have to manually free all memory ourselves.
Since the binary can be larger after hot reloading, we in fact have
to allocate all regular memory using the last fit strategy.
Otherwise, exec() fails to resize the program's main block for
the same reason that crashed the game on our initial attempt to
spawn("MDRV98.COM").
Just like Borland's heap implementation, the DOS heap stores its control
structures immediately before each allocation, forming a singly linked list.
But since the entire OS shares this single list, corruptions from heap
overflows also affect the whole system, and become much more disastrous.
Theoretically, it might be possible to recover from them by forcibly
releasing all blocks after the last correct one, or even by doing a
brute-force search for valid memory
control blocks, but in reality, DOS will likely just throw error code #7
(ERROR_ARENA_TRASHED) on the next memory management syscall,
forcing a reboot.
With a custom allocator, small corruptions remain isolated to the process.
They can be even further limited if the process adds some padding between
its last internal allocation and the end of the allocated DOS memory block;
Borland's heap sort of does this as well by always rounding up the DOS block
to a full KiB. All this might not make a difference in today's emulated and
single-tasked usage, but would have back then when software was still
developed inside IDEs running on the same system.
TH01's debug mode uses heapcheck() and
heapchecknode(), and reimplementing these on top of the DOS
heap is not trivial. On the contrary, it would be the most complicated part
of such a wrapper, by far.
I could release this DOS heap wrapper in unused form for another push if
anyone's interested, but for now, I'm pretty happy with not actually using
it in the games. Instead, let's stay with the Borland C heap, and find a way
to push MDRV98 to the very top of conventional RAM. Like this:
Which is much easier said than done. It would be nice if we could just use
the last fit allocation strategy here, but .COM executables always
receive all free memory by default anyway, which eliminates any difference
between the strategies.
But we can still change memory itself. So let's temporarily claim all
remaining free memory, minus the exact amount we need for MDRV98, for our
process. Then, the only remaining free space to spawn MDRV98 is at the exact
place where we want it to be:
Obviously, we release all the additional memory after spawning MDRV98.
Now we only need to know how much memory to not temporarily allocate. First,
we need to replicate the assumption that MDRV98's -M7
command-line parameter corresponds to a resident size of 23,552 bytes. This
is not as bad as it seems, because the -M parameter explicitly
has a KiB unit, and we can nicely abstract it away for the API.
The (env.) block though? Its minimum size equals the combined length
of all environment variables passed to the process, but its maximum size is…
not limited at all?! As in, DOS implementations can add and have
historically added more free space because some programs insisted on storing
their own new environment variables in this exact segment. DOSBox and
DOSBox-X follow this tradition by providing a configuration option for the
additional amount of environment space, with the latter adding 1024
additional bytes by default, y'know, just in case someone wants to compile
FreeDOS on a slow emulator. It's not even worth sending a bug report for
this specific case, because it's only a symptom of the fact that
unexpectedly large program environment blocks can and will happen, and are
to be expected in DOS land.
So thanks to this cruel joke, it's technically impossible to achieve what we
want to do there. Hooray! The only thing we can kind of do here is an
educated guess: Sum up the length of all environment variables in our
environment block, compare that length against the allocated size of the
block, and assume that the MDRV98 process will get as much additional memory
as our process got. 🤷
The remaining hurdles came courtesy of some Borland C runtime implementation
details. You would think that the temporary reallocation could even be done
in pure C using the sbrk(), coreleft(), and
brk() functions, but all values passed to or returned from
these functions are inaccurate because they don't factor in the
aforementioned KiB padding to the underlying DOS memory block. So we have to
directly use the DOS syscalls after all. Which at least means that learning
about them wasn't completely useless…
The final issue is caused inside Borland's
spawn() implementation. The environment block for the
child process is built out of all the strings reachable from C's
environ pointer, which is what that FreeDOS build process
should have used. Coalescing them into a single buffer involves yet
another C heap allocation… and since we didn't report our DOS memory block
manipulation back to the C heap, the malloc() call might think
it needs to request more memory from DOS. This resets the DOS memory block
back to its intended level, undoing our manipulation right before the actual
INT 21h, AH=4Bh
EXEC syscall. Or in short:
Manipulate DOS heap ➜ spawn() call ➜_LoadProg() ➜ allocate and prepare environment block ➜ _spawn() ➜ DOS EXEC syscall
The obvious solution: Replace _LoadProg(), implement the
coalescing ourselves, and do it before the heap manipulation. Fortunately,
Borland's internal low-level _spawn() function is not
static, so we can call it ourselves whenever we want to:
Allocate and prepare environment block ➜ manipulate DOS heap ➜ _spawn() call ➜EXEC syscall
So yes, launching MDRV98 from C can be done, but it involves advanced
witchcraft and is completely ridiculous.
Launching external sound drivers from a batch file is the right way
of doing things.
Fortunately, you don't have to rely on this auto-launching feature. You can
still launch DEBLOAT.EXE or ANNIV.EXE from a batch
file that launched MDRV98.COM before, and the binaries will
detect this case and skip the attempt of launching MDRV98 from C. It's
unlikely that my heuristic will ever break, but I definitely recommend
replicating GAME.BAT just to be completely sure – especially
for user-friendly repacks that don't want to include the original game
anyway.
This is also why ANNIV.EXE doesn't launch
ZUNSOFT.COM: The "correct" and stable way to launch
ANNIV.EXE still involves a batch file, and I would say that
expecting people to remove ZUNSOFT.COM from that file is worse
than not playing the animation. It's certainly a debate we can have, though.
This deep dive into memory allocation revealed another previously
undocumented bug in the original game. The RLE decompression code for the
東方靈異.伝 packfile contains two heap overflows, which are
actually triggered by SinGyoku's BOSS1_3.BOS and Konngara's
BOSS8_1.BOS. They only do not immediately crash the game when
loading these bosses thanks to two implementation details of Borland's C
heap.
Obviously, this is a bug we should fix, but according to the definition of
bugs, that fix would be exclusive to the anniversary branch.
Isn't that too restrictive for something this critical? This code is
guaranteed to blow up with a different heap implementation, if only in a
Debug build. And besides, nobody would notice a fix
just by looking at the game's rendered output…
Looks like we have to introduce a fourth category of weird code, in addition
to the previous bloat, bug, and quirk categories, for
invisible internal issues like these. Let's call it landmine, and fix
them on the debloated branch as well. Thanks to
Clerish for the naming inspiration!
With this new category, the full definitions for all categories have become
quite extensive. Thus, they now live in CONTRIBUTING.md
inside the ReC98 repository.
With the new discoveries and the new landmine category, TH01 is now at 67
bugs and 20 landmines. And the solution for the landmine in question? Simplifying
the 61 lines of the original code down to 16. And yes, I'm including
comments in these numbers – if the interactions of the code are complex
enough to require multi-paragraph comments, these are a necessary and
valid part of the code.
While we're on the topic of weird code and its visible or invisible effects,
there's one thing you might be concerned about. With all the rearchitecting
and data shifting we're doing on the debloated branch, what
will happen to the 📝 negative glitch stages?
These are the result of a clearly observable bug that, by definition, must
not be fixed on the debloated branch. But given that the
observable layout of the glitch stages is defined by the memory
surrounding the scene stage variable, won't the
debloated branch inherently alter their appearance (= ⚠️
fanfiction ⚠️), or even remove them completely?
Well, yes, it will. But we can still preserve their layout by
hardcoding
the exact original data that the game would originally read, and even emulate
the original segment relocations and other pieces of global data.
Doing this is feasible thanks to the fact that there are only 4 glitch
stages. Unfortunately, the same can't be said for the timer values, which
are determined by an array lookup with the un-modulo'd stage ID. If we
wanted to preserve those as well, we'd have to bundle an exact copy of the
original REIIDEN.EXE data segment to preserve the values of all
32,768 negative stages you could possibly enter, together with a map
of all relocations in this segment. 😵 Which I've decided against for now,
since this has been going on for far too long already. Let's first see if
anyone ever actually complains about details like this…
Alright, time to start the anniversary branch by rendering
everything at its correct internal unaligned X position? Eh… maybe not quite
yet. If we just hacked all the necessary bit-shifting code into all the
format-specific blitting functions, we'd still retain all this largely
redundant, bad, and slow code, and would make no progress in terms of
portability. It'd be much better to first write a single generic blitter
that's decently optimized, but supports all kinds of sprites to make this
optimization actually worth something.
So, next research question: How would such a blitter look like? After I
learned during my
📝 first foray into cycle counting that port
I/O is slow on 486 CPUs, it became clear that TH04's
📝 GRCG batching for pellets was one of the
more useful optimizations that probably contributed a big deal towards
achieving the high bullet counts of that game. This leads to two
conclusions:
master.lib's super_*() sprite functions are slow, and not
worth looking at for inspiration. Even the 📝 tiny format reinitializes the GRCG on every color change, wasting 80
cycles.
Hence, our low-level blitting API should not even care about colors. It
should only concern itself with blitting a given 1bpp sprite to a single
VRAM segment. This way, it can work for both 4-plane sprites and
single-plane sprites, and just assume that the GRCG is active.
Maybe we should also start by not even doing these unaligned bit shifts
ourselves, and instead expect the call site to
📝 always deliver a byte-aligned sprite that is correctly preshifted,
if necessary? Some day, we definitely should measure how slow runtime
shifting would really be…
What we should do, however, are some further general optimizations that I
would have expected from master.lib: Unrolling the vertical
loop, and baking a single function for every sprite width to eliminate
the horizontal loop. We can then use the widest possible x86
MOV instruction for the lowest possible number of cycles per
row – for example, we'd blit a 56-wide sprite with three MOVs
(32-bit + 16-bit + 8-bit), and a 64-wide one with two 32-bit
MOVs.
Or maybe not? There's a lot of blitting code in both master.lib and PC-98
Touhou that checks for empty bytes within sprites to skip needlessly writing
them to VRAM:
Which goes against everything you seem to know about computers. We aren't
running on an 8-bit CPU here, so wouldn't it be faster to always write both
halves of a sprite in a single operation?
That's a single CPU instruction, compared to two instructions and two
branches. The only possible explanation for this would be that VRAM writes
are so slow on PC-98 that you'd want to avoid them at all costs, even
if that means additional branching on the CPU to do so. Or maybe that was
something you would want to do on certain models with slow VRAM, but not on
others?
So I wrote a benchmark to answer all these questions, and to compare my new
blitter against typical TH01 blitting code:
A not really representative run on DOSBox-X. Since the master.lib sprite
functions are also unbatched, I expect them to not be much faster than
the naive C implementation.
2023-03-05-blitperf.zip
And here are the real-hardware results I've got from the PC-9800
Central Discord server:
PC-286LS
PC-9801ES
PC-9821Cb/Cx
PC-9821Ap3
PC-9821An
PC-9821Nw133
PC-9821Ra20
80286, 12 MHz
i386SX, 16 MHz
486SX, 33 MHz
486DX4, 100 MHz
Pentium, 90 MHz
Pentium, 133 MHz
Pentium Pro, 200 MHz
1987
1989
1994
1994
1994
1997
1996
Unchecked
C
GRCG
36,85
38,42
26,02
26,87
3,98
4,13
2,08
2,16
1,81
1,87
0,86
0,89
1,25
1,25
MOVS
GRCG
15,22
16,87
9,33
10,19
1,22
1,37
0,44
0,44
MOV
GRCG
15,42
17,08
9,65
10,53
1,15
1,3
0,44
0,44
4-plane
37,23
43,97
29,2
32,96
4,44
5,01
4,39
4,67
5,11
5,32
5,61
5,74
6,63
6,64
Checking first
GRCG
17,49
19,15
10,84
11,72
1,27
1,44
1,04
1,07
0,54
0,54
4-plane
46,49
53,36
35,01
38,79
5,66
6,26
5,43
5,74
6,56
6,8
8,08
8,29
10,25
10,29
Checking second
GRCG
16,47
18,12
10,77
11,65
1,25
1,39
1,02
0,51
0,51
4-plane
43,41
50,26
33,79
37,82
5,22
5,81
5,14
5,43
6,18
6,4
7,57
7,77
9,58
9,62
Checking both
GRCG
16,14
18,03
10,84
11,71
1,33
1,49
1,01
0,49
0,49
4-plane
43,61
50,45
34,11
37,87
5,39
5,99
4,92
5,23
5,88
6,11
7,19
7,43
9,1
9,13
Amount of frames required to render 2000 16×8 pellet sprites on a variety of
PC-98 models, using the new generic blitter. Both preshifted (first column)
and runtime-shifted (second column) sprites were tested; empty columns
correspond to times faster than a single frame. Thanks to cuba200611,
Shoutmon, cybermind, and Digmac for running the tests!
The key takeaways:
Checking for empty bytes has never been a good idea.
Preshifting sprites made a slight difference on the 286. Starting with
the 386 though, that difference got smaller and smaller, until it completely
vanished on Pentium models. The memory tradeoff is especially not worth it
for 4-plane sprites, given that you would have to preshift each of the 4
planes and possibly even a fifth alpha plane. Ironically, ZUN only ever
preshifted monochrome single-bitplane sprites with a width of 8 pixels.
That's the smallest possible amount of memory a sprite can possibly take,
and where preshifting consequently has the smallest effect on performance.
Shifting 8-wide sprites on the fly literally takes a single ROL
or ROR instruction per row.
You might want to use MOVS instead of MOV when
targeting the 286 and 386, but the performance gains are barely worth the
resulting mess you would make out of your blitting code. On Pentium models,
there is no difference.
Use the GRCG whenever you have to render lots of things that share a
static 8×1 pattern.
These are the PC-98 models that the people who are willing to test your
newly written PC-98 code actually use.
Since this won't be the only piece of game-independent and explicitly
PC-98-specific custom code involved in this delivery, it makes sense to
start a
dedicated PC-98 platform layer. This code will gradually eliminate the
dependency on master.lib and replace it with better optimized and more
readable C++ code. The blitting benchmark, for example, is already
implemented completely without master.lib.
While this platform layer is mainly written to generate optimal code within
Turbo C++ 4.0J, it can also serve as general PC-98 documentation for
everyone who prefers code over machine-translating old Japanese books. Not
to mention the immediacy of having all actual relevant information in
one place, which might otherwise be pretty well hidden in these books, or
some obscure old text file. For example, did you know that uploading gaiji
via INT 18h might end up disabling the VSync interrupt trigger,
deadlocking the process on the next frame delay loop? This nuisance is not
replicated by any emulators, and it's quite frustrating to encounter it when
trying to run your code on real hardware. master.lib works around it by
simply hooking INT 18h and unconditionally reenabling the VSync
interrupt trigger after the original handler returns, and so does our
platform layer.
So, with the pellet draw calls batched and routed through the new renderer,
we should have gained enough free CPU cycles to disable
📝 interlaced pellet rendering without any
impact on frame rates?
Well, kinda. We do get 56.4 FPS, but only together with noticeable and
reproducible tearing in the top part of the playfield, suggesting exactly
why ZUN interlaced the rendering in the first place. 😕 So have we
already reached the limit of single-buffered PC-98 games here, or can we
still do something about it?
As it turns out, the main bottleneck actually lies in the pellet
unblitting code. Every EGC-"accelerated" unblitting call in TH01 is
as unbatched as the pellet blitting calls were, spending an additional 17
I/O port writes per call to completely set up and shut down the EGC, every
time. And since this is TH01, the two-instruction operation of changing the
active PC-98 VRAM page isn't inlined either, but instead done via a function
call to a faraway segment. On the 486, that's:
>341 cycles for EGC setup and teardown, plus
>72 cycles for each 16-pixel chunk to be unblitted.
This sums up to
>917 cycles of completely unnecessary work for every active pellet,
in the optimal 50% of cases where it lies on an even VRAM byte,
or
>1493 cycles if it lies on an odd VRAM byte, because ZUN's code
extends the unblitted rectangle to a gargantuan 32×8 pixels in this case
And this calculation even ignores the lack of small micro-optimizations that
could further optimize the blitting loop. Multiply that by the game's pellet
cap of 100, and we get a 6-digit number of wasted CPU cycles. On
paper, that's roughly 1/6 of the time we have for each
of our target 56.423 FPS on the game's target 33 MHz systems. Might not
sound all too critical, but the single-buffered nature of the game means
that we're effectively racing the beam on every frame. In turn, we have to
be even more serious about performance.
So, time to also add a batched EGC API to our PC-98 platform layer? Writing
our own EGC code presents a nice opportunity to finally look deeper into all
its registers and configuration options, and see what exactly we can do
about ZUN's enforced 16-pixel alignment.
To nobody's surprise, this alignment is completely unnecessary, and only
displays a lack of knowledge about the chip. While it is true that
the EGC wants VRAM to be exclusively addressed in 16-bit chunks at
16-bit-aligned addresses, it specifically provides
an address register (0x4AC) for shifting the horizontal
start offsets of the source and destination to any pixel within the
16 pixels of such a chunk, and
a bit length register (0x4AE) for specifying the total
width of pixels to be transferred, which also implies the correct end
offsets.
And it gets even better: After ⌈bitlength ÷ 16⌉ write
instructions, the EGC's internal shifter state automatically reinitializes
itself in preparation for blitting another row of pixels with the same
initially configured bit addresses and length. This is perfect for blitting
rectangles, as two I/O port writes before the start of your blitting loop
are enough to define your entire rectangle.
The manual nature of reading and writing in 16-pixel chunks does come with a
slight pitfall though. If the source bit address is larger than the
destination bit address, the first 16-bit read won't fill the EGC's internal
shift register with all pixels that should appear in the first 16-pixel
destination chunk. In this case, the EGC simply won't write anything and
leave the first chunk unchanged. In a
📝 regular blitting loop, however, you expect
that memory to be written and immediately move on to the next chunks within
the row. As a result, the actual blitting process for such a rectangle will
no longer be aligned to the configured address and bit length. The first row
of the rectangle will appear 16 pixels to the right of the destination
address, and the second one will start at bit offset 0 with pixels from the
rightmost byte of the first line, which weren't blitted and remained in the
tile register.
There is an easy solution though: Before the horizontal loop on each line of
the rectangle, simply read one additional 16-pixel chunk from the source
location to prefill the shift register. Thankfully, it's large enough to
also fit the second read of the then full 16 pixels, without dropping any
pixels along the way.
And that's how we get arbitrarily unaligned rectangle copies with the EGC!
Except for a small register allocation trick to use two-register addressing,
there's not much use in further optimizations, as the runtime of these
inter-page blit operations is dominated by the VRAM page switches anyway.
Except that T98-Next seems to disagree about the register prefilling issue:
Every other emulator agrees with real hardware in this regard, so we can
safely assume this to be a bug in T98-Next. Just in case this old emulator
with its last release from June 2010 still has any fans left nowadays… For
now though, even they can still enjoy the TH01 Anniversary Edition: The only
EGC copy algorithm that TH01 actually needs is the left one during the
single-buffered tests, which even that emulator gets right.
That only leaves
📝 my old offer of documenting the EGC raster ops,
and we've got the EGC figured out completely!
And that did in fact remove tearing from the pellet rendering function! For
the first time, we can now fight Elis, Kikuri, Sariel, and Konngara with a
doubled pellet frame rate:
Switchable videos like these can nicely provide evidence that these
changes have no effect on gameplay, making it easy to see that the Orb
still collides with all pellets on the same frames. Also, check out the
difference in remaining conventional memory (coreleft)…
With only pellets and no other animation on screen, this exact pattern
presents the optimal demonstration case for the new unblitter. But as you
can already tell from the invincibility sprites, we'd also need to route
every other kind of sprite through the same new code. This isn't all too
trivial: Most sprites are still rendered at byte-aligned positions, and
their blitting APIs hide that fact by taking a pixel position regardless.
This is why we can't just replace ZUN's original 16-pixel-aligned EGC
unblitting function with ours, and always have to replace both the blitter
and the unblitter on a per-sprite basis.
To completely remove all flickering, we'd also like to get rid of all the
sprite-specific unblit ➜ update ➜ render sequences, and instead
gather all unblitting code to the beginning of the game loop, before any
update and rendering calls. So yeah, it will take a long time to completely
get rid of all flickering. Until we're there, I recommend any backer to tell
me their favorite boss, so that I can focus on getting that one
rendered without any flickering. Remember that here at ReC98, we can have a
Touhou character popularity contest at any time during the year, whenever
the store is open!
In the meantime, the consistent use of 8×8 rectangles during pellet
unblitting does significantly reduce flickering across the entire game,
and shrinks certain holes that pellets tend to rip into lazily reblitted
sprites:
SinGyoku's "crossing pellets" pattern, shortly before completing
the transformation back to the sphere.
To round out the first release, I added all the other bug fixes to achieve
parity with my previously released patched REIIDEN.EXE builds:
I removed the 📝 shootout laser crash by
simply leaving the lasers on screen if a boss is defeated,
prevented the HP bar heap corruption bug in test or debug mode by not
letting it display negative HP in the first place, and
So here it is, the first build of TH01's Anniversary Edition:
2023-03-05-th01-anniv.zip Edit (2023-03-12): If you're playing on Neko Project and seeing more
flickering than in the original game, make sure you've checked the Screen
→ Disp vsync option.
Next up: The long overdue extended trip through the depths of TH02's
low-level code. From what I've seen of it so far, the work on this project
is finally going to become a bit more relaxing. Which is quite welcome
after, what, 6 months of stressful research-heavy work?
Starting the year with a delivery that wasn't delayed until the last
day of the month for once, nice! Still, very soon and
high-maintenance did not go well together…
It definitely wasn't Sara's fault though. As you would expect from a Stage 1
Boss, her code was no challenge at all. Most of the TH02, TH04, and TH05
bosses follow the same overall structure, so let's introduce a new table to
replace most of the boilerplate overview text:
Phase #
Patterns
HP boundary
Timeout condition
(Entrance)
4,650
288 frames
2
4
2,550
2,568 frames
(= 32 patterns)
3
4
450
5,296 frames
(= 24 patterns)
4
1
0
1,300 frames
Total
9
9,452 frames
In Phases 2 and 3, Sara cycles between waiting, moving randomly for a
fixed 28 frames, and firing a random pattern among the 4 phase-specific
ones. The pattern selection makes sure to never
pick any pattern twice in a row. Both phases contain spiral patterns that
only differ in the clockwise or counterclockwise turning direction of the
spawner; these directions are treated as individual unrelated patterns, so
it's possible for the "same" pattern to be fired multiple times in a row
with a flipped direction.
The two phases also differ in the wait and pattern durations:
In Phase 2, the wait time starts at 64 frames and decreases by 12
frames after the first 5 patterns each, ending on a minimum of 4 frames.
In Phase 3, it's a constant 16 frames instead.
All Phase 2 patterns are fired for 28 frames, after a 16-frame
gather animation. The Phase 3 pattern time starts at 80 frames and
increases by 24 frames for the first 6 patterns, ending at 200 frames
for all later ones.
Phase 4 consists of the single laser corridor pattern with additional
random bullets every 16 frames.
And that's all the gameplay-relevant detail that ZUN put into Sara's code. It doesn't even make sense to describe the remaining
patterns in depth, as their groups can significantly change between
difficulties and rank values. The
📝 general code structure of TH05 bosses
won't ever make for good-code, but Sara's code is just a
lesser example of what I already documented for Shinki.
So, no bugs, no unused content, only inconsequential bloat to be found here,
and less than 1 push to get it done… That makes 9 PC-98 Touhou bosses
decompiled, with 22 to go, and gets us over the sweet 50% overall
finalization mark! 🎉 And sure, it might be possible to pass through the
lasers in Sara's final pattern, but the boss script just controls the
origin, angle, and activity of lasers, so any quirk there would be part of
the laser code… wait, you can do what?!?
TH05 expands TH04's one-off code for Yuuka's Master and Double Sparks into a
more featureful laser system, and Sara is the first boss to show it off.
Thus, it made sense to look at it again in more detail and finalize the code
I had purportedly
📝 reverse-engineered over 4 years ago.
That very short delivery notice already hinted at a very time-consuming
future finalization of this code, and that prediction certainly came true.
On the surface, all of the low-level laser ray rendering and
collision detection code is undecompilable: It uses the SI and
DI registers without Turbo C++'s safety backups on the stack,
and its helper functions take their input and output parameters from
convenient registers, completely ignoring common calling conventions. And
just to raise the confusion even further, the code doesn't just set
these registers for the helper function calls and then restores their
original values, but permanently shifts them via additions and
subtractions. Unfortunately, these convenient registers also include the
BP base pointer to the stack frame of a function… and shifting
that register throws any intuition behind accessed local variables right out
of the window for a good part of the function, requiring a correctly shifted
view of the stack frame just to make sense of it again.
How could such code even have been written?! This
goes well beyond the already wrong assumption that using more stack space is
somehow bad, and straight into the territory of self-inflicted pain.
So while it's not a lot of instructions, it's quite dense and really hard to
follow. This code would really benefit from a decompilation that
anchors all this madness as much as possible in existing C++ structures… so
let's decompile it anyway?
Doing so would involve emitting lots of raw machine code bytes to hide the
SI and DI registers from the compiler, but I
already had a certain
📝 batshit insane compiler bug workaround abstraction
lying around that could make such code more readable. Hilariously, it only
took this one additional use case for that abstraction to reveal itself as
premature and way too complicated. Expanding
the core idea into a full-on x86 instruction generator ended up simplifying
the code structure a lot. All we really want there is a way to set all
potential parameters to e.g. a specific form of the MOV
instruction, which can all be expressed as the parameters to a force-inlined
__emit__() function. Type safety can help by providing
overloads for different operand widths here, but there really is no need for
classes, templates, or explicit specialization of templates based on
classes. We only need a couple of enums with opcode, register,
and prefix constants from the x86 reference documentation, and a set of
associated macros that token-paste pseudoregisters onto the prefixes of
these enum constants.
And that's how you get a custom compile-time assembler in a 1994 C++
compiler and expand the limits of decompilability even further. What's even
truly left now? Self-modifying code, layout tricks that can't be replicated
with regularly structured control flow… and that's it. That leaves quite a
few functions I previously considered undecompilable to be revisited once I
get to work on making this game more portable.
With that, we've turned the low-level laser code into the expected horrible
monstrosity that exposes all the hidden complexity in those few ASM
instructions. The high-level part should be no big deal now… except that
we're immediately bombarded with Fixup overflow errors at link
time? Oh well, time to finally learn the true way of fixing this highly
annoying issue in a second new piece of decompilation tech – and one
that might actually be useful for other x86 Real Mode retro developers at
that.
Earlier in the RE history of TH04 and TH05, I often wrote about the need to
split the two original code segments into multiple segments within two
groups, which makes it possible to slot in code from different
translation units at arbitrary places within the original segment. If we
don't want to define a unique segment name for each of these slotted-in
translation units, we need a way to set custom segment and group names in C
land. Turbo C++ offers two #pragmas for that:
#pragma option -zCsegment -zPgroup – preferred in most
cases as it's equivalent to setting the default segment and group via the
command line, but can only be used at the beginning of a translation unit,
before the first non-preprocessor and non-comment C language token
#pragma codeseg segment <group> – necessary if a
translation unit needs to emit code into two or more segments
For the most part, these #pragmas work well, but they seemed to
not help much when it came to calling near functions declared
in different segments within the same group. It took a bit of trial and
error to figure out what was actually going on in that case, but there
is a clear logic to it:
Symbols are allocated to the segment and group that's active during
their first appearance, no matter whether that appearance is a declaration
or definition. Any later appearance of the function in a different segment
is ignored.
The linker calculates the 16-bit offsets of such references relative to
the symbol's declared segment, not its actual one. Turbo C++ does
not show an error or warning if the declared and actual segments are
different, as referencing the same symbol from multiple segments is a valid
use case. The linker merely throws the Fixup overflow error if
the calculated distance exceeds 64 KiB and thus couldn't possibly fit
within a near reference. With a wrong segment declaration
though, your code can be incorrect long before a fixup hits that limit.
Summarized in code:
#pragma option -zCfoo_TEXT -zPfoo
void bar(void);
void near qux(void); // defined somewhere else, maybe in a different segment
#pragma codeseg baz_TEXT baz
// Despite the segment change in the line above, this function will still be
// put into `foo_TEXT`, the active segment during the first appearance of the
// function name.
void bar(void) {
}
// This function hasn't been declared yet, so it will go into `baz_TEXT` as
// expected.
void baz(void) {
// This `near` function pointer will be calculated by subtracting the
// flat/linear address of qux() inside the binary from the base address
// of qux()'s declared segment, i.e., `foo_TEXT`.
void (near *ptr_to_qux)(void) = qux;
}
So yeah, you might have to put #pragma codeseg into your
headers to tell the linker about the correct segment of a
near function in advance. 🤯 This is an important insight for
everyone using this compiler, and I'm shocked that none of the Borland C++
books documented the interaction of code segment definitions and
near references at least at this level of clarity. The TASM
manuals did have a few pages on the topic of groups, but that syntax
obviously doesn't apply to a C compiler. Fixup overflows in particular are
such a common error and really deserved better than the unhelpful 🤷
of an explanation that ended up in the User's Guide. Maybe this whole
technique of custom code segment names was considered arcane even by 1993,
judging from the mere three sentences that #pragma codeseg was
documented with? Still, it must have been common knowledge among Amusement
Makers, because they couldn't have built these exact binaries without
knowing about these details. This is the true solution to
📝 any issues involving references to near functions,
and I'm glad to see that ZUN did not in fact lie to the compiler. 👍
OK, but now the remaining laser code compiles, and we get to write
C++ code to draw some hitboxes during the two collision-detected states of
each laser. These confirm what the low-level code from earlier already
uncovered: Collision detection against lasers is done by testing a
12×12-pixel box at every 16 pixels along the length of a laser, which leaves
obvious 4-pixel gaps at regular intervals that the player can just pass
through. This adds
📝 yet📝 another📝 quirk to the growing list of quirks that
were either intentional or must have been deliberately left in the game
after their initial discovery. This is what constants were invented for, and
there really is no excuse for not using them – especially during
intoxicated coding, and/or if you don't have a compile-time abstraction for
Q12.4 literals.
When detecting laser collisions, the game checks the player's single
center coordinate against any of the aforementioned 12×12-pixel boxes.
Therefore, it's correct to split these 12×12 pixels into two 6×6-pixel
boxes and assign the other half to the player for a more natural
visualization. Always remember that hitbox visualizations need to keep
all colliding entities in mind –
📝 assigning a constant-sized hitbox to "the player" and "the bullets" will be wrong in most other cases.
Using subpixel coordinates in collision detection also introduces a slight
inaccuracy into any hitbox visualization recorded in-engine on a 16-color
PC-98. Since we have to render discrete pixels, we cannot exactly place a
Q12.4 coordinate in the 93.75% of cases where the fractional part is
non-zero. This is why pretty much every laser segment hitbox in the video
above shows up as 7×7 rather than 6×6: The actual W×H area of each box is 13
pixels smaller, but since the hitbox lies between these pixels, we
cannot indicate where it lies exactly, and have to err on the
side of caution. It's also why Reimu's box slightly changes size as she
moves: Her non-diagonal movement speed is 3.5 pixels per frame, and the
constant focused movement in the video above halves that to 1.75 pixels,
making her end up on an exact pixel every 4 frames. Looking forward to the
glorious future of displays that will allow us to scale up the playfield to
16× its original pixel size, thus rendering the game at its exact internal
resolution of 6144×5888 pixels. Such a port would definitely add a lot of
value to the game…
The remaining high-level laser code is rather unremarkable for the most
part, but raises one final interesting question: With no explicitly defined
limit, how wide can a laser be? Looking at the laser structure's 1-byte
width field and the unsigned comparisons all throughout the update and
rendering code, the answer seems to be an obvious 255 pixels. However, the
laser system also contains an automated shrinking state, which can be most
notably seen in Mai's wheel pattern. This state shrinks a laser by 2 pixels
every 2 frames until it reached a width of 0. This presents a problem with
odd widths, which would fall below 0 and overflow back to 255 due to the
unsigned nature of this variable. So rather than, I don't know, treating
width values of 0 as invalid and stopping at a width of 1, or even adding a
condition for that specific case, the code just performs a signed
comparison, effectively limiting the width of a shrinkable laser to a
maximum of 127 pixels. This small signedness
inconsistency now forces the distinction between shrinkable and
non-shrinkable lasers onto every single piece of code that uses lasers. Yet
another instance where
📝 aiming for a cinematic 30 FPS look
made the resulting code much more complicated than if ZUN had just evenly
spread out the subtraction across 2 frames. 🤷
Oh well, it's not as if any of the fixed lasers in the original scripts came
close to any of these limits. Moving lasers are much more streamlined and
limited to begin with: Since they're hardcoded to 6 pixels, the game can
safely assume that they're always thinner than the 28 pixels they get
gradually widened to during their decay animation.
Finally, in case you were missing a mention of hitboxes in the previous
paragraph: Yes, the game always uses the aforementioned 12×12 boxes,
regardless of a laser's width.
This video also showcases the 127-pixel limit because I wanted
to include the shrink animation for a seamless loop.
That was what, 50% of this blog post just being about complications that
made laser difficult for no reason? Next up: The first TH01 Anniversary
Edition build, where I finally get to reap the rewards of having a 100%
decompiled game and write some good code for once.
More than three months without any reverse-engineering progress! It's been
way too long. Coincidentally, we're at least back with a surprising 1.25% of
overall RE, achieved within just 3 pushes. The ending script system is not
only more or less the same in TH04 and TH05, but actually originated in
TH03, where it's also used for the cutscenes before stages 8 and 9. This
means that it was one of the final pieces of code shared between three of
the four remaining games, which I got to decompile at roughly 3× the usual
speed, or ⅓ of the price.
The only other bargains of this nature remain in OP.EXE. The
Music Room is largely equivalent in all three remaining games as well, and
the sound device selection, ZUN Soft logo screens, and main/option menus are
the same in TH04 and TH05. A lot of that code is in the "technically RE'd
but not yet decompiled" ASM form though, so it would shift Finalized% more
significantly than RE%. Therefore, make sure to order the new
Finalization option rather than Reverse-engineering if you
want to make number go up.
So, cutscenes. On the surface, the .TXT files look simple enough: You
directly write the text that should appear on the screen into the file
without any special markup, and add commands to define visuals, music, and
other effects at any place within the script. Let's start with the basics of
how text is rendered, which are the same in all three games:
First off, the text area has a size of 480×64 pixels. This means that it
does not correspond to the tiled area painted into TH05's
EDBK?.PI images:
The yellow area is designated for character names.
Since the font weight can be customized, all text is rendered to VRAM.
This also includes gaiji, despite them ignoring the font weight
setting.
The system supports automatic line breaks on a per-glyph basis, which
move the text cursor to the beginning of the red text area. This might seem like a piece of long-forgotten
ancient wisdom at first, considering the absence of automatic line breaks in
Windows Touhou. However, ZUN probably implemented it more out of pure
necessity: Text in VRAM needs to be unblitted when starting a new box, which
is way more straightforward and performant if you only need to worry
about a fixed area.
The system also automatically starts a new (key press-separated) text
box after the end of the 4th line. However, the text cursor is
also unconditionally moved to the top-left corner of the yellow name
area when this happens, which is almost certainly not what you expect, given
that automatic line breaks stay within the red area. A script author might
as well add the necessary text box change commands manually, if you're
forced to anticipate the automatic ones anyway…
Due to ZUN forgetting an unblitting call during the TH05 refactoring of the
box background buffer, this feature is even completely broken in that game,
as any new text will simply be blitted on top of the old one:
Wait, why are we already talking about game-specific differences after
all? Also, note how the ⏎ animation appears one line below where you'd
expect it.
Overall, the system is geared toward exclusively full-width text. As
exemplified by the 2014 static English patches and the screenshots in this
blog post, half-width text is possible, but comes with a lot of
asterisks attached:
Each loop of the script interpreter starts by looking at the next
byte to distinguish commands from text. However, this step also skips
over every ASCII space and control character, i.e., every byte
≤ 32. If you only intend to display full-width glyphs anyway, this
sort of makes sense: You gain complete freedom when it comes to the
physical layout of these script files, and it especially allows commands
to be freely separated with spaces and line breaks for improved
readability. Still, enforcing commands to be separated exclusively by
line breaks might have been even better for readability, and would have
freed up ASCII spaces for regular text…
Non-command text is blindly processed and rendered two bytes at a
time. The rendering function interprets these bytes as a Shift-JIS
string, so you can use half-width characters here. While the
second byte can even be an ASCII 0x20 space due to the
parser's blindness, all half-width characters must still occur in pairs
that can't be interrupted by commands:
As a workaround for at least the ASCII space issue, you can replace
them with any of the unassigned
Shift-JIS lead bytes – 0x80, 0xA0, or
anything between 0xF0 and 0xFF inclusive.
That's what you see in all screenshots of this post that display
half-width spaces.
Finally, did you know that you can hold ESC to fast-forward
through these cutscenes, which skips most frame delays and reduces the rest?
Due to the blocking nature of all commands, the ESC key state is
only updated between commands or 2-byte text groups though, so it can't
interrupt an ongoing delay.
Superficially, the list of game-specific differences doesn't look too long,
and can be summarized in a rather short table:
It's when you get into the implementation that the combined three systems
reveal themselves as a giant mess, with more like 56 differences between the
games. Every single new weird line of code opened up
another can of worms, which ultimately made all of this end up with 24
pieces of bloat and 14 bugs. The worst of these should be quite interesting
for the general PC-98 homebrew developers among my audience:
The final official 0.23 release of master.lib has a bug in
graph_gaiji_put*(). To calculate the JIS X 0208 code point for
a gaiji, it is enough to ADD 5680h onto the gaiji ID. However,
these functions accidentally use ADC instead, which incorrectly
adds the x86 carry flag on top, causing weird off-by-one errors based on the
previous program state. ZUN did fix this bug directly inside master.lib for
TH04 and TH05, but still needed to work around it in TH03 by subtracting 1
from the intended gaiji ID. Anyone up for maintaining a bug-fixed master.lib
repository?
The worst piece of bloat comes from TH03 and TH04 needlessly
switching the visibility of VRAM pages while blitting a new 320×200 picture.
This makes it much harder to understand the code, as the mere existence of
these page switches is enough to suggest a more complex interplay between
the two VRAM pages which doesn't actually exist. Outside this visibility
switch, page 0 is always supposed to be shown, and page 1 is always used
for temporarily storing pixels that are later crossfaded onto page 0. This
is also the only reason why TH03 has to render text and gaiji onto both VRAM
pages to begin with… and because TH04 doesn't, changing the picture in the
middle of a string of text is technically bugged in that game, even though
you only get to temporarily see the new text on very underclocked PC-98
systems.
These performance implications made me wonder why cutscenes even bother with
writing to the second VRAM page anyway, before copying each crossfade step
to the visible one.
📝 We learned in June how costly EGC-"accelerated" inter-page copies are;
shouldn't it be faster to just blit the image once rather than twice?
Well, master.lib decodes .PI images into a packed-pixel format, and
unpacking such a representation into bitplanes on the fly is just about the
worst way of blitting you could possibly imagine on a PC-98. EGC inter-page
copies are already fairly disappointing at 42 cycles for every 16 pixels, if
we look at the i486 and ignore VRAM latencies. But under the same
conditions, packed-pixel unpacking comes in at 81 cycles for every 8
pixels, or almost 4× slower. On lower-end systems, that can easily sum up to
more than one frame for a 320×200 image. While I'd argue that the resulting
tearing could have been an acceptable part of the transition between two
images, it's understandable why you'd want to avoid it in favor of the
pure effect on a slower framerate.
Really makes me wonder why master.lib didn't just directly decode .PI images
into bitplanes. The performance impact on load times should have been
negligible? It's such a good format for
the often dithered 16-color artwork you typically see on PC-98, and
deserves better than master.lib's implementation which is both slow to
decode and slow to blit.
That brings us to the individual script commands… and yes, I'm going to
document every single one of them. Some of their interactions and edge cases
are not clear at all from just looking at the code.
Almost all commands are preceded by… well, a 0x5C lead byte.
Which raises the question of whether we should
document it as an ASCII-encoded \ backslash, or a Shift-JIS-encoded
¥ yen sign. From a gaijin perspective, it seems obvious that it's a
backslash, as it's consistently displayed as one in most of the editors you
would actually use nowadays. But interestingly, iconv
-f shift-jis -t utf-8 does convert any 0x5C
lead bytes to actual ¥ U+00A5 YEN SIGN code points
.
Ultimately, the distinction comes down to the font. There are fonts
that still render 0x5C as ¥, but mainly do so out
of an obvious concern about backward compatibility to JIS X 0201, where this
mapping originated. Unsurprisingly, this group includes MS Gothic/Mincho,
the old Japanese fonts from Windows 3.1, but even Meiryo and Yu
Gothic/Mincho, Microsoft's modern Japanese fonts. Meanwhile, pretty much
every other modern font, and freely licensed ones in particular, render this
code point as \, even if you set your editor to Shift-JIS. And
while ZUN most definitely saw it as a ¥, documenting this code
point as \ is less ambiguous in the long run. It can only
possibly correspond to one specific code point in either Shift-JIS or UTF-8,
and will remain correct even if we later mod the cutscene system to support
full-blown Unicode.
Now we've only got to clarify the parameter syntax, and then we can look at
the big table of commands:
Numeric parameters are read as sequences of up to 3 ASCII digits. This
limits them to a range from 0 to 999 inclusive, with 000 and
0 being equivalent. Because there's no further sentinel
character, any further digit from the 4th one onwards is
interpreted as regular text.
Filename parameters must be terminated with a space or newline and are
limited to 12 characters, which translates to 8.3 basenames without any
directory component. Any further characters are ignored and displayed as
text as well.
Each .PI image can contain up to four 320×200 pictures ("quarters") for
the cutscene picture area. In the script commands, they are numbered like
this:
0
1
2
3
\@
Clears both VRAM pages by filling them with VRAM color 0. 🐞
In TH03 and TH04, this command does not update the internal text area
background used for unblitting. This bug effectively restricts usage of
this command to either the beginning of a script (before the first
background image is shown) or its end (after no more new text boxes are
started). See the image below for an
example of using it anywhere else.
\b2
Sets the font weight to a value between 0 (raw font ROM glyphs) to 3
(very thicc). Specifying any other value has no effect.
🐞 In TH04 and TH05, \b3 leads to glitched pixels when
rendering half-width glyphs due to a bug in the newly micro-optimized
ASM version of
📝 graph_putsa_fx(); see the image below for an example.
In these games, the parameter also directly corresponds to the
graph_putsa_fx() effect function, removing the sanity check
that was present in TH03. In exchange, you can also access the four
dissolve masks for the bold font (\b2) by specifying a
parameter between 4 (fewest pixels) to 7 (most
pixels). Demo video below.
\c15
Changes the text color to VRAM color 15.
\c=字,15
Adds a color map entry: If 字 is the first code point
inside the name area on a new line, the text color is automatically set
to 15. Up to 8 such entries can be registered
before overflowing the statically allocated buffer.
🐞 The comma is assumed to be present even if the color parameter is omitted.
\e0
Plays the sound effect with the given ID.
\f
(no-op)
\fi1
\fo1
Calls master.lib's palette_black_in() or
palette_black_out() to play a hardware palette fade
animation from or to black, spending roughly 1 frame on each of the 16 fade steps.
\fm1
Fades out BGM volume via PMD's AH=02h interrupt call,
in a non-blocking way. The fade speed can range from 1 (slowest) to 127 (fastest).
Values from 128 to 255 technically correspond to
AH=02h's fade-in feature, which can't be used from cutscene
scripts because it requires BGM volume to first be lowered via
AH=19h, and there is no command to do that.
\g8
Plays a blocking 8-frame screen shake
animation.
\ga0
Shows the gaiji with the given ID from 0 to 255
at the current cursor position. Even in TH03, gaiji always ignore the
text delay interval configured with \v.
@3
TH05's replacement for the \ga command from TH03 and
TH04. The default ID of 3 corresponds to the
gaiji. Not to be confused with \@, which starts with a backslash,
unlike this command.
@h
Shows the gaiji.
@t
Shows the gaiji.
@!
Shows the gaiji.
@?
Shows the gaiji.
@!!
Shows the gaiji.
@!?
Shows the gaiji.
\k0
Waits 0 frames (0 = forever) for an advance key to be pressed before
continuing script execution. Before waiting, TH05 crossfades in any new
text that was previously rendered to the invisible VRAM page…
🐞 …but TH04 doesn't, leaving the text invisible during the wait time.
As a workaround, \vp1 can be
used before \k to immediately display that text without a
fade-in animation.
\m$
Stops the currently playing BGM.
\m*
Restarts playback of the currently loaded BGM from the
beginning.
\m,filename
Stops the currently playing BGM, loads a new one from the given
file, and starts playback.
\n
Starts a new line at the leftmost X coordinate of the box, i.e., the
start of the name area. This is how scripts can "change" the name of the
currently speaking character, or use the entire 480×64 pixels without
being restricted to the non-name area.
Note that automatic line breaks already move the cursor into a new line.
Using this command at the "end" of a line with the maximum number of 30
full-width glyphs would therefore start a second new line and leave the
previously started line empty.
If this command moved the cursor into the 5th line of a box,
\s is executed afterward, with
any of \n's parameters passed to \s.
\p
(no-op)
\p-
Deallocates the loaded .PI image.
\p,filename
Loads the .PI image with the given file into the single .PI slot
available to cutscenes. TH04 and TH05 automatically deallocate any
previous image, 🐞 TH03 would leak memory without a manual prior call to
\p-.
\pp
Sets the hardware palette to the one of the loaded .PI image.
\p@
Sets the loaded .PI image as the full-screen 640×400 background
image and overwrites both VRAM pages with its pixels, retaining the
current hardware palette.
\p=
Runs \pp followed by \p@.
\s0
\s-
Ends a text box and starts a new one. Fades in any text rendered to
the invisible VRAM page, then waits 0 frames
(0 = forever) for an advance key to be
pressed. Afterward, the new text box is started with the cursor moved to
the top-left corner of the name area. \s- skips the wait time and starts the new box
immediately.
\t100
Sets palette brightness via master.lib's
palette_settone() to any value from 0 (fully black) to 200
(fully white). 100 corresponds to the palette's original colors.
Preceded by a 1-frame delay unless ESC is held.
\v1
Sets the number of frames to wait between every 2 bytes of rendered
text.
Sets the number of frames to spend on each of the 4 fade
steps when crossfading between old and new text. The game-specific
default value is also used before the first use of this command.
\v2
\vp0
Shows VRAM page 0. Completely useless in
TH03 (this game always synchronizes both VRAM pages at a command
boundary), only of dubious use in TH04 (for working around a bug in \k), and the games always return to
their intended shown page before every blitting operation anyway. A
debloated mod of this game would just remove this command, as it exposes
an implementation detail that script authors should not need to worry
about. None of the original scripts use it anyway.
\w64
\w and \wk wait for the given number
of frames
\wm and \wmk wait until PMD has played
back the current BGM for the total number of measures, including
loops, given in the first parameter, and fall back on calling
\w and \wk with the second parameter as
the frame number if BGM is disabled.
🐞 Neither PMD nor MMD reset the internal measure when stopping
playback. If no BGM is playing and the previous BGM hasn't been
played back for at least the given number of measures, this command
will deadlock.
Since both TH04 and TH05 fade in any new text from the invisible VRAM
page, these commands can be used to simulate TH03's typing effect in
those games. Demo video below.
Contrary to \k and \s, specifying 0 frames would
simply remove any frame delay instead of waiting forever.
The TH03-exclusive k variants allow the delay to be
interrupted if ⏎ Return or Shot are held down.
TH04 and TH05 recognize the k as well, but removed its
functionality.
All of these commands have no effect if ESC is held.
\wm64,64
\wk64
\wmk64,64
\wi1
\wo1
Calls master.lib's palette_white_in() or
palette_white_out() to play a hardware palette fade
animation from or to white, spending roughly 1 frame on each of the 16 fade steps.
\=4
Immediately displays the given quarter of the loaded .PI image in
the picture area, with no fade effect. Any value ≥ 4 resets the picture area to black.
\==4,1
Crossfades the picture area between its current content and quarter
#4 of the loaded .PI image, spending 1 frame on each of the 4 fade steps unless
ESC is held. Any value ≥ 4 is
replaced with quarter #0.
\$
Stops script execution. Must be called at the end of each file;
otherwise, execution continues into whatever lies after the script
buffer in memory.
TH05 automatically deallocates the loaded .PI image, TH03 and TH04
require a separate manual call to \p- to not leak its memory.
Bold values signify the default if the parameter
is omitted; \c is therefore
equivalent to \c15.
The \@ bug. Yes, the ¥ is fake. It
was easier to GIMP it than to reword the sentences so that the backslashes
landed on the second byte of a 2-byte half-width character pair.
The font weights and effects available through \b, including the glitch with
\b3 in TH04 and TH05.
Font weight 3 is technically not rendered correctly in TH03 either; if
you compare 1️⃣ with 4️⃣, you notice a single missing column of pixels
at the left side of each glyph, which would extend into the previous
VRAM byte. Ironically, the TH04/TH05 version is more correct in
this regard: For half-width glyphs, it preserves any further pixel
columns generated by the weight functions in the high byte of the 16-dot
glyph variable. Unlike TH03, which still cuts them off when rendering
text to unaligned X positions (3️⃣), TH04 and TH05 do bit-rotate them
towards their correct place (4️⃣). It's only at byte-aligned X positions
(2️⃣) where they remain at their internally calculated place, and appear
on screen as these glitched pixel columns, 15 pixels away from the glyph
they belong to. It's easy to blame bugs like these on micro-optimized
ASM code, but in this instance, you really can't argue against it if the
original C++ version was equally incorrect.
Combining \b and s- into a partial dissolve
animation. The speed can be controlled with \v.
Simulating TH03's typing effect in TH04 and TH05 via \w. Even prettier in TH05 where we
also get an additional fade animation
after the box ends.
So yeah, that's the cutscene system. I'm dreading the moment I will have to
deal with the other command interpreter in these games, i.e., the
stage enemy system. Luckily, that one is completely disconnected from any
other system, so I won't have to deal with it until we're close to finishing
MAIN.EXE… that is, unless someone requests it before. And it
won't involve text encodings or unblitting…
The cutscene system got me thinking in greater detail about how I would
implement translations, being one of the main dependencies behind them. This
goal has been on the order form for a while and could soon be implemented
for these cutscenes, with 100% PI being right around the corner for the TH03
and TH04 cutscene executables.
Once we're there, the "Virgin" old-school way of static translation patching
for Latin-script languages could be implemented fairly quickly:
Establish basic UTF-8 parsing for less painful manual editing of the
source files
Procedurally generate glyphs for the few required additional letters
based on existing font ROM glyphs. For example, we'd generate ä
by painting two short lines on top of the font ROM's a glyph,
or generate ¿ by vertically flipping the question mark. This
way, the text retains a consistent look regardless of whether the translated
game is run with an NEC or EPSON font ROM, or the that Neko Project II auto-generates if you
don't provide either.
(Optional) Change automatic line breaks to work on a per-word
basis, rather than per-glyph
That's it – script editing and distribution would be handled by your local
translation group. It might seem as if this would also work for Greek and
Cyrillic scripts due to their presence in the PC-98 font ROM, but I'm not
sure if I want to attempt procedurally shrinking these glyphs from 16×16 to
8×16… For any more thorough solution, we'd need to go for a more "Chad" kind
of full-blown translation support:
Implement text subdivisions at a sensible granularity while retaining
automatic line and box breaks
Compile translatable text into a Japanese→target language dictionary
(I'm too old to develop any further translation systems that would overwrite
modded source text with translations of the original text)
Implement a custom Unicode font system (glyphs would be taken from GNU
Unifont unless translators provide a different 8×16 font for their
language)
Combine the text compiler with the font compiler to only store needed
glyphs as part of the translation's font file (dealing with a multi-MB font
file would be rather ugly in a Real Mode game)
Write a simple install/update/patch stacking tool that supports both
.HDI and raw-file DOSBox-X scenarios (it's different enough from thcrap to
warrant a separate tool – each patch stack would be statically compiled into
a single package file in the game's directory)
Add a nice language selection option to the main menu
(Optional) Support proportional fonts
Which sounds more like a separate project to be commissioned from
Touhou Patch Center's Open Collective funds, separate from the ReC98 cap.
This way, we can make sure that the feature is completely implemented, and I
can talk with every interested translator to make sure that their language
works.
It's still cheaper overall to do this on PC-98 than to first port the games
to a modern system and then translate them. On the other hand, most
of the tasks in the Chad variant (3, 4, 5, and half of 2) purely deal with
the difficulty of getting arbitrary Unicode characters to work natively in a
PC-98 DOS game at all, and would be either unnecessary or trivial if we had
already ported the game. Depending on where the patrons' interests lie, it
may not be worth it. So let's see what all of you think about which
way we should go, or whether it's worth doing at all. (Edit
(2022-12-01): With Splashman's
order towards the stage dialogue system, we've pretty much confirmed that it
is.) Maybe we want to meet in the middle – using e.g. procedural glyph
generation for dynamic translations to keep text rendering consistent with
the rest of the PC-98 system, and just not support non-Latin-script
languages in the beginning? In any case, I've added both options to the
order form. Edit (2023-07-28):Touhou Patch Center has agreed to fund
a basic feature set somewhere between the Virgin and Chad level. Check the
📝 dedicated announcement blog post for more
details and ideas, and to find out how you can support this goal!
Surprisingly, there was still a bit of RE work left in the third push after
all of this, which I filled with some small rendering boilerplate. Since I
also wanted to include TH02's playfield overlay functions,
1/15 of that last push went towards getting a
TH02-exclusive function out of the way, which also ended up including that
game in this delivery.
The other small function pointed out how TH05's Stage 5 midboss pops into
the playfield quite suddenly, since its clipping test thinks it's only 32
pixels tall rather than 64:
Good chance that the pop-in might have been intended. Edit (2023-06-30): Actually, it's a
📝 systematic consequence of ZUN having to work around the lack of clipping in master.lib's sprite functions.
There's even another quirk here: The white flash during its first frame
is actually carried over from the previous midboss, which the
game still considers as actively getting hit by the player shot that
defeated it. It's the regular boilerplate code for rendering a
midboss that resets the responsible damage variable, and that code
doesn't run during the defeat explosion animation.
Next up: Staying with TH05 and looking at more of the pattern code of its
boss fights. Given the remaining TH05 budget, it makes the most sense to
continue in in-game order, with Sara and the Stage 2 midboss. If more money
comes in towards this goal, I could alternatively go for the Mai & Yuki
fight and immediately develop a pretty fix for the cheeto storage
glitch. Also, there's a rather intricate
pull request for direct ZMBV decoding on the website that I've still got
to review…
On August 15, 1997, at Comiket 52, an unknown doujin developer going by the
name of ZUN released his first game, 東方靈異伝 ~
The Highly Responsive to Prayers, marking the start of the
Touhou Project game series that keeps running to this day. Today, exactly 25
years later, the C++ source code to version 1.10 of that game has been
completely and perfectly reconstructed, reviewed, and documented.
And with that, a warm welcome to all game journalists who have
(re-)discovered this project through these news! Here's a summary for
everyone who doesn't want to go through 3 years worth of blog posts:
What does this mean?
All code that ZUN wrote as part of a TH01 installation has now been
decompiled to C++ code. The only parts left in assembly are two third-party
libraries (master.lib and PiLoad), which were originally written in
assembly, and are built from their respective official source code.
You can clone the ReC98
repository, set up the build environment, and get a binary with an
identical program image. The hashes of the resulting executables won't match
those of ZUN's original release, but all differences there stem from details
in the .EXE header that don't influence program execution, such as the
on-disk order of the conceptually unordered set of x86 memory segment
relocations. If you're interested in that level of correctness, you can
order Easier verification against original binaries from the store.
For now though, use mzdiff for
verifying the builds against ZUN's binaries.
Ever since this crowdfunding has started 3 years ago, the goal of this
project has shifted more and more towards a full-on code review rather than
being just a mechanical decompilation:
Hardcoded constants were derived from as few truly hardcoded values
as possible, which uncovered their intended meaning and highlighted any
inconsistencies
Code was deduplicated to a perhaps obsessive level (I'm still trying
to find a balance)
Tons of comments everywhere to put everything into context
And, of course, 2½ years worth of blog
posts summarizing any highlights, glitches, and secrets. (There
might still be some left to be discovered!)
As a result, modding the games and porting them away from the PC-98
platform is now a lot easier.
What does this not mean?
This is not a piracy release. ReC98 only provides the code that the
game's .EXE and .COM files are built out of. Without the rest of the
original data files, supplied from a pre-existing game copy, the code won't
do very much.
Even apart from ZUN's own code quality, the ReC98 repository is not as
polished and consistent as it could be, having seen multiple code structure
evolutions over the 8 years of its existence.
TH01 hasn't magically reached Doom levels of easy portability now. As a
decompilation of the exact code that ZUN wrote for the PC-98 platform, it is
very PC-98-native, and wildly mixes game logic with hardware
accesses. As ZUN's first foray into game development, he understandably
didn't see the need for writing an engine or hardware abstraction layer
yet.
So while this milestone opened the floodgates to PC-98-native mods, I
wouldn't advise trying to attempt a port away from PC-98 right now. But then
again, I have a financial interest in being a part of the porting process,
and who knows, maybe you can just merge in a PC-98 emulator core and
get started with something halfway decent in a short amount of time. After
all, TH01 is by far the easiest PC-98 Touhou game to port to other systems,
as it makes the least use of hardware features. (Edit
(2023-03-30): 📝 Turns out that this
crown actually goes to TH02. It features the least amount of ZUN-written
PC-98-specific rendering code out of all the 5 games, with most of it
being decently abstracted via master.lib.)
However, this game in particular raises the question of what exactly
one would even want to port. TH01 is a broken flicker-fest that
overwhelmingly suffers the drawbacks of PC-98 hardware rather than using it
to its advantage. Out of the 78 bugs that I ended up labeling as such, the
majority are sprite blitting issues,
while you can count the instances of
good hardware use on one hand.
And even at the level of game logic, this game features a lot of
weird, inconsistent behavior. Less rigorous projects such as uth05win would probably
promptly identify these issues as bugs and fix them. On the one hand, this
shows that there is a part of the community that wants sane versions of
these games which behave as expected. In other parts of the community
though, such projects quickly gain the reputation of being too inaccurate to
bother about them.
Some terminology might help here. If you look over the ReC98 codebase,
you'll find that I classified any weird code into three categories.
Edit (2023-03-05): These have been overhauled with a new
landmine category for invisible issues. Check CONTRIBUTING.md
for the complete and current current definition of all weird code
categories.
🔗 ZUN bugs: Broken
code that results from logic errors or incorrect programming
language/API/hardware use, with enough evidence in the code to indicate that
ZUN did not intend the bug. Fixing these issues must not affect hypothetical
replay compatibility, and any resulting visual changes must match ZUN's
provable intentions.
🔗 ZUN quirks:
Weird code that looks incorrect in context. Fixing these issues would change
gameplay enough to desync a hypothetical replay recorded on the original
version, or affect the visuals so much that the result is no longer faithful
to ZUN's original release. It might very well be called a fangame at that
point.
🔗 ZUN bloat:
Code that wastes memory, CPU cycles, or even just the mental capacity of
anyone trying to read and understand the code. If you want to write a
particularly resource-intensive mod, these are the places you could claim
some of those resources from.
Some examples:
All crashes are bugs
All blitting issues related to inappropriate VRAM byte alignment are
bugs
The idea of splitting TH01 across three executables is its biggest
source of bloat. It wastes disk space, the game doesn't even make use of the
memory gained from unloading unneeded code and data, it complicates the
build process and code structure with inconsistencies between the individual
binaries, and the required inter-process communication via shared memory
adds another piece of global state mutation headache.
Since I'm not in the business of writing fanfiction, I won't offer any
option that fixes quirks. That's where all of you can come in, and
use ReC98 as a base for remasters and remakes. As for bloat and bugs though,
there are many ways we could go from here:
If you want to ultimately try porting the game yourself, but still
support ReC98 somehow, I can recommend the ZUN code cleanup goal.
This is the most conservative option that leaves all bugs and quirks in
place and only removes bloat, rearchitecting the codebase so that
it's easier to work with.
For an improved gameplay experience on PC-98, choose the TH01
Anniversary Edition goal. In addition to the above code cleanup, this
goal fixes every bug with the game, most notably all the sprite
flickering by implementing a completely new renderer, while maintaining
hypothetical replay compatibility to ZUN's original release.
If you're mainly interested in seeing any variety of TH01 ported away
from PC-98 to any system, choose the Portability to non-PC-98 systems
goal. In this one, I'm going to develop the abstraction layers that would
ultimately bring this game to the aforementioned Doom level of portability,
while still keeping it running with better than original performance on
PC-98.
Replay support is also something you could order…
… as is Multilingual translation support (on PC-98), for those
sweet non-ASCII characters if that's your thing.
Then again, with all these choices in mind, maybe we should just let TH01 be
what it is: ZUN's first game, evidence for the truth that no programmer
writes good code the first time around, and more of a historical curiosity
than anything you'd want to maintain and modernize. The idea of moving on to
the next game and decompiling all 5 PC-98 Touhou games in order has
certainly shown to be
popular among the backers who funded this 100% goal.
Since the beginning of the year, I've been dramatically raising the level of
quality and care I've been putting into this project, leading to 9 of the 10
longest blog posts having been written in the past 8 months. The community
reception has been even more supportive as well, with all of you still
regularly selling out the store in return. To match the level of quality
with the community demand, I'm raising push prices from
to per push, as of this blog
post. 📝 As usual, I'm going to deliver any
existing orders in the backlog at the value they were originally purchased
at. Due to the way the cap has to be calculated, these contributions now
appear to have increased in value by 25%.
However, I do realize that this might make regular pushes prohibitively
expensive for some. This could especially prevent all these exciting modding
goals from ever getting off the ground. Thinking about it though, the push
system is only really necessary for the core reverse-engineering business,
where longer, concentrated stretches of work allow me to study a new piece
of code in a larger context and improve the quality of the final result. In
contrast, modding-related goals could theoretically be segmented into
arbitrarily small portions of work, as I have a clear idea of where I want
to go and how to get there.
Thus, I'm introducing microtransactions, now available for all
modding-related goals. These allow you to order fractional pieces of work
for as low as 1 €, which I will immediately deliver without requiring others
to fund a full push first. Edit (2022-08-16): And then the
store still sold out with a single regular contribution by
nrook towards more reverse-engineering. Guess that this
experiment will have to wait a little while longer, then… 😅
Next up: Taking a break and recovering from crunch time by improving video
playback on this blog and working on Shuusou Gyoku,
before returning to Touhou in September.
Last blog post before the 100% completion of TH01! The final parts of
REIIDEN.EXE would feel rather out of place in a celebratory
blog post, after all. They provided quite a neat summary of the typical
technical details that are wrong with this game, and that I now get to
mention for one final time:
The Orb's animation cycle is maybe two frames shorter than it should
have been, showing its last sprite for just 1 frame rather than 3:
The text in the Pause and Continue menus is not quite correctly
centered.
The memory info screen hides quite a bit of information about the .PTN
buffers, and obscures even the info that it does show behind
misleading labels. The most vital information would have been that ZUN could
have easily saved 20% of the memory by using a structure without the
unneeded alpha plane… Oh, and the REWIRTE option
mapped to the ⬇️ down arrow key simply redraws the info screen. Might be
useful after a NODE CHEAK, which replaces the output
with its own, but stays within the same input loop.
But hey, there's an error message if you start REIIDEN.EXE
without a resident MDRV2 or a correctly prepared resident structure! And
even a good, user-friendly one, asking the user to launch the batch file
instead. For some reason, this convenience went out of fashion in the later
games.
The Game Over animation (how fitting) gives us TH01's final piece of weird
sprite blitting code, which seriously manages to include 2 bugs and 3 quirks
in under 50 lines of code. In test mode (game t or game
d), you can trigger this effect by pressing the ⬇️ down arrow key,
which certainly explains why I encountered seemingly random Game Over events
during all the tests I did with this game…
The animation appears to have changed quite a bit during development, to the
point that probably even ZUN himself didn't know what he wanted it to look
like in the end:
The original version unblits a 32×32 rectangle around Reimu that only
grows on the X axis… for the first 5 frames. The unblitting call is
only run if the corresponding sprite wasn't clipped at the edges of the
playfield in the frame before, and ZUN uses the animation's frame
number rather than the sprite loop variable to index the per-sprite
clip flag array. The resulting out-of-bounds access then reads the
sprite coordinates instead, which are never 0, thus interpreting
all 5 sprites as clipped.
This variant would interpret the declared 5 effect coordinates as
distinct sprites and unblit them correctly every frame. The end result
is rather wimpy though… hardly appropriate for a Game Over, especially
with the original animation in mind.
This variant would not unblit anything, and is probably closest to what
the final animation should have been.
Finally, we get to the big main() function, serving as the duct
tape that holds this game together. It may read rather disorganized with all
the (actually necessary) assignments and function calls, but the only
actual minor issue I've seen there is that you're robbed of any
pellet destroy bonus collected on the final frame of the final boss. There
is a certain charm in directly nesting the infinite main gameplay loop
within the infinite per-life loop within the infinite stage loop. But come
on, why is there no fourth scene loop? Instead, the
game just starts a new REIIDEN.EXE process before and after a
boss fight. With all the wildly mutated global state, that was probably a
much saner choice.
The final secrets can be found in the debug stage selection. ZUN
implemented the prompts using the C standard library's scanf()
function, which is the natural choice for quick-and-dirty testing features
like this one. However, the C standard library is also complete and utter
trash, and so it's not surprising that both of the scanf()
calls do… well, probably not what ZUN intended. The guaranteed out-of-bounds
memory access in the select_flag route prompt thankfully has no
real effect on the game, but it gets really interesting with the 面数 stage prompt.
Back in 2020, I already wrote about
📝 stages 21-24, and how they're loaded from actual data that ZUN shipped with the game.
As it now turns out, the code that maps stage IDs to STAGE?.DAT
scene numbers contains an explicit branch that maps any (1-based) stage
number ≥21 to scene 7. Does this mean that an Extra Stage was indeed planned
at some point? That branch seems way too specific to just be meant as a
fallback. Maybe
Asprey was on to something after all…
However, since ZUN passed the stage ID as a signed integer to
scanf(), you can also enter negative numbers. The only place
that kind of accidentally checks for them is the aforementioned stage
ID → scene mapping, which ensures that (1-based) stages < 5 use
the shrine's background image and BGM. With no checks anywhere else, we get
a new set of "glitch stages":
Stage -1Stage -2Stage -3Stage -4Stage -5
The scene loading function takes the entered 0-based stage ID value modulo
5, so these 4 are the only ones that "exist", and lower stage numbers will
simply loop around to them. When loading these stages, the function accesses
the data in REIIDEN.EXE that lies before the statically
allocated 5-element stages-of-scene array, which happens to encompass
Borland C++'s locale and exception handling data, as well as a small bit of
ZUN's global variables. In particular, the obstacle/card HP on the tile I
highlighted in green corresponds to the
lowest byte of the 32-bit RNG seed. If it weren't for that and the fact that
the obstacles/card HP on the few tiles before are similarly controlled by
the x86 segment values of certain initialization function addresses, these
glitch stages would be completely deterministic across PC-98 systems, and
technically canon…
Stage -4 is the only playable one here as it's the only stage to end up
below the
📝 heap corruption limit of 102 stage objects.
Completing it loads Stage -3, which crashes with a Divide Error
just like it does if it's directly selected. Unsurprisingly, this happens
because all 50 card bytes at that memory location are 0, so one division (or
in this case, modulo operation) by the number of cards is enough to crash
the game.
Stage -5 is modulo'd to 0 and thus loads the first regular stage. The only
apparent broken element there is the timer, which is handled by a completely
different function that still operates with a (0-based) stage ID value of
-5. Completing the stage loads Stage -4, which also crashes, but only
because its 61 cards naturally cause the
📝 stack overflow in the flip-in animation for any stage with more than 50 cards.
And that's REIIDEN.EXE, the biggest and most bloated PC-98
Touhou executable, fully decompiled! Next up: Finishing this game with the
main menu, and hoping I'll actually pull it off within 24 hours. (If I do,
we might all have to thank 32th
System, who independently decompiled half of the remaining 14
functions…)
Wow, it's been 3 days and I'm already back with an unexpectedly long post
about TH01's bonus point screens? 3 days used to take much longer in my
previous projects…
Before I talk about graphics for the rest of this post, let's start with the
exact calculations for both bonuses. Touhou Wiki already got these right,
but it still makes sense to provide them here, in a format that allows you
to cross-reference them with the source code more easily. For the
card-flipping stage bonus:
Time
min((Stage timer * 3), 6553)
Continuous
min((Highest card combo * 100), 6553)
Bomb&Player
min(((Lives * 200) + (Bombs * 100)), 6553)
STAGE
min(((Stage number - 1) * 200), 6553)
BONUS Point
Sum of all above values * 10
The boss stage bonus is calculated from the exact same metrics, despite half
of them being labeled differently. The only actual differences are in the
higher multipliers and in the cap for the stage number bonus. Why remove it
if raising it high enough also effectively disables it?
Time
min((Stage timer * 5), 6553)
Continuous
min((Highest card combo * 200), 6553)
MIKOsan
min(((Lives * 500) + (Bombs * 200)), 6553)
Clear
min((Stage number * 1000), 65530)
TOTLE
Sum of all above values * 10
The transition between the gameplay and TOTLE screens is one of the more
impressive effects showcased in this game, especially due to how wavy it
often tends to look. Aside from the palette interpolation (which is, by the
way, the first time ZUN wrote a correct interpolation algorithm between two
4-bit palettes), the core of the effect is quite simple. With the TOTLE
image blitted to VRAM page 1:
Shift the contents of a line on VRAM page 0 by 32 pixels, alternating
the shift direction between right edge → left edge (even Y
values) and the other way round (odd Y values)
Keep a cursor for the destination pixels on VRAM page 1 for every line,
starting at the respective opposite edge
Blit the 32 pixels at the VRAM page 1 cursor to the newly freed 32
pixels on VRAM page 0, and advance the cursor towards the other edge
Successive line shifts will then include these newly blitted 32 pixels
as well
Repeat (640 / 32) = 20 times, after which all new pixels
will be in their intended place
So it's really more like two interlaced shift effects with opposite
directions, starting on different scanlines. No trigonometry involved at
all.
Horizontally scrolling pixels on a single VRAM page remains one of the few
📝 appropriate uses of the EGC in a fullscreen 640×400 PC-98 game,
regardless of the copied block size. The few inter-page copies in this
effect are also reasonable: With 8 new lines starting on each effect frame,
up to (8 × 20) = 160 lines are transferred at any given time, resulting
in a maximum of (160 × 2 × 2) = 640 VRAM page switches per frame for the newly
transferred pixels. Not that frame rate matters in this situation to begin
with though, as the game is doing nothing else while playing this effect.
What does sort of matter: Why 32 pixels every 2 frames, instead of 16
pixels on every frame? There's no performance difference between doing one
half of the work in one frame, or two halves of the work in two frames. It's
not like the overhead of another loop has a serious impact here,
especially with the PC-98 VRAM being said to have rather high
latencies. 32 pixels over 2 frames is also harder to code, so ZUN
must have done it on purpose. Guess he really wanted to go for that 📽
cinematic 30 FPS look 📽 here…
Removing the palette interpolation and transitioning from a black screen
to CLEAR3.GRP makes it a lot clearer how the effect works.
Once all the metrics have been calculated, ZUN animates each value with a
rather fancy left-to-right typing effect. As 16×16 images that use a single
bright-red color, these numbers would be
perfect candidates for gaiji… except that ZUN wanted to render them at the
more natural Y positions of the labels inside CLEAR3.GRP that
are far from aligned to the 8×16 text RAM grid. Not having been in the mood
for hardcoding another set of monochrome sprites as C arrays that day, ZUN
made the still reasonable choice of storing the image data for these numbers
in the single-color .GRC form– yeah, no, of course he once again
chose the .PTN hammer, and its
📝 16×16 "quarter" wrapper functions around nominal 32×32 sprites.
The three 32×32 TOTLE metric digit sprites inside
NUMB.PTN.
Why do I bring up such a detail? What's actually going on there is that ZUN
loops through and blits each digit from 0 to 9, and then continues the loop
with "digit" numbers from 10 to 19, stopping before the number whose ones
digit equals the one that should stay on screen. No problem with that in
theory, and the .PTN sprite selection is correct… but the .PTN
quarter selection isn't, as ZUN wrote (digit % 4)
instead of the correct ((digit % 10) % 4).
Since .PTN quarters are indexed in a row-major
way, the 10-19 part of the loop thus ends up blitting
2 →
3 →
0 →
1 →
6 →
7 →
4 →
5 →
(nothing):
This footage was slowed down to show one sprite blitting operation per
frame. The actual game waits a hardcoded 4 milliseconds between each
sprite, so even theoretically, you would only see roughly every
4th digit. And yes, we can also observe the empty quarter
here, only blitted if one of the digits is a 9.
Seriously though? If the deadline is looming and you've got to rush
some part of your game, a standalone screen that doesn't affect
anything is the best place to pick. At 4 milliseconds per digit, the
animation goes by so fast that this quirk might even add to its
perceived fanciness. It's exactly the reason why I've always been rather
careful with labeling such quirks as "bugs". And in the end, the code does
perform one more blitting call after the loop to make sure that the correct
digit remains on screen.
The remaining ¾ of the second push went towards transferring the final data
definitions from ASM to C land. Most of the details there paint a rather
depressing picture about ZUN's original code layout and the bloat that came
with it, but it did end on a real highlight. There was some unused data
between ZUN's non-master.lib VSync and text RAM code that I just moved away
in September 2015 without taking a closer look at it. Those bytes kind of
look like another hardcoded 1bpp image though… wait, what?!
Lovely! With no mouse-related code left in the game otherwise, this cursor
sprite provides some great fuel for wild fan theories about TH01's
development history:
Could ZUN have 📝 stolen the basic PC-98
VSync or text RAM function code from a source that also implemented mouse
support?
Or was this game actually meant to have mouse-controllable portions at
some point during development? Even if it would have just been the
menus.
… Actually, you know what, with all shared data moved to C land, I might as
well finish FUUIN.EXE right now. The last secret hidden in its
main() function: Just like GAME.BAT supports
launching the game in various debug modes from the DOS command line,
FUUIN.EXE can directly launch one of the game's endings. As
long as the MDRV2 driver is installed, you can enter
fuuin t1 for the 魔界/Makai Good Ending, or
fuuin t for 地獄/Jigoku Good Ending.
Unfortunately, the command-line parameter can only control the route.
Choosing between a Good or Bad Ending is still done exclusively through
TH01's resident structure, and the continues_per_scene array in
particular. But if you pre-allocate that structure somehow and set one of
the members to a nonzero value, it would work. Trainers, anyone?
Alright, gotta get back to the code if I want to have any chance of
finishing this game before the 15th… Next up: The final 17
functions in REIIDEN.EXE that tie everything together and add
some more debug features on top.
Whew, TH01's boss code just had to end with another beast of a boss, taking
way longer than it should have and leaving uncomfortably little time for the
rest of the game. Let's get right into the overview of YuugenMagan, the most
sequential and scripted battle in this game:
The fight consists of 14 phases, numbered (of course) from 0 to 13.
Unlike all other bosses, the "entrance phase" 0 is a proper gameplay-enabled
part of the fight itself, which is why I also count it here.
YuugenMagan starts with 16 HP, second only to Sariel's 18+6. The HP bar
visualizes the HP threshold for the end of phases 3 (white part) and 7
(red-white part), respectively.
All even-numbered phases change the color of the 邪 kanji in the stage
background, and don't check for collisions between the Orb and any eye.
Almost all of them consequently don't feature an attack, except for phase
0's 1-pixel lasers, spawning symmetrically from the left and right edges of
the playfield towards the center. Which means that yes, YuugenMagan is in
fact invincible during this first attack.
All other attacks are part of the odd-numbered phases:
Phase 1: Slow pellets from the lateral eyes. Ends
at 15 HP.
Phase 3: Missiles from the southern eyes, whose
angles first shift away from Reimu's tracked position and then towards
it. Ends at 12 HP.
Phase 5: Circular pellets sprayed from the lateral
eyes. Ends at 10 HP.
Phase 7: Another missile pattern, but this time
with both eyes shifting their missile angles by the same
(counter-)clockwise delta angles. Ends at 8 HP.
Phase 9: The 3-pixel 3-laser sequence from the
northern eye. Ends at 2 HP.
Phase 11: Spawns the pentagram with one corner out
of every eye, then gradually shrinks and moves it towards the center of
the playfield. Not really an "attack" (surprise) as the pentagram can't
reach the player during this phase, but collision detection is
technically already active here. Ends at 0 HP, marking the earliest
point where the fight itself can possibly end.
Phase 13: Runs through the parallel "pentagram
attack phases". The first five consist of the pentagram alternating its
spinning direction between clockwise and counterclockwise while firing
pellets from each of the five star corners. After that, the pentagram
slams itself into the player, before YuugenMagan loops back to phase
10 to spawn a new pentagram. On the next run through phase 13, the
pentagram grows larger and immediately slams itself into the player,
before starting a new pentagram attack phase cycle with another loop
back to phase 10.
Since the HP bar fills up in a phase with no collision detection,
YuugenMagan is immune to
📝 test/debug mode heap corruption. It's
generally impossible to get YuugenMagan's HP into negative numbers, with
collision detection being disabled every other phase, and all odd-numbered
phases ending immediately upon reaching their HP threshold.
All phases until the very last one have a timeout condition, independent
from YuugenMagan's current HP:
Phase 0: 331 frames
Phase 1: 1101 frames
Phases 2, 4, 6, 8, 10, and 12: 70 frames each
Phases 3 and 7: 5 iterations of the pattern, or
1845 frames each
Phase 5: 5 iterations of the pattern, or 2230
frames
Phase 9: The full duration of the sequence, or 491
frames
Phase 11: Until the pentagram reached its target
position, or 221 frames
This makes it possible to reach phase 13 without dealing a single point of
damage to YuugenMagan, after almost exactly 2½ minutes on any difficulty.
Your actual time will certainly be higher though, as you will have to
HARRY UP at least once during the attempt.
And let's be real, you're very likely to subsequently lose a
life.
At a pixel-perfect 81×61 pixels, the Orb hitboxes are laid out rather
generously this time, reaching quite a bit outside the 64×48 eye sprites:
And that's about the only positive thing I can say about a position
calculation in this fight. Phase 0 already starts with the lasers being off
by 1 pixel from the center of the iris. Sure, 28 may be a nicer number to
add than 29, but the result won't be byte-aligned either way? This is
followed by the eastern laser's hitbox somehow being 24 pixels larger than
the others, stretching a rather unexpected 70 pixels compared to the 46 of
every other laser.
On a more hilarious note, the eye closing keyframe contains the following
(pseudo-)code, comprising the only real accidentally "unused" danmaku
subpattern in TH01:
// Did you mean ">= RANK_HARD"?
if(rank == RANK_HARD) {
eye_north.fire_aimed_wide_5_spread();
eye_southeast.fire_aimed_wide_5_spread();
eye_southwest.fire_aimed_wide_5_spread();
// Because this condition can never be true otherwise.
// As a result, no pellets will be spawned on Lunatic mode.
// (There is another Lunatic-exclusive subpattern later, though.)
if(rank == RANK_LUNATIC) {
eye_west.fire_aimed_wide_5_spread();
eye_east.fire_aimed_wide_5_spread();
}
}
Featuring the weirdly extended hitbox for the eastern laser, as well as
an initial Reimu position that points out the disparity between
byte-aligned rendering and the internal coordinates one final time.
After a few utility functions that look more like a quickly abandoned
refactoring attempt, we quickly get to the main attraction: YuugenMagan
combines the entire boss script and most of the pattern code into a single
2,634-instruction function, totaling 9,677 bytes inside
REIIDEN.EXE. For comparison, ReC98's version of this code
consists of at least 49 functions, excluding those I had to add to work
around ZUN's little inconsistencies, or the ones I added for stylistic
reasons.
In fact, this function is so large that Turbo C++ 4.0J refuses to generate
assembly output for it via the -S command-line option, aborting
with a Compiler table limit exceeded in function error.
Contrary to what the Borland C++ 4.0 User Guide suggests, this
instance of the error is not at all related to the number of function bodies
or any metric of algorithmic complexity, but is simply a result of the
compiler's internal text representation for a single function overflowing a
64 KiB memory segment. Merely shortening the names of enough identifiers
within the function can help to get that representation down below 64 KiB.
If you encounter this error during regular software development, you might
interpret it as the compiler's roundabout way of telling you that it inlined
way more function calls than you probably wanted to have inlined. Because
you definitely won't explicitly spell out such a long function
in newly-written code, right?
At least it wasn't the worst copy-pasting job in this
game; that trophy still goes to 📝 Elis. And
while the tracking code for adjusting an eye's sprite according to the
player's relative position is one of the main causes behind all the bloat,
it's also 100% consistent, and might have been an inlined class method in
ZUN's original code as well.
The clear highlight in this fight though? Almost no coordinate is
precisely calculated where you'd expect it to be. In particular, all
bullet spawn positions completely ignore the direction the eyes are facing
to:
Combining the bottom of the pupil with the exact horizontal
center of the sprite as a whole might sound like a good idea, but looks
especially wrong if the eye is facing right.Here it's the other way round: OK for a right-facing eye, really
wrong for a left-facing one.Dude, the eye is even supposed to track the laser in this
one!Hint: That's not the center of the playfield. At least the
pellets spawned from the corners are sort of correct, but with the corner
calculates precomputed, you could only get them wrong on
purpose.
Due to their effect on gameplay, these inaccuracies can't even be called
"bugs", and made me devise a new "quirk" category instead. More on that in
the TH01 100% blog post, though.
While we did see an accidentally unused bullet pattern earlier, I can
now say with certainty that there are no truly unused danmaku
patterns in TH01, i.e., pattern code that exists but is never called.
However, the code for YuugenMagan's phase 5 reveals another small piece of
danmaku design intention that never shows up within the parameters of
the original game.
By default, pellets are clipped when they fly past the top of the playfield,
which we can clearly observe for the first few pellets of this pattern.
Interestingly though, the second subpattern actually configures its pellets
to fall straight down from the top of the playfield instead. You never see
this happening in-game because ZUN limited that subpattern to a downwards
angle range of 0x73 or 162°, resulting in none of its pellets
ever getting close to the top of the playfield. If we extend that range to a
full 360° though, we can see how ZUN might have originally planned the
pattern to end:
YuugenMagan's phase 5 patterns on every difficulty, with the
second subpattern extended to reveal the different pellet behavior that
remained in the final game code. In the original game, the eyes would stop
spawning bullets on the marked frame.
If we also disregard everything else about YuugenMagan that fits the
upcoming definition of quirk, we're left with 6 "fixable" bugs, all
of which are a symptom of general blitting and unblitting laziness. Funnily
enough, they can all be demonstrated within a short 9-second part of the
fight, from the end of phase 9 up until the pentagram starts spinning in
phase 13:
General flickering whenever any sprite overlaps an eye. This is caused
by only reblitting each eye every 3 frames, and is an issue all throughout
the fight. You might have already spotted it in the videos above.
Each of the two lasers is unblitted and blitted individually instead of
each operation being done for both lasers together. Remember how
📝 ZUN unblits 32 horizontal pixels for every row of a line regardless of its width?
That's why the top part of the left, right-moving laser is never visible,
because it's blitted before the other laser is unblitted.
ZUN forgot to unblit the lasers when phase 9 ends. This footage was
recorded by pressing ↵ Return in test mode (game t or
game d), and it's probably impossible to achieve this during
actual gameplay without TAS techniques. You would have to deal the required
6 points of damage within 491 frames, with the eye being invincible during
240 of them. Simply shooting up an Orb with a horizontal velocity of 0 would
also only work a single time, as boss entities always repel the Orb with a
horizontal velocity of ±4.
The shrinking pentagram is unblitted after the eyes were blitted,
adding another guaranteed frame of flicker on top of the ones in 1). Like in
2), the blockiness of the holes is another result of unblitting 32 pixels
per row at a time.
Another missing unblitting call in a phase transition, as the pentagram
switches from its not quite correctly interpolated shrunk form to a regular
star polygon with a radius of 64 pixels. Indirectly caused by the massively
bloated coordinate calculation for the shrink animation being done
separately for the unblitting and blitting calls. Instead of, y'know, just
doing it once and storing the result in variables that can later be
reused.
The pentagram is not reblitted at all during the first 100 frames of
phase 13. During that rather long time, it's easily possible to remove
it from VRAM completely by covering its area with player shots. Or HARRY UP pellets.
Definitely an appropriate end for this game's entity blitting code.
I'm really looking forward to writing a
proper sprite system for the Anniversary Edition…
And just in case you were wondering about the hitboxes of these pentagrams
as they slam themselves into Reimu:
62 pixels on the X axis, centered around each corner point of the star, 16
pixels below, and extending infinitely far up. The latter part becomes
especially devious because the game always collision-detects
all 5 corners, regardless of whether they've already clipped through
the bottom of the playfield. The simultaneously occurring shape distortions
are simply a result of the line drawing function's rather poor
re-interpolation of any line that runs past the 640×400 VRAM boundaries;
📝 I described that in detail back when I debugged the shootout laser crash.
Ironically, using fixed-size hitboxes for a variable-sized pentagram means
that the larger one is easier to dodge.
The final puzzle in TH01's boss code comes
📝 once again in the form of weird hardware
palette changes. The 邪 kanji on the background
image goes through various colors throughout the fight, which ZUN
implemented by gradually incrementing and decrementing either a single one
or none of the color's three 4-bit components at the beginning of each
even-numbered phase. The resulting color sequence, however, doesn't
quite seem to follow these simple rules:
Phase 0: #DD5邪
Phase 2: #0DF邪
Phase 4: #F0F邪
Phase 6: #00F邪, but at the
end of the phase?!
Phase 8: #0FF邪, at the start
of the phase, #0F5邪, at the end!?
Phase 10: #FF5邪, at the start of
the phase, #F05邪, at the end
Second repetition of phase 12: #005邪
shortly after the start of the phase?!
Adding some debug output sheds light on what's going on there:
Since each iteration of phase 12 adds 63 to the red component, integer
overflow will cause the color to infinitely alternate between dark-blue
and red colors on every 2.03 iterations of the pentagram phase loop. The
65th iteration will therefore be the first one with a dark-blue color
for a third iteration in a row – just in case you manage to stall the
fight for that long.
Yup, ZUN had so much trust in the color clamping done by his hardware
palette functions that he did not clamp the increment operation on the
stage_palette itself. Therefore, the 邪
colors and even the timing of their changes from Phase 6 onwards are
"defined" by wildly incrementing color components beyond their intended
domain, so much that even the underlying signed 8-bit integer ends up
overflowing. Given that the decrement operation on the
stage_paletteis clamped though, this might be another
one of those accidents that ZUN deliberately left in the game,
📝 similar to the conclusion I reached with infinite bumper loops.
But guess what, that's also the last time we're going to encounter this type
of palette component domain quirk! Later games use master.lib's 8-bit
palette system, which keeps the comfort of using a single byte per
component, but shifts the actual hardware color into the top 4 bits, leaving
the bottom 4 bits for added precision during fades.
OK, but now we're done with TH01's bosses! 🎉That was the
8th PC-98 Touhou boss in total, leaving 23 to go.
With all the necessary research into these quirks going well into a fifth
push, I spent the remaining time in that one with transferring most of the
data between YuugenMagan and the upcoming rest of REIIDEN.EXE
into C land. This included the one piece of technical debt in TH01 we've
been carrying around since March 2015, as well as the final piece of the
ending sequence in FUUIN.EXE. Decompiling that executable's
main() function in a meaningful way requires pretty much all
remaining data from REIIDEN.EXE to also be moved into C land,
just in case you were wondering why we're stuck at 99.46% there.
On a more disappointing note, the static initialization code for the
📝 5 boss entity slots ultimately revealed why
YuugenMagan's code is as bloated and redundant as it is: The 5 slots really
are 5 distinct variables rather than a single 5-element array. That's why
ZUN explicitly spells out all 5 eyes every time, because the array he could
have just looped over simply didn't exist. 😕 And while these slot variables
are stored in a contiguous area of memory that I could just have
taken the address of and then indexed it as if it were an array, I
didn't want to annoy future port authors with what would technically be
out-of-bounds array accesses for purely stylistic reasons. At least it
wasn't that big of a deal to rewrite all boss code to use these distinct
variables, although I certainly had to get a bit creative with Elis.
Next up: Finding out how many points we got in totle, and hoping that ZUN
didn't hide more unexpected complexities in the remaining 45 functions of
this game. If you have to spare, there are two ways
in which that amount of money would help right now:
I'm expecting another subscription transaction
from Yanga before the 15th, which would leave to
round out one final TH01 RE push. With that, there'd be a total of 5 left in
the backlog, which should be enough to get the rest of this game done.
I really need to address the performance and usability issues
with all the small videos in this blog. Just look at the video immediately
above, where I disabled the controls because they would cover the debug text
at the bottom… Edit (2022-10-31):… which no longer is an
issue with our 📝 custom video player.
I already reserved this month's anonymous contribution for this work, so it would take another to be turned into a full push.
Oh look, it's another rather short and straightforward boss with a rather
small number of bugs and quirks. Yup, contrary to the character's
popularity, Mima's premiere is really not all that special in terms of code,
and continues the trend established with
📝 Kikuri and
📝 SinGyoku. I've already covered
📝 the initial sprite-related bugs last November,
so this post focuses on the main code of the fight itself. The overview:
The TH01 Mima fight consists of 3 phases, with phases 1 and 3 each
corresponding to one half of the 12-HP bar.
📝 Just like with SinGyoku, the distinction
between the red-white and red parts is purely visual once again, and doesn't
reflect anything about the boss script. As usual, all of the phases have to
be completed in order.
Phases 1 and 3 cycle through 4 danmaku patterns each, for a total of 8.
The cycles always start on a fixed pattern.
3 of the patterns in each phase feature rotating white squares, thus
introducing a new sprite in need of being unblitted.
Phase 1 additionally features the "hop pattern" as the last one in its
cycle. This is the only pattern where Mima leaves the seal in the center of
the playfield to hop from one edge of the playfield towards the other, while
also moving slightly higher up on the Y axis, and staying on the final
position for the next pattern cycle. For the first time, Mima selects a
random starting edge, which is then alternated on successive cycles.
Since the square entities are local to the respective pattern function,
Phase 1 can only end once the current pattern is done, even if Mima's HP are
already below 6. This makes Mima susceptible to the
📝 test/debug mode HP bar heap corruption bug.
Phase 2 simply consists of a spread-in teleport back to Mima's initial
position in the center of the playfield. This would only have been strictly
necessary if phase 1 ended on the hop pattern, but is done regardless of the
previous pattern, and does provide a nice visual separation between the two
main phases.
That's it – nothing special in Phase 3.
And there aren't even any weird hitboxes this time. What is maybe
special about Mima, however, is how there's something to cover about all of
her patterns. Since this is TH01, it's won't surprise anyone that the
rotating square patterns are one giant copy-pasta of unblitting, updating,
and rendering code. At least ZUN placed the core polar→Cartesian
transformation in a separate function for creating regular polygons
with an arbitrary number of sides, which might hint toward some more varied
shapes having been planned at one point?
5 of the 6 patterns even follow the exact same steps during square update
frames:
Calculate square corner coordinates
Unblit the square
Update the square angle and radius
Use the square corner coordinates for spawning pellets or missiles
Recalculate square corner coordinates
Render the square
Notice something? Bullets are spawned before the corner coordinates
are updated. That's why their initial positions seem to be a bit off – they
are spawned exactly in the corners of the square, it's just that it's
the square from 8 frames ago.
Mima's first pattern on Normal difficulty.
Once ZUN reached the final laser pattern though, he must have noticed that
there's something wrong there… or maybe he just wanted to fire those
lasers independently from the square unblit/update/render timer for a
change. Spending an additional 16 bytes of the data segment for conveniently
remembering the square corner coordinates across frames was definitely a
decent investment.
When Mima isn't shooting bullets from the corners of a square or hopping
across the playfield, she's raising flame pillars from the bottom of the playfield within very specifically calculated
random ranges… which are then rendered at byte-aligned VRAM positions, while
collision detection still uses their actual pixel position. Since I don't
want to sound like a broken record all too much, I'll just direct you to
📝 Kikuri, where we've seen the exact same issue with the teardrop ripple sprites.
The conclusions are identical as well.
Mima's flame pillar pattern. This video was recorded on a particularly
unlucky seed that resulted in great disparities between a pillar's
internal X coordinate and its byte-aligned on-screen appearance, leading
to lots of right-shifted hitboxes.
Also note how the change from the meteor animation to the three-arm 🚫
casting sprite doesn't unblit the meteor, and leaves that job to
any sprite that happens to fly over those pixels.
However, I'd say that the saddest part about this pattern is how choppy it
is, with the circle/pillar entities updating and rendering at a meager 7
FPS. Why go that low on purpose when you can just make the game render ✨
smoothly ✨ instead?
So smooth it's almost uncanny.
The reason quickly becomes obvious: With TH01's lack of optimization, going
for the full 56.4 FPS would have significantly slowed down the game on its
intended 33 MHz CPUs, requiring more than cheap surface-level ASM
optimization for a stable frame rate. That might very well have been ZUN's
reason for only ever rendering one circle per frame to VRAM, and designing
the pattern with these time offsets in mind. It's always been typical for
PC-98 developers to target the lowest-spec models that could possibly still
run a game, and implementing dynamic frame rates into such an engine-less
game is nothing I would wish on anybody. And it's not like TH01 is
particularly unique in its choppiness anyway; low frame rates are actually a
rather typical part of the PC-98 game aesthetic.
The final piece of weirdness in this fight can be found in phase 1's hop
pattern, and specifically its palette manipulation. Just from looking at the
pattern code itself, each of the 4 hops is supposed to darken the hardware
palette by subtracting #444 from every color. At the last hop,
every color should have therefore been reduced to a pitch-black
#000, leaving the player completely blind to the movement of
the chasing pellets for 30 frames and making the pattern quite ghostly
indeed. However, that's not what we see in the actual game:
Nothing in the pattern's code would cause the hardware palette to get
brighter before the end of the pattern, and yet…
The expected version doesn't look all too unfair, even on Lunatic…
well, at least at the default rank pellet speed shown in this
video. At maximum pellet speed, it is in fact rather brutal.
Looking at the frame counter, it appears that something outside the
pattern resets the palette every 40 frames. The only known constant with a
value of 40 would be the invincibility frames after hitting a boss with the
Orb, but we're not hitting Mima here…
But as it turns out, that's exactly where the palette reset comes from: The
hop animation darkens the hardware palette directly, while the
📝 infamous 12-parameter boss collision handler function
unconditionally resets the hardware palette to the "default boss palette"
every 40 frames, regardless of whether the boss was hit or not. I'd classify
this as a bug: That function has no business doing periodic hardware palette
resets outside the invincibility flash effect, and it completely defies
common sense that it does.
That explains one unexpected palette change, but could this function
possibly also explain the other infamous one, namely, the temporary green
discoloration in the Konngara fight? That glitch comes down to how the game
actually uses two global "default" palettes: a default boss
palette for undoing the invincibility flash effect, and a default
stage palette for returning the colors back to normal at the end of
the bomb animation or when leaving the Pause menu. And sure enough, the
stage palette is the one with the green color, while the boss
palette contains the intended colors used throughout the fight. Sending the
latter palette to the graphics chip every 40 frames is what corrects
the discoloration, which would otherwise be permanent.
The green color comes from BOSS7_D1.GRP, the scrolling
background of the entrance animation. That's what turns this into a clear
bug: The stage palette is only set a single time in the entire fight,
at the beginning of the entrance animation, to the palette of this image.
Apart from consistency reasons, it doesn't even make sense to set the stage
palette there, as you can't enter the Pause menu or bomb during a blocking
animation function.
And just 3 lines of code later, ZUN loads BOSS8_A1.GRP, the
main background image of the fight. Moving the stage palette assignment
there would have easily prevented the discoloration.
But yeah, as you can tell, palette manipulation is complete jank in this
game. Why differentiate between a stage and a boss palette to begin with?
The blocking Pause menu function could have easily copied the original
palette to a local variable before darkening it, and then restored it after
closing the menu. It's not so easy for bombs as the intended palette could
change between the start and end of the animation, but the code could have
still been simplified a lot if there was just one global "default palette"
variable instead of two. Heck, even the other bosses who manipulate their
palettes correctly only do so because they manually synchronize the two
after every change. The proper defense against bugs that result from wild
mutation of global state is to get rid of global state, and not to put up
safety nets hidden in the middle of existing effect code.
The easiest way of reproducing the green discoloration bug in
the TH01 Konngara fight, timed to show the maximum amount of time the
discoloration can possibly last.
In any case, that's Mima done! 7th PC-98 Touhou boss fully
decompiled, 24 bosses remaining, and 59 functions left in all of TH01.
In other thrilling news, my call for secondary funding priorities in new
TH01 contributions has given us three different priorities so far. This
raises an interesting question though: Which of these contributions should I
now put towards TH01 immediately, and which ones should I leave in the
backlog for the time being? Since I've never liked deciding on priorities,
let's turn this into a popularity contest instead: The contributions with
the least popular secondary priorities will go towards TH01 first, giving
the most popular priorities a higher chance to still be left over after TH01
is done. As of this delivery, we'd have the following popularity order:
TH05 (1.67 pushes), from T0182
Seihou (1 push), from T0184
TH03 (0.67 pushes), from T0146
Which means that T0146 will be consumed for TH01 next, followed by T0184 and
then T0182. I only assign transactions immediately before a delivery though,
so you all still have the chance to change up these priorities before the
next one.
Next up: The final boss of TH01 decompilation, YuugenMagan… if the current
or newly incoming TH01 funds happen to be enough to cover the entire fight.
If they don't turn out to be, I will have to pass the time with some Seihou
work instead, missing the TH01 anniversary deadline as a result.Edit (2022-07-18): Thanks to Yanga for
securing the funding for YuugenMagan after all! That fight will feature
slightly more than half of all remaining code in TH01's
REIIDEN.EXE and the single biggest function in all of PC-98
Touhou, let's go!
More than 50% of all PC-98 Touhou game code has now been
reverse-engineered! 🎉 While this number isn't equally distributed among the
games, we've got one game very close to 100% and reverse-engineered most of
the core features of two others. During the last 32 months of continuous
funding, I've averaged an overall speed of 1.11% total RE per month. That
looks like a decent prediction of how much more time it will take for 100%
across all games – unless, of course, I'd get to work towards some of the
non-RE goals in the meantime.
70 functions left in TH01, with less than 10,000 ASM instructions
remaining! Due to immense hype, I've temporarily raised the cap by 50% until
August 15. With the last TH01 pushes delivering at roughly 1.5× of the
currently calculated average speed, that should be more than enough to get
TH01 done – especially since I expect YuugenMagan to come with lots of
redundant code. Therefore, please also request a secondary priority for
these final TH01 RE contributions.
So, how did this card-flipping stage obstacle delivery get so horribly
delayed? With all the different layouts showcased in the 28 card-flipping
stages, you'd expect this to be among the more stable and bug-free parts of
the codebase. Heck, with all stage objects being placed on a 32×32-pixel
grid, this is the first TH01-related blog post this year that doesn't have
to describe an alignment-related unblitting glitch!
That alone doesn't mean that this code is free from quirky behavior though,
and we have to look no further than the first few lines of the collision
handling for round bumpers to already find a whole lot of that. Simplified,
they do the following:
Immediately, you wonder why these assignments only exist for the Y
coordinate. Sure, hitting a bumper from the left or right side should happen
less often, but it's definitely possible. Is it really a good idea to warp
the Orb to the top or bottom edge of a bumper regardless?
What's more important though: The fact that these immediate assignments
exist at all. The game's regular Orb physics work by producing a Y velocity
from the single force acting on the Orb and a gravity factor, and are
completely independent of its current Y position. A bumper collision does
also apply a new force onto the Orb further down in the code, but these
assignments still bypass the physics system and are bound to have
some knock-on effect on the Orb's movement.
To observe that effect, we just have to enter Stage 18 on the 地獄/Jigoku route, where it's particularly trivial to
reproduce. At a 📝 horizontal velocity of ±4,
these assignments are exactly what can cause the Orb to endlessly
bounce between two bumpers. As rudimentary as the Orb's physics may be, just
letting them do their work would have entirely prevented these loops:
One of at least three infinite bumper loop constellations within just
this 10×5-tile section of TH01's Stage 18 on the 地獄/Jigoku route. With an effective 56 horizontal
pixels between both hitboxes, the Orb would have to travel an absolute
Y distance of at least 16 vertical pixels within
(56 / 4) = 14 frames to escape the
other bumper's hitbox. If the initial bounce reduces the Orb's Y
velocity far enough for it to not manage that distance the first time,
it will never reach the necessary speed again. In this loop, the
bounce-off force even stabilizes, though this doesn't have to happen.
The blue areas indicate the pixel-perfect* hitboxes of each bumper.
TH01 bumper collision handling without ZUN's manual assignment of the Y
coordinate. The Orb still bounces back and forth between two bumpers
for a while, but its top position always follows naturally
from its Y velocity and the force applied to it, and gravity wins out
in the end. The blue areas indicate the pixel-perfect* hitboxes of each bumper.
Now, you might be thinking that these Y assignments were just an attempt to
prevent the Orb from colliding with the same bumper again on the next frame.
After all, those 24 pixels exactly correspond to ⅓ of the height of a
bumper's hitbox with an additional pixel added on top. However, the game
already perfectly prevents repeated collisions by turning off collision
testing with the same bumper for the next 7 frames after a collision. Thus,
we can conclude that ZUN either explicitly coded bumper collision handling
to facilitate these loops, or just didn't take out that code after
inevitably discovering what it did. This is not janky code, it's not a
glitch, it's not sarcasm from my end, and it's not the game's physics being
bad.
But wait. Couldn't these assignments just be a remnant from a time in
development before ZUN decided on the 7-frame delay on further
collisions? Well, even that explanation stops holding water after the next
few lines of code. Simplified, again:
What's important here is the part that's not in the code – namely,
anything that handles X velocities of -8 or +8. In those cases, the Orb
simply continues in the same horizontal direction. The manual Y assignment
is the only part of the code that actually prevents a collision there, as
the newly applied force is not guaranteed to be enough:
An infinite loop across three bumpers, made possible by the edge of the
playfield and bumper bars on opposite sides, an unchanged horizontal
direction, and the Y assignments neatly placing the Orb on either the
top or bottom side of a bumper. The alternating sign of the force
further ensures that the Orb will travel upwards half the time,
canceling out gravity during the short time between two hitboxes.
With the unchanged horizontal direction and the Y assignments removed,
nothing keeps an Orb at ±8 pixels per frame from flying into/over a
bumper. The collision force pushes the Orb slightly, but not enough to
truly matter. The final force sends the Orb on a significant downward
trajectory beyond the next bumper's hitbox, breaking the original loop.
Forgetting to handle ⅖ of your discrete X velocity cases is simply not
something you do by accident. So we might as well say that ZUN deliberately
designed the game to behave exactly as it does in this regard.
Bumpers also come in vertical or horizontal bar shapes. Their collision
handling also turns off further collision testing for the next 7 frames, and
doesn't do any manual coordinate assignment. That's definitely a step up in
cleanliness from round bumpers, but it doesn't seem to keep in mind that the
player can fire a new shot every 4 frames when standing still. That makes it
immediately obvious why this works:
The green numbers show the amount of
frames since the last detected collision with the respective bumper bar,
and indicate that collision testing with the bar below is currently
disabled.
That's the most well-known case of reducing the Orb's horizontal velocity to
0 by exactly hitting it with shots in its center and then button-mashing it
through a horizontal bar. This also works with vertical bars and yields even
more interesting results there, but if we want to have any chance of
understanding what happens there, we have to first go over some basics:
Collision detection for all stage obstacles is done in row-major
order from the top-left to the bottom-right corner of the
playfield.
All obstacles are collision-tested independently from each other, with
the collision response code immediately following the test.
The hitboxes for bumper bars extend far past their 32×32 sprites to make
sure that the Orb can collide with them from any side. They are a
pixel-perfect* 87×56 pixels for horizontal bars, and 57×87 pixels for
vertical ones. Yes, that's no typo, they really do differ in one pixel.
Changing the Y velocity during such a collision just involves applying a
new force with the magnitude of the negated current Y velocity, which can be
done multiple times during a frame without changing the result. This
explains why the force is correctly inverted in the clip above, despite the
Orb colliding with two bumpers simultaneously.
Lacking a similar force system, the X coordinate is simply directly
inverted.
However, if that were everything the game did, kicking the Orb into a column
of vertical bumper bars would lead them to behave more like a rope that the
Orb can climb, as the initial collision with two hitboxes cancels out the
intended sign change that reflects the Orb away from the bars:
This footage was recorded without the workaround I am about to describe.
It does not reflect the behavior of the original game. You
cannot do this in the original game.
While the visualization reveals small sections where three hitboxes
overlap, the Orb can never actually collide with three of them at the
same time, as those 3-hitbox regions are 2 pixels smaller than they
would need to be to fit the Orb. That's exactly the difference between
using < rather than <= in these hitbox
comparisons.
While that would have been a fun gameplay mechanic on its own, it
immediately breaks apart once you place two vertical bumper bars next to
each other. Due to how these bumper bar hitboxes extend past their sprites,
any two adjacent vertical bars will end up with the exact same hitbox in
absolute screen coordinates. Stage 17 on the
魔界/Makai route contains exactly such a layout:
The collision handlers of adjacent vertical bars always activate in the
same frame, independently invert the Orb's X velocity, and therefore
fully cancel out their intended effect on the Orb… if the game did not
have the workaround I am about to describe. This cannot happen
in the original game.
ZUN's workaround: Setting a "vertical bumper bar block flag" after any
collision with such a bar, which simply disables any collision with
any vertical bar for the next 7 frames. This quick hack made all
vertical bars work as intended, and avoided the need for involving the Orb's
X velocity in any kind of physics system.
Edit (2022-07-12): This flag only works around glitches
that would be caused by simultaneously colliding with more than one vertical
bar. The actual response to a bumper bar collision still remains unaffected,
and is very naive:
Horizontal bars always invert the Orb's Y velocity
Vertical bars invert either the Y or X velocity depending on whether
the Orb's current X velocity is 0 (Y) or not (X)
These conditions are only correct if the Orb comes in at an angle roughly
between 45° and 135° on either side of a bar. If it's anywhere close to 0°
or 180°, this response will be incorrect, and send the Orb straight
through the bar. Since the large hitboxes make this easily possible, you can
still get the Orb to climb a vertical column, or glide along a horizontal
row:
Here's the hitbox overlay for
地獄/Jigoku Stage 19, and here's an updated
version of the 📝 Orb physics debug mod that
now also shows bumper bar collision frame numbers:
2022-07-10-TH01OrbPhysicsDebug.zip
See the th01_orb_debug
branch for the code. To use it, simply replace REIIDEN.EXE, and
run the game in debug mode, via game d on the DOS prompt. If you
encounter a gameplay situation that doesn't seem to be covered by this blog
post, you can now verify it for yourself. Thanks to touhou-memories for bringing these
issues to my attention! That definitely was a glaring omission from the
initial version of this blog post.
With that clarified, we can now try mashing the Orb into these two vertical
bars:
At first, that workaround doesn't seem to make a difference here. As we
expect, the frame numbers now tell us that only one of the two bumper bars
in a row activates, but we couldn't have told otherwise as the number of
bars has no effect on newly applied Y velocity forces. On a closer look, the
Orb's rise to the top of the playfield is in fact caused by that
workaround though, combined with the unchanged top-to-bottom order of
collision testing. As soon as any bumper bar completed its 7
collision delay frames, it resets the aforementioned flag, which already
reactivates collision handling for any remaining vertical bumper bars during
the same frame. Look out for frames with both a 7 and a 1, like the one marked in the video above:
The 7 will always appear before
the 1 in the row-major order. Whenever
this happens, the current oscillation period is cut down from 7 to 6
frames – and because collision testing runs from top to bottom, this will
always happen during the falling part. Depending on the Y velocity, the
rising part may also be cut down to 6 frames from time to time, but that one
at least has a chance to last for the full 7 frames. This difference
adds those crucial extra frames of upward movement, which add up to send the
Orb to the top. Without the flag, you'd always see the Orb oscillating
between a fixed range of the bar column.
Finally, it's the "top of playfield" force that gradually slows down the Orb
and makes sure it ultimately only moves at sub-pixel velocities, which have
no visible effect. Because
📝 the regular effect of gravity is reset with
each newly applied force, it's completely negated during most of the climb.
This even holds true once the Orb reached the top: Since the Orb requires a
negative force to repeatedly arrive up there and be bounced back, this force
will stay active for the first 5 of the 7 collision frames and not move the
Orb at all. Once gravity kicks in at the 5th frame and adds 1 to
the Y velocity, it's already too late: The new velocity can't be larger than
0.5, and the Orb only has 1 or 2 frames before the flag reset causes it to
be bounced back up to the top again.
Portals, on the other hand, turn out to be much simpler than the old
description that ended up on Touhou Wiki in October 2005 might suggest.
Everything about their teleportations is random: The destination portal, the
exit force (as an integer between -9 and +9), as well as the exit X
velocity, with each of the
📝 5 distinct horizontal velocities having an
equal chance of being chosen. Of course, if the destination portal is next
to the left or right edge of the playfield and it chooses to fire the Orb
towards that edge, it immediately bounces off into the opposite direction,
whereas the 0 velocity is always selected with a constant 20% probability.
The selection process for the destination portal involves a bit more than a
single rand() call. The game bundles all obstacles in a single
structure of dynamically allocated arrays, and only knows how many obstacles
there are in total, not per type. Now, that alone wouldn't have much
of an impact on random portal selection, as you could simply roll a random
obstacle ID and try again if it's not a portal. But just to be extra cute,
ZUN instead iterates over all obstacles, selects any non-entered portal with
a chance of ¼, and just gives up if that dice roll wasn't successful after
16 loops over the whole array, defaulting to the entered portal in that
case.
In all its silliness though, this works perfectly fine, and results in a
chance of 0.7516(𝑛 - 1) for the Orb exiting out of the
same portal it entered, with 𝑛 being the total number of portals in a
stage. That's 1% for two portals, and 0.01% for three. Pretty decent for a
random result you don't want to happen, but that hurts nobody if it does.
The one tiny ZUN bug with portals is technically not even part of the newly
decompiled code here. If Reimu gets hit while the Orb is being sent through
a portal, the Orb is immediately kicked out of the portal it entered, no
matter whether it already shows up inside the sprite of the destination
portal. Neither of the two portal sprites is reset when this happens,
leading to "two Orbs" being visible simultaneously.
This makes very little sense no matter how you look at it. The Orb doesn't
receive a new velocity or force when this happens, so it will simply
re-enter the same portal once the gameplay resumes on Reimu's next life:
That left another ½ of a push over at the end. Way too much time to finish
FUUIN.exe, way too little time to start with Mima… but the bomb
animation fit perfectly in there. No secrets or bugs there, just a bunch of
sprite animation code wasting at least another 82 bytes in the data segment.
The special effect after the kuji-in sprites uses the same single-bitplane
32×32 square inversion effect seen at the end of Kikuri's and Sariel's
entrance animation, except that it's a 3-stack of 16-rings moving at 6, 7,
and 8 pixels per frame respectively. At these comparatively slow speeds, the
byte alignment of each square adds some further noise to the discoloration
pattern… if you even notice it below all the shaking and seizure-inducing
hardware palette manipulation.
And yes, due to the very destructive nature of the effect, the game does in
fact rely on it only being applied to VRAM page 0. While that will cause
every moving sprite to tear holes into the inverted squares along its
trajectory, keeping a clean playfield on VRAM page 1 is what allows all that
pixel damage to be easily undone at the end of this 89-frame animation.
Next up: Mima! Let's hope that stage obstacles already were the most complex
part remaining in TH01…
It only took a record-breaking 1½ pushes to get SinGyoku done!
No 📝 entity synchronization code after
all! Since all of SinGyoku's sprites are 96×96 pixels, ZUN made the rather
smart decision of just using the sphere entity's position to render the
📝 flash and person entities – and their only
appearance is encapsulated in a single sphere→person→sphere transformation
function.
Just like Kikuri, SinGyoku's code as a whole is not a complete
disaster.
The negative:
It's still exactly as buggy as Kikuri, with both of the ZUN bugs being
rendering glitches in a single function once again.
It also happens to come with a weird hitbox, …
… and some minor questionable and weird pieces of code.
The overview:
SinGyoku's fight consists of 2 phases, with the first one corresponding
to the white part from 8 to 6 HP, and the second one to the rest of the HP
bar. The distinction between the red-white and red parts is purely visual,
and doesn't reflect anything about the boss script.
Both phases cycle between a pellet pattern and SinGyoku's sphere form
slamming itself into the player, followed by it slightly overshooting its
intended base Y position on its way back up.
Phase 1 only consists of the sphere form's half-circle spray pattern.
Technically, the phase can only end during that pattern, but adding
that one additional condition to allow it to end during the slam+return
"pattern" wouldn't have made a difference anyway. The code doesn't rule out
negative HP during the slam (have fun in test or debug mode), but the sum of
invincibility frames alone makes it impossible to hit SinGyoku 7 times
during a single slam in regular gameplay.
Phase 2 features two patterns for both the female and male forms
respectively, which are selected randomly.
This time, we're back to the Orb hitbox being a logical 49×49 pixels in
SinGyoku's center, and the shot hitbox being the weird one. What happens if
you want the shot hitbox to be both offset to the left a bit
and stretch the entire width of SinGyoku's sprite? You get a hitbox
that ends in mid-air, far away from the right edge of the sprite:
Due to VRAM byte alignment, all player shots fired between
gx = 376 and gx = 383 inclusive
appear at the same visual X position, but are internally already partly
outside the hitbox and therefore won't hit SinGyoku – compare the
marked shot at gx = 376 to the one at gx =
380. So much for precisely visualizing hitboxes in this game…
Since the female and male forms also use the sphere entity's coordinates,
they share the same hitbox.
Onto the rendering glitches then, which can – you guessed it – all be found
in the sphere form's slam movement:
ZUN unblits the delta area between the sphere's previous and current
position on every frame, but reblits the sphere itself on… only every second
frame?
For negative X velocities, ZUN made a typo and subtracted the Y velocity
from the right edge of the area to be unblitted, rather than adding the X
velocity. On a cursory look, this shouldn't affect the game all too
much due to the unblitting function's word alignment. Except when it does:
If the Y velocity is much smaller than the X one, the left edge of the
unblitted area can, on certain frames, easily align to a word address past
the previous right edge of the sphere. As a result, not a single sphere
pixel will actually be unblitted, and a small stripe of the sphere will be
left in VRAM for one frame, until the alignment has caught up with the
sphere's movement in the next one.
By having the sphere move from the right edge of the playfield to the
left, this video demonstrates both the lazy reblitting and broken
unblitting at the right edge for negative X velocities. Also, isn't it
funny how Reimu can partly disappear from all the sloppy
SinGyoku-related unblitting going on after her sprite was blitted?
Due to the low contrast of the sphere against the background, you typically
don't notice these glitches, but the white invincibility flashing after a
hit really does draw attention to them. This time, all of these glitches
aren't even directly caused by ZUN having never learned about the
EGC's bit length register – if he just wrote correct code for SinGyoku, none
of this would have been an issue. Sigh… I wonder how many more glitches will
be caused by improper use of this one function in the last 18% of
REIIDEN.EXE.
There's even another bug here, with ZUN hardcoding a horizontal delta of 8
pixels rather than just passing the actual X velocity. Luckily, the maximum
movement speed is 6 pixels on Lunatic, and this would have only turned into
an additional observable glitch if the X velocity were to exceed 24 pixels.
But that just means it's the kind of bug that still drains RE attention to
prove that you can't actually observe it in-game under some
circumstances.
The 5 pellet patterns are all pretty straightforward, with nothing to talk
about. The code architecture during phase 2 does hint towards ZUN having had
more creative patterns in mind – especially for the male form, which uses
the transformation function's three pattern callback slots for three
repetitions of the same pellet group.
There is one more oddity to be found at the very end of the fight:
Right before the defeat white-out animation, the sphere form is explicitly
reblitted for no reason, on top of the form that was blitted to VRAM in the
previous frame, and regardless of which form is currently active. If
SinGyoku was meant to immediately transform back to the sphere form before
being defeated, why isn't the person form unblitted before then? Therefore,
the visibility of both forms is undeniably canon, and there is some
lore meaning to be found here…
In any case, that's SinGyoku done! 6th PC-98 Touhou boss fully
decompiled, 25 remaining.
No FUUIN.EXE code rounding out the last push for a change, as
the 📝 remaining missile code has been
waiting in front of SinGyoku for a while. It already looked bad in November,
but the angle-based sprite selection function definitely takes the cake when
it comes to unnecessary and decadent floating-point abuse in this game.
The algorithm itself is very trivial: Even with
📝 .PTN requiring an additional quarter parameter to access 16×16 sprites,
it's essentially just one bit shift, one addition, and one binary
AND. For whatever reason though, ZUN casts the 8-bit missile
angle into a 64-bit double, which turns the following explicit
comparisons (!) against all possible 4 + 16 boundary angles (!!)
into FPU operations. Even with naive and readable
division and modulo operations, and the whole existence of this function not
playing well with Turbo C++ 4.0J's terrible code generation at all, this
could have been 3 lines of code and 35 un-inlined constant-time
instructions. Instead, we've got this 207-instruction monster… but hey, at
least it works. 🤷
The remaining time then went to YuugenMagan's initialization code, which
allowed me to immediately remove more declarations from ASM land, but more
on that once we get to the rest of that boss fight.
That leaves 76 functions until we're done with TH01! Next up: Card-flipping
stage obstacles.
What's this? A simple, straightforward, easy-to-decompile TH01 boss with
just a few minor quirks and only two rendering-related ZUN bugs? Yup, 2½
pushes, and Kikuri was done. Let's get right into the overview:
Just like 📝 Elis, Kikuri's fight consists
of 5 phases, excluding the entrance animation. For some reason though, they
are numbered from 2 to 6 this time, skipping phase 1? For consistency, I'll
use the original phase numbers from the source code in this blog post.
The main phases (2, 5, and 6) also share Elis' HP boundaries of 10, 6,
and 0, respectively, and are once again indicated by different colors in the
HP bar. They immediately end upon reaching the given number of HP, making
Kikuri immune to the
📝 heap corruption in test or debug mode that can happen with Elis and Konngara.
Phase 2 solely consists of the infamous big symmetric spiral
pattern.
Phase 3 fades Kikuri's ball of light from its default bluish color to bronze over 100 frames. Collision detection is deactivated
during this phase.
In Phase 4, Kikuri activates her two souls while shooting the spinning
8-pellet circles from the previously activated ball. The phase ends shortly
after the souls fired their third spread pellet group.
Note that this is a timed phase without an HP boundary, which makes
it possible to reduce Kikuri's HP below the boundaries of the next
phases, effectively skipping them. Take this video for example,
where Kikuri has 6 HP by the end of Phase 4, and therefore directly
starts Phase 6.
(Obviously, Kikuri's HP can also be reduced to 0 or below, which will
end the fight immediately after this phase.)
Phase 5 combines the teardrop/ripple "pattern" from the souls with the
"two crossed eye laser" pattern, on independent cycles.
Finally, Kikuri cycles through her remaining 4 patterns in Phase 6,
while the souls contribute single aimed pellets every 200 frames.
Interestingly, all HP-bounded phases come with an additional hidden
timeout condition:
Phase 2 automatically ends after 6 cycles of the spiral pattern, or
5,400 frames in total.
Phase 5 ends after 1,600 frames, or the first frame of the
7th cycle of the two crossed red lasers.
If you manage to keep Kikuri alive for 29 of her Phase 6 patterns,
her HP are automatically set to 1. The HP bar isn't redrawn when this
happens, so there is no visual indication of this timeout condition even
existing – apart from the next Orb hit ending the fight regardless of
the displayed HP. Due to the deterministic order of patterns, this
always happens on the 8th cycle of the "symmetric gravity
pellet lines from both souls" pattern, or 11,800 frames. If dodging and
avoiding orb hits for 3½ minutes sounds tiring, you can always watch the
byte at DS:0x1376 in your emulator's memory viewer. Once
it's at 0x1E, you've reached this timeout.
So yeah, there's your new timeout challenge.
The few issues in this fight all relate to hitboxes, starting with the main
one of Kikuri against the Orb. The coordinates in the code clearly describe
a hitbox in the upper center of the disc, but then ZUN wrote a < sign
instead of a > sign, resulting in an in-game hitbox that's not
quite where it was intended to be…
Kikuri's actual hitbox.
Since the Orb sprite doesn't change its shape, we can visualize the
hitbox in a pixel-perfect way here. The Orb must be completely within
the red area for a hit to be registered.
Much worse, however, are the teardrop ripples. It already starts with their
rendering routine, which places the sprites from TAMAYEN.PTN
at byte-aligned VRAM positions in the ultimate piece of if(…) {…}
else if(…) {…} else if(…) {…} meme code. Rather than
tracking the position of each of the five ripple sprites, ZUN suddenly went
purely functional and manually hardcoded the exact rendering and collision
detection calls for each frame of the animation, based on nothing but its
total frame counter.
Each of the (up to) 5 columns is also unblitted and blitted individually
before moving to the next column, starting at the center and then
symmetrically moving out to the left and right edges. This wouldn't be a
problem if ZUN's EGC-powered unblitting function didn't word-align its X
coordinates to a 16×1 grid. If the ripple sprites happen to start at an
odd VRAM byte position, their unblitting coordinates get rounded both down
and up to the nearest 16 pixels, thus touching the adjacent 8 pixels of the
previously blitted columns and leaving the well-known black vertical bars in
their place.
OK, so where's the hitbox issue here? If you just look at the raw
calculation, it's a slightly confusingly expressed, but perfectly logical 17
pixels. But this is where byte-aligned blitting has a direct effect on
gameplay: These ripples can be spawned at any arbitrary, non-byte-aligned
VRAM position, and collisions are calculated relative to this internal
position. Therefore, the actual hitbox is shifted up to 7 pixels to the
right, compared to where you would expect it from a ripple sprite's
on-screen position:
Due to the deterministic nature of this part of the fight, it's
always 5 pixels for this first set of ripples. These visualizations are
obviously not pixel-perfect due to the different potential shapes of
Reimu's sprite, so they instead relate to her 32×32 bounding box, which
needs to be entirely inside the red
area.
We've previously seen the same issue with the
📝 shot hitbox of Elis' bat form, where
pixel-perfect collision detection against a byte-aligned sprite was merely a
sidenote compared to the more serious X=Y coordinate bug. So why do I
elevate it to bug status here? Because it directly affects dodging: Reimu's
regular movement speed is 4 pixels per frame, and with the internal position
of an on-screen ripple sprite varying by up to 7 pixels, any micrododging
(or "grazing") attempt turns into a coin flip. It's sort of mitigated
by the fact that Reimu is also only ever rendered at byte-aligned
VRAM positions, but I wouldn't say that these two bugs cancel out each
other.
Oh well, another set of rendering issues to be fixed in the hypothetical
Anniversary Edition – obviously, the hitboxes should remain unchanged. Until
then, you can always memorize the exact internal positions. The sequence of
teardrop spawn points is completely deterministic and only controlled by the
fixed per-difficulty spawn interval.
Aside from more minor coordinate inaccuracies, there's not much of interest
in the rest of the pattern code. In another parallel to Elis though, the
first soul pattern in phase 4 is aimed on every difficulty except
Lunatic, where the pellets are once again statically fired downwards. This
time, however, the pattern's difficulty is much more appropriately
distributed across the four levels, with the simultaneous spinning circle
pellets adding a constant aimed component to every difficulty level.
Kikuri's phase 4 patterns, on every difficulty.
That brings us to 5 fully decompiled PC-98 Touhou bosses, with 26 remaining…
and another ½ of a push going to the cutscene code in
FUUIN.EXE.
You wouldn't expect something as mundane as the boss slideshow code to
contain anything interesting, but there is in fact a slight bit of
speculation fuel there. The text typing functions take explicit string
lengths, which precisely match the corresponding strings… for the most part.
For the "Gatekeeper 'SinGyoku'" string though, ZUN passed 23
characters, not 22. Could that have been the "h" from the Hepburn
romanization of 神玉?!
Also, come on, if this text is already blitted to VRAM for no reason,
you could have gone for perfect centering at unaligned byte positions; the
rendering function would have perfectly supported it. Instead, the X
coordinates are still rounded up to the nearest byte.
The hardcoded ending cutscene functions should be even less interesting –
don't they just show a bunch of images followed by frame delays? Until they
don't, and we reach the 地獄/Jigoku Bad Ending with
its special shake/"boom" effect, and this picture:
Picture #2 from ED2A.GRP.
Which is rendered by the following code:
for(int i = 0; i <= boom_duration; i++) { // (yes, off-by-one)
if((i & 3) == 0) {
graph_scrollup(8);
} else {
graph_scrollup(0);
}
end_pic_show(1); // ← different picture is rendered
frame_delay(2); // ← blocks until 2 VSync interrupts have occurred
if(i & 1) {
end_pic_show(2); // ← picture above is rendered
} else {
end_pic_show(1);
}
}
Notice something? You should never see this picture because it's
immediately overwritten before the frame is supposed to end. And yet
it's clearly flickering up for about one frame with common emulation
settings as well as on my real PC-9821 Nw133, clocked at 133 MHz.
master.lib's graph_scrollup() doesn't block until VSync either,
and removing these calls doesn't change anything about the blitted images.
end_pic_show() uses the EGC to blit the given 320×200 quarter
of VRAM from page 1 to the visible page 0, so the bottleneck shouldn't be
there either…
…or should it? After setting it up via a few I/O port writes, the common
method of EGC-powered blitting works like this:
Read 16 bits from the source VRAM position on any single
bitplane. This fills the EGC's 4 16-bit tile registers with the VRAM
contents at that specific position on every bitplane. You do not care
about the value the CPU returns from the read – in optimized code, you would
make sure to just read into a register to avoid useless additional stores
into local variables.
Write any 16 bits
to the target VRAM position on any single bitplane. This copies the
contents of the EGC's tile registers to that specific position on
every bitplane.
To transfer pixels from one VRAM page to another, you insert an additional
write to I/O port 0xA6 before 1) and 2) to set your source and
destination page… and that's where we find the bottleneck. Taking a look at
the i486 CPU and its cycle
counts, a single one of these page switches costs 17 cycles – 1 for
MOVing the page number into AL, and 16 for the
OUT instruction itself. Therefore, the 8,000 page switches
required for EGC-copying a 320×200-pixel image require 136,000 cycles in
total.
And that's the optimal case of using only those two
instructions. 📝 As I implied last time, TH01
uses a function call for VRAM page switches, complete with creating
and destroying a useless stack frame and unnecessarily updating a global
variable in main memory. I tried optimizing ZUN's code by throwing out
unnecessary code and using 📝 pseudo-registers
to generate probably optimal assembly code, and that did speed up the
blitting to almost exactly 50% of the original version's run time. However,
it did little about the flickering itself. Here's a comparison of the first
loop with boom_duration = 16, recorded in DOSBox-X with
cputype=auto and cycles=max, and with
i overlaid using the text chip. Caution, flashing lights:
The original animation, completing in 50 frames instead of the expected
34, thanks to slow blitting. Combined with the lack of
double-buffering, this results in noticeable tearing as the screen
refreshes while blitting is still in progress.
(Note how the background of the ドカーン image is shifted 1 pixel to the left compared to pic
#1.)
This optimized version completes in the expected 34 frames. No tearing
happens to be visible in this recording, but the ドカーン image is still visible on every
second loop iteration. (Note how the background of the ドカーン image is shifted 1 pixel to the left compared to pic
#1.)
I pushed the optimized code to the th01_end_pic_optimize
branch, to also serve as an example of how to get close to optimal code out
of Turbo C++ 4.0J without writing a single ASM instruction.
And if you really want to use the EGC for this, that's the best you can do.
It really sucks that it merely expanded the GRCG's 4×8-bit tile register to
4×16 bits. With 32 bits, ≥386 CPUs could have taken advantage of their wider
registers and instructions to double the blitting performance. Instead, we
now know the reason why
📝 Promisence Soft's EGC-powered sprite driver that ZUN later stole for TH03
is called SPRITE16 and not SPRITE32. What a massive disappointment.
But what's perhaps a bigger surprise: Blitting planar
images from main memory is much faster than EGC-powered inter-page
VRAM copies, despite the required manual access to all 4 bitplanes. In
fact, the blitting functions for the .CDG/.CD2 format, used from TH03
onwards, would later demonstrate the optimal method of using REP
MOVSD for blitting every line in 32-pixel chunks. If that was also
used for these ending images, the core blitting operation would have taken
((12 + (3 × (320 / 32))) × 200 × 4) =
33,600 cycles, with not much more overhead for the surrounding row
and bitplane loops. Sure, this doesn't factor in the whole infamous issue of
VRAM being slow on PC-98, but the aforementioned 136,000 cycles don't even
include any actual blitting either. And as you move up to later PC-98
models with Pentium CPUs, the gap between OUT and REP
MOVSD only becomes larger. (Note that the page I linked above has a
typo in the cycle count of REP MOVSD on Pentium CPUs: According
to the original Intel Architecture and Programming Manual, it's
13+𝑛, not 3+𝑛.)
This difference explains why later games rarely use EGC-"accelerated"
inter-page VRAM copies, and keep all of their larger images in main memory.
It especially explains why TH04 and TH05 can get away with naively redrawing
boss backdrop images on every frame.
In the end, the whole fact that ZUN did not define how long this image
should be visible is enough for me to increment the game's overall bug
counter. Who would have thought that looking at endings of all things
would teach us a PC-98 performance lesson… Sure, optimizing TH01 already
seemed promising just by looking at its bloated code, but I had no idea that
its performance issues extended so far past that level.
That only leaves the common beginning part of all endings and a short
main() function before we're done with FUUIN.EXE,
and 98 functions until all of TH01 is decompiled! Next up: SinGyoku, who not
only is the quickest boss to defeat in-game, but also comes with the least
amount of code. See you very soon!
With Elis, we've not only reached the midway point in TH01's boss code, but
also a bunch of other milestones: Both REIIDEN.EXE and TH01 as
a whole have crossed the 75% RE mark, and overall position independence has
also finally cracked 80%!
And it got done in 4 pushes again? Yup, we're back to
📝 Konngara levels of redundancy and
copy-pasta. This time, it didn't even stop at the big copy-pasted code
blocks for the rift sprite and 256-pixel circle animations, with the words
"redundant" and "unnecessary" ending up a total of 18 times in my source
code comments.
But damn is this fight broken. As usual with TH01 bosses, let's start with a
high-level overview:
The Elis fight consists of 5 phases (excluding the entrance animation),
which must be completed in order.
In all odd-numbered phases, Elis uses a random one-shot danmaku pattern
from an exclusive per-phase pool before teleporting to a random
position.
There are 3 exclusive girl-form patterns per phase, plus 4
additional bat-form patterns in phase 5, for a total of 13.
Due to a quirk in the selection algorithm in phases 1 and 3, there
is a 25% chance of Elis skipping an attack cycle and just teleporting
again.
In contrast to Konngara, Elis can freely select the same pattern
multiple times in a row. There's nothing in the code to prevent that
from happening.
This pattern+teleport cycle is repeated until Elis' HP reach a certain
threshold value. The odd-numbered phases correspond to the white (phase 1),
red-white (phase 3), and red (phase 5) sections of the health bar. However,
the next phase can only start at the end of each cycle, after a
teleport.
Phase 2 simply teleports Elis back to her starting screen position of
(320, 144) and then advances to phase 3.
Phase 4 does the same as phase 2, but adds the initial bat form
transformation before advancing to phase 5.
Phase 5 replaces the teleport with a transformation to the bat form.
Rather than teleporting instantly to the target position, the bat gradually
flies there, firing a randomly selected looping pattern from the 4-pattern
bat pool on the way, before transforming back to the girl form.
This puts the earliest possible end of the fight at the first frame of phase
5. However, nothing prevents Elis' HP from reaching 0 before that point. You
can nicely see this in 📝 debug mode: Wait
until the HP bar has filled up to avoid heap corruption, hold ↵ Return
to reduce her HP to 0, and watch how Elis still goes through a total of
two patterns* and four
teleport animations before accepting defeat.
But wait, heap corruption? Yup, there's a bug in the HP bar that already
affected Konngara as well, and it isn't even just about the graphical
glitches generated by negative HP:
The initial fill-up animation is drawn to both VRAM pages at a rate of 1
HP per frame… by passing the current frame number as the
current_hp number.
The target_hp is indicated by simply passing the current
HP…
… which, however, can be reduced in debug mode at an equal rate of up to
1 HP per frame.
The completion condition only checks if
((target_hp - 1) == current_hp). With the
right timing, both numbers can therefore run past each other.
In that case, the function is repeatedly called on every frame, backing
up the original VRAM contents for the current HP point before blitting
it…
… until frame ((96 / 2) + 1), where the
.PTN slot pointer overflows the heap buffer and overwrites whatever comes
after. 📝 Sounds familiar, right?
Since Elis starts with 14 HP, which is an even number, this corruption is
trivial to cause: Simply hold ↵ Return from the beginning of the
fight, and the completion condition will never be true, as the
HP and frame numbers run past the off-by-one meeting point.
Edit (2023-07-21): Pressing ↵ Return to reduce HP
also works in test mode (game t). There, the game doesn't
even check the heap, and consequently won't report any corruption,
allowing the HP bar to be glitched even further.
Regular gameplay, however, entirely prevents this due to the fixed start
positions of Reimu and the Orb, the Orb's fixed initial trajectory, and the
50 frames of delay until a bomb deals damage to a boss. These aspects make
it impossible to hit Elis within the first 14 frames of phase 1, and ensure
that her HP bar is always filled up completely. So ultimately, this bug ends
up comparable in seriousness to the
📝 recursion / stack overflow bug in the memory info screen.
These wavy teleport animations point to a quite frustrating architectural
issue in this fight. It's not even the fact that unblitting the yellow star
sprites rips temporary holes into Elis' sprite; that's almost expected from
TH01 at this point. Instead, it's all because of this unused frame of the
animation:
With this sprite still being part of BOSS5.BOS, Girl-Elis has a
total of 9 animation frames, 1 more than the
📝 8 per-entity sprites allowed by ZUN's architecture.
The quick and easy solution would have been to simply bump the sprite array
size by 1, but… nah, this would have added another 20 bytes to all 6 of the
.BOS image slots. Instead, ZUN wrote the manual
position synchronization code I mentioned in that 2020 blog post.
Ironically, he then copy-pasted this snippet of code often enough that it
ended up taking up more than 120 bytes in the Elis fight alone – with, you
guessed it, some of those copies being redundant. Not to mention that just
going from 8 to 9 sprites would have allowed ZUN to go down from 6 .BOS
image slots to 3. That would have actually saved 420 bytes in
addition to the manual synchronization trouble. Looking forward to SinGyoku,
that's going to be fun again…
As for the fight itself, it doesn't take long until we reach its most janky
danmaku pattern, right in phase 1:
The "pellets along circle" pattern on Lunatic, in its original version
and with fanfiction fixes for everything that can potentially be
interpreted as a bug.
For whatever reason, the lower-right quarter of the circle isn't
animated? This animation works by only drawing the new dots added with every
subsequent animation frame, expressed as a tiny arc of a dotted circle. This
arc starts at the animation's current 8-bit angle and ends on the sum of
that angle and a hardcoded constant. In every other (copy-pasted, and
correct) instance of this animation, ZUN uses 0x02 as the
constant, but this one uses… 0.05 for the lower-right quarter?
As in, a 64-bit double constant that truncates to 0 when added
to an 8-bit integer, thus leading to the start and end angles being
identical and the game not drawing anything.
On Easy and Normal, the pattern then spawns 32 bullets along the outline
of the circle, no problem there. On Lunatic though, every one of these
bullets is instead turned into a narrow-angled 5-spread, resulting in 160
pellets… in a game with a pellet cap of 100.
Now, if Elis teleported herself to a position near the top of the playfield,
most of the capped pellets would have been clipped at that top edge anyway,
since the bullets are spawned in clockwise order starting at Elis' right
side with an angle of 0x00. On lower positions though, you can
definitely see a difference if the cap were high enough to allow all coded
pellets to actually be spawned.
The Hard version gets dangerously close to the cap by spawning a total of 96
pellets. Since this is the only pattern in phase 1 that fires pellets
though, you are guaranteed to see all of the unclipped ones.
The pellets also aren't spawned exactly on the telegraphed circle, but 4 pixels to the left.
Then again, it might very well be that all of this was intended, or, most
likely, just left in the game as a happy accident. The latter interpretation
would explain why ZUN didn't just delete the rendering calls for the
lower-right quarter of the circle, because seriously, how would you not spot
that? The phase 3 patterns continue with more minor graphical glitches that
aren't even worth talking about anymore.
And then Elis transforms into her bat form at the beginning of Phase 5,
which displays some rather unique hitboxes. The one against the Orb is fine,
but the one against player shots…
… uses the bat's X coordinate for both X and Y dimensions.
In regular gameplay, it's not too bad as most
of the bat patterns fire aimed pellets which typically don't allow you to
move below her sprite to begin with. But if you ever tried destroying these
pellets while standing near the middle of the playfield, now you know why
that didn't work. This video also nicely points out how the bat, like any
boss sprite, is only ever blitted at positions on the 8×1-pixel VRAM byte
grid, while collision detection uses the actual pixel position.
The bat form patterns are all relatively simple, with little variation
depending on the difficulty level, except for the "slow pellet spreads"
pattern. This one is almost easiest to dodge on Lunatic, where the 5-spreads
are not only always fired downwards, but also at the hardcoded narrow delta
angle, leaving plenty of room for the player to move out of the way:
The "slow pellet spreads" pattern of Elis' bat form, on every
difficulty. Which version do you think is the easiest one?
Finally, we've got another potential timesave in the girl form's "safety
circle" pattern:
After the circle spawned completely, you lose a life by moving outside it,
but doing that immediately advances the pattern past the circle part. This
part takes 200 frames, but the defeat animation only takes 82 frames, so
you can save up to 118 frames there.
Final funny tidbit: As with all dynamic entities, this circle is only
blitted to VRAM page 0 to allow easy unblitting. However, it's also kind of
static, and there needs to be some way to keep the Orb, the player shots,
and the pellets from ripping holes into it. So, ZUN just re-blits the circle
every… 4 frames?! 🤪 The same is true for the Star of David and its
surrounding circle, but there you at least get a flash animation to justify
it. All the overlap is actually quite a good reason for not even attempting
to 📝 mess with the hardware color palette instead.
Reproducing the crash was the whole challenge here. Even after moving Elis
and Reimu to the exact positions seen in Pearl's video and setting Elis' HP
to 0 on the exact same frame, everything ran fine for me. It's definitely no
division by 0 this time, the function perfectly guards against that
possibility. The line specified in the function's parameters is always
clipped to the VRAM region as well, so we can also rule out illegal memory
accesses here…
… or can we? Stepping through it all reminded me of how this function brings
unblitting sloppiness to the next level: For each VRAM byte touched, ZUN
actually unblits the 4 surrounding bytes, adding one byte to the left
and two bytes to the right, and using a single 32-bit read and write per
bitplane. So what happens if the function tries to unblit the topmost byte
of VRAM, covering the pixel positions from (0, 0) to (7, 0)
inclusive? The VRAM offset of 0x0000 is decremented to
0xFFFF to cover the one byte to the left, 4 bytes are written
to this address, the CPU's internal offset overflows… and as it turns out,
that is illegal even in Real Mode as of the 80286, and will raise a General Protection
Fault. Which is… ignored by DOSBox-X,
every Neko Project II version in common use, the CSCP
emulators, SL9821, and T98-Next. Only Anex86 accurately emulates the
behavior of real hardware here.
OK, but no laser fired by Elis ever reaches the top-left corner of the
screen. How can such a fault even happen in practice? That's where the
broken laser reset+unblit function comes in: Not only does it just flat out pass the wrong
parameters to the line unblitting function – describing the line
already traveled by the laser and stopping where the laser begins –
but it also passes them
wrongly, in the form of raw 32-bit fixed-point Q24.8 values, with no
conversion other than a truncation to the signed 16-bit pixels expected by
the function. What then follows is an attempt at interpolation and clipping
to find a line segment between those garbage coordinates that actually falls
within the boundaries of VRAM:
right/bottom correspond to a laser's origin position, and
left/top to the leftmost pixel of its moved-out top line. The
bug therefore only occurs with lasers that stopped growing and have started
moving.
Moreover, it will only happen if either (left % 256) or
(right % 256) is ≤ 127 and the other one of the two is ≥ 128.
The typecast to signed 16-bit integers then turns the former into a large
positive value and the latter into a large negative value, triggering the
function's clipping code.
The function then follows Bresenham's
algorithm: left is ensured to be smaller than right
by swapping the two values if necessary. If that happened, top
and bottom are also swapped, regardless of their value – the
algorithm does not care about their order.
The slope in the X dimension is calculated using an integer division of
((bottom - top) /
(right - left)). Both subtractions are done on signed
16-bit integers, and overflow accordingly.
(-left × slope_x) is added to top,
and left is set to 0.
If both top and bottom are < 0 or
≥ 640, there's nothing to be unblitted. Otherwise, the final
coordinates are clipped to the VRAM range of [(0, 0),
(639, 399)].
If the function got this far, the line to be unblitted is now very
likely to reach from
the top-left to the bottom-right corner, starting out at
(0, 0) right away, or
from the bottom-left corner to the top-right corner. In this case,
you'd expect unblitting to end at (639, 0), but thanks to an
off-by-one error,
it actually ends at (640, -1), which is equivalent to
(0, 0). Why add clipping to VRAM offset calculations when
everything else is clipped already, right?
Possible laser states that will cause the fault, with some debug
output to help understand the cause, and any pellets removed for better
readability. This can happen for all bosses that can potentially have
shootout lasers on screen when being defeated, so it also applies to Mima.
Fixing this is easier than understanding why it happens, but since y'all
love reading this stuff…
tl;dr: TH01 has a high chance of freezing at a boss defeat sequence if there
are diagonally moving lasers on screen, and if your PC-98 system
raises a General Protection Fault on a 4-byte write to offset
0xFFFF, and if you don't run a TSR with an INT
0Dh handler that might handle this fault differently.
The easiest fix option would be to just remove the attempted laser
unblitting entirely, but that would also have an impact on this game's…
distinctive visual glitches, in addition to touching a whole lot of
code bytes. If I ever get funded to work on a hypothetical TH01 Anniversary
Edition that completely rearchitects the game to fix all these glitches, it
would be appropriate there, but not for something that purports to be the
original game.
(Sidenote to further hype up this Anniversary Edition idea for PC-98
hardware owners: With the amount of performance left on the table at every
corner of this game, I'm pretty confident that we can get it to work
decently on PC-98 models with just an 80286 CPU.)
Since we're in critical infrastructure territory once again, I went for the
most conservative fix with the least impact on the binary: Simply changing
any VRAM offsets >= 0xFFFD to 0x0000 to avoid
the GPF, and leaving all other bugs in place. Sure, it's rather lazy and
"incorrect"; the function still unblits a 32-pixel block there, but adding a
special case for blitting 24 pixels would add way too much code. And
seriously, it's not like anything happens in the 8 pixels between
(24, 0) and (31, 0) inclusive during gameplay to begin with.
To balance out the additional per-row if() branch, I inlined
the VRAM page change I/O, saving two function calls and one memory write per
unblitted row.
That means it's time for a new community_choice_fixes
build, containing the new definitive bugfixed versions of these games:
2022-05-31-community-choice-fixes.zip
Check the th01_critical_fixes
branch for the modified TH01 code. It also contains a fix for the HP bar
heap corruption in test or debug mode – simply changing the ==
comparison to <= is enough to avoid it, and negative HP will
still create aesthetic glitch art.
Once again, I then was left with ½ of a push, which I finally filled with
some FUUIN.EXE code, specifically the verdict screen. The most
interesting part here is the player title calculation, which is quite
sneaky: There are only 6 skill levels, but three groups of
titles for each level, and the title you'll see is picked from a random
group. It looks like this is the first time anyone has documented the
calculation?
As for the levels, ZUN definitely didn't expect players to do particularly
well. With a 1cc being the standard goal for completing a Touhou game, it's
especially funny how TH01 expects you to continue a lot: The code has
branches for up to 21 continues, and the on-screen table explicitly leaves
room for 3 digits worth of continues per 5-stage scene. Heck, these
counts are even stored in 32-bit long variables.
Next up: 📝 Finally finishing the long
overdue Touhou Patch Center MediaWiki update work, while continuing with
Kikuri in the meantime. Originally I wasn't sure about what to do between
Elis and Seihou,
but with Ember2528's surprise
contribution last week, y'all have
demonstrated more than enough interest in the idea of getting TH01 done
sooner rather than later. And I agree – after all, we've got the 25th
anniversary of its first public release coming up on August 15, and I might
still manage to completely decompile this game by that point…
TH05 has passed the 50% RE mark, with both MAIN.EXE and the
game as a whole! With that, we've also reached what -Tom-
wanted out of the project, so he's suspending his discount offer for a
bit.
Curve bullets are now officially called cheetos! 76.7% of
fans prefer this term, and it fits into the 8.3 DOS filename scheme much
better than homing lasers (as they're called in
OMAKE.TXT) or Taito
lasers (which would indeed have made sense as well).
…oh, and I managed to decompile Shinki within 2 pushes after all. That
left enough budget to also add the Stage 1 midboss on top.
So, Shinki! As far as final boss code is concerned, she's surprisingly
economical, with 📝 her background animations
making up more than ⅓ of her entire code. Going straight from TH01's
📝 final📝 bosses
to TH05's final boss definitely showed how much ZUN had streamlined
danmaku pattern code by the end of PC-98 Touhou. Don't get me wrong, there
is still room for improvement: TH05 not only
📝 reuses the same 16 bytes of generic boss state we saw in TH04 last month,
but also uses them 4× as often, and even for midbosses. Most importantly
though, defining danmaku patterns using a single global instance of the
group template structure is just bad no matter how you look at it:
The script code ends up rather bloated, with a single MOV
instruction for setting one of the fields taking up 5 bytes. By comparison,
the entire structure for regular bullets is 14 bytes large, while the
template structure for Shinki's 32×32 ball bullets could have easily been
reduced to 8 bytes.
Since it's also one piece of global state, you can easily forget to set
one of the required fields for a group type. The resulting danmaku group
then reuses these values from the last time they were set… which might have
been as far back as another boss fight from a previous stage.
And of course, I wouldn't point this out if it
didn't actually happen in Shinki's pattern code. Twice.
Declaring a separate structure instance with the static data for every
pattern would be both safer and more space-efficient, and there's
more than enough space left for that in the game's data segment.
But all in all, the pattern functions are short, sweet, and easy to follow.
The "devil"
patternis significantly more complex than the others, but still
far from TH01's final bosses at their worst. I especially like the clear
architectural separation between "one-shot pattern" functions that return
true once they're done, and "looping pattern" functions that
run as long as they're being called from a boss's main function. Not many
all too interesting things in these pattern functions for the most part,
except for two pieces of evidence that Shinki was coded after Yumeko:
The gather animation function in the first two phases contains a bullet
group configuration that looks like it's part of an unused danmaku
pattern. It quickly turns out to just be copy-pasted from a similar function
in Yumeko's fight though, where it is turned into actual
bullets.
As one of the two places where ZUN forgot to set a template field, the
lasers at the end of the white wing preparation pattern reuse the 6-pixel
width of Yumeko's final laser pattern. This actually has an effect on
gameplay: Since these lasers are active for the first 8 frames after
Shinki's wings appear on screen, the player can get hit by them in the last
2 frames after they grew to their final width.
Of course, there are more than enough safespots between the lasers.
Speaking about that wing sprite: If you look at ST05.BB2 (or
any other file with a large sprite, for that matter), you notice a rather
weird file layout:
A large sprite split into multiple smaller ones with a width of
64 pixels each? What's this, hardware sprite limitations? On my
PC-98?!
And it's not a limitation of the sprite width field in the BFNT+ header
either. Instead, it's master.lib's BFNT functions which are limited to
sprite widths up to 64 pixels… or at least that's what
MASTER.MAN claims. Whatever the restriction was, it seems to be
completely nonexistent as of master.lib version 0.23, and none of the
master.lib functions used by the games have any issues with larger
sprites.
Since ZUN stuck to the supposed 64-pixel width limit though, it's now the
game that expects Shinki's winged form to consist of 4 physical
sprites, not just 1. Any conversion from another, more logical sprite sheet
layout back into BFNT+ must therefore replicate the original number of
sprites. Otherwise, the sequential IDs ("patnums") assigned to every newly
loaded sprite no longer match ZUN's hardcoded IDs, causing the game to
crash. This is exactly what used to happen with -Tom-'s
MysticTK automation scripts,
which combined these exact sprites into a single large one. This issue has
now been fixed – just in case there are some underground modders out there
who used these scripts and wonder why their game crashed as soon as the
Shinki fight started.
And then the code quality takes a nosedive with Shinki's main function.
Even in TH05, these boss and midboss update
functions are still very imperative:
The origin point of all bullet types used by a boss must be manually set
to the current boss/midboss position; there is no concept of a bullet type
tracking a certain entity.
The same is true for the target point of a player's homing shots…
… and updating the HP bar. At least the initial fill animation is
abstracted away rather decently.
Incrementing the phase frame variable also must be done manually. TH05
even "innovates" here by giving the boss update function exclusive ownership
of that variable, in contrast to TH04 where that ownership is given out to
the player shot collision detection (?!) and boss defeat helper
functions.
Speaking about collision detection: That is done by calling different
functions depending on whether the boss is supposed to be invincible or
not.
Timeout conditions? No standard way either, and all done with manual
if statements. In combination with the regular phase end
condition of lowering (mid)boss HP to a certain value, this leads to quite a
convoluted control flow.
The manual calls to the score bonus functions for cleared phases at least provide some sense of orientation.
One potentially nice aspect of all this imperative freedom is that
phases can end outside of HP boundaries… by manually incrementing the
phase variable and resetting the phase frame variable to 0.
The biggest WTF in there, however, goes to using one of the 16 state bytes
as a "relative phase" variable for differentiating between boss phases that
share the same branch within the switch(boss.phase)
statement. While it's commendable that ZUN tried to reduce code duplication
for once, he could have just branched depending on the actual
boss.phase variable? The same state byte is then reused in the
"devil" pattern to track the activity state of the big jerky lasers in the
second half of the pattern. If you somehow managed to end the phase after
the first few bullets of the pattern, but before these lasers are up,
Shinki's update function would think that you're still in the phase
before the "devil" pattern. The main function then sequence-breaks
right to the defeat phase, skipping the final pattern with the burning Makai
background. Luckily, the HP boundaries are far away enough to make this
impossible in practice.
The takeaway here: If you want to use the state bytes for your custom
boss script mods, alias them to your own 16-byte structure, and limit each
of the bytes to a clearly defined meaning across your entire boss script.
One final discovery that doesn't seem to be documented anywhere yet: Shinki
actually has a hidden bomb shield during her two purple-wing phases.
uth05win got this part slightly wrong though: It's not a complete
shield, and hitting Shinki will still deal 1 point of chip damage per
frame. For comparison, the first phase lasts for 3,000 HP, and the "devil"
pattern phase lasts for 5,800 HP.
And there we go, 3rd PC-98 Touhou boss
script* decompiled, 28 to go! 🎉 In case you were expecting a fix for
the Shinki death glitch: That one
is more appropriately fixed as part of the Mai & Yuki script. It also
requires new code, should ideally look a bit prettier than just removing
cheetos between one frame and the next, and I'd still like it to fit within
the original position-dependent code layout… Let's do that some other
time.
Not much to say about the Stage 1 midboss, or midbosses in general even,
except that their update functions have to imperatively handle even more
subsystems, due to the relative lack of helper functions.
The remaining ¾ of the third push went to a bunch of smaller RE and
finalization work that would have hardly got any attention otherwise, to
help secure that 50% RE mark. The nicest piece of code in there shows off
what looks like the optimal way of setting up the
📝 GRCG tile register for monochrome blitting
in a variable color:
mov ah, palette_index ; Any other non-AL 8-bit register works too.
; (x86 only supports AL as the source operand for OUTs.)
rept 4 ; For all 4 bitplanes…
shr ah, 1 ; Shift the next color bit into the x86 carry flag
sbb al, al ; Extend the carry flag to a full byte
; (CF=0 → 0x00, CF=1 → 0xFF)
out 7Eh, al ; Write AL to the GRCG tile register
endm
Thanks to Turbo C++'s inlining capabilities, the loop body even decompiles
into a surprisingly nice one-liner. What a beautiful micro-optimization, at
a place where micro-optimization doesn't hurt and is almost expected.
Unfortunately, the micro-optimizations went all downhill from there,
becoming increasingly dumb and undecompilable. Was it really necessary to
save 4 x86 instructions in the highly unlikely case of a new spark sprite
being spawned outside the playfield? That one 2D polar→Cartesian
conversion function then pointed out Turbo C++ 4.0J's woefully limited
support for 32-bit micro-optimizations. The code generation for 32-bit
📝 pseudo-registers is so bad that they almost
aren't worth using for arithmetic operations, and the inline assembler just
flat out doesn't support anything 32-bit. No use in decompiling a function
that you'd have to entirely spell out in machine code, especially if the
same function already exists in multiple other, more idiomatic C++
variations.
Rounding out the third push, we got the TH04/TH05 DEMO?.REC
replay file reading code, which should finally prove that nothing about the
game's original replay system could serve as even just the foundation for
community-usable replays. Just in case anyone was still thinking that.
Next up: Back to TH01, with the Elis fight! Got a bit of room left in the
cap again, and there are a lot of things that would make a lot of
sense now:
TH04 would really enjoy a large number of dedicated pushes to catch up
with TH05. This would greatly support the finalization of both games.
Continuing with TH05's bosses and midbosses has shown to be good value
for your money. Shinki would have taken even less than 2 pushes if she
hadn't been the first boss I looked at.
Oh, and I also added Seihou as a selectable goal, for the two people out
there who genuinely like it. If I ever want to quit my day job, I need to
branch out into safer territory that isn't threatened by takedowns, after
all.
Slight change of plans, because we got instructions for
reliably reproducing the TH04 Kurumi Divide Error crash! Major thanks to
Colin Douglas Howell. With those, it also made sense to immediately look at
the crash in the Stage 4 Marisa fight as well. This way, I could release
both of the obligatory bugfix mods at the same time.
Especially since it turned out that I was wrong: Both crashes are entirely
unrelated to the custom entity structure that would have required PI-centric
progress. They are completely specific to Kurumi's and Marisa's
danmaku-pattern code, and really are two separate bugs
with no connection to each other. All of the necessary research nicely fit
into Arandui's 0.5 pushes, with no further deep understanding
required here.
But why were there still three weeks between Colin's message and this blog
post? DMCA distractions aside: There are no easy fixes this time, unlike
📝 back when I looked at the Stage 5 Yuuka crash.
Just like how division by zero is undefined in mathematics, it's also,
literally, undefined what should happen instead of these two
Divide error crashes. This means that any possible "fix" can
only ever be a fanfiction interpretation of the intentions behind ZUN's
code. The gameplay community should be aware of this, and
might decide to handle these cases differently. And if we
have to go into fanfiction territory to work around crashes in the
canon games, we'd better document what exactly we're fixing here and how, as
comprehensible as possible.
With that out of the way, let's look at Kurumi's crash first, since it's way
easier to grasp. This one is known to primarily happen to new players, and
it's easy to see why:
In one of the patterns in her third phase, Kurumi fires a series of 3
aimed rings from both edges of the playfield. By default (that is, on Normal
and with regular rank), these are 6-way rings.
6 happens to be quite a peculiar number here, due to how rings are
(manually) tuned based on the current "rank" value (playperf)
before being fired. The code, abbreviated for clarity:
Let's look at the range of possible playperf values per
difficulty level:
Easy
Normal
Hard
Lunatic
Extra
playperf_min
4
11
20
22
16
playperf_max
16
24
32
34
20
Edit (2022-05-24): This blog post initially had
26 instead of 16 for playperf_min for the Extra Stage. Thanks
to Popfan for pointing out that typo!
Reducing rank to its minimum on Easy mode will therefore result in a
0-ring after tuning.
To calculate the individual angles of each bullet in a ring, ZUN divides
360° (or, more correctly,
📝 0x100) by the total number of
bullets…
Boom, division by zero.
The pattern that causes the crash in Kurumi's fight. Also
demonstrates how the number of bullets in a ring is always halved on
Easy Mode after the rank-based tuning, leading to just a 3-ring on
playperf = 16.
So, what should the workaround look like? Obviously, we want to modify
neither the default number of ring bullets nor the tuning algorithm – that
would change all other non-crashing variations of this pattern on other
difficulties and ranks, creating a fork of the original gameplay. Instead, I
came up with four possible workarounds that all seemed somewhat logical to
me:
Firing no bullet, i.e., interpreting 0-ring literally. This would
create the only constellation in which a call to the bullet group spawn
functions would not spawn at least one new bullet.
Firing a "1-ring", i.e., a single bullet. This would be consistent with
how the bullet spawn functions behave for "0-way" stack and spread
groups.
Firing a "∞-ring", i.e., 200 bullets, which is as much as the game's cap
on 16×16 bullets would allow. This would poke fun at the whole "division by
zero" idea… but given that we're still talking about Easy Mode (and
especially new players) here, it might be a tad too cruel. Certainly the
most trollish interpretation.
Triggering an immediate Game Over, exchanging the hard crash for a
softer and more controlled shutdown. Certainly the option that would be
closest to the behavior of the original games, and perhaps the only one to
be accepted in Serious, High-Level Play™.
As I was writing this post, it felt increasingly wrong for me to make this
decision. So I once again went to Twitter, where 56.3%
voted in favor of the 1-bullet option. Good that I asked! I myself was
more leaning towards the 0-bullet interpretation, which only got 28.7% of
the vote. Also interesting are the 2.3% in favor of the Game Over option but
I get it, low-rank Easy Mode isn't exactly the most competitive mode of
playing TH04.
There are reports of Kurumi crashing on higher difficulties as well, but I
could verify none of them. If they aren't fixed by this workaround, they're
caused by an entirely different bug that we have yet to discover.
Onto the Stage 4 Marisa crash then, which does in fact apply to all
difficulty levels. I was also wrong on this one – it's a hell of a lot more
intricate than being just a division by the number of on-screen bits.
Without having decompiled the entire fight, I can't give a completely
accurate picture of what happens there yet, but here's the rough idea:
Marisa uses different patterns, depending on whether at least one of her
bits is still alive, or all of them have been destroyed.
Destroying the last bit will immediately switch to the bit-less
counterpart of the current pattern.
The bits won't respawn before the pattern ended, which ensures that the
bit-less version is always shown in its entirety after being started or
switched into.
In two of the bit-less patterns, Marisa gradually moves to the point
reflection of her position at the start of the pattern across the playfield
coordinate of (192, 112), or (224, 128) on screen.
Reference points for Marisa's point-reflected movement. Cyan:
Marisa's position, green: (192, 112), yellow: the intended end
point.
The velocity of this movement is determined by both her distance to that
point and the total amount of frames that this instance of the bit-less
pattern will last.
Since this frame amount is directly tied to the frame the player
destroyed the last bit on, it becomes a user-controlled variable. I think
you can see where this is going…
The last 12 frames of this duration, however, are always reserved for a
"braking phase", where Marisa's velocity is halved on each frame.
This part of the code only runs every 4 frames though. This expands the
time window for this crash to 4 frames, rather than just the two frames you
would expect from looking at the division itself.
Both of the broken patterns run for a maximum of 160 frames. Therefore,
the crash will occur when Marisa's last bit is destroyed between frame 152
and 155 inclusive. On these frames, the
last_frame_with_bits_alive variable is set to 148, which is the
crucial 12 duration frames away from the maximum of 160.
Interestingly enough, the calculated velocity is also only
applied every 4 frames, with Marisa actually staying still for the 3 frames
inbetween. As a result, she either moves
too slowly to ever actually reach the yellow point if the last bit
was destroyed early in the pattern (see destruction frames 68 or
112),
or way too quickly, and almost in a jerky, teleporting way (see
destruction frames 144 or 148).
Finally, as you may have already gathered from the formula: Destroying
the last bit between frame 156 and 160 inclusive results in
duration values of 8 or 4. These actually push Marisa
away from the intended point, as the divisor becomes negative.
One of the two patterns in TH04's Stage 4 Marisa boss fight that feature
frame number-dependent point-reflected movement. The bits were hacked to
self-destruct on the respective frame.
tl;dr: "Game crashes if last bit destroyed within 4-frame window near end of
two patterns". For an informed decision on a new movement behavior for these
last 8 frames, we definitely need to know all the details behind the crash
though. Here's what I would interpret into the code:
Not moving at all, i.e., interpreting 0 as the middle ground between
positive and negative movement. This would also make sense because a
12-frame duration implies 100% of the movement to consist of
the braking phase – and Marisa wasn't moving before, after all.
Move at maximum speed, i.e., dividing by 1 rather than 0. Since the
movement duration is still 12 in this case, Marisa will immediately start
braking. In total, she will move exactly ¾ of the way from her initial
position to (192, 112) within the 8 frames before the pattern
ends.
Directly warping to (192, 112) on frame 0, and to the
point-reflected target on 4, respectively. This "emulates" the division by
zero by moving Marisa at infinite speed to the exact two points indicated by
the velocity formula. It also fits nicely into the 8 frames we have to fill
here. Sure, Marisa can't reach these points at any other duration, but why
shouldn't she be able to, with infinite speed? Then again, if Marisa
is far away enough from (192, 112), this workaround would warp her
across the entire playfield. Can Marisa teleport according to lore? I
have no idea…
Triggering an immediate Game O– hell no, this is the Stage 4 boss,
people already hate losing runs to this bug!
Asking Twitter worked great for the Kurumi workaround, so let's do it again!
Gotta attach a screenshot of an earlier draft of this blog post though,
since this stuff is impossible to explain in tweets…
…and it went
through the roof, becoming the most successful ReC98 tweet so far?!
Apparently, y'all really like to just look at descriptions of overly complex
bugs that I'd consider way beyond the typical attention span that can be
expected from Twitter. Unfortunately, all those tweet impressions didn't
quite translate into poll turnout. The results
were pretty evenly split between 1) and 2), with option 1) just coming out
slightly ahead at 49.1%, compared to 41.5% of option 2).
(And yes, I only noticed after creating the poll that warping to both the
green and yellow points made more sense than warping to just one of the two.
Let's hope that this additional variant wouldn't have shifted the results
too much. Both warp options only got 9.4% of the vote after all, and no one
else came up with the idea either. In the end,
you can always merge together your preferred combination of workarounds from
the Git branches linked below.)
So here you go: The new definitive version of TH04, containing not only the
community-chosen Kurumi and Stage 4 Marisa workaround variant, but also the
📝 No-EMS bugfix from last year.
Edit (2022-05-31): This package is outdated, 📝 the current version is here!2022-04-18-community-choice-fixes.zip
Oh, and let's also add spaztron64's TH03 GDC clock fix
from 2019 because why not. This binary was built from the community_choice_fixes
branch, and you can find the code for all the individual workarounds on
these branches:
Again, because it can't be stated often enough: These fixes are
fanfiction. The gameplay community should be aware of
this, and might decide to handle these cases differently.
With all of that taking way more time to evaluate and document, this
research really had to become part of a proper push, instead of just being
covered in the quick non-push blog post I initially intended. With ½ of a
push left at the end, TH05's Stage 1-5 boss background rendering functions
fit in perfectly there. If you wonder how these static backdrop images even
need any boss-specific code to begin with, you're right – it's basically the
same function copy-pasted 4 times, differing only in the backdrop image
coordinates and some other inconsequential details.
Only Sara receives a nice variation of the typical
📝 blocky entrance animation: The usually
opaque bitmap data from ST00.BB is instead used as a transition
mask from stage tiles to the backdrop image, by making clever use of the
tile invalidation system:
TH04 uses the same effect a bit more frequently, for its first three bosses.
Next up: Shinki, for real this time! I've already managed to decompile 10 of
her 11 danmaku patterns within a little more than one push – and yes,
that one is included in there. Looks like I've slightly
overestimated the amount of work required for TH04's and TH05's bosses…
Did you know that moving on top of a boss sprite doesn't kill the player in
TH04, only in TH05?
Yup, Reimu is not getting hit… yet.
That's the first of only three interesting discoveries in these 3 pushes,
all of which concern TH04. But yeah, 3 for something as seemingly simple as
these shared boss functions… that's still not quite the speed-up I had hoped
for. While most of this can be blamed, again, on TH04 and all of its
hardcoded complexities, there still was a lot of work to be done on the
maintenance front as well. These functions reference a bunch of code I RE'd
years ago and that still had to be brought up to current standards, with the
dependencies reaching from 📝 boss explosions
over 📝 text RAM overlay functionality up to
in-game dialog loading.
The latter provides a good opportunity to talk a bit about x86 memory
segmentation. Many aspiring PC-98 developers these days are very scared
of it, with some even going as far as to rather mess with Protected Mode and
DOS extenders just so that they don't have to deal with it. I wonder where
that fear comes from… Could it be because every modern programming language
I know of assumes memory to be flat, and lacks any standard language-level
features to even express something like segments and offsets? That's why
compilers have a hard time targeting 16-bit x86 these days: Doing anything
interesting on the architecture requires giving the programmer full
control over segmentation, which always comes down to adding the
typical non-standard language extensions of compilers from back in the day.
And as soon as DOS stopped being used, these extensions no longer made sense
and were subsequently removed from newer tools. A good example for this can
be found in an old version of the
NASM manual: The project started as an attempt to make x86 assemblers
simple again by throwing out most of the segmentation features from
MASM-style assemblers, which made complete sense in 1996 when 16-bit DOS and
Windows were already on their way out. But there was a point to all
those features, and that's why ReC98 still has to use the supposedly
inferior TASM.
Not that this fear of segmentation is completely unfounded: All the
segmentation-related keywords, directives, and #pragmas
provided by Borland C++ and TASM absolutely can be the cause of many
weird runtime bugs. Even if the compiler or linker catches them, you are
often left with confusing error messages that aged just as poorly as memory
segmentation itself.
However, embracing the concept does provide quite the opportunity for
optimizations. While it definitely was a very crazy idea, there is a small
bit of brilliance to be gained from making proper use of all these
segmentation features. Case in point: The buffer for the in-game dialog
scripts in TH04 and TH05.
// Thanks to the semantics of `far` pointers, we only need a single 32-bit
// pointer variable for the following code.
extern unsigned char far *dialog_p;
// This master.lib function returns a `void __seg *`, which is a 16-bit
// segment-only pointer. Converting to a `far *` yields a full segment:offset
// pointer to offset 0000h of that segment.
dialog_p = (unsigned char far *)hmem_allocbyte(/* … */);
// Running the dialog script involves pointer arithmetic. On a far pointer,
// this only affects the 16-bit offset part, complete with overflow at 64 KiB,
// from FFFFh back to 0000h.
dialog_p += /* … */;
dialog_p += /* … */;
dialog_p += /* … */;
// Since the segment part of the pointer is still identical to the one we
// allocated above, we can later correctly free the buffer by pulling the
// segment back out of the pointer.
hmem_free((void __seg *)dialog_p);
If dialog_p was a huge pointer, any pointer
arithmetic would have also adjusted the segment part, requiring a second
pointer to store the base address for the hmem_free call. Doing
that will also be necessary for any port to a flat memory model. Depending
on how you look at it, this compression of two logical pointers into a
single variable is either quite nice, or really, really dumb in its
reliance on the precise memory model of one single architecture.
Why look at dialog loading though, wasn't this supposed to be all about
shared boss functions? Well, TH04 unnecessarily puts certain stage-specific
code into the boss defeat function, such as loading the alternate Stage 5
Yuuka defeat dialog before a Bad Ending, or initializing Gengetsu after
Mugetsu's defeat in the Extra Stage.
That's TH04's second core function with an explicit conditional branch for
Gengetsu, after the
📝 dialog exit code we found last year during EMS research.
And I've heard people say that Shinki was the most hardcoded fight in PC-98
Touhou… Really, Shinki is a perfectly regular boss, who makes proper use of
all internal mechanics in the way they were intended, and doesn't blast
holes into the architecture of the game. Even within TH05, it's Mai and Yuki
who rely on hacks and duplicated code, not Shinki.
The worst part about this though? How the function distinguishes Mugetsu
from Gengetsu. Once again, it uses its own global variable to track whether
it is called the first or the second time within TH04's Extra Stage,
unrelated to the same variable used in the dialog exit function. But this
time, it's not just any newly created, single-use variable, oh no. In a
misguided attempt to micro-optimize away a few bytes of conventional memory,
TH04 reserves 16 bytes of "generic boss state", which can (and are) freely
used for anything a boss doesn't want to store in a more dedicated
variable.
It might have been worth it if the bosses actually used most of these
16 bytes, but the majority just use (the same) two, with only Stage 4 Reimu
using a whopping seven different ones. To reverse-engineer the various uses
of these variables, I pretty much had to map out which of the undecompiled
danmaku-pattern functions corresponds to which boss
fight. In the end, I assigned 29 different variable names for each of the
semantically different use cases, which made up another full push on its
own.
Now, 16 bytes of wildly shared state, isn't that the perfect recipe for
bugs? At least during this cursory look, I haven't found any obvious ones
yet. If they do exist, it's more likely that they involve reused state from
earlier bosses – just how the Shinki death glitch in
TH05 is caused by reusing cheeto data from way back in Stage 4 – and
hence require much more boss-specific progress.
And yes, it might have been way too early to look into all these tiny
details of specific boss scripts… but then, this happened:
Looks similar to another
screenshot of a crash in the same fight that was reported in December,
doesn't it? I was too much in a hurry to figure it out exactly, but notice
how both crashes happen right as the last of Marisa's four bits is destroyed.
KirbyComment has suspected
this to be the cause for a while, and now I can pretty much confirm it
to be an unguarded division by the number of on-screen bits in
Marisa-specific pattern code. But what's the cause for Kurumi then?
As for fixing it, I can go for either a fast or a slow option:
Superficially fixing only this crash will probably just take a fraction
of a push.
But I could also go for a deeper understanding by looking at TH04's
version of the 📝 custom entity structure. It
not only stores the data of Marisa's bits, but is also very likely to be
involved in Kurumi's crash, and would get TH04 a lot closer to 100%
PI. Taking that look will probably need at least 2 pushes, and might require
another 3-4 to completely decompile Marisa's fight, and 2-3 to decompile
Kurumi's.
OK, now that that's out of the way, time to finish the boss defeat function…
but not without stumbling over the third of TH04's quirks, relating to the
Clear Bonus for the main game or the Extra Stage:
To achieve the incremental addition effect for the in-game score display
in the HUD, all new points are first added to a score_delta
variable, which is then added to the actual score at a maximum rate of
61,110 points per frame.
There are a fixed 416 frames between showing the score tally and
launching into MAINE.EXE.
As a result, TH04's Clear Bonus is effectively limited to
(416 × 61,110) = 25,421,760 points.
Only TH05 makes sure to commit the entirety of the
score_delta to the actual score before switching binaries,
which fixes this issue.
And after another few collision-related functions, we're now truly,
finally ready to decompile bosses in both TH04 and TH05! Just as the
anything funds were running out… The
remaining ¼ of the third push then went to Shinki's 32×32 ball bullets,
rounding out this delivery with a small self-contained piece of the first
TH05 boss we're probably going to look at.
Next up, though: I'm not sure, actually. Both Shinki and Elis seem just a
little bit larger than the 2¼ or 4 pushes purchased so far, respectively.
Now that there's a bunch of room left in the cap again, I'll just let the
next contribution decide – with a preference for Shinki in case of a tie.
And if it will take longer than usual for the store to sell out again this
time (heh), there's still the
📝 PC-98 text RAM JIS trail word rendering research
waiting to be documented.
Two years after
📝 the first look at TH04's and TH05's bullets,
we finally get to finish their logic code by looking at the special motion
types. Bullets as a whole still aren't completely finished as the
rendering code is still waiting to be RE'd, but now we've got everything
about them that's required for decompiling the midboss and boss fights of
these games.
Just like the motion types of TH01's pellets, the ones we've got here really
are special enough to warrant an enum, despite all the
overlap in the "slow down and turn" and "bounce at certain edges of the
playfield" types. Sure, including them in the bitfield I proposed two years
ago would have allowed greater variety, but it wouldn't have saved any
memory. On the contrary: These types use a single global state variable for
the maximum turn count and delta speed, which a proper customizable
architecture would have to integrate into the bullet structure. Maybe it is
possible to stuff everything into the same amount of bytes, but not without
first completely rearchitecting the bullet structure and removing every
single piece of redundancy in there. Simply extending the system by adding a
new enum value for a new motion type would be way more
straightforward for modders.
Speaking about memory, TH05 already extends the bullet structure by 6 bytes
for the "exact linear movement" type exclusive to that game. This type is
particularly interesting for all the prospective PC-98 game developers out
there, as it nicely points out the precision limits of Q12.4 subpixels.
Regular bullet movement works by adding a Q12.4 velocity to a Q12.4 position
every frame, with the velocity typically being calculated only once on spawn
time from an 8-bit angle and a Q12.4 speed. Quantization errors from this
initial calculation can quickly compound over all the frames a bullet spends
moving across the playfield. If a bullet is only supposed to move on a
straight line though, there is a more precise way of calculating its
position: By storing the origin point, movement angle, and total distance
traveled, you can perform a full polar→Cartesian transformation every frame.
Out of the 10 danmaku patterns in TH05 that use this motion type, the
difference to regular bullet movement can be best seen in Louise's final
pattern:
Louise's final pattern in its original form, demonstrating
exact linear bullet movement. Note how each bullet spawns slightly
behind the delay cloud: ZUN simply forgot to shift the fixed origin
point along with it.The same pattern with standard bullet movement, corrupting
its intended appearance. No delay cloud-related oversights here though,
at least.
Not far away from the regular bullet code, we've also got the movement
function for the infamous curve / "cheeto" bullets. I would have almost
called them "cheetos" in the code as well, which surely fits more nicely
into 8.3 filenames than "curve bullets" does, but eh, trademarks…
As for hitboxes, we got a 16×16 one on the head node, and a 12×12 one on the
16 trail nodes. The latter simply store the position of the head node during
the last 16 frames, Snake style. But what you're all here for is probably
the turning and homing algorithm, right? Boiled down to its essence, it
works like this:
// [head] points to the controlled "head" part of a curve bullet entity.
// Angles are stored with 8 bits representing a full circle, providing free
// normalization on arithmetic overflow.
// The directions are ordered as you would expect:
// • 0x00: right (sin(0x00) = 0, cos(0x00) = +1)
// • 0x40: down (sin(0x40) = +1, cos(0x40) = 0)
// • 0x80: left (sin(0x80) = 0, cos(0x80) = -1)
// • 0xC0: up (sin(0xC0) = -1, cos(0xC0) = 0)
uint8_t angle_delta = (head->angle - player_angle_from(
head->pos.cur.x, head->pos.cur.y
));
// Stop turning if the player is 1/128ths of a circle away from this bullet
const uint8_t SNAP = 0x02;
// Else, turn either clockwise or counterclockwise by 1/256th of a circle,
// depending on what would reach the player the fastest.
if((angle_delta > SNAP) && (angle_delta < static_cast<uint8_t>(-SNAP))) {
angle_delta = (angle_delta >= 0x80) ? -0x01 : +0x01;
}
head_p->angle -= angle_delta;
5 lines of code, and not all too difficult to follow once you are familiar
with 8-bit angles… unlike what ZUN actually wrote. Which is 26 lines,
and includes an unused "friction" variable that is never set to any value
that makes a difference in the formula. uth05win
correctly saw through that all and simplified this code to something
equivalent to my explanation. Redoing that work certainly wasted a bit of my
time, and means that I now definitely need to spend another push on RE'ing
all the shared boss functions before I can start with Shinki.
So while a curve bullet's speed does get faster over time, its
angular velocity is always limited to 1/256th of a
circle per frame. This reveals the optimal strategy for dodging them:
Maximize this delta angle by staying as close to 180° away from their
current direction as possible, and let their acceleration do the rest.
At least that's the theory for dodging a single one. As a danmaku
designer, you can now of course place other bullets at these technically
optimal places to prevent a curve bullet pattern from being cheesed like
that. I certainly didn't record the video above in a single take either…
After another bunch of boring entity spawn and update functions, the
playfield shaking feature turned out as the most notable (and tricky) one to
round out these two pushes. It's actually implemented quite well in how it
simply "un-shakes" the screen by just marking every stage tile to be
redrawn. In the context of all the other tile invalidation that can take
place during a frame, that's definitely more performant than
📝 doing another EGC-accelerated memmove().
Due to these two games being double-buffered via page flipping, this
invalidation only really needs to happen for the frame after the next
one though. The immediately next frame will show the regular, un-shaken
playfield on the other VRAM page first, except during the multi-frame
shake animation when defeating a midboss, where it will also appear shifted
in a different direction… 😵 Yeah, no wonder why ZUN just always invalidates
all stage tiles for the next two frames after every shaking animation, which
is guaranteed to handle both sporadic single-frame shakes and continuous
ones. So close to good-code here.
Finally, this delivery was delayed a bit because -Tom-
requested his round-up amount to be limited to the cap in the future. Since
that makes it kind of hard to explain on a static page how much money he
will exactly provide, I now properly modeled these discounts in the website
code. The exact round-up amount is now included in both the pre-purchase
breakdown, as well as the cap bar on the main page.
With that in place, the system is now also set up for round-up offers from
other patrons. If you'd also like to support certain goals in this way, with
any amount of money, now's the time for getting in touch with me about that.
Known contributors only, though! 😛
Next up: The final bunch of shared boring boss functions. Which certainly
will give me a break from all the maintenance and research work, and speed
up delivery progress again… right?
Been 📝 a while since we last looked at any of
TH03's game code! But before that, we need to talk about Y coordinates.
During TH03's MAIN.EXE, the PC-98 graphics GDC runs in its
line-doubled 640×200 resolution, which gives the in-game portion its
distinctive stretched low-res look. This lower resolution is a consequence
of using 📝 Promisence Soft's SPRITE16 driver:
Its performance simply stems from the fact that it expects sprites to be
stored in the bottom half of VRAM, which allows them to be blitted using the
same EGC-accelerated VRAM-to-VRAM copies we've seen again and again in all
other games. Reducing the visible resolution also means that the sprites can
be stored on both VRAM pages, allowing the game to still be double-buffered.
If you force the graphics chip to run at 640×400, you can see them:
The full VRAM contents during TH03's in-game portion, as seen when forcing the system into a 640×400 resolution.
•
Note that the text chip still displays its overlaid contents at 640×400,
which means that TH03's in-game portion technically runs at two
resolutions at the same time.
But that means that any mention of a Y coordinate is ambiguous: Does it
refer to undoubled VRAM pixels, or on-screen stretched pixels? Especially
people who have known about the line doubling for years might almost expect
technical blog posts on this game to use undoubled VRAM coordinates. So,
let's introduce a new formatting convention for both on-screen
640×400 and undoubled 640×200 coordinates,
and always write out both to minimize the confusion.
Alright, now what's the thing gonna be? The enemy structure is highly
overloaded, being used for enemies, fireballs, and explosions with seemingly
different semantics for each. Maybe a bit too much to be figured out in what
should ideally be a single push, especially with all the functions that
would need to be decompiled? Bullet code would be easier, but not exactly
single-push material either. As it turns out though, there's something more
fundamental left to be done first, which both of these subsystems depend on:
collision detection!
And it's implemented exactly how I always naively imagined collision
detection to be implemented in a fixed-resolution 2D bullet hell game with
small hitboxes: By keeping a separate 1bpp bitmap of both playfields in
memory, drawing in the collidable regions of all entities on every frame,
and then checking whether any pixels at the current location of the player's
hitbox are set to 1. It's probably not done in the other games because their
single data segment was already too packed for the necessary 17,664 bytes to
store such a bitmap at pixel resolution, and 282,624 bytes for a bitmap at
Q12.4 subpixel resolution would have been prohibitively expensive in 16-bit
Real Mode DOS anyway. In TH03, on the other hand, this bitmap is doubly
useful, as the AI also uses it to elegantly learn what's on the playfield.
By halving the resolution and only tracking tiles of 2×2 / 2×1 pixels, TH03 only requires an adequate total
of 6,624 bytes of memory for the collision bitmaps of both playfields.
So how did the implementation not earn the good-code tag this time? Because the code for drawing into these bitmaps is undecompilable hand-written x86 assembly. And not just your usual ASM that was basically compiled from C and then edited to maybe optimize register allocation and maybe replace a bunch of local variables with self-modifying code, oh no. This code is full of overly clever bit twiddling, abusing the fact that the 16-bit AX,
BX, CX, and DX registers can also be
accessed as two 8-bit registers, calculations that change the semantic
meaning behind the value of a register, or just straight-up reassignments of
different values to the same small set of registers. Sure, in some way it is
impressive, and it all does work and correctly covers every edge
case, but come on. This could have all been a lot more readable in
exchange for just a few CPU cycles.
What's most interesting though are the actual shapes that these functions
draw into the collision bitmap. On the surface, we have:
vertical slopes at any angle across the whole playfield; exclusively
used for Chiyuri's diagonal laser EX attack
straight vertical lines, with a width of 1 tile; exclusively used for
the 2×2 / 2×1 hitboxes of bullets
rectangles at arbitrary sizes
But only 2) actually draws a full solid line. 1) and 3) are only ever drawn
as horizontal stripes, with a hardcoded distance of 2 vertical tiles
between every stripe of a slope, and 4 vertical tiles between every stripe
of a rectangle. That's 66-75% of each rectangular entity's intended hitbox
not actually taking part in collision detection. Now, if player hitboxes
were ≤ 6 / 3 pixels, we'd have one
possible explanation of how the AI can "cheat", because it could just
precisely move through those blank regions at TAS speeds. So, let's make
this two pushes after all and tell the complete story, since this is one of
the more interesting aspects to still be documented in this game.
And the code only gets worse. While the player
collision detection function is decompilable, it might as well not
have been, because it's just more of the same "optimized", hard-to-follow
assembly. With the four splittable 16-bit registers having a total of 20
different meanings in this function, I would have almost preferred
self-modifying code…
In fact, it was so bad that it prompted some maintenance work on my inline
assembly coding standards as a whole. Turns out that the _asm
keyword is not only still supported in modern Visual Studio compilers, but
also in Clang with the -fms-extensions flag, and compiles fine
there even for 64-bit targets. While that might sound like amazing news at
first ("awesome, no need to rewrite this stuff for my x86_64 Linux
port!"), you quickly realize that almost all inline assembly in this
codebase assumes either PC-98 hardware, segmented 16-bit memory addressing,
or is a temporary hack that will be removed with further RE progress.
That's mainly because most of the raw arithmetic code uses Turbo C++'s
register pseudovariables where possible. While they certainly have their
drawbacks, being a non-standard extension that's not supported in other
x86-targeting C compilers, their advantages are quite significant: They
allow this code to stay in the same language, and provide slightly more
immediate portability to any other architecture, together with
📝 readability and maintainability improvements that can get quite significant when combined with inlining:
// This one line compiles to five ASM instructions, which would need to be
// spelled out in any C compiler that doesn't support register pseudovariables.
// By adding typed aliases for these registers via `#define`, this code can be
// both made even more readable, and be prepared for an easier transformation
// into more portable local variables.
_ES = (((_AX * 4) + _BX) + SEG_PLANE_B);
However, register pseudovariables might cause potential portability issues
as soon as they are mixed with inline assembly instructions that rely on
their state. The lazy way of "supporting pseudo-registers" in other
compilers would involve declaring the full set as global variables, which
would immediately break every one of those instances:
_DI = 0;
_AX = 0xFFFF;
// Special x86 instruction doing the equivalent of
//
// *reinterpret_cast(MK_FP(_ES, _DI)) = _AX;
// _DI += sizeof(uint16_t);
//
// Only generated by Turbo C++ in very specific cases, and therefore only
// reliably available through inline assembly.
asm { movsw; }
What's also not all too standardized, though, are certain variants of
the asm keyword. That's why I've now introduced a distinction
between the _asm keyword for "decently sane" inline assembly,
and the slightly less standard asm keyword for inline assembly
that relies on the contents of pseudo-registers, and should break on
compilers that don't support them. So yeah, have some minor
portability work in exchange for these two pushes not having all that much
in RE'd content.
With that out of the way and the function deciphered, we can confirm the
player hitboxes to be a constant 8×8 /
8×4 pixels, and prove that the hit stripes are nothing but
an adequate optimization that doesn't affect gameplay in any way.
And what's the obvious thing to immediately do if you have both the
collision bitmap and the player hitbox? Writing a "real hitbox" mod, of
course:
Reorder the calls to rendering functions so that player and shot sprites
are rendered after bullets
Blank out all player sprite pixels outside an
8×8 / 8×4 box around the center
point
After the bullet rendering function, turn on the GRCG in RMW mode and
set the tile register set to the background color
Stretch the negated contents of collision bitmap onto each playfield,
leaving only collidable pixels untouched
Do the same with the actual, non-negated contents and a white color, for
extra contrast against the background. This also makes sure to show any
collidable areas whose sprite pixels are transparent, such as with the moon
enemy. (Yeah, how unfair.) Doing that also loses a lot of information about
the playfield, such as enemy HP indicated by their color, but what can you
do:
A decently busy TH03 in-game frame and its underlying collision bitmap,
showing off all three different collision shapes together with the
player hitboxes.
2022-02-18-TH03-real-hitbox.zip
The secret for writing such mods before having reached a sufficient level of
position independence? Put your new code segment into DGROUP,
past the end of the uninitialized data section. That's why this modded
MAIN.EXE is a lot larger than you would expect from the raw amount of new code: The file now actually needs to store all these
uninitialized 0 bytes between the end of the data segment and the first
instruction of the mod code – normally, this number is simply a part of the
MZ EXE header, and doesn't need to be redundantly stored on disk. Check the
th03_real_hitbox
branch for the code.
And now we know why so many "real hitbox" mods for the Windows Touhou games
are inaccurate: The games would simply be unplayable otherwise – or can
you dodge rapidly moving 2×2 /
2×1 blocks as an 8×8 /
8×4 rectangle that is smaller than your shot sprites,
especially without focused movement? I can't.
Maybe it will feel more playable after making explosions visible, but that
would need more RE groundwork first.
It's also interesting how adding two full GRCG-accelerated redraws of both
playfields per frame doesn't significantly drop the game's frame rate – so
why did the drawing functions have to be micro-optimized again? It
would be possible in one pass by using the GRCG's TDW mode, which
should theoretically be 8× faster, but I have to stop somewhere.
Next up: The final missing piece of TH04's and TH05's
bullet-moving code, which will include a certain other
type of projectile as well.
Here we go, TH01 Sariel! This is the single biggest boss fight in all of
PC-98 Touhou: If we include all custom effect code we previously decompiled,
it amounts to a total of 10.31% of all code in TH01 (and 3.14%
overall). These 8 pushes cover the final 8.10% (or 2.47% overall),
and are likely to be the single biggest delivery this project will ever see.
Considering that I only managed to decompile 6.00% across all games in 2021,
2022 is already off to a much better start!
So, how can Sariel's code be that large? Well, we've got:
16 danmaku patterns; including the one snowflake detonating into a giant
94×32 hitbox
Gratuitous usage of floating-point variables, bloating the binary thanks
to Turbo C++ 4.0J's particularly horrid code generation
The hatching birds that shoot pellets
3 separate particle systems, sharing the general idea, overall code
structure, and blitting algorithm, but differing in every little detail
The "gust of wind" background transition animation
5 sets of custom monochrome sprite animations, loaded from
BOSS6GR?.GRC
A further 3 hardcoded monochrome 8×8 sprites for the "swaying leaves"
pattern during the second form
In total, it's just under 3,000 lines of C++ code, containing a total of 8
definite ZUN bugs, 3 of them being subpixel/pixel confusions. That might not
look all too bad if you compare it to the
📝 player control function's 8 bugs in 900 lines of code,
but given that Konngara had 0… (Edit (2022-07-17):
Konngara contains two bugs after all: A
📝 possible heap corruption in test or debug mode,
and the infamous
📝 temporary green discoloration.)
And no, the code doesn't make it obvious whether ZUN coded Konngara or
Sariel first; there's just as much evidence for either.
Some terminology before we start: Sariel's first form is separated
into four phases, indicated by different background images, that
cycle until Sariel's HP reach 0 and the second, single-phase form
starts. The danmaku patterns within each phase are also on a cycle,
and the game picks a random but limited number of patterns per phase before
transitioning to the next one. The fight always starts at pattern 1 of phase
1 (the random purple lasers), and each new phase also starts at its
respective first pattern.
Sariel's bugs already start at the graphics asset level, before any code
gets to run. Some of the patterns include a wand raise animation, which is
stored in BOSS6_2.BOS:
Umm… OK? The same sprite twice, just with slightly different
colors? So how is the wand lowered again?
The "lowered wand" sprite is missing in this file simply because it's
captured from the regular background image in VRAM, at the beginning of the
fight and after every background transition. What I previously thought to be
📝 background storage code has therefore a
different meaning in Sariel's case. Since this captured sprite is fully
opaque, it will reset the entire 128×128 wand area… wait, 128×128, rather
than 96×96? Yup, this lowered sprite is larger than necessary, wasting 1,967
bytes of conventional memory. That still doesn't quite explain the
second sprite in BOSS6_2.BOS though. Turns out that the black
part is indeed meant to unblit the purple reflection (?) in the first
sprite. But… that's not how you would correctly unblit that?
The first sprite already eats up part of the red HUD line, and the second
one additionally fails to recover the seal pixels underneath, leaving a nice
little black hole and some stray purple pixels until the next background
transition. Quite ironic given that both
sprites do include the right part of the seal, which isn't even part of the
animation.
Just like Konngara, Sariel continues the approach of using a single function
per danmaku pattern or custom entity. While I appreciate that this allows
all pattern- and entity-specific state to be scoped locally to that one
function, it quickly gets ugly as soon as such a function has to do more than one thing.
The "bird function" is particularly awful here: It's just one if(…)
{…} else if(…) {…} else if(…) {…} chain with different
branches for the subfunction parameter, with zero shared code between any of
these branches. It also uses 64-bit floating-point double as
its subpixel type… and since it also takes four of those as parameters
(y'know, just in case the "spawn new bird" subfunction is called), every
call site has to also push four double values onto the stack.
Thanks to Turbo C++ even using the FPU for pushing a 0.0 constant, we
have already reached maximum floating-point decadence before even having
seen a single danmaku pattern. Why decadence? Every possible spawn position
and velocity in both bird patterns just uses pixel resolution, with no
fractional component in sight. And there goes another 720 bytes of
conventional memory.
Speaking about bird patterns, the red-bird one is where we find the first
code-level ZUN bug: The spawn cross circle sprite suddenly disappears after
it finished spawning all the bird eggs. How can we tell it's a bug? Because
there is code to smoothly fly this sprite off the playfield, that
code just suddenly forgets that the sprite's position is stored in Q12.4
subpixels, and treats it as raw screen pixels instead.
As a result, the well-intentioned 640×400
screen-space clipping rectangle effectively shrinks to 38×23 pixels in the
top-left corner of the screen. Which the sprite is always outside of, and
thus never rendered again.
The intended animation is easily restored though:
Sariel's third pattern, and the first to spawn birds, in its original
and fixed versions. Note that I somewhat fixed the bird hatch animation
as well: ZUN's code never unblits any frame of animation there, and
simply blits every new one on top of the previous one.
Also, did you know that birds actually have a quite unfair 14×38-pixel
hitbox? Not that you'd ever collide with them in any of the patterns…
Another 3 of the 8 bugs can be found in the symmetric, interlaced spawn rays
used in three of the patterns, and the 32×32 debris "sprites" shown at their endpoint, at
the edge of the screen. You kinda have to commend ZUN's attention to detail
here, and how he wrote a lot of code for those few rapidly animated pixels
that you most likely don't
even notice, especially with all the other wrong pixels
resulting from rendering glitches. One of the bugs in the very final pattern
of phase 4 even turns them into the vortex sprites from the second pattern
in phase 1 during the first 5 frames of
the first time the pattern is active, and I had to single-step the blitting
calls to verify it.
It certainly was annoying how much time I spent making sense of these bugs,
and all weird blitting offsets, for just a few pixels… Let's look at
something more wholesome, shall we?
So far, we've only seen the PC-98 GRCG being used in RMW (read-modify-write)
mode, which I previously
📝 explained in the context of TH01's red-white HP pattern.
The second of its three modes, TCR (Tile Compare Read), affects VRAM reads
rather than writes, and performs "color extraction" across all 4 bitplanes:
Instead of returning raw 1bpp data from one plane, a VRAM read will instead
return a bitmask, with a 1 bit at every pixel whose full 4-bit color exactly
matches the color at that offset in the GRCG's tile register, and 0
everywhere else. Sariel uses this mode to make sure that the 2×2 particles
and the wind effect are only blitted on top of "air color" pixels, with
other parts of the background behaving like a mask. The algorithm:
Set the GRCG to TCR mode, and all 8 tile register dots to the air
color
Read N bits from the target VRAM position to obtain an N-bit mask where
all 1 bits indicate air color pixels at the respective position
AND that mask with the alpha plane of the sprite to be drawn, shifted to
the correct start bit within the 8-pixel VRAM byte
Set the GRCG to RMW mode, and all 8 tile register dots to the color that
should be drawn
Write the previously obtained bitmask to the same position in VRAM
Quite clever how the extracted colors double as a secondary alpha plane,
making for another well-earned good-code tag. The wind effect really doesn't deserve it, though:
ZUN calculates every intermediate result inside this function
over and over and over again… Together with some ugly
pointer arithmetic, this function turned into one of the most tedious
decompilations in a long while.
This gradual effect is blitted exclusively to the front page of VRAM,
since parts of it need to be unblitted to create the illusion of a gust of
wind. Then again, anything that moves on top of air-colored background –
most likely the Orb – will also unblit whatever it covered of the effect…
As far as I can tell, ZUN didn't use TCR mode anywhere else in PC-98 Touhou.
Tune in again later during a TH04 or TH05 push to learn about TDW, the final
GRCG mode!
Speaking about the 2×2 particle systems, why do we need three of them? Their
only observable difference lies in the way they move their particles:
Up or down in a straight line (used in phases 4 and 2,
respectively)
Left or right in a straight line (used in the second form)
Left and right in a sinusoidal motion (used in phase 3, the "dark
orange" one)
Out of all possible formats ZUN could have used for storing the positions
and velocities of individual particles, he chose a) 64-bit /
double-precision floating-point, and b) raw screen pixels. Want to take a
guess at which data type is used for which particle system?
If you picked double for 1) and 2), and raw screen pixels for
3), you are of course correct! Not that I'm implying
that it should have been the other way round – screen pixels would have
perfectly fit all three systems use cases, as all 16-bit coordinates
are extended to 32 bits for trigonometric calculations anyway. That's what,
another 1.080 bytes of wasted conventional memory? And that's even
calculated while keeping the current architecture, which allocates
space for 3×30 particles as part of the game's global data, although only
one of the three particle systems is active at any given time.
That's it for the first form, time to put on "Civilization
of Magic"! Or "死なばもろとも"? Or "Theme of 地獄めくり"? Or whatever SYUGEN is
supposed to mean…
… and the code of these final patterns comes out roughly as exciting as
their in-game impact. With the big exception of the very final "swaying
leaves" pattern: After 📝 Q4.4,
📝 Q28.4,
📝 Q24.8, and double variables,
this pattern uses… decimal subpixels? Like, multiplying the number by
10, and using the decimal one's digit to represent the fractional part?
Well, sure, if you really insist on moving the leaves in cleanly
represented integer multiples of ⅒, which is infamously impossible in IEEE
754. Aside from aesthetic reasons, it only really combines less precision
(10 possible fractions rather than the usual 16) with the inferior
performance of having to use integer divisions and multiplications rather
than simple bit shifts. And it's surely not because the leaf sprites needed
an extended integer value range of [-3276, +3276], compared to
Q12.4's [-2047, +2048]: They are clipped to 640×400 screen space
anyway, and are removed as soon as they leave this area.
This pattern also contains the second bug in the "subpixel/pixel confusion
hiding an entire animation" category, causing all of
BOSS6GR4.GRC to effectively become unused:
The "swaying leaves" pattern. ZUN intended a splash animation to be
shown once each leaf "spark" reaches the top of the playfield, which is
never displayed in the original game.
At least their hitboxes are what you would expect, exactly covering the
30×30 pixels of Reimu's sprite. Both animation fixes are available on the th01_sariel_fixes
branch.
After all that, Sariel's main function turned out fairly unspectacular, just
putting everything together and adding some shake, transition, and color
pulse effects with a bunch of unnecessary hardware palette changes. There is
one reference to a missing BOSS6.GRP file during the
first→second form transition, suggesting that Sariel originally had a
separate "first form defeat" graphic, before it was replaced with just the
shaking effect in the final game.
Speaking about the transition code, it is kind of funny how the… um,
imperative and concrete nature of TH01 leads to these 2×24
lines of straight-line code. They kind of look like ZUN rattling off a
laundry list of subsystems and raw variables to be reinitialized, making
damn sure to not forget anything.
Whew! Second PC-98 Touhou boss completely decompiled, 29 to go, and they'll
only get easier from here! 🎉 The next one in line, Elis, is somewhere
between Konngara and Sariel as far as x86 instruction count is concerned, so
that'll need to wait for some additional funding. Next up, therefore:
Looking at a thing in TH03's main game code – really, I have little
idea what it will be!
Now that the store is open again, also check out the
📝 updated RE progress overview I've posted
together with this one. In addition to more RE, you can now also directly
order a variety of mods; all of these are further explained in the order
form itself.
TH03 finally passed 20% RE, and the newly decompiled code contains no
serious ZUN bugs! What a nice way to end the year.
There's only a single unlockable feature in TH03: Chiyuri and Yumemi as
playable characters, unlocked after a 1CC on any difficulty. Just like the
Extra Stages in TH04 and TH05, YUME.NEM contains a single
designated variable for this unlocked feature, making it trivial to craft a
fully unlocked score file without recording any high scores that others
would have to compete against. So, we can now put together a complete set
for all PC-98 Touhou games: 2021-12-27-Fully-unlocked-clean-score-files.zip
It would have been cool to set the randomly generated encryption keys in
these files to a fixed value so that they cancel out and end up not actually
encrypting the file. Too bad that TH03 also started feeding each encrypted
byte back into its stream cipher, which makes this impossible.
The main loading and saving code turned out to be the second-cleanest
implementation of a score file format in PC-98 Touhou, just behind TH02.
Only two of the YUME.NEM functions come with nonsensical
differences between OP.EXE and MAINL.EXE, rather
than 📝 all of them, as in TH01 or
📝 too many of them, as in TH04 and TH05. As
for the rest of the per-difficulty structure though… well, it quickly
becomes clear why this was the final score file format to be RE'd. The name,
score, and stage fields are directly stored in terms of the internal
REGI*.BFT sprite IDs used on the high score screen. TH03 also
stores 10 score digits for each place rather than the 9 possible ones, keeps
any leading 0 digits, and stores the letters of entered names in reverse
order… yeah, let's decompile the high score screen as well, for a full
understanding of why ZUN might have done all that. (Answer: For no reason at
all. )
And wow, what a breath of fresh air. It's surely not
good-code: The overlapping shadows resulting from using
a 24-pixel letterspacing with 32-pixel glyphs in the name column led ZUN to
do quite a lot of unnecessary and slightly confusing rendering work when
moving the cursor back and forth, and he even forgot about the EGC there.
But it's nowhere close to the level of jank we saw in
📝 TH01's high score menu last year. Good to
see that ZUN had learned a thing or two by his third game – especially when
it comes to storing the character map cursor in terms of a character ID,
and improving the layout of the character map:
That's almost a nicely regular grid there. With the question mark and the
double-wide SP, BS, and END options, the cursor
movement code only comes with a reasonable two exceptions, which are easily
handled. And while I didn't get this screen completely decompiled,
one additional push was enough to cover all important code there.
The only potential glitch on this screen is a result of ZUN's continued use
of binary-coded
decimal digits without any bounds check or cap. Like the in-game HUD
score display in TH04 and TH05, TH03's high score screen simply uses the
next glyph in the character set for the most significant digit of any score
above 1,000,000,000 points – in this case, the period. Still, it only
really gets bad at 8,000,000,000 points: Once the glyphs are
exhausted, the blitting function ends up accessing garbage data and filling
the entire screen with garbage pixels. For comparison though, the current world record
is 133,650,710 points, so good luck getting 8 billion in the first
place.
Next up: Starting 2022 with the long-awaited decompilation of TH01's Sariel
fight! Due to the 📝 recent price increase,
we now got a window in the cap that
is going to remain open until tomorrow, providing an early opportunity to
set a new priority after Sariel is done.
EMS memory! The
infamous stopgap measure between the 640 KiB ("ought to be enough for
everyone") of conventional
memory offered by DOS from the very beginning, and the later XMS standard for
accessing all the rest of memory up to 4 GiB in the x86 Protected Mode. With
an optionally active EMS driver, TH04 and TH05 will make use of EMS memory
to preload a bunch of situational .CDG images at the beginning of
MAIN.EXE:
The "eye catch" game title image, shown while stages are loaded
The character-specific background image, shown while bombing
The player character dialog portraits
TH05 additionally stores the boss portraits there, preloading them
at the beginning of each stage. (TH04 instead keeps them in conventional
memory during the entire stage.)
Once these images are needed, they can then be copied into conventional
memory and accessed as usual.
Uh… wait, copied? It certainly would have been possible to map EMS
memory to a regular 16-bit Real Mode segment for direct access,
bank-switching out rarely used system or peripheral memory in exchange for
the EMS data. However, master.lib doesn't expose this functionality, and
only provides functions for copying data from EMS to regular memory and vice
versa.
But even that still makes EMS an excellent fit for the large image files
it's used for, as it's possible to directly copy their pixel data from EMS
to VRAM. (Yes, I tried!) Well… would, because ZUN doesn't do
that either, and always naively copies the images to newly allocated
conventional memory first. In essence, this dumbs down EMS into just another
layer of the memory hierarchy, inserted between conventional memory and
disk: Not quite as slow as disk, but still requiring that
memcpy() to retrieve the data. Most importantly though: Using
EMS in this way does not increase the total amount of memory
simultaneously accessible to the game. After all, some other data will have
to be freed from conventional memory to make room for the newly loaded data.
The most idiomatic way to define the game-specific layout of the EMS area
would be either a struct or an enum.
Unfortunately, the total size of all these images exceeds the range of a
16-bit value, and Turbo C++ 4.0J supports neither 32-bit enums
(which are silently degraded to 16-bit) nor 32-bit structs
(which simply don't compile). That still leaves raw compile-time constants
though, you only have to manually define the offset to each image in terms
of the size of its predecessor. But instead of doing that, ZUN just placed
each image at a nice round decimal offset, each slightly larger than the
actual memory required by the previous image, just to make sure that
everything fits. This results not only in quite
a bit of unnecessary padding, but also in technically the single
biggest amount of "wasted" memory in PC-98 Touhou: Out of the 180,000 (TH04)
and 320,000 (TH05) EMS bytes requested, the game only uses 135,552 (TH04)
and 175,904 (TH05) bytes. But hey, it's EMS, so who cares, right? Out of all
the opportunities to take shortcuts during development, this is among the
most acceptable ones. Any actual PC-98 model that could run these two games
comes with plenty of memory for this to not turn into an actual issue.
On to the EMS-using functions themselves, which are the definition of
"cross-cutting concerns". Most of these have a fallback path for the non-EMS
case, and keep the loaded .CDG images in memory if they are immediately
needed. Which totally makes sense, but also makes it difficult to find names
that reflect all the global state changed by these functions. Every one of
these is also just called from a single place, so inlining
them would have saved me a lot of naming and documentation trouble
there.
The TH04 version of the EMS allocation code was actually displayed on ZUN's monitor in the
2010 MAG・ネット documentary; WindowsTiger already transcribed the low-quality video image
in 2019. By 2015 ReC98 standards, I would have just run with that, but
the current project goal is to write better code than ZUN, so I didn't. 😛
We sure ain't going to use magic numbers for EMS offsets.
The dialog init and exit code then is completely different in both games,
yet equally cross-cutting. TH05 goes even further in saving conventional
memory, loading each individual player or boss portrait into a single .CDG
slot immediately before blitting it to VRAM and freeing the pixel data
again. People who play TH05 without an active EMS driver are surely going to
enjoy the hard drive access lag between each portrait change…
TH04, on the other hand, also abuses the dialog
exit function to preload the Mugetsu defeat / Gengetsu entrance and
Gengetsu defeat portraits, using a static variable to track how often the
function has been called during the Extra Stage… who needs function
parameters anyway, right?
This is also the function in which TH04 infamously crashes after the Stage 5
pre-boss dialog when playing with Reimu and without any active EMS driver.
That crash is what motivated this look into the games' EMS usage… but the
code looks perfectly fine? Oh well, guess the crash is not related to EMS
then. Next u–
OK, of course I can't leave it like that. Everyone is expecting a fix now,
and I still got half of a push left over after decompiling the regular EMS
code. Also, I've now RE'd every function that could possibly be involved in
the crash, and this is very likely to be the last time I'll be looking at
them.
Turns out that the bug has little to do with EMS, and everything to do with
ZUN limiting the amount of conventional RAM that TH04's
MAIN.EXE is allowed to use, and then slightly miscalculating
this upper limit. Playing Stage 5 with Reimu is the most asset-intensive
configuration in this game, due to the combination of
6 player portraits (Marisa has only 5), at 128×128 pixels each
a 288×256 background for the boss fight, tied in size only with the
ones in the Extra Stage
the additional 96×80 image for the vertically scrolling stars during
the stage, wastefully stored as 4 bitplanes rather than a single one.
This image is never freed, not even at the end of the stage.
The star image used in TH04's Stage 5.
Remove any single one of the above points, and this crash would have never
occurred. But with all of them combined, the total amount of memory consumed
by TH04's MAIN.EXE just barely exceeds ZUN's limit of 320,000
bytes, by no more than 3,840 bytes, the size of the star image.
But wait: As we established earlier, EMS does nothing to reduce the amount
of conventional memory used by the game. In fact, if you disabled TH04's EMS
handling, you'd still get this crash even if you are running an EMS
driver and loaded DOS into the High Memory Area to free up as much
conventional RAM as possible. How can EMS then prevent this crash in the
first place?
The answer: It's only because ZUN's usage of EMS bypasses the need to load
the cached images back out of the XOR-encrypted 東方幻想.郷
packfile. Leaving aside the general
stupidity of any game data file encryption*, master.lib's decryption
implementation is also quite wasteful: It uses a separate buffer that
receives fixed-size chunks of the file, before decrypting every individual
byte and copying it to its intended destination buffer. That really
resembles the typical slowness of a C fread() implementation
more than it does the highly optimized ASM code that master.lib purports to
be… And how large is this well-hidden decryption buffer? 4 KiB.
So, looking back at the game, here is what happens once the Stage 5
pre-battle dialog ends:
Reimu's bomb background image, which was previously freed to make space
for her dialog portraits, has to be loaded back into conventional memory
from disk
BB0.CDG is found inside the 東方幻想.郷
packfile
file_ropen() ends up allocating a 4 KiB buffer for the
encrypted packfile data, getting us the decisive ~4 KiB closer to the memory
limit
The .CDG loader tries to allocate 52 608 contiguous bytes for the
pixel data of Reimu's bomb image
This would exceed the memory limit, so hmem_allocbyte()
fails and returns a nullptr
ZUN doesn't check for this case (as usual)
The pixel data is loaded to address 0000:0000,
overwriting the Interrupt Vector Table and whatever comes after
The game crashes
The final frame rendered by a crashing TH04.
The 4 KiB encryption buffer would only be freed by the corresponding
file_close() call, which of course never happens because the
game crashes before it gets there. At one point, I really did suspect the
cause to be some kind of memory leak or fragmentation inside master.lib,
which would have been quite delightful to fix.
Instead, the most straightforward fix here is to bump up that memory limit
by at least 4 KiB. Certainly easier than squeezing in a
cdg_free() call for the star image before the pre-boss dialog
without breaking position dependence.
Or, even better, let's nuke all these memory limits from orbit
because they make little sense to begin with, and fix every other potential
out-of-memory crash that modders would encounter when adding enough data to
any of the 4 games that impose such limits on themselves. Unless you want to
launch other binaries (which need to do their own memory allocations) after
launching the game, there's really no reason to restrict the amount of
memory available to a DOS process. Heck, whenever DOS creates a new one, it
assigns all remaining free memory by default anyway.
Removing the memory limits also removes one of ZUN's few error checks, which
end up quitting the game if there isn't at least a given maximum amount of
conventional RAM available. While it might be tempting to reserve enough
memory at the beginning of execution and then never check any allocation for
a potential failure, that's exactly where something like TH04's crash
comes from.
This game is also still running on DOS, where such an initial allocation
failure is very unlikely to happen – no one fills close to half of
conventional RAM with TSRs and then tries running one of these games. It
might have been useful to detect systems with less than 640 KiB of
actual, physical RAM, but none of the PC-98 models with that little amount
of memory are fast enough to run these games to begin with. How ironic… a
place where ZUN actually added an error check, and then it's mostly
pointless.
Here's an archive that contains both fix variants, just in case. These were
compiled from the th04_noems_crash_fix
and mem_assign_all
branches, and contain as little code changes as possible. Edit (2022-04-18): For TH04, you probably want to download
the 📝 community choice fix package instead,
which contains this fix along with other workarounds for the Divide
error crashes.
2021-11-29-Memory-limit-fixes.zip
So yeah, quite a complex bug, leaving no time for the TH03 scorefile format
research after all. Next up: Raising prices.
OK, TH01 missile bullets. Can we maybe have a well-behaved entity type,
without any weirdness? Just once?
Ehh, kinda. Apart from another 150 bytes wasted on unused structure members,
this code is indeed more on the low end in terms of overall jank. It does
become very obvious why dodging these missiles in the YuugenMagan, Mima, and
Elis fights feels so awful though: An unfair 46×46 pixel hitbox around
Reimu's center pixel, combined with the comeback of
📝 interlaced rendering, this time in every
stage. ZUN probably did this because missiles are the only 16×16 sprite in
TH01 that is blitted to unaligned X positions, which effectively ends up
touching a 32×16 area of VRAM per sprite.
But even if we assume VRAM writes to be the bottleneck here, it would
have been totally possible to render every missile in every frame at roughly
the same amount of CPU time that the original game uses for interlaced
rendering:
Note that all missile sprites only use two colors, white and green.
Instead of naively going with the usual four bitplanes, extract the
pixels drawn in each of the two used colors into their own bitplanes.
master.lib calls this the "tiny format".
Use the GRCG to draw these two bitplanes in the intended white and green
colors, halving the amount of VRAM writes compared to the original
function.
(Not using the .PTN format would have also avoided the inconsistency of
storing the missile sprites in boss-specific sprite slots.)
That's an optimization that would have significantly benefitted the game, in
contrast to all of the fake ones
introduced in later games. Then again, this optimization is
actually something that the later games do, and it might have in fact been
necessary to achieve their higher bullet counts without significant
slowdown.
After some effectively unused Mima sprite effect code that is so broken that
it's impossible to make sense out of it, we get to the final feature I
wanted to cover for all bosses in parallel before returning to Sariel: The
separate sprite background storage for moving or animated boss sprites in
the Mima, Elis, and Sariel fights. But, uh… why is this necessary to begin
with? Doesn't TH01 already reserve the other VRAM page for backgrounds?
Well, these sprites are quite big, and ZUN didn't want to blit them from
main memory on every frame. After all, TH01 and TH02 had a minimum required
clock speed of 33 MHz, half of the speed required for the later three games.
So, he simply blitted these boss sprites to both VRAM pages, leading
the usual unblitting calls to only remove the other sprites on top of the
boss. However, these bosses themselves want to move across the screen…
and this makes it necessary to save the stage background behind them
in some other way.
Enter .PTN, and its functions to capture a 16×16 or 32×32 square from VRAM
into a sprite slot. No problem with that approach in theory, as the size of
all these bigger sprites is a multiple of 32×32; splitting a larger sprite
into these smaller 32×32 chunks makes the code look just a little bit clumsy
(and, of course, slower).
But somewhere during the development of Mima's fight, ZUN apparently forgot
that those sprite backgrounds existed. And once Mima's 🚫 casting sprite is
blitted on top of her regular sprite, using just regular sprite
transparency, she ends up with her infamous third arm:
Ironically, there's an unused code path in Mima's unblit function where ZUN
assumes a height of 48 pixels for Mima's animation sprites rather than the
actual 64. This leads to even clumsier .PTN function calls for the bottom
128×16 pixels… Failing to unblit the bottom 16 pixels would have also
yielded that third arm, although it wouldn't have looked as natural. Still
wouldn't say that it was intentional; maybe this casting sprite was just
added pretty late in the game's development?
So, mission accomplished, Sariel unblocked… at 2¼ pushes. That's quite some time left for some smaller stage initialization
code, which bundles a bunch of random function calls in places where they
logically really don't belong. The stage opening animation then adds a bunch
of VRAM inter-page copies that are not only redundant but can't even be
understood without knowing the hidden internal state of the last VRAM page
accessed by previous ZUN code…
In better news though: Turbo C++ 4.0 really doesn't seem to have any
complexity limit on inlining arithmetic expressions, as long as they only
operate on compile-time constants. That's how we get macro-free,
compile-time Shift-JIS to JIS X 0208 conversion of the individual code
points in the 東方★靈異伝 string, in a compiler from 1994. As long as you
don't store any intermediate results in variables, that is…
But wait, there's more! With still ¼ of a push left, I also went for the
boss defeat animation, which includes the route selection after the SinGyoku
fight.
As in all other instances, the 2× scaled font is accomplished by first
rendering the text at regular 1× resolution to the other, invisible VRAM
page, and then scaled from there to the visible one. However, the route
selection is unique in that its scaled text is both drawn transparently on
top of the stage background (not onto a black one), and can also change
colors depending on the selection. It would have been no problem to unblit
and reblit the text by rendering the 1× version to a position on the
invisible VRAM page that isn't covered by the 2× version on the visible one,
but ZUN (needlessly) clears the invisible page before rendering any text.
Instead, he assigned a separate VRAM color for both
the 魔界 and 地獄 options, and only changed the palette value for
these colors to white or gray, depending on the correct selection. This is
another one of the
📝 rare cases where TH01 demonstrates good use of PC-98 hardware,
as the 魔界へ and 地獄へ strings don't need to be reblitted during the selection process, only the Orb "cursor" does.
Then, why does this still not count as good-code? When
changing palette colors, you kinda need to be aware of everything
else that can possibly be on screen, which colors are used there, and which
aren't and can therefore be used for such an effect without affecting other
sprites. In this case, well… hover over the image below, and notice how
Reimu's hair and the bomb sprites in the HUD light up when Makai is
selected:
This push did end on a high note though, with the generic, non-SinGyoku
version of the defeat animation being an easily parametrizable copy. And
that's how you decompile another 2.58% of TH01 in just slightly over three
pushes.
Now, we're not only ready to decompile Sariel, but also Kikuri, Elis, and
SinGyoku without needing any more detours into non-boss code. Thanks to the
current TH01 funding subscriptions, I can plan to cover most, if not all, of
Sariel in a single push series, but the currently 3 pending pushes probably
won't suffice for Sariel's 8.10% of all remaining code in TH01. We've got
quite a lot of not specifically TH01-related funds in the backlog to pass
the time though.
Due to recent developments, it actually makes quite a lot of sense to take a
break from TH01: spaztron64 has
managed what every Touhou download site so far has failed to do: Bundling
all 5 game onto a single .HDI together with pre-configured PC-98
emulators and a nice boot menu, and hosting the resulting package on a
proper website. While this first release is already quite good (and much
better than my attempt from 2014), there is still a bit of room for
improvement to be gained from specific ReC98 research. Next up,
therefore:
Researching how TH04 and TH05 use EMS memory, together with the cause
behind TH04's crash in Stage 5 when playing as Reimu without an EMS driver
loaded, and
reverse-engineering TH03's score data file format
(YUME.NEM), which hopefully also comes with a way of building a
file that unlocks all characters without any high scores.
No technical obstacles for once! Just pure overcomplicated ZUN code. Unlike
📝 Konngara's main function, the main TH01
player function was every bit as difficult to decompile as you would expect
from its size.
With TH01 using both separate left- and right-facing sprites for all of
Reimu's moves and separate classes for Reimu's 32×32 and 48×*
sprites, we're already off to a bad start. Sure, sprite mirroring is
minimally more involved on PC-98, as the planar
nature of VRAM requires the bits within an 8-pixel byte to also be
mirrored, in addition to writing the sprite bytes from right to left. TH03
uses a 256-byte lookup table for this, generated at runtime by an infamous
micro-optimized and undecompilable ASM algorithm. With TH01's existing
architecture, ZUN would have then needed to write 3 additional blitting
functions. But instead, he chose to waste a total of 26,112 bytes of memory
on pre-mirrored sprites…
Alright, but surely selecting those sprites from code is no big deal? Just
store the direction Reimu is facing in, and then add some branches to the
rendering code. And there is in fact a variable for Reimu's direction…
during regular arrow-key movement, and another one while shooting and
sliding, and a third as part of the special attack types,
launched out of a slide.
Well, OK, technically, the last two are the same variable. But that's even
worse, because it means that ZUN stores two distinct enums at
the same place in memory: Shooting and sliding uses 1 for left,
2 for right, and 3 for the "invalid" direction of
holding both, while the special attack types indicate the direction in their
lowest bit, with 0 for right and 1 for left. I
decompiled the latter as bitflags, but in ZUN's code, each of the 8
permutations is handled as a distinct type, with copy-pasted and adapted
code… The interpretation of this
two-enum "sub-mode" union variable is controlled
by yet another "mode" variable… and unsurprisingly, two of the bugs in this
function relate to the sub-mode variable being interpreted incorrectly.
Also, "rendering code"? This one big function basically consists of separate
unblit→update→render code snippets for every state and direction Reimu can
be in (moving, shooting, swinging, sliding, special-attacking, and bombing),
pasted together into a tangled mess of nested if(…) statements.
While a lot of the code is copy-pasted, there are still a number of
inconsistencies that defeat the point of my usual refactoring treatment.
After all, with a total of 85 conditional branches, anything more than I did
would have just obscured the control flow too badly, making it even harder
to understand what's going on.
In the end, I spotted a total of 8 bugs in this function, all of which leave
Reimu invisible for one or more frames:
2 frames after all special attacks
2 frames after swing attacks, and
4 frames before swing attacks
Thanks to the last one, Reimu's first swing animation frame is never
actually rendered. So whenever someone complains about TH01 sprite
flickering on an emulator: That emulator is accurate, it's the game that's
poorly written.
And guess what, this function doesn't even contain everything you'd
associate with per-frame player behavior. While it does
handle Yin-Yang Orb repulsion as part of slides and special attacks, it does
not handle the actual player/Orb collision that results in lives being lost.
The funny thing about this: These two things are done in the same function…
Therefore, the life loss animation is also part of another function. This is
where we find the final glitch in this 3-push series: Before the 16-frame
shake, this function only unblits a 32×32 area around Reimu's center point,
even though it's possible to lose a life during the non-deflecting part of a
48×48-pixel animation. In that case, the extra pixels will just stay on
screen during the shake. They are unblitted afterwards though, which
suggests that ZUN was at least somewhat aware of the issue?
Finally, the chance to see the alternate life loss sprite is exactly ⅛.
As for any new insights into game mechanics… you know what? I'm just not
going to write anything, and leave you with this flowchart instead. Here's
the definitive guide on how to control Reimu in TH01 we've been waiting for
24 years:
Pellets are deflected during all gray
states. Not shown is the obvious "double-tap Z and X" transition from
all non-(#1) states to the Bomb state, but that would have made this
diagram even more unwieldy than it turned out. And yes, you can shoot
twice as fast while moving left or right.
While I'm at it, here are two more animations from MIKO.PTN
which aren't referenced by any code:
With that monster of a function taken care of, we've only got boss sprite animation as the final blocker of uninterrupted Sariel progress. Due to some unfavorable code layout in the Mima segment though, I'll need to spend a bit more time with some of the features used there. Next up: The missile bullets used in the Mima and YuugenMagan fights.
Nothing really noteworthy in TH01's stage timer code, just yet another HUD
element that is needlessly drawn into VRAM. Sure, ZUN applies his custom
boldfacing effect on top of the glyphs retrieved from font ROM, but he could
have easily installed those modified glyphs as gaiji.
Well, OK, halfwidth gaiji aren't exactly well documented, and sometimes not
even correctly emulated
📝 due to the same PC-98 hardware oddity I was researching last month.
I've reserved two of the pending anonymous "anything" pushes for the
conclusion of this research, just in case you were wondering why the
outstanding workload is now lower after the two delivered here.
And since it doesn't seem to be clearly documented elsewhere: Every 2 ticks
on the stage timer correspond to 4 frames.
So, TH01 rank pellet speed. The resident pellet speed value is a
factor ranging from a minimum of -0.375 up to a maximum of 0.5 (pixels per
frame), multiplied with the difficulty-adjusted base speed for each pellet
and added on top of that same speed. This multiplier is modified
every time the stage timer reaches 0 and
HARRY UP is shown (+0.05)
for every score-based extra life granted below the maximum number of
lives (+0.025)
every time a bomb is used (+0.025)
on every frame in which the rand value (shown in debug
mode) is evenly divisible by
(1800 - (lives × 200) - (bombs × 50)) (+0.025)
every time Reimu got hit (set to 0 if higher, then -0.05)
when using a continue (set to -0.05 if higher, then -0.125)
Apparently, ZUN noted that these deltas couldn't be losslessly stored in an
IEEE 754 floating-point variable, and therefore didn't store the pellet
speed factor exactly in a way that would correspond to its gameplay effect.
Instead, it's stored similar to Q12.4 subpixels: as a simple integer,
pre-multiplied by 40. This results in a raw range of -15 to 20, which is
what the undecompiled ASM calls still use. When spawning a new pellet, its
base speed is first multiplied by that factor, and then divided by 40 again.
This is actually quite smart: The calculation doesn't need to be aware of
either Q12.4 or the 40× format, as
((Q12.4 * factor×40) / factor×40) still comes out as a
Q12.4 subpixel even if all numbers are integers. The only limiting issue
here would be the potential overflow of the 16-bit multiplication at
unadjusted base speeds of more than 50 pixels per frame, but that'd be
seriously unplayable.
So yeah, pellet speed modifications are indeed gradual, and don't just fall
into the coarse three "high, normal, and low" categories.
That's ⅝ of P0160 done, and the continue and pause menus would make good
candidates to fill up the remaining ⅜… except that it seemed impossible to
figure out the correct compiler options for this code?
The issues centered around the two effects of Turbo C++ 4.0J's
-O switch:
Optimizing jump instructions: merging duplicate successive jumps into a
single one, and merging duplicated instructions at the end of conditional
branches into a single place under a single branch, which the other branches
then jump to
Compressing ADD SP and POP CX
stack-clearing instructions after multiple successive CALLs to
__cdecl functions into a single ADD SP with the
combined parameter stack size of all function calls
But how can the ASM for these functions exhibit #1 but not #2? How
can it be seemingly optimized and unoptimized at the same time? The
only option that gets somewhat close would be -O- -y, which
emits line number information into the .OBJ files for debugging. This
combination provides its own kind of #1, but these functions clearly need
the real deal.
The research into this issue ended up consuming a full push on its own.
In the end, this solution turned out to be completely unrelated to compiler
options, and instead came from the effects of a compiler bug in a totally
different place. Initializing a local structure instance or array like
const uint4_t flash_colors[3] = { 3, 4, 5 };
always emits the { 3, 4, 5 } array into the program's data
segment, and then generates a call to the internal SCOPY@
function which copies this data array to the local variable on the stack.
And as soon as this SCOPY@ call is emitted, the -O
optimization #1 is disabled for the entire rest of the translation
unit?!
So, any code segment with an SCOPY@ call followed by
__cdecl functions must strictly be decompiled from top to
bottom, mirroring the original layout of translation units. That means no
TH01 continue and pause menus before we haven't decompiled the bomb
animation, which contains such an SCOPY@ call. 😕
Luckily, TH01 is the only game where this bug leads to significant
restrictions in decompilation order, as later games predominantly use the
pascal calling convention, in which each function itself clears
its stack as part of its RET instruction.
What now, then? With 51% of REIIDEN.EXE decompiled, we're
slowly running out of small features that can be decompiled within ⅜ of a
push. Good that I haven't been looking a lot into OP.EXE and
FUUIN.EXE, which pretty much only got easy pieces of
code left to do. Maybe I'll end up finishing their decompilations entirely
within these smaller gaps? I still ended up finding one more small
piece in REIIDEN.EXE though: The particle system, seen in the
Mima fight.
I like how everything about this animation is contained within a single
function that is called once per frame, but ZUN could have really
consolidated the spawning code for new particles a bit. In Mima's fight,
particles are only spawned from the top and right edges of the screen, but
the function in fact contains unused code for all other 7 possible
directions, written in quite a bloated manner. This wouldn't feel quite as
unused if ZUN had used an angle parameter instead…
Also, why unnecessarily waste another 40 bytes of
the BSS segment?
But wait, what's going on with the very first spawned particle that just
stops near the bottom edge of the screen in the video above? Well, even in
such a simple and self-contained function, ZUN managed to include an
off-by-one error. This one then results in an out-of-bounds array access on
the 80th frame, where the code attempts to spawn a 41st
particle. If the first particle was unlucky to be both slow enough and
spawned away far enough from the bottom and right edges, the spawning code
will then kill it off before its unblitting code gets to run, leaving its
pixel on the screen until something else overlaps it and causes it to be
unblitted.
Which, during regular gameplay, will quickly happen with the Orb, all the
pellets flying around, and your own player movement. Also, the RNG can
easily spawn this particle at a position and velocity that causes it to
leave the screen more quickly. Kind of impressive how ZUN laid out the
structure
of arrays in a way that ensured practically no effect of this bug on the
game; this glitch could have easily happened every 80 frames instead.
He almost got close to all bugs canceling out each other here!
Next up: The player control functions, including the second-biggest function
in all of PC-98 Touhou.
Of course, Sariel's potentially bloated and copy-pasted code is blocked by
even more definitely bloated and copy-pasted code. It's TH01, what did you
expect?
But even then, TH01's item code is on a new level of software architecture
ridiculousness. First, ZUN uses distinct arrays for both types of items,
with their own caps of 4 for bomb items, and 10 for point items. Since that
obviously makes any type-related switch statement redundant,
he also used distinct functions for both types, with copy-pasted
boilerplate code. The main per-item update and render function is
shared though… and takes every single accessed member of the item
structure as its own reference parameter. Like, why, you have a
structure, right there?! That's one way to really practice the C++ language
concept of passing arbitrary structure fields by mutable reference…
To complete the unwarranted grand generic design of this function, it calls
back into per-type collision detection, drop, and collect functions with
another three reference parameters. Yeah, why use C++ virtual methods when
you can also implement the effectively same polymorphism functionality by
hand? Oh, and the coordinate clamping code in one of these callbacks could
only possibly have come from nested min() and
max() preprocessor macros. And that's how you extend such
dead-simple functionality to 1¼ pushes…
Amidst all this jank, we've at least got a sensible item↔player hitbox this
time, with 24 pixels around Reimu's center point to the left and right, and
extending from 24 pixels above Reimu down to the bottom of the playfield.
It absolutely didn't look like that from the initial naive decompilation
though. Changing entity coordinates from left/top to center was one of the
better lessons from TH01 that ZUN implemented in later games, it really
makes collision detection code much more intuitive to grasp.
The card flip code is where we find out some slightly more interesting
aspects about item drops in this game, and how they're controlled by a
hidden cycle variable:
At the beginning of every 5-stage scene, this variable is set to a
random value in the [0..59] range
Point items are dropped at every multiple of 10
Every card flip adds 1 to its value after this mod 10
check
At a value of 140, the point item is replaced with a bomb item, but only
if no damaging bomb is active. In any case, its value is then reset to
1.
Then again, score players largely ignore point items anyway, as card
combos simply have a much bigger effect on the score. With this, I should
have RE'd all information necessary to construct a tool-assisted score run,
though? Edit: Turns out that 1) point items are becoming
increasingly important in score runs, and 2) Pearl already did a TAS some
months ago. Thanks to
spaztron64 for the info!
The Orb↔card hitbox also makes perfect sense, with 24 pixels around
the center point of a card in every direction.
The rest of the code confirms the
card
flip score formula documented on Touhou Wiki, as well as the way cards
are flipped by bombs: During every of the 90 "damaging" frames of the
140-frame bomb animation, there is a 75% chance to flip the card at the
[bomb_frame % total_card_count_in_stage] array index. Since
stages can only have up to 50 cards
📝 thanks to a bug, even a 75% chance is high
enough to typically flip most cards during a bomb. Each of these flips
still only removes a single card HP, just like after a regular collision
with the Orb.
Also, why are the card score popups rendered before the cards
themselves? That's two needless frames of flicker during that 25-frame
animation. Not all too noticeable, but still.
And that's over 50% of REIIDEN.EXE decompiled as well! Next
up: More HUD update and rendering code… with a direct dependency on
rank pellet speed modifications?
Yup, there still are features that can be fully covered in a single push
and don't lead to sprawling blog posts. The giant
STAGE number and
HARRY UP messages, as well as the
flashing transparent 東方★靈異伝 at the beginning of each scene are drawn
by retrieving the glyphs for each letter from font ROM, and then "blitting"
them to text RAM by placing a colored fullwidth 16×16 square at every pixel
that is set in the font bitmap.
And 📝 once again, ZUN's code there matches
the mediocre example code for the related hardware interrupt from the
PC-9801 Programmers' Bible. It's not 100% copied this time, but
definitely inspired by the code on page 121. Therefore, we can conclude
that these letters are probably only displayed as these 16× scaled glyphs
because that book had code on how to achieve this effect.
ZUN "improved" on the example code by implementing a write-only cursor over
the entire text RAM that fills every 16×16 cell with a differently colored
space character, fully clearing the text RAM as a side effect. For once, he
even removed some redundancy here by using helper functions! It's all still
far from good-code though. For example, there's a
function for filling 5 rows worth of cells, which he uses for both the top
and bottom margin of these letters. But since the bottom margin starts at
the 22nd line, the code writes past the 25th line and into the second TRAM
page. Good that this page is not used by either the hardware or the game.
These cursor functions can actually write any fullwidth JIS code point to
text RAM… and seem to do that in a rather simplified way, because shouldn't
you set the most significant bit to indicate the right half of a fullwidth
character? That's what's written in the same book that ZUN copied all
functions out of, after all. 🤔 Researching this led me down quite the
rabbit hole, where I found an oddity in PC-98 text RAM rendering that no
single one of the widely-used PC-98 emulators gets completely right. I'm
almost done with the 2-push