ReC98

⮜ Blog

⮜ List of tags

Showing all posts tagged

📝 Posted:: 2024-07-09 11:30 UTC
🚚 Summary of:: P0002, P0003, P0004, P0281, P0282, P0283, P0284, P0285
⌨ Commits:: 87eed57...f131878, f131878...d60fb3e, b86a2b1...62bd5b1, (MS-DOS Player) 07d1088...P0281, d60fb3e...056085b, 056085b...b86a2b1, 62bd5b1...18b9cd7, 18b9cd7...23fc9e7
💰 Funded by:: GhostPhanom, [Anonymous], Blue Bolt, iruleatgames
🏷 Tags:

I'm 13 days late, but 🎉 ReC98 is now 10 years old! 🎉 On June 26, 2014, I first tried exporting IDA's disassembly of TH05's OP.EXE and reassembling and linking the resulting file back into a binary, and was amazed that it actually yielded an identical binary. Now, this doesn't actually mean that I've spent 10 years working on this project; priorities have been shifting and continue to shift, and time-consuming mistakes were certainly made. Still, it's a good occasion to finally fully realize the good future for ReC98 that GhostPhanom invested in with the very first financial contribution back in 2018, deliver the last three of the first four reserved pushes, cross another piece of time-consuming maintenance off the list, and prepare the build process for hopefully the next 10 years.
But why did it take 8 pushes and over two months to restore feature parity with the old system? 🥲

The previous build system(s)
Migrating the 16-bit build part to Tup
Optimizing MS-DOS Player
Continued support for building on 32-bit Windows
The new tier list of supported build platforms
Cleaning up #include lists
TH02's High Score menu

The original plan for ReC98's good future was quite different from what I ended up shipping here. Before I started writing the code for this website in August 2019, I focused on feature-completing the experimental 16-bit DOS build system for Borland compilers that I'd been developing since 2018, and which would form the foundation of my internal development work in the following years. Eventually, I wanted to polish and publicly release this system as soon as people stopped throwing money at me. But as of November 2019, just one month after launch, the store kept selling out with everyone investing into all the flashier goals, so that release never happened.

In theory, this build system remains the optimal way of developing with old Borland compilers on a real PC-98 (or any other 32-bit single-core system) and outside of Borland's IDE, even after the changes introduced by this delivery. In practice though, you're soon going to realize that there are lots of issues I'd have to revisit in case any PC-98 homebrew developers are interested in funding me to finish and release this tool…

The main idea behind the system still has its charm: Your build script is a regular C++ program that #includes the build system as a static library and passes fixed structures with names of source files and build flags. By employing static constructors, even a 1994 Turbo C++ would let you define the whole build at compile time, although this certainly requires some dank preprocessor magic to remain anywhere near readable at ReC98 scale. 🪄 While this system does require a bootstrapping process, the resulting binary can then use the same dependency-checking mechanisms to recompile and overwrite itself if you change the C++ build code later. Since DOS just simply loads an entire binary into RAM before executing it, there is no lock to worry about, and overwriting the originating binary is something you can just do.
Later on, the system also made use of batched compilation: By passing more than one source file to TCC.EXE, you get to avoid TCC's quite noticeable startup times, thus speeding up the build proportional to the number of translation units in each batch. Of course, this requires that every passed source file is supposed to be compiled with the same set of command-line flags, but that's a generally good complexity-reducing guideline to follow in a build script. I went even further and enforced this guideline in the system itself, thus truly making per-file compiler command line switches considered harmful. Thanks to Turbo C++'s #pragma option, changing the command line isn't even necessary for the few unfortunate cases where parts of ZUN's code were compiled with inconsistent flags.
I combined all these ideas with a general approach of "targeting DOSBox": By maximizing DOS syscalls and minimizing algorithms and data structures, we spend as much time as possible in DOSBox's native-code DOS implementation, which should give us a performance advantage over DOS-native implementations of MAKE that typically follow the opposite approach.

Of course, all this only matters if the system is correct and reliable at its core. Tup teaches us that it's fundamentally impossible to have a reliable generic build system without

augmenting the build graph with all actual files read and written by each invoked build tool, which involves tracing all file-related syscalls, and
persistently serializing the full build graph every time the system runs, allowing later runs to detect every possible kind of change in the build script and rebuild or clean up accordingly.

Unfortunately, the design limitations of my system only allowed half-baked attempts at solving both of these prerequisites:

If your build system is not supposed to be generic and only intended to work with specific tools that emit reliable dependency information, you can replace syscall tracing with a parser for those specific formats. This is what my build system was doing, reading dependency information out of each .OBJ file's OMF COMENT record.
Since DOS command lines are limited to 127 bytes, DOS compilers support reading additional arguments from response files, typically indicated with an @ next to their path on the command line. If we now put every parameter passed to TCC or TLINK into a response file and leave these files on disk afterward, we've effectively serialized all command-line arguments of the entire build into a makeshift database. In later builds, the system can then detect changed command-line arguments by comparing the existing response files from the previous run with the new contents it would write based on the current build structures. This way, we still only recompile the parts of the codebase that are affected by the changed arguments, which is fundamentally impossible with Makefiles.

But this strategy only covers changes within each binary's compile or link arguments, and ignores the required deletions in "the database" when removing binaries between build runs. This is a non-issue as long as we keep decompiling on master, but as soon as we switch between master and similarly old commits on the debloated/anniversary branches, we can get very confusing errors:

Screenshot of a seemingly weird error in my 16-bit build system that complains about TH01's vector functions being undefined when linking REIIDEN.EXE, shown when switching between the `anniversary` and `master` branches. — The symptom is a calling convention mismatch: The two vector functions use `__cdecl` on `master` and `pascal` on `debloated`/`anniversary`. We've switched from `anniversary` (which compiles to `ANNIV.EXE`) back to `master` (which compiles to `REIIDEN.EXE`) here, so the .obj file on disk still uses the `pascal` calling convention. The build system, however, only checks the response files associated with the current target binary (`REIIDEN.EXE`) and therefore assumes that the .obj files still reflect the (unchanged) command-line flags in the TCC response file associated with this binary. And if none of the inputs of these .obj files changed between the two branches, they aren't rebuilt after switching, even though they would need to be.

Apparently, there's also such a thing as "too much batching", because TCC would suddenly stop applying certain compiler optimizations at very specific places if too many files were compiled within a single process? At least you quickly remember which source files you then need to manually touch and recompile to make the binaries match ZUN's original ones again…

But the final nail in the coffin was something I'd notice on every single build: 5 years down the line, even the performance argument wasn't convincing anymore. The strategy of minimizing emulated code still left me with an 𝑂(𝑛) algorithm, and with this entire thing still being single-threaded, there was no force to counteract the dependency check times as they grew linearly with the number of source files.
At P0280, each build run would perform a total of 28,130 file-related DOS syscalls to figure out which source files have changed and need to be rebuilt. At some point, this was bound to become noticeable even despite these syscalls being native, not to mention that they're still surrounded by emulator code that must convert their parameters and results to and from the DOS ABI. And with the increasing delays before TCC would do its actual work, the entire thing started feeling increasingly jankier.

While this system was waiting to be eventually finished, the public master branch kept using the Makefile that dates back to early 2015. Back then, it didn't take long for me to abandon raw dumb batch files because Make was simply the most straightforward way of ensuring that the build process would abort on the first compile error.
The following years also proved that Makefile syntax is quite well-suited for expressing the build rules of a codebase at this scale. The built-in support for automatically turning long commands into response files was especially helpful because of how naturally it works together with batched compilation. Both of these advantages culminate in this wonderfully arcane incantation of ASCII special characters and syntactically significant linebreaks:

tcc … @&&|
$**
|

Which translates to "take the filenames of all dependents of this explicit rule, write them into a temporary file with an autogenerated name, insert this filename into the tcc … @ command line, and delete the file after the command finished executing". The @ is part of TCC's command-line interface, the rest is all MAKE syntax.

But 📝 as we all know by now, these surface-level niceties change nothing about Makefiles inherently being unreliable trash due to implementing none of the aforementioned two essential properties of a generic build system. Borland got so close to a correct and reliable implementation of autodependencies, but that would have just covered one of the two properties. Due to this unreliability, the old build16b.bat called Borland's MAKER.EXE with the -B flag, recompiling everything all the time. Not only did this leave modders with a much worse build process than I was using internally, but it also eventually got old for me to merge my internal branch onto master before every delivery. Let's finally rectify that and work towards a single good build process for everyone.

As you would expect by now, I've once again migrated to Tup's Lua syntax. Rewriting it all makes you realize once again how complex the PC-98 Touhou build process is: It has to cover 2 programming languages, 2 pipeline steps, and 3 third-party libraries, and currently generates a total of 39 executables, including the small programs I wrote for research. The final Lua code comprises over 1,300 lines – but then again, if I had written it in 📝 Zig, it would certainly be as long or even longer due to manual memory management. The Tup building blocks I constructed for Shuusou Gyoku quickly turned out to be the wrong abstraction for a project that has no debug builds, but their 📝 basic idea of a branching tree of command-line options remained at the foundation of this script as well.
This rewrite also provided an excellent opportunity for finally dumping all the intermediate compilation outputs into a separate dedicated obj/ subdirectory, finally leaving bin/ nice and clean with only the final executables. I've also merged this new system into most of the public branches of the GitHub repo.

As soon as I first tried to build it all though, I was greeted with a particularly nasty Tup bug. Due to how DOS specified file metadata mutation, MS-DOS Player has to open every file in a way that current Tup treats as a write access… but since unannotated file writes introduce the risk of a malformed build graph if these files are read by another build command later on, Tup providently deletes these files after the command finished executing. And by these files, I mean TCC.EXE as well as every one of its C library header files opened during compilation.
Due to a minor unsolved question about a failing test case, my fix has not been merged yet. But even if it was, we're now faced with a problem: If you previously chose to set up Tup for ReC98 or 📝 Shuusou Gyoku and are maybe still running 📝 my 32-bit build from September 2020, running the new build.bat would in fact delete the most important files of your Turbo C++ 4.0J installation, forcing you to reinstall it or restore it from a backup. So what do we do?

Should my custom build get a special version number so that the surrounding batch file can fail if the version number of your installed Tup is lower?
Or do I just put a message somewhere, which some people invariably won't read?

The easiest solution, however, is to just put a fixed Tup binary directly into the ReC98 repo. This not only allows me to make Tup mandatory for 64-bit builds, but also cuts out one step in the build environment setup that at least one person previously complained about. *nix users might not like this idea all too much (or do they?), but then again, TASM32 and the Windows-exclusive MS-DOS Player require Wine anyway. Running Tup through Wine as well means that there's only one PATH to worry about, and you get to take advantage of the tool checks in the surrounding batch file.
If you're one of those people who doesn't trust binaries in Git repos, the repo also links to instructions for building this binary yourself. Replicating this specific optimized binary is slightly more involved than the classic ./configure && make && make install trinity, so having these instructions is a good idea regardless of the fact that Tup's GPL license requires it.

One particularly interesting aspect of the Lua code is the way it handles sprite dependencies:

th04:branch(MODEL_LARGE):link("main", {
	{ "th04_main.asm", extra_inputs = {
		th02_sprites["pellet"],
		th02_sprites["sparks"],
		th04_sprites["pelletbt"],
		th04_sprites["pointnum"],
	} },
	-- …
}

If build commands read from files that were created by other build commands, Tup requires these input dependencies to be spelled out so that it can arrange the build graph and parallelize the build correctly. We could simply put every sprite into a single array and automatically pass that as an extra input to every source file, but that would effectively split the build into a "sprite convert" and "code compile" phase. Spelling out every individual dependency allows such source files to be compiled as soon as possible, before (and in parallel to) the rest of the sprites they don't depend on. Similarly, code files without sprite dependencies can compile before the first sprite got converted, or even before the sprite converter itself got compiled and linked, maximizing the throughput of the overall build process.

Running a 30-year-old DOS toolchain in a parallel build system also introduces new issues, though. The easiest and recommended way of compiling and linking a program in Turbo C++ is a single tcc invocation:

tcc … main.cpp utils.cpp master.lib

This performs a batched compilation of main.cpp and utils.cpp within a single TCC process, and then launches TLINK to link the resulting .obj files into main.exe, together with the C++ runtime library and any needed objects from master.lib. The linking step works by TCC generating a TLINK command line and writing it into a response file with the fixed name turboc.$ln… which obviously can't work in a parallel build where multiple TCC processes will want to link different executables via the same response file.
Therefore, we have to launch TLINK with a custom response file ourselves. This file is echo'd as a separate parallel build rule, and the Lua code that constructs its contents has to replicate TCC's logic for picking the correct C++ runtime .lib file for the selected memory model.

	-c -s -t c0t.obj obj\th02\zun_res1.obj obj\th02\zun_res2.obj, bin\th02\zun_res.com, obj\th02\zun_res.map, bin\masters.lib emu.lib maths.lib ct.lib

The response file for TH02's ZUN_RES.COM, consisting of the C++ standard library, two files of ZUN code, and master.lib.

While this does add more string formatting logic, not relying on TCC to launch TLINK actually removes the one possible PATH-related error case I previously documented in the README. Back in 2021 when I first stumbled over the issue, it took a few hours of RE to figure this out. I don't like these hours to go to waste, so here's a Gist, and here's the text replicated for SEO reasons:

Issue: TCC compiles, but fails to link, with Unable to execute command 'tlink.exe'

Cause: This happens when invoking TCC as a compiler+linker, without the -c flag. To locate TLINK, TCC needlessly copies the PATH environment variable into a statically allocated 128-byte buffer. It then constructs absolute tlink.exe filenames for each of the semicolon- or \0-terminated paths, writing these into a buffer that immediately follows the 128-byte PATH buffer in memory. The search is finished as soon as TCC finds an existing file, which gives precedence to earlier paths in the PATH. If the search didn't complete until a potential "final" path that runs past the 128 bytes, the final attempted filename will consist of the part that still managed to fit into the buffer, followed by the previously attempted path.

Workaround: Make sure that the BIN\ path to Turbo C++ is fully contained within the first 127 bytes of the PATH inside your DOS system. (The 128^th byte must either be a separating ; or the terminating \0 of the PATH string.)

Now that DOS emulation is an integral component of the single-part build process, it even makes sense to compile our pipeline tools as 16-bit DOS executables and then emulate them as part of the build. Sure, it's technically slower, but realistically it doesn't matter: Our only current pipeline tools are 📝 the converter for hardcoded sprites and the 📝 ZUN.COM generators, both of which involve very little code and are rarely run during regular development after the initial full build. In return, we get to drop that awkward dependency on the separate Borland C++ 5.5 compiler for Windows and yet another additional manual setup step. 🗑️ Once PC-98 Touhou becomes portable, we're probably going to require a modern compiler anyway, so you can now delete that one as well.

That gives us perfect dependency tracking and minimal parallel rebuilds across the whole codebase! While MS-DOS Player is noticeably slower than DOSBox-X, it's not going to matter all too much; unless you change one of the more central header files, you're rarely if ever going to cause a full rebuild. Then again, given that I'm going to use this setup for at least a couple of years, it's worth taking a closer look at why exactly the compilation performance is so underwhelming …

On the surface, MS-DOS Player seems like the right tool for our job, with a lot of advantages over DOSBox:

It doesn't spawn a window that boots an entire emulated PC, but is instead
perfectly integrated into the Windows console. Using it in a modern developer console would allow you to click on a compile error and have your editor immediately open the relevant file and jump to that specific line! With DOSBox, this basic comfort feature was previously unthinkable.
Heck, Takeda Toshiya originally developed it to run the equally vintage LSI C-86 compiler on 64-bit Windows. Fixing any potential issues we'd run into would be well within the scope of the project.
It consists of just a single comparatively small binary that we could just drop into the ReC98 repo. No manual setup steps required.

But once I began integrating it, I quickly noticed two glaring flaws:

Back in 2009, Takeda Toshiya chose to start the project by writing a custom DOS implementation from scratch. He was aware of DOSBox, but only adapted small tricky parts of its source code rather than starting with the DOSBox codebase and ripping out everything he didn't need. This matches the more research-oriented nature that all of his projects appear to follow, where the primary goal of writing the code is a personal understanding of the problem domain rather than a widely usable piece of software. MS-DOS Player is even the outlier in this regard, with Takeda Toshiya describing it as 珍しく実用的かもしれません. I am definitely sympathetic to this mindset; heck, my old internal build system falls under this category too, being so specialized and narrow that it made little sense to use it outside of ReC98. But when you apply it to emulators for niche systems, you end up with exactly the current PC-98 emulation scene, where there's no single universally good emulator because all of them have some inaccuracy somewhere. This scene is too small for you not to eventually become part of someone else's supply chain… 🥲
Emulating DOS is a particularly poor fit for a research/NIH project because it's Hyrum's Law incarnate. With the lack of memory protection in Real Mode, programs could freely access internal DOS (and even BIOS) data structures if they only knew where to look, and frequently did. It might look as if "DOS command-line tools" just equals x86 plus INT 21h, but soon you'll also be emulating the BIOS, PIC, PIT, EMS, XMS, and probably a few more things, all with their individual quirks that some application out there relies on. DOSBox simply had much more time to grow and mature and figure out all of these details by trial and error. If you start a DOS emulator from scratch, you're bound to duplicate all this research as people want to use your emulator to run more and more programs, until you've ended up with what's effectively a clone of DOSBox's exact logic. Unless, of course, if you draw a line somewhere and limit the scope of the DOS and BIOS emulation. But given how many people have wanted to use MS-DOS Player for running DOS TUIs in arbitrarily sized terminal windows with arbitrary fonts, that's not what happened. I guess it made sense for this use case before DOSBox-X gained a TTF output mode in late 2020?
As usual, I wouldn't mention this if I didn't run into two bugs when combining MS-DOS Player with Turbo C++ and Tup. Both of these originated from workarounds for inaccuracies in the DOS emulation that date back to MS-DOS Player's initial release and were thankfully no longer necessary with the accuracy improvements implemented in the years since.
For CPU emulation, MS-DOS Player can use either MAME's or Neko Project 21/W's x86 core, both of which are interpreters and won't win any performance contests. The NP21/W core is significantly better optimized and runs ≈41% faster, but still pales in comparison to DOSBox-X's dynamic recompiler. Running the same sequential commands that the P0280 Makefile would execute, the upstream 2024-03-02 NP21/W core build of MS-DOS Player would take 128.509s to compile the entire ReC98 codebase on my system, whereas DOSBox-X's dynamic core manages the same in 66.202s, or 94% faster.

Granted, even the DOSBox-X performance is much slower than we would like it to be. Most of it can be blamed on the awkward time in the early-to-mid-90s when Turbo C++ 4.0J came out. This was the time when DOS applications had long grown past the limitations of the x86 Real Mode and required DOS extenders or even sillier hacks to actually use all the RAM in a typical system of that period, but Win32 didn't exist yet to put developers out of this misery. As such, this compiler not only requires at least a 386 CPU, but also brings its own DOS extender (DPMI16BI.OVL) plus a loader for said extender (RTM.EXE), both of which need to be emulated alongside the compiler, to the great annoyance of emulator maintainers 30 years later. Even MS-DOS Player's README file notes how Protected Mode adds a lot of complexity and slowdown:

8086 binaries are much faster than 80286/80386/80486/Pentium4/IA32 binaries. If you don't need the protected mode or new mnemonics added after 80286, I recommend i86_x86 or i86_x64 binary.

The immediate reaction to these performance numbers is obvious: Let's just put DOSBox-X's dynamic recompiler into MS-DOS Player, right?! 🙌 Except that once you look at DOSBox-X, you immediately get why Takeda Toshiya might have preferred to start from scratch. Its codebase is a historically grown tangled mess, requiring intimate familiarity and a significant engineering effort to isolate the dynamic core in the first place. I did spend a few days trying to untangle and copy it all over into MS-DOS Player… only to be greeted with an infinite loop as soon as everything compiled for the first time. 😶 Yeah, no, that's bound to turn into a budget-exceeding maintenance nightmare.

Instead, let's look at squeezing at least some additional performance out of what we already have. A generic emulator for the entire CISCy instruction set of the 80386, with complete support for Protected Mode, but it's only supposed to run the subset of instructions and features used by a specific compiler and linker as fast as possible… wait a moment, that sounds like a use case for profile-guided optimization! This is the first time I've encountered a situation that would justify the required 2-phase build process and lengthy profile collection – after all, writing into some sort of database for every function call does slow down MS-DOS Player by roughly 15×. However, profiling just the compilation of our most complex translation unit (📝 TH01 YuugenMagan) and the linking of our largest executable (TH01's REIIDEN.EXE) should be representative enough.
I'll get to the performance numbers later, but even the build output is quite intriguing. Based on this profile, Visual Studio chooses to optimize only 104 out of MS-DOS Player's 1976 functions for speed and the rest for size, shaving off a nice 109 KiB from the binary. Presumably, keeping rare code small is also considered kind of fast these days because it takes up less space in your CPU's instruction cache once it does get executed?

With PGO as our foundation, let's run a performance profile and see if there are any further code-level optimizations worth trying out:

Removing redundant memset() calls: MS-DOS Player is written in a very C-like style of C++, and initializes a bunch of its statically allocated data by memset()ing it with 00 bytes at startup. This is strictly redundant even in C; Section 6.7.9/10 of the C standard mandates that all static data is zero-initialized by default. In turn, the program loaders of modern operating systems employ all sorts of paging tricks to reduce the CPU cost (and actual RAM usage!) of this initialization as much as possible. If you manually memset() afterward, you throw all these advantages out of the window.
Of course, these calls would only ever show up among the top CPU consumers in a performance profile if a program uses a large amount of static data, but the hardcoded 32 MiB of emulated RAM in ≥i386-supporting builds definitely qualifies. Zeroing 32.8 MiB of memory makes up a significant chunk of the runtime of some of the shorter build steps and quickly adds up; a full rebuild of the ReC98 codebase currently spawns a total of 361 MS-DOS Player instances, totaling 11.5 GiB of needless memory writes.
Limiting the emulated instruction set: NP21/W's x86 core emulates everything up to the SSE3 extension from 2004, but Turbo C++ 4.0J's x86 instruction set usage doesn't stretch past the 386. It doesn't even need the x87 FPU for compiling code that involves floating-point constants. Disabling all these unneeded extensions speeds up x86's infamously annoying instruction decoding, and also reduces the size of the MS-DOS Player binary by another 149.5 KiB. The source code already had macros for this purpose, and only needed a slight fix for the code to compile with these macros disabled.
Removing x86 paging: Borland's DOS extender uses segmented memory addressing even in Protected Mode. This allows us to remove the MMU emulation and the corresponding "are we paging" check for every memory access.
Removing cycle counting: When emulating a whole system, counting the cycles of each instruction is important for accurately synchronizing the CPU with other pieces of hardware. As hinted above, MS-DOS Player does emulate and periodically update a few pieces of hardware outside the CPU, but we need none of them for a build tool.
Testing Takeda Toshiya's optimizations: In a nice turn of events, Takeda Toshiya merged every single one of my bugfixes and optimization flags into his upstream codebase. He even agreed with my memset() and cycle counting removal optimizations, which are now part of all upstream builds as of 2024-06-24. For the 2024-06-27 build, he claims to have gone even further than my more minimal optimization, so let's see how these additional changes affect our build process.
Further risky optimizations: A lot of the remaining slowness of x86 emulation comes from the segmentation and protection fault checks required for every memory access. If we assume that the emulator only ever executes correct code, we can remove these checks and implement further shortcuts based on their absence.
The L[DEFGS]S group of instructions that load a segment and offset register from a 32-bit far pointer, for example, are both frequently used in Turbo C++ 4.0J code and particularly expensive to emulate. Intel specified their Real Mode operation as loading the segment and offset part in two separate 16-bit reads. But if we assume that neither of those reads can fault, we can compress them into a single 32-bit read and thus only perform the costly address translation once rather than twice. Emulator authors are probably rolling their eyes at this gross violation of Intel documentation now, but it's at least worth a try to see just how much performance we could get out of it.

So, what do we get?

MS-DOS Player build	Full build (Pipeline + 5 games + research code)		Median translation unit + median link		📝 YuugenMagan compile + link
MS-DOS Player build	Generic	PGO	Generic	PGO	Generic	PGO
MAME x86 core	46.522s / 50.854s	32.162s / 34.885s	1.346s / 1.429s	0.966s / 0.963s	6.975s / 7.155s	4.024s / 3.981s
NP21/W core, before optimizations	34.620s / 36.151s	30.218s / 31.318s	1.031s / 1.065s	0.885s / 0.916s	5.294s / 5.330s	4.260s / 4.299s
No initial `memset()`	31.886s / 34.398s	27.151s / 29.184s	0.945s / 1.009s	0.802s / 0.852s	5.094s / 5.266s	4.104s / 4.190s
Limited instructions	32.404s / 34.276s	26.602s / 27.833s	0.963s / 1.001s	0.783s / 0.819s	5.086s / 5.182s	3.886s / 3.987s
No paging	29.836s / 31.646s	25.124s / 26.356s	0.865s / 0.918s	0.748s / 0.769s	4.611s / 4.717s	3.500s / 3.572s
No cycle counting	25.407s / 26.691s	21.461s / 22.599s	0.735s / 0.752s	0.617s / 0.625s	3.747s / 3.868s	2.873s / 2.979s
2024-06-27 build	26.297s / 27.629s	21.014s / 22.143s	0.771s / 0.779s	0.612s / 0.632s	4.372s / 4.506s	3.253s / 3.272s
Risky optimizations	23.168s / 24.193s	20.711s / 21.782s	0.658s / 0.663s	0.582s / 0.603s	3.269s / 3.414s	2.823s / 2.805s

Measured on a 6-year-old 6-core Intel Core i5 8400T on Windows 11. The first number in each column represents the codebase before the #include cleanup explained below, and the second one corresponds to this commit. All builds are 64-bit, 32-bit builds were ≈5% slower across the board. I kept the fastest run within three attempts; as Tup parallelizes the build process across all CPU cores, it's common for the long-running full build to take up to a few seconds longer depending on what else is running on your system. Tup's standard output is also redirected to a file here; its regular terminal output and nice progress bar will add more slowdown on top.

The key takeaways:

By merely disabling certain x86 features from MS-DOS Player and retaining the accuracy of the remaining emulation, we get speedups of ≈60% (full build), ≈70% (median TU), and ≈80% (largest TU).
≈25% (full build), ≈29% (median TU), and ≈41% (largest TU) of this speedup came from Visual Studio's profile-guided optimization, with no changes to the MS-DOS Player codebase.
The effects of removing cycle counting are the biggest surprise. Between ≈17% and ≈23%, just for removing one subtraction per emulated instruction? Turns out that in the absence of a "target cycle amount" setting, the x86 emulation loop previously ran for only a single cycle. This caused the PIC check to run after every instruction, followed by PIT, serial I/O, keyboard, mouse, and CRTC update code every millisecond. Without cycle counting, the x86 loop actually keeps running until a CPU exception is raised or the emulated process terminates, skipping the hardware code during the vast majority of the program's execution time.
While Takeda Toshiya's changes in the 2024-06-27 build completely throw out the cycle counter and clean up process termination, they also reintroduce the hardware updates that made up the majority of the cycle removal speedup. This explains the results we're getting: The small speedup for full rebuilds is too insignificant to bother with and might even fall within a statistical margin of error, but the build slows down more and more the longer the emulated process runs. Compiling and linking YuugenMagan takes a whole 14% longer on generic builds, and ≈9-12% longer on PGO builds. I did another in-between test that just removed the x86 loop from the cycle removal version, and got exactly the same numbers. This just goes to show how much removing two writes to a fixed memory address per emulated instruction actually matters. Let's not merge back this one, and stay on top of 2024-06-24 for the time being.
The risky optimizations of ignoring segment limits and speeding up 32-bit segment+offset pointer load instructions could yield a further speedup. However, most of these changes boil down to removing branches that would never be taken when emulating correct x86 code. Consequently, these branches get recorded as unlikely during PGO training, which then causes the profile-guided rebuild to rearrange the instructions on these branches in a way that favors the common case, leaving the rest of their effective removal to your CPU's branch predictor. As such, the 10%-15% speedup we can observe in generic builds collapses down to 2%-6% in PGO builds. At this rate and with these absolute durations, it's not worth it to maintain what's strictly a more inaccurate fork of Neko Project 21/W's x86 core.
The redundant header inclusions afforded by #include guards do in fact have a measurable performance cost on Turbo C++ 4.0J, slowing down compile times by 5%.

But how does this compare to DOSBox-X's dynamic core? Dynamic recompilers need some kind of cache to ensure that every block of original ASM gets recompiled only once, which gives them an advantage in long-running processes after the initial warmup. As a result, DOSBox-X compiles and links YuugenMagan in 1.548s, ≈92% faster than even our optimized MS-DOS Player build. That percentage resembles the slowdown we were initially getting when comparing full rebuilds between DOSBox-X and MS-DOS Player, as if we hadn't optimized anything.
On paper, this would mean that DOSBox-X barely lost any of its huge advantage when it comes to single-threaded compile+link performance. In practice, though, this metric is supposed to measure a typical decompilation or modding workflow that focuses on repeatedly editing a single file. Thus, a more appropriate comparison would also have to add the aforementioned constant 28,130 syscalls that my old build system required to detect that this is the one file/binary that needs to be recompiled/relinked. The video at the top of this blog post happens to capture the best time (1.313s) I got for the detection process on DOSBox-X. This is almost as slow as the compilation and linking itself, and would have only gotten slower as we continue decompiling the rest of the games. Tup, on the other hand, performs its filesystem scan in a near-constant 0.08s, matching the claim in Section 4.7 of its paper, and thus shrinking the performance difference to ≈14% after all. Sure, merging the dynamic core would have been even better (contribution-ideas, anyone?), but this is good enough for now.
Just like with Tup, I've also placed this optimized binary directly into the ReC98 repo and added the specific build instructions to the GitHub release page.

I do have more far-reaching ideas for further optimizing Neko Project 21/W's x86 core for this specific case of repeated switches between Real Mode and Protected Mode while still retaining the interpreted nature of this core, but these already strained the budget enough.
The perhaps more important remaining bottleneck, however, is hiding in the actual DOS emulation. Right now, a Tup-driven full rebuild spawns a total of 361 MS-DOS Player processes, which means that we're booting an emulated DOS 361 times. This isn't as bad as it sounds, as "booting DOS" basically just involves initializing a bunch of internal DOS structures in conventional memory to meaningful values. However, these structures also include a few environment variables like PATH, APPEND, or TEMP/TMP, which MS-DOS Player seamlessly integrates by translating them from their value on the Windows host system to the DOS 8.3 format. This could be one of the main reasons why MS-DOS Player is a native Windows program rather than being cross-platform:

On Windows, this path translation is as simple as calling GetShortPathNameA(), which returns a unique 8.3 name for every component along the path.
Also, drive letters are an integral part of the DOS INT 21h API, and Windows still uses them as well.

However, the NT kernel doesn't actually use drive letters either, and views them as just a legacy abstraction over its reality of volume GUIDs. Converting paths back and forth between these two views therefore requires it to communicate with a mount point manager service, which can coincidentally also be observed in debug builds of Tup.
As a result, calling any path-retrieving API is a surprisingly expensive operation on modern Windows. When running a small sprite through our 📝 sprite converter, MS-DOS Player's boot process makes up 56% of the runtime, with 64% of that boot time (or 36% of the entire runtime) being spent on path translation. The actual x86 emulation to run the program only takes up 6.5% of the runtime, with the remaining 37.5% spent on initializing the multithreaded C++ runtime.

But then again, the truly optimal solution would not involve MS-DOS Player at all. If you followed general video game hacking news in May, you'll probably remember the N64 community putting the concept of statically recompiled game ports on the map. In case you're wondering where this seemingly sudden innovation came from and whether a reverse-engineered decompilation project like ReC98 is obsolete now, I wrote a new FAQ entry about why this hype, although justified, is at least in part misguided. tl;dr: None of this can be meaningfully applied to PC-98 games at the moment.
On the other hand, recompiling our compiler would not only be a reasonable thing to attempt, but exactly the kind of problem that recompilation solves best. A 16-bit command-line tool has none of the pesky hardware factors that drag down the usefulness of recompilations when it comes to game ports, and a recompiled port could run even faster than it would on 32-bit Windows. Sure, it's not as flashy as a recompiled game, but if we got a few generous backers, it would still be a great investment into improving the state of static x86 recompilation by simply having another open-source project in that space. Not to mention that it would be a great foundation for improving Turbo C++ 4.0J's code generation and optimizations, which would allow us to simplify lots of awkward pieces of ZUN code… 🤩

That takes care of building ReC98 on 64-bit platforms, but what about the 32-bit ones we used to support? The previous split of the build process into a Tup-driven 32-bit part and a Makefile-driven 16-bit part sure was awkward and I'm glad it's gone, but it did give you the choice between 1) emulating the 16-bit part or 2) running both parts natively on 32-bit Windows. While Tup's upstream Windows builds are 64-bit-only, it made sense to 📝 compile a custom 32-bit version and thus turn any 32-bit Windows ≥Vista into the perfect build platform for ReC98. Older Windows versions that can't run Tup had to build the 32-bit part using a separately maintained dumb batch script created by tup generate, but again, due to Make being trash, they were fully rebuilding the entire codebase every time anyway.
Driving the entire build via Tup changes all of that. Now, it makes little sense to continue using 32-bit Tup:

We need to DLL-inject into a 64-bit MS-DOS Player. Sure, we could compile a 32-bit build of MS-DOS Player, but why would we? If we look at current market shares, nobody runs 32-bit Windows anymore, not even by accident. If you run 32-bit Windows in 2024, it's because you know what you're doing and made a conscious choice for the niche use case of natively running DOS programs. Emulating them defeats the whole point of setting up this environment to begin with.
It would make sense if Tup could inject into DOS programs, but it can't.
Also, as we're going to see later, requiring Windows ≥Vista goes in the opposite direction of what we want for a 32-bit build. The earlier the Windows version, the better it is at running native DOS tools.

This means that we could now only support 32-bit Windows via an even larger tup generated batch file. We'd have to move the MS-DOS Player prefix of the respective command lines into an environment variable to make Tup use the same rules for both itself and the batch file, but the result seems to work…

…but it's really slow, especially on Windows 9x. 🐌 If we look back at the theory behind my previous custom build system, we can already tell why: Efficiently building ReC98 requires a completely different approach depending on whether you're running a typical modern multi-core 64-bit system or a vintage single-core 32-bit system. On the former, you'd want to parallelize the slow emulation as much as you can, so you maximize the amount of TCC processes to keep all CPU cores as busy as possible. But on the latter, you'd want the exact opposite – there, the biggest annoyance is the repeated startup and shutdown of the VDM, TCC, and its DOS extender, so you want to continue batching translation units into as few TCC processes as possible.

CMake fans will probably feel vindicated now, thinking "that sounds exactly like you need a meta build system 🤪". Leaving aside the fact that the output vomited by all of CMake's Makefile generators is a disgusting monstrosity that's far removed from addressing any performance concerns, we sure could solve this problem by adding another layer of abstraction. But then, I'd have to rewrite my working Lua script into either C++ or (heaven forbid) Batch, which are the only options we'd have for bootstrapping without adding any further dependencies, and I really wouldn't want to do that. Alternatively, we could fork Tup and modify tup generate to rewrite the low-level build rules that end up in Tup's database.
But why should we go for any of these if the Lua script already describes the build in a high-level declarative way? The most appropriate place for transforming the build rules is the Lua script itself…

… if there wasn't the slight problem of Tup forbidding file writes from Lua. 🥲 Presumably, this limitation exists because there is no way of replicating these writes in a tup generated dumb shell script, and it does make sense from that point of view.
But wait, printing to stdout or stderr works, and we always invoke Tup from a batch file anyway. You can now tell where this is going. Hey, exfiltrating commands from a build script to the build system via standard I/O streams works for Rust's Cargo too!

Just like Cargo, we want to add a sufficiently unique prefix to every line of the generated batch script to distinguish it from Tup's other output. Since Tup only reruns the Lua script – and would therefore print the batch file – if the script changed between the previous and current build run, we only want to overwrite the batch file if we got one or more lines. Getting all of this to work wasn't all too easy; we're once again entering the more awful parts of Batch syntax here, which apparently are so terrible that Wine doesn't even bother to correctly implement parts of it. 😩
Most importantly, we don't really want to redirect any of Tup's standard I/O streams. Redirecting stdout disables console output coloring and the pretty progress bar at the bottom, and looping over stderr instead of stdout in Batch is incredibly awkward. Ideally, we'd run a second Tup process with a sub-command that would just evaluate the Lua script if it changed - and fortunately, tup parse does exactly that. 😌
In the end, the optimally fast and ERRORLEVEL-preserving solution involves two temporary files. But since creating files between two Tup runs causes it to reparse the Lua code, which would print the batch file to the unfiltered stdout, we have to hide these temporary files from Tup by placing them into its .tup/ database directory. 🤪

On a more positive note, programmatically generating batches from single-file TCC rules turned out to be a great idea. Since the Lua code maps command-line flags to arrays of input files, it can also batch across binaries, surpassing my old system in this regard. This works especially well on the debloated and anniversary branches, which replace ZUN's little command-line flag inconsistencies with a single set of good optimization flags that every translation unit is compiled with.

Time to fire up some VMs then… only to see the build failing on Windows 9x with multiple unhelpful Bad command or file name errors. Clearly, the long echo lines that write our response files run up against some length limit in command.com and need to be split into multiple ones. Windows 9x's limit is larger than the 127 characters of DOS, that's for sure, and the exact number should just be one search away…
…except that it's not the 1024 characters recounted in a surviving newsgroup post. Sure, lines are truncated to 1023 bytes and that off-by-one error is no big deal in this context, but that's not the whole story:

: This not unrealistic command line is 137 bytes long and fails on Windows 9x?!
> echo -DA=1 2 3 a/b/c/d/1 a/b/c/d/2 a/b/c/d/3 a/b/c/d/4 a/b/c/d/5 a/b/c/d/6 a/b/c/d/7 a/b/c/d/8 a/b/c/d/9 a/b/c/d/10 a/b/c/d/11 a/b/c/d/12
Bad command or file name

Wait, what, something about / being the SWITCHAR? And not even just that…

: Down to 132 bytes… and 32 "assignments"?
> echo a=0 b=1 c=2 d=3 e=4 f=5 g=6 h=7 i=8 j=9 k=0 l=1 m=2 n=3 o=4 p=5 q=6 r=7 s=8 t=9 u=0 v=1 w=2 x=3 y=4 z=5 a=0 b=1 c=2 d=3 e=4 f=5
Bad command or file name

And what's perhaps the worst example:

: 64 slashes. Works on DOS, works on `cmd.exe`, fails on 9x.
> echo ////////////////////////////////////////////////////////////////
Bad command or file name

My complete set of test cases: 2024-07-09-Win9x-batch-tokenizer-tests.bat So, time to load command.com into DOSBox-X's debugger and step through some code. 🤷 The earliest NT-based Windows versions were ported to a variety of CPUs and therefore received the then-all-new cmd.exe shell written in C, whereas Windows 9x's command.com was still built on top of the dense hand-written ASM code that originated in the very first DOS versions. Fortunately though, Microsoft open-sourced one of the later DOS versions in April. This made it somewhat easier to cross-reference the disassembly even though the Windows 9x version significantly diverged in the parts we're interested in.
And indeed: After truncating to 1023 bytes and parsing out any redirectors, each line is split into tokens around whitespace and = signs and before every occurrence of the SWITCHAR. These tokens are written into a statically allocated 64-element array, and once the code tries to write the 65^th element, we get the Bad command or file name error instead.

#	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14
String	echo	-DA	1	2	3	a	/B	/C	/D	/1	a	/B	/C	/D	/2
Switch flag							🚩	🚩	🚩	🚩		🚩	🚩	🚩	🚩

The first few elements of command.com's internal argument array after calling the Windows 9x equivalent of parseline with my initial example string. Note how all the "switches" got capitalized and annotated with a flag, whereas the = sign no longer appears in either string or flag form.

Needless to say, this makes no sense. Both DOS and Windows pass command lines as a single string to newly created processes, and since this tokenization is lossy, command.com will just have to pass the original string anyway. If your shell wants to handle tokenization at a central place, it should happen after it decided that the command matches a builtin that can actually make use of a pointer to the resulting token array – or better yet, as the first call of each builtin's code. Doing it before is patently ridiculous.
I don't know what's worse – the fact that Windows 9x blindly grinds each batch line through this tokenizer, or the fact that no documentation of this behavior has survived on today's Internet, if any even ever existed. The closest thing I found was this page that doesn't exist anymore, and it also just contains a mere hint rather than a clear description of the issue. Even the usual Batch experts who document everything else seem to have a blind spot when it comes to this specific issue. As do emulators: DOSBox and FreeDOS only reimplement the sane DOS versions of command.com, and Wine only reimplements cmd.exe.

Oh well. 71 lines of Lua later, the resulting batch file does in fact work everywhere:

The clear performance winner at 11.15 seconds after the initial tool check, though sadly bottlenecked by strangely long TASM32 startup times. As for TCC though, even this performance is the slowest a recompiled port would be. Modern compiler optimizations are probably going to shave off another second or two, and implementing support for #pragma once into the recompiled code will get us the aforementioned 5% on top.
If you run this on VirtualBox on modern Windows, make sure to disable Hyper-V to avoid the slower snail execution mode. 🐢

Building in Windows XP under Hyper-V exchanges Windows 98's slow TASM32 startup times for slightly slower DOS performance, resulting in a still decent 13.4 seconds.

29.5 seconds?! Surely something is getting emulated here. And this is the best time I randomly got; my initial preview recording took 55 seconds which is closer to DOSBox-X's dynamic core than it is to Windows 9x. Given how poorly 32-bit Windows 10 performs, Microsoft should have probably discontinued 32-bit Windows after 8 already. If any 16-bit program you could possibly want to run is either too slow or likely to exhibit other compatibility issues (📝 Shuusou Gyoku, anyone?), the existence of 32-bit Windows 10 is nothing but a maintenance burden. Especially because Windows 10 simultaneously overhauled the console subsystem, which is bound to cause compatibility issues anyway. It sure did for me back in 2019 when I tried to get my build system to work…

But wait, there's more! The codebase now compiles on all 32-bit Windows systems I've tested, and yields binaries that are equivalent to ZUN's… except on 32-bit Windows 10. 🙄 Suddenly, we're facing the exact same batched compilation bug from my custom build system again, with REIIDEN.EXE being 16 bytes larger than it's supposed to be.
Looks like I have to look into that issue after all, but figuring out the exact cause by debugging TCC would take ages again. Thankfully, trial and error quickly revealed a functioning workaround: Separating translation unit filenames in the response file with two spaces rather than one. Really, I couldn't make this up. This is the most ridiculous workaround for a bug I've encountered in a long time.

echo -c  -I.  -O  -b-  -3  -Z  -d  -DGAME=4  -ml  -nobj/th04/  th04/op_main.cpp  th04/input_w.cpp  th04/vector.cpp  th04/snd_pmdr.c  th04/snd_mmdr.c  th04/snd_kaja.cpp  th04/snd_mode.cpp  th04/snd_dlym.cpp  th04/snd_load.cpp  th04/exit.cpp  th04/initop.cpp  th04/cdg_p_na.cpp  th04/snd_se.cpp  th04/egcrect.cpp  th04/bgimage.cpp  th04/op_setup.cpp  th04/zunsoft.cpp  th04/op_music.cpp  th04/m_char.cpp  th04/slowdown.cpp  th04/demo.cpp  th04/ems.cpp  th04/tile_set.cpp  th04/std.cpp  th04/tile.cpp>obj\batch014.@c
echo th04/playfld.cpp  th04/midboss4.cpp  th04/f_dialog.cpp  th04/dialog.cpp  th04/boss_exp.cpp  th04/stages.cpp  th04/player_m.cpp  th04/player_p.cpp  th04/hud_ovrl.cpp  th04/cfg_lres.cpp  th04/checkerb.cpp  th04/mb_inv.cpp  th04/boss_bd.cpp  th04/mpn_free.cpp  th04/mpn_l_i.cpp  th04/initmain.cpp  th04/gather.cpp  th04/scrolly3.cpp  th04/midboss.cpp  th04/hud_hp.cpp  th04/mb_dft.cpp  th04/grcg_3.cpp  th04/it_spl_u.cpp  th04/boss_4m.cpp  th04/bullet_u.cpp  th04/bullet_a.cpp  th04/boss.cpp  th04/boss_4r.cpp  th04/boss_x2.cpp  th04/maine_e.cpp  th04/cutscene.cpp>>obj\batch014.@c
echo th04/staff.cpp>>obj\batch014.@c

The TCC response file generation code for all current decompiled TH04 code, split into multiple echo calls based on the Windows 9x batch tokenizer rules and with double spaces between each parameter for added "safety". Would this also have been the solution for the batched compilation bugs I was experiencing with my old build system in DOSBox? I suddenly was unable to reproduce these bugs, so we won't know for the time being…

Hopefully, you've now got the impression that supporting any kind of 32-bit Windows build is way more of a liability than an asset these days, at least for this specific project. "Real hardware", "motivating a TCC recompilation", and "not dropping previous features" really were the only reasons for putting up with the sheer jank and testing effort I had to go through. And I wouldn't even be surprised if real-hardware developers told me that the first reason doesn't actually hold up because compiling ReC98 on actual PC-98 hardware is slow enough that they'd rather compile it on their main machine and then transfer the binaries over some kind of network connection.
I guess it also made for some mildly interesting blog content, but this was definitely the last time I bothered with such a wide variety of Windows versions without being explicitly funded to do so. If I ever get to recompile TCC, it will be 64-bit only by default as well.

Instead, let's have a tier list of supported build platforms that clearly defines what I am maintaining, with just the most convincing 32-bit Windows version in Tier 1. Initially, that was supposed to be Windows 98 SE due to its superior performance, but that's just unreasonable if key parts of the OS remain undocumented and make no sense. So, XP it is.
*nix fans will probably once again be disappointed to see their preferred OS in Tier 2. But at least, all we'd need for that to move up to Tier 1 is a CI configuration, contributed either via funding me or sending a PR. (Look, even more contribution-ideas!)
Getting rid of the Wine requirement for a fully cross-platform build process wouldn't be too unrealistic either, but would require us to make a few quality decisions, as usual:

Do we run the DOS tools by creating a cross-platform MS-DOS Player fork, or do we statically recompile them?
Do we replace 32-bit Windows TASM with the 16-bit DOS TASM.EXE or TASMX.EXE, which we then either run through our forked MS-DOS Player or recompile? This would further slow down the build and require us to get rid of these nice long non-8.3 filenames… 😕 I'd only recommend this after the looming librarization of ZUN's master.lib fork is completed.
Or do we try migrating to JWasm again? As an open-source assembler that aims for MASM compatibility, it's the closest we can get to TASM, but it's not a drop-in replacement by any means. I already tried in late 2014, but encountered too many issues and quickly abandoned the idea. Maybe it works better now that we have less ASM? In any case, this migration would only get easier the less ASM code we have remaining in the codebase as we get closer to the 100% finalization mark.

Y'know what I think would be the best idea for right now, though? Savoring this new build system and spending an extended amount of time doing actual decompilation or modding for a change.

Now that even full rebuilds are decently fast, let's make use of that productivity boost by doing some urgent and far-reaching code cleanup that touches almost every single C++ source file. The most immediately annoying quirk of this codebase was the silly way each translation unit #included the headers it needed. Many years ago, I measured that repeatedly including the same header did significantly impact Turbo C++ 4.0J's compilation times, regardless of any include guards inside. As a consequence of this discovery, I slightly overreacted and decided to just not use any include guards, ever. After all, this emulated build process is slow enough, and we don't want it to needlessly slow down even more! This way, redundantly including any file that adds more than just a few #define macros won't even compile, throwing lots of Multiple definition errors.
Consequently, the headers themselves #included almost nothing. Starting a new translation unit therefore always involved figuring and spelling out the transitive dependencies of the headers the new unit actually wants to use, in a short trial-and-error process. While not too bad by itself, this was bound to become quite counterproductive once we get closer to porting these games: If some inlined function in a header needed access to, let's say, PC-98-specific I/O ports as an implementation detail, the header would have externalized this dependency to the top-level translation unit, which in turn made that that unit appear to contain PC-98-native code even if the unit's code itself was perfectly portable.

But once we start making some of these implicit transitive dependencies optional, it all stops being justifiable. Sometimes, a.hpp declared things that required declarations from b.hpp but these things are used so rarely that it didn't justify adding #include "b.hpp" to all translation units that #include "a.hpp". So how about conditionally declaring these things based on previously #included headers?

#if (defined(SUBPIXEL_HPP) && defined(PLANAR_H))
	// Sets the [tile_ring] tile at (x, y) to the given VRAM offset.
	void tile_ring_set_vo(subpixel_t x, subpixel_t y, vram_offset_t image_vo);
#endif

You can maybe do this in a project that consistently sorts the #include lists in every translation unit… err, no, don't do this, ever, it's awful. Just separate that declaration out into another header.

Now that we've measured that the sane alternative of include guards comes with a performance cost of just 5% and we've further reduced its effective impact by parallelizing the build, it's worth it to take that cost in exchange for a tidy codebase without such surprises. From now on, every header file will #include its own dependencies and be a valid translation unit that must compile on its own without errors. In turn, this allows us to remove at least 1,000 #include of transitive dependencies from .cpp files. 🗑️
However, that 5% number was only measured after I reduced these redundant #includes to their absolute minimum. So it still makes sense to only add include guards where they are absolutely necessary – i.e., transitively dependent headers included from more than one other file – and continue to (ab)use the Multiple definition compiler errors as a way of communicating "you're probably #including too many headers, try removing a few". Certainly a less annoying error than Undefined symbol.

Since all of this went way over the 7-push mark, we've got some small bits of RE and PI work to round it all out. The .REC loader in TH04 and TH05 is completely unremarkable, but I've got at least a bit to say about TH02's High Score menu. I already decompiled MAINE.EXE's post-Staff Roll variant in 2015, so we were only missing the almost identical MAIN.EXE variant shown after a Game Over or when quitting out of the game. The two variants are similar enough that it mostly needed just a small bit of work to bring my old 2015 code up to current standards, and allowed me to quickly push TH02 over the 40% RE mark.
Functionally, the two variants only differ in two assignments, but ZUN once again chose to copy-paste the entire code to handle them. This was one of ZUN's better copy-pasting jobs though – and honestly, I can't even imagine how you would mess up a menu that's entirely rendered on the PC-98's text RAM. It almost makes you wonder whether ZUN actually used the same #if ENDING preprocessor branching that my decompilation uses… until the visual inconsistencies in the alignment of the place numbers and the POINT and labels clearly give it away as copy-pasted:

Screenshot of TH02's High Score screen as seen in MAIN.EXE when quitting out of the game, with scores initialized to show off the maximum number of digits and the incorrect alignment of the POINT and ST headers

Screenshot of TH02's High Score screen as seen in MAINE.EXE when entering a new high score after the Staff Roll, with scores initialized to show off the maximum number of digits and the incorrect alignment of the POINT header

Next up: Starting the big Seihou summer! Fortunately, waiting two more months was worth it: In mid-June, Microsoft released a preview version of Visual Studio that, in response to my bug report, finally, finally makes C++ standard library modules fully usable. Let's clean up that codebase for real, and put this game into a window.

📝 Posted:: 2024-04-11 22:45 UTC
🚚 Summary of:: P0278, P0279
⌨ Commits:: b6a7285...f0fbaf6, f0fbaf6...20bac82
💰 Funded by:: Yanga, Blue Bolt
🏷 Tags:

That was quick: In a surprising turn of events, Romantique Tp themselves came in just one day after the last blog post went up, updated me with their current and much more positive opinion on Sound Canvas VA, and confirmed that real SC-88Pro hardware clamps invalid Reverb Macro values to the specified range. I promised to release a new Sound Canvas VA BGM pack for free once I knew the exact behavior of real hardware, so let's go right back to Seihou and also integrate the necessary SysEx patches into the game's MIDI player behind a toggle. This would also be a great occasion to quickly incorporate some long overdue code maintenance and build system improvements, and a migration to C++ modules in particular. When I started the Shuusou Gyoku Linux port a year ago, the combination of modules and <windows.h> threw lots of weird errors and even crashed the Visual Studio compiler. But nowadays, Microsoft even uses modules in the Office code base. This must mean that these issues are fixed by now, right?
Well, there's still a bug that causes the modularized C++ standard library to be basically unusable in combination with the static analyzer, and somehow, I was the first one to report it. So it's 3½ years after C++20 was finalized, and somehow, modules are still a bleeding-edge feature and a second-class citizen in even the compiler that supports them the best. I want fast compile times already! 😕
Thankfully, Microsoft agrees that this is a bug, and will work on it at some point. While we're waiting, let's return to the original plan of decompiling the endings of the one PC-98 Touhou game that still needed them decompiled.

TH02's endings
TH02's Staff Roll
TH02's verdict screen, and its hidden challenge
TH02's end-of-stage bonus screens

After the textless slideshows of TH01, TH02 was the first Touhou game to feature lore text in its endings. Given that this game stores its 📝 in-game dialog text in fixed-size plaintext files, you wouldn't expect anything more fancy for the endings either, so it's not surprising to see that the END?.TXT files use the same concept, with 44 visible bytes per line followed by two bytes of padding for the CR/LF newline sequence. Each of these lines is typed to the screen in full, with all whitespace and a fixed time for each 2-byte chunk.
As a result, everything surrounding the text is just as hardcoded as TH01's endings were, which once again opens up the possibility of freely integrating all sorts of creative animations without the overhead of an interpreter. Sadly, TH02 only makes use of this freedom in a mere two cases: the picture scrolling effect from Reimu's head to Marisa's head in the Bad Endings, and a single hardware palette change in the Good Endings.

Screenshot of the (0-based) line #13 in TH02's Good Endings, together with its associated (and colored) picture — Powered by master.lib's `egc_shift_down()`.

Screenshot of the (0-based) line #14 in TH02's Good Endings, showing off how it doesn't change the picture of the previous line and only applies a different grayscale palette — Powered by master.lib's `egc_shift_down()`.

Hardcoding also still made sense for this game because of how the ending text is structured. The Good and Bad Endings for the individual shot types respectively share 55% and 77% of their text, and both only diverge after the first 27 lines. In straight-line procedural code, this translates to one branch for each shot type at a single point, neatly matching the high-level structure of these endings.

But that's the end of the positive or neutral aspects I can find in these scripts. The worst part, by far, is ZUN's approach to displaying the text in alternating colors, and how it impacts the entire structure of the code.
The simplest solution would have involved a hardcoded array with the color of each line, just like how the in-game dialogs store the face IDs for each text box. But for whatever reason, ZUN did not apply this piece of wisdom to the endings and instead hardcoded these color changes by… mutating a global variable before calling the text typing function for every individual line. This approach ruins any possibility of compressing the script code into loops. While ZUN did use loops, all of them are very short because they can only last until the next color change. In the end, the code contains 90 explicitly spelled-out calls to the 5-parameter line typing function that only vary in the pointer to each line and in the slower speed used for the one or two final lines of each ending. As usual, I've deduplicated the code in the ReC98 repository down to a sensible level, but here's the full inlined and macro-expanded horror:

Raw decompilation of TH02's script function for its three Bad Endings, without inline function or macro trickery — It's highly likely that this is what ZUN hacked into his PC-98 and was staring at back in 1997.

Raw decompilation of TH02's script function for its three Good Endings, without inline function or macro trickery — It's highly likely that this is what ZUN hacked into his PC-98 and was staring at back in 1997.

All this redundancy bloats the two script functions for the 6 endings to a whopping 3,344 bytes inside TH02's MAINE.EXE. In particular, the single function that covers the three Good Endings ends up with a total of 631 x86 ASM instructions, making it the single largest function in TH02 and the 7^th longest function in all of PC-98 Touhou. If the 📝 single-executable build for TH02's debloated and anniversary branches ends up needing a few more KB to reduce its size below the original MAIN.EXE, there are lots of opportunities to compress it all.

The ending text can also be fast-forwarded by holding any key. As we've come to expect for this sort of ZUN code, the text typing function runs its own rendering loop with VSync delays and input detection, which means that we 📝 once 📝 again have to talk about the infamous quirk of the PC-98 keyboard controller in relation to held keys. We've still got 54 not yet decompiled calls to input detection functions left in this codebase, are you excited yet?!
Holding any key speeds up the text of all ending lines before the last one by displaying two kana/kanji instead of one per rendered frame and reducing the delay between the rendered frames to ¹/₃ of its regular length. In pseudocode:

for(i = 0; i < number_of_2_byte_chunks_on_displayed_line; i++) {
	input = convert_current_pc98_bios_input_state_to_game_specific_bitflags();
	add_chunk_to_internal_text_buffer(i);
	blit_internal_text_buffer_from_the_beginning();
	if(input == INPUT_NONE) {
		// Basic case, no key pressed
		frame_delay(frames_per_chunk);
	} else if((i % 2) == 1) {
		// Key pressed, chunk number is odd.
		frame_delay(frames_per_chunk / 3);
	} else {
		// Key pressed, chunk number is even.
		// No delay; next iteration adds to the same frame.
	}
}

This is exactly the kind of code you would write if you wanted to deliberately maximize the impact of this hardware quirk. If the game happens to read the current input state right after a key up scancode for the last previously held and game-relevant key, it will then wrongly take the branch that uninterruptibly waits for the regular, non-divided amount of VSync interrupts. In my tests, this broke the rhythm of the fast-forwarded text about once per line. Note how this branch can also be taken on an even chunk: Rendering glyphs straight from font ROM to VRAM is not exactly cheap, and if each iteration (needlessly) blits one more full-width glyph than the last one, the probability of a key up scancode arriving in the middle of a frame only increases.
The fact that TH02 allows any of the supported input keys to be held points to another detail of this quirk I haven't mentioned so far. If you press multiple keys at once, the PC-98's keyboard controller only sends the periodic key up scancodes as long as you are holding the last key you pressed. Because the controller only remembers this last key, pressing and releasing any other key would get rid of these scancodes for all keys you are still holding.
As usual, this ZUN bug only occurs on real hardware and with DOSBox-X's correct emulation of the PC-98 keyboard controller.

After the ending, we get to witness the most seamless transition between ending and Staff Roll in any Touhou game as the BGM immediately changes to the Staff Roll theme, and the ending picture is shifted into the same place where the Staff Roll pictures will appear. Except that the code misses the exact position by four pixels, and cuts off another four pixels at the right edge of the picture:

Also, note the green 1-pixel line at the right edge of this specific picture. This is a bug in the .PI file where the picture is indeed shifted one pixel to the left.

What follows is a comparatively large amount of unused content for a single scene. It starts right at the end of this underappreciated 11-frame animation loaded from ENDFT.BFT:

TH02's ENDFT.BFT — Wastefully using the 4bpp BFNT format. The single frame at the end of the animation is unused; while it might look identical to the ＺＵＮ glyphs later on in the Staff Roll, that's only because both are independently rendered boldfaced versions of the same font ROM glyphs. Then again, it does prove that ZUN created this animation on a PC-98 model made by NEC, as the Epson clones used a font ROM with a distinctly different look.

Wastefully using the 4bpp BFNT format. The single frame at the end of the animation is unused; while it might look identical to the ＺＵＮ glyphs later on in the Staff Roll, that's only because both are independently rendered boldfaced versions of the same font ROM glyphs. Then again, it does prove that ZUN created this animation on a PC-98 model made by NEC, as the Epson clones used a font ROM with a distinctly different look.

TH02's Staff Roll is also unique for the pre-made screenshots of all 5 stages that get shown together with a fancy rotating rectangle animation while the Staff Roll progresses in sync with the BGM. The first interesting detail shows up immediately after the first image, where the code jumps over one of the 320×200 quarters in ED06.PI, leaving the screenshot of the Stage 2 midboss unused.
All of the cutscenes in PC-98 Touhou store their pictures as 320×200 quarters within a single 640×400 .PI file. Anywhere else, all four quarters are supposed to be displayed with the same palette specified in the .PI header, but TH02's Staff Roll screenshots are also unique in how all quarters beyond the top-left one require palettes loaded from external .RGB files to look right. Consequently, the game doesn't clearly specify the intended palette of this unused screenshot, and leaves two possibilities:

The unused second 320×200 quarter of TH02's `ED06.PI`, displayed in the Stage 2 color palette used in-game.
The unused second 320×200 quarter of TH02's `ED06.PI`, displayed in the palette specified in the .PI header. These are the colors you'd see when looking at the file in a .PI viewer, when converting it into another format with the usual tools, or in sprite rips that don't take TH02's hardcoded palette changes into account. These colors are only intended for the Stage 1 screenshot in the top-left quarter of the file.
The unused second 320×200 quarter of TH02's `ED06.PI`, displayed in the palette from `ED06B.RGB`, which the game uses for the following screenshot of the Meira fight. As it's from the same stage, it *almost* matches the in-game colors seen in 1️⃣, and only differs in the white color (`#FFF`) being slightly red-tinted (`#FCC`).

It might seem obvious that the Stage 2 palette in 1️⃣ is the correct one, but ZUN indeed uses ED06B.RGB with the red-tinted white color for the following screenshot of the Meira fight. Not only does this palette not match Meira's in-game appearance, but it also discolors the rectangle animation and the surrounding Staff Roll text:

Also, that tearing on frame #1 is not a recording artifact, but the expected result of yet another VSync-related landmine. 💣 This time, it's caused by the combination of 1) the entire sequence from the ending to the verdict screen being single-buffered, and 2) this animation always running immediately after an expensive operation (640×400 .PI image loading and blitting to VRAM, 320×200 VRAM inter-page copy, or hardware palette loading from a packed file), without waiting for the VSync interrupt. This makes it highly likely for the first frame of this animation to start rendering at a point where the (real or emulated) electron beam has already traveled over a significant portion of the screen.

But when I went into Stage 2 to compare these colors to the in-game palette, I found something even more curious. ZUN obviously made this screenshot with the Reimu-C shot type, but one of the shot sprites looks slightly different from how it does in-game. These screenshots must have been made earlier in development when the sprite didn't yet feature the second ring at the top. The same applies to the Stage 4 screenshot later on:

Original version of the third 320×200 quarter from TH02's ED06.PI, representing the Meira boss fight and showing off an old version of the Reimu-C shot sprites

Original version of the first 320×200 quarter from TH02's ED07.PI, representing Stage 4 and showing off an old version of the Reimu-C shot sprites

Finally, the rotating rectangle animation delivers one more minor rendering bug. Each of the 20 frames removes the largest and outermost rectangle from VRAM by redrawing it in the same black color of the background before drawing the remaining rectangles on top. The corners of these rectangles are placed on a shrinking circle that starts with a radius of 256 pixels and is centered at (192, 200), which results in a maximum possible X coordinate of 448 for the rightmost corner of the rectangle. However, the Staff Roll text starts at an X coordinate of 416, causing the first two full-width glyphs to still fall within the area of the circle. Each line of text is also only rendered once before the animation. So if any of the rectangles then happens to be placed at an angle that causes its edges to overlap the text, its removal will cut small holes of black pixels into the glyphs:

The green dotted circle corresponds to the newest/smallest rectangle. Note how ZUN only happened to avoid the holes for the two final animations by choosing an initial angle and angular velocity that causes the resulting rectangles to just barely avoid touching the ＴＥＳＴ　ＰＬＡＹＥＲ glyphs.

At least the following verdict screen manages to have no bugs aside from the slightly imperfect centering of its table values, and only comes with a small amount of additional bloat. Let's get right to the mapping from skill points to the 12 title strings from END3.TXT, because one of them is not like the others:

Skill	Title
≥100	神を超えた巫女！！
90 - 99	もはや神の領域！！
80 - 99	Ａ級シューター！！
78 - 79	うきうきゲーマー！
77	バニラはーもにー！
70 - 76	うきうきゲーマー！
60 - 69	どきどきゲーマー！
50 - 59	要練習ゲーマー
40 - 49	非ゲーマー級
30 - 39	ちょっとだめ
20 - 29	非人間級
10 - 19	人間でない何か
≤9	死んでいいよ、いやいやまじで

Looks like I'm the first one to document the required skill points as well? Everyone else just copy-pastes END3.TXT without providing context.

So how would you get exactly 77 and achieve vanilla harmony? Here's the formula:

	Difficulty level* `× 20`
+	`10 - (`Continues used `× 3)`
+	`max((50 - (`Lives lost^† `× 3) -` Bombs used^†`), 0)`
+	`min(max(📝 item_skill, 0), 25)`

* Ranges from 0 (Easy) to 3 (Lunatic).
^† Across all 5 stages.

With Easy Mode capping out at 85, this is possible on every difficulty, although it requires increasingly perfect play the lower you go. Reaching 77 on purpose, however, pretty much demands a careful route through the entire game, as every collected and missed item will influence the item_skill in some way. This almost feels it's like the ultimate challenge that this game has to offer. Looking forward to the first Vanilla Harmony% run!

And with that, TH02's MAINE.EXE is both fully position-independent and ready for translation. There's a tiny bit of undecompiled bit of code left in the binary, but I'll leave that for rounding up a future TH02 decompilation push.

With one of the game's skill-based formulas decompiled, it's fitting to round out the second push with the other two. The in-game bonus tables at the end of a stage also have labels that we'd eventually like to translate, after all.
The bonus formula for the 4 regular stages is also the first place where we encounter TH02's rank value, as well as the only instance in PC-98 Touhou where the game actually displays a rank-derived value to the player. KirbyComment and Colin Douglas Howell accurately documented the rank mechanics over at Touhou Wiki two years ago, which helped quite a bit as rank would have been slightly out of scope for these two pushes. 📝 Similar to TH01, TH02's rank value only affects bullet speed, but the exact details of how rank is factored in will have to wait until RE progress arrives at this game's bullet system.
These bonuses are calculated by taking a sum of various gameplay metrics and multiplying it with the amount of point items collected during the stage. In the 4 regular stages, the sum consists of:

難易度	Difficulty level* `× 2,000`
ステージ	`(`Rank `+ 16) × 200`
ボム	`max((2,500 - (`Bombs used* `× 500)), 0)`
ミス	`max((3,000 - (`Lives lost* `× 1,000)), 0)`
靈撃初期数	`(4 -` Starting bombs`) × 800`
靈夢初期数	`(5 -` Starting lives`) × 1,000`

* Within this stage, across all continues.
Yup, 封魔録.TXT does indeed document this correctly.

As rank can range from -6 to +4 on Easy and +16 on the other difficulties, this sum can range between:

	Easy	Normal	Hard	Lunatic
Minimum	2,800	4,800	6,800	8,800
Maximum	16,700	21,100	23,100	25,100

The sum for the Extra Stage is not documented in 封魔録.TXT:

クリア	`10,000`
ミス回数	`max((20,000 - (`Lives lost `× 4,000)), 0)`
ボム回数	`max((20,000 - (`Bombs used `× 4,000)), 0)`
クリアタイム	`⌊max((20,000 -` Boss fight frames*`), 0) ÷ 10⌋ × 10`

* Amount of frames spent fighting Evil Eye Σ, counted from the end of the pre-boss dialog until the start of the defeat animation.

And that's two pushes packed full of the most bloated and copy-pasted code that's unique to TH02! So bloated, in fact, that TH02 RE as a whole jumped by almost 7%, which in turn finally pushed overall RE% over the 60% mark. 🎉 It's been a while since we hit a similar milestone; 50% overall RE happened almost 2 years ago during 📝 P0204, a month before I completed the TH01 decompilation.
Next up: Continuing to wait for Microsoft to fix the static analyzer bug until May at the latest, and working towards the newly popular dreams of TH03 netplay by looking at some of its foundational gameplay code.

📝 Posted:: 2024-02-03 08:03 UTC
🚚 Summary of:: P0264, P0265
⌨ Commits:: 46cd6e7...78728f6, 78728f6...ff19bed
💰 Funded by:: Blue Bolt, [Anonymous], iruleatgames
🏷 Tags:

Oh, it's 2024 already and I didn't even have a delivery for December or January? Yeah… I can only repeat what I said at the end of November, although the finish line is actually in sight now. With 10 pushes across 4 repositories and a blog post that has already reached a word count of 9,240, the Shuusou Gyoku SC-88Pro BGM release is going to break 📝 both the push record set by TH01 Sariel two years ago, and 📝 the blog post length record set by the last Shuusou Gyoku delivery. Until that's done though, let's clear some more PC-98 Touhou pushes out of the backlog, and continue the preparation work for the non-ASCII translation project starting later this year.

But first, we got another free bugfix according to my policy! 📝 Back in April 2022 when I researched the Divide Error crash that can occur in TH04's Stage 4 Marisa fight, I proposed and implemented four possible workarounds and let the community pick one of them for the generally recommended small bugfix mod. I still pushed the others onto individual branches in case the gameplay community ever wants to look more closely into them and maybe pick a different one… except that I accidentally pushed the wrong code for the warp workaround, probably because I got confused with the second warp variant I developed later on.
Fortunately, I still had the intended code for both variants lying around, and used the occasion to merge the current master branch into all of these mod branches. Thanks to wyatt8740 for spotting and reporting this oversight!

The Music Room background masking effect
The GRCG's plane disabling flags
Text color restrictions
The entire messy rest of the Music Room code
TH04's partially consistent congratulation picture on Easy Mode
TH02's boss position and damage variables

As the final piece of code shared in largely identical form between 4 of the 5 games, the Music Rooms were the biggest remaining piece of low-hanging fruit that guaranteed big finalization% gains for comparatively little effort. They seemed to be especially easy because I already decompiled TH02's Music Room together with the rest of that game's OP.EXE back in early 2015, when this project focused on just raw decompilation with little to no research. 9 years of increased standards later though, it turns out that I missed a lot of details, and ended up renaming most variables and functions. Combined with larger-than-expected changes in later games and the usual quality level of ZUN's menu code, this ended up taking noticeably longer than the single push I expected.

The undoubtedly most interesting part about this screen is the animation in the background, with the spinning and falling polygons cutting into a single-color background to reveal a spacey image below. However, the only background image loaded in the Music Room is OP3.PI (TH02/TH03) or MUSIC3.PI (TH04/TH05), which looks like this in a .PI viewer or when converted into another image format with the usual tools:

TH02's Music Room background in its on-disk state — Let's call this "the blank image".

TH03's Music Room background in its on-disk state — Let's call this "the blank image".

That is definitely the color that appears on top of the polygons, but where is the spacey background? If there is no other .PI file where it could come from, it has to be somewhere in that same file, right?
And indeed: This effect is another bitplane/color palette trick, exactly like the 📝 three falling stars in the background of TH04's Stage 5. If we set every bit on the first bitplane and thus change any of the resulting even hardware palette color indices to odd ones, we reveal a full second 8-color sub-image hiding in the same .PI file:

TH02's Music Room background, with all bits in the first bitplane set to reveal the spacey background image, and the full color palette at the bottom — The spacey sub-image. Never before seen!1!! …OK, touhou-memories beat me by a month. Let's add each image's full 16-color palette to deliver some additional value.

TH03's Music Room background, with all bits in the first bitplane set to reveal the spacey background image, and the full color palette at the bottom — The spacey sub-image. Never before seen!1!! …OK, touhou-memories beat me by a month. Let's add each image's full 16-color palette to deliver some additional value.

On a high level, the first bitplane therefore acts as a stencil buffer that selects between the blank and spacey sub-image for every pixel. The important part here, however, is that the first bitplane of the blank sub-images does not consist entirely of 0 bits, but does have 1 bits at the pixels that represent the caption that's supposed to be overlaid on top of the animation. Since there now are some pixels that should always be taken from the spacey sub-image regardless of whether they're covered by a polygon, the game can no longer just clear the first bitplane at the start of every frame. Instead, it has to keep a separate copy of the first bitplane's original state (called nopoly_B in the code), captured right after it blitted the .PI image to VRAM. Turns out that this copy also comes in quite handy with the text, but more on that later.

Then, the game simply draws polygons onto only the reblitted first bitplane to conditionally set the respective bits. ZUN used master.lib's grcg_polygon_c() function for this, which means that we can entirely thank the uncredited master.lib developers for this iconic animation – if they hadn't included such a function, the Music Rooms would most certainly look completely different.
This is where we get to complete the series on the PC-98 GRCG chip with the last remaining four bits of its mode register. So far, we only needed the highest bit (0x80) to either activate or deactivate it, and the bit below (0x40) to choose between the 📝 RMW and 📝 TCR/📝 TDW modes. But you can also use the lowest four bits to restrict the GRCG's operations to any subset of the four bitplanes, leaving the other ones untouched:

// Enable the GRCG (0x80) in regular RMW mode (0x40). All bitplanes are
// enabled and written according to the contents of the tile register.
outportb(0x7C, 0xC0);

// The same, but limiting writes to the first bitplane by disabling the
// second (0x02), third (0x04), and fourth (0x08) one, as done in the
// PC-98 Touhou Music Rooms.
outportb(0x7C, 0xCE);

// Regular GRCG blitting code to any VRAM segment…
pokeb(0xA8000, offset, …);

// We're done, turn off the GRCG.
outportb(0x7C, 0x00);

This could be used for some unusual effects when writing to two or three of the four planes, but it seems rather pointless for this specific case at first. If we only want to write to a single plane, why not just do so directly, without the GRCG? Using that chip only involves more hardware and is therefore slower by definition, and the blitting code would be the same, right?
This is another one of these questions that would be interesting to benchmark one day, but in this case, the reason is purely practical: All of master.lib's polygon drawing functions expect the GRCG to be running in RMW mode. They write their pixels as bitmasks where 1 and 0 represent pixels that should or should not change, and leave it to the GRCG to combine these masks with its tile register and OR the result into the bitplanes instead of doing so themselves. Since GRCG writes are done via MOV instructions, not using the GRCG would turn these bitmasks into actual dot patterns, overwriting any previous contents of each VRAM byte that gets modified.
Technically, you'd only have to replace a few MOV instructions with OR to build a non-GRCG version of such a function, but why would you do that if you haven't measured polygon drawing to be an actual bottleneck.

Three overlapping Music Room polygons rendered using master.lib's grcg_polygon_c() function with a disabled GRCG — An example with three polygons drawn from top to bottom. Without the GRCG, edges of later polygons overwrite any previously drawn pixels within the same VRAM byte. Note how treating bitmasks as dot patterns corrupts even those areas where the background image had nonzero bits in its first bitplane.

Three overlapping Music Room polygons rendered as in the original game, with the GRCG enabled — An example with three polygons drawn from top to bottom. Without the GRCG, edges of later polygons overwrite any previously drawn pixels within the same VRAM byte. Note how treating bitmasks as dot patterns corrupts even those areas where the background image had nonzero bits in its first bitplane.

As far as complexity is concerned though, the worst part is the implicit logic that allows all this text to show up on top of the polygons in the first place. If every single piece of text is only rendered a single time, how can it appear on top of the polygons if those are drawn every frame?
Depending on the game (because of course it's game-specific), the answer involves either the individual bits of the text color index or the actual contents of the palette:

Colors 0 or 1 can't be used, because those don't include any of the bits that can stay constant between frames.
If the lowest bit of a palette color index has no effect on the displayed color, text drawn in either of the two colors won't be visually affected by the polygon animation and will always appear on top. TH04 and TH05 rely on this property with their colors 2/3, 4/5, and 6/7 being identical, but this would work in TH02 and TH03 as well.
But this doesn't apply to TH02 and TH03's palettes, so how do they do it? The secret: They simply include all text pixels in nopoly_B. This allows text to use any color with an odd palette index – the lowest bit then won't be affected by the polygons ORed into the first bitplane, and the other bitplanes remain unchanged.
TH04 is a curious case. Ostensibly, it seems to remove support for odd text colors, probably because the new 10-frame fade-in animation on the comment text would require at least the comment area in VRAM to be captured into nopoly_B on every one of the 10 frames. However, the initial pixels of the tracklist are still included in nopoly_B, which would allow those to still use any odd color in this game. ZUN only removed those from nopoly_B in TH05, where it had to be changed because that game lets you scroll and browse through multiple tracklists.

The contents of `nopoly_B` with each game's first track selected.

Finally, here's a list of all the smaller details that turn the Music Rooms into such a mess:

Due to the polygon animation, the Music Room is one of the few double-buffered menus in PC-98 Touhou, rendering to both VRAM pages on alternate frames instead of using the other page to store a background image. Unfortunately though, this doesn't actually translate to tearing-free rendering because ZUN's initial implementation for TH02 mixed up the order of the required operations. You're supposed to first wait for the GDC's VSync interrupt and then, within the display's vertical blanking interval, write to the relevant I/O ports to flip the accessed and shown pages. Doing it the other way around and flipping as soon as you're finished with the last draw call of a frame means that you'll very likely hit a point where the (real or emulated) electron beam is still traveling across the screen. This ensures that there will be a tearing line somewhere on the screen on all but the fastest PC-98 models that can render an entire frame of the Music Room completely within the vertical blanking interval, causing the very issue that double-buffering was supposed to prevent.
ZUN only fixed this landmine in TH05.
The polygons have a fixed vertex count and radius depending on their index, everything else is randomized. They are also never reinitialized while OP.EXE is running – if you leave the Music Room and reenter it, they will continue animating from the same position.
Except for TH05's repeatable ⬆️ up and ⬇️ down inputs, the games force you to release any pressed key before they will handle any new input. But since the game is running on a PC-98, we 📝 once again have to mention the infamous quirk of the keyboard controller with regard to held keys. Funnily enough, the games go back and forth in how they address it, in a way that matches the 📝 additional delay of the cutscene interpreter loop:
- TH02 and TH04 don't handle it at all, causing held keys to be processed again after about a second.
- TH03 and TH05 correctly work around the quirk, at the usual cost of a 614.4 µs delay per frame. Except that the delay is actually twice as long in frames in which a previously held key is released, because this code is a mess.
But even in 2024, DOSBox-X is the only emulator that actually replicates this detail of real hardware. On anything else, keyboard input will behave as ZUN intended it to. At least I've now mentioned this once for every game, and can just link back to this blog post for the other menus we still have to go through, in case their game-specific behavior matches this one.
TH02 is the only game that
1. separately lists the stage and boss themes of the main game, rather than following the in-game order of appearance,
2. continues playing the selected track when leaving the Music Room,
3. always loads both MIDI and PMD versions, regardless of the currently selected mode, and
4. does not stop the currently playing track before loading the new one into the PMD and MMD drivers.
The combination of 2) and 3) allows you to leave the Music Room and change the music mode in the Option menu to listen to the same track in the other version, without the game changing back to the title screen theme. 4), however, might cause the PMD and MMD drivers to play garbage for a short while if the music data is loaded from a slow storage device that takes longer than a single period of the OPN timer to fill the driver's song buffer. Probably not worth mentioning anymore though, now that people no longer try fitting PC-98 Touhou games on floppy disks.
The comment text files use another fixed-size plaintext format, just like 📝 TH02's in-game dialog system or 📝 TH03's win messages:
- Exactly 40 (TH02/TH03) / 38 (TH04/TH05) visible bytes per line,
- padded with 2 bytes that can hold a CR/LF newline sequence for easier editing.
- Every track starts with a title line that mostly just duplicates the names from the hardcoded tracklist,
- followed by a fixed 19 (TH02/TH03/TH04) / 9 (TH05) comment lines.
In TH04 and TH05, lines can start with a semicolon (;) to prevent them from being rendered. This is purely a performance hint, and is visually equivalent to filling the line with spaces.
All in all, the quality of the code is even slightly below the already poor standard for PC-98 Touhou: More VRAM page copies than necessary, conditional logic that is nested way too deeply, a distinct avoidance of state in favor of loops within loops, and – of course – a couple of gotos to jump around as needed.
In TH05, this gets so bad with the scrolling and game-changing tracklist that it all gives birth to a wonderfully obscure inconsistency: When pressing both ⬆️/⬇️ and ⬅️/➡️ at the same time, the game first processes the vertical input and then the horizontal one in the next frame, making it appear as if the latter takes precedence. Except when the cursor is highlighting the first (⬆️ ) or 12^th (⬇️ ) element of the list, and said list element is not the first track (⬆️ ) or the quit option (⬇️ ), in which case the horizontal input is ignored.

And that's all the Music Rooms! The OP.EXE binaries of TH04 and especially TH05 are now very close to being 100% RE'd, with only the respective High Score menus and TH04's title animation still missing. As for actual completion though, the finalization% metric is more relevant as it also includes the ZUN Soft logo, which I RE'd on paper but haven't decompiled. I'm 📝 still hoping that this will be the final piece of code I decompile for these two games, and that no one pays to get it done earlier…

For the rest of the second push, there was a specific goal I wanted to reach for the remaining anything budget, which was blocked by a few functions at the beginning of TH04's and TH05's MAINE.EXE. In another anticlimactic development, this involved yet another way too early decompilation of a main() function…
Generally, this main() function just calls the top-level functions of all other ending-related screens in sequence, but it also handles the TH04-exclusive congratulating All Clear images within itself. After a 1CC, these are an additional reward on top of the Good Ending, showing the player character wearing a different outfit depending on the selected difficulty. On Easy Mode, however, the Good Ending is unattainable because the game always ends after Stage 5 with a Bad Ending, but ZUN still chose to show the EASY ALL CLEAR!! image in this case, regardless of how many continues you used.
While this might seem inconsistent with the other difficulties, it is consistent within Easy Mode itself, as the enforced Bad Ending after Stage 5 also doesn't distinguish between the number of continues. Also, Try to Normal Rank!! could very well be ZUN's roundabout way of implying "because this is how you avoid the Bad Ending".

With that out of the way, I was finally able to separate the VRAM text renderer of TH04 and TH05 into its own assembly unit, 📝 finishing the technical debt repayment project that I couldn't complete in 2021 due to assembly-time code segment label arithmetic in the data segment. This now allows me to translate this undecompilable self-modifying mess of ASM into C++ for the non-ASCII translation project, and thus unify the text renderers of all games and enhance them with support for Unicode characters loaded from a bitmap font. As the final finalized function in the SHARED segment, it also allowed me to remove 143 lines of particularly ugly segmentation workarounds 🙌

The remaining ¹/₆th of the second push provided the perfect occasion for some light TH02 PI work. The global boss position and damage variables represented some equally low-hanging fruit, being easily identified global variables that aren't part of a larger structure in this game. In an interesting twist, TH02 is the only game that uses an increasing damage value to track boss health rather than decreasing HP, and also doesn't internally distinguish between bosses and midbosses as far as these variables are concerned. Obviously, there's quite a bit of state left to be RE'd, not least because Marisa is doing her own thing with a bunch of redundant copies of her position, but that was too complex to figure out right now.

Also doing their own thing are the Five Magic Stones, which need five positions rather than a single one. Since they don't move, the game doesn't have to keep 📝 separate position variables for both VRAM pages, and can handle their positions in a much simpler way that made for a nice final commit.
And for the first time in a long while, I quite like what ZUN did there! Not only are their positions stored in an array that is indexed with a consistent ID for every stone, but these IDs also follow the order you fight the stones in: The two inner ones use 0 and 1, the two outer ones use 2 and 3, and the one in the center uses 4. This might look like an odd choice at first because it doesn't match their horizontal order on the playfield. But then you notice that ZUN uses this property in the respective phase control functions to iterate over only the subrange of active stones, and you realize how brilliant it actually is.

Screenshot of TH02's Five Magic Stones, with the first two (both internally and in the order you fight them in) alive and activated

Screenshot of TH02's Five Magic Stones, with the second two (both internally and in the order you fight them in) alive and activated

This seems like a really basic thing to get excited about, especially since the rest of their data layout sure isn't perfect. Splitting each piece of state and even the individual X and Y coordinates into separate 5-element arrays is still counter-productive because the game ends up paying more memory and CPU cycles to recalculate the element offsets over and over again than this would have ever saved in cache misses on a 486. But that's a minor issue that could be fixed with a few regex replacements, not a misdesigned architecture that would require a full rewrite to clean it up. Compared to the hardcoded and bloated mess that was 📝 YuugenMagan's five eyes, this is definitely an improvement worthy of the good-code tag. The first actual one in two years, and a welcome change after the Music Room!

These three pieces of data alone yielded a whopping 5% of overall TH02 PI in just ¹/₆th of a push, bringing that game comfortably over the 60% PI mark. MAINE.EXE is guaranteed to reach 100% PI before I start working on the non-ASCII translations, but at this rate, it might even be realistic to go for 100% PI on MAIN.EXE as well? Or at least technical position independence, without the false positives.

Next up: Shuusou Gyoku SC-88Pro BGM. It's going to be wild.

📝 Posted:: 2023-11-01 23:58 UTC
🚚 Summary of:: P0258, P0259, P0260, P0261
⌨ Commits:: 5876755...e8a0b3e, e8a0b3e...dfaa3c6, dfaa3c6...ed9ee93, ed9ee93...ae2fc28
💰 Funded by:: Blue Bolt, [Anonymous], Yanga, Splashman
🏷 Tags:

And we're back to PC-98 Touhou for a brief interruption of the ongoing Shuusou Gyoku Linux port. Let's clear some of the Touhou-related progress from the backlog, and use the unconstrained nature of these contributions to prepare the 📝 upcoming non-ASCII translations commissioned by Touhou Patch Center. The current budget won't cover all of my ambitions, but it would at least be nice if all text in these games was feasibly translatable by the time I officially start working on that project.

At a little over 3 pushes, it might be surprising to see that this took longer than the 📝 TH03/TH04/TH05 cutscene system. It's obvious that TH02 started out with a different system for in-game dialog, but while TH04 and TH05 look identical on the surface, they only actually share 30% of their dialog code. So this felt more like decompiling 2.4 distinct systems, as opposed to one identical base with tons of game-specific differences on top.

The table of contents was pretty popular last time around, so let's have another one:

Overview of TH04's dialog system
Changes introduced in TH05
Command reference for the TH04 and TH05 systems
Overview of TH02's dialog system
TH02's face portrait images
Bugs during TH02's dialog box slide-in animation
Bugs and quirks in Mima's defeat dialog (might be lore-relevant)
TH03 win messages

Let's start with the ones from TH04 and TH05, since they are not that broken. For TH04, ZUN started out by copy-pasting the cutscene system, causing the result to inherit many of the caveats I already described in the cutscene blog post:

It's still a plaintext format geared exclusively toward full-width Japanese text.
The parser still ignores all whitespace, forcing ASCII text into hacks with unassigned Shift-JIS lead bytes outside the second byte of a 2-byte chunk.
Commands are still preceded by a 0x5C byte, which renders as either a \ or a ¥ depending on your font and interpretation of Shift-JIS.
Command parameters are parsed in exactly the same way, with all the same limits.
A lot of the same script commands are identical, including 7 of them that were not used in TH04's original dialog scripts.

Then, however, he greatly simplified the system. Mainly, this was done by moving text rendering from the PC-98 graphics chip to the text chip, which avoids the need for any text-related unblitting code, but ZUN also added a bunch of smaller changes:

The player must advance through every dialog box by releasing any held keys and then pressing any key mapped to a game action. There are no timeouts.
The delay for every 2 bytes of text was doubled to 2 frames, and can't be overridden.
Instead of holding ESC to fast-forward, pressing any key will immediately print the entire rest of a text box.
Dialogs run in their own single-buffered frame loop, interrupting the rest of the game. The other VRAM page keeps the background pixels required for unblitting the face images.
All script commands that affect the graphics layer are preceded by a 1-frame delay. ZUN most likely did this because of the single-buffered nature, as it prevents tearing on the first frame by waiting for the CRT beam to return to the top-left corner before changing any pixels.
Both boxes are intended to contain up to 30 half-width characters on each of their up to 3 lines, but nothing in the code enforces these limits. There is no support for automatic line breaks or starting new boxes.

Demonstrating the lack of automatic line or box breaks in TH05's dialog system — While it would seem that TH05 has no issues with ASCII `0x20` spaces, the text as a whole is still blindly processed two bytes at a time, and any commands can only appear at even byte positions within a line. I dimmed the VRAM pixels to 25% of their original brightness to make the text easier to read.
The same text backported to TH04, additionally demonstrating how that game's dialog system inherited the whitespace skipping behavior of TH03's cutscene system. Just like there, ASCII `0x20` spaces only work at odd byte positions because the game treats them as the trailing byte of a full-width Shift-JIS codepoint. I don't know how large the budget for the upcoming non-ASCII translations will be, but I'm going to fix this even in the very basic fully static variant. I dimmed the VRAM pixels to 25% of their original brightness to make the text easier to read.

Demonstrating the lack of automatic line or box breaks in TH04's dialog system, in addition to its lack of support for ASCII 0x20 spaces carried over from TH03's cutscene system — While it would seem that TH05 has no issues with ASCII `0x20` spaces, the text as a whole is still blindly processed two bytes at a time, and any commands can only appear at even byte positions within a line. I dimmed the VRAM pixels to 25% of their original brightness to make the text easier to read.
The same text backported to TH04, additionally demonstrating how that game's dialog system inherited the whitespace skipping behavior of TH03's cutscene system. Just like there, ASCII `0x20` spaces only work at odd byte positions because the game treats them as the trailing byte of a full-width Shift-JIS codepoint. I don't know how large the budget for the upcoming non-ASCII translations will be, but I'm going to fix this even in the very basic fully static variant. I dimmed the VRAM pixels to 25% of their original brightness to make the text easier to read.

TH05 then moved from TH04's plaintext scripts to the binary .TX2 format while removing all the unused commands copy-pasted from the cutscene system. Except for a single additional command intended to clear a text box, TH05's dialog system only supports a strict subset of the features of TH04's system.
This change also introduced the following differences compared to TH04:

The game now stores the dialog of all 4 playable characters in the same file, with a (4 + 1)-word header that indicates the byte offset and length of each character's script. This way, it can load only the one script for the currently played character.
Since there is no need for whitespace in a binary format, you can now use ASCII 0x20 spaces even as the first byte of a 2-byte text chunk! 🥳
All command parameters are now mandatory.
Filenames are now passed directly by pointer to the respective game function. Therefore, they now need to be null-terminated, but can in turn be as long as 📝 the number of remaining bytes in the allocated dialog segment. In practice though, the game still runs on DOS and shares its restriction of 8.3 filenames…
When starting a new dialog box, any existing text in the other box is now colored blue.
Thanks to ZUN messing up the return values of the command-interpreting switch function, you can effectively use only line break and gaiji commands in the middle of text. All other commands do execute, but the interpreter then also treats their command byte as a Shift-JIS lead byte and places it in text RAM together with whatever other byte follows in the script.
This is why TH04 can and does put its \= commands into the boxes started with the 0 or 1 commands, but TH05 has to put its 0x02 commands before the equivalent 0x0D.

Writing the `0x02` byte to text RAM results in an character, which is simply the PC-98 font ROM's glyph for that Shift-JIS codepoint.
Also note how each face change is now preceded by two frames of delay.
No problem in TH04. Note how the dialog also runs a bit faster – TH04 only adds the aforementioned one frame of delay to each face change, and has fewer two-byte chunks of text to display overall.

For modding these files, you probably want to use TXDEF from -Tom-'s MysticTK. It decodes these files into a text representation, and its encoder then takes care of the character-specific byte offsets in the 10-byte header. This text representation simplifies the format a lot by avoiding all corner cases and landmines you'd experience during hex-editing – most notably by interpreting the box-starting 0x0D as a command to show text that takes a string parameter, avoiding the broken calls to script commands in the middle of text. However, you'd still have to manually ensure an even number of bytes on every line of text.

In the entry function of TH05's dialog loop, we also encounter the hack that is responsible for properly handling 📝 ZUN's hidden Extra Stage replay. Since the dialog loop doesn't access the replay inputs but still requires key presses to advance through the boxes, ZUN chose to just skip the dialog altogether in the specific case of the Extra Stage replay being active, and replicated all sprite management commands from the dialog script by just hardcoding them.
And you know what? Not only do I not mind this hack, but I would have preferred it over the actual dialog system! The aforementioned sprite management commands effectively boil down to manual memory management, deallocating all stage enemy and midboss sprites and thus ensuring that the boss sprites end up at specific master.lib sprite IDs (patnums). The hardcoded boss rendering function then expects these sprites to be available at these exact IDs… which means that the otherwise hardcoded bosses can't render properly without the dialog script running before them.
There is absolutely no excuse for the game to burden dialog scripts with this functionality. Sure, delayed deallocation would allow them to blit stage-specific sprites, but the original games don't do that; probably because none of the two games feature an unblitting command. And even if they did, it would have still been cleaner to expose the boss-specific sprite setup as a single script command that can then also be called from game code if the script didn't do so. Commands like these just are a recipe for crashes, especially with parsers that expect fullwidth Shift-JIS text and where misaligned ASCII text can easily cause these commands to be skipped.

But then again, it does make for funny screenshot material if you accidentally the deallocation and then see bosses being turned into stage enemies:

Some of the more amusing consequences of not calling the sprite-deallocating

\c /

0x04 command inside a dialog script.
In the case of 4️⃣, the game then even crashes on this frame at the end of the dialog, in a way that resembles the infamous 📝 TH04 crash before Stage 5 Yuuka if no EMS driver is loaded. Both the stage- and boss-specific BFNT sprites are loaded into memory at this point, leaving no room for the 256×256-pixel background image on the size-limited master.lib heap.

With all the general details out of the way, here's the command reference:


0 1	0x00 0x01	Selects either the player character (0) or the boss (1) as the currently speaking character, and moves the cursor to the beginning of the text box. In TH04, this command also directly starts the new dialog box, which is probably why it's not prefixed with a `\` as it only makes sense outside of text. TH05 requires a separate `0x0D` command to do the same.
\=`1`	0x02 `0x!!`	Replaces the face portrait of the currently active speaking character with image #`1` within her .CD2 file.
\=255	0x02 0xFF	Removes the face portrait from the currently active text box.
\l,`filename`	0x03 `filename` 0x00	Calls master.lib's `super_entry_bfnt()` function, which loads sprites from a BFNT file to consecutive IDs starting at the current patnum write cursor.
\c	0x04	Deallocates all stage-specific BFNT sprites (i.e., stage enemies and midbosses), freeing up conventional RAM for the boss sprites and ensuring that master.lib's patnum write cursor ends up at 128 / 180. In TH05's Extra Stage, this command also replaces 📝 the sprites loaded from `MIKO16.BFT` with the ones from `ST06_16.BFT`.
\d		Deallocates all face portrait images. The game automatically does this at the end of each dialog sequence. However, ZUN wanted to load Stage 6 Yuuka's 76 KiB of additional animations inside the script via `\l`, and would have once again run up against the master.lib heap size limit without that extra free memory.
\m,`filename`	0x05 `filename` 0x00	Stops the currently playing BGM, loads a new one from the given file, and starts playback.
\m$	0x05 `$` 0x00	Stops the currently playing BGM. Note that TH05 interprets `$` as a null-terminated filename as well.
\m*		Restarts playback of the currently loaded BGM from the beginning.
\b`0`,`0`,`0`	0x06 `0x!!!!` `0x!!!!` `0x!!`	Blits the master.lib patnum with the ID indicated by the third parameter to the current VRAM page at the top-left screen position indicated by the first two parameters.
\e`0`		Plays the sound effect with the given ID.
\t`100`		Sets palette brightness via master.lib's `palette_settone()` to any value from 0 (fully black) to 200 (fully white). 100 corresponds to the palette's original colors.
\fo`1` \fi`1`		Calls master.lib's `palette_black_out()` or `palette_black_in()` to play a hardware palette fade animation from or to black, spending roughly `1` frame on each of the 16 fade steps.
\wo`1` \wi`1`	0x09 `0x!!` 0x0A `0x!!`	Calls master.lib's `palette_white_out()` or `palette_white_in()` to play a hardware palette fade animation from or to white, spending roughly `1` frame on each of the 16 fade steps. The TH05 version of `0x09` also clears the text in both boxes before the animation.
\n	0x0B	Starts a new line by resetting the X coordinate of the TRAM cursor to the left edge of the text area and incrementing the Y coordinate. The new line will always be the next one below the last one that was properly started, regardless of whether the text previously wrapped to the next TRAM row at the edge of the screen.
\g`8`		Plays a blocking `8`-frame screen shake animation. Copy-pasted from the cutscene parser, but actually used right at the end of the dialog shown before TH04's Bad Ending.
\ga`0`	0x0C `0x!!`	Shows the gaiji with the given ID from 0 to 255 at the current cursor position, ignoring the per-glyph delay.
\k`0`		Waits `0` frames (0 = forever) for any key to be pressed before continuing script execution.
	0x0D	Starts a new dialog box with the previously selected speaker. All text until the next `0xFF` command will appear on screen. Inside dialogs, this is a no-op.
	0x0E	Takes the current dialog cursor as the top-left corner of a 240×48-pixel rectangle, and replaces all text RAM characters within that rectangle with whitespace. This is only used to clear the player character's text box before Shinki's final いくよ‼ box. Shinki has two consecutive text boxes in all 4 scripts here, and ZUN probably wanted to clear the otherwise blue text to imply a dramatic pause before Shinki's final sentence. Nice touch. (You could, however, also use it after a box-ending `0xFF` command to mess with text RAM in general.)
\#		Quits the currently running loop. This returns from either the text loop to the command loop, or it ends the dialog sequence by returning from the command loop back to gameplay. If this stage of the game later starts another dialog sequence, it will start at the next script byte.
\$		Like `\#`, but first waits for any key to be pressed.
	0xFF	Behaves like TH04's `\$` in the text loop, and like `\#` in the command loop. Hence, it's not possible in TH05 to automatically end a text box and advance to the next one without waiting for a key press.

Unused commands are in gray.

At the end of the day, you might criticize the system for how its landmines make it annoying to mod in ASCII text, but it all works and does what it's supposed to. ZUN could have written the cleanest single and central Shift-JIS iterator that properly chunks a byte buffer into halfwidth and fullwidth codepoints, and I'd still be throwing it out for the upcoming non-ASCII translations in favor of something that either also supports UTF-8 or performs dictionary lookups with a full box of text.
The only actual bug can be found in the input detection, which once again doesn't correctly handle the infamous key up/key down scancode quirk of PC-98 keyboards. All it takes is one wrongly placed input polling call, and suddenly you have to think about how the update cycle behind the PC-98 keyboard state bytes might cause the game to run the regular 2-frame delay for a single 2-byte chunk of text before it shows the full text of a box after all… But even this bug is highly theoretical and could probably only be observed very, very rarely, and exclusively on real hardware.

The same can't be said about TH02 though, but more on that later. Let's first take a look at its data, which started out much simpler in that game. The STAGE?.TXT files contain just raw Shift-JIS text with no trace of commands or structure. Turning on the whitespace display feature in your editor reveals how the dialog system even assumes a fixed byte length for each box: 36 bytes per line which will appear on screen, followed by 4 bytes of padding, which the original files conveniently use to visually split the lines via a CR/LF newline sequence. Make sure to disable trimming of trailing whitespace in your editor to not ruin the file when modding the text…

靈夢：あんた、まだ名前も聞いてないの··
······に覚えられないわよ。・・・・・··
里香：あたいは、里香よ。覚えときなさ··
・・・い。・・・・・・················

Two boxes from TH02's STAGE5.TXT with visualized whitespace. These also demonstrate how the CR/LF newlines only make up 2 of the 4 padding bytes, and require each line to be padded with two more bytes; you could not use these trailing spaces for actual text. Also note how the exquisite mixture of fullwidth and halfwidth spaces demands the text to be viewed with only the most metrically consistent monospace fonts to preserve the intended alignment. 🍷 It appears quite misaligned on my phone.

Consequently, everything else is hardcoded – every effect shown between text boxes, the face portrait shown for each box, and even how many boxes are part of each dialog sequence. Which means that the source code now contains a long hardcoded list of face IDs for most of the text boxes in the game, with the rest being part of the dedicated hardcoded dialog scripts for ²/₃ of the game's stages.
Without the restriction to a fixed set of scripting commands, TH02 naturally gravitated to having the most varied dialog sequences of all PC-98 Touhou games. This flexibility certainly facilitated Mima's grand entrance animation in Stage 4, or the different lines in Stage 4 and 5 depending on whether you already used a continue or not. Marisa's post-boss dialog even inserts the number of continues into the text itself – by, you guessed it, writing to hardcoded byte offsets inside the dialog text before printing it to the screen. But once again, I have nothing to criticize here – not even the fact that the alternate dialog scripts have to mutate the "box cursor" to jump to the intended boxes within the file. I know that some people in my audience like VMs, but I would have considered it more bloated if ZUN had implemented a full-blown scripting language just to handle all these special cases.

Another unique aspect of TH02 is the way it stores its face portraits, which are infamous for how hard they are to find in the original data files. These sprites are actually map tiles, stored in MIKO_K.MPN, and drawn using the same functions used to blit the regular map tiles to the 📝 tile source area in VRAM. We can only guess why ZUN chose this one out of the three graphics formats he used in TH02:

BFNT supports transparency, but sacrifices one of the 16 colors to do so. ZUN only used 15 colors for the face portraits, but might have wanted to keep open the option to use that 16^th color. The detailed backgrounds also suggest that these images were never supposed to be transparent to begin with.
PI is used for all bigger and non-transparent images, but ZUN would have had to write a separate small function to blit a 48×48 subsection of such an image. That certainly wouldn't have stopped him in the TH01 days, but he probably was already past that point by this game.
That only leaves .MPN. Sure, he did have to slice each face into 9 separate 16×16 "map" tiles to use this format, but that's a small price to pay in exchange for not having to write any new low-level blitting code, especially since he must have already had an asset pipeline to generate these files.

TH02's MIKO_K.PTN — Show tile ID grid
TH02's `MIKO_K.PTN`, arranged into a 16×16-tile layout that reveals how these tiles are combined into face portraits.
`MPNDEF` from -Tom-'s MysticTK conveniently uses this exact layout in its .BMP output. Earlier `MPNDEF` versions crashed when converting this file as its 256 tiles led to an 8-bit overflow bug, so make sure you've updated to the current version from the end of October 2023 if you want to convert this file yourself. The format stores the 4 bitplanes of each 16×16 tile in order, so good luck finding a different planar image viewer that would support both such a tiled layout *and* a custom palette. Sometimes, a weird internal format is the best type of obfuscation.

And since you're certainly wondering about all these black tiles at the edges: Yes, these are not only part of the file and pad it from the required 240×192 pixels to 256×256, but also kept in memory during a stage, wasting 9.5 KiB of conventional RAM. That's 172 seconds of potential input replay data, just for those people who might still think that we need EMS for replays.

Alright, we've got the text, we've got the faces, let's slide in the box and display it all on screen. Apparently though, we also have to blit the player and option sprites using raw, low-level master.lib function calls in the process? This can't be right, especially because ZUN always blits the option sprite associated with the Reimu-A shot type, regardless of which one the player actually selected. And if you keep moving above the box area before the dialog starts, you get to see exactly how wrong this is:

Let's look closer at Reimu's sprite during the slide-in animation, and in the two frames before:

Zoomed-in area around Reimu's sprite from frame 35 of the video above

Zoomed-in area around Reimu's sprite from frame 36 of the video above

This one image shows off no less than 4 bugs:

ZUN blits the stationary player sprite here, regardless of whether the player was previously moving left or right. This is a nice way of indicating that Reimu stops moving once the dialog starts, but maybe ZUN should have unblitted the old sprite so that the new one wouldn't have appeared on top. The game only unblits the 384×64 pixels covered by the dialog box on every frame of the slide-in animation, so Reimu would only appear correctly if her sprite happened to be entirely located within that area.
All sprites are shifted up by 1 pixel in frame 2️⃣. This one is not a bug in the dialog system, but in the main game loop. The game runs the relevant actions in the following order:
1. Invalidate any map tiles covered by entities
2. Redraw invalidated tiles
3. Decrement the Y coordinate at the top of VRAM according to the scroll speed
4. Update and render all game entities
5. Scroll in new tiles as necessary according to the scroll speed, and report whether the game has scrolled one pixel past the end of the map
6. If that happened, pretend it didn't by incrementing the value calculated in #3 for all further frames and skipping to #8.
7. Issue a GDC SCROLL command to reflect the line calculated in #3 on the display
8. Wait for VSync
9. Flip VRAM pages
10. Start boss if we're past the end of the map
The problem here: Once the dialog starts, the game has already rendered an entire new frame, with all sprites being offset by a new Y scroll offset, without adjusting the graphics GDC's scroll registers to compensate. Hence, the Y position in 3️⃣ is the correct one, and the whole existence of frame 2️⃣ is a bug in itself. (Well… OK, probably a quirk because speedrunning exists, and it would be pretty annoying to synchronize any video regression tests of the future TH02 Anniversary Edition if it renders one fewer frame in the middle of a stage.)
ZUN blits the option sprites to their position from frame 1️⃣. This brings us back to 📝 TH02's special way of retaining the previous and current position in a two-element array, indexed with a VRAM page ID. Normally, this would be equivalent to using dedicated prev and cur structure fields and you'd just index it with the back page for every rendering call. But if you then decide to go single-buffered for dialogs and render them onto the front page instead…
Note that fixing bug #2 would not cancel out this one – the sprites would then simply be rendered to their position in the frame before 1️⃣.
And of course, the fixed option sprite ID also counts as a bug.

As for the boxes themselves, it's yet another loop that prints 2-byte chunks of Shift-JIS text at an even slower fixed interval of 3 frames. In an interesting quirk though, ZUN assumes that every box starts with the name of the speaking character in its first two fullwidth Shift-JIS characters, followed by a fullwidth colon. These 6 bytes are displayed immediately at the start of every box, without the usual delay. The resulting alignment looks rather janky with Genjii, whose single right-padded 亀 kanji looks quite awkward with the fullwidth space between the name and the colon. Kind of makes you wonder why ZUN just didn't spell out his proper name, 玄爺, instead, but I get the stylistic difference.
In Stage 4, the two-kanji assumption then breaks with Marisa's three-kanji name, which causes the full-width colon to be printed as the first delayed character in each of her boxes:

That's all the issues and quirks in the system itself. The scripts themselves don't leave much room for bugs as they basically just loop over the hardcoded face ID array at this level… until we reach the end of the game. Previously, the slide-in animation could simply use the tile invalidation and re-rendering system to unblit the box on each frame, which also explained why Reimu had to be separately rendered on top. But this no longer works with a custom-rendered boss background, and so the game just chooses to flood-fill the area with graphics chip color #0:

Then again, transferring pixels from the back page would be just as wrong as they lag one frame behind. No way around capturing these 384×64 pixels to main memory here… Oh well, this flood-fill at least adds even more legibility on top of the already half-transparent text box. A property that the following dialog sequence unfortunately lacks…

For Mima's final defeat dialog though, ZUN chose to not even show the box. He might have realized the issue by that point, or simply preferred the more dramatic effect this had on the lines. The resulting issues, however, might even have ramifications for such un-technical things as lore and character dynamics. As it turns out, the code for this dialog sequence does in fact render Mima's smiling face for all boxes?! You only don't see it in the original game because it's rendered to the other VRAM page that remains invisible during the dialog sequence:

Caution, flashing lights.

Here's how I interpret the situation:

The function that launches into the final part of the dialog script starts with dedicated code to re-render Mima to the back page, on top of the previously rendered planet background. Since the entire script runs on the front page (and thus, on top of the previous frame) and the game launches into the ending immediately after, you don't ever get to see this new partial frame in the original game.
Showing this partial frame would also ensure that you can actually read the dialog text without a surrounding box. Then, the white letters won't ever be put on top of any white bullets – or, worse, be completely invisible if the dialog is triggered in the middle of Reimu-B's bomb animation, which fills VRAM with lots of white pixels.
Hence, we've got enough evidence to classify not showing the back page as a ZUN bug. 🐞
However, Mima's smiling face jars with the words she says here. Adding the face would deviate more significantly from the original game than removing the player shot, item, bullet, or spark sprites would. It's imaginable that ZUN just forgot about the dedicated code that re-rendered just Mima to the back page, but the faces add something to the dialog, and ZUN would have clearly noticed and fixed it if their absence wasn't intended. Heck, ZUN might have just put something related to Mima into the code because TH02's dialog system has no way of not drawing a face for a dialog box. Filling the face area with graphics chip color #0, as seen in the first and third boxes of the Extra Stage pre-boss dialog, would have been an alternative, but that would have been equally wrong with regard to the background.
Hence, the invisible face portrait from the original game is a ZUN quirk. 🎺

So, the future TH02 Anniversary Edition will fix the bug by showing the back page, but retain the quirk by rewriting the dialog code to not blit the face.

And with that, we've secured all in-game dialog for the upcoming non-ASCII translations! The remaining ²/₃ of the last push made for a good occasion to also decompile the small amount of code related to TH03's win messages, stored in the @0?TX.TXT files. Similar to TH02's dialog format, these files are also split into fixed-size blocks of 3×60 bytes. But this time, TH03 loads all 60 bytes of a line, including the CR/LF line breaking codepoints in the original files, into the statically allocated buffer that it renders from. These control characters are then only filtered to whitespace by ZUN's graph_putsa_fx() function. If you remove the line breaks, you get to use the full 60 bytes on every line.
The final commits went to the MIKO.CFG loading and saving functions used in TH04's and TH05's OP.EXE, as well as TH04's game startup code to finally catch up with 📝 TH05's counterpart from over 3 years ago. This brought us right in front of the main menu rendering code in both TH04 and TH05, which is identical in both games and will be tackled in the next PC-98 Touhou delivery.

Next up, though: Returning to Shuusou Gyoku, and adding support for SC-88Pro recordings as BGM. Which may or may not come with a slight controversy…

📝 Posted:: 2023-06-30 23:52 UTC
🚚 Summary of:: P0245
⌨ Commits:: 97f0c3b...5876755
💰 Funded by:: Blue Bolt, Ember2528, [Anonymous], Yanga
🏷 Tags:

And then, the supposed boilerplate code revealed yet another confusing issue that quickly forced me back to serial work, leading to no parallel progress made with Shuusou Gyoku after all. 🥲 The list of functions I put together for the first ½ of this push seemed so boring at first, and I was so sure that there was almost nothing I could possibly talk about:

TH02's gaiji animations at the start and end of each stage, resembling opening and closing window blind slats. ZUN should have maybe not defined the regular whitespace gaiji as what's technically the last frame of the closing animation, but that's a minor nitpick. Nothing special there otherwise.
The remaining spawn functions for TH04's and TH05's gather circles. The only dumb antic there is the way ZUN initializes the template for bullets fired at the end of the animation, featuring ASM instructions that are equivalent to what Turbo C++ 4.0J generates for the __memcpy__ intrinsic, but show up in a different order. Which means that they must have been handwritten. I already figured that out in 2022 though, so this was just more of the same.
EX-Alice's override for the game's main 16×16 sprite sheet, loaded during her dialog script. More of a naming and consistency challenge, if anything.

The regular version of TH05's big 16×16 sprite sheet.
EX-Alice's variant of TH05's big 16×16 sprite sheet.
The rendering function for TH04's Stage 4 midboss, which seems to feature the same premature clipping quirk we've seen for 📝 TH05's Stage 5 midboss, 7 months ago?
The rendering function for the big 48×48 explosion sprite, which also features the same clipping quirk?

That's three instances of ZUN removing sprites way earlier than you'd want to, intentionally deciding against those sprites flying smoothly in and out of the playfield. Clearly, there has to be a system and a reason behind it.

Turns out that it can be almost completely blamed on master.lib. None of the super_*() sprite blitting functions can clip the rendered sprite to the edges of VRAM, and much less to the custom playfield rectangle we would actually want here. This is exactly the wrong choice to make for a game engine: Not only is the game developer now stuck with either rendering the sprite in full or not at all, but they're also left with the burden of manually calculating when not to display a sprite.
However, strictly limiting the top-left screen-space coordinate to (0, 0) and the bottom-right one to (640, 400) would actually stop rendering some of the sprites much earlier than the clipping conditions we encounter in these games. So what's going on there?

The answer is a combination of playfield borders, hardware scrolling, and master.lib needing to provide at least some help to support the latter. Hardware scrolling on PC-98 works by dividing VRAM into two vertical partitions along the Y-axis and telling the GDC to display one of them at the top of the screen and the other one below. The contents of VRAM remain unmodified throughout, which raises the interesting question of how to deal with sprites that reach the vertical edges of VRAM. If the top VRAM row that starts at offset 0x0000 ends up being displayed below the bottom row of VRAM that starts at offset 0x7CB0 for 399 of the 400 possible scrolling positions, wouldn't we then need to vertically wrap most of the rendered sprites?
For this reason, master.lib provides the super_roll_*() functions, which unconditionally perform exactly this vertical wrapping. But this creates a new problem: If these functions still can't clip, and don't even know which VRAM rows currently correspond to the top and bottom row of the screen (since master.lib's graph_scrollup() function doesn't retain this information), won't we also see sprites wrapping around the actual edges of the screen? That's something we certainly wouldn't want in a vertically scrolling game…
The answer is yes, and master.lib offers no solution for this issue. But this is where the playfield borders come in, and helpfully cover 16 pixels at the top and 16 pixels at the bottom of the screen. As a result, they can hide up to 32 rows of potentially wrapped sprite pixels below them:

VRAM contents of the first possible frame that TH05's Stage 5 midboss can appear on, at their original scrolling position. Also featuring the 64×64 bounding box of the midboss sprite. — •
The earliest possible frame that TH05 can start rendering the Stage 5 midboss on. Hiding the text layer reveals how master.lib did in fact "blindly" render the top part of her sprite to the bottom of the playfield. That's where her sprite *starts* before it is correctly wrapped around to the top of VRAM.
If we scrolled VRAM by another 200 pixels (and faked an equally shifted TRAM for demonstration purposes), we get an equally valid game scene that points out why a vertically scrolling PC-98 game must wrap all sprites at the vertical edges of VRAM to begin with.
Also, note how the HP bar has filled up quite a bit before the midboss can actually appear on screen.

VRAM contents of the first possible frame that TH05's Stage 5 midboss can appear on, scrolled down by a further 200 pixels. Also featuring the 64×64 bounding box of the midboss sprite. — •
The earliest possible frame that TH05 can start rendering the Stage 5 midboss on. Hiding the text layer reveals how master.lib did in fact "blindly" render the top part of her sprite to the bottom of the playfield. That's where her sprite *starts* before it is correctly wrapped around to the top of VRAM.
If we scrolled VRAM by another 200 pixels (and faked an equally shifted TRAM for demonstration purposes), we get an equally valid game scene that points out why a vertically scrolling PC-98 game must wrap all sprites at the vertical edges of VRAM to begin with.
Also, note how the HP bar has filled up quite a bit before the midboss can actually appear on screen.

And that's how the lowest possible top Y coordinate for sprites blitted using the master.lib super_roll_*() functions during the scrolling portions of TH02, TH04, and TH05 is not 0, but -16. Any lower, and you would actually see some of the sprite's upper pixels at the bottom of the playfield, as there are no more opaque black text cells to cover them. Theoretically, you could lower this number for some animation frames that start with multiple rows of transparent pixels, but I thankfully haven't found any instance of ZUN using such a hack. So far, at least…
Visualized like that, it all looks quite simple and logical, but for days, I did not realize that these sprites were rendered to a scrolling VRAM. This led to a much more complicated initial explanation involving the invisible extra space of VRAM between offsets 0x7D00 and 0x7FFF that effectively grant a hidden additional 9.6 lines below the playfield. Or even above, since PC-98 hardware ignores the highest bit of any offset into a VRAM bitplane segment (& 0x7FFF), which prevents blitting operations from accidentally reaching into a different bitplane. Together with the aforementioned rows of transparent pixels at the top of these midboss sprites, the math would have almost worked out exactly.

The need for manual clipping also applies to the X-axis. Due to the lack of scrolling in this dimension, the boundaries there are much more straightforward though. The minimum left coordinate of a sprite can't fall below 0 because any smaller coordinate would wrap around into the 📝 tile source area and overwrite some of the pixels there, which we obviously don't want to re-blit every frame. Similarly, the right coordinate must not extend into the HUD, which starts at 448 pixels.
The last part might be surprising if you aren't familiar with the PC-98 text chip. Contrary to the CGA and VGA text modes of IBM-compatibles, PC-98 text cells can only use a single color for either their foreground or background, with the other pixels being transparent and always revealing the pixels in VRAM below. If you look closely at the HUD in the images above, you can see how the background of cells with gaiji glyphs is slightly brighter (◼ #100) than the opaque black cells (◼ #000) surrounding them. This rather custom color clearly implies that those pixels must have been rendered by the graphics GDC. If any other sprite was rendered below the HUD, you would equally see it below the glyphs.

So in the end, I did find the clear and logical system I was looking for, and managed to reduce the new clipping conditions down to a set of basic rules for each edge. Unfortunately, we also need a second macro for each edge to differentiate between sprites that are smaller or larger than the playfield border, which is treated as either 32×32 (for super_roll_*()) or 32×16 (for non-"rolling" super_*() functions). Since smaller sprites can be fully contained within this border, the games can stop rendering them as soon as their bottom-right coordinate is no longer seen within the playfield, by comparing against the clipping boundaries with <= and >=. For example, a 16×16 sprite would be completely invisible once it reaches (16, 0), so it would still be rendered at (17, 1). A larger sprite during the scrolling part of a stage, like, say, the 64×64 midbosses, would still be rendered if their top-left coordinate was (0, -16), so ZUN used < and > comparisons to at least get an additional pixel before having to stop rendering such a sprite. Turbo C++ 4.0J sadly can't constant-fold away such a difference in comparison operators.

And for the most part, ZUN did follow this system consistently. Except for, of course, the typical mistakes you make when faced with such manual decisions, like how he treated TH04's Stage 4 midboss as a "small" sprite below 32×32 pixels (it's 64×64), losing that precious one extra pixel. Or how the entire rendering code for the 48×48 boss explosion sprite pretends that it's actually 64×64 pixels large, which causes even the initial transformation into screen space to be misaligned from the get-go. But these are additional bugs on top of the single one that led to all this research.
Because that's what this is, a bug. 🐞 Every resulting pixel boundary is a systematic result of master.lib's unfortunate lack of clipping. It's as much of a bug as TH01's byte-aligned rendering of entities whose internal position is not byte-aligned. In both cases, the entities are alive, simulated, and partake in collision detection, but their rendered appearance doesn't accurately reflect their internal position.
Initially, I classified 📝 the sudden pop-in of TH05's Stage 5 midboss as a quirk because we had no conclusive evidence that this wasn't intentional, but now we do. There have been multiple explanations for why ZUN put borders around the playfield, but master.lib's lack of sprite clipping might be the biggest reason.

And just like byte-aligned rendering, the clipping conditions can easily be removed when porting the game away from PC-98 hardware. That's also what uth05win chose to do: By using OpenGL and not having to rely on hardware scrolling, it can simply place every sprite as a textured quad at its exact position in screen space, and then draw the black playfield borders on top in the end to clip everything in a single draw call. This way, the Stage 5 midboss can smoothly fly into the playfield, just as defined by its movement code:

The entire smooth Stage 5 midboss entrance animation as shown in uth05win. If the simultaneous appearance of the Enemy!! label doesn't lend further proof to this having been ZUN's actual intention, I don't know what will.

Meanwhile, I designed the interface of the 📝 generic blitter used in the TH01 Anniversary Edition entirely around clipping the blitted sprite at any explicit combination of VRAM edges. This was nothing I tacked on in the end, but a core aspect that informed the architecture of the code from the very beginning. You really want to have one and only one place where sprite clipping is done right – and only once per sprite, regardless of how many bitplanes you want to write to.

Which brings us to the goal that the final ¼ of this push went toward. I thought I was going to start cleaning up the 📝 player movement and rendering code, but that turned out too complicated for that amount of time – especially if you want to start with just cleanup, preserving all original bugs for the time being.
Fixing and smoothening player and Orb movement would be the next big task in Anniversary Edition development, needing about 3 pushes. It would start with more performance research into runtime-shifting of larger sprites, followed by extending my generic blitter according to the results, writing new optimized loaders for the original image formats, and finally rewriting all rendering code accordingly. With that code in place, we can then start cleaning up and fixing the unique code for each boss, one by one.

Until that's funded, the code still contains a few smaller and easier pieces of code that are equally related to rendering bugs, but could be dealt with in a more incremental way. Line rendering is one of those, and first needs some refactoring of every call site, including 📝 the rotating squares around Mima and 📝 YuugenMagan's pentagram. So far, I managed to remove another 1,360 bytes from the binary within this final ¼ of a push, but there's still quite a bit to do in that regard.
This is the perfect kind of feature for smaller (micro-)transactions. Which means that we've now got meaningful TH01 code cleanup and Anniversary Edition subtasks at every price range, no matter whether you want to invest a lot or just a little into this goal.

If you can, because Ember2528 revealed the plan behind his Shuusou Gyoku contributions: A full-on Linux port of the game, which will be receiving all the funding it needs to happen. 🐧 Next up, therefore: Turning this into my main project within ReC98 for the next couple of months, and getting started by shipping the long-awaited first step towards that goal.
I've raised the cap to avoid the potential of rounding errors, which might prevent the last needed Shuusou Gyoku push from being correctly funded. I already had to pick the larger one of the two pending TH02 transactions for this push, because we would have mathematically ended up ¹/₂₅₅₀₀ short of a full push with the smaller transaction. And if I'm already at it, I might as well free up enough capacity to potentially ship the complete OpenGL backend in a single delivery, which is currently estimated to cost 7 pushes in total.

📝 Posted:: 2023-06-07 23:17 UTC
🚚 Summary of:: P0242, P0243
⌨ Commits:: 08352a5...dfa758d, dfa758d...ac33bd2
💰 Funded by:: Yanga
🏷 Tags:

OK, let's decompile TH02's HUD code first, gain a solid understanding of how increasing the score works, and then look at the item system of this game. Should be no big deal, no surprises expected, let's go!

…Yeah, right, that's never how things end up in ReC98 land. And so, we get the usual host of newly discovered oddities in addition to the expected insights into the item mechanics. Let's start with the latter:

Some regular stage enemies appear to randomly drop either or items. In reality, there is very little randomness at play here: These items are picked from a hardcoded, repeating ring of 10 items (𝄆 𝄇), and the only source of randomness is the initial position within this ring, which changes at the beginning of every stage. ZUN further increased the illusion of randomness by only dropping such a semi-random item for every 3^rd defeated enemy that is coded to drop one, and also having enemies that drop fixed, non-random items. I'd say it's a decent way of ensuring both randomness and balance.
- There's a ¹/₅₁₂ chance for such a semi-random item drop to turn into a item instead – which translates to ¹/₁₅₃₆ enemies due to the fixed drop rate.
- Edit (2023-06-11): These are the only ways that items can randomly drop in this game. All other drops, including any items, are scripted and deterministic.
- After using a continue (both after a Game Over, or after manually choosing to do so through the Pause menu for whatever reason), the next (Stage number + 1) semi-random item drops are turned into items instead.

Items can contribute up to 25 points to the skill value and subsequent rating (あなたの腕前) on the final verdict screen. Doing well at item collection first increases a separate collect_skill value:

Item	Collection condition	`collect_skill` change
	below max power	+1
	at or above max power	+2
	value == 51,200	+8
	value ≥20,000 and <51,200	+4
	value ≥10,000 and <20,000	+2
	value <10,000	+1
	with 5 bombs in stock	+16

Note, again, the lack of anything involving

items. At the maximum of 5 lives, the item spawn function transforms them into bomb items anyway. It is possible though to gain the 5^th life by reaching one of the extend scores while a

item is still on screen; in that case, collecting the 1-up has no effect at all.

Every 32 collect_skill points will then raise the item_skill by 1, whereas every 16 dropped items will lower it by 1. Before launching into the ending sequence, item_skill is clamped to the [0; 25] range and added to the other skill-relevant metrics we're going to look at in future pushes.

When losing a life, the game will drop a single and 4 randomly picked or items in a random order around Reimu's position. Contrary to an unsourced Touhou Wiki edit from 2009, each of the 4 does have an equal and independent chance of being either a or item.
Finally, and perhaps most interestingly, item values! These are determined by the top Y coordinate of an item during the frame it is collected on. The maximum value of 51,200 points applies to the top 48 pixels of the playfield, and drops off as soon as an item falls below that line. For the rest of the playfield, point items then use a formula of (28,000 - (top Y coordinate of item in screen space × 70)):

Point items and their collection value in TH02. The numbers correspond to items that are collected while their top Y coordinate matches the line they are directly placed on. The upper item in the image would therefore give 23,450 points if the player collected it at that specific position.
Reimu collects any item whose 16×16 bounding box lies fully within the red 48×40 hitbox. Note that the box isn't cut off in this specific case: At Reimu's lowest possible position on the playfield, the lowest 8 pixels of her sprite are clipped, but the item hitbox still happens to end exactly at the bottom of the playfield. Since an item's Y velocity accelerates on every frame, it's entirely possible to collect a point item at the lowest value of 2,240 points, on the exact frame before it falls below the collection hitbox.

Onto score tracking then, which only took a single commit to raise another big research question. It's widely known that TH02 grants extra lives upon reaching a score of 1, 2, 3, 5, or 8 million points. But what hasn't been documented is the fact that the game does not stop at the end of the hardcoded extend score array. ZUN merely ends it with a sentinel value of 999,999,990 points, but if the score ever increased beyond this value, the game will interpret adjacent memory as signed 32-bit score values and continue giving out extra lives based on whatever thresholds it ends up finding there. Since the following bytes happen to turn into a negative number, the next extra life would be awarded right after gaining another 10 points at exactly 1,000,000,000 points, and the threshold after that would be 11,114,905,600 points. Without an explicit counterstop, the number of score-based extra lives is theoretically unlimited, and would even continue after the signed 32-bit value overflowed into the negative range. Although we certainly have bigger problems once scores ever reach that point…
That said, it seems impossible that any of this could ever happen legitimately. The current high scores of 42,942,800 points on Lunatic and 42,603,800 points on Extra don't even reach ¹/₂₀ of ZUN's sentinel value. Without either a graze or a bullet cancel system, the scoring potential in this game is fairly limited, making it unlikely for high scores to ever increase by that additional order of magnitude to end up anywhere near the 1 billion mark.
But can we really be sure? Is this a landmine because it's impossible to ever reach such high scores, or is it a quirk because these extends could be observed under rare conditions, perhaps as the result of other quirks? And if it's the latter, how many of these adjacent bytes do we need to preserve in cleaned-up versions and ports? We'd pretty much need to know the upper bound of high scores within the original stage and boss scripts to tell. This value should be rather easy to calculate in a game with such a simple scoring system, but doing that only makes sense after we RE'd all scoring-related code and could efficiently run such simulations. It's definitely something we'd need to look at before working on this game's debloated version in the far future, which is when the difference between quirks and landmines will become relevant. Still, all that uncertainty just because ZUN didn't restrict a loop to the size of the extend threshold array…

TH02 marks a pivotal point in how the PC-98 Touhou games handle the current score. It's the last game to use a 32-bit variable before the later games would regrettably start using arrays of binary-coded decimals. More importantly though, TH02 is also the first game to introduce the delayed score counting animation, where the displayed score intentionally lags behind and gradually counts towards the real one over multiple frames. This could be implemented in one of two ways:

Keep the displayed score as a separate variable inside the presentation layer, and let it gradually count up to the real score value passed in from the logic layer
Burden the game logic with this presentation detail, and split the score into two variables: One for the displayed score, and another for the delta between that score and the actual one. Newly gained points are first added to the delta variable, and then gradually subtracted from there and added to the real score before being displayed.

And by now, we can all tell which option ZUN picked for the rest of the PC-98 games, even if you don't remember 📝 me mentioning this system last year. 📝 Once again, TH02 immortalized ZUN's initial attempt at the concept, which lacks the abstraction boundaries you'd want for managing this one piece of state across two variables, and messes up the abstractions it does have. In addition to the regular score transfer/render function, the codebase therefore has

a function that transfers the current delta to the score immediately, but does not re-render the HUD, and
a function that adds the delta to the score and re-renders the HUD, but does not reset the delta.

And – you guessed it – I wouldn't have mentioned any of this if it didn't result in one bug and one quirk in TH02. The bug resulting from 1) is pretty minor: The function is called when losing a life, and simply stops any active score-counting animation at the value rendered on the frame where the player got hit. This one is only a rendering issue – no points are lost, and you just need to gain 10 more for the rendered value to jump back up to its actual value. You'll probably never notice this one because you're likely busy collecting the single spawned around Reimu when losing a life, which always awards at least 10 points.

The quirk resulting from 2) is more intriguing though. Without a separate reset of the score delta, the function effectively awards the current delta value as a one-time point bonus, since the same delta will still be regularly transferred to the score on further game frames.
This function is called at the start of every dialog sequence. However, TH02 stops running the regular game loop between the post-boss dialog and the next stage where the delta is reset, so we can only observe this quirk for the pre-boss sequences and the dialog before Mima's form change. Unfortunately, it's not all too exploitable in either case: Each of the pre-boss dialog sequences is preceded by an ungrazeable pellet pattern and followed by multiple seconds of flying over an empty playfield with zero scoring opportunities. By the time the sequence starts, the game will have long transferred any big score delta from max-valued point items. It's slightly better with Mima since you can at least shoot her and use a bomb to keep the delta at a nonzero value, but without a health bar, there is little indication of when the dialog starts, and it'd be long after Mima gave out her last bonus items in any case.
But two of the bosses – that is, Rika, and the Five Magic Stones – are scrolled onto the playfield as part of the stage script, and can also be hit with player shots and bombs for a few seconds before their dialog starts. While I'll only get to cover shot types and bomb damage within the next few TH02 pushes, there is an obvious initial strategy for maximizing the effect of this quirk: Spreading out the A-Type / Wide / High Mobility shot to land as many hits as possible on all Five Magic Stones, while firing off a bomb.

Turns out that the infamous button-mashing mechanics of the player shot are also more complicated than simply pressing and releasing the Shot key at alternating frames. Even this result took way too many takes.

Wow, a grand total of 1,750 extra points! Totally worth wasting a bomb for… yeah, probably not. But at the very least, it's something that a TAS score run would want to keep in mind. And all that just because ZUN "forgot" a single score_delta = 0; assignment at the end of one function…

And that brings TH02 over the 30% RE mark! Next up: 100% position independence for TH04. If anyone wants to grab the that have now been freed up in the cap: Any small Touhou-related task would be perfect to round out that upcoming TH04 PI delivery.

📝 Posted:: 2023-03-30 13:23 UTC
🚚 Summary of:: P0235, P0236, P0237
⌨ Commits:: e7a9262...62c4b7f, 62c4b7f...7fa9038, 7fa9038...c5e51e6
💰 Funded by:: Ember2528, Yanga
🏷 Tags:

So, TH02! Being the only game whose main binary hadn't seen any dedicated attention ever, we get to start the TH02-related blog posts at the very beginning with the most foundational pieces of code. The stage tile system is the best place to start here: It not only blocks every entity that is rendered on top of these tiles, but is curiously placed right next to master.lib code in TH02, and would need to be separated out into its own translation unit before we can do the same with all the master.lib functions.

In late 2018, I already RE'd 📝 TH04's and TH05's stage tile implementation, but haven't properly documented it on this blog yet, so this post is also going to include the details that are unique to those games. On a high level, the stage tile system works identically in all three games:

The tiles themselves are 16×16 pixels large, and a stage can use 100 of them at the same time.
The optimal way of blitting tiles would involve VRAM-to-VRAM copies within the same page using the EGC, and that's exactly what the games do. All tiles are stored on both VRAM pages within the rightmost 64×400 pixels of the screen just right next to the HUD, and you only don't see them because the games cover the same area in text RAM with black cells:

The initial screen of TH02's Stage 1, with the tile source area uncovered by filling the same area in text RAM with transparent cells instead of black ones. In TH02, this also reveals how the tile area ends with a bunch of glitch tiles, tinted blue in the image. These are the result of ZUN unconditionally blitting 100 tile images every time, regardless of how many are actually contained in an .MPN file.
These glitch tiles are another good example of a ZUN landmine. Their appearance is the result of reading heap memory outside allocated boundaries, which can easily cause segmentation faults when porting the game to a system with virtual memory. Therefore, these would not just be removed in this game's Anniversary Edition, but on the more conservative debloated branch as well. Since the game never uses these tiles and you can't observe them unless you manipulate text RAM from outside the confines of the game, it's not a bug according to our definition.
To reduce the memory required for a map, tiles are arranged into fixed vertical sections of a game-specific constant size.
The 6 24×8-tile sections defined in TH02's STAGE0.MAP, in reverse order compared to how they're defined in the file. Note the duplicated row at the top of the final section: The boss fight starts once the game scrolled the last full row of tiles onto the top of the screen, not the playfield. But since the PC-98 text chip covers the top tile row of the screen with black cells, this final row is never visible, which effectively reduces a map's final tile section to 7 rows rather than 8.
The actual stage map then is simply a list of these tile sections, ordered from the start/bottom to the top/end.
Any manipulation of specific tiles within the fixed tile sections has to be hardcoded. An example can be found right in Stage 1, where the Shrine Tank leaves track marks on the tiles it appears to drive over:

This video also shows off the two issues with Touhou's first-ever midboss: The replaced tiles are rendered below the midboss during their first 4 frames, and maybe ZUN should have stopped the tile replacements one row before the timeout. The first one is clearly a bug, but it's not so clear-cut with the second one. I'd need to look at the code to tell for sure whether it's a quirk or a bug.

The differences between the three games can best be summarized in a table:

	TH02	TH04	TH05
Tile image file extension	.MPN
Tile section format	.MAP
Tile section order defined as part of	.DT1	.STD
Tile section index format	0-based ID		0-based ID × 2
Tile image index format	Index between 0 and 100, 1 byte	VRAM offset in tile source area, 2 bytes
Scroll speed control	Hardcoded	Part of the .STD format, defined per referenced tile section
Redraw granularity	Full tiles (16×16)	Half tiles (16×8)
Rows per tile section	8	5
Maximum number of tile sections	16	32
Lowest number of tile sections used	5 (Stage 3 / Extra)	8 (Stage 6)	11 (Stage 2 / 4)
Highest number of tile sections used	13 (Stage 4)	19 (Extra)	24 (Stage 3)
Maximum length of a map	320 sections (static buffer)	256 sections (format limitation)
Shortest map	14 sections (Stage 5)	20 sections (Stage 5)	15 sections (Stage 2)
Longest map	143 sections (Stage 4)	95 sections (Stage 4)	40 sections (Stage 1 / 4 / Extra)

The most interesting part about stage tiles is probably the fact that some of the .MAP files contain unused tile sections. 👀 Many of these are empty, duplicates, or don't really make sense, but a few are unique, fit naturally into their respective stage, and might have been part of the map during development. In TH02, we can find three unused sections in Stage 5:

Section 0 of TH02's STAGE4.MAP — The non-empty tile sections defined in TH02's `STAGE4.MAP`, showing off three unused ones.

Section 1 of TH02's STAGE4.MAP — The non-empty tile sections defined in TH02's `STAGE4.MAP`, showing off three unused ones.

These unused tile sections are much more common in the later games though, where we can find them in TH04's Stage 3, 4, and 5, and TH05's Stage 1, 2, and 4. I'll document those once I get to finalize the tile rendering code of these games, to leave some more content for that blog post. TH04/TH05 tile code would be quite an effective investment of your money in general, as most of it is identical across both games. Or how about going for a full-on PC-98 Touhou map viewer and editor GUI?

Compared to TH04 and TH05, TH02's stage tile code definitely feels like ZUN was just starting to understand how to pull off smooth vertical scrolling on a PC-98. As such, it comes with a few inefficiencies and suboptimal implementation choices:

The redraw flag for each tile is stored in a 24×25 bool array that does nothing with 7 of the 8 bits.
During bombs and the Stage 4, 5, and Extra bosses, the game disables the tile system to render more elaborate backgrounds, which require the playfield to be flood-filled with a single color on every frame. ZUN uses the GRCG's RMW mode rather than TDW mode for this, leaving almost half of the potential performance on the table for no reason. Literally, changing modes only involves changing a single constant.
The scroll speed could theoretically be changed at any time. However, the function that scrolls in new stage tiles can only ever blit part of a single tile row during every call, so it's up to the caller to ensure that scrolling always ends up on an exact 16-pixel boundary. TH02 avoids this problem by keeping the scroll speed constant across a stage, using 2 pixels for Stage 4 and 1 pixel everywhere else.
Since the scroll speed is given in pixels, the slowest speed would be 1 pixel per frame. To allow the even slower speeds seen in the final game, TH02 adds a separate scroll interval variable that only runs the scroll function every 𝑛th frame, effectively adding a prescaler to the scroll speed. In TH04 and TH05, the speed is specified as a Q12.4 value instead, allowing true fractional speeds at any multiple of ¹/₁₆ pixels. This also necessitated a fixed algorithm that correctly blits tile lines from two rows.
Finally, we've got a few inconsistencies in the way the code handles the two VRAM pages, which cause a few unnecessary tiles to be rendered to just one of the two pages. Mentioning that just in case someone tries to play this game with a fully cleared text RAM and wonders where the flickering tiles come from.

Even though this was ZUN's first attempt at scrolling tiles, he already saw it fit to write most of the code in assembly. This was probably a reaction to all of TH01's performance issues, and the frame rate reduction workarounds he implemented to keep the game from slowing down too much in busy places. "If TH01 was all C++ and slow, TH02 better contain more ASM code, and then it will be fast, right?"
Another reason for going with ASM might be found in the kind of documentation that may have been available to ZUN. Last year, the PC-98 community discovered and scanned two new game programming tutorial books from 1991 (1, 2). Their example code is not only entirely written in assembly, but restricts itself to the bare minimum of x86 instructions that were available on the 8086 CPU used by the original PC-9801 model 9 years earlier. Such code is not only suboptimal on the 486, but can often be actually worse than what your C++ compiler would generate. TH02 is where the trend of bad hand-written ASM code started, and it 📝 only intensified in ZUN's later games. So, don't copy code from these books unless you absolutely want to target the earlier 8086 and 286 models. Which, 📝 as we've gathered from the recent blitting benchmark results, are not all too common among current real-hardware owners.
That said, all that ASM code really only impacts readability and maintainability. Apart from the aforementioned issues, the algorithms themselves are mostly fine – especially since most EGC and GRCG operations are decently batched this time around, in contrast to TH01.

Luckily, the tile functions merely use inline assembly within a typical C function and can therefore be at least part of a C++ source file, even if the result is pretty ugly. This time, we can actually be sure that they weren't written directly in a .ASM file, because they feature x86 instruction encodings that can only be generated with Turbo C++ 4.0J's inline assembler, not with TASM. The same can't unfortunately be said about the following function in the same segment, which marks the tiles covered by the spark sprites for redrawing. In this one, it took just one dumb hand-written ASM inconsistency in the function's epilog to make the entire function undecompilable.
The standard x86 instruction sequence to set up a stack frame in a function prolog looks like this:

PUSH	BP
MOV 	BP, SP
SUB 	SP, ?? ; if the function needs the stack for local variables

When compiling without optimizations, Turbo C++ 4.0J will replace this sequence with a single ENTER instruction. That one is two bytes smaller, but much slower on every x86 CPU except for the 80186 where it was introduced.

In functions without local variables, BP and SP remain identical, and a single POP BP is all that's needed in the epilog to tear down such a stack frame before returning from the function. Otherwise, the function needs an additional

MOV SP,
	BP

instruction to pop all local variables. With x86 being the helpful CISC architecture that it is, the 80186 also introduced the LEAVE instruction to perform both tasks. Unlike ENTER, this single instruction is faster than the raw two instructions on a lot of x86 CPUs (and even current ones!), and it's always smaller, taking up just 1 byte instead of 3.
So what if you use LEAVE even if your function doesn't use local variables?

The fact that the instruction first does the equivalent of MOV SP, BP doesn't matter if these registers are identical, and who cares about the additional CPU cycles of LEAVE compared to just POP BP, right? So that's definitely something you could theoretically do, but not something that any compiler would ever generate.

And so, TH02 MAIN.EXE decompilation already hits the first brick wall after two pushes. Awesome! Theoretically, we could slowly mash through this wall using the 📝 code generator. But having such an inconsistency in the function epilog would mean that we'd have to keep Turbo C++ 4.0J from emitting any epilog or prolog code so that we can write our own. This means that we'd once again have to hide any use of the SI and DI registers from the compiler… and doing that requires code generation macros for 22 of the 49 instructions of the function in question, almost none of which we currently have. So, this gets quite silly quite fast, especially if we only need to do it for one single byte.

Instead, wouldn't it be much better if we had a separate build step between compile and link time that allowed us to replicate mistakes like these by just patching the compiled .OBJ files? These files still contain the names of exported functions for linking, which would allow us to look up the code of a function in a robust manner, navigate to specific instructions using a disassembler, replace them, and write the modified .OBJ back to disk before linking. Such a system could then naturally expand to cover all other decompilation issues, culminating in a full-on optimizer that could even recreate ZUN's self-modifying code. At that point, we would have sealed away all of ZUN's ugly ASM code within a separate build step, and could finally decompile everything into readable C++.

Pulling that off would require a significant tooling investment though. Patching that one byte in TH02's spark invalidation function could be done within 1 or 2 pushes, but that's just one issue, and we currently have 32 other .ASM files with undecompilable code. Also, note that this is fundamentally different from what we're doing with the debloated branch and the Anniversary Editions. Mistake patching would purely be about having readable code on master that compiles into ZUN's exact binaries, without fixing weird code. The Anniversary Editions go much further and rewrite such code in a much more fundamental way, improving it further than mistake patching ever could.
Right now, the Anniversary Editions seem much more popular, which suggests that people just want 100% RE as fast as possible so that I can start working on them. In that case, why bother with such undecompilable functions, and not just leave them in raw and unreadable x86 opcode form if necessary… But let's first see how much backer support there actually is for mistake patching before falling back on that.

The best part though: Once we've made a decision and then covered TH02's spark and particle systems, that was it, and we will have already RE'd all ZUN-written PC-98-specific blitting code in this game. Every further sprite or shape is rendered via master.lib, and is thus decently abstracted. Guess I'll need to update 📝 the assessment of which PC-98 Touhou game is the easiest to port, because it sure isn't TH01, as we've seen with all the work required for the first Anniversary Edition build.

Until then, there are still enough parts of the game that don't use any of the remaining few functions in the _TEXT segment. Previously, I mentioned in the 📝 status overview blog post that TH02 had a seemingly weird sprite system, but the spark and point popup ( 〇一二三四五六七八九十×÷ ) structures showed that the game just stores the current and previous position of its entities in a slightly different way compared to the rest of PC-98 Touhou. Instead of having dedicated structure fields, TH02 uses two-element arrays indexed with the active VRAM page. Same thing, and such a pattern even helps during RE since it's easy to spot once you know what to look for.
There's not much to criticize about the point popup system, except for maybe a landmine that causes sprite glitches when trying to display more than 99,990 points. Sadly, the final push in this delivery was rounded out by yet another piece of code at the opposite end of the quality spectrum. The particle and smear effects for Reimu's bomb animations consist almost entirely of assembly bloat, which would just be replaced with generic calls to the generic blitter in this game's future Anniversary Edition.

If I continue to decompile TH02 while avoiding the brick wall, items would be next, but they probably require two pushes. Next up, therefore: Integrating Stripe as an alternative payment provider into the order form. There have been at least three people who reported issues with PayPal, and Stripe has been working much better in tests. In the meantime, here's a temporary Stripe order link for everyone. This one is not connected to the cap yet, so please make sure to stay within whatever value is currently shown on the front page – I will treat any excess money as donations. If there's some time left afterward, I might also add some small improvements to the TH01 Anniversary Edition.

📝 Posted:: 2022-11-30 22:20 UTC
🚚 Summary of:: P0223, P0224, P0225
⌨ Commits:: 139746c...371292d, 371292d...8118e61, 8118e61...4f85326
💰 Funded by:: rosenrose, Blue Bolt, Splashman, -Tom-, Yanga, Enderwolf, 32th System
🏷 Tags:

More than three months without any reverse-engineering progress! It's been way too long. Coincidentally, we're at least back with a surprising 1.25% of overall RE, achieved within just 3 pushes. The ending script system is not only more or less the same in TH04 and TH05, but actually originated in TH03, where it's also used for the cutscenes before stages 8 and 9. This means that it was one of the final pieces of code shared between three of the four remaining games, which I got to decompile at roughly 3× the usual speed, or ⅓ of the price.
The only other bargains of this nature remain in OP.EXE. The Music Room is largely equivalent in all three remaining games as well, and the sound device selection, ZUN Soft logo screens, and main/option menus are the same in TH04 and TH05. A lot of that code is in the "technically RE'd but not yet decompiled" ASM form though, so it would shift Finalized% more significantly than RE%. Therefore, make sure to order the new Finalization option rather than Reverse-engineering if you want to make number go up.

General overview
Game-specific differences
Command reference
Thoughts about translation support

So, cutscenes. On the surface, the .TXT files look simple enough: You directly write the text that should appear on the screen into the file without any special markup, and add commands to define visuals, music, and other effects at any place within the script. Let's start with the basics of how text is rendered, which are the same in all three games:

First off, the text area has a size of 480×64 pixels. This means that it does not correspond to the tiled area painted into TH05's EDBK?.PI images:
The yellow area is designated for character names.
Since the font weight can be customized, all text is rendered to VRAM. This also includes gaiji, despite them ignoring the font weight setting.
The system supports automatic line breaks on a per-glyph basis, which move the text cursor to the beginning of the red text area. This might seem like a piece of long-forgotten ancient wisdom at first, considering the absence of automatic line breaks in Windows Touhou. However, ZUN probably implemented it more out of pure necessity: Text in VRAM needs to be unblitted when starting a new box, which is way more straightforward and performant if you only need to worry about a fixed area.
The system also automatically starts a new (key press-separated) text box after the end of the 4^th line. However, the text cursor is also unconditionally moved to the top-left corner of the yellow name area when this happens, which is almost certainly not what you expect, given that automatic line breaks stay within the red area. A script author might as well add the necessary text box change commands manually, if you're forced to anticipate the automatic ones anyway…
Due to ZUN forgetting an unblitting call during the TH05 refactoring of the box background buffer, this feature is even completely broken in that game, as any new text will simply be blitted on top of the old one:
Wait, why are we already talking about game-specific differences after all? Also, note how the ⏎ animation appears one line below where you'd expect it.
Overall, the system is geared toward exclusively full-width text. As exemplified by the 2014 static English patches and the screenshots in this blog post, half-width text is possible, but comes with a lot of asterisks attached:
- Each loop of the script interpreter starts by looking at the next byte to distinguish commands from text. However, this step also skips over every ASCII space and control character, i.e., every byte ≤ 32. If you only intend to display full-width glyphs anyway, this sort of makes sense: You gain complete freedom when it comes to the physical layout of these script files, and it especially allows commands to be freely separated with spaces and line breaks for improved readability. Still, enforcing commands to be separated exclusively by line breaks might have been even better for readability, and would have freed up ASCII spaces for regular text…
- Non-command text is blindly processed and rendered two bytes at a time. The rendering function interprets these bytes as a Shift-JIS string, so you can use half-width characters here. While the second byte can even be an ASCII 0x20 space due to the parser's blindness, all half-width characters must still occur in pairs that can't be interrupted by commands:
- As a workaround for at least the ASCII space issue, you can replace them with any of the unassigned Shift-JIS lead bytes – 0x80, 0xA0, or anything between 0xF0 and 0xFF inclusive. That's what you see in all screenshots of this post that display half-width spaces.
Finally, did you know that you can hold ESC to fast-forward through these cutscenes, which skips most frame delays and reduces the rest? Due to the blocking nature of all commands, the ESC key state is only updated between commands or 2-byte text groups though, so it can't interrupt an ongoing delay.

Superficially, the list of game-specific differences doesn't look too long, and can be summarized in a rather short table:

	TH03	TH04	TH05
Script size limit	65536 bytes (heap-allocated)	8192 bytes (statically allocated)
Delay between every 2 bytes of text	1 frame by default, customizable via `\v`	None
Text delay when holding `ESC`	Varying speed-up factor	None
Visibility of new text	Immediately typed onto the screen	Rendered onto invisible VRAM page, faded in on wait commands
Visibility of old text	Unblitted when starting a new box		Left on screen until crossfaded out with new text
Key binding for advancing the script	Any key		`⏎ Return`, `Shot`, or `ESC`
Animation while waiting for an advance key	None		, past right edge of current row
Inexplicable delays	None		1 frame before changing pictures and after rendering new text boxes
Additional delay per interpreter loop	614.4 µs	None	614.4 µs

The 614.4 µs correspond to the necessary delay for working around the repeated key up and key down events sent by PC-98 keyboards when holding down a key. While the absence of this delay significantly speeds up TH04's interpreter, it's also the reason why that game will stop recognizing a held ESC key after a few seconds, requiring you to press it again.

It's when you get into the implementation that the combined three systems reveal themselves as a giant mess, with more like 56 differences between the games. Every single new weird line of code opened up another can of worms, which ultimately made all of this end up with 24 pieces of bloat and 14 bugs. The worst of these should be quite interesting for the general PC-98 homebrew developers among my audience:

The final official 0.23 release of master.lib has a bug in graph_gaiji_put*(). To calculate the JIS X 0208 code point for a gaiji, it is enough to ADD 5680h onto the gaiji ID. However, these functions accidentally use ADC instead, which incorrectly adds the x86 carry flag on top, causing weird off-by-one errors based on the previous program state. ZUN did fix this bug directly inside master.lib for TH04 and TH05, but still needed to work around it in TH03 by subtracting 1 from the intended gaiji ID. Anyone up for maintaining a bug-fixed master.lib repository?
The worst piece of bloat comes from TH03 and TH04 needlessly switching the visibility of VRAM pages while blitting a new 320×200 picture. This makes it much harder to understand the code, as the mere existence of these page switches is enough to suggest a more complex interplay between the two VRAM pages which doesn't actually exist. Outside this visibility switch, page 0 is always supposed to be shown, and page 1 is always used for temporarily storing pixels that are later crossfaded onto page 0. This is also the only reason why TH03 has to render text and gaiji onto both VRAM pages to begin with… and because TH04 doesn't, changing the picture in the middle of a string of text is technically bugged in that game, even though you only get to temporarily see the new text on very underclocked PC-98 systems.
These performance implications made me wonder why cutscenes even bother with writing to the second VRAM page anyway, before copying each crossfade step to the visible one. 📝 We learned in June how costly EGC-"accelerated" inter-page copies are; shouldn't it be faster to just blit the image once rather than twice?
Well, master.lib decodes .PI images into a packed-pixel format, and unpacking such a representation into bitplanes on the fly is just about the worst way of blitting you could possibly imagine on a PC-98. EGC inter-page copies are already fairly disappointing at 42 cycles for every 16 pixels, if we look at the i486 and ignore VRAM latencies. But under the same conditions, packed-pixel unpacking comes in at 81 cycles for every 8 pixels, or almost 4× slower. On lower-end systems, that can easily sum up to more than one frame for a 320×200 image. While I'd argue that the resulting tearing could have been an acceptable part of the transition between two images, it's understandable why you'd want to avoid it in favor of the pure effect on a slower framerate.
Really makes me wonder why master.lib didn't just directly decode .PI images into bitplanes. The performance impact on load times should have been negligible? It's such a good format for the often dithered 16-color artwork you typically see on PC-98, and deserves better than master.lib's implementation which is both slow to decode and slow to blit.

That brings us to the individual script commands… and yes, I'm going to document every single one of them. Some of their interactions and edge cases are not clear at all from just looking at the code.

Almost all commands are preceded by… well, a 0x5C lead byte. Which raises the question of whether we should document it as an ASCII-encoded \ backslash, or a Shift-JIS-encoded ¥ yen sign. From a gaijin perspective, it seems obvious that it's a backslash, as it's consistently displayed as one in most of the editors you would actually use nowadays. But interestingly, iconv -f shift-jis -t utf-8 does convert any 0x5C lead bytes to actual ¥ U+00A5 YEN SIGN code points .
Ultimately, the distinction comes down to the font. There are fonts that still render 0x5C as ¥, but mainly do so out of an obvious concern about backward compatibility to JIS X 0201, where this mapping originated. Unsurprisingly, this group includes MS Gothic/Mincho, the old Japanese fonts from Windows 3.1, but even Meiryo and Yu Gothic/Mincho, Microsoft's modern Japanese fonts. Meanwhile, pretty much every other modern font, and freely licensed ones in particular, render this code point as \, even if you set your editor to Shift-JIS. And while ZUN most definitely saw it as a ¥, documenting this code point as \ is less ambiguous in the long run. It can only possibly correspond to one specific code point in either Shift-JIS or UTF-8, and will remain correct even if we later mod the cutscene system to support full-blown Unicode.

Now we've only got to clarify the parameter syntax, and then we can look at the big table of commands:

Numeric parameters are read as sequences of up to 3 ASCII digits. This limits them to a range from 0 to 999 inclusive, with 000 and 0 being equivalent. Because there's no further sentinel character, any further digit from the 4^th one onwards is interpreted as regular text.
Filename parameters must be terminated with a space or newline and are limited to 12 characters, which translates to 8.3 basenames without any directory component. Any further characters are ignored and displayed as text as well.
Each .PI image can contain up to four 320×200 pictures ("quarters") for the cutscene picture area. In the script commands, they are numbered like this:

0 1

2 3

			\@	Clears both VRAM pages by filling them with VRAM color 0. 🐞 In TH03 and TH04, this command does not update the internal text area background used for unblitting. This bug effectively restricts usage of this command to either the beginning of a script (before the first background image is shown) or its end (after no more new text boxes are started). See the image below for an example of using it anywhere else.
			\b`2`	Sets the font weight to a value between 0 (raw font ROM glyphs) to 3 (very thicc). Specifying any other value has no effect.
			\b`2`	🐞 In TH04 and TH05, `\b3` leads to glitched pixels when rendering half-width glyphs due to a bug in the newly micro-optimized ASM version of `📝 graph_putsa_fx()`; see the image below for an example. In these games, the parameter also directly corresponds to the `graph_putsa_fx()` effect function, removing the sanity check that was present in TH03. In exchange, you can also access the four dissolve masks for the bold font (`\b2`) by specifying a parameter between `4` (fewest pixels) to `7` (most pixels). Demo video below.
			\c`15`	Changes the text color to VRAM color `15`.
			\c=`字`,`15`	Adds a color map entry: If `字` is the first code point inside the name area on a new line, the text color is automatically set to `15`. Up to 8 such entries can be registered before overflowing the statically allocated buffer. 🐞 The comma is assumed to be present even if the color parameter is omitted.
			\e`0`	Plays the sound effect with the given ID.
			\f	(no-op)
			\fi`1` \fo`1`	Calls master.lib's `palette_black_in()` or `palette_black_out()` to play a hardware palette fade animation from or to black, spending roughly `1` frame on each of the 16 fade steps.
			\fm`1`	Fades out BGM volume via PMD's `AH=02h` interrupt call, in a non-blocking way. The fade speed can range from `1` (slowest) to `127` (fastest). Values from `128` to `255` technically correspond to `AH=02h`'s fade-in feature, which can't be used from cutscene scripts because it requires BGM volume to first be lowered via `AH=19h`, and there is no command to do that.
			\g`8`	Plays a blocking `8`-frame screen shake animation.
			\ga`0`	Shows the gaiji with the given ID from 0 to 255 at the current cursor position. Even in TH03, gaiji always ignore the text delay interval configured with `\v`.
			@`3`	TH05's replacement for the `\ga` command from TH03 and TH04. The default ID of `3` corresponds to the gaiji. Not to be confused with `\@`, which starts with a backslash, unlike this command.
			@h	Shows the gaiji.
			@t	Shows the gaiji.
			@!	Shows the gaiji.
			@?	Shows the gaiji.
			@!!	Shows the gaiji.
			@!?	Shows the gaiji.
			\k`0`	Waits `0` frames (0 = forever) for an advance key to be pressed before continuing script execution. Before waiting, TH05 crossfades in any new text that was previously rendered to the invisible VRAM page… 🐞 …but TH04 doesn't, leaving the text invisible during the wait time. As a workaround, `\vp1` can be used before `\k` to immediately display that text without a fade-in animation.
			\m$	Stops the currently playing BGM.
			\m*	Restarts playback of the currently loaded BGM from the beginning.
			\m,`filename`	Stops the currently playing BGM, loads a new one from the given file, and starts playback.
			\n	Starts a new line at the leftmost X coordinate of the box, i.e., the start of the name area. This is how scripts can "change" the name of the currently speaking character, or use the entire 480×64 pixels without being restricted to the non-name area. Note that automatic line breaks already move the cursor into a new line. Using this command at the "end" of a line with the maximum number of 30 full-width glyphs would therefore start a second new line and leave the previously started line empty. If this command moved the cursor into the 5^th line of a box, `\s` is executed afterward, with any of `\n`'s parameters passed to `\s`.
			\p	(no-op)
			\p-	Deallocates the loaded .PI image.
			\p,`filename`	Loads the .PI image with the given file into the single .PI slot available to cutscenes. TH04 and TH05 automatically deallocate any previous image, 🐞 TH03 would leak memory without a manual prior call to `\p-`.
			\pp	Sets the hardware palette to the one of the loaded .PI image.
			\p@	Sets the loaded .PI image as the full-screen 640×400 background image and overwrites both VRAM pages with its pixels, retaining the current hardware palette.
			\p=	Runs `\pp` followed by `\p@`.
			\s`0` \s-	Ends a text box and starts a new one. Fades in any text rendered to the invisible VRAM page, then waits `0` frames (0 = forever) for an advance key to be pressed. Afterward, the new text box is started with the cursor moved to the top-left corner of the name area. `\s-` skips the wait time and starts the new box immediately.
			\t`100`	Sets palette brightness via master.lib's `palette_settone()` to any value from 0 (fully black) to 200 (fully white). 100 corresponds to the palette's original colors. Preceded by a 1-frame delay unless `ESC` is held.
			\v`1`	Sets the number of frames to wait between every 2 bytes of rendered text.
			\v`1`	Sets the number of frames to spend on each of the 4 fade steps when crossfading between old and new text. The game-specific default value is also used before the first use of this command.
			\v`2`
			\vp`0`	Shows VRAM page `0`. Completely useless in TH03 (this game always synchronizes both VRAM pages at a command boundary), only of dubious use in TH04 (for working around a bug in `\k`), and the games always return to their intended shown page before every blitting operation anyway. A debloated mod of this game would just remove this command, as it exposes an implementation detail that script authors should not need to worry about. None of the original scripts use it anyway.
			\w`64`	`\w` and `\wk` wait for the given number of frames `\wm` and `\wmk` wait until PMD has played back the current BGM for the total number of measures, including loops, given in the first parameter, and fall back on calling `\w` and `\wk` with the second parameter as the frame number if BGM is disabled. 🐞 Neither PMD nor MMD reset the internal measure when stopping playback. If no BGM is playing and the previous BGM hasn't been played back for at least the given number of measures, this command will deadlock. Since both TH04 and TH05 fade in any new text from the invisible VRAM page, these commands can be used to simulate TH03's typing effect in those games. Demo video below. Contrary to `\k` and `\s`, specifying 0 frames would simply remove any frame delay instead of waiting forever. The TH03-exclusive `k` variants allow the delay to be interrupted if `⏎ Return` or `Shot` are held down. TH04 and TH05 recognize the `k` as well, but removed its functionality. All of these commands have no effect if `ESC` is held.
			\wm`64`,`64`
			\wk`64`
			\wmk`64`,`64`
			\wi`1` \wo`1`	Calls master.lib's `palette_white_in()` or `palette_white_out()` to play a hardware palette fade animation from or to white, spending roughly `1` frame on each of the 16 fade steps.
			\=`4`	Immediately displays the given quarter of the loaded .PI image in the picture area, with no fade effect. Any value ≥ `4` resets the picture area to black.
			\==`4`,`1`	Crossfades the picture area between its current content and quarter #`4` of the loaded .PI image, spending `1` frame on each of the 4 fade steps unless `ESC` is held. Any value ≥ `4` is replaced with quarter #0.
			\$	Stops script execution. Must be called at the end of each file; otherwise, execution continues into whatever lies after the script buffer in memory. TH05 automatically deallocates the loaded .PI image, TH03 and TH04 require a separate manual call to `\p-` to not leak its memory.

Bold values signify the default if the parameter is omitted; \c is therefore equivalent to \c15.

Using the \@ command in the middle of a TH03 or TH04 cutscene script — The `\@` bug. Yes, the ¥ is fake. It was easier to GIMP it than to reword the sentences so that the backslashes landed on the second byte of a 2-byte half-width character pair.

Cutscene font weights in TH03 — The font weights and effects available through `\b`, including the glitch with `\b3` in TH04 and TH05.
Font weight 3 is technically not rendered correctly in TH03 either; if you compare 1️⃣ with 4️⃣, you notice a single missing column of pixels at the left side of each glyph, which would extend into the previous VRAM byte. Ironically, the TH04/TH05 version is more *correct* in this regard: For half-width glyphs, it preserves any further pixel columns generated by the weight functions in the high byte of the 16-dot glyph variable. Unlike TH03, which still cuts them off when rendering text to unaligned X positions (3️⃣), TH04 and TH05 do bit-rotate them towards their correct place (4️⃣). It's only at byte-aligned X positions (2️⃣) where they remain at their internally calculated place, and appear on screen as these glitched pixel columns, 15 pixels away from the glyph they belong to. It's easy to blame bugs like these on micro-optimized ASM code, but in this instance, you really can't argue against it if the original C++ version was equally incorrect.

Cutscene font weights in TH05, demonstrating the <code>\b3</code> bug that also affects TH04 — The font weights and effects available through `\b`, including the glitch with `\b3` in TH04 and TH05.
Font weight 3 is technically not rendered correctly in TH03 either; if you compare 1️⃣ with 4️⃣, you notice a single missing column of pixels at the left side of each glyph, which would extend into the previous VRAM byte. Ironically, the TH04/TH05 version is more *correct* in this regard: For half-width glyphs, it preserves any further pixel columns generated by the weight functions in the high byte of the 16-dot glyph variable. Unlike TH03, which still cuts them off when rendering text to unaligned X positions (3️⃣), TH04 and TH05 do bit-rotate them towards their correct place (4️⃣). It's only at byte-aligned X positions (2️⃣) where they remain at their internally calculated place, and appear on screen as these glitched pixel columns, 15 pixels away from the glyph they belong to. It's easy to blame bugs like these on micro-optimized ASM code, but in this instance, you really can't argue against it if the original C++ version was equally incorrect.

Combining \b and s- into a partial dissolve animation. The speed can be controlled with \v.

Simulating TH03's typing effect in TH04 and TH05 via \w. Even prettier in TH05 where we also get an additional fade animation after the box ends.

So yeah, that's the cutscene system. I'm dreading the moment I will have to deal with the other command interpreter in these games, i.e., the stage enemy system. Luckily, that one is completely disconnected from any other system, so I won't have to deal with it until we're close to finishing MAIN.EXE… that is, unless someone requests it before. And it won't involve text encodings or unblitting…

The cutscene system got me thinking in greater detail about how I would implement translations, being one of the main dependencies behind them. This goal has been on the order form for a while and could soon be implemented for these cutscenes, with 100% PI being right around the corner for the TH03 and TH04 cutscene executables.
Once we're there, the "Virgin" old-school way of static translation patching for Latin-script languages could be implemented fairly quickly:

Establish basic UTF-8 parsing for less painful manual editing of the source files
Procedurally generate glyphs for the few required additional letters based on existing font ROM glyphs. For example, we'd generate ä by painting two short lines on top of the font ROM's a glyph, or generate ¿ by vertically flipping the question mark. This way, the text retains a consistent look regardless of whether the translated game is run with an NEC or EPSON font ROM, or the that Neko Project II auto-generates if you don't provide either.
(Optional) Change automatic line breaks to work on a per-word basis, rather than per-glyph

That's it – script editing and distribution would be handled by your local translation group. It might seem as if this would also work for Greek and Cyrillic scripts due to their presence in the PC-98 font ROM, but I'm not sure if I want to attempt procedurally shrinking these glyphs from 16×16 to 8×16… For any more thorough solution, we'd need to go for a more "Chad" kind of full-blown translation support:

Implement text subdivisions at a sensible granularity while retaining automatic line and box breaks
Compile translatable text into a Japanese→target language dictionary (I'm too old to develop any further translation systems that would overwrite modded source text with translations of the original text)
Implement a custom Unicode font system (glyphs would be taken from GNU Unifont unless translators provide a different 8×16 font for their language)
Combine the text compiler with the font compiler to only store needed glyphs as part of the translation's font file (dealing with a multi-MB font file would be rather ugly in a Real Mode game)
Write a simple install/update/patch stacking tool that supports both .HDI and raw-file DOSBox-X scenarios (it's different enough from thcrap to warrant a separate tool – each patch stack would be statically compiled into a single package file in the game's directory)
Add a nice language selection option to the main menu
(Optional) Support proportional fonts

Which sounds more like a separate project to be commissioned from Touhou Patch Center's Open Collective funds, separate from the ReC98 cap. This way, we can make sure that the feature is completely implemented, and I can talk with every interested translator to make sure that their language works.
It's still cheaper overall to do this on PC-98 than to first port the games to a modern system and then translate them. On the other hand, most of the tasks in the Chad variant (3, 4, 5, and half of 2) purely deal with the difficulty of getting arbitrary Unicode characters to work natively in a PC-98 DOS game at all, and would be either unnecessary or trivial if we had already ported the game. Depending on where the patrons' interests lie, it may not be worth it. So let's see what all of you think about which way we should go~~, or whether it's worth doing at all~~. (Edit (2022-12-01): With Splashman's order towards the stage dialogue system, we've pretty much confirmed that it is.) Maybe we want to meet in the middle – using e.g. procedural glyph generation for dynamic translations to keep text rendering consistent with the rest of the PC-98 system, and just not support non-Latin-script languages in the beginning? In any case, I've added both options to the order form.
Edit (2023-07-28): Touhou Patch Center has agreed to fund a basic feature set somewhere between the Virgin and Chad level. Check the 📝 dedicated announcement blog post for more details and ideas, and to find out how you can support this goal!

Surprisingly, there was still a bit of RE work left in the third push after all of this, which I filled with some small rendering boilerplate. Since I also wanted to include TH02's playfield overlay functions, ¹/₁₅ of that last push went towards getting a TH02-exclusive function out of the way, which also ended up including that game in this delivery.
The other small function pointed out how TH05's Stage 5 midboss pops into the playfield quite suddenly, since its clipping test thinks it's only 32 pixels tall rather than 64:

~~Good chance that the pop-in might have been intended.~~
Edit (2023-06-30): Actually, it's a 📝 systematic consequence of ZUN having to work around the lack of clipping in master.lib's sprite functions.
There's even another quirk here: The white flash during its first frame is actually carried over from the previous midboss, which the game still considers as actively getting hit by the player shot that defeated it. It's the regular boilerplate code for rendering a midboss that resets the responsible damage variable, and that code doesn't run during the defeat explosion animation.

Next up: Staying with TH05 and looking at more of the pattern code of its boss fights. Given the remaining TH05 budget, it makes the most sense to continue in in-game order, with Sara and the Stage 2 midboss. If more money comes in towards this goal, I could alternatively go for the Mai & Yuki fight and immediately develop a pretty fix for the cheeto storage glitch. Also, there's a rather intricate pull request for direct ZMBV decoding on the website that I've still got to review…

📝 Posted:: 2022-04-30 22:58 UTC
🚚 Summary of:: P0190, P0191, P0192
⌨ Commits:: 5734815...293e16a, 293e16a...71cb7b5, 71cb7b5...e1f3f9f
💰 Funded by:: nrook, -Tom-, [Anonymous]
🏷 Tags:

The important things first:

TH05 has passed the 50% RE mark, with both MAIN.EXE and the game as a whole! With that, we've also reached what -Tom- wanted out of the project, so he's suspending his discount offer for a bit.
Curve bullets are now officially called cheetos! 76.7% of fans prefer this term, and it fits into the 8.3 DOS filename scheme much better than homing lasers (as they're called in OMAKE.TXT) or Taito lasers (which would indeed have made sense as well).
…oh, and I managed to decompile Shinki within 2 pushes after all. That left enough budget to also add the Stage 1 midboss on top.

So, Shinki! As far as final boss code is concerned, she's surprisingly economical, with 📝 her background animations making up more than ⅓ of her entire code. Going straight from TH01's 📝 final 📝 bosses to TH05's final boss definitely showed how much ZUN had streamlined danmaku pattern code by the end of PC-98 Touhou. Don't get me wrong, there is still room for improvement: TH05 not only 📝 reuses the same 16 bytes of generic boss state we saw in TH04 last month, but also uses them 4× as often, and even for midbosses. Most importantly though, defining danmaku patterns using a single global instance of the group template structure is just bad no matter how you look at it:

The script code ends up rather bloated, with a single MOV instruction for setting one of the fields taking up 5 bytes. By comparison, the entire structure for regular bullets is 14 bytes large, while the template structure for Shinki's 32×32 ball bullets could have easily been reduced to 8 bytes.
Since it's also one piece of global state, you can easily forget to set one of the required fields for a group type. The resulting danmaku group then reuses these values from the last time they were set… which might have been as far back as another boss fight from a previous stage. And of course, I wouldn't point this out if it didn't actually happen in Shinki's pattern code. Twice.

Declaring a separate structure instance with the static data for every pattern would be both safer and more space-efficient, and there's more than enough space left for that in the game's data segment.
But all in all, the pattern functions are short, sweet, and easy to follow. The "devil" pattern is significantly more complex than the others, but still far from TH01's final bosses at their worst. I especially like the clear architectural separation between "one-shot pattern" functions that return true once they're done, and "looping pattern" functions that run as long as they're being called from a boss's main function. Not many all too interesting things in these pattern functions for the most part, except for two pieces of evidence that Shinki was coded after Yumeko:

The gather animation function in the first two phases contains a bullet group configuration that looks like it's part of an unused danmaku pattern. It quickly turns out to just be copy-pasted from a similar function in Yumeko's fight though, where it is turned into actual bullets.
As one of the two places where ZUN forgot to set a template field, the lasers at the end of the white wing preparation pattern reuse the 6-pixel width of Yumeko's final laser pattern. This actually has an effect on gameplay: Since these lasers are active for the first 8 frames after Shinki's wings appear on screen, the player can get hit by them in the last 2 frames after they grew to their final width.

Of course, there are more than enough safespots between the lasers.

Speaking about that wing sprite: If you look at ST05.BB2 (or any other file with a large sprite, for that matter), you notice a rather weird file layout:

Raw file layout of TH05's ST05.BB2, demonstrating master.lib's supposed BFNT width limit of 64 pixels — A large sprite split into multiple smaller ones with a width of 64 pixels each? What's this, hardware sprite limitations? On my PC-98?!

And it's not a limitation of the sprite width field in the BFNT+ header either. Instead, it's master.lib's BFNT functions which are limited to sprite widths up to 64 pixels… or at least that's what MASTER.MAN claims. Whatever the restriction was, it seems to be completely nonexistent as of master.lib version 0.23, and none of the master.lib functions used by the games have any issues with larger sprites.
Since ZUN stuck to the supposed 64-pixel width limit though, it's now the game that expects Shinki's winged form to consist of 4 physical sprites, not just 1. Any conversion from another, more logical sprite sheet layout back into BFNT+ must therefore replicate the original number of sprites. Otherwise, the sequential IDs ("patnums") assigned to every newly loaded sprite no longer match ZUN's hardcoded IDs, causing the game to crash. This is exactly what used to happen with -Tom-'s MysticTK automation scripts, which combined these exact sprites into a single large one. This issue has now been fixed – just in case there are some underground modders out there who used these scripts and wonder why their game crashed as soon as the Shinki fight started.

And then the code quality takes a nosedive with Shinki's main function. Even in TH05, these boss and midboss update functions are still very imperative:

The origin point of all bullet types used by a boss must be manually set to the current boss/midboss position; there is no concept of a bullet type tracking a certain entity.
The same is true for the target point of a player's homing shots…
… and updating the HP bar. At least the initial fill animation is abstracted away rather decently.
Incrementing the phase frame variable also must be done manually. TH05 even "innovates" here by giving the boss update function exclusive ownership of that variable, in contrast to TH04 where that ownership is given out to the player shot collision detection (?!) and boss defeat helper functions.
Speaking about collision detection: That is done by calling different functions depending on whether the boss is supposed to be invincible or not.
Timeout conditions? No standard way either, and all done with manual if statements. In combination with the regular phase end condition of lowering (mid)boss HP to a certain value, this leads to quite a convoluted control flow.
The manual calls to the score bonus functions for cleared phases at least provide some sense of orientation.
One potentially nice aspect of all this imperative freedom is that phases can end outside of HP boundaries… by manually incrementing the phase variable and resetting the phase frame variable to 0.

The biggest WTF in there, however, goes to using one of the 16 state bytes as a "relative phase" variable for differentiating between boss phases that share the same branch within the switch(boss.phase) statement. While it's commendable that ZUN tried to reduce code duplication for once, he could have just branched depending on the actual boss.phase variable? The same state byte is then reused in the "devil" pattern to track the activity state of the big jerky lasers in the second half of the pattern. If you somehow managed to end the phase after the first few bullets of the pattern, but before these lasers are up, Shinki's update function would think that you're still in the phase before the "devil" pattern. The main function then sequence-breaks right to the defeat phase, skipping the final pattern with the burning Makai background. Luckily, the HP boundaries are far away enough to make this impossible in practice.
The takeaway here: If you want to use the state bytes for your custom boss script mods, alias them to your own 16-byte structure, and limit each of the bytes to a clearly defined meaning across your entire boss script.

One final discovery that doesn't seem to be documented anywhere yet: Shinki actually has a hidden bomb shield during her two purple-wing phases. uth05win got this part slightly wrong though: It's not a complete shield, and hitting Shinki will still deal 1 point of chip damage per frame. For comparison, the first phase lasts for 3,000 HP, and the "devil" pattern phase lasts for 5,800 HP.

And there we go, 3rd PC-98 Touhou boss script* decompiled, 28 to go! 🎉 In case you were expecting a fix for the Shinki death glitch: That one is more appropriately fixed as part of the Mai & Yuki script. It also requires new code, should ideally look a bit prettier than just removing cheetos between one frame and the next, and I'd still like it to fit within the original position-dependent code layout… Let's do that some other time.
Not much to say about the Stage 1 midboss, or midbosses in general even, except that their update functions have to imperatively handle even more subsystems, due to the relative lack of helper functions.

The remaining ¾ of the third push went to a bunch of smaller RE and finalization work that would have hardly got any attention otherwise, to help secure that 50% RE mark. The nicest piece of code in there shows off what looks like the optimal way of setting up the 📝 GRCG tile register for monochrome blitting in a variable color:

mov ah, palette_index ; Any other non-AL 8-bit register works too.
                      ; (x86 only supports AL as the source operand for OUTs.)

rept 4                ; For all 4 bitplanes…
    shr ah,  1        ; Shift the next color bit into the x86 carry flag
    sbb al,  al       ; Extend the carry flag to a full byte
                      ; (CF=0 → 0x00, CF=1 → 0xFF)
    out 7Eh, al       ; Write AL to the GRCG tile register
endm

Thanks to Turbo C++'s inlining capabilities, the loop body even decompiles into a surprisingly nice one-liner. What a beautiful micro-optimization, at a place where micro-optimization doesn't hurt and is almost expected.
Unfortunately, the micro-optimizations went all downhill from there, becoming increasingly dumb and undecompilable. Was it really necessary to save 4 x86 instructions in the highly unlikely case of a new spark sprite being spawned outside the playfield? That one 2D polar→Cartesian conversion function then pointed out Turbo C++ 4.0J's woefully limited support for 32-bit micro-optimizations. The code generation for 32-bit 📝 pseudo-registers is so bad that they almost aren't worth using for arithmetic operations, and the inline assembler just flat out doesn't support anything 32-bit. No use in decompiling a function that you'd have to entirely spell out in machine code, especially if the same function already exists in multiple other, more idiomatic C++ variations.
Rounding out the third push, we got the TH04/TH05 DEMO?.REC replay file reading code, which should finally prove that nothing about the game's original replay system could serve as even just the foundation for community-usable replays. Just in case anyone was still thinking that.

Next up: Back to TH01, with the Elis fight! Got a bit of room left in the cap again, and there are a lot of things that would make a lot of sense now:

TH04 would really enjoy a large number of dedicated pushes to catch up with TH05. This would greatly support the finalization of both games.
Continuing with TH05's bosses and midbosses has shown to be good value for your money. Shinki would have taken even less than 2 pushes if she hadn't been the first boss I looked at.
I've got a new idea for 📝 properly linking in master.lib and getting rid of the 32-bit build step… (Edit (2022-05-31): Nope, that didn't work out after all.)
Oh, and I also added Seihou as a selectable goal, for the two people out there who genuinely like it. If I ever want to quit my day job, I need to branch out into safer territory that isn't threatened by takedowns, after all.

📝 Posted:: 2022-03-27 16:13 UTC
🚚 Summary of:: P0186, P0187, P0188
⌨ Commits:: a21ab3d...bab5634, bab5634...426a531, 426a531...e881f95
💰 Funded by:: Blue Bolt, [Anonymous], nrook
🏷 Tags:

Did you know that moving on top of a boss sprite doesn't kill the player in TH04, only in TH05?

Screenshot of Reimu moving on top of Stage 6 Yuuka, demonstrating the lack of boss↔player collision in TH04 — Yup, Reimu is not getting hit… yet.

That's the first of only three interesting discoveries in these 3 pushes, all of which concern TH04. But yeah, 3 for something as seemingly simple as these shared boss functions… that's still not quite the speed-up I had hoped for. While most of this can be blamed, again, on TH04 and all of its hardcoded complexities, there still was a lot of work to be done on the maintenance front as well. These functions reference a bunch of code I RE'd years ago and that still had to be brought up to current standards, with the dependencies reaching from 📝 boss explosions over 📝 text RAM overlay functionality up to in-game dialog loading.

The latter provides a good opportunity to talk a bit about x86 memory segmentation. Many aspiring PC-98 developers these days are very scared of it, with some even going as far as to rather mess with Protected Mode and DOS extenders just so that they don't have to deal with it. I wonder where that fear comes from… Could it be because every modern programming language I know of assumes memory to be flat, and lacks any standard language-level features to even express something like segments and offsets? That's why compilers have a hard time targeting 16-bit x86 these days: Doing anything interesting on the architecture requires giving the programmer full control over segmentation, which always comes down to adding the typical non-standard language extensions of compilers from back in the day. And as soon as DOS stopped being used, these extensions no longer made sense and were subsequently removed from newer tools. A good example for this can be found in an old version of the NASM manual: The project started as an attempt to make x86 assemblers simple again by throwing out most of the segmentation features from MASM-style assemblers, which made complete sense in 1996 when 16-bit DOS and Windows were already on their way out. But there was a point to all those features, and that's why ReC98 still has to use the supposedly inferior TASM.

Not that this fear of segmentation is completely unfounded: All the segmentation-related keywords, directives, and #pragmas provided by Borland C++ and TASM absolutely can be the cause of many weird runtime bugs. Even if the compiler or linker catches them, you are often left with confusing error messages that aged just as poorly as memory segmentation itself.
However, embracing the concept does provide quite the opportunity for optimizations. While it definitely was a very crazy idea, there is a small bit of brilliance to be gained from making proper use of all these segmentation features. Case in point: The buffer for the in-game dialog scripts in TH04 and TH05.

// Thanks to the semantics of `far` pointers, we only need a single 32-bit
// pointer variable for the following code.
extern unsigned char far *dialog_p;

// This master.lib function returns a `void __seg *`, which is a 16-bit
// segment-only pointer. Converting to a `far *` yields a full segment:offset
// pointer to offset 0000h of that segment.
dialog_p = (unsigned char far *)hmem_allocbyte(/* … */);

// Running the dialog script involves pointer arithmetic. On a far pointer,
// this only affects the 16-bit offset part, complete with overflow at 64 KiB,
// from FFFFh back to 0000h.
dialog_p += /* … */;
dialog_p += /* … */;
dialog_p += /* … */;

// Since the segment part of the pointer is still identical to the one we
// allocated above, we can later correctly free the buffer by pulling the
// segment back out of the pointer.
hmem_free((void __seg *)dialog_p);

If dialog_p was a huge pointer, any pointer arithmetic would have also adjusted the segment part, requiring a second pointer to store the base address for the hmem_free call. Doing that will also be necessary for any port to a flat memory model. Depending on how you look at it, this compression of two logical pointers into a single variable is either quite nice, or really, really dumb in its reliance on the precise memory model of one single architecture.

Why look at dialog loading though, wasn't this supposed to be all about shared boss functions? Well, TH04 unnecessarily puts certain stage-specific code into the boss defeat function, such as loading the alternate Stage 5 Yuuka defeat dialog before a Bad Ending, or initializing Gengetsu after Mugetsu's defeat in the Extra Stage.
That's TH04's second core function with an explicit conditional branch for Gengetsu, after the 📝 dialog exit code we found last year during EMS research. And I've heard people say that Shinki was the most hardcoded fight in PC-98 Touhou… Really, Shinki is a perfectly regular boss, who makes proper use of all internal mechanics in the way they were intended, and doesn't blast holes into the architecture of the game. Even within TH05, it's Mai and Yuki who rely on hacks and duplicated code, not Shinki.

The worst part about this though? How the function distinguishes Mugetsu from Gengetsu. Once again, it uses its own global variable to track whether it is called the first or the second time within TH04's Extra Stage, unrelated to the same variable used in the dialog exit function. But this time, it's not just any newly created, single-use variable, oh no. In a misguided attempt to micro-optimize away a few bytes of conventional memory, TH04 reserves 16 bytes of "generic boss state", which can (and are) freely used for anything a boss doesn't want to store in a more dedicated variable.
It might have been worth it if the bosses actually used most of these 16 bytes, but the majority just use (the same) two, with only Stage 4 Reimu using a whopping seven different ones. To reverse-engineer the various uses of these variables, I pretty much had to map out which of the undecompiled danmaku-pattern functions corresponds to which boss fight. In the end, I assigned 29 different variable names for each of the semantically different use cases, which made up another full push on its own.

Now, 16 bytes of wildly shared state, isn't that the perfect recipe for bugs? At least during this cursory look, I haven't found any obvious ones yet. If they do exist, it's more likely that they involve reused state from earlier bosses – just how the Shinki death glitch in TH05 is caused by reusing cheeto data from way back in Stage 4 – and hence require much more boss-specific progress.
And yes, it might have been way too early to look into all these tiny details of specific boss scripts… but then, this happened:

TH04 crashing to the DOS prompt in the Stage 4 Marisa fight, right as the last of her bits is destroyed

Looks similar to another screenshot of a crash in the same fight that was reported in December, doesn't it? I was too much in a hurry to figure it out exactly, but notice how both crashes happen right as the last of Marisa's four bits is destroyed. KirbyComment has suspected this to be the cause for a while, and now I can pretty much confirm it to be an unguarded division by the number of on-screen bits in Marisa-specific pattern code. But what's the cause for Kurumi then?
As for fixing it, I can go for either a fast or a slow option:

Superficially fixing only this crash will probably just take a fraction of a push.
But I could also go for a deeper understanding by looking at TH04's version of the 📝 custom entity structure. It not only stores the data of Marisa's bits, but is also very likely to be involved in Kurumi's crash, and would get TH04 a lot closer to 100% PI. Taking that look will probably need at least 2 pushes, and might require another 3-4 to completely decompile Marisa's fight, and 2-3 to decompile Kurumi's.

OK, now that that's out of the way, time to finish the boss defeat function… but not without stumbling over the third of TH04's quirks, relating to the Clear Bonus for the main game or the Extra Stage:

To achieve the incremental addition effect for the in-game score display in the HUD, all new points are first added to a score_delta variable, which is then added to the actual score at a maximum rate of 61,110 points per frame.
There are a fixed 416 frames between showing the score tally and launching into MAINE.EXE.
As a result, TH04's Clear Bonus is effectively limited to (416 × 61,110) = 25,421,760 points.
Only TH05 makes sure to commit the entirety of the score_delta to the actual score before switching binaries, which fixes this issue.

And after another few collision-related functions, we're now truly, finally ready to decompile bosses in both TH04 and TH05! Just as the anything funds were running out… The remaining ¼ of the third push then went to Shinki's 32×32 ball bullets, rounding out this delivery with a small self-contained piece of the first TH05 boss we're probably going to look at.

Next up, though: I'm not sure, actually. Both Shinki and Elis seem just a little bit larger than the 2¼ or 4 pushes purchased so far, respectively. Now that there's a bunch of room left in the cap again, I'll just let the next contribution decide – with a preference for Shinki in case of a tie. And if it will take longer than usual for the store to sell out again this time (heh), there's still the 📝 PC-98 text RAM JIS trail word rendering research waiting to be documented.

📝 Posted:: 2021-05-12 18:01 UTC
🚚 Summary of:: P0139
⌨ Commits:: 864e864...d985811
💰 Funded by:: [Anonymous]
🏷 Tags:

Technical debt, part 10… in which two of the PMD-related functions came with such complex ramifications that they required one full push after all, leaving no room for the additional decompilations I wanted to do. At least, this did end up being the final one, completing all SHARED segments for the time being.

The first one of these functions determines the BGM and sound effect modes, combining the resident type of the PMD driver with the Option menu setting. The TH04 and TH05 version is apparently coded quite smartly, as PC-98 Touhou only needs to distinguish "OPN- / PC-9801-26K-compatible sound sources handled by PMD.COM" from "everything else", since all other PMD varieties are OPNA- / PC-9801-86-compatible.
Therefore, I only documented those two results returned from PMD's AH=09h function. I'll leave a comprehensive, fully documented enum to interested contributors, since that would involve research into basically the entire history of the PC-9800 series, and even the clearly out-of-scope PC-88VA. After all, distinguishing between more versions of the PMD driver in the Option menu (and adding new sprites for them!) is strictly mod territory.

The honor of being the final decompiled function in any SHARED segment went to TH04's snd_load(). TH04 contains by far the sanest version of this function: Readable C code, no new ZUN bugs (and still missing file I/O error handling, of course)… but wait, what about that actual file read syscall, using the INT 21h, AH=3Fh DOS file read API? Reading up to a hardcoded number of bytes into PMD's or MMD's song or sound effect buffer, 20 KiB in TH02-TH04, 64 KiB in TH05… that's kind of weird. About time we looked closer into this.

Turns out that no, KAJA's driver doesn't give you the full 64 KiB of one memory segment for these, as especially TH05's code might suggest to anyone unfamiliar with these drivers. Instead, you can customize the size of these buffers on its command line. In GAME.BAT, ZUN allocates 8 KiB for FM songs, 2 KiB for sound effects, and 12 KiB for MMD files in TH02… which means that the hardcoded sizes in snd_load() are completely wrong, no matter how you look at them. Consequently, this read syscall will overflow PMD's or MMD's song or sound effect buffer if the given file is larger than the respective buffer size.
Now, ZUN could have simply hardcoded the sizes from GAME.BAT instead, and it would have been fine. As it also turns out though, PMD has an API function (AH=22h) to retrieve the actual buffer sizes, provided for exactly that purpose. There is little excuse not to use it, as it also gives you PMD's default sizes if you don't specify any yourself.
(Unless your build process enumerates all PMD files that are part of the game, and bakes the largest size into both snd_load() and GAME.BAT. That would even work with MMD, which doesn't have an equivalent for AH=22h.)

What'd be the consequence of loading a larger file then? Well, since we don't get a full segment, let's look at the theoretical limit first.
PMD prefers to keep both its driver code and the data buffers in a single memory segment. As a result, the limit for the combined size of the song, instrument, and sound effect buffer is determined by the amount of code in the driver itself. In PMD86 version 4.8o (bundled with TH04 and TH05) for example, the remaining size for these buffers is exactly 45,555 bytes. Being an actually good programmer who doesn't blindly trust user input, KAJA thankfully validates the sizes given via the /M, /V, and /E command-line options before letting the driver reside in memory, and shuts down with an error message if they exceed 40 KiB. Would have been even better if he calculated the exact size – even in the current PMD version 4.8s from January 2020, it's still a hardcoded value (see line 8581).
Either way: If the file is larger than this maximum, the concrete effect is down to the INT 21h, AH=3Fh implementation in the underlying DOS version. DOS 3.3 treats the destination address as linear and reads past the end of the segment, DOS 5.0 and DOSBox-X truncate the number of bytes to not exceed the remaining space in the segment, and maybe there's even a DOS that wraps around and ends up overwriting the PMD driver code. In any case: You will overwrite what's after the driver in memory – typically, the game .EXE and its master.lib functions.

It almost feels like a happy accident that this doesn't cause issues in the original games. The largest PMD file in any of the 4 games, the -86 version of 幽夢　～ Inanimate Dream, takes up 8,099 bytes, just under the 8,192 byte limit for BGM. For modders, I'd really recommend implementing this properly, with PMD's AH=22h function and error handling, once position independence has been reached.

Whew, didn't think I'd be doing more research into KAJA's drivers during regular ReC98 development! That's probably been the final time though, as all involved functions are now decompiled, and I'm unlikely to iterate over them again.

And that's it! Repaid the biggest chunk of technical debt, time for some actual progress again. Next up: Reopening the store tomorrow, and waiting for new priorities. If we got nothing by Sunday, I'm going to put the pending [Anonymous] pushes towards some work on the website.

📝 Posted:: 2021-04-23 00:46 UTC
🚚 Summary of:: P0138
⌨ Commits:: 8d953dc...864e864
💰 Funded by:: [Anonymous], Blue Bolt
🏷 Tags:

Technical debt, part 9… and as it turns out, it's highly impractical to repay 100% of it at this point in development. 😕

The reason: graph_putsa_fx(), ZUN's function for rendering optionally boldfaced text to VRAM using the font ROM glyphs, in its ridiculously micro-optimized TH04 and TH05 version. This one sets the "callback function" for applying the boldface effect by self-modifying the target of two CALL rel16 instructions… because there really wasn't any free register left for an indirect CALL, eh? The necessary distance, from the call site to the function itself, has to be calculated at assembly time, by subtracting the target function label from the call site label.
This usually wouldn't be a problem… if ZUN didn't store the resulting lookup tables in the .DATA segment. With code segments, we can easily split them at pretty much any point between functions because there are multiple of them. But there's only a single .DATA segment, with all ZUN and master.lib data sandwiched between Borland C++'s crt0 at the top, and Borland C++'s library functions at the bottom of the segment. Adding another split point would require all data after that point to be moved to its own translation unit, which in turn requires EXTERN references in the big .ASM file to all that moved data… in short, it would turn the codebase into an even greater mess.
Declaring the labels as EXTERN wouldn't work either, since the linker can't do fancy arithmetic and is limited to simply replacing address placeholders with one single address. So, we're now stuck with this function at the bottom of the SHARED segment, for the foreseeable future.

We can still continue to separate functions off the top of that segment, though. Pretty much the only thing noteworthy there, so far: TH04's code for loading stage tile images from .MPN files, which we hadn't reverse-engineered so far, and which nicely fit into one of Blue Bolt's pending ⅓ RE contributions. Yup, we finally moved the RE% bars again! If only for a tiny bit.
Both TH02 and TH05 simply store one pointer to one dynamically allocated memory block for all tile images, as well as the number of images, in the data segment. TH04, on the other hand, reserves memory for 8 .MPN slots, complete with their color palettes, even though it only ever uses the first one of these. There goes another 458 bytes of conventional RAM… I should start summing up all the waste we've seen so far. Let's put the next website contribution towards a tagging system for these blog posts.

At 86% of technical debt in the SHARED segment repaid, we aren't quite done yet, but the rest is mostly just TH04 needing to catch up with functions we've already separated. Next up: Getting to that practical 98.5% point. Since this is very likely to not require a full push, I'll also decompile some more actual TH04 and TH05 game code I previously reverse-engineered – and after that, reopen the store!

📝 Posted:: 2021-04-04 18:46 UTC
🚚 Summary of:: P0137
⌨ Commits:: 07bfcf2...8d953dc
💰 Funded by:: [Anonymous]
🏷 Tags:

Whoops, the build was broken again? Since P0127 from mid-November 2020, on TASM32 version 5.3, which also happens to be the one in the DevKit… That version changed the alignment for the default segments of certain memory models when requesting .386 support. And since redefining segment alignment apparently is highly illegal and absolutely has to be a build error, some of the stand-alone .ASM translation units didn't assemble anymore on this version. I've only spotted this on my own because I casually compiled ReC98 somewhere else – on my development system, I happened to have TASM32 version 5.0 in the PATH during all this time.
At least this was a good occasion to get rid of some weird segment alignment workarounds from 2015, and replace them with the superior convention of using the USE16 modifier for the .MODEL directive.

ReC98 would highly benefit from a build server – both in order to immediately spot issues like this one, and as a service for modders. Even more so than the usual open-source project of its size, I would say. But that might be exactly because it doesn't seem like something you can trivially outsource to one of the big CI providers for open-source projects, and quickly set it up with a few lines of YAML.
That might still work in the beginning, and we might get by with a regular 64-bit Windows 10 and DOSBox running the exact build tools from the DevKit. Ideally, though, such a server should really run the optimal configuration of a 32-bit Windows 10, allowing both the 32-bit and the 16-bit build step to run natively, which already is something that no popular CI service out there offers. Then, we'd optimally expand to Linux, every other Windows version down to 95, emulated PC-98 systems, other TASM versions… yeah, it'd be a lot. An experimental project all on its own, with additional hosting costs and probably diminishing returns, the more it expands…
I've added it as a category to the order form, let's see how much interest there is once the store reopens (which will be at the beginning of May, at the latest). That aside, it would 📝 also be a great project for outside contributors!

So, technical debt, part 8… and right away, we're faced with TH03's low-level input function, which 📝 once 📝 again 📝 insists on being word-aligned in a way we can't fake without duplicating translation units. Being undecompilable isn't exactly the best property for a function that has been interesting to modders in the past: In 2018, spaztron64 created an ASM-level mod that hardcoded more ergonomic key bindings for human-vs-human multiplayer mode: 2021-04-04-TH03-WASD-2player.zip However, this remapping attempt remained quite limited, since we hadn't (and still haven't) reached full position independence for TH03 yet. There's quite some potential for size optimizations in this function, which would allow more BIOS key groups to already be used right now, but it's not all that obvious to modders who aren't intimately familiar with x86 ASM. Therefore, I really wouldn't want to keep such a long and important function in ASM if we don't absolutely have to…

… and apparently, that's all the motivation I needed? So I took the risk, and spent the first half of this push on reverse-engineering TCC.EXE, to hopefully find a way to get word-aligned code segments out of Turbo C++ after all.

And there is! The -WX option, used for creating DPMI applications, messes up all sorts of code generation aspects in weird ways, but does in fact mark the code segment as word-aligned. We can consider ourselves quite lucky that we get to use Turbo C++ 4.0, because this feature isn't available in any previous version of Borland's C++ compilers.
That allowed us to restore all the decompilations I previously threw away… well, two of the three, that lookup table generator was too much of a mess in C. But what an abuse this is. The subtly different code generation has basically required one creative workaround per usage of -WX. For example, enabling that option causes the regular PUSH BP and POP BP prolog and epilog instructions to be wrapped with INC BP and DEC BP, for some reason:

a_function_compiled_with_wx proc
	inc 	bp    	; ???
	push	bp
	mov 	bp, sp
	    	      	; [… function code …]
	pop 	bp
	dec 	bp    	; ???
	ret
a_function_compiled_with_wx endp

Luckily again, all the functions that currently require -WX don't set up a stack frame and don't take any parameters.
While this hasn't directly been an issue so far, it's been pretty close: snd_se_reset(void) is one of the functions that require word alignment. Previously, it shared a translation unit with the immediately following snd_se_play(int new_se), which does take a parameter, and therefore would have had its prolog and epilog code messed up by -WX. Since the latter function has a consistent (and thus, fakeable) alignment, I simply split that code segment into two, with a new -WX translation unit for just snd_se_reset(void). Problem solved – after all, two C++ translation units are still better than one ASM translation unit. Especially with all the previous #include improvements.

The rest was more of the usual, getting us 74% done with repaying the technical debt in the SHARED segment. A lot of the remaining 26% is TH04 needing to catch up with TH03 and TH05, which takes comparatively little time. With some good luck, we might get this done within the next push… that is, if we aren't confronted with all too many more disgusting decompilations, like the two functions that ended this push. If we are, we might be needing 10 pushes to complete this after all, but that piece of research was definitely worth the delay. Next up: One more of these.

📝 Posted:: 2021-03-20 20:19 UTC
🚚 Summary of:: P0135, P0136
⌨ Commits:: a6eed55...252c13d, 252c13d...07bfcf2
💰 Funded by:: [Anonymous]
🏷 Tags:

Alright, no more big code maintenance tasks that absolutely need to be done right now. Time to really focus on parts 6 and 7 of repaying technical debt, right? Except that we don't get to speed up just yet, as TH05's barely decompilable PMD file loading function is rather… complicated.
Fun fact: Whenever I see an unusual sequence of x86 instructions in PC-98 Touhou, I first consult the disassembly of Wolfenstein 3D. That game was originally compiled with the quite similar Borland C++ 3.0, so it's quite helpful to compare its ASM to the officially released source code. If I find the instructions in question, they mostly come from that game's ASM code, leading to the amusing realization that "even John Carmack was unable to get these instructions out of this compiler" This time though, Wolfenstein 3D did point me to Borland's intrinsics for common C functions like memcpy() and strchr(), available via #pragma intrinsic. Bu~t those unfortunately still generate worse code than what ZUN micro-optimized here. Commenting how these sequences of instructions should look in C is unfortunately all I could do here.
The conditional branches in this function did compile quite nicely though, clarifying the control flow, and clearly exposing a ZUN bug: TH05's snd_load() will hang in an infinite loop when trying to load a non-existing -86 BGM file (with a .M2 extension) if the corresponding -26 BGM file (with a .M extension) doesn't exist either.

Unsurprisingly, the PMD channel monitoring code in TH05's Music Room remains undecompilable outside the two most "high-level" initialization and rendering functions. And it's not because there's data in the middle of the code segment – that would have actually been possible with some #pragmas to ensure that the data and code segments have the same name. As soon as the SI and DI registers are referenced anywhere, Turbo C++ insists on emitting prolog code to save these on the stack at the beginning of the function, and epilog code to restore them from there before returning. Found that out in September 2019, and confirmed that there's no way around it. All the small helper functions here are quite simply too optimized, throwing away any concern for such safety measures. 🤷
Oh well, the two functions that were decompilable at least indicate that I do try.

Within that same 6th push though, we've finally reached the one function in TH05 that was blocking further progress in TH04, allowing that game to finally catch up with the others in terms of separated translation units. Feels good to finally delete more of those .ASM files we've decompiled a while ago… finally!

But since that was just getting started, the most satisfying development in both of these pushes actually came from some more experiments with macros and inline functions for near-ASM code. By adding "unused" dummy parameters for all relevant registers, the exact input registers are made more explicit, which might help future port authors who then maybe wouldn't have to look them up in an x86 instruction reference quite as often. At its best, this even allows us to declare certain functions with the __fastcall convention and express their parameter lists as regular C, with no additional pseudo-registers or macros required.
As for output registers, Turbo C++'s code generation turns out to be even more amazing than previously thought when it comes to returning pseudo-registers from inline functions. A nice example for how this can improve readability can be found in this piece of TH02 code for polling the PC-98 keyboard state using a BIOS interrupt:

inline uint8_t keygroup_sense(uint8_t group) {
	_AL = group;
	_AH = 0x04;
	geninterrupt(0x18);
	// This turns the output register of this BIOS call into the return value
	// of this function. Surprisingly enough, this does *not* naively generate
	// the `MOV AL, AH` instruction you might expect here!
	return _AH;
}

void input_sense(void)
{
	// As a result, this assignment becomes `_AH = _AH`, which Turbo C++
	// never emits as such, giving us only the three instructions we need.
	_AH = keygroup_sense(8);

	// Whereas this one gives us the one additional `MOV BH, AH` instruction
	// we'd expect, and nothing more.
	_BH = keygroup_sense(7);

	// And now it's obvious what both of these registers contain, from just
	// the assignments above.
	if(_BH & K7_ARROW_UP || _AH & K8_NUM_8) {
		key_det |= INPUT_UP;
	}
	// […]
}

I love it. No inline assembly, as close to idiomatic C code as something like this is going to get, yet still compiling into the minimum possible number of x86 instructions on even a 1994 compiler. This is how I keep this project interesting for myself during chores like these. We might have even reached peak inline already?

And that's 65% of technical debt in the SHARED segment repaid so far. Next up: Two more of these, which might already complete that segment? Finally!

📝 Posted:: 2021-01-31 15:54 UTC
🚚 Summary of:: P0133
⌨ Commits:: 045450c...1d5db71
💰 Funded by:: [Anonymous]
🏷 Tags:

Wow, 31 commits in a single push? Well, what the last push had in progress, this one had in maintenance. The 📝 master.lib header transition absolutely had to be completed in this one, for my own sanity. And indeed, it reduced the build time for the entirety of ReC98 to about 27 seconds on my system, just as expected in the original announcement. Looking forward to even faster build times with the upcoming #include improvements I've got up my sleeve! The port authors of the future are going to appreciate those quite a bit.

As for the new translation units, the funniest one is probably TH05's function for blitting the 1-color .CDG images used for the main menu options. Which is so optimized that it becomes decompilable again, by ditching the self-modifying code of its TH04 counterpart in favor of simply making better use of CPU registers. The resulting C code is still a mess, but what can you do.
This was followed by even more TH05 functions that clearly weren't compiled from C, as evidenced by their padding bytes. It's about time I've documented my lack of ideas of how to get those out of Turbo C++.

And just like in the previous push, I also had to 📝 throw away a decompiled TH02 function purely due to alignment issues. Couldn't have been a better one though, no one's going to miss a residency check for the MMD driver that is largely identical to the corresponding (and indeed decompilable) function for the PMD driver. Both of those should have been merged into a single function anyway, given how they also mutate the game's sound configuration flags…

In the end, I've slightly slowed down with this one, with only 37% of technical debt done after this 4th dedicated push. Next up: One more of these, centered around TH05's stupidly optimized .PI functions. Maybe also with some more reverse-engineering, after not having done any for 1½ months?

📝 Posted:: 2021-01-06 17:29 UTC
🚚 Summary of:: P0132
⌨ Commits:: dc9e3ee...045450c
💰 Funded by:: [Anonymous]
🏷 Tags:

Now that's the amount of translation unit separation progress I was looking for! Too bad that RL is keeping me more and more occupied these days, and ended up delaying this push until 2021. Now that Touhou Patch Center is also commissioning me to update their infrastructure, it's going to take a while for ReC98 to return to full speed, and for the store to be reopened. Should happen by April at the latest, though!

With everything related to this separation of translation units explained earlier, we've really got a push with nothing to talk about, this time. Except, maybe, for the realization that 📝 this current approach might not be the best fit for TH02 after all: Not only did it force us to 📝 throw away the previous decompilation of the sound effect playback functions, but OP.EXE also contains obviously copy-pasted code in addition to the common, shared set of library functions. How was that game even built, originally??? No way around compiling that one instance of the "delay until given BGM measure" function separately then, if it insists on using its own instance of the VSync delay function…
Oh well, this separated layout still works better for the later games, and consistency is good. Smooth sailing with all of the other functions, at least.

Next up: One more of these, which might even end up completing the 📝 transition to our own master.lib header file. In terms of the total number of ASM code left in the SHARED code segments, we're now 30% done after 3 dedicated pushes. It really shouldn't require 7 more pushes, though!

📝 Posted:: 2020-11-03 01:44 UTC
🚚 Summary of:: P0124, P0125
⌨ Commits:: 72dfa09...056b1c7, 056b1c7...f6a3246
💰 Funded by:: Blue Bolt, [Anonymous]
🏷 Tags:

Turns out that TH04's player selection menu is exactly three times as complicated as TH05's. Two screens for character and shot type rather than one, and a way more intricate implementation for saving and restoring the background behind the raised top and left edges of a character picture when moving the cursor between Reimu and Marisa. TH04 decides to backup precisely only the two 256×8 (top) and 8×244 (left) strips behind the edges, indicated in red in the picture below.

Backed-up VRAM area in TH04's player character selection

These take up just 4 KB of heap memory… but require custom blitting functions, and expanding this explicitly hardcoded approach to TH05's 4 characters would have been pretty annoying. So, rather than, uh, not explicitly hardcoding it all, ZUN decided to just be lazy with the backup area in TH05, saving the entire 640×400 screen, and thus spending 128 KB of heap memory on this rather simple selection shadow effect.

So, this really wasn't something to quickly get done during the first half of a push, even after already having done TH05's equivalent of this menu. But since life is very busy right now, I also used the occasion to start addressing another code organization annoyance: master.lib's single master.h header file.

Now that ReC98 is trying to develop (or at least mimic) a more type-safe C++ foundation to model the PC-98 hardware, a pure C header (with counter-productive C++ extensions) is becoming increasingly unidiomatic. By moving some of the original assumptions about function parameters into the type system, we can also reduce the reliance on its Japanese-only documentation without having to translate it
It's far from complete with regards to providing compile-time PC-98 hardware constants and helpful types. In fact, we started to add these in our own header files a long time ago.
It's quite bloated, with at least 2800 lines of code that currently are #included into the vast majority of files, not counting master.h's recursively included C standard library headers. PC-98 Touhou only makes direct use of a rather small fraction of its contents.
And finally, all the DOS/V compatibility definitions are especially useless in the context of ReC98. As I've noted 📝 time and 📝 time again, porting PC-98 Touhou to IBM-compatible DOS won't be easy, and MASTER_DOSV won't be helping much. Therefore, my upstream version of ReC98 will never include all of master.lib. There's no point in lengthening compile times for everyone by default, and those will be getting quite noticeable after moving to a full 16-bit build process.
(Actually, what retro system ports should rather be doing: Get rid of master.lib's original ASM code, replace it with readable, modern C++, and then simply convert the optimized assembly output of modern compilers to your ISA of choice. Improving the landscape of such assembly or object file converters would benefit everyone!)

So, time to start a new master.hpp header that would contain just the declarations from master.h that PC-98 Touhou actually needs, plus some semantic (yes, semantic) sugar. Comparing just the old master.h to just the new master.hpp after roughly 60% of the transition has been completed, we get median build times of 319 ms for master.h, and 144 ms for master.hpp on my (admittedly rather slow) DOSBox setup. Nice!
As of this push, ReC98 consists of 107 translation units that have to be compiled with Turbo C++ 4.0J. Fully rebuilding all of these currently takes roughly 37.5 seconds in DOSBox. After the transition to master.hpp is done, we could therefore shave some 10 to 15 seconds off this time, simply by switching header files. And that's just the beginning, as this will also pave the way for further #include optimizations. Life in this codebase will be great!

Unfortunately, there wasn't enough time to repay some of the actual technical debt I was looking forward to, after all of this. Oh well, at least we now also have nice identifiers for the three different boldface options that are used when rendering text to VRAM, after procrastinating that issue for almost 11 months. Next up, assuming the existing subscriptions: More ridiculous decompilations of things that definitely weren't originally written in C, and a big blocker in TH03's MAIN.EXE.

📝 Posted:: 2020-09-17 20:58 UTC
🚚 Summary of:: P0118
⌨ Commits:: 0bb5bc3...cbf14eb
💰 Funded by:: -Tom-, Ember2528
🏷 Tags:

🎉 TH05 is finally fully position-independent! 🎉 To celebrate this milestone, -Tom- coded a little demo, which we recorded on both an emulator and on real PC-98 hardware:

For all the new people who are unfamiliar with PC-98 Touhou internals: Boss behavior is hardcoded into MAIN.EXE, rather than being scriptable via separate .ECL files like in Windows Touhou. That's what makes this kind of a big deal.

What does this mean?

You can now freely add or remove both data and code anywhere in TH05, by editing the ReC98 codebase, writing your mod in ASM or C/C++, and recompiling the code. Since all absolute memory addresses have now been converted to labels, this will work without causing any instability. See the position independence section in the FAQ for a more thorough explanation about why this was a problem.

By extension, this also means that it's now theoretically possible to use a different compiler on the source code. But:

What does this not mean?

The original ZUN code hasn't been completely reverse-engineered yet, let alone decompiled. As the final PC-98 Touhou game, TH05 also happens to have the largest amount of actual ZUN-written ASM that can't ever be decompiled within ReC98's constraints of a legit source code reconstruction. But a lot of the originally-in-C code is also still in ASM, which might make modding a bit inconvenient right now. And while I have decompiled a bunch of functions, I selected them largely because they would help with PI (as requested by the backers), and not because they are particularly relevant to typical modding interests.

As a result, the code might also be a bit confusingly organized. There's quite a conflict between various goals there: On the one hand, I'd like to only have a single instance of every function shared with earlier games, as well as reduce ZUN's code duplication within a single game. On the other hand, this leads to quite a lot of code being scattered all over the place and then #include-pasted back together, except for the places where 📝 this doesn't work, and you'd have to use multiple translation units anyway… I'm only beginning to figure out the best structure here, and some more reverse-engineering attention surely won't hurt.

Also, keep in mind that the code still targets x86 Real Mode. To work effectively in this codebase, you'd need some familiarity with memory segmentation, and how to express it all in code. This tends to make even regular C++ development about an order of magnitude harder, especially once you want to interface with the remaining ASM code. That part made -Tom- struggle quite a bit with implementing his custom scripting language for the demo above. For now, he built that demo on quite a limited foundation – which is why he also chose to release neither the build nor the source publically for the time being.
So yeah, you're definitely going to need the TASM and Borland C++ manuals there.

tl;dr: We now know everything about this game's data, but not quite as much about this game's code.

So, how long until source ports become a realistic project?

You probably want to wait for 100% RE, which is when everything that can be decompiled has been decompiled.

Unless your target system is 16-bit Windows, in which case you could theoretically start right away. 📝 Again, this would be the ideal first system to port PC-98 Touhou to: It would require all the generic portability work to remove the dependency on PC-98 hardware, thus paving the way for a subsequent port to modern systems, yet you could still just drop in any undecompiled ASM.

Porting to IBM-compatible DOS would only be a harder and less universally useful version of that. You'd then simply exchange one architecture, with its idiosyncrasies and limits, for another, with its own set of idiosyncrasies and limits. (Unless, of course, you already happen to be intimately familiar with that architecture.) The fact that master.lib provides DOS/V support would have only mattered if ZUN consistently used it to abstract away PC-98 hardware at every single place in the code, which is definitely not the case.

The list of actually interesting findings in this push is, 📝 again, very short. Probably the most notable discovery: The low-level part of the code that renders Marisa's laser from her TH04 Illusion Laser shot type is still present in TH05. Insert wild mass guessing about potential beta version shot types… Oh, and did you know that the order of background images in the Extra Stage staff roll differs by character?

Next up: Finally driving up the RE% bar again, by decompiling some TH05 main menu code.

📝 Posted:: 2020-09-07 19:19 UTC
🚚 Summary of:: P0113, P0114
⌨ Commits:: 150d2c6...6204fdd, 6204fdd...967bb8b
💰 Funded by:: Lmocinemod
🏷 Tags:

Alright, tooling and technical debt. Shouldn't be really much to talk about… oh, wait, this is still ReC98

For the tooling part, I finished up the remaining ergonomics and error handling for the 📝 sprite converter that Jonathan Campbell contributed two months ago. While I familiarized myself with the tool, I've actually ran into some unreported errors myself, so this was sort of important to me. Still got no command-line help in there, but the error messages can now do that job probably even better, since we would have had to write them anyway.

So, what's up with the technical debt then? Well, by now we've accumulated quite a number of 📝 ASM code slices that need to be either decompiled or clearly marked as undecompilable. Since we define those slices as "already reverse-engineered", that decision won't affect the numbers on the front page at all. But for a complete decompilation, we'd still have to do this someday. So, rather than incorporating this work into pushes that were purchased with the expectation of measurable progress in a certain area, let's take the "anything goes" pushes, and focus entirely on that during them.

The second code segment seemed like the best place to start with this, since it affects the largest number of games simultaneously. Starting with TH02, this segment contains a set of random "core" functions needed by the binary. Image formats, sounds, input, math, it's all there in some capacity. You could maybe call it all "libzun" or something like that? But for the time being, I simply went with the obvious name, seg2. Maybe I'll come up with something more convincing in the future.

Oh, but wait, why were we assembling all the previous undecompilable ASM translation units in the 16-bit build part? By moving those to the 32-bit part, we don't even need a 16-bit TASM in our list of dependencies, as long as our build process is not fully 16-bit.
And with that, ReC98 now also builds on Windows 95, and thus, every 32-bit Windows version. 🎉 Which is certainly the most user-visible improvement in all of these two pushes.

Back in 2015, I already decompiled all of TH02's seg2 functions. As suggested by the Borland compiler, I tried to follow a "one translation unit per segment" layout, bundling the binary-specific contents via #include. In the end, it required two translation units – and that was even after manually inserting the original padding bytes via #pragma codestring… yuck. But it worked, compiled, and kept the linker's job (and, by extension, segmentation worries) to a minimum. And as long as it all matched the original binaries, it still counted as a valid reconstruction of ZUN's code.

However, that idea ultimately falls apart once TH03 starts mixing undecompilable ASM code inbetween C functions. Now, we officially have no choice but to use multiple C and ASM translation units, with maybe only just one or two #includes in them…

…or we finally start reconstructing the actual seg2 library, turning every sequence of related functions into its own translation unit. This way, we can simply reuse the once-compiled .OBJ files for all the binaries those functions appear in, without requiring that additional layer of translation units mirroring the original segmentation.
The best example for this is TH03's almost undecompilable function that generates a lookup table for horizontally flipping 8 1bpp pixels. It's part of every binary since TH03, but only used in that game. With the previous approach, we would have had to add 9 C translation units, which would all have just #included that one file. Now, we simply put the .OBJ file into the correct place on the linker command line, as soon as we can.

💡 And suddenly, the linker just inserts the correct padding bytes itself.

The most immediate gains there also happened to come from TH03. Which is also where we did get some tiny RE% and PI% gains out of this after all, by reverse-engineering some of its sprite blitting setup code. Sure, I should have done even more RE here, to also cover those 5 functions at the end of code segment #2 in TH03's MAIN.EXE that were in front of a number of library functions I already covered in this push. But let's leave that to an actual RE push 😛

All in all though, I was just getting started with this; the real gains in terms of removed ASM files are still to come. But in the meantime, the funding situation has become even better in terms of allowing me to focus on things nobody asked for. 🙂 So here's a slightly better idea: Instead of spending two more pushes on this, let's shoot for TH05 MAINE.EXE position independence next. If I manage to get it done, we'll have a 100% position-independent TH05 by the time -Tom- finishes his MAIN.EXE PI demo, rather than the 94% we'd get from just MAIN.EXE. That's bound to make a much better impression on all the people who will then (re-)discover the project.

📝 Posted:: 2020-08-28 13:51 UTC
🚚 Summary of:: P0111, P0112
⌨ Commits:: 8b5c146...4ef4c9e, 4ef4c9e...e447a2d
💰 Funded by:: [Anonymous], Blue Bolt
🏷 Tags:

Only one newly ordered push since I've reopened the store? Great, that's all the justification I needed for the extended maintenance delay that was part of these two pushes 😛

Having to write comments to explain whether coordinates are relative to the top-left corner of the screen or the top-left corner of the playfield has finally become old. So, I introduced distinct types for all the coordinate systems we typically encounter, applying them to all code decompiled so far. Note how the planar nature of PC-98 VRAM meant that X and Y coordinates also had to be different from each other. On the X side, there's mainly the distinction between the [0; 640] screen space and the corresponding [0; 80] VRAM byte space. On the Y side, we also have the [0; 400] screen space, but the visible area of VRAM might be limited to [0; 200] when running in the PC-98's line-doubled 640×200 mode. A VRAM Y coordinate also always implies an added offset for vertical scrolling.
During all of the code reconstruction, these types can only have a documenting purpose. Turning them into anything more than just typedefs to int, in order to define conversion operators between them, simply won't recompile into identical binaries. Modding and porting projects, however, now have a nice foundation for doing just that, and can entirely lift coordinate system transformations into the type system, without having to proofread all the meaningless int declarations themselves.

So, what was left in terms of memory references? EX-Alice's fire waves were our final unknown entity that can collide with the player. Decently implemented, with little to say about them.

That left the bomb animation structures as the one big remaining PI blocker. They started out nice and simple in TH04, with a small 6-byte star animation structure used for both Reimu and Marisa. TH05, however, gave each character her own animation… and what the hell is going on with Reimu's blue stars there? Nope, not going to figure this out on ASM level.

A decompilation first required some more bomb-related variables to be named though. Since this was part of a generic RE push, it made sense to do this in all 5 games… which then led to nice PI gains in anything but TH05. Most notably, we now got the "pulling all items to player" flag in TH04 and TH05, which is actually separate from bombing. The obvious cheat mod is left as an exercise to the reader.

So, TH05 bomb animations. Just like the 📝 custom entity types of this game, all 4 characters share the same memory, with the superficially same 10-byte structure.
But let's just look at the very first field. Seen from a low level, it's a simple struct { int x, y; } pos, storing the current position of the character-specific bomb animation entity. But all 4 characters use this field differently:

For Reimu's blue stars, it's the top-left position of each star, in the 12.4 fixed-point format. But unlike the vast majority of these values in TH04 and TH05, it's relative to the top-left corner of the screen, not the playfield. Much better represented as struct { Subpixel screen_x, screen_y; } topleft.
For Marisa's lasers, it's the center of each circle, as a regular 12.4 fixed-point coordinate, relative to the top-left corner of the playfield. Much better represented as struct { Subpixel x, y; } center.
For Mima's shrinking circles, it's the center of each circle in regular pixel coordinates. Much better represented as struct { screen_x_t x; screen_y_t y; } center.
For Yuuka's spinning heart, it's the top-left corner in regular pixel coordinates. Much better represented as struct { screen_x_t x; screen_y_t y; } topleft.
And yes, singular. The game is actually smart enough to only store a single heart, and then create the rest of the circle on the fly. (If it were even smarter, it wouldn't even use this structure member, but oh well.)

Therefore, I decompiled it as 4 separate structures once again, bundled into an union of arrays.

As for Reimu… yup, that's some pointer arithmetic straight out of Jigoku* for setting and updating the positions of the falling star trails. While that certainly required several comments to wrap my head around the current array positions, the one "bug" in all this arithmetic luckily has no effect on the game.
There is a small glitch with the growing circles, though. They are spawned at the end of the loop, with their position taken from the star pointer… but after that pointer has already been incremented. On the last loop iteration, this leads to an out-of-bounds structure access, with the position taken from some unknown EX-Alice data, which is 0 during most of the game. If you look at the animation, you can easily spot these bugged circles, consistently growing from the top-left corner (0, 0) of the playfield:

After all that, there was barely enough remaining time to filter out and label the final few memory references. But now, TH05's MAIN.EXE is technically position-independent! 🎉 -Tom- is going to work on a pretty extensive demo of this unprecedented level of efficient Touhou game modding. For a more impactful effect of both the 100% PI mark and that demo, I'll be delaying the push covering the remaining false positives in that binary until that demo is done. I've accumulated a pretty huge backlog of minor maintenance issues by now…
Next up though: The first part of the long-awaited build system improvements. I've finally come up with a way of sanely accelerating the 32-bit build part on most setups you could possibly want to build ReC98 on, without making the building experience worse for the other few setups.

📝 Posted:: 2020-08-19 18:14 UTC
🚚 Summary of:: P0110
⌨ Commits:: 2c7d86b...8b5c146
💰 Funded by:: [Anonymous], Blue Bolt
🏷 Tags:

… and just as I explained 📝 in the last post how decompilation is typically more sensible and efficient than ASM-level reverse-engineering, we have this push demonstrating a counter-example. The reason why the background particles and lines in the Shinki and EX-Alice battles contributed so much to position dependence was simply because they're accessed in a relatively large amount of functions, one for each different animation. Too many to spend the remaining precious crowdfunded time on reverse-engineering or even decompiling them all, especially now that everyone anticipates 100% PI for TH05's MAIN.EXE.

Therefore, I only decompiled the two functions of the line structure that also demonstrate best how it works, which in turn also helped with RE. Sadly, this revealed that we actually can't 📝 overload operator =() to get that nice assignment syntax for 12.4 fixed-point values, because one of those new functions relies on Turbo C++'s built-in optimizations for trivially copyable structures. Still, impressive that this abstraction caused no other issues for almost one year.

As for the structures themselves… nope, nothing to criticize this time! Sure, one good particle system would have been awesome, instead of having separate structures for the Stage 2 "starfield" particles and the one used in Shinki's battle, with hardcoded animations for both. But given the game's short development time, that was quite an acceptable compromise, I'd say.
And as for the lines, there just has to be a reason why the game reserves 20 lines per set, but only renders lines #0, #6, #12, and #18. We'll probably see once we get to look at those animation functions more closely.

This was quite a 📝 TH03-style RE push, which yielded way more PI% than RE%. But now that that's done, I can finally not get distracted by all that stuff when looking at the list of remaining memory references. Next up: The last few missing structures in TH05's MAIN.EXE!

📝 Posted:: 2020-05-04 14:16 UTC
🚚 Summary of:: P0088, P0089
⌨ Commits:: 97ce7b7...da6b856, da6b856...90252cc
💰 Funded by:: -Tom-, [Anonymous], Blue Bolt
🏷 Tags:

As expected, we've now got the TH04 and TH05 stage enemy structure, finishing position independence for all big entity types. This one was quite straightfoward, as the .STD scripting system is pretty simple.

Its most interesting aspect can be found in the way timing is handled. In Windows Touhou, all .ECL script instructions come with a frame field that defines when they are executed. In TH04's and TH05's .STD scripts, on the other hand, it's up to each individual instruction to add a frame time parameter, anywhere in its parameter list. This frame time defines for how long this instruction should be repeatedly executed, before it manually advances the instruction pointer to the next one. From what I've seen so far, these instruction typically apply their effect on the first frame they run on, and then do nothing for the remaining frames.
Oh, and you can't nest the LOOP instruction, since the enemy structure only stores one single counter for the current loop iteration.

Just from the structure, the only innovation introduced by TH05 seems to have been enemy subtypes. These can be used to parametrize scripts via conditional jumps based on this value, as a first attempt at cutting down the need to duplicate entire scripts for similar enemy behavior. And thanks to TH05's favorable segment layout, this game's version of the .STD enemy script interpreter is even immediately ready for decompilation, in one single future push.

As far as I can tell, that now only leaves

.MPN file loading
player bomb animations
some structures specific to the Shinki and EX-Alice battles
plus some smaller things I've missed over the years

until TH05's MAIN.EXE is completely position-independent.
Which, however, won't be all it needs for that 100% PI rating on the front page. And with that many false positives, it's quite easy to get lost with immediately reverse-engineering everything around them. This time, the rendering of the text dissolve circles, used for the stage and BGM title popups, caught my eye… and since the high-level code to handle all of that was near the end of a segment in both TH04 and TH05, I just decided to immediately decompile it all. Like, how hard could it possibly be? Sure, it needed another segment split, which was a bit harder due to all the existing ASM referencing code in that segment, but certainly not impossible…

Oh wait, this code depends on 9 other sets of identifiers that haven't been declared in C land before, some of which require vast reorganizations to bring them up to current consistency standards. Whoops! Good thing that this is the part of the project I'm still offering for free…
Among the referenced functions was tiles_invalidate_around(), which marks the stage background tiles within a rectangular area to be redrawn this frame. And this one must have had the hardest function signature to figure out in all of PC-98 Touhou, because it actually seems impossible. Looking at all the ways the game passes the center coordinate to this function, we have

X and Y as 16-bit integer literals, merged into a single PUSH of a 32-bit immediate
X and Y calculated and pushed independently from each other
by-value copies of entire Point instances

Any single declaration would only lead to at most two of the three cases generating the original instructions. No way around separately declaring the function in every translation unit then, with the correct parameter list for the respective calls. That's how ZUN must have also written it.

Oh well, we would have needed to do all of this some time. At least there were quite a bit of insights to be gained from the actual decompilation, where using const references actually made it possible to turn quite a number of potentially ugly macros into wholesome inline functions.

But still, TH04 and TH05 will come out of ReC98's decompilation as one big mess. A lot of further manual decompilation and refactoring, beyond the limits of the original binary, would be needed to make these games portable to any non-PC-98, non-x86 architecture.
And yes, that includes IBM-compatible DOS – which, for some reason, a number of people see as the obvious choice for a first system to port PC-98 Touhou to. This will barely be easier. Sure, you'll save the effort of decompiling all the remaining original ASM. But even with master.lib's MASTER_DOSV setting, these games still very much rely on PC-98 hardware, with corresponding assumptions all over ZUN's code. You will need to provide abstractions for the PC-98's superimposed text mode, the gaiji, and planar 4-bit color access in general, exchanging the use of the PC-98's GRCG and EGC blitter chips with something else. At that point, you might as well port the game to one generic 640×400 framebuffer and away from the constraints of DOS, resulting in that Doom source code-like situation which made that game easily portable to every architecture to begin with. But ZUN just wasn't a John Carmack, sorry.

Or what do I know. I've never programmed for IBM-compatible DOS, but maybe ReC98's audience does include someone who is intimately familiar with IBM-compatible DOS so that the constraints aren't much of an issue for them? But even then, 16-bit Windows would make much more sense as a first porting target if you don't want to bother with that undecompilable ASM.

At least I won't have to look at TH04 and TH05 for quite a while now. The delivery delays have made it obvious that my life has become pretty busy again, probably until September. With a total of 9 TH01 pushes from monthly subscriptions now waiting in the backlog, the shop will stay closed until I've caught up with most of these. Which I'm quite hyped for!

📝 Posted:: 2020-04-16 13:09 UTC
🚚 Summary of:: P0086, P0087
⌨ Commits:: 54ee99b...24b96cd, 24b96cd...97ce7b7
💰 Funded by:: [Anonymous], Blue Bolt, -Tom-
🏷 Tags:

Alright, the score popup numbers shown when collecting items or defeating (mid)bosses. The second-to-last remaining big entity type in TH05… with quite some PI false positives in the memory range occupied by its data. Good thing I still got some outstanding generic RE pushes that haven't been claimed for anything more specific in over a month! These conveniently allowed me to RE most of these functions right away, the right way.

Most of the false positives were boss HP values, passed to a "boss phase end" function which sets the HP value at which the next phase should end. Stage 6 Yuuka, Mugetsu, and EX-Alice have their own copies of this function, in which they also reset certain boss-specific global variables. Since I always like to cover all varieties of such duplicated functions at once, it made sense to reverse-engineer all the involved variables while I was at it… and that's why this was exactly the right time to cover the implementation details of Stage 6 Yuuka's parasol and vanishing animations in TH04.

With still a bit of time left in that RE push afterwards, I could also start looking into some of the smaller functions that didn't quite fit into other pushes. The most notable one there was a simple function that aims from any point to the current player position. Which actually only became a separate function in TH05, probably since it's called 27 times in total. That's 27 places no longer being blocked from further RE progress.

WindowsTiger already did most of the work for the score popup numbers in January, which meant that I only had to review it and bring it up to ReC98's current coding styles and standards. This one turned out to be one of those rare features whose TH05 implementation is significantly less insane than the TH04 one. Both games lazily redraw only the tiles of the stage background that were drawn over in the previous frame, and try their best to minimize the amount of tiles to be redrawn in this way. For these popup numbers, this involves calculating the on-screen width, based on the exact number of digits in the point value. TH04 calculates this width every frame during the rendering function, and even resorts to setting that field through the digit iteration pointer via self-modifying code… yup. TH05, on the other hand, simply calculates the width once when spawning a new popup number, during the conversion of the point value to binary-coded decimal. The "×2" multiplier suffix being removed in TH05 certainly also helped in simplifying that feature in this game.

And that's ⅓ of TH05 reverse-engineered! Next up, one more TH05 PI push, in which the stage enemies hopefully finish all the big entity types. Maybe it will also be accompanied by another RE push? In any case, that will be the last piece of TH05 progress for quite some time. The next TH01 stretch will consist of 6 pushes at the very least, and I currently have no idea of how much time I can spend on ReC98 a month from now…

📝 Posted:: 2020-04-03 15:46 UTC
🚚 Summary of:: P0085
⌨ Commits:: 110d6dd...54ee99b
💰 Funded by:: -Tom-
🏷 Tags:

Wait, PI for FUUIN.EXE is mainly blocked by the high score menu? That one should really be properly decompiled in a separate RE push, since it's also present in largely identical form in REIIDEN.EXE… but I currently lack the explicit funding to do that.

And as it turns out, I shouldn't really capture any of the existing generic RE contributions for it either. Back in 2018 when I ran the crowdfunding on the Touhou Patch Center Discord server, I said that generic RE contributions would never go towards TH01. No one was interested in that game back then, and as it's significantly different from all the other games, it made sense to only cover it if explicitly requested.
As Touhou Patch Center still remains one of the biggest supporters and advertisers for ReC98, someone recently believed that this rule was still in effect, despite not being mentioned anywhere on this website.

Fast forward to today, and TH01 has become the single most supported game lately, with plenty of incomplete pushes still open to be completed. Reverse-engineering it has proven to be quite efficient, yielding lots of completion percentage points per push. This, I suppose, is exactly what backers that don't give any specific priorities are mainly interested in. Therefore, I will allocate future partial contributions to TH01, whenever it makes sense.

So, instead of rushing TH01 PI, let's wait for Ember2528's April subscription, and get the 25% total RE milestone with some TH05 PI progress instead. This one primarily focused on the gather circles (spirals…?), the third-last missing entity type in TH05. These are rendered using the same 8×8 pellet sprite introduced in TH02… except that the actual pellets received a darkened bottom part in TH04 . Which, in turn, is actually rendered quite efficiently – the games first render the top white part of all pellets, followed by the bottom gray part of all pellets. The PC-98 GRCG is used throughout the process, doing its typical job of accelerating monochrome blitting, and by arranging the rendering like this, only two GRCG color changes are required to draw any number of pellets. I guess that makes it quite a worthwhile optimization? Don't ask me for specific performance numbers or even saved cycles, though

Next up, one more TH05 PI push!

📝 Posted:: 2020-02-29 15:05 UTC
🚚 Summary of:: P0078, P0079
⌨ Commits:: f4eb7a8...9e52cb1, 9e52cb1...cd48aa3
💰 Funded by:: iruleatgames, -Tom-
🏷 Tags:

To finish this TH05 stretch, we've got ~~a feature that's exclusive to TH05 for once! As the final memory management innovation in PC-98 Touhou, TH05 provides~~ a single static (64 * 26)-byte array for storing up to 64 entities of a custom type, specific to a stage or boss portion. (Edit (2023-05-29): This system actually debuted in 📝 TH04, where it was used for much simpler entities.)

TH05 uses this array for

the Stage 2 star particles,
Alice's puppets,
the tip of curve ("jello") bullets,
Mai's snowballs and Yuki's fireballs,
Yumeko's swords,
and Shinki's 32×32 bullets,

which makes sense, given that only one of those will be active at any given time.

On the surface, they all appear to share the same 26-byte structure, with consistently sized fields, merely using its 5 generic fields for different purposes. Looking closer though, there actually are differences in the signedness of certain fields across the six types. uth05win chose to declare them as entirely separate structures, and given all the semantic differences (pixels vs. subpixels, regular vs. tiny master.lib sprites, …), it made sense to do the same in ReC98. It quickly turned out to be the only solution to meet my own standards of code readability.

Which blew this one up to two pushes once again… But now, modders can trivially resize any of those structures without affecting the other types within the original (64 * 26)-byte boundary, even without full position independence. While you'd still have to reduce the type-specific number of distinct entities if you made any structure larger, you could also have more entities with fewer structure members.

As for the types themselves, they're full of redundancy once again – as you might have already expected from seeing #4, #5, and #6 listed as unrelated to each other. Those could have indeed been merged into a single 32×32 bullet type, supporting all the unique properties of #4 (destructible, with optional revenge bullets), #5 (optional number of twirl animation frames before they begin to move) and #6 (delay clouds). The *_add(), *_update(), and *_render() functions of #5 and #6 could even already be completely reverse-engineered from just applying the structure onto the ASM, with the ones of #3 and #4 only needing one more RE push.

But perhaps the most interesting discovery here is in the curve bullets: TH05 only renders every second one of the 17 nodes in a curve bullet, yet hit-tests every single one of them. In practice, this is an acceptable optimization though – you only start to notice jagged edges and gaps between the fragments once their speed exceeds roughly 11 pixels per second:

And that brings us to the last 20% of TH05 position independence! But first, we'll have more cheap and fast TH01 progress.

📝 Posted:: 2020-02-23 16:56 UTC
🚚 Summary of:: P0076, P0077
⌨ Commits:: 222fc99...9ae9754, 9ae9754...f4eb7a8
💰 Funded by:: [Anonymous], -Tom-, Splashman
🏷 Tags:

Well, that took twice as long as I thought, with the two pushes containing a lot more maintenance than actual new research. Spending some time improving both field names and types in 32th System's TH03 resident structure finally gives us all of those structures. Which means that we can now cover all the remaining decompilable ZUN.COM parts at once…

Oh wait, their main() functions have stayed largely identical since TH02? Time to clean up and separate that first, then… and combine two recent code generation observations into the solution to a decompilation puzzle from 4½ years ago. Alright, time to decomp-

Oh wait, we'd kinda like to properly RE all the code in TH03-TH05 that deals with loading and saving .CFG files. Almost every outside contributor wanted to grab this supposedly low-hanging fruit a lot earlier, but (of course) always just for a single game, while missing how the format evolved.

So, ZUN.COM. For some reason, people seem to consider it particularly important, even though it contains neither any game logic nor any code specific to PC-98 hardware… All that this decompilable part does is to initialize a game's .CFG file, allocate an empty resident structure using master.lib functions, release it after you quit the game, error-check all that, and print some playful messages~ (OK, TH05's also directly fills the resident structure with all data from MIKO.CFG, which all the other games do in OP.EXE.) At least modders can now freely change and extend all the resident structures, as well as the .CFG files? And translators can translate those messages that you won't see on a decently fast emulator anyway? Have fun, I guess 🤷‍

And you can in fact do this right now – even for TH04 and TH05, whose ZUN.COM currently isn't rebuilt by ReC98. There is actually a rather involved reason for this:

One of the missing files is TH05's GJINIT.COM.
Which contains all of TH05's gaiji characters in hardcoded 1bpp form, together with a bit of ASM for writing them to the PC-98's hardware gaiji RAM
Which means we'd ideally first like to have a sprite compiler, for all the hardcoded 1bpp sprites
Which must compile to an ASM slice in the meantime, but should also output directly to an OMF .OBJ file (for performance now), as well as to C code (for portability later)
The custom build system I've been using since mid-August has some declarations for OMF .OBJ files, but it needs maybe 1 or 2 more weeks of polish to be shipped
Which I won't put in as long as the backlog contains actual progress to drive up the percentages on the front page.

So yeah, no meaningful RE and PI progress at any of these levels. Heck, even as a modder, you can just replace the zun zun_res (TH02), zun -5 (TH03), or zun -s (TH04/TH05) calls in GAME.BAT with a direct call to your modified *RES*.COM. And with the alternative being "manually typing 0 and 1 bits into a text file", editing the sprites in TH05's GJINIT.COM is way more comfortable in a binary sprite editor anyway.

For me though, the best part in all of this was that it finally made sense to throw out the old Borland C++ run-time assembly slices 🗑 This giant waste of time became obvious 5 years ago, but any ASM dump of a .COM file would have needed rather ugly workarounds without those slices. Now that all .COM binaries that were originally written in C are compiled from C, we can all enjoy slightly faster grepping over the entire repository, which now has 229 fewer files. Productivity will skyrocket!

Next up: Three weeks of almost full-time ReC98 work! Two more PI-focused pushes to finish this TH05 stretch first, before switching priorities to TH01 again.

📝 Posted:: 2019-12-29 20:23 UTC
🚚 Summary of:: P0064
⌨ Commits:: 80cec5b...9faa29a
💰 Funded by:: Touhou Patch Center
🏷 Tags:

🎉 TH04's and TH05's OP.EXE are now fully position-independent! 🎉

What does this mean?

You can now add any data or code to the main menus of the two games, by simply editing the ReC98 source, writing your mod in ASM or C/C++, and recompiling the code. Since all absolute memory addresses have now been converted to labels, this will work without causing any instability. See the position independence section in the FAQ for a more thorough explanation about why this was a problem.

What does this not mean?

The original ZUN code hasn't been completely reverse-engineered yet, let alone decompiled. Pretty much all of that is still ASM, which might make modding a bit inconvenient right now.

Since this push was otherwise pretty unremarkable, I made a video demonstrating a few basic things you can do with this:

Now, what to do for the last outstanding Touhou Patch Center push? Bullets, or resident structures?

📝 Posted:: 2019-11-29 00:15 UTC
🚚 Summary of:: P0060
⌨ Commits:: 29385dd...73f5ae7
💰 Funded by:: Touhou Patch Center
🏷 Tags:

So, where to start? Well, TH04 bullets are hard, so let's ~~procrastinate~~ start with TH03 instead The 📝 sprite display functions are the obvious blocker for any structure describing a sprite, and therefore most meaningful PI gains in that game… and I actually did manage to fit a decompilation of those three functions into exactly the amount of time that the Touhou Patch Center community votes alloted to TH03 reverse-engineering!

And a pretty amazing one at that. The original code was so obviously written in ASM and was just barely decompilable by exclusively using register pseudovariables and a bit of goto, but I was able to abstract most of that away, not least thanks to a few helpful optimization properties of Turbo C++… seriously, I can't stop marveling at this ancient compiler. The end result is both readable, clear, and dare I say portable?! To anyone interested in porting TH03, take a look. How painful would it be to port that away from 16-bit x86?

However, this push is also a typical example that the RE/PI priorities can only control what I look at, and the outcome can actually differ greatly. Even though the priorities were 65% RE and 35% PI, the progress outcome was +0.13% RE and +1.35% PI. But hey, we've got one more push with a focus on TH03 PI, so maybe that one will include more RE than PI, and then everything will end up just as ordered?

📝 Posted:: 2019-09-24 21:12 UTC
🚚 Summary of:: P0034, P0035
⌨ Commits:: 6cdd229...6f1f367, 6f1f367...a533b5d
💰 Funded by:: zorg
🏷 Tags:

Deathbombs confirmed, in both TH04 and TH05! On the surface, it's the same 8-frame window as in most Windows games, but due to the slightly lower PC-98 frame rate of 56.4 Hz, it's actually slightly more lenient in TH04 and TH05.

The last function in front of the TH05 shot type control functions marks the player's previous position in VRAM to be redrawn. But as it turns out, "player" not only means "the player's option satellites on shot levels ≥ 2", but also "the explosion animation if you lose a life", which required reverse-engineering both things, ultimately leading to the confirmation of deathbombs.

It actually was kind of surprising that we then had reverse-engineered everything related to rendering all three things mentioned above, and could also cover the player rendering function right now. Luckily, TH05 didn't decide to also micro-optimize that function into un-decompilability; in fact, it wasn't changed at all from TH04. Unlike the one invalidation function whose decompilation would have actually been the goal here…

But now, we've finally gotten to where we wanted to… and only got 2 outstanding decompilation pushes left. Time to get the website ready for hosting an actual crowdfunding campaign, I'd say – It'll make a better impression if people can still see things being delivered after the big announcement.

📝 Posted:: 2019-09-21 12:11 UTC
🚚 Summary of:: P0031, P0032, P0033
⌨ Commits:: dea40ad...9f764fa, 9f764fa...e6294c2, e6294c2...6cdd229
💰 Funded by:: zorg
🏷 Tags:

The glacial pace continues, with TH05's unnecessarily, inappropriately micro-optimized, and hence, un-decompilable code for rendering the current and high score, as well as the enemy health / dream / power bars. While the latter might still pass as well-written ASM, the former goes to such ridiculous levels that it ends up being technically buggy. If you enjoy quality ZUN code, it's definitely worth a read.

In TH05, this all still is at the end of code segment #1, but in TH04, the same code lies all over the same segment. And since I really wanted to move that code into its final form now, I finally did the research into decompiling from anywhere else in a segment.

Turns out we actually can! It's kinda annoying, though: After splitting the segment after the function we want to decompile, we then need to group the two new segments back together into one "virtual segment" matching the original one. But since all ASM in ReC98 heavily relies on being assembled in MASM mode, we then start to suffer from MASM's group addressing quirk. Which then forces us to manually prefix every single function call

from inside the group
to anywhere else within the newly created segment

with the group name. It's stupidly boring busywork, because of all the function calls you mustn't prefix. Special tooling might make this easier, but I don't have it, and I'm not getting crowdfunded for it.

So while you now definitely can request any specific thing in any of the 5 games to be decompiled right now, it will take slightly longer, and cost slightly more.
(Except for that one big segment in TH04, of course.)

Only one function away from the TH05 shot type control functions now!

📝 Posted:: 2019-09-15 19:29 UTC
🚚 Summary of:: P0029, P0030
⌨ Commits:: 6ff427a...c7fc4ca, c7fc4ca...dea40ad
💰 Funded by:: zorg
🏷 Tags:

Here we go, new C code! …eh, it will still take a bit to really get decompilation going at the speeds I was hoping for. Especially with the sheer amount of stuff that is set in the first few significant functions we actually can decompile, which now all has to be correctly declared in the C world. Turns out I spent the last 2 years screwing up the case of exported functions, and even some of their names, so that it didn't actually reflect their calling convention… yup. That's just the stuff you tend to forget while it doesn't matter.

To make up for that, I decided to research whether we can make use of some C++ features to improve code readability after all. Previously, it seemed that TH01 was the only game that included any C++ code, whereas TH02 and later seemed to be 100% C and ASM. However, during the development of the soon to be released new build system, I noticed that even this old compiler from the mid-90's, infamous for prioritizing compile speeds over all but the most trivial optimizations, was capable of quite surprising levels of automatic inlining with class methods…

…leading the research to culminate in the mindblow that is 9d121c7 – yes, we can use C++ class methods and operator overloading to make the code more readable, while still generating the same code than if we had just used C and preprocessor macros.

Looks like there's now the potential for a few pull requests from outside devs that apply C++ features to improve the legibility of previously decompiled and terribly macro-ridden code. So, if anyone wants to help without spending money…

📝 Posted:: 2019-02-28 16:58 UTC
🚚 Summary of:: P0047, P0048
⌨ Commits:: 9a2c6f7...893bd46
💰 Funded by:: -Tom-
🏷 Tags:

So, let's continue with player shots! …eh, or maybe not directly, since they involve two other structure types in TH05, which we'd have to cover first. One of them is a different sort of sprite, and since I like me some context in my reverse-engineering, let's disable every other sprite type first to figure out what it is.

One of those other sprite types were the little sparks flying away from killed stage enemies, midbosses, and grazed bullets; easy enough to also RE right now. Turns out they use the same 8 hardcoded 8×8 sprites in TH02, TH04, and TH05. Except that it's actually 64 16×8 sprites, because ZUN wanted to pre-shift them for all 8 possible start pixels within a planar VRAM byte (rather than, like, just writing a few instructions to shift them programmatically), leading to them taking up 1,024 bytes rather than just 64.
Oh, and the thing I wanted to RE *actually* was the decay animation whenever a shot hits something. Not too complex either, especially since it's exclusive to TH05.

And since there was some time left and I actually have to pick some of the next RE places strategically to best prepare for the upcoming 17 decompilation pushes, here's two more function pointers for good measure.

📝 Posted:: 2018-12-30 01:46 UTC
🚚 Summary of:: P0043, P0044, P0045
⌨ Commits:: 261d503...612beb8
💰 Funded by:: -Tom-
🏷 Tags:

Turns out I had only been about half done with the drawing routines. The rest was all related to redrawing the scrolling stage backgrounds after other sprites were drawn on top. Since the PC-98 does have hardware-accelerated scrolling, but no hardware-accelerated sprites, everything that draws animated sprites into a scrolling VRAM must then also make sure that the background tiles covered by the sprite are redrawn in the next frame, which required a bit of ZUN code. And that are the functions that have been in the way of the expected rapid reverse-engineering progress that uth05win was supposed to bring. So, looks like everything's going to go really fast now?

📝 Posted:: 2018-12-26 17:57 UTC
🚚 Summary of:: P0025, P0026, P0027
⌨ Commits:: 0cde4b7...261d503
💰 Funded by:: zorg
🏷 Tags:

… yeah, no, we won't get very far without figuring out these drawing routines.
Which process data that comes from the .STD files. Which has various arrays related to the background… including one to specify the scrolling speed. And wait, setting that to 0 actually is what starts a boss battle?

So, have a TH05 Boss Rush patch: 2018-12-26-TH05BossRush.zip Theoretically, this should have also worked for TH04, but for some reason, the Stage 3 boss gets stuck on the first phase if we do this?

Here's the diff for the Boss Rush. Turning it into a thcrap-style Skipgame patch is left as an exercise for the reader.

📝 Posted:: 2018-12-16 23:34 UTC
🚚 Summary of:: P0023, P0024
⌨ Commits:: 807df3d...0cde4b7
💰 Funded by:: zorg
🏷 Tags:

Actually, I lied, and lasers ended up coming with everything that makes reverse-engineering ZUN code so difficult: weirdly reused variables, unexpected structures within structures, and those TH05-specific nasty, premature ASM micro-optimizations that will waste a lot of time during decompilation, since the majority of the code actually was C, except for where it wasn't.

📝 Posted:: 2018-09-02 19:33 UTC
🚚 Summary of:: P0018
⌨ Commits:: 746681d...178d589
💰 Funded by:: zorg
🏷 Tags:

What do you do if the TH06 text image feature for thcrap should have been done 3 days™ ago, but keeps getting more and more complex, and you have a ton of other pushes to deliver anyway? Get some distraction with some light ReC98 reverse-engineering work. This is where it becomes very obvious how much uth05win helps us with all the games, not just TH05.

5a5c347 is the most important one in there, this was the missing substructure that now makes every other sprite-like structure trivial to figure out.

📝 Posted:: 2018-04-21 13:52 UTC
🚚 Summary of:: P0008
⌨ Commits:: 47e5601...d62dd06
💰 Funded by:: -Tom-
🏷 Tags:

You could use this to get a homing Mima, for example.

⮜ Blog

⮜ List of tags

Showing all posts tagged th02-

What does this mean?

What does this not mean?

So, how long until source ports become a realistic project?

What does this mean?

What does this not mean?

Showing all posts tagged