Blog

Showing all posts tagged

📝 Posted:
💰 Funded by:
[Anonymous], Ember2528
🏷️ Tags:

Part 4! Let's get this over with, fix all these landmines, and conclude 📝 this 4-post series about the big 2025 PC-98 Touhou portability subproject. Unless I find something big in TH03, this is likely to be the final long blog post for this year. I wanna code again!

  1. The scope
  2. Retaining menu backgrounds in conventional RAM
  3. Resolving screen tearing landmines
  4. Intermission: Handling unused code on fork branches
  5. Merging TH02-TH05's OP.EXE and MAINE.EXE/MAINL.EXE
  6. Replicating TH02-TH05's GAME.BAT in C++
  7. Topping it off with an actual feature

As you can already tell by this table of contents, this "initial" cleanup work was quite larger in scope than its counterpart for 📝 the first TH01 Anniversary Edition release. Even that already took unexpectedly long 2½ years ago, and now imagine doing that across four games simultaneously while keeping all the little required inconsistencies in place. Then you'll get why this has taken over four months…
With an overall goal of "general portability", it's very tempting to escalate the scope towards covering everything in these menu and cutscene binaries. So I had to draw at least some boundaries:

Even then, this was way premature. Not only because we still need to maintain the memory layout of TH02's and TH03's MAIN.EXE, but also because of all the undecompilable ASM code in all four games that blocks certain architectural simplifications.
The biggest problem, however: I haven't quite decided on how to use static libraries within my build environment yet. Since Turbo C++ 4.0J's linker just blindly includes every explicitly named object file without eliminating dead code, static libraries are essential for reducing bloat by providing a layer of optional files to be included on demand. However, the Windows and DOS versions of TLIB are easily confused, TLIB's usual paradigm of mutating existing library files goes against Tup's explicit dependency graph, and should we really depend on an ancient proprietary tool for a job that I could reimplement in a few hundred lines? Famous last words, I know. But since I 📝 didn't want to do any dedicated build system work this year, I also didn't want to sort out these questions in a 12th or even 13th push. Leaving the build environment woefully ill-equipped for the complexity of this task was probably a mistake; while the resulting workaround of feature bundles does the job, it's very silly and hopefully won't last very long. I'm definitely going to spend some time sorting out the static library situation before I ever attempt something like this again. Or at some general point before we hit the overall 100% finalization mark, because we've still got that long-awaited librarization of ZUN's master.lib fork ahead of us.


Let's get to it then, starting with the feature that will remove lag in menus by removing PC-98-specific page-flipping and EGC code:

Retaining menu backgrounds in conventional RAM

At first, this seems to be no problem. We just swap out master.lib's .PI functions with our forked PiLoad and our generic blitter, and make sure to keep the images allocated. master.lib's graph_pi_load_pack() has always loaded .PI images into one big contiguous buffer in conventional RAM, so this shouldn't negatively affect the heap layout. If anything, we'd be saving memory by 📝 not allocating these extra two rows, right?
Unfortunately, it's that second goal that would turn out to be a massive problem. 📝 The end of part 1 already hinted at how the majority of menu backgrounds are only rendered to VRAM a single time before ZUN immediately frees them from memory. These cases are so common that I defined a macro for them:

#define pi_fullres_load_palette_apply_put_free(slot, fn) { \
	pi_load(slot, fn); \
	pi_palette_apply(slot); \
	pi_put_8(0, 0, slot); \
	pi_free(slot); \
}
At the current state of decompilation, this macro is used 16 times across TH02-TH05, and it will appear an additional 12 times by the time decompilation is done.

In these cases, the games only need that single 128 KiB block temporarily, and then get to reuse that memory for other, more dynamic graphics. Consequently, ZUN probably dimensioned the master.lib heap sizes for TH02-TH05 to leave ample headroom with this fact in mind. I wasn't so sure about deliberately limiting the amount of heap memory 📝 in late 2021 when I fixed the one out-of-memory landmine that remained in TH04, but I've begun to appreciate these memory limits quite a lot as the scope of my research has deepened. Specify the right amount of bytes, perform the single allocation from the DOS heap at startup, and if that allocation succeeds, you've removed an entire class of out-of-memory bugs from consideration. Sure, modders might prefer mem_assign_all() for simplicity during development, but it does make sense to return to a static limit when shipping. For once, ZUN was right, and there is no excuse. :godzun:

On the surface, this macro is equivalent to PiLoad's original direct-to-VRAM approach. And indeed, we can replace this code with a call to PiLoad's original code path in the few cases where we just want to show a static image without unblitting any of its regions later on, removing even the requirement for that temporary 128 KiB block in the process. But in the majority of cases, we do need these images in RAM, and ZUN's original heap sizes simply weren't intended for that.
But how much of a problem are ZUN's limits in practice? Well, there's at least one instance where retaining all images would require significantly more memory than ZUN anticipated. TH02's OP.EXE requests a 256,000-byte master.lib heap, but then wants to do this:

If you step through this video, you'll see that this effect indeed page-flips between the later menu background (128,000 bytes) and the three images with the resized text (3 × 54,720 = 164,160 bytes). Hence, we must have already loaded all four of them into the heap before this animation starts. While the menu's main text is rendered on the text layer, its shadow is part of the graphics layer and must be unblitted when switching to the option menu and back, thus requiring this image in at least some portion of memory.
Since VRAM page 1 always shows the unmodified menu background image, we could cheat, use PiLoad's direct-to-VRAM code path, and then do a second load of the image to conventional RAM on frame 19, before the white-in palette effect. 📝 Frame-perfection rule #2 would also allow us to do that. But adding a second load time is lame, especially because the white-in effect wouldn't even hide that many expensive calls in the debloated build anymore. After replacing the original final graph_pack_put_8() call for the main menu image with a faster planar blit, it only hides the draw calls for the ©ZUN text and some minor file and port I/O to load and apply the menu's final 48-byte palette from OP.RGB.

There's really no reason against just increasing the size of the master.lib heap to incorporate all of these four images at the same time. But how much additional memory do we actually need? Obviously, these four images are not the only allocations on the heap, which also needs to fit at least the following buffers:

We can bypass the super_buffer allocation through a dumb trick. But the single worst aspect hides between all these individual allocations:

Fighting heap fragmentation

Let's look at TH05's OP.EXE, whose heap limit of 336,000 bytes is much more lenient. This limit should be more than enough to fit the new additional 128,000-byte buffer for the background image in addition to the original heap contents on every menu screen, and we can indeed enter the main menu without any issue. But then, we're still greeted with an out-of-memory crash after entering and leaving the Music Room? Let's take a look into the master.lib heap, with the retained background images we'd like to have:

Step Heap layout Total remaining Largest free block Fragmentation loss
Entering the main menu op1.pi (128,000) sft*.cd2 + car.cd2 (7,360) sl*.cdg (90,240) scnum.bft + hi_m.bft (7,680) 76,352 66,240 10,112
Leaving the main menu scnum.bft + hi_m.bft (7,680) 301,984 234,608 67,376
Entering the Music Room music.pi (128,000) 📝 nopoly_B (32,000) scnum.bft + hi_m.bft (7,680) 139,536 66,240 73,296
Leaving the Music Room music.pi (!) (128,000) scnum.bft + hi_m.bft (7,680) 165,552 81,920 83,632
Re-entering the main menu .PI load buffer (16,384) sft*.cd2 + car.cd2 (7,360) sl*.cdg (90,240) scnum.bft + hi_m.bft (7,680) 179,760 115,648 64,112

Something gradually shreds our heap into tiny pieces that ultimately prevent us from allocating the main menu background image a second time. We can surely blame this on ZUN's suboptimal order of load calls that doesn't prioritize larger images over smaller ones, or on the 16 KiB .PI load buffer that we maybe should have allocated statically. But the biggest hidden offender turns out to be… master.lib's packfile implementation?!

Yup. Every time you load a file out of an archive, master.lib heap-allocates a 31-byte state structure and a file read buffer that master.lib originally dimensioned at 520 bytes. TH03's MAIN.EXE and MAINL.EXE then increased the size of that buffer to 4,104 bytes, before TH04 and TH05 went up to 8,200 bytes. :zunpet:
Ultimately though, it's not the size that's the problem here, but the fact that we repeatedly allocate any memory that could have been allocated once when setting up the INT 21h handlers. You'd think that master.lib went for dynamic allocations in order to support the fact that the INT 21h file API lets you open multiple file handles simultaneously, which would point to different RLE-compressed files within the archive. But no, master.lib doesn't even support this case, and even incorrectly returns FileNotFound if you attempt to open a second file from an archive before closing the first one! :onricdennat:

After I identified this whole issue, I immediately wanted to replace master.lib's packfile code with the much saner and more explicit C++ implementation from TH01. Sooner or later, we'll have to do this anyway because we can't just hook file syscalls on other operating systems the way we can hook them on DOS.
However, TH01's implementation would quickly turn out to have its own share of heap fragmentation issues. Every time the game loads an RLE-compressed file from 東方靈異.伝, the archived file is completely decompressed into a newly allocated temporary buffer, from where the game then copies out parts into the actual game structures. The resulting fragmentation is at least easily fixable though, and that's what the TH01 part of the very first push assigned to part 1 went to. Switching to a zero-copy architecture basically only required persisting the RLE state and brought a significant improvement: 15,776 bytes of heap memory during Stage 1 freed up by that switch alone, as reported by the coreleft() output seen in debug mode?! That much for just removing temporary allocations that the game was freeing anyway?
Let's check the Borland C++ DOS Reference for how this value is actually calculated. Turns out that it is simply intended to be a measure of unused RAM memory, and sure enough:

In the large data models, coreleft returns the amount of memory between the highest allocated block and the end of memory.

That's a reasonably meaningful measurement that can be determined in constant time, compared to the 𝑂(𝑛) operation of finding the true total size of available heap memory by walking over every node.

In the end though, rolling out this C++ implementation to the other four games was way premature and would have pushed this delivery way above 12 pushes. After all, both ZUN's and master.lib's code is still full of INT 21h file syscalls that would all need to be replaced. Conditionally, even, given the two binaries that are not yet position-independent…
Fortunately, .PI loading itself is just as much of an issue and can be worked around in the much simpler way I already spoiled when explaining 📝 the API changes I made to PiLoad: We simply hold on to the menu background pixel buffer for as long as possible. Ideally, we only allocate these 128 KiB once, decode every new menu background into that same buffer, and only explicitly free it when we really need to. That's why the platform layer logic requires full control over .PI buffer allocation.
This was enough to keep the required amount of additional heap memory to a more than acceptable level:

And after a few rewritten function calls, we've indeed removed every single EGC-powered inter-page copy from all menus and cutscenes of TH02-TH05! On to the next goal…


Resolving screen tearing landmines

…which requires individual solutions for every case that merely follow a common pattern. If things are already done close to a VSync wait loop and just in a slightly wrong order, the solution is easy and we just have to shift around a few function calls.
But what can we do if ZUN mutates visible pixels at some place far away from the last VSync wait loop? After all, a lot of these landmines result from confusing the current CRT beam position across multiple functions. Often, it's impossible to see at a glance where these menu-specific subfunctions are called within a frame without tracing execution back to the last VSync delay loop at the call site. For starters, it would be nice to clearly formalize that a specific section of code must be run in VBLANK.

master.lib's vsync_Proc function pointer already gets us most of the way there. Its VSync subsystem automatically calls any non-nullptr shortly after the VSync interrupt fires, and our task function would then set vsync_Proc back to a nullptr to ensure the intended one-shot behavior.
However, this approach can at best defer a task to the next VBLANK interval, which might leave us one frame behind the original game and hurt our frame-perfection goals. What we actually want is a conditional approach for timing-sensitive tasks, as a common operation that only requires a single line of code:

Now we're only missing that crucial one bit of information, which is delivered by Bit 5 of the graphics GDC's status register at I/O port 0xA0. In fact, ZUN uses the same bit in all hand-written VSync wait code throughout TH01 and in the bouncing-ball ZUN Soft logo:

void vsync_wait_via_gdc_polling(void)
{
	// Bit 5 of the graphics GDC register indicates VBLANK. Wait until this bit
	// is set.
	while((inportb(0xA0) & 0x20) != 0) {
	}

	// Once Bit 5 is no longer set, the CRT has started drawing the next frame.
	// I have no idea why you would ever want to throw away all your precious
	// vertical retrace time, but ZUN does this all throughout TH01.
	while((inportb(0xA0) & 0x20) == 0) {
	}
}

Of course, this only solves the problem in theory, as the tasks themselves don't come with any real-time guarantees. It's entirely possible for the resulting vblank_run() function to get called near the end of VBLANK, start the task immediately, and return long after the CRT beam has started drawing again. Heck, if the system is slow enough, the task might not even complete within the VBLANK interval if we run it immediately after VSync. But this is a much more complex problem to solve, requiring upfront measurements of both the VBLANK interval and the execution time for each potential task, which can then be factored into the run-now-or-defer decision. We definitely don't need to go there as long as we're mainly targeting emulated 66 MHz systems.

In easy cases, vblank_run() can then resolve screen tearing landmines completely by itself. Towards the end of PC-98 Touhou, ZUN's menus made more and more use of master.lib's blocking palette fading functions, which delay themselves to the next VSync signal and thus avoid any tearing issues. Hence, TH04's and TH05's screen tearing landmines are limited to the very few sudden palette changes that remained in these games:

void return_from_other_screen_to_main_menu(void)
{
	// Loads the .PI image into our persistent menu image background buffer,
	// and overwrites master.lib's 8-bit palette. Takes a few frames and
	// probably won't return during VBLANK.
	GrpSurface_LoadPI(bgimage, &Palettes, "op1.pi");

	graph_accesspage(0);
	bgimage.write(0, 0); // Planar blit
	PaletteTone = 100;   // Use original brightness in palette_show()

-	// ZUN landmine: Updating the hardware palette right now will most likely
-	// cause screen tearing.
-	palette_show();
+	vblank_run(palette_show);

	[…]
}

The fixes for the landmines in TH03 and TH02, however, require much more thought and care to stay as close to ZUN's defined logical frame sequence as possible. TH03's character selection screen, which prompted this whole subproject in the first place, houses one of the harder groups of landmines:

Recorded on DOSBox-X with 375,000 cycles, since exact machine specifications are not important to demonstrate these landmines.

Very finicky work, where every single branch has the potential to introduce an off-by-one-frame error, and vblank_run() doesn't help at all.

And then you reach TH02, which asks for way too much to happen within a single frame, in plain sight, and with no palette tricks to hide it. The screen transitions into and out of its HiScore screen are by far the worst example:

Recorded at 1.9968 × 17 = 33.9456 MHz for a change to magnify the jank that you perhaps wouldn't see at higher clock speeds.

These screen transitions exhibit no less than 6 landmines and 2 bugs:

  1. 💣 Frame 1 shows how TRAM (containing the actual menu text as gaiji) gets cleared immediately, but VRAM (containing the shadow) remains untouched as ZUN decides to load HUUHI.DAT first.

  2. 💣 While the following VRAM clear appears to produce a well-defined black frame 2, it's anything but well-defined, as the load operation only happens to conclude within VBLANK by sheer chance in this recording.

  3. 💣 Frame 4 is wild. First of all, the code still hasn't waited for a single VBLANK signal ever since entering the menu, and therefore shouldn't be writing to TRAM to begin with.
    But even then, you wouldn't expect to see only the name and nothing else on the scanlines of a score record in such a partial rendering. How can TRAM operations possibly be that slow? This almost seemed as if I was missing some crucial timing-related detail about the hardware. But in the end, what we're seeing here is simply Neko Project not actually using scanline rendering for the text layer. If you write to a TRAM cell, Neko Project just marks the entire 16-pixel row to be redrawn during the next screen refresh event.

  4. 🐞 Frame 5, then, is the first well-defined frame that actually renders the way it's defined in the code. The green 東方封魔録 logo is indeed only meant to be visible from the next frame onwards. This certainly meets all criteria for a bug, but the debloated build isn't allowed to fix those. In fact, it needs a dedicated conditional branch to preserve this bug.

  5. Once you leave the menu, you'll first have to sit through a stylistic and non-productive 20-frame delay, before… the screen switches back to the last frame rendered before the delay on frame 77?
    💣 By that point, we're technically already back to the main menu, where the first thing ZUN does is to switch from double-buffering back to single-buffering with VRAM page 0 shown. If you happened to leave the menu by hitting a key on the 50% of frames where VRAM page 1 is shown, the screen will therefore flip back to the frame rendered before the 20-frame delay, and keep it visible while master.lib decodes the title screen image.

  6. 💣 This decoding process finishes after ~20.4 frames in this recording, near the middle of frame 98. Clearly, we then have to immediately switch the hardware palette to the one we just loaded. Let's completely disregard that we're probably not in VBLANK, or that the screen is still showing the last High Score menu frame… :zunpet:

  7. Then, we need to get the image onto both VRAM pages. 📝 As we found out in Part 2, a low-clocked 386 is pretty much the most suboptimal system for master.lib's packed→planar conversion code, and 12 frames exactly match the performance we would expect from Neko Project at 33 MHz.
    💣 But that only rendered the image to the invisible VRAM page 1. We could now temporarily show page 1 after the next VSync signal to hide the pretty much guaranteed multi-frame VRAM writes… but nah, who cares except for some researcher 28 years later. By leaving VRAM page 0 on screen, ZUN doesn't even attempt to hide the jank that is about to occur. Once again, he reaches for master.lib's graph_copy_page(), 📝 whose slowness I already talked about in Part 2. At 33 MHz, Neko Project takes 3 frames to copy one page after another, leaving us with two frames of mixed pixels. This can be even worse on real hardware: On 📝 spaztron64's K6-2-upgraded and southbridge-bottlenecked PC-9821V166 model, this copy took 100 ms. I was able to watch every single bitplane getting individually copied in the recording. Unpacking the .PI image a second time would have been faster on that machine. :onricdennat:

  8. 🐞 Also, ZUN should have definitely cleared TRAM before the page copy instead of deferring this responsibility to the main menu rendering code. Since we then return to the main menu's VSync-timed loop and regularly wait for VSync while the scoreboard remains on screen and part of the current logical frame, this is not a landmine.

Compare this with the debloated version:

Then, I did that 13 more times for the other screen tearing landmines fixed in this build. And no, these new builds don't even fix every instance of this issue…


Intermission: Handling unused code on fork branches

Given that all of these improvements are taking place on the debloated branch, it's time to decide on how to handle the biggest unneeded obstacle in the way of our portability efforts, after 📝 I procrastinated this question 2½ years ago.
In Shuusou Gyoku, I've been trying to retain every single line of unused code in a dedicated directory, not least because that game has 📝 some very wild effects that should be reasonably preserved. The problem with this approach is that all this unused code quickly stopped compiling as I started to refactor the game into its current cross-platform state. For discoverability, this is still better than outright deleting the code and expecting people to read pbg's original codebase, but it's not all too practical either. :thonk:

In the ReC98 codebase, we have a different situation: All the unused code doesn't just exist at some old commit that maybe won't even compile going forward, but is an integral part of the master branch. Therefore, removing this code from fork branches is not only in line with their goals, but also completely non-destructive, since its compilable form on master keeps getting maintained for a handful of building platforms.
Then again, I like the added overview and discoverability of the Shuusou Gyoku approach. So let's meet in the middle: From now on, the debloated branch will only keep unused code in the form of its declarations and some short explanatory comments, in files within the unused/ directory whose names point to the actual implementations on the master branch.

Funnily enough, unused code wasn't even the main reason why TH01's ANNIV.EXE lost 10,834 bytes between the previous and current builds. Although TH01 is the one game with by far the most unused engine code, that code only made up 3,728 bytes of that difference. The rest came from the work surrounding the zero-copy unpacker and the few portability features that already made sense to be rolled out for this game. Yes, TH01 really is that bloated.


Merging TH02-TH05's OP.EXE and MAINE.EXE/MAINL.EXE

Onto the second most exciting feature, 📝 as motivated by the blog post from May! A true single-executable build 📝 never looked that viable for TH04 and TH05 to begin with, so let's just go for the one viable partial merge that makes sense for all of the four games. With all of MAINE.EXE/MAINL.EXE being position-independent, the remaining bunch of ASM code there isn't much of an obstacle either.

And once again, this merge means that we have to resolve all 📝 binary-specific inconsistencies at once. While ZUN thankfully eliminated most of them by the end of the PC-98 era, the scorefile code remained inconsistent until the very end, 📝 as Part 3 already mentioned. Hopefully, this is the second-to-last time I have to mention these formats…
Funnily enough, all of their most noteworthy inconsistencies are found in how these formats deal with corrupted files:

The actual merge then indeed delivers what we were hoping for: In three of the four games, the added unique code from OP.EXE and MAINE.EXE/MAINL.EXE comes in at far below the 20,512 bytes we freed by removing 📝 Borland's C++ exception handler, both in the binaries themselves and in their loaded in-memory state.
But it's TH05 where both OP.EXE's expanded Music Room and MAINE.EXE's Staff Roll and All Cast sequence add so much unique data that the initial merge ended up slightly larger than the size of the original MAINE.EXE. Getting the binary and run-time size of the new DEBLOAT.EXE below that point required every trick in the book and then some. The more critical tricks were good ideas in their own right:

But the rest of them definitely crossed over into silly micro-optimization territory:


Alright, another idealistic bonus goal reached! That means we're only missing a single aspect to reach feature parity with the debloated TH01 build:

Replicating TH02-TH05's GAME.BAT in C++

In TH03, this is slightly more involved. We not only need to launch PMD using this technique, but also apply it to the INTvector set program and SPRITE16. 📝 You know the way this goes:

…actually, wait a moment! TH02-TH05 don't even use the C heap, and the master.lib heap works in a completely different way. Borland implemented their heap in a traditional sbrk(2) style that dynamically resizes a program's main DOS memory block as needed, which is how we end up with the whole concept of the heap growing in one direction. master.lib, however, needs to place its heap entirely within a single pre-allocated and fixed-size block of memory. And since mem_assign_dos() simply allocates this block via the usual DOS INT 21h, AH=48h API, it doesn't matter where on the DOS heap this block is located. This means that we don't even have to do the error-prone witchcraft of pushing these TSRs to the top of conventional memory that we had to do for TH01! Right?

Just like last time, these visualizations are not even remotely to scale.
Also, note the small gap between SPRITE16 and PMD. That's supposed to be PMD's environment segment, and DOS does allocate it, but PMD explicitly frees it before going resident. Sure, it's just a few bytes on real DOS, but still, very considerate!

Unfortunately though, the fixed position of all these TSRs would still prevent the game allocation from being replaced with a binary that asks for more memory than the one this block was initially allocated for. In TH01, this would have been a minor issue because it only applied to hot-reloading the single DEBLOAT.EXE or ANNIV.EXE that contains all game code. For the other four games, however, we still keep the larger MAIN.EXE as a separate binary, and most likely will do so for the foreseeable future. And we're surely not getting into the business of moving already allocated TSRs…
So we're back to the technique from two years ago after all. Let's precalculate the size of each TSR, push that TSR to the top of conventional RAM by temporarily claiming all free memory minus its expected size, and then we get…

…a ZUN.COM spawn failure from DOS as we try to start the ZUNINIT sub-binary. :zunpet:
Yup. Thanks to ZUN's fantastic idea of bundling these small utility tools and TSRs into a single binary that's larger than each individual TSR, we can't just reuse the strategy that worked for TH01. DOS must load the entirety of ZUN.COM into conventional RAM before the bundling code gets a chance to shift the selected sub-binary to the top of the program's memory block and then reduce the size of that block.
So how are we going to solve this?

Sure, we could allocate the resident structure into the gap left by ZUNINIT's instance of ZUN.COM, but that's just a few bytes. Keep in mind that the (env.) blocks are much smaller than they appear here. Thus, the free blocks are also a lot larger in reality, but still not large enough to fit PMD and its music buffer.

But if we can't get rid of ZUN.COM's high load-time memory requirements, how about using that memory more productively? Is there a way we could maybe spawn the other TSRs into the hole left by ZUN.COM after it went resident?
Let's take a step back from individual TSRs and instead look at the full picture of spawning a bundle of TSRs in a defined order. First, we determine both the binary size (file size of the .COM binary + Program Segment Prefix + 256 bytes of stack) and the resident size (the size of its memory block after it goes resident) of each TSR. With these metrics, we can calculate a minimum and resident size for the full bundle by simulating the TSR spawns in order:

uint32_t bundle_size_min = 0;
uint32_t bundle_size_resident = 0;

for(const auto& tsr : tsrs) {
	// Since DOS has freed all excess binary memory before we get to spawn a new TSR, the new
	// one will end up next to the previous resident allocations. We only need to consider
	// the previous minimum size because it might be larger than the one we calculate here.
	bundle_size_min = std::max((bundle_size_resident + tsr.size_binary), bundle_size_min);

	bundle_size_resident += tsr.size_resident;
}

Let's step through the bundle construction for TH03:

TSR Binary Resident Bundle minimum Bundle resident Naive
ZUNINIT (ZUN.COM) 23,276 1,056 23,276 1,056 23,276
SPRITE16 (ZUN.COM) 23,276 36,528 24,332 37,584 59,804
PMD86.COM 29,295 30,144 66,879 67,728 89,948

Then, we only need to resize our main memory block a single time to leave a gap at the top of conventional RAM whose size matches the larger of the minimum or resident bundle sizes. If we then spawn the TSRs into this gap, we indeed save 22,220 bytes over the naive approach! Let's visualize the resulting memory layout with TH02 because there's a nice detail with MMD and PMD:

MMD also frees its environment segment before going resident, so we can subtract the size of one environment segment from the reserved heap size if we spawn both PMD and MMD.
Unfortunately, one of these gaps will always remain with this approach. We could get rid of it by spawning MMD and PMD first, which would merge it into the remaining free memory. But then, we'd have nothing to spawn into the hole left by the ZUN.COM binary for ZUNINIT, and we'd be back to the naive fragmented situation above. Thus, this trick remains unique to TH02.

However, there's one crucial detail in all of this that would prove to be more complicated:

Calculating correct resident sizes

In TH01, this was no big deal. MDRV98 was the only TSR we had to care about, and there was no reason not to just replicate its simple resident size calculation within the code. After all, people would either run the version bundled with the game or the smaller previous version if they played on a real-hardware CanBe model. No one really cares about MDRV98 beyond that level; the driver is almost universally disliked for just not being PMD, which managed to attract a sizable community, documentation, and even new developments over the years. A PMD port of TH01 has been one of the most common mod requests as well.
The TSRs in later games, however, are much more flexible. We compile both ZUNINIT and SPRITE16 from source and should therefore expect people to mod them, but these two in particular might just be considered uninteresting and static enough to justify hardcoding their sizes. But this approach utterly breaks with PMD, whose chip-specific variants come in multiple versions depending on the game:

:th02: TH02 :th03: TH03 :th04: TH04 :th05: TH05
PMD.COM 4.8l (1996-12-28)
14,336 bytes
PMDB2.COM
(ADPCM)
4.8l (1996-12-28)
18,496 bytes
4.8o (1997-06-19)
18,592 bytes
PMD86.COM
(86PCM)
4.8l (1996-12-28)
19,904 bytes
4.8o (1997-06-19)
19,984 bytes
PMDPPZ.COM
(PPZ8/CanBe)
4.8l (1996-12-28)
20,768 bytes
4.8o (1997-06-19)
21,024 bytes
The PMD versions that ZUN shipped with each game. The byte size refers to the in-memory TSR size without any music, voice, or effect data added on top.

In theory, nothing stops us from hardcoding these sizes for each game as well. But these physical details about specific PMD versions are even less of a property of the game. There's no reason why modders shouldn't be able to replace any of the hardware-specific driver versions with any other – and given the sizable PMD composer and arranger community, this is a much more likely kind of mod to happen. SSG-EG, anyone?

But how could we figure out the required resident size of arbitrary PMD versions without hardcoding anything? From the outside, we can only really know for sure by running the driver and seeing how much memory it keeps resident…
…so that's exactly what we need to do. The merged binaries spawn each driver three times during setup – once to figure out its size, a second time to remove this test TSR, and a third time to respawn the TSR at its designated place at the top of conventional memory. And if we have such a system in place, nothing stops us from applying it to all other TSRs as well, removing the need to precalculate or hardcode any size… well, except for SPRITE16, which still needs a hack to factor in its extra two blocks on the DOS heap. In TH03, these 2×3 additional processes do slow down startup by about 6 frames on our target 66 MHz Neko Project configuration when compared to the batch file, which should still be tolerable relative to the .PI load times we removed by switching to PiLoad.

The whole feature has a few other nice properties as well:

And then you test with the actual ZUN.COM and notice that you're still not done:

TSR Binary Resident Bundle minimum Bundle resident
ZUNINIT (ZUN.COM) 13,394 784 13,394 784
PMD86.COM 29,383 30,224 30,167 31,008
TSR Binary Resident Bundle minimum Bundle resident
ZUNINIT (ZUN.COM) 65,968 784 65,968 784
PMD86.COM 29,383 30,224 66,752 31,008

It's this TH04 issue that raises the question of whether this whole TSR bundling solution was even worthwhile in the first place. It sure was an interesting problem to solve, but it'd be much simpler and less bloated to just integrate the INTvector set program into every binary. For TH03, we could similarly integrate all SPRITE16 functionality directly into DEBLOATM.EXE/ANNIVM.EXE and still end up with a smaller-than-original binary after removing Borland's C++ exception handler. That would leave PMD and MMD as the only TSRs we'd need to spawn from C++, and those do have good reasons to be separate from game code.
Oh well, gotta get TH03's MAIN.EXE position-independent first…

Also, the usual caveats from two years ago still apply. This whole trick of pushing TSRs to the top of conventional RAM still relies on witchcraft that may not work on certain DOS kernels. For developers, tinkerers, and people who know what they're doing, it does succeed at nicely decluttering the game directory. But for… ahem, distributors, I still recommend shipping the modified version of GAME.BAT and GAMECB.BAT in the package below to defend against any potential stability issues.


Finally, if the performance improvements aren't enough of a reason to upgrade to these new builds, how about an actual new feature? TH03's Anniversary Edition now lets you quit out of the VS Start menu via either ESC or a new menu item, without going through the Select screen. 🙌

Screenshot of the VS Start menu in the P0323 build of TH03's Anniversary Edition, showing off the new Quit option and the version text
Matching the style of the version text to the style of ☪ The Phantasmagoria of Dim. Dream on the other side seemed like the least bad option here. That outline is indeed created by rendering every line 9 times…

And with that, I'm finally done with 2025's most indulgent subproject! Let's quickly check the overall impact on the codebase:

$ git diff --stat debloated~193 debloated -- . ":(exclude)Tupfile.lua" ":(exclude)build_dumb.bat" ":(exclude)unused/"
[…]
259 files changed, 4145 insertions(+), 8099 deletions(-)

That's almost 4,000 lines of ad-hoc PC-98-native graphics code, bloat, landmines, bloat- and landmine-documenting comments, and binary-specific inconsistencies removed from game code, in exchange for…

git diff --stat master~203 master -- platform/
[…]
28 files changed, 2213 insertions(+), 258 deletions(-)

…not even half that many additional lines in the platform layer. And here's what all of this compiles to:

Richard Stallman cosplaying as a shrine maiden ReC98 (version P0323) 2025-09-29-ReC98.zip

After the Shuusou Gyoku debacle and the many last-minute fixes that cropped up while I was writing this post, I'm not particularly confident in these builds, despite the weeks of testing that went into them. Still, we've got to start somewhere. At least for TH03, we're bound to quickly find any issues that slipped through the cracks while I'm implementing netplay into the Anniversary Edition.

Next up: The very quick round of 📝 Shuusou Gyoku maintenance and forward compatibility I announced in April, to clear out the backlog a bit. This whole series also really stretched the concept of what 11 pushes should be, so I'll charge 2 pushes for that maintenance round to compensate. In exchange, I'll also incorporate a small bit of new Windows 98 feature work, since it fits nicely with the cleanup work.

📝 Posted:
💰 Funded by:
Blue Bolt, [Anonymous], Yanga, Ember2528
🏷️ Tags:

Part 3 of 📝 the 4-post series about the big 2025 PC-98 Touhou portability subproject, and we actually get to move some percentages on the front page with this one! For once, there truly isn't a lot to mention about most of these five disconnected small-feature decompilations, so let's go for more of a touhou-memories style and string together a few shorter bullet points and paragraphs. For even greater brevity, I'll also use the ZUN code issue emoji you might already know from Twitter or Bluesky: 🐞 denotes a bug, 💣 denotes a landmine, and 🎺 denotes a quirk.

  1. Revising TH02's main menu
  2. Finishing TH03's High Score menu
  3. TH04's title screen animation
  4. TH05's All Cast sequence
  5. Finishing TH03/TH04/TH05 scorefiles
  6. A typo in TH04 Reimu A's Good Ending

Revising TH02's main menu

This was one of those old decompilations from 2015 that I really wanted to bring up to current standards before the debloated branch would roll out the new more portable and performant blitting code. Replacing the magic-number coordinates with constants and calculations revealed 📝 the usual off-by-one text positioning bugs in the Option menu, despite ZUN still using monospaced text in this game…
As for more unique and exciting details in this screen: ZUN's defined gaiji strings contain an unused adaptation of TH01's blinking HIT KEY text. On screen, it might have looked something like this:

Mockup of the TH01's &HIT KEY& string in TH02
This string is so unused that we don't even know its intended position, though.

Finishing TH03's High Score menu

At the end of 2021, 📝 I already decompiled most of this menu, but left two functions in ASM due to push scope constraints. Originally, I thought that this menu would need a few changes to address a certain scorefile inconsistency I'll mention in Part 4, but I ended up finding a better solution. Still, we got one interesting discovery per function out of it:


TH04's title screen animation

This decompilation was necessary because its palette manipulation code did the very dubious thing of accessing the palette in a freed .PI slot. I don't think that the stylish effect of separately whiting in the image's black outlines is appreciated enough. And yes, that formally was the last non-RE'd tiny bit of any OP.EXE binary!

Also note that single black pixel in Reimu's gohei. :zunpet:

TH05's All Cast sequence

This sequence contained the last not yet decompiled instance of 📝 masked crossfading, which the debloated branch wants to replace with our single optimized implementation.
Most picture and text cues in this sequence are synced to the BGM, using PMD's AH=05h function to retrieve the current measure. And yes, that's measures, which is indeed the only time unit you get from PMD. The cues appear to be timed based on beats rather than measures, but the secret there is that ZUN simply wrote Peaceful Romancer in the internal time signature of 1/4. Just in case anyone tries to mod this BGM and starts wondering why the sequence suddenly progresses more slowly. I'll just use beats below since it's shorter.
Any cues that don't appear synced only do so because of – you guessed it – weird ZUN code issues.


Finishing TH03/TH04/TH05 scorefiles

Well, at least as far as decompilation is concerned. Cleaning up all these binary-specific inconsistencies on the debloated branch will be just as annoying as reconstructing them in the first place, and I won't even get it all the way done within these 11 pushes. TH05 made this even worse by continuing its general trend of taking TH04's slightly bloated but overall fine C++ code and needlessly rewriting it in micro-optimized and only semi-decompilable ASM. If you still believe that the master branch is a good foundation for any kind of serious work, this file should convince you otherwise.
Two more discoveries here:


Finally, I stumbled over a script bug in TH04's Good Ending for Reimu A:

Screenshot of the &,4魔理沙& script bug in TH04's Good Ending for Reimu A
The 2014 static English patch fixes this issue. That's probably why this isn't talked about anywhere.

This looks unintentional, and the same line in Reimu B's Good Ending confirms that this is indeed a typo:

\p,ed07.pi \=0,4 魔理沙:なんだよ、そりゃ\ga9\s160\c
\p,ed07.pi \==0,4 魔理沙:なんだよ、そりゃ\ga9\s160\c

The 📝 cutscene command reference tells us that the line in the Reimu B variant is preceded by \==, the picture crossfading command, followed by both possible parameters, 0 and 4. Reimu A's script, however, lacks that second = and instead spells out \=, the immediate picture display command, which doesn't take a second parameter. Thus, the command stops reading after the 0 and leaves the trailing ,4 as text to be displayed in the newly started box. The line break is then ignored as usual, causing 魔理沙 to be displayed right next to these two characters.

Whew! Once again, this did turn into more of the typical ReC98 research by the end after all. :godzun: And that was just 75% of the pushes assigned to this post, because the rest already went towards the debloating work. Next up: Concluding this series and actually applying all this research to the games.

📝 Posted:
💰 Funded by:
Splashman, Ember2528
🏷️ Tags:

Talk about a nerd snipe! I just wanted to take the first meaningful step towards getting PC-98 Touhou portable. But then, that step massively escalated and resulted in not only the single biggest subproject of 2025, but also in the most productive dev cycle this project has seen since the beginning of the crowdfunding era. 405 commits over 11 pushes, and touching on so many topics that writing a single blog post would have been way too much for even me to handle. So let's try something new and split this delivery into four "smaller" and thematically more focused posts that I'll release in quick succession:

  1. Deciding on a porting strategy
  2. Accurate slowdown
  3. Picking a CPU clock speed for emulators
  4. Frame-perfect
  5. Port implementation thoughts

So, how do we get the PC-98 Touhou codebase into a portable state? That entirely depends on what kind of port we want in the first place, and how much of ZUN's code we are willing to change. Three particularly efficient options immediately come to mind:

  1. On one end of the spectrum, we have a preconfigured PC-98 emulator with disabled configuration options and a stripped-down UI that tricks people into believing they're playing a port and prevents them from accidentally breaking the working configuration.
    This might sound like a joke, but it's unironically the most efficient and pragmatic solution that will be good enough for the overwhelming majority of players. If you ask people what they expect from a port, they primarily name ease of use and not having to configure emulators. Both of these can be solved with a preconfigured emulator and thus don't justify the monumental engineering effort of the more complex porting methods described below. That effort also wouldn't be justified if people just wanted a port and had no standards regarding its technical implementation, besides maybe no input lag. Someone has to put in the effort to solve every little challenge on the way from PC-98 to modern systems, and if that effort is not appreciated…

    By the way, I have no idea what people are talking about when they claim that PC-98 Touhou has input lag, because there sure is nothing like that in the code that would indicate anything above 1 frame / 17.7 ms for the in-game portions. Any investigation into these issues would therefore have to come from someone else, I'm afraid. Everything points to input lag being the result of misconfigured emulators.

    This is not like Shuusou Gyoku, where a port to modern APIs made sense because almost every subsystem still performs suboptimally on modern Windows even after you set up DxWnd, a better MIDI synth, and whatever people are using to make modern gamepads work with ancient DirectInput these days. If you correctly set up a PC-98 emulator, the games do run at full speed, and are highly likely to continue running fine after emulator and operating system version updates.
    Thus, can we conclude that wishing for ports is primarily a symptom of the Touhou community's past failure and negligence to spread preconfigured emulators to people? Because this surely shouldn't be a problem in this day and age anymore? While I did my part way back in 2013, it would take until spaztron64's 2021 package for the community at large to finally wake up and realize that this was a problem. Nowadays though, we have at least three decent packages made by separate people that have my personal seal of approval. And yes, this even includes the offering you can obtain at a certain mountaintop place of worship. That site used to be infamous for pushing out slop that violated their own mission statement and externalized costs to the tech support departments of their supply chain, but I'm glad to announce that they've leveled up and now provide a decent solution. And once they remove that archive inside their archive, it will be even better!
    Still, if your emulator configuration guides are presented more prominently than your preconfigured emulator downloads, you're doing a disservice to the community. Make guides available, yes, but clearly label them as background information for people who already played the games and then got curious about this old Japanese computer architecture.

  2. OK, but what if you do have standards and would appreciate a technically more solid port that removes layers and maybe even improves the games beyond the limits of the PC-98's architecture? If we take a single step towards native code and native performance, we end up with what people call a "static recompilation" these days. As I explained in the FAQ entry I wrote last year, this kind of port would still emulate the graphics, sound, input, and memory subsystems of a PC-98, but it would cut out CPU emulation.
    For PC-98 Touhou, this is actually quite a huge deal: CPU speed is the single biggest point of contention when configuring PC-98 emulators for Touhou, and the vastly different x86 cores of each emulator result in vastly different performance characteristics once you start to benchmark them all more thoroughly. With no more CPU cycles to count, we'd also lose all the VRAM access latencies that emulators typically strive to replicate, and thus pretty much guarantee 0% slowdown in the resulting port. While the aforementioned kind of modded emulator could theoretically also remove cycle counting and VRAM latencies, it would still interpret x86 instructions and thus have a harder time actually reaching the native performance required for 0% slowdown.

    This kind of port would also find immediate acceptance within the gameplay community. Since it would only take ZUN's original binaries as input and ignore our reconstructed source, we're guaranteed to retain the exact gameplay logic. The entire instruction translation process would be automated, leaving no room for modernizing the codebase by hand 📝 and accidentally breaking gameplay. We'd still have to defuse at least a few landmines to get the port running without issue, but those would be limited to things like filename casing, for example. Nothing even remotely close to gameplay code.

  3. On the other end of the spectrum, we have something like uth05win: A fully native rewrite of the graphics code that takes every liberty and cuts every corner it needs to rework the game into something that naturally renders within a modern graphics API of our choice. Unlike uth05win, however, our ports will be based on complete decompilations and thus retain the original gameplay code instead of freely rewriting certain parts because they look strange. In turn, we would basically scrap all of ZUN's menu and cutscene code and write quirk-free and sane replacements. Part 4 will drive home just how much more relaxing this course of action would have been…
    There's certainly an argument to be had that a modern port should reimagine the game to look and feel as modern as you can get within the original assets, and not stick to PC-98 limitations. After all, the unmodified PC-98 version is always there for you to play on your correctly configured emulator, right? In fact, if we ever wanted to port the games to weaker systems or consoles, this kind of port would be our only option.

But as you might have guessed, we're not going for either of these options:

  1. The first option doesn't even need anything from ReC98. Even the sleekest imaginable release could be done by anyone who either knows about PC-98 emulation or keeps in contact with someone who does, and is comfortable messing around with emulator source code. In fact, I'm not even a particularly qualified person for this job; I frequently mess with emulator configurations for research reasons, and then forget the correct values for certain obscure settings. :tannedcirno:
    This is such an obvious and efficient move that I seriously wonder why nobody has done it so far… but then again, I thought the same about every other idea I ended up doing myself in this space over the past 15 years. If that idea sounds great to you, feel free to go ahead – it represents the opposite of what this project is about, so the resulting fame is yours for the taking. If y'all see "ports" popping up from a place that isn't this project in the not-too-distant future, you can be pretty sure that their developers followed this strategy.

  2. The second option would indeed be an interesting project in its own right, as I've stated in the FAQ entry. But if you remember 📝 the last time I thought about static recompilation, I was way more excited for recompiling the old compiler we use for the PC-98 code rather than the games themselves. Ironically, this is primarily because of how much a recompilation would complicate the new features we plan to add to the games. Since I can only develop new features on top of a previous reverse-engineering effort, they will necessarily remain tied to the PC-98-native version of the codebase at first. How would we port them, then?

    • Do I continue developing these features for the PC-98 and then simply recompile them along with the rest of the game? The issue with that approach is that most features won't have a version that could work with the original ZUN codebase that we'd prefer to recompile. For everyone's sanity, most features will only exist as part of a respective game's anniversary branch, which in turn is based on the rearchitected and de-landmined debloated branch. Recompiling these branches would undermine the entire selling point of delivering the pure, untainted ZUN code that would have probably convinced the gameplay community to invest in this strategy in the first place. It might be good enough for the rest of the community, but if I'm going to rearchitect the PC-98 codebase anyway, would there even be a point in developing the required recompilation techniques on the side? Would this give us ports faster than following a more classical approach?
    • Then again, I could still try slicing out the code for these features in a way that would allow them to be shared between the rearchitected PC-98 and recompiled ZUN codebases. But that's bound to create an unnatural and awkward mess that's probably even worse than the way I have to arrange ZUN's code on the unmodified master branch. I'd definitely charge extra for that.
    • Do I just copy-paste and maintain two versions of the feature code for both platforms, manually transferring all required reverse-engineering to the recompilation? That might feel very dull, but it's probably more efficient than any attempt at sharing that code.
    • Or do I just abandon the PC-98-native codebase? In favor of a pseudo-PC-98 codebase that still very much assumes PC-98 hardware but doesn't actually run on real or conventionally emulated PC-98 hardware… :thonk:

    The last point in particular demonstrates just how little of a help a recompilation would actually be. Since it would continue to emulate the PC-98's graphics system, I'd still have to write any new graphics code against the PC-98's planar and two-page VRAM. Automatically porting the games to a friendlier and more generic rendering paradigm is infeasible for even an advanced recompiler: Every part of the original game expects PC-98 hardware, and a generic rewrite requires engineering decisions at a much higher level than the individual x86 instructions a recompilation operates at.
    And ultimately, it's these individual features that people should be (and mostly are) hyped for. Community-usable replays, translations, and TH03 netplay can all be implemented natively on PC-98. Sure, netplay would be easier to develop and easier to use within a TH03 recompilation since we can just use the native network stack of your host OS 📝 without any intermediaries. But developing both a recompiler and netplay would still take longer than 📝 following through with our current PC-98-native plan.

  3. The third option is actually quite popular, or would at least be acceptable to the majority of the general fandom. This is what non-technical people have in mind anyway when they think about ports, even if they don't confuse ports with remakes.
    To find out just how acceptable such a port would be, I picked screen fade effects as a representative detail for the corners that such a port would cut, and asked how people judge the natural alpha-blended implementation in uth05win against the palette-based method you'd use on a PC-98. Surprisingly, a whopping 79% of respondents don't have any problem with a port using whatever is most natural for the system it runs on. And that's 79% of my audience, which certainly is at least somewhat aware of PC-98 hardware details and the limitations that shaped these games into what they are. Of course, the 21% of die-hard PC-98 supremacists would then loudly complain that such a choice would make the port literally unplayable, but we could easily dismiss them by pointing to the poll where the community decided in favor of the smoother option. After all, ZUN's intention was to have a fade, and manipulation of a 12-bit color palette was simply the only tool he had on a PC-98.

    However, the gameplay community has much higher hopes for ReC98. Both them and I don't just want to supplement the original PC-98 versions with something that's playable on modern systems, but

    > replace the need for the proprietary, PC-98-exclusive original releases and their emulation for even the most conservative fan

    as I wrote back in 2014. Sure, the community can manage spreading pre-configured emulators for a few more years, but wouldn't it be great if they could stop doing that at some point in the far future?

So if all the "easy" solutions either don't have much of a purpose or disappoint in some way, we're only left with the hard one: A classic, manual port done primarily for the sake of solving an engineering challenge. But hey, this means that it'll also produce tons of blog posts for all of you to read, which apparently is at least equally as popular as actually playing the games. :onricdennat:
Here's what we're going to do:

Sure, the main drawback here is the immense development effort required. But in exchange, the port retains readable and moddable code and continues to deliver the insights that this project has always stood for. Imagine stepping through gameplay code using a native C/C++ debugger at your native screen resolution!


But before we can get to how I'm going to do all that, there are two popular misconceptions I have to address.

Accurate slowdown

The initial version of Maribel Hearn's new emulator guide for PC-98 Touhou had the following sentence that spaztron64 and I successfully lobbied against:

Note that none of the emulators have accurate slowdown; the slowdown will not match real hardware.

Objectively, this is a true statement. Neko Project's i386 core is the closest thing to cycle-accurate PC-98 emulation we have, as its per-instruction cycle counts match Intel's documentation. But even its performance characteristics are wildly inaccurate compared to a real PC-98 system with a 386, as we're going to see in the next blog post.
The problem I have with this sentence is that it's very misleading in this specific context. The mere mention of accurate slowdown in a beginner's guide on PC-98 emulation paints said slowdown as something desirable and worthy of preservation. It evokes stories of console speedrunners and emulator developers who deal with fixed, well-defined hardware where the concept of accurate slowdown makes sense. Stories that probably originated from a time before decompilations of classic games became commonplace, when it was hard to say whether a particular instance of slowdown was intended or not. And even with a decompilation, these things remain a matter of interpretation if you can't ask the original developer. Thus, it's completely understandable why observable behavior of real hardware remains the one benchmark of accuracy and quality that people can understand and rally around.
The PC-98, however, is very much not that kind of fixed system, but a computer architecture that spanned 18 years of hardware evolution, from 1982 to 2000. Even if we reduce this list of models to the ones that match ZUN's stated minimum system requirements, we're still looking at 7 years of hardware, running different microarchitectures at different clock speeds and with different resulting bottlenecks. If there's such a big variety of systems, which particular slowdown behavior should the ports even preserve?

The obvious answer is "the one from the exact system ZUN wrote these games on", but we don't know that system. 📝 Last year, I claimed that ZUN developed these games on a PC-9821Xa7, but I didn't add a citation back then and can't find one now. The closest piece of related known info is this note on the Amusement Makers page that hosts the official downloads for the trial versions, listing three PC-98 models that they confirmed to run the games without issues:

なお当サークルでは ・ NEC PC-9821Xs i486DX2 66MHz ・ NEC PC-9821La13 Pentium Processor (P54C) 133MHz ・ EPSON PC-486MS AMD 5x86-P133 換装 などで正常に動くことを確認しています

These models are one whole CPU generation apart and their clock speed differs by 100%. Which one of these is supposed to have the accurate slowdown?
But even if we knew, it doesn't matter. The README is clear about ZUN's intentions:

  PC98、またはその互換機専用です。(EGC搭載機種)   386以上で動作しますが、486ぐらい無いときついかも知れません。実際は   VRAMアクセスが速いことが重要です。   オプションで、処理の重い演出を減らすことも出来ます。   また、MSDOSが必要です   CPUは486(66MHz)でしか動作確認を取っておりませんので、あんま   り遅い機種ですと不幸かもしれません。   ちなみに、486(66MHz)ですといっさい処理落ちや、欠けなどは出ませ   ん。
  PC98、またはその互換機専用です。(EGC搭載機種)   CPU:486(66MHz)以上推奨      (386でも動作はしますがゲームにならないでしょう。       ただし、386命令を使っているので286は不可です。       また、低クロックの486でもかなり処理落ちするかも知れません)   実際は、CPUの他にもVRAMアクセスが速いことも重要です。
  PC98(PC98-NX除く)、またはその互換機専用です。   (EGC搭載機種)   CPU:486(66MHz)以上推奨      (386でも動作はしますがゲームにならないでしょう。       ただし、386命令を使っているので286は不可です。       はっきり言って、486でも66MHz位はないとかなり       処理落ちするかも知れません)   実際は、CPUの他にもVRAMアクセスが速いことも重要です。
  ●PC98(PC98-NX除く)、またはその互換機専用です。    (EGC搭載機種)   ●CPU:486(66MHz)以上推奨      (386でも動作はしますがゲームにならないでしょう。       ただし、386命令を使っているので286は不可です。       はっきり言って、486でも66MHz位はないとかなり       処理落ちするかも知れません)       実際は、CPUの他にもVRAMアクセスが速いことも重要です。

If ZUN recommends a 486 or faster to avoid slowdown, this necessarily means that any unintentional slowdown is indeed unwanted.
Also, note how only TH02's README claims that the game was exclusively tested on a 66 MHz model, which is highly likely to be that PC-9821Xs listed on the Amusement Makers page. Did ZUN switch to a faster PC-98 model for the development of the last three games? That late into the architecture's lifespan? Or did he merely test the game on faster models while the main development still took place on his 66 MHz model?

Picking a CPU clock speed for emulators

Of course, this now creates a problem for everyone wanting to configure emulators for PC-98 Touhou. If the ideal Touhou machine is infinitely fast, we should always pick the fastest possible emulated CPU speed, right? Historically, this has been bad advice: Most emulators will then stick to exactly the amount of cycles per emulated second you specified in the menu, slowing down the emulated system as a result. It's this kind of emulator behavior that gets players to manually look for "the sweet spot" – the maximum possible explicitly specified CPU clock speed that still manages to render without slowdown on their system. This is a tragedy for many reasons:

Ever since 2019, however, SimK has been developing an Async CPU mode for Neko Project 21/W, which finally got stabilized in ver0.86 rev.93, back in April of this year. Activate this mode with the Screen → CPU clock stabilizer and Screen → Dynamic CPU clock adjustment options, and then you should theoretically be able to finally stop worrying: Just specify the maximum possible clock speed in the usual configuration menu, and Neko Project will dynamically reduce the emulated clock speed to the fastest speed your system can handle.

Screenshot of Neko Project 21/W's Async CPU options

Then, the games are supposed to run similarly to how a correctly configured Anex86 has been running them all along, but with an additional 21 years of emulation accuracy improvements.
Sadly, this mode still needs a bit of work. Excessively high clock speeds will result in wildly fluctuating frame rates and even BGM tempos during the first few seconds of a game session as Neko Project 21/W apparently takes a while to find the optimal clock speed. Even afterwards, emulation remains noticeably slower than Anex86:

This is Neko Project 21/W ver.0.86 rev.95 configured with a clock speed of 1 GHz, running on an Intel Core i5-8400T. The fluctuations are not nearly as intense during the rest of a game session, but remain noticeable throughout.

But what about DOSBox-X, the other good emulator recommended these days? This Async CPU mode is very similar to the cycles=max option that DOSBox-X has supported all along. If you try running my 📝 past and future blitting benchmarks using this option, you can observe how DOSBox-X also starts with a low cycle count and then gradually speeds up to accommodate the actual processing load.
In the much less synthetic test case of running PC-98 Touhou, however, DOSBox-X's cycle adjustment reveals itself as much more sophisticated than Neko Project 21/W's implementation. The showdetails=true option reveals that the cycle count does fluctuate quite heavily, which does translate into minor BGM dropouts particularly near the start of a session. But these dropouts are tiny in comparison to what you'd get on Neko Project 21/W, and the framerate remains stable throughout.

As for overall performance, DOSBox-X's simple interpreter core is not nearly as optimized as Neko Project 21/W's interpreter and peaks at roughly half of its speed. The dynamic_nodhfpu core, however, solidly beats Neko Project 21/W by the same 50%. And it's this added bit of performance that makes all the difference: It eradicates slowdown in most of the usual spots in PC-98 Touhou where emulators and even Anex86 typically struggle, and turns DOSBox-X into the first emulator to finally beat Anex86's performance on the same hardware in all the workloads that matter. The dynamic core still doesn't quite reach the speeds of the hypothetical infinitely fast PC-98 on my outdated system, but it remains the most reliable configuration option when it comes to delivering ZUN's intended vision. If we ignore the BGM dropouts. :thonk:
Just make sure to explicitly select the dynamic_nodhfpu variant, not the regular dynamic core. The latter is infamous for recompilation errors in FPU code that break TH01 gameplay. While that specific issue is ostensibly fixed, I still managed to occasionally run into smaller FPU-related bugs in current DOSBox-X versions. Unfortunately, I didn't manage to capture them on video; I would have reopened the issue on the spot if I did.

Screenshot of the ReC98 Shuusou Gyoku build running in a 640×480 window on a Windows 98 desktop, next to the System Properties window (as usual for backport screenshots) and the D3DWindower window (demonstrating how the game was windowed). The game window shows the DOSBox-X with `cycles=max` and `core=simple`Screenshot of the ReC98 Shuusou Gyoku build running in a 640×480 window on a Windows 98 desktop, next to the System Properties window (as usual for backport screenshots) and the D3DWindower window (demonstrating how the game was windowed). The game window shows the Neko Project 21/W in Async CPU mode with a maximum clock speed of 2.4576 × 407 = 1000.2432 MHzScreenshot of the ReC98 Shuusou Gyoku build running in a 640×480 window on a Windows 98 desktop, next to the System Properties window (as usual for backport screenshots) and the D3DWindower window (demonstrating how the game was windowed). The game window shows the DOSBox-X with `cycles=max` and `core=dynamic`Screenshot of the ReC98 Shuusou Gyoku build running in a 640×480 window on a Windows 98 desktop, next to the System Properties window (as usual for backport screenshots) and the D3DWindower window (demonstrating how the game was windowed). The game window shows the Anex86 in Sync 1 mode
Of course, any performance measurement of an emulator with dynamic cycle adjustment can only ever represent a snapshot of the ever-changing adjustment state, and should therefore be taken with a grain of salt. Hence, these screenshots are purely decorative; I just added them because I'm sure that someone would have asked for exact numbers otherwise. Also, the exact relations between emulators are highly dependent on the workload…
And yes, that's a new benchmark! More about this one 📝 in part 2.

(Still, it's remarkable how close Anex86 gets despite its interpreter core, and how it even beats DOSBox-X in MOVS performance. I looked at Anex86's disassembly for 10 minutes and saw big tables of tiny per-instruction functions with custom calling conventions that make remarkably efficient use of the few registers you get in x86. Also, negative offsets? They must have written this entire x86-on-x86 core in ASM.)

While this is great news for players, the whole situation remains very unsatisfying at a technical level. Even if you don't care about the remaining BGM dropouts, running these games at the highest possible emulated clock speed means that you constantly spend 100% of all CPU cores assigned to your emulator just to avoid slowdown and lag in a few particularly CPU-intensive sections. Power saving might be the single best practical argument in favor of a port. :thonk:

Also, all this complexity involved in dynamic cycle adjustment raises one question you might have had all along. Why don't we just leave our emulated CPUs at 66 MHz? After all, ZUN said that 66 MHz is enough to eliminate all slowdown in at least TH02 and TH03, so how about just living with whatever slowdown we'd still experience in TH04 and TH05? This is certainly a healthier approach, much more appropriate for these silly little indie games that were never meant to be obsessed about at this level, and we get rid of those last few BGM dropouts in DOSBox-X!
Well, if that statement was ever correct to begin with, it would have only applied to real hardware and not to emulators. mu021 reported that the final phase of TH02's Mima fight slowed down even at 78 MHz in Neko Project, and part 4 will contain 📝 even more examples of how 66 MHz slows down several effects in menus and cutscenes, and thus paints a wrong picture of them. Hence, choosing 66 MHz for a preconfigured emulator package might have a particularly annoying side effect: If people get used to how slow these effects run on emulators, they might be rather irritated once the modern ports will invariably run them at their intended speed denoted in the code. I can already imagine them yelling too fast!, inaccurate!, and literally unplayable!, oblivious to the fact that they had the wrong idea about these effects all along.
Or maybe it'll all be fine once part 4 has documented these issues in depth. I certainly wouldn't criticize a package for choosing 66 MHz. All choices are unsatisfying at some level…

If only we could optimize the games enough to remove any unwanted slowdown at 66 MHz. Then, people could freely choose one emulator over another for reasons unrelated to performance, because even cycle-limited emulators could then actually deliver on ZUN's statements in the README files… :thonk:
And since we've defined debloating as an integral part of port development earlier, that's exactly what we're going to do.


But can we even do that within our high standards? Obviously, our ports should remain…

Frame-perfect

Since all five games are explicitly timed around VSync, it's immediately clear what we mean by this term:

Everything rendered to a single page of VRAM between two VSync wait loops defines one single logical frame.

If we are double-buffering correctly and the PC-98 system running the game is fast enough to finish rendering such a logical frame to VRAM within two VSync signals, everything is fine: The sequence of frames you can observe on your screen matches the logical sequence of internal frames, and we can easily record this sequence and compare the port against it.
But what about unintentional slowdown? In these cases, ZUN asks the system to do way more work than it can execute between two VSync signals. Notably, this also includes most loading times: Once we add disk access into the mix, we can't guarantee hitting any VSync deadlines anymore, and decompressing all these 640×400 images is quite expensive as well. Obviously, we don't want to abandon our goal of frame-perfection and the comparability of ports just because of this variability, so let's add another rule:

Individual defined frames may be shown on screen for any integer multiple of the frame time.

The reason for the integer restriction is obvious: If we start drawing to the screen in the middle of a frame, we get screen tearing and thus a non-perfect frame – not just because tearing looks bad, but also because the position of the tearing line always depends on the overall performance of the system you run the game on.
The combination of these two rules leads to an immediate consequence:

The games must only ever display complete logical frames.

And now we have a problem. Our rules have just outlawed screen tearing, but nearly every menu and cutscene screen in ZUN's original code has some kind of screen tearing issue. 📝 The Music Room of TH02-TH04 represents probably the worst example as it suffers from screen tearing on every single frame:

Screenshot of TH04's Music Room, demonstrating the screen tearing landmine that the original game exhibits on every frameColored visualization of which lines correspond to which frame in the TH04 Music Room screenshot
Which frame are we even on? 😵 This landmine is the reason why the first rule explicitly restricts rendering to a single VRAM page. 📝 Check the Music Room blog post for an explanation of what went wrong here.

Also, how would you possibly preserve these tearing lines once you've ported the game? After all, modern platforms not only imply much faster CPUs, but also completely different rendering methods, especially once we add scaling into the mix.
This can only mean one thing:

It is fundamentally impossible to port the unmodified codebase of PC-98 Touhou and remain frame-perfect to the original release.

You could maybe get there by throwing out the integer multiple rule and accepting teared frames as legitimate. But then you'd have to decide on a particular model whose slowdown behavior you'd want to replicate and lock down exactly – and as I've stated in the section above, that's quite a silly and impractical proposition.

Resolving screen tearing

So, how do we get back to a comparable sequence of well-defined frames? This can only work if we leave the confines of real hardware and instead reach for the infinitely fast PC-98 that ZUN wanted to have anyway. Such a system would never exhibit screen tearing because it would naturally complete all rendering within the vertical blanking interval preceding each displayed frame. Once our code then ends a frame by entering a busy-waiting loop for the next VSync signal, the screen would then get to draw static and well-defined VRAM contents. This behavior is the whole reason why I get to classify screen tearing issues as landmines that must always be fixed, as opposed to bugs that a port could potentially retain.
If we actually had such an infinitely fast PC-98, we could just run ZUN's unmodified code on that system and be done now. But as we've seen above, not even DOSBox-X's dynamic core manages to run PC-98 Touhou at the infinitely fast level we'd need. Also, we wanted to get rid of relying on specific emulators and have already planned to optimize all this code anyway…

So let's defuse each screen tearing landmine one by one by rewriting its code to match the output of an infinitely fast PC-98. This is a lot more feasible than it sounds because these landmines aren't actually caused by a lack of CPU power. Every screen tearing issue comes down to ZUN misplacing certain screen-affecting operations within the hellscape of imperative hardware state mutations that is his menu and cutscene code. You can either hide the issue by throwing an infinite amount of processing power at the problem so that the order of mutations no longer observably matters, or you can just write good code.
In theory, we only have to follow a few rules:

The difficulty of actually pulling this off, however, can range anywhere from Easy to Lunatic, depending on the screen, because of course every one of them is different. Even after these 11 pushes, I'll be far from done. But in the end, we'll have perfect and easily verifiable frame parity between the PC-98 versions and the future ports, even though we had to bend the code a little. Or a lot. Oh well.


If you only opened this post for the required reading part, you can stop reading now. I've got a few more technical thoughts about a few implementation details of the future ports that tend to come up in discussions, but these aren't as essential as the high-level issues above.

So we've now decided on what to do in order to make the ports good, but what are the basic challenges we have to solve in order to port these games to modern systems in the first place? Let's start with a perhaps surprising list of non-issues that some people might perceive as challenges:

In short: If the feature in question is consistently used through an API, it's not a challenge in itself. The hard parts are all the opposite cases – when ZUN suddenly starts writing to VRAM segments or I/O ports in the middle of gameplay code, like he does all over TH01. All of these instances need to be manually cleaned up and abstracted away. Conversely, this is also why 📝 TH02 remains by far the easiest individual game to port – it has the least amount of hand-written blitting code and mostly sticks to master.lib functions.

Instead, the biggest immediate challenge is something far more basic:

🎨 Palettized and planar graphics 🎨

After all, PC-98 Touhou doesn't just view the PC-98's graphics subsystem as an obstacle to overcome, but occasionally makes creative use of both palettes and individual bitplanes. How would we possibly cover these effects in a modern graphics API that will be far removed from these concepts? Three challenges immediately come to mind in that regard:

  1. The whole concept of enforcing a single 16-color palette across the entire screen in a world where 32-bit RGBA is the only reliably available texture format. Shaders offer a simple solution: We simply wouldn't use traditional textures, and just write our own sampler that takes both the original palettized 16-color+alpha image and the global palette as input, and performs a lookup for each texel. But what are we supposed to do in SDL_Renderer's fixed-function pipeline? Use the CPU to update all loaded textures on every palette color change? Split each sprite into a separate texture for each color and consume 16× the amount of VRAM just so that we can use vertex colors for each individual color layer? :onricdennat: Or break down every sprite into a point list to save the VRAM? :tannedcirno::onricdennat:
  2. Any kind of sprite-shaped palette color bit flipping effect, such as 📝 the falling polygons in the Music Room. Effects like these could potentially be hardware-rendered even in a fixed-function pipeline if we split the background image into two and render the polygons using regular triangles with their UV coordinates matched to the pixel coordinates on every frame. But would all the involved interpolation reliably give us the original sharp edges without reaching for a shader to ensure that it does? In any case, this solution would need a completely different implementation for a modern port than it currently uses in ZUN's PC-98-native code, which gets by with less per-frame redraw than you'd think that this effect would need.
    uth05win didn't even get to port the Music Room, which is probably not without reason.
  3. TH01's square-shaped inverting effects used during bomb and entrance animations. Flipping a given bit of a pixel's palette index? Based on what's there before? No way around a shader for this one…
    Note how the flipped cards rip holes into the square trails. I'm not even sure what the TH01 Anniversary Edition would change about the effect, or whether it even should change anything about it. Good luck porting this effect pixel-perfectly without pixel-level access.

However, writing all this custom graphics code for the modern port would run against my previously stated goal of sharing as much code as possible between PC-98 and modern platforms. While shaders are the conceptually simpler solution for all of these challenges, they aren't easy in practical terms, and I already 📝 decided against using them for Shuusou Gyoku for good reasons. Also, is all of this really worth the effort if these games demonstrably don't even need the performance of GPU rendering?
But that only leaves one conclusion:

The future ports of PC-98 Touhou to modern systems will software-render the graphics layer on the CPU.

I know, that sounds very shocking and probably disappointing at first. But at a closer look, it's really not all that bad. These games have been software-rendered all along by not only PC-98 emulators, but by real hardware at mid-90's CPU speeds. You might point to the GRCG and EGC chips as evidence for at least some capacity of hardware acceleration, but I see them more as workarounds for the unfortunate planar nature of VRAM on this Japanese business computer architecture. In the end, "software rendering" only means that the CPU receives access to every pixel in the framebuffer. Once all graphical functionality is neatly abstracted away and the game no longer directly accesses the four physical bitplanes, the ports can store sprites and the rendered graphics layer in the most performant way.
Also, note how I only said "graphics layer". Besides 📝 the obvious candidate of framebuffer scaling, the ports will use the GPU for two more important aspects:

As a result, the software renderer of our hand-crafted ports would still internally produce a graphics and text layer that persists across frames and receives minimal redraws, just like the PC-98 originals did. In fact, it would have to produce the exact same graphics layer if we wanted to port the non-Anniversary Edition, including the tile source area. There's no technical need to keep tiles on the graphics layer in a port, but certain intense shake effects temporarily reveal individual tiles below the HUD:

This definitely counts as a bug to be fixed in this game's Anniversary Edition, but how would we fix this one on PC-98 where we do need the tile area in VRAM? Moving the tiles to another place and patching the 📝 .MAP at runtime?

Applying the palette to produce the final rendered image then raises another set of exciting engineering questions. Would we actually use a palettized 4bpp buffer in memory, storing two pixels in a byte? Perhaps with an 8-bit palette that maps each possible pair of pixels to a pair of 32-bit RGBA values, halving the amount of per-frame palette lookups? Or would we always store an RGBA image and merely offer a palettized API around it? As far as I'm concerned, these challenges are way more exciting than the prospect of locking ourselves into some shader language.

📃 Page flipping 📃

But wait. If the port produces a persistent graphics layer, shouldn't it produce two, one for each VRAM page on the PC-98? From the point of view of a modern port, we really don't need to. We only ever upload one "VRAM page" to the GPU anyway, which is then scrolled and scaled onto one of the GPU's backbuffers inside the swapchain. Then, the game can immediately continue drawing onto the same software-rendered VRAM buffer in the next frame without affecting the GPU output.

Obviously, this rendering paradigm doesn't translate back to the PC-98. There, we must render each frame to either the invisible or the visible page. Also, minimal redraw is crucial because we can neither afford the memory nor the performance to regularly copy an entire 128 KB of pixel data from whatever place to VRAM. As a result, page flips are a common sight in even the highest levels of menu and cutscene code, adding yet another unsightly piece of state you have to keep track of while reviewing and modding the code. I've grown to hate them quite a lot over the past four months because of just how often they are associated with bad code: In most menu and cutscene screens, ZUN just uses the second VRAM page as pixel storage for inter-page copies using the EGC, 📝 whose slowness is a regular topic on this blog. Once you've replaced these copies with optimized blits from conventional RAM, you've not only removed all these page flips and clearly revealed these screens as the single-buffered affairs they've always been, but you've also accelerated them enough to remove any screen tearing issues they might have had at 66 MHz.

Unfortunately, things are not that easy everywhere:

Can we rewrite all of these cases in a way that high-level game code no longer has to care about pages? Can we perhaps even banish page flipping to a new lower level of the architecture that all menus and cutscenes are built on top of, and thus unconditionally double-buffer every screen while still maintaining minimal redraw? Or is none of this worth it and we'll just live with two VRAM pages on all platforms? I'm honestly not sure. And that's just a small preview of the porting challenges that still await us and were far beyond the scope of even these 11 pushes…


As for the commits that are formally assigned to this blog post: It was all maintenance, build system setup, and some debloating work on TH01 around its packfile support that I thought would be necessary but thankfully didn't yet need after all. More about that in, you guessed it, 📝 part 4.

Alright! Improving performance, fixing screen tearing issues, establishing better cross-platform interfaces, and cleaning up ZUN's code to facilitate all of that… I've got a lot to do now. Next up: Getting closer to our performance goals by optimizing all PC-98-native code surrounding the .PI files used for backgrounds and cutscene pictures, since we later want to draw our TH03 netplay menus on top.

📝 Posted:
💰 Funded by:
32th System, [Anonymous], iruleatgames, Blue Bolt
🏷️ Tags:

Surprise! The last missing main menu in PC-98 Touhou was, in fact, not that hard. Finishing the rest of TH03's OP.EXE took slightly shorter than the expected 2 pushes, which left enough room to uncover an unexpected mystery and take important leaps in position independence…

  1. TH03's main/option menu
  2. Picking opponents for Story Mode
  3. Demo Play difficulty?
  4. More variables from TH02's Stage 3
  5. Technical position independence for TH03's MAINL.EXE

For TH03, ZUN stepped up the visual quality of the main menu items by exchanging TH02's monospaced font with fixed, pre-composited strings of proportional text. While TH04 would later place its menu text in VRAM, TH03 still wanted to stay with TH02's approach of using gaiji to display the menu items on the PC-98 text layer. Since gaiji have a fixed size of 16×16 pixels, this requires the pre-composited bitmaps to be cut into blocks of that size and padded with blank pixels as necessary:

The proportional text section from TH03's MIKOFT.BFT, with the 16×16 gaiji grid overlaid

If your combined amount of text is short enough to fit into the PC-98's 256 gaiji slots, this is a nice way of using hardware features to replace the need for a proportional text renderer. It especially simplifies transitions between menus – simply wiping the entire TRAM is both cheap and certainly less error-prone than (un)blitting pixels in VRAM, which 📝 ZUN was always kind of sloppy at.
However, all this text still needs to be composited and cut into gaiji somewhere. If you do that manually, it's easy to lose sight of how the text is supposed to appear on screen, especially if you decide to horizontally center it. Then, you're in for some awkward coordinate fiddling as you try to place these 16-pixel bricks into the 8-pixel text grid to somehow make it all appear centered:

TH03's main menu box as it appears in the original gameTH03's main menu box with opaque gaiji to highlight their exact locations in TRAMTH03's main menu box with correct centering
TH03's Option menu box as it appears in the original gameTH03's Option menu box with opaque gaiji to highlight their exact locations in TRAMTH03's Option menu box with correct centering
The VS Start menu actually is correctly centered.

Then again, did ZUN actually want to center the Option menu like this? :thonk: Even the main menu looks kind of uncanny with perfect centering, probably because I'm so used to the original. Imperfect centering usually counts as a bug, but this case is quirky enough to leave it as is. We might want to perfectly center any future translations, but that would definitely cost a bit as I'd then actually need to write that proportional text renderer.
Apart from that, we're left with only a very short list of actual bugs and landmines:

While the rest of the code is not free of the usual nitpicks, those don't matter in the grand scheme of things. The code for the sliding 東方時空 animation is even better: it makes decent use of the EGC and page flipping, and places the 📝 loading calls for the character selection portraits at sensible points where the animation naturally wants to have a delay anyway. We're definitely ending the main menus of PC-98 Touhou on a high note here.


You might have already spotted some unfamiliar text in the gaiji above, and indeed, we've got three pieces of unused text in these two menus! Starting from the top, the Watch label is entirely unused as none of its gaiji IDs are referenced anywhere in the final code. The label's placement within the gaiji IDs would imply that this option was once part of the main menu, but nothing in the game suggests that the main menu ever had a bigger box that could fit a 7th element. On the contrary, every piece of menu code assumes that the box sprites loaded from OPWIN.BFT are exactly 128 pixels high:

TH03's OPWIN.BFT
Fun fact: The code doesn't even use the 16 pixels in the middle, and instead just assumes that the pixels between the X coordinates of [8; 16[ and [32; 40[ are identical.

The unused MIDI music option has already been widely documented elsewhere. Changing the first byte in YUME.CFG to 02 has no functional effect because ZUN removed most MIDI-related code before release. He did forget a few instances though, and the surviving dedicated switch case in the Option menu is now the entire reason why you can reveal this text without modifying the binary. Changing the option will always flip its value back to either off or FM(86).
Last but not least, we have the Type label and its associated numbers. These are the most interesting ones in my eyes; nobody talks about them, even though we have definite proof that they were used for the KeyConfig options at some earlier point in development:

But how exactly can we prove this? The code does string together the respective gaiji IDs and defines the resulting arrays right next to the final KeyConfig types, but doesn't use these arrays anywhere. By itself, this only means that these labels were associated with some option that may have existed at one point in development. The proof must therefore come from outside the code – and in this case, it comes from both 夢時空.TXT and 時空_1.TXT, which still refer to the KeyConfig options as numbered types:

 ■6.操作方法 […]    FM音源のジョイスティックが無い場合は、TYPE1にしてください。    ○TYPE1 Key vs Key […]    ○TYPE2 Joy vs Key […]    ○TYPE3 Key vs Joy

That's all I've got about the menus, so let's talk characters and gameplay! When playing Story Mode, OP.EXE picks the opponents for all stages immediately after the 📝 Select screen has faded out. Each character fights a fixed and hardcoded opponent in Stage 7's Decisive Match:

PlayerStage 7 opponent
Reimu ReimuMima Mima
Mima MimaReimu Reimu
Marisa MarisaReimu Reimu
Ellen EllenMarisa Marisa
Kotohime KotohimeReimu Reimu
Kana KanaEllen Ellen
Rikako RikakoKana Kana
Chiyuri ChiyuriKotohime Kotohime
Yumemi YumemiRikako Rikako

The opponents for the first 6 stages, however, are indeed completely random, and picked by master.lib's reimplementation of the Borland RNG. The game only needs to ensure that no character is picked twice, which it does like this:

const int stage_7_opponent = HARDCODED_STAGE_7_OPPONENT_FOR[playchar];
bool opponent_seen[7] = { false };

for(int stage = 0; stage < 6; stage++) {
	int candidate;
	do {
		// Pick a random character between Reimu and Rikako
		candidate = (irand() % 7);
	} while(opponent_seen[candidate] || (stage_7_opponent == candidate));
	opponent_seen[candidate] = true;
	story_opponent[stage] = candidate;
}
Characters are numbered from 0 (Reimu Reimu) to 8 (Yumemi Yumemi), following the order in the Stage 7 table above.

Yup. For every stage, ZUN re-rolls until the RNG returns a character who hasn't yet been seen in a previous stage – even in Stage 6 where there's only one possible character left. :zunpet: Since each successive stage makes it harder for the inner loop to find a valid answer, you start to wonder if there is some unlucky combination of seed and player character that causes the game to just hang forever.
So I tested all possible 232 seed values for all 9 player characters and… nope, Borland's RNG is good enough to eventually always return the only possible answer. The inner loop for Stage 6 does occasionally run for a disproportionate number of iterations, with the worst case being 134 re-rolls when playing Rikako Rikako's Story Mode with a seed value of 0x099BDA86. But even that is many orders of magnitude away from manifesting as any kind of noticeable delay. And on average, it just takes 17.15 iterations to determine all 6 random opponents.


The attract demos are another intriguing aspect that I initially didn't even have on my radar for the main menu. touhou-memories raises an interesting question: The demos start at Gauge and Boss Attack level 9, which would imply Lunatic difficulty, but the enemy formations don't match what you'd normally get on Lunatic. So, which difficulty were they recorded on?
Our already RE'd code clears up the first part of that question. TH03's demos are not recordings, but simply regular VS rounds in CPU vs. CPU mode that automatically quit back to the title screen after 7,000 frames. They can only possibly appear pre-recorded because the game cycles through a mere four hardcoded character pairings with fixed RNG seeds:

Demo #P1P2Seed
1Mima MimaReimu Reimu600
2Marisa MarisaRikako Rikako1000
3Ellen EllenKana Kana3200
4Kotohime KotohimeMarisa Marisa500
Certainly an odd choice if your game already had the feature to let arbitrary CPU-controlled characters fight each other. That would have even naturally worked for the trial version, which doesn't contain demos at all.

Then again, even a "random" character selection would have appeared deterministic to an outside observer. As usual for PC-98 Touhou, the RNG seed is initialized to 0 at startup and then simply increments after every frame you spend on the title screen and inside the top-level main, Option, and character selection menus – and yes, it does stay constant inside the VS Start menu. But since these demos always start after waiting exactly 520 frames on the title screen without pressing any key to enter the main menu, there's no actual source of randomness anywhere. ZUN could have classically initialized the RNG with the current system time, which is what we used to do back in the day before operating systems had easily accessible APIs for true randomness, but he chose not to, for whatever reason.

The difficulty question, however, is not so easy to answer. The demo startup code in the main menu doesn't override the configured difficulty, and neither does any other of the binaries depending on the demo ID. This seems to suggest that the demos simply run at the difficulty you last configured in the Option menu, just like regular VS matches. But then, you'd expect them to run differently depending on that difficulty, which they demonstrably don't. They always start on Gauge and Boss Attack level 9, and their last frame before the exit animation is always identical, right down to the score, reinforcing the pre-recorded impression:

Screenshot of the last frame (#7,000) of TH03's first demo (Mima vs. Reimu).Screenshot of the last frame (#7,000) of TH03's second demo (Marisa vs. Rikako).Screenshot of the last frame (#7,000) of TH03's third demo (Ellen vs. Kana).Screenshot of the last frame (#7,000) of TH03's fourth demo (Kotohime vs. Marisa).
Note that it takes much longer than the expected 2:04 minutes for the game to reach this end state. Each WARNING!! You are forced to evade / Your life is in peril popup freezes gameplay for 26 frames which don't count toward the demo frame counter. That's why these popups will provide such a great 📝 resynchronization opportunity for netplay. It's almost as if Versus Touhou was designed from the start with rollback netcode in mind! :godzun:

With quite a bit of time left over in the second push, it made sense to look at a bit of code around the Gauge and Boss Attack levels to hopefully get a better idea of what's going on there. The Gauge Attack levels are very straightforward – they can range from 1 to 16 inclusive, which matches the range that the game can depict with its gaiji, and all parts of the game agree about how they're interpreted:

The 16 proportional digit gaiji from TH03's GAMEFT.BFT
Stored in GAMEFT.BFT.

The same can't be said about the Boss Attack level though, as the gauge and the WARNING!! popup interpret the same internal variable as two different levels?

Darkened screenshot of a TH03 Boss Attack fired off near the beginning of a match at Lunatic difficulty, highlighting the discrepancy between the Boss Attack level shown in the gauge at the bottom of each playfield (10) and the one shown in the WARNING!! popup (9)

This apparent inconsistency raises quite a few questions. After all, these gaiji have to be addressed by adding an offset from 0 to 15 to the ID of the 1 gaiji, but the levels are supposed to range from 1 to 16. Does this mean that one of these two displays has an off-by-one error? You can't fire a Level 0 Boss Attack because the level always increments before every attack, but would 0 still be a technically valid Boss Attack level?
Decompiling the static HUD code debunks at least the first question as ZUN resolves the apparent off-by-one error by explicitly capping the displayed level to 16. And indeed, if a round lasts until the maximum Boss Attack level, the two numbers end up matching:

Darkened screenshot of a TH03 Boss Attack fired off near the end of a match, highlighting how both the gauge and the WARNING!! popup agree on the level once it reaches 16

This suggests that the popup indicates the level of the incoming attack while the gauge indicates the level of the next one to be fired by any player. That said, this theory not only needs tons of comments to explain it within the code, but also contradicts 夢時空.TXT, which explicitly describes the level next to the gauge as the 現在のBOSSアタックのレベル. Still, it remains our best bet until we've decompiled a few of the Boss Attacks and saw how they actually use this single variable.


So, what does this tell us about the demo difficulty? Now that we can search the code for these variables, we quickly come across the dedicated demo-specific branch that initializes these levels to the observable fixed values, along with two other variables I haven't researched so far. This confirms that demos run at a custom difficulty, as the two other variables receive slightly different values in regular gameplay.

However, it's still a good idea to check the code for any other potential effects of the difficulty setting. Maybe they're just hard to spot in demos? Doesn't difficulty typically affect a whole lot of other things in Touhou game code? Well, not in TH03 – MAIN.EXE only ever looks at the configured difficulty in three places, and all of them are part of the code that initializes a new round.
This reveals the true nature of difficulty in TH03: It's exclusively specified in terms of these five variables, and the Easy/Normal/Hard/Lunatic/"Demo" settings can be thought of as simply being presets for them. Story Mode adds 📝 the AI's number of safety frames to the list of variables and factors the current stage number into their values, but the concept stays the same. In this regard, TH03's design is unusually clean, making it perhaps the only Touhou game with not even a single "if difficulty is this, then do that" branch in script code. It's certainly the only PC-98 Touhou game with this property.

But it gets even better if we consider what this means for netplay. We now know that the configured difficulty is part of the match-defining parameters that must be synced between both players, just like the selected characters and the RNG seed. But why stop there? How about letting players not just choose between the presets, but allowing them to customize each of the five variables independently? Boom, we've just skyrocketed the replay value of netplay. 🚀 It's discoveries like these that justify my decision to start the road toward netplay by decompiling all of OP.EXE: In-engine menus are the cleanest and most friendly way of allowing players to configure all these variables, and now they're also the easiest and most natural choice from a technical point of view.


But wait, there's still some time left in that second push! The remaining fraction of the OP.EXE reverse-engineering contribution had repeating decimals, so let's do some quick TH02 PI work to remove the matching instance of repeating decimals from the backlog. This was very much a continuation of 📝 last year's light PI work; while the regular TH02 decompilation progress has focused and will continue to focus on the big features, it still left plenty of low-hanging PI fruit in boss code.
The first animation frame of TH02's Stage 3 midboss, taken from STAGE2.BMT Back then, we left with the positions of the Five Magic Stones, where ZUN's choice of storing them in arrays was almost revolutionary compared to what we saw in TH01. The same now applies to the state flags and total damage amount of not just the boss of Stage 3, but also the two independently damageable entities of the stage's midboss. In total, all of the newly identified arrays made up 3.36% of all memory references in TH02, and we're not even done with Stage 3. The first animation frame of TH02's Stage 3 midboss, taken from STAGE2.BMT


Actually, you know what, let's round out that second push with even more low-hanging PI fruit and ensure 📝 technical position independence for TH03's MAINL.EXE. This was very helpful considering that I'm going to build netplay into the anniversary branch, whose debloated foundation 📝 aims to merge every game into as few executables as possible. Due to TH03's overall lower level of bloat and the dedicated SPRITE16-based rendering code in MAIN.EXE, it might not make as much sense to merge all three of TH03's .EXE binaries as it did for TH01, and MAIN.EXE's lack of position independence currently prevents this anyway. However, merging just OP.EXE and MAINL.EXE makes tremendous sense not just for TH03, but for the other three games as well. These binaries have a much smaller ratio of ZUN code to library code, and use the same file formats and subsystems.
But that's not even the best part! Once we've factored out all the invisible inconsistencies between the games, we get to share all of this code across all of the four games. Hence, technical position independence for TH03's MAINL.EXE also was the final obstacle in the way of a single consistent and ultimately portable version of all of this code. 🙌

So, how do we go from here to 📝 the short-term half-PC-98/half-modern netplay option that Ember2528 is now funding? Most of the netcode will be unrelated to TH03 in particular, but we'd obviously still want to reverse-engineer more of MAIN.EXE to ensure a high-quality integration. So how about alternating the upcoming deliveries between pure RE work and any new or modded code? Next up, therefore, I'll go for the latter and debloat OP.EXE so that I can later add the netplay features without pulling my hair out. At that point, it also makes sense to take the first steps into portability; I've got some initial ideas I'm excited to implement, and Congrio's tiny bit of funding just begs to be removed from the backlog. :tannedcirno:
(And I'm definitely going to defuse all the tearing landmines because my goodness are they infuriating when slowing down the game or working with screen recordings.)

📝 Posted:
💰 Funded by:
Blue Bolt, [Anonymous], Splashman
🏷️ Tags:

TH05's OP.EXE? It's not one of the 📝 main blockers for multilingual translation support, but fine, let's push it to 100% RE. This didn't go all too quickly after all, though – sure, we were only missing the High Score viewer, but that's technically a menu. By now, we all know the level of code quality we can reasonably expect from ZUN's menu code, especially if we simultaneously look at how it's implemented in TH04 as well. But how much could I possibly say about even a static screen?

  1. Generating the initial High Score lists
  2. TH04/TH05's High Score viewer
  3. TH05-exclusive palette bugs when switching between the main menu and the High Score viewer
  4. Please don't fund ZUN Soft logo finalization just yet 🙇

Then again, with half of the funding for this push not being constrained to RE, OP.EXE wasn't the worst choice. In both TH04 and TH05, the High Score viewer's code is preceded by all the functions needed to handle the GENSOU.SCR scorefile format, which I already RE'd 📝 in late 2019. Back then, it turned out to be one of the most needlessly inconsistent pieces of code in all of PC-98 Touhou, with a slightly different implementation in each of the 6 binaries that was waiting for its equally messy decompilation ever since.
Most of these inconsistencies just add bloat, but TH05's different stage number defaults for the Extra Stage do have the tiniest visible impact on the game. Since 2019 was before we had our current system of classifying weird code, let's take a quick look at this again:

Screenshot of TH05's High Score viewer, showing the default GENSOU.SCR data for the Extra Stage as generated by OP.EXE Screenshot of TH05's High Score viewer, showing the default GENSOU.SCR data for the Extra Stage as generated by MAINE.EXE with the same decreasing stage numbers as for the regular stages
In the end, this is a landmine, albeit a slightly unusual one. OP.EXE always needs to load GENSOU.SCR to determine whether the Extra Stage is unlocked and can be selected in the main menu. If that file is corrupted or doesn't exist yet, OP.EXE will always recreate it. Therefore, MAINE.EXE's recreation code would only ever run if GENSOU.SCR got deleted or corrupted while playing the game. This can only happen through code that runs outside the game or as the result of failing hardware, and thus goes beyond our criteria for observability.

On to the actual High Score screen then! The OP.EXE code I decompiled here only covers the viewer, the actual score registration is part of MAINE.EXE and is a completely different beast that only shares a few code snippets at best. This means that I'll have to do this all over again at some point down the line, which will result in another few pushes that look very similar to this one. 🥲
By now, it's no surprise that even this static screen has more or less the same density of bugs, landmines, and bloat as ZUN's more dynamic and animated menus. This time however, the worst source of bloat lies more on the meta level: TH04's version explicitly spells out every single loading and rendering call for both of that game's playable characters, rather than covering them with loops like TH05 does for its four characters. As a result, the two games only share 3¼ out of the 7 functions in even this simple viewer screen. It definitely didn't have to be this way.

On the bright side, the code starts off with a feature that probably only scoreplayers and their followers have been consciously aware of: The High Score screens can display 9-digit scores without glitches, unlike the in-game HUD's infamous overflow that turns the 8th digit into a letter once the score exceeds 100 million points.
To understand why this is such a surprise, we have to look at how scores are tracked in-game where the glitch does happen. This brings us back to the binary-coded decimal format that the final three PC-98 Touhou games use for their scores, which we didn't have to deal with 📝 for almost three years. On paper, the fixed-size array of 8 digits used by the three games would leave no room for a 9th one, so why don't we get a counterstop at 99,999,999 points, similar to what happens in modern Touhou? Let's look at the concrete example of adding, say, 200,000 points to a score of 99,899,990 points, and step through the algorithm for the most significant four digits:

score BCD delta
09 09 08 09 09 09 09 00 + 00 00 02 00 00 00 00 00
= 09 09 08 09 09 09 09 00 + 00 00 02 00 00 00 00 00
= 09 0A 00 09 09 09 09 00 + 00 00 02 00 00 00 00 00
= 0A 00 00 09 09 09 09 00 + 00 00 02 00 00 00 00 00
= 0A 00 00 09 09 09 09 00
It sure is neat how ZUN arranged the gaiji font in such a way that the HUD's rendering is an exact visual representation of the bytes in memory… at least for scores between 100,000,000 (A0000000) and 159,999,999 (F9999999) inclusive.
Formatted as big-endian for easier reading. Here's the relevant undecompilable ASM code, featuring the venerable AAA instruction.

In other words: The carry of each addition is regularly added to the next digit as if it were binary, and then the next iteration has to adjust that value as necessary and pass along any carry to the digit after that. But once we've reached the most significant digit, there is no way for its carry to go. So it just stays there, leaving the last digit with a value greater than 9 and effectively turning it from a BCD digit into a regular old 8-bit binary value. This leaves us with a maximum representable score of 2,559,999,999 points (FF 09 09 09 09 09 09 09) – and with the scores achieved by current TAS runs being far below that limit in both games, it's definitely not worth it to bother about rendering that 10th score digit anywhere.
In the High Score screens, ZUN also zero-padded each score to 8 digits, but only blitted the 9th digit into the padding between name and score if it's nonzero. From this code detail alone, we can tell that ZUN was fully aware of ≥100 million points being possible, but probably considered such high scores unlikely enough to not bother rearranging the in-game HUD to support 9 digits. After all, it only looks like there's plenty of unused space next to the HUD, but in reality, it's tightly surrounded by important VRAM regions on both sides: The 32 pixels to the left provide the much-needed sprite garbage area to support 📝 visually clipped sprites despite master.lib's lack of sprite clipping, and the 64 pixels to the right are home to the 📝 tile source area:

Constructed screenshot of TH04's in-game layout during the Elly boss fight, showing off how a score of 99,999,999 points would be reduced to the 8-digit 🎝9999999 string, with the PC-98's text layer hidden to reveal the sprite garbage areas and tile source areas that tightly surround the HUD Constructed screenshot of TH04's in-game layout during the Elly boss fight, showing off how a score of 99,999,999 points would be reduced to the 8-digit 🎝9999999 string; with the PC-98's text layer visible, it might seem as if it was no big deal to expand the HUD into render 9-digit scores
It sure wouldn't have been impossible. You could either sacrifice the two tiles that would cover the 9th digit in both the HiScore and Score row, or – even better – move these tiles under the existing padding space within the HUD. 📝 The tile sections of TH04 and TH05 already address their images using raw VRAM addresses, so this wouldn't have even required an additional tile index→VRAM address lookup table.

And sure enough, ZUN confirms this awareness in TH04's OMAKE.TXT:

9999万でもカンストしませんが、ゲーム中の得点表示、1千万の位が英字 表記となります。つまり、100000000点はA0000000点と表示 されます。(ネームレジスト画面は普通に表示されます) でも、1億点以上出せる人いるのかな。

(And indeed, the first documented legitimate run that crossed 100 million only happened 11 years after the game's release, with Reimu A on Normal difficulty.)

However, the highest score that the High Score screens of both games can display without visual glitches is not 999,999,999, as you would expect from 9 digits, but rather…

Screenshot of TH04's High Score viewer showing a score of 959,999,999 points for both Reimu and Marisa, which is the highest score that this screen can display without glitches Screenshot of TH05's High Score viewer showing a score of 959,999,999 points for all characters, which is the highest score that this screen can display without glitches
959 million?
(Also, this 9th digit nicely highlights a slight asymmetry in TH04's screen, where Marisa gets 4 fewer pixels of padding between names and scores.)

What a weird limit. Regardless of whether GENSOU.SCR saves its scores in a sane unsigned 32-bit format or a silly 8-digit BCD one, this limit makes no sense in either representation. In fact, GENSOU.SCR goes even further than BCD values, and instead uses… the ID of the corresponding gaiji in the 📝 bold font? :zunpet:
How cute. No matter how you look at it, storing digits with an added offset of 160 makes no sense:

It does start to explain the 959 million limit, though. Since each digit in GENSOU.SCR takes up 1 byte as well, they are indeed limited to a maximum value of (255 - 160) = 95 before they wrap back to 0.
But wait. If the game simply subtracts 160 from the gaiji index to get the digit value, shouldn't this subtraction also wrap back around from 0 to 255 and recover higher values without issue? The answer is, 📝 again, C's integer promotion: Splitting the binary value into two digits involves a division by 10, the C standard mandates that a regular untyped 10 is always of type int, the uint8_t digit operand gets promoted to match, and the result is actually negative and thus doesn't even get recognized as a 9th digit because no negative value is ≥10.

So what would happen if we were to enter a score that exceeds this limit? The registration screen in MAINE.EXE doesn't display the 9th digit and the 8th one wraps around. But it still sorts the score correctly, so at least the internal processing seems to work without any problem…

Screenshot of registering a score of 999,999,999 points in TH04, showing off how such a score would wrap around to 39,999,999 points, despite being sorted correctly. Screenshot of registering a score of 999,999,999 points in TH05, showing off how such a score would wrap around to 39,999,999 points, despite being sorted correctly.
(160 + 99) = 259, which wraps around to 3, so this makes perfect sense. We'll figure out the exact logic behind the differently colored sprite once RE progress reaches this screen.

But once you try viewing this score, you're instead greeted with VRAM corruption resulting from master.lib's super_put() function not bounds-checking the negative sprite IDs passed by the viewer:

Screenshot showing how trying to view a score above 959 million points in TH04's High Score viewer leads to VRAM corruption due to master.lib trying to blit a sprite with a negative ID Screenshot showing how trying to view a score above 959 million points in TH05's High Score viewer leads to VRAM corruption due to master.lib trying to blit a sprite with a negative ID

In a rare case for PC-98 Touhou, the High Score viewer also hides two interesting details regarding its BGM. Just like for the graphics, ZUN also coded a fade-in call for the music. In abbreviated ASM code:

mov ax, 0000h ; PMD AH=00H (start music playback)
int 60h
mov ax, 0280h ; PMD AH=02H (fade in/out)
int 60h
PMD API documentation is here.

However, the AH=02H fade-in call has no effect because AH=00h resets the music volume and would need to be followed by a volume-lowering AH=19h call. But even if there was such a call, the fade-in would sound terrible. 80h corresponds to the fastest possible fade-in speed of -128, which is almost but not quite instant. As such, the fade-in would leave the initial note on each channel muted while the rest of the track fades in very abruptly, which clashes badly with the bass and chord notes you'd expect to hear in the name registration themes of the two games:

At least the first issue could have been avoided if PMD's AH=00h call took optional parameters that describe the initial playback state instead of relying on these mutating calls later on. After all, it might be entirely possible for a bunch of interrupts to fire between AH=00h and these further calls, and if those interrupts take a while, the FM chip might have already played a few samples at PMD's default volume. Sure, Real Mode doesn't stop you from wrapping this sequence in CLI and STI instructions to work around this issue, but why rely on even more CPU state mutation when there would have been plenty of free x86 registers for passing more initial state to AH=00h?

The second detail is the complete opposite: It's a fade-out when leaving the menu, it uses PMD's slowest fade speed, and it does work and sound good. However, the speed is so slow that you typically barely notice the feature before the main menu theme starts playing again. But ZUN hid a small easter egg in the code: After the title screen background faded back in, the games wait for all inputs to be released before moving back into the main menu and playing the title screen theme. By holding any key when leaving the High Score viewer, you can therefore listen to the fade-out for as long as you want.
Although when I said that it works, this does not include TH04. 📝 As 📝 usual, this game's menus do not address the PC-98's keyboard scancode quirk with regard to held keys, causing the loop to break even while the player is still holding a key. There are 21 not yet RE'd input polling calls in TH02 and TH04 that will most certainly reveal similar inconsistencies, are you excited yet? :tannedcirno:
But in TH05, holding a key indeed reveals the hidden-content of a 37-second fade-out:

I'm holding Esc here, but this works with any key, even the ⬅️ left and ➡️ right arrow keys that don't quit out of the menu.

As you can already tell by the markers, the final bugs in TH05's (and only TH05's) OP.EXE are palette-related and revealed by switching between these two screens:

  1. Why does the title screen initially use an ever so slightly darker palette than it does when returning from the menu?
  2. What's with the sudden palette change between frames 1 and 2? Why are the colors suddenly much brighter?

1) is easily traced and attributed to an off-by-one error in the animation's palette fade code, but 2) is slightly more complex. This palette glitch only happens if the High Score viewer is the first palette-changing submenu you enter after the 📝 title animation. Just like 📝 TH03's character portraits, both TH04 and TH05 load the sprites for the High Score screen's digits (SCNUM.BFT) and rank indicator (HI_M.BFT) as soon as the title animation has finished. Since these are regular BFNT sprite sheets, ZUN loads them using master.lib's super_entry_bfnt(), and that's where the issue hides: master.lib's blocking palette fade functions operate on master.lib's main 8-bit palette, and super_entry_bfnt() overwrites this palette with the one in the BFNT header. Synchronizing the hardware palette with this newly loaded one would have immediately revealed this possibly unintended state mutation, but I get why master.lib might not have wanted to do that – after all, 📝 palette uploads aren't exactly cheap and would be very noticeable when loading multiple sprite sheets in a row.
In any case, this is no problem in TH04 as that game's HI_M.BFT and OP1.PI have identical palettes. But in TH05, HI_M.BFT has a significantly brighter palette:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
OP1.PI
HI01.PI / HI_M.BFT

And that's 100% RE for TH05's OP.EXE! 🎉 TH04's counterpart is not far behind either now, and only misses its title screen animation to reach the same mark.
As for 100% finalization, there's still the not yet decompiled TH04/TH05 version of the ZUN Soft logo that separates both OP.EXE binaries from this goal. But as I've mentioned 📝 time and time again, the most fitting moment for decompiling that animation would be right before reaching 100% on the entirety of either game. Really – as long as we aren't there, your funding is better invested into literally anything else. The ZUN Soft logo does not interact with or block work on any other part of the game, and any potential modding should be easy enough on the ASM level.
But thankfully, nobody actually scrolls down to the Finalized section. So I can rest assured that no one will take that moment away from me! :onricdennat:

Next up: I'd kinda like to stay with PC-98 Touhou for a little longer, but the current backlog is pulling into too many different directions and doesn't convincingly point toward one goal over any other. TH02 is close, but with an active subscription, it makes more sense to accumulate 3 pushes of funding and then go for that game's bullet system in January. This is why I'm OK with subscriptions exceeding the cap every once in a while, because they do allow me to plan ahead in the long term.
So, let's wait a few days for all of you to capture the open towards something more specific. But if the backlog stays as indecisive as it is now, I'll instead go for finishing the Shuusou Gyoku Linux port, hopefully in time for the holiday season.
As for prices, indeed seems to be the point where my supply meets the community's demand for this project and the store no longer sells out immediately. So for the time being, we're going to stay at that push price and I won't increase it any further upon hitting the cap.

📝 Posted:
💰 Funded by:
[Anonymous], 32th System
🏷️ Tags:

Remember when ReC98 was about researching the PC-98 Touhou games? After over half a year, we're finally back with some actual RE and decompilation work. The 📝 build system improvement break was definitely worth it though, the new system is a pure joy to use and injected some newfound excitement into day-to-day development.
And what game would be better suited for this occasion than TH03, which currently has the highest number of individual backers interested in it. Funding the full decompilation of TH03's OP.EXE is the clearest signal you can send me that 📝 you want your future TH03 netplay to be as seamlessly integrated and user-friendly as possible. We're just two menu screens away from reaching that goal anyway, and the character selection screen fits nicely into a single push.

  1. TH03's character selection screen
  2. Improved blog navigability

The code of a menu typically starts with loading all its graphics, and TH03's character selection already stands out in that regard due to the sheer amount of image data it involves. Each of the game's 9 selectable characters comes with

  1. a 192×192-pixel portrait (??SL.CD2),
  2. a 32×44-pixel pictogram describing her Extra Attack (in SLEX.CD2), and
  3. a 128×16-pixel image of her name (in CHNAME.BFT). While this image just consists of regular boldfaced versions of font ROM glyphs that the game could just render procedurally, pre-rendering these names and keeping them around in memory does make sense for performance reasons, as we're soon going to see. What doesn't make sense, though, is the fact that this is a 16-color BFNT image instead of a monochrome one, wasting both memory and rendering time.

Luckily, ZUN was sane enough to draw each character's stats programmatically. If you've ever looked through this game's data, you might have wondered where the game stores the sprite for an individual stat star. There's SLWIN.CDG, but that file just contains a full stat window with five stars in all three rows. And sure enough, ZUN renders each character's stats not by blitting sprites, but by painting (5 - value) yellow rectangles over the existing stars in that image. :tannedcirno:

TH03's SLWIN.CDG, showing off how ZUN baked all 15 possible stat stars into the image
The only stat-related image you will find as part of the game files. The number of stat stars per character is hardcoded and not based on any other internal constant we know about.

Together with the EXTRA🎔 window and the question mark portrait for Story Mode, all of this sums up to 255,216 bytes of image data across 14 files. You could remove the unnecessary alpha plane from SLEX.CD2 (-1,584 bytes) or store CHNAME.BFT in a 1-bit format (-6,912 bytes), but using 3.3% less memory barely makes a difference in the grand scheme of things.
From the code, we can assume that loading such an amount of data all at once would have led to a noticeable pause on the game's target PC-98 models. The obvious alternative would be to just start out with the initially visible images and lazy-load the data for other characters as the cursors move through the menu, but the resulting mini-latencies would have been bound to cause minor frame drops as well. Instead, ZUN opted for a rather creative solution: By segmenting the loading process into four parts and moving three of these parts ahead into the main menu, we instead get four smaller latencies in places where they don't stick out as much, if at all:

  1. The loading process starts at the logo animation, with Ellen's, Kotohime's, and Kana's portraits getting loaded after the 東方時空 letters finished sliding in. Why ZUN chose to start with characters #3, #4, and #5 is anyone's guess. :zunpet:
  2. Reimu's, Mima's, and Marisa's portraits as well as all 9 EXTRA🎔 attack pictograms are loaded at the end of the flash animation once the full title image is shown on screen and before the game is waiting for the player to press a key.
  3. The stat and EXTRA🎔 windows are loaded at the end of the main menu's slide-in animation… together with the question mark portrait for Story Mode, even though the player might not actually want to play Story Mode.
  4. Finally, the game loads Rikako's, Chiyuri's, and Yumemi's portraits after it cleared VRAM upon entering the Select screen, regardless of whether the latter two are even unlocked.

I don't like how ZUN implemented this split by using three separately named standalone functions with their own copy-pasted character loop, and the load calls for specific files could have also been arranged in a more optimal order. But otherwise, this has all the ingredients of good-code. As usual, though, ZUN then definitively ruins it all by counteracting the intended latency hiding with… deliberately added latency frames:

Sure, maybe loading the fourth part's 69,120 bytes from a highly fragmented hard drive might have even taken longer than 30 frames on a period-correct PC-98, but the point still stands that these delays don't solve the problem they are supposed to solve.


But the unquestionable main attraction of this menu is its fancy background animation. Mathematically, it consists of Lissajous curves with a twist: Instead of calculating each point as x = sin((fx·t)+ẟx) y = sin((fy·t)+ẟy) , TH03 effectively calculates its points as x = cos(fx·((t+ẟx) % 0xFF)) y = sin(fy·((t+ẟy) % 0xFF)) , due to t and being 📝 8-bit angles. Since the result of the addition remains 8-bit as well, it can and will regularly overflow before the frequency scaling factors fx and fy are applied, thus leading to sudden jumps between both ends of the 8-bit value range. The combination of this overflow and the gradual changes to fx and fy create all these interesting splits along the 360° of the curve:

At a high level, there really is just one big curve and one small curve, plus an array of trailing curves that approximate motion blur by subtracting from ẟx and ẟy.

In a rather unusual display of mathematical purity, ZUN fully re-calculates all variables and every point on every frame from just the single byte of state that indicates the current time within the animation's 128-frame cycle. However, that beauty is quickly tarnished by the sheer cost of fully recalculating these curves every frame:

This is decidedly more than the 1.17 million cycles we have between each VSync on the game's target 66 MHz CPUs. So it's not surprising that this effect is not rendered at 56.4 FPS, but instead drops the frame rate of the entire menu by targeting a hardcoded 1 frame per 3 VSync interrupts, or 18.8 FPS. Accordingly, I reduced the frame rate of the video above to represent the actual animation cycle as cleanly as possible.
Apparently, ZUN also tested the game on the 33 MHz PC-98 model that he targeted with TH01, and realized that 4,096 points were way too much even at 18.8 FPS. So he also added a mechanism that decrements the number of trailing curves if the last frame took ≥5 VSync interrupts, down to a minimum of only a single extra curve. You can see this in action by underclocking the CPU in your Neko Project fork of choice.

But were any of these measures really necessary? Couldn't ZUN just have allocated a 12 KiB ring buffer to keep the coordinates of previous curves, thus reducing per-frame calculations to just 512 points? Well, he could have, but we now can't use such a buffer to optimize the original animation. The 8-bit main angle offset/animation cycle variable advances by 0x02 every frame, but some of the trailing curves subtract odd numbers from this variable and thus fall between two frames of the main curves.
So let's shelve the idea of high-level algorithmic optimizations. In this particular case though, even micro-optimizations can have massive benefits. The sheer number of points magnifies the performance impact of every suboptimal code generation decision within the inner point loop:

Multiplied by the number of points, even these low-hanging fruit already save a whopping ≥729,088 cycles per frame on an i486, without writing a single line of ASM! On Pentium CPUs such as the one in the PC-9821Xa7 that ZUN supposedly developed this game on, the savings are slightly smaller because far calls are much faster, but still come in at a hefty ≥466,944 cycles. Thus, this animation easily beats 📝 TH01's sprite blitting and unblitting code, which just barely hit the 6-digit mark of wasted cycles, and snatches the crown of being the single most unoptimized code in all of PC-98 Touhou.
The incredible irony here is that TH03 is the point where ZUN 📝 really 📝 started 📝 going 📝 overboard with useless ASM micro-optimizations, yet he didn't even begin to optimize the one thing that would have actually benefitted from it. Maybe he 📝 once again went for the 📽️ cinematic look 📽️ on purpose?

Unlike TH01's sprites though, all this wasted performance doesn't really matter much in the end. Sure, optimizing the animation would give us more trailing curves on slower PC-98 models, but any attempt to increase the frame rate by interpolating angles would send us straight into fanfiction territory. Due to the 0x02/2.8125° increment per cycle, tripling the frame rate of this animation would require a change to a very awkward (log2384) = 8.58-bit angle format, complete with a new 384-entry sine/cosine lookup table. And honestly, the effect does look quite impressive even at 18.8 FPS.


There are three more bugs and quirks in this animation that are unrelated to performance:

Now with the full 18 curves, a direction change of the smaller trailing curves at the end of the loop that only looks slightly odd, and a reversed and more natural plotting order.

If you want to play with the math in a more user-friendly and high-res way, here's a Desmos graph of the full animation, converted to 360° angles and with toggles for the discontinuity and trail count fixes.


Now that we fully understand how the curve animation works, there's one more issue left to investigate. Let's actually try holding the Z key to auto-select Reimu on the very first frame of the Story Mode Select screen:

The confirmation flash even happens before the menu's first page flip.

Stepping through the individual frames of the video above reveals quite a bit of tearing, particularly when VRAM is cleared in frame 1 and during the menu's first page flip in frame 49. This might remind you of 📝 the tearing issues in the Music Rooms – and indeed, this tearing is once again the expected result of ZUN landmines in the code, not an emulation bug. In fact, quite the contrary: Scanline-based rendering is a mark of quality in an emulator, as it always requires more coding effort and processing power than not doing it. Everyone's favorite two PC-98 emulators from 20 years ago might look nicer on a per-frame basis, but only because they effectively hide ZUN's frequent confusion around VRAM page flips.
To understand these tearing issues, we need to consider two more code details:

  1. If a frame took longer than 3 VSync interrupts to render, ZUN flips the VRAM pages immediately without waiting for the next VSync interrupt.
  2. The hardware palette fade-out is the last thing done at the end of the per-frame rendering loop, but before busy-waiting for the VSync interrupt.

The combination of 1) and the aforementioned 30-frame delay quirk explains Frame 49. There, the page flip happens within the second frame of the three-frame chunk while the electron beam is drawing row #156. DOSBox-X doesn't try to be cycle-accurate to specific CPUs, but 1 menu frame taking 1.39 real-time frames at 56.4 FPS is roughly in line with the cycle counting we did earlier.
Frame 97 is the much more intriguing one, though. While it's mildly amusing to see the palette actually go brighter for a single frame before it fades out, the interesting aspect here is that 2) practically guarantees its palette changes to happen mid-frame. And since the CRT's electron beam might be anywhere at that point… yup, that's how you'd get more than 16 colors out of the PC-98's 16-color graphics mode. 🎨
Let's exaggerate the brightness difference a bit in case the original difference doesn't come across too clearly on your display:

Frame 97 of the video above, with a brighter initial palette to highlight the mid-frame palette change
Probably not too much of a reason for demosceners to get excited; generic PC-98 code that doesn't try to target specific CPUs would still need a way of reliably timing such mid-frame palette changes. Bit 6 (0x40) of I/O port 0xA0 indicates HBlank, and the usual documentation suggests that you could just busy-wait for that bit to flip, but an HBlank interrupt would be much nicer.

This reproduces on both DOSBox-X and Neko Project 21/W, although the latter needs the Screen → Real palettes option enabled to actually emulate a CRT electron beam. Unfortunately, I couldn't confirm it on real hardware because my PC-9821Nw133's screen vinegar'd at the beginning of the year. But just as with the image loading times, TH03's remaining code sorts of indicate that mid-frame palette changes were noticeable on real hardware, by means of this little flag I RE'd way back in March 2019. Sure, palette_show() takes >2,850 cycles on a 486 to downconvert master.lib's 8-bit palette to the GDC's 4-bit format and send it over, and that might add up with more than one palette-changing effect per frame. But tearing is a way more likely explanation for deferring all palette updates until after VSync and to the next frame.

And that completes another menu, placing us a very likely 2 pushes away from completing TH03's OP.EXE! Not many of those left now…


To balance out this heavy research into a comparatively small amount of code, I slotted in 2024's Part 2 of my usual bi-annual website improvements. This time, they went toward future-proofing the blog and making it a lot more navigable. You've probably already noticed the changes, but here's the full changelog:

Speaking of microblogging platforms, I've now also followed a good chunk of the Touhou community to Bluesky! The algorithms there seem to treat my posts much more favorably than Twitter has been doing lately, despite me having less than 1/10 of mostly automatically migrated followers there. For now, I'm going to cross-post new stuff to both platforms, but I might eventually spend a push to migrate my entire tweet history over to a self-hosted PDS to own the primary source of this data.

Next up: Staying with main menus, but jumping forward to TH04 and TH05 and finalizing some code there. Should be a quick one.

📝 Posted:
💰 Funded by:
Ember2528, [Anonymous]
🏷️ Tags:

📝 Over two years since the previous largest delivery, we've now got a new record in every regard: 12 pushes across 5 repos, 215 commits, and a blog post with over 14,000 words and 48 pieces of media. 😱 Who would have thought that the superficially simple task of putting SC-88Pro recordings into Shuusou Gyoku would actually mainly focus on deep research into the underlying MIDI files? I don't typically cover much music-related content because it's a non-issue as far as PC-98 Touhou code is concerned, so it's quite fitting how extensive this one turned out. So here we go, the result of virtually unlimited funding and patience:

  1. The SC-88Pro recording controversy
  2. Undefined SysEx behavior
  3. Resolving the controversy, and making a choice (contains personal opinion)
  4. A Unix-style command-line MIDI filter (in Rust BTW)
  5. Visualizing MIDI files (for science, and not for playing them on a keyboard)
  6. Shuusou Gyoku's individual loop quirks 🎺
  7. Rewriting pbg's MIDI code
  8. Putting together the BGM packs
  9. Outgrowing miniaudio (and raging about single-file C libraries for a while)
  10. Remaining implementation details
  11. Pricing changes (and no, not everything's getting more expensive)

So where's the controversy? Romantique Tp obviously made the best and most careful real-hardware SC-88Pro recordings of all of ZUN's old MIDIs, including the original (OST) and arranged (AST) soundtrack of Shuusou Gyoku, right? Surely all I have to do now is to cut them into seamless loops to save a bit of disk space, and then put them into the game? Let's start at the end of the track list with the name registration theme, since it's light on instruments and has an obvious loop point that will be easy to spot in the waveform. But, um… wait a moment, that very first drum note comes a bit late, doesn't it?

This can also be heard in Romantique Tp's YouTube upload.
At a notated tempo of 96 BPM, these first four beats should take exactly 2.5 seconds, which they do in this seamlessly looping softsynth rendering.

That's… not quite the accuracy and perfection I was expecting. :thonk: But I think I know what we're seeing and hearing there. Let's look at the first few MIDI events across all channels:

Delta	Pulse	 Beat	Channel	Event
 +540	   960	  2:000	      1	Controller { CC   0, value   0 }
   +0	   960	  2:000	      1	Controller { CC  32, value   0 }
   +0	   960	  2:000	      1	ProgramChange {  37 }
[…]
   +0	   960	  2:000	      2	Controller { CC   0, value   0 }
   +0	   960	  2:000	      2	Controller { CC  32, value   0 }
   +0	   960	  2:000	      2	ProgramChange {  19 }
[…]
   +0	   960	  2:000	      3	Controller { CC   0, value   0 }
   +0	   960	  2:000	      3	Controller { CC  32, value   0 }
   +0	   960	  2:000	      3	ProgramChange {   6 }
[…]
   +0	   960	  2:000	      4	Controller { CC   0, value   0 }
   +0	   960	  2:000	      4	Controller { CC  32, value   0 }
   +0	   960	  2:000	      4	ProgramChange {   2 }
[…]
Delta	Pulse	 Beat	Channel	Event
   +0	  960	2:000	     10	Controller { CC   0, value   0 }
   +0	  960	2:000	     10	Controller { CC  32, value   0 }
   +0	  960	2:000	     10	ProgramChange {  25 }
   +0	  960	2:000	     10	Controller { CC   7, value 127 }
   +0	  960	2:000	     10	Controller { CC  11, value 127 }
   +0	  960	2:000	     10	Controller { CC  10, value  64 }
   +0	  960	2:000	     10	Controller { CC  91, value  80 }
   +0	  960	2:000	     10	Controller { CC  93, value  40 }
   +0	  960	2:000	     10	NoteOn { Key  42, Vel.  94 }
   +0	  960	2:000	     10	NoteOn { Key  36, Vel. 110 }
   +1	  961	2:001	     10	NoteOn { Key  42, Vel.   0 }
   +0	  961	2:001	     10	NoteOn { Key  36, Vel.   0 }
 +119	 1080	2:120	     10	NoteOn { Key  42, Vel.  34 }
   +1	 1081	2:121	     10	NoteOn { Key  42, Vel.   0 }
 +119	 1200	2:240	     10	NoteOn { Key  42, Vel.  64 }
   +0	 1200	2:240	     10	NoteOn { Key  36, Vel.  64 }
Also, the fact that GS doesn't put its drums on a non-general voice bank and instead relies on external channel configuration to differentiate drums from pitched instruments is making this Yamaha kid uncontrollably furious. 🤬

Yup. That's the sound of a vintage hardware synth being slow and taking a two-digit number of milliseconds to process a barrage of simultaneous Program Change messages, playing a MIDI file that doesn't take this reality into account and expects program changes to happen instantly.
I can only speak from my own experience of writing MIDIs for hardware synths here, but having the first note displaced by 50 ms is very much not the way a composer would have intended the music to be heard if the note is clearly notated to occur on the beat. If you had told me about such an issue when playing one of my MIDIs on a certain synth, I would have thanked you for the bug report! And I would have promptly released a fixed version of the MIDI with the Program Change events moved back by a beat or two. In the case of Shuusou Gyoku's MIDIs, this wouldn't even have added any additional delay in-game, as all of these files already start with at least one beat of leading silence to make room for setting Roland-specific synth parameters.

OK, but that's just a single isolated bass drum hit. If we wanted to, we could even fix this issue ourselves by splicing the same note from around the loop end point. Maybe this is just an isolated case and the rest of Romantique Tp's recordings are fine? Well…

Again, check Romantique Tp's YouTube upload for proof.
By the way, this seamless audio player is what consumed most of the two website pushes this time. The rest went to the slightly redesigned main page, whose progress bars now use the cap bar style and the GitHub badge colors.

This one is even worse. Here, the delay is so long relative to the tempo of the piece that the intended five drum hits pretty much turn into four.

This type of issue doesn't even have to be isolated to the very beginning of a piece. A few of the tracks in both the OST and AST start with an anacrusis on just one or two channels and leave the Program Change event barrage at the beginning of the first full measure. In 幻想科学 ~ Doll's Phantom for example, this creates a flam-like glitch where the bass on channel 2 is pretty much on time, but the crash hit on channel 10 only follows 50 ms later, after the SC-88Pro took its sweet time to process all the Program Change events on the channels between:

This is from the arranged soundtrack for a change. In that one, ZUN at least fixed the issue in the final three MIDIs (シルクロードアリス, 魔女達の舞踏会, and 二色蓮花蝶 ~ Ancients) that closed out this rearranging project in May 2001, which spread out their per-channel setup events over at least a single measure before playing any note.

Let's listen to that at half speed:

Romantique Tp's YouTube upload.
Still on point.

Sure, all of this is barely noticeable in casual listening, but very noticeable if you're the one who now has to cut these recordings into seamless loops. And these are just the most obvious timing issues that can be easily pinpointed and documented – the actual worst aspects are all the minor tempo and timing fluctuations throughout most of the pieces. With recordings that deviate ever so slightly from the tempo defined in the MIDI files, you can no longer rely on mathematically exact sample positions when cutting loops. Even if those positions do work out from time to time, there'd pretty much always be a discontinuity in the waveform at both ends of the loop, manifesting as a clearly audible click. In the end, the only way of finding good loop points in existing recordings involves straining your ears and listening very, very closely to avoid any audible glitches. 😩

But if you've taken a look at the second tabs in the clips above, you will have noticed that we don't necessarily have to be stuck with recordings from real hardware. In late 2015, Roland released Sound Canvas VA, a VST plugin that emulates the classic core of Roland's old Sound Canvas lineup, including the SC-88Pro. As long as we run such a software synthesizer through a quality VST host, a purely software-based solution should be way superior for recording looped BGM:

Any drawbacks? For our use case, all of them are found in the abysmal software quality of everything around the synth engine. As it's typical for the VST industry, Sound Canvas VA is excessively DRM'd – it takes multiple seconds to start up, and even then only allows a single process to run at any given time, immediately quitting every process beyond the first one with a misleading Parameter File1 Read Error message box. I totally believe anyone who claims that this makes SCVA more annoying than real hardware when composing new music. Retro gamers also dislike how Roland themselves no longer sells the 32-bit builds they used to offer for the first few versions. These old versions are now exclusively available through resellers, or on the seven seas.
But as far as the SC-88Pro emulation is concerned, there don't seem to be any technical reasons against it. There is a long thread over at VOGONS discussing all sorts of issues, but you have to dig quite deep to find any clear descriptions of bugs in SCVA's synth engine. Everything I found either only applies to the SC-55 emulation and not the SC-88Pro, was fixed by Roland in the meantime, or turned out to be a fixable bug in a MIDI file.

Nevertheless, Romantique Tp has a very negative opinion about SCVA, getting quite angry and defensive in this instance where someone favorably compared SCVA to their recordings. Edit (2024-03-10): These days, Romantique Tp has a much more favorable opinion on SCVA as well.
8 years after their release, however, the community unanimously accepts the Romantique Tp recordings as the intended way to listen to ZUN's old MIDIs, so choosing Sound Canvas VA for our Shuusou Gyoku builds might be a bad idea purely for PR reasons. At best, people would slightly wonder why I intentionally went with the opposite of the accepted reference recordings, but at worst, this entire project could face a violent backlash…


But wait, we've already heard one obvious difference between the real SC-88Pro and Sound Canvas VA. Let's listen to the very first clip again:

Ha! You can clearly hear a panning echo in the real-hardware recording that is missing from the Sound Canvas VA rendering. That's an obvious case of a core system effect not being reproduced correctly. If even that's undeniably broken, who knows which other subtle bugs SCVA suffers from, right? Case closed, Romantique Tp was right all along, SCVA is trash, real hardware reigns supreme :godzun:

Actually, let's look closer into this one. Panning delay effects like this are typically reverb-related, but General MIDI only specifies a single controller to specify the per-channel reverb level from 0 to 127. Any specific characteristics of the reverb therefore have to be configured using vendor-specific system-exclusive messages, or SysEx for short.
So it's down to one of the four SysEx messages at the beginning of the MIDI file:

Delta	Pulse	 Beat	Event
   +0	    0	0:000	SysEx(41 10 42 12 40 00 7F 00 41 F7)
 +240	  240	0:240	SysEx(41 10 42 12 40 01 30 14 7B F7)
 +120	  360	0:360	SysEx(41 10 42 12 40 01 33 0F 7D F7)
  +60	  420	0:420	SysEx(41 10 42 12 40 01 34 30 5B F7)

Since these byte strings represent Roland-specific instructions, we can't learn anything from a raw MIDI event dump alone here. No problem though, let's just load these files into some old MIDI sequencer that targeted Roland synths, open its MIDI event list, and then they will be automatically decoded into a human-readable representation…
…or at least that's what I expected. In Yamaha land, XGworks has done that for Yamaha's own XG SysEx messages ever since 1997:

Screenshot of the MIDI Event Viewer in Yamaha's XGworks, showing off its automatic XG SysEx decoding feature.
No configuration required. You can even edit the textual Value1 representation and XGworks parses it back into the closest supported value!

But for Roland synths, there's… nothing similar? Seriously? 😶 Roland fanboys, how do you even live?! I mean, they are quick to recommend the typical bloated and sluggish big-name DAWs that take up multiple gigabytes of disk space, but none of the ones I tried seemed to have this feature. They can't have possibly been flinging around raw byte strings for the past 33 years?!
But once you look more into today's MIDI community, it becomes clear that this is exactly what they've been doing. Why else would so many people use the word complicated to describe Roland SysEx, or call it an old school/cryptic communication protocol in hexadecimal format? The latter is particularly hilarious because if you removed the word cryptic, this might as well describe all of MIDI, not just SysEx. :tannedcirno: Everything about this is a tooling issue, and Yamaha showed how easily it could have been solved. Instead, we get Sound Canvas experts, who should know more about the ecosystem than I do, making the incredible mental leap from "my DAW doesn't decode or easily generate SysEx" to "SysEx is antiquated" to "please just lift up these settings to the VST level and into my proprietary DAW's proprietary project format, that would be so much better"

Thankfully that's not entirely true. After some more digging and configuration, I found a somewhat workable solution involving a comparatively modern sequencer called Domino:

  1. Download either Domino's original Japanese version or the partial English translation. The .zip file on the release page contains a full standalone build.
  2. Open the File → Preferences menu and associate your MIDI output device with a module map. This makes sense for SysEx encoding/generation since it can limit the options in the UI to what's actually available on your target hardware, but is also required for selecting the respective SysEx map into Domino's SysEx decoder. There is no technical reason for this because SC-88Pro SysEx messages can be uniquely identified by the three vendor, device, and model ID bytes that every message starts with, but would be too easy and user-friendly. The perception of SysEx being a black art must be upheld at all costs.
    Screenshot of Domino's MIDI-OUT window, complete with garbled text
    I've kept the garbled text of the partial translation to emphasize the sheer amount of jank involved in this entire process.
  3. Load a MIDI file and let Domino "analyze" it:
    Screenshot of Domino's analysis message box
  4. Strangely enough, this will take quite a while – on my system, this analysis step runs at a speed of roughly 4.25 KB/s of MIDI data. Yes, kilobytes.
  5. Unfortunately, "control change macro restoration" also seems to mean that you don't get to see any raw bytes when selecting the respective MIDI track in the UI, but at least we get what we were looking for:
    Screenshot of the four SysEx messages of タイトルドメイド, Shuusou Gyoku's name registration theme, as decoded by Domino
    …for the most part?
    Pulse	Event
        0	SysEx(41 10 42 12 40 00 7F 00 41 F7)
      240	SysEx(41 10 42 12 40 01 30 14 7B F7)
      360	SysEx(41 10 42 12 40 01 33 0F 7D F7)
      420	SysEx(41 10 42 12 40 01 34 30 5B F7)

Alright, that's something we can work with. The GS Reset message is something that every Roland GS MIDI should start with, but it's immediately followed by a message that Domino failed to decode? The two subsequent reverb parameters make sense, but panning delays typically have more parameters than just a reverb level and time.
That unknown SysEx message shares much of the same bytes with the decoded ones though. So let's do what we maybe should have done all along, return to caveman, and check the SC-88Pro manual:

The relevant section from page 194. We can see how the address and value correspond to bytes 5-7 and 8 in the SysEx messages. Byte 9 is a checksum and byte 10 signals the end of the message.

And that's where we find what this particular issue boils down to. The missing SysEx message is clearly intended to be a Reverb Macro command, whose value can range from 0 to 7 inclusive on the SC-88Pro, but ZUN tries to specify Reverb Macro #14h, or 20 in decimal. The SC-88Pro manual does not specify what happens if a SysEx message wants to write an invalid value to a valid address, which means that we've firmly entered the territory of undefined behavior.
Edit (2024-03-10): Romantique Tp confirmed that the real SC-88Pro clamps these Reverb Macro IDs to the supported range of 0-7. Therefore, the appropriate course of action for guaranteeing the same sound on other Roland synths would be to fix the MIDI file and specify Reverb Macro #7 instead. But since this behavior remains technically undefined, we can still argue about ZUN's intention behind specifying the Reverb Macro like this:

In fact, 32 out of the 39 MIDIs across both of Shuusou Gyoku's soundtrack use this invalid Reverb Macro. The only ones that don't are


And that's where this quest seemed to end, until Romantique Tp themselves came in and suggested that I take a closer look at the GS Advanced Editor, or GSAE for short.

The splash screen of GSAE version 4.01e.
Make sure to connect a MIDI input device before starting GSAE, or it will silently crash immediately after this splash screen. At least it accepts any controller, so this might just be a bug instead of the typical user-hostile kind of hardware dongle DRM that is pervasive in today's synth industry. 1999 would seem a bit too early for that, thankfully.

I was aware of this tool, but hadn't initially considered it because it's always described as just a SysEx generator/encoder. In fact, the very existence of such a tool made no sense to me at first, and seemed to prove my point that the usability of GS SysEx was wholly inferior to what I was used to in Yamaha land. Like, why not build at least a tiny and stripped-down MIDI sequencer around this functionality that would allow you to insert SC-88Pro-specific messages at any point within a sequence, and not just the beginning? I can see the need for such a tool in today's world of closed-source DAWs where hardware MIDI modules are niche and retro and are only kept alive by a small community of enthusiasts. But why would its developers guarantee that MIDI composers would have to hop between programs even back in 1997? I can only imagine that they saw how every just slightly advanced MIDI sequencer or DAW back then already used its own project format instead of raw Standard MIDI Files, and assumed that composers would therefore be program-hopping anyway?
However, GSAE does support the import of settings from a MIDI file and features a SysEx history window that decodes every newly processed Roland SysEx byte string, which is all I was looking for. So let's throw in that same MIDI and…

Screenshot of GSAE's SysEx history window,showing the results of sending a GS Reverb Macro #20 message
That's the result of sending just the single F0 41 10 42 12 40 01 30 14 7B F7 message at the top.

Now that's some wild numbers. An equally invalid Reverb Character, and Reverb Level and Time values that even exceed their defined range of 0-127? Could it be that GSAE emulates the real-hardware response to invalid Reverb Macros here, and gives us the exact reverb setting we can hear in Romantique Tp's recording? This could even be the reason why GSAE is still used and recommended within today's Roland MIDI sequencing scene, and hasn't been supplanted by some more modern open-source tool written by the community.

In any case, these values have to come from somewhere, so let's reverse-engineer GSAE and figure out the logic behind them. Shoutout to IDR for being a great help with its automatic generation of IDC debug symbols for the Delphi standard library, and even including a few names of application-level widget class methods by reading Delphi-specific type information from the binary. This little sub-project made me also come around to appreciating Ghidra, whose decompiler and data type manager helped a lot and allowed me to find the relevant code section within just a few hours.
A~nd it turns out that the values all come from out-of-bounds accesses into arrays on the stack. :onricdennat: If we combine 25, 235, and 132 back into a 32-bit value, we get 0x19EB84, which is the virtual address of the relevant function's stack frame base pointer.
But it gets even more hilarious: If you enable debug text output via Option → Other Options → SMF → Insert text events to setup measures and export these imported settings back into a MIDI file, GSAE not only retains these invalid Reverb Macro IDs, but stringifies them via a simple lookup into a hardcoded string pointer array, again without any bounds checks. The effects of this are roughly what you would expect:

In the end, we have Domino not decoding the Reverb Macro message, and GSAE, the premier SysEx tool for Roland synths, responding to it in even more undefined and clearly bugged ways than real hardware apparently does. That's two programs confirming that whatever ZUN intended was never supposed to work reliably. And while we still don't know exactly what these reverb parameters are supposed to be, these observations solve the mystery as far as I'm concerned, and solidify my personal opinion on the matter.


So what do we do now, and which version do we go with? Optimally, I'd offer both versions and turn this controversy into a personal choice so that everybody wins… and Ember2528 agreed and generously provided all the funding to make it happen. 💸
If you haven't picked your favorite yet, here are some final arguments:

The Romantique Tp recordings certainly have something going for them with their provenance of coming from real hardware, and the care that Romantique Tp put into manually recording every single track, warts and all. I wholeheartedly agree that preserving the raw sound of playing the MIDI files into the hardware without thinking about bugs or quirks is an important angle to take when it comes to preservation. It's good that these recordings exist – after all, you wouldn't know which musical elements you'd possibly be missing in an emulation if you have nothing to compare it to. Even the muffled sound in the half-speed clip above can be an argument in their favor, as the SC-88Pro's DAC operates at 32 kHz and you wouldn't expect any meaningful frequency content between 16,000 and 22,050 Hz to begin with. Any frequency content in that range that does remain in Romantique Tp's recording is simply 📝 rolled-off imaging noise added during the ADC's resampling process.
All this is why they are a definite improvement over kaorin's 2007 recordings of only the AST, which used to be the previous reference recordings within the community. Those had all of the same timing issues and more, in addition to being so excessively volume-boosted that 0.15% of the samples across the entire soundtrack ended up clipped. That's 6.25 seconds out of 68:39m being lost to pure digital noise.

Most importantly though: ZUN himself said that only the real SC-88Pro will play back these files as he intended them to sound. This quote is likely where the tagline of Romantique Tp's entire recording project came from in the first place:

> 全てのデエタはSC-88ProもしくはSC-8850(ロオランド社)にて最適に聴けるように調整してあります > それ以外の音源でも、作者の意図した音ではない場合があります。 — ZUN on 東方幻想的音楽, his old MIDI page

However. ZUN is not exactly known for accurately and carefully preserving the legacy of his series, or really doing anything beyond parading his old games as unobtainable showpieces at conventions. With all the issues we've seen, preferring real hardware is ultimately just that: an angle, and a preference. This is why I disagree with the heavy and uncritical advertising that is mainly responsible for elevating the Romantique Tp recordings to their current reference status within the community, especially if at least half of the alleged superiority of real hardware is founded on undefined behavior that can easily be fixed in the MIDI files themselves if people only bothered to look.

Here's where I stand: MIDI files are digital sheet music first and foremost, not an inferior version of tracker modules where the samples are sold separately. As such, the specific synth a MIDI file was written for is merely a secondary property of the composition – and even more so if the MIDI file contains little to nothing in terms of sound design and mostly restricts itself to the basic feature set of General MIDI. In turn, synth quirks and bugs are not a defined part of the composition either, unless they are clearly annotated and documented in the file itself. And most importantly: If the MIDI file specifies a certain timing and a recording fails to reproduce that timing, then that recording is not an accurate representation of the MIDI file.
In that regard, Sound Canvas VA is not only the closest alternative to the real thing, as a few people in the MIDI and retrogaming scene do have to admit, but superior to the real thing. I'll gladly take clarity and perfect timing accuracy in exchange for minor differences in effects, especially if the MIDI file does not explicitly and correctly define said effects to begin with. If I want a panning delay as part of the reverb, I add the respective and correct SysEx message to define one – and if I don't, I do not care about the reverb. You might still get a panning delay on a certain synth, and you might even prefer how it sounds, but it's ultimately a rendering artifact and not a consciously intended part of the composition. In that way, it's similar to the individual flavor a musician adds to a performance of a piece of classical music.
And as far as the differences in frequency response and resonant filters are concerned: In Yamaha land, these are exactly the main distinguishing factors between vintage WF-192XG sound cards (resembling the real SC-88Pro in these characteristics) and the S-YXG50 softsynth (resembling SCVA). Once I found out about that softsynth and how much clearer it sounded in comparison, I sold that old PCI sound card soon after.

In the interest of preservation though, there's still one more unexplored solution that could be the ideal middle ground between the two approaches:

  1. Play the MIDIs through a real-hardware SC-88Pro again
  2. Capture the actually observed system-exclusive settings that fall within the synth's supported and documented ranges
  3. Insert them back into the MIDI file, creating a new bugfixed version
  4. Re-record that bugfixed version through Sound Canvas VA

Edit (2024-03-10): And since Romantique Tp has confirmed what exactly happens on real hardware, I'm going to do exactly that. These bugfixed Sound Canvas VA renderings will be a free bonus of the single next Shuusou Gyoku push, and will add another angle to the preservation of these soundtracks. In the meantime though, the Sound Canvas VA packs will sound like they do in the preview videos above.

Or, you know… Maybe none of this actually matters. Here's beatMARIO streaming some Shuusou Gyoku gameplay using what looks like a real-hardware SC-8850, which plays these MIDIs with occasionally noticeably different instrument patches and no panning delay in the name registration theme, and he still enjoyed every second of it. Imagine undefined SysEx behavior not even being consistent within the same family of Roland synths… nah, I'm done arguing, let's get back to the actual work and cut some loops.


Just to be clear: I'm not suggesting that Romantique Tp should have been the one to cut their recordings into loops, or even just the one who defined where the loop points are supposed to be. On the surface, this seems to be a non-issue, and you'd just pick a point wherever each track appears to loop, right? But with 39 MIDIs to cut and all the financial support from Ember2528, it made sense to also solve this problem more thoroughly, and algorithmically detect provably correct loop points for all of these files. Who knows, maybe we even find some surprises that make it all worth it?
This is the algorithm I came up with:

Of course, this algorithm isn't perfect and won't work for every MIDI file out there. It doesn't consider things like differently ordered events within the same MIDI pulse, (non-)registered parameter numbers, or the effect that SysEx messages can have on the state of individual channels. The latter would require the general SysEx decoding logic that I would have liked to have for the research above… actually, let's add an issue and add the project to the order form. I'd really like to see a comprehensive open-source cross-vendor SysEx decoder library in my lifetime.

As for the implementation, I was happy to write some Rust again for a change, as it's a great fit for these standalone greenfield command-line tools that don't have to directly interact with the legacy C++ code bases that this project usually deals with. It's even better if the foundational functionality is not just available in a crate, but in four, with the community already having gone through multiple iterations to arrive at a tried and tested winner. Who knows, maybe I even get to rewrite this website in it one day? Just for the sheer meme value of doing so, of course.
I also enjoyed this a lot from a technical point of view:

This algorithm works well for the long MIDI files of Shuusou Gyoku's OST that all contain multiple duplicates of their loop section, but it quickly reaches its limit with the AST. Following the classic two-loop + fade-out format, that soundtrack was meant to be played back in generic MIDI players, and not to actually be put back into the game in looped form. Since the loop algorithm did, in fact, find inconsistencies even in the OST, two copies of the apparent loop are sometimes not enough to prove cases where the actual loop ends much later than you think it does. In a few cases, it would be enough to simply remove all volume change events from the fade-out to prove the actual loop, but in others, the algorithm would need MIDI event data far past the end of the fade-out.

However, just giving up and not looping any of these tracks would be equally unfortunate. So how about shifting the question, from what's the best loop in this MIDI file to what's the best loop if the MIDI didn't fade out and instead repeated its apparent second loop a third time? As long as the detected loop in such a pre-processed file ends before the repeated range, it's still a valid loop in terms of the unmodified original.
Ideally, we want to do this pre-processing programmatically with the same Rust library instead of manually editing the MIDI. Many sequencers (and especially XGworks) apply significant changes to a MIDI file's internal structure when saving its internal representation back to a MIDI file, which might even mess with our loop algorithm. So it would be very nice to have a more trustworthy tool that applies only the edit we actually want, and perfectly retains the rest of the MIDI.

And that's how this sub-project turned into a small suite of command-line MIDI operations in the classic Unix filter/pipeline style: Each command reads a MIDI file from stdin, transforms it, and outputs text or the resulting MIDI file on stdout. This way, we gain maximum transparency and reproducibility as I can document the unique pre-processing steps for each AST track by simply providing the command lines. And sure, we're re-encoding and re-decoding the full MIDI sequence at every step along such a pipeline, but computers are fast, Rust and the midly library in particular are ⚡ blazingly fast ⚡, and the usability benefits of this pipeline model far outweigh any theoretical performance drops.
Here's the full list of commands that made it into the resulting mly tool:

This feature set should strike a good balance between not spending too much of the Shuusou Gyoku budget on tangential problems, but still offering a decent solution for the problem at hand. As a counterexample, the obvious killer feature – deserializing a dump back into a Standard MIDI File – would have gone way past the budget. While there are crates that free you from the need to write manual parsing code for basic data structures, they would instead require a lot of attribute boilerplate – and if the library that provided the structures doesn't already come with these attributes, you now have to duplicate all the structures, and convert back and forth between the original structures and your copies. Not to mention that we'd still have to write code for the high-level structure of the dump output…

If we put it all together, this is what we can do:

$ <ssg_02.mid mly loop-find
Best loop in note space: 4 events (between event #[117, 121[ and [121, 125[)
First note: event    71 / pulse    960 / beat   2:000 / 0:00:800m
Loop start: event   117 / pulse   1680 / beat   3:240 / 0:01:400m
  Loop end: event   121 / pulse   1920 / beat   4:000 / 0:01:600m

$ <ssg_02.mid mly cut 466: | mly loop-unfold 240: | mly -r 44100 loop-find
Track #0: Removing events #[16439, 19881[
Track #0: Repeating events #[8344, 16439[ at the end of the sequence
Best loop in note space: 8095 events (between event #[5625, 13720[ and [13720, 21815[)
First note: event    71 / pulse    960 / beat   2:000 / 0:00:800m
Loop start: event  5625 / pulse  75361 / beat 157:001 / 1:03:531m
  Loop end: event 13720 / pulse 183841 / beat 383:001 / 2:34:726m

Best loop in recording space:  8095 events (between event #[5709, 13804[ and [13804, 21899[)
First note: event    71 / pulse    960 / beat   2:000 / 0:00:800m / sample    35280.00
Loop start: event  5709 / pulse  77280 / beat 161:000 / 1:05:163m / sample  2873667.66
  Loop end: event 13804 / pulse 185760 / beat 387:000 / 2:36:358m / sample  6895375.27

Translation:


So, where are these loop quirks that justify why some of these audio files are longer than you'd think they should be? Just listing them as text wouldn't really communicate just how minor these are. It would be much nicer to visualize them in a way that highlights the exact inconsistencies within a fixed range of MIDI measures. Screenshots of MIDI sequencer or DAW windows won't capture these aspects all too well because these programs are geared toward fine-grained editing of single tracks, not visualization of details across all channels.

Screenshot of the first 8 measures of Shuusou Gyoku's Stage 1 theme (フォルスストロベリー) in its OST version, as visualized by REAPER's piano roll
REAPER's piano roll nicely snaps to a certain range, but good luck picking out the individual lines from the single volume lane at the bottom of the screen, or spotting a 7-point difference. Not to mention that CC #11 (Expression) makes up an equal part of a channel's final perceived volume, which is the metric we'd actually want to visualize.

Typical MIDI visualizers, however, are on the complete opposite end of the spectrum. In recent years, MIDI visualization has become synonymous with the typical Synthesia style of YouTube videos with a big keyboard at the bottom, note bars flying in from the top, and optional fancy effects once those notes hit the top of the keyboard. The Black MIDI community has been churning out tons of identically looking MIDI visualizers in recent years that mainly seem to differ in the programming language they're written in, and in how well they can cope with the blackest of black MIDIs.
Thankfully, most of these visualizers are open-source and have small and manageable codebases. The project with the most GitHub stars and the most generic name seemed to be the best starting point for hacking in the missing features, despite using GLSL shaders which I had no prior experience with. It was long overdue that I did something with GLSL though – it added a nice educational aspect to these hacks, and it still was easier than deciphering whatever the fastest and hyper-optimized Rust visualizer is doing.
Still, this visualizer needed a total of 18 small features and bugfixes to be actually usable for demonstrating Shuusou Gyoku's loop quirks. As such, these hacks turned into yet another tangential sub-project that could have easily consumed another two pushes if I cleaned up the code and published the result. But that would have really gone way past the budget for something that people might not even care about. So here's what we're going to do:


Alright then! Here's how to read the visualizations:



Before we package up these looped soundtracks, let's take a quick look at how they would be shown off in the Music Room. The Seihou Music Rooms carry over the per-channel keyboards from TH05, add the current per-channel volume, expression, and pan pot values, and top it off with a fake spectrum analyzer. All of these visualizations rely on MIDI data, and the Music Room would feel very dull and boring without them. Just look at Kioh Gyoku, whose Music Room basically turns into a still image in WAVE mode.
Retaining these visualizations even when playing waveform BGM was very important for me, and not just because it would make for a unique high-quality feature that would break new ground. It can also double as proof that the waveform versions are, in fact, in perfect sync with both the MIDIs they are based on, and, by extension, the respective stage scripts.
However, this would require the game to process the MIDIs and update the internal visualization state without simultaneously playing them back through the WinMM / MME / midiOut*() API. And just like graphics and text rendering, Shuusou Gyoku's original code came with zero architectural separation between platform-independent processing logic and platform-specific playback…

So I accidentally rewrote almost the entire MIDI code to achieve said separation. :tannedcirno: This also provided a great occasion to modernize this code and add some much-needed robustness for potential MIDI mods, while retaining the original code's approach of iterating over raw SMF byte streams. It might all have been very excessive for a delivery that was supposed to be just about waveform BGM support, but on the plus side, MIDI output is now portable to any other system's MIDI API as well.

Surprisingly though, it was Shuusou Gyoku's original MIDI timing that quickly turned out to be rather inaccurate, and not the waveforms. The exact numbers vary depending on the piece, but the game played back every MIDI about 1% slower than notated, adding about 2 or 3 seconds to their total playback time after 5 minutes. Tempo changes in particular were the biggest causes of desynchronizations with the waveforms… :thonk:
To understand how this can happen to begin with, we have to look closer at how you're supposed to use the midiOut*() API. This API is as low-level as it gets, only covering the transmission of a single MIDI message to the selected output device right now. There is no concept of note timing at this low level, so it's completely up to the program to parse delta times and tempo change events out of the MIDI file and correctly time the calls to this API for each MIDI message. With all the code that runs between the API and the actual renderer of the synth for every single message, the resulting timing can only ever be an approximation of the MIDI file. This doesn't really matter for the timescales and polyphony levels of typical music because, again, computers are fast, but such an API is fundamentally unsuitable for accurately playing back even just a moderately complex million-note Black MIDI. :onricdennat:

Shuusou Gyoku handles this required manual timing in the simplest possible way: It runs a MIDI processing function (Mid_Proc() in the code) at an interval of 10 ms, which processes and instantly sends out all MIDI events that have occurred at any point within the last 10 ms, maintaining merely their order. This explains not only why the original game incremented its MIDI TIMER by multiples of 10, but also the infamous missing drums when playing the soundtrack through the Microsoft GS Wavetable Synth:

But while sending MIDI events in such quantized chunks might not be perfect, it can't be the cause behind multi-second playback slowdowns. Instead, this issue has to boil down to the way Shuusou Gyoku times each individual message, and specifically how it converts between MIDI pulse units and real-time (milli)seconds. pbg's original MIDI code chose to do this in an equally confusing and inaccurate way: it kept two counters that tracked the current MIDI pulse before and after the latest tempo change, used the value of the latter counter to decide which events to process, and only added the pulse equivalent of 10 ms to this counter at the end of Mid_Proc() in the then current tempo. The commit message for my rewritten algorithm details the problems with this approach using nice ASCII art in case you're interested, but in short, the main problem lies in how the single final addition can only consider a single tempo change within each call to Mid_Proc(). If a MIDI file contains tempo ramps with less than 10 ms between each different tempo, the original game would only use the last of these tempo values as the basis for converting the entire 10 ms back into MIDI pulses. Not to mention that maybe MIDI pulses aren't the best unit in a game that still 📝 treats the FPU as lava and doesn't use any fixed-point means of increasing the resolution of the 10 ms→pulse division either…

On the contrary, it's much more accurate to immediately convert every encountered MIDI delta time to a real-time quantity and use that unit for event timing, especially if we want to restrict ourselves to integer math. Signed 64-bit integers are enough to fit the product of the slowest possible MIDI tempo ((224 - 1) µs per quarter note) and the highest possible MIDI delta time (228 - 1) at nanosecond precision (103), with one bit to spare. Then, we arrive at a much simpler timing algorithm:

The additive nature of this timer not only naturally allows more than one event to happen within a single Mid_Proc() call, but also averages out any minor timing inconsistencies across the length of a track.

This new algorithm did improve the overall timing accuracy, but only barely, shaving off just ≈100 ms of the total duration. Turns out that the main source behind the slowness was hiding somewhere else entirely, in the single line that deserializes tempo values from MIDI's big-endian representation into the native integer format:

assert(length_of_tempo_message == 3);
uint32_t tempo = 0;
for(int i = 0; i < length_of_tempo_message; i++) {
-	tempo += ((tempo << 8) + (*track_data++));
+	tempo  = ((tempo << 8) + (*track_data++));
}

Yup – the original code performed two additions per byte, which incorrectly added the interim value at every byte to the final result, and yielded a tempo that is ≈0.8% / ≈1 BPM slower than notated in the MIDI file, matching the number we were looking for. That's why the |/OR operator is the safer one to use in such a bit-twiddling context…
But now I'm curious. This is such a tiny bug that is bound to remain unnoticed until someone compares the game's MIDI output to another renderer. It must have certainly made it into other games whose MIDI code is based on Shuusou Gyoku's, or that pbg was involved with. And sure enough, not only did this bug survive Kioh Gyoku's OOP refactoring, but it even traveled into Windows Touhou, where it remained in every single game that supported MIDI playback. Now we know for a fact that pbg's Program Support role in the TH06 credits involved sharing ready-made, finished code with ZUN:

Disassembly of the Shuusou Gyoku MIDI tempo deserialization bug in TH06Disassembly of the Shuusou Gyoku MIDI tempo deserialization bug in TH07Disassembly of the Shuusou Gyoku MIDI tempo deserialization bug in TH08Disassembly of the Shuusou Gyoku MIDI tempo deserialization bug in TH09Disassembly of the Shuusou Gyoku MIDI tempo deserialization bug in TH10
The broken tempo deserialization in the respective latest full versions of TH06 through TH10. And yes, that's TH10 – even though TH09's trial version was the last game to ship MIDI versions of its soundtrack, TH10 still contained all of pbg's MIDI code that originated back in Shuusou Gyoku, before TH11 finally removed it.
Amusingly, ZUN's compiler even started optimizing the combination of left-shifting and addition to a multiplication with 257 for TH09, which even sort of highlights this bug if you're used to reading x86 ASM.

That leaves support for MIDI loop points as the only missing feature for syncing MIDI data with a looping waveform track. While it didn't require all too much code, pbg's original zero-copy approach of iterating over raw MIDI data definitely injected a lot of complexity into the required branches. Multi-track/SMF Type 1 files require quite a bit of extra thought to correctly calculate delta times across loop boundaries that reach past the end of the respective track, while still allowing the real-time delta values to be resynchronized at tempo changes within the loop – and yes, 3 of ZUN's 19 arranged MIDI files actually do use more than one track, so this wasn't just about maximizing MIDI compatibility for mods. I stuck to the original approach mostly as a challenge and to prove that it's possible without first parsing the entire MIDI sequence into a friendlier internal representation, but I absolutely do not recommend this to anyone else. :tannedcirno:

After hardcoding the loop points detected by mly into the binary, we only need to call Mid_Proc() once per frame in the Music Room and pass the frame delta time instead of the 10 ms constant. And then, we get this:

The MIDI TIMER now shows off the arguably more interesting current MIDI pulse value rather than just formatting the PASSED TIME in milliseconds. Ironically, displaying this value in a constantly counting way takes more effort now – the new nanosecond-based timing code doesn't use any measure of total MIDI pulses anymore, and they don't naturally fall out of the algorithm either. Instead, the code remembers the total pulse value of the last event it processed and adds the real-time duration that has passed since, similar to the original timing algorithm.
This naturally causes the timer to jump from the loop end pulse to the loop start pulse, proving that Mid_Proc() is in fact looping the sequence.

Alright, now we know what to package:

Unfortunately, we still haven't reached the end of the complications and weird issues that haunt Shuusou Gyoku's music:

  1. The original game reads the in-game track title directly out of the first Sequence Name event of the playing MIDI file. The waveform equivalent would be the Vorbis comment TITLE tag, which therefore should exactly match the original track's title, down to the exact placement of whitespace. As usual, if I emphasize minor things like this, it's not without reason: 幻想科学 ~ Doll's Phantom inconsistently uses halfwidth spaces at both sides of the , and wouldn't fit into the Music Room's limited space otherwise.

  2. However, the AST MIDI files jam a bunch of other metadata into their Sequence Names, roughly following the format
    【 $title 】 from 秋霜玉  for sc88Pro comp.ZUN
    The track titles should definitely not appear in this format in-game, but how do we get rid of this format without hardcoding either the names or the magic to parse the names out of this format? :thonk:
  3. The absolute state of GS SysEx tooling rears its ugly head one final time in three of the AST MIDIs, which for some reason are missing the Roland vendor prefix byte in all of their SysEx messages and are therefore undeniably bugged. There even seemed to be another SysEx-related bug which Romantique Tp explained away, but not this one:

    ssg_04.mid

    0:000	SysEx(   10 42 12 40 00 7F 00 41 F7)
    0:240	SysEx(   10 42 12 40 01 30 14 7B F7)
    0:360	SysEx(   10 42 12 40 01 33 14 78 F7)
    0:420	SysEx(   10 42 12 40 01 34 50 3B F7)

    ssg_05.mid

    0:000	SysEx(   10 42 12 40 00 7F 00 41 F7)
    0:240	SysEx(   10 42 12 40 01 30 14 7B F7)
    0:360	SysEx(   10 42 12 40 01 33 00 0C F7)
    0:420	SysEx(   10 42 12 40 01 34 14 77 F7)

    ssg_10.mid

    0:000	SysEx(   10 42 12 40 00 7F 00 41 F7)
    0:240	SysEx(   10 42 12 40 01 30 14 7B F7)
    0:360	SysEx(   10 42 12 40 01 33 00 0C F7)
    0:420	SysEx(   10 42 12 40 01 34 60 2B F7)

    ssg_04.mid

    0:000	SysEx(41 10 42 12 40 00 7F 00 41 F7)	GS Reset
    0:240	SysEx(41 10 42 12 40 01 30 14 7B F7)	Reverb Macro #20
    0:360	SysEx(41 10 42 12 40 01 33 14 78 F7)	Reverb Level 20
    0:420	SysEx(41 10 42 12 40 01 34 50 3B F7)	Reverb Time 80

    ssg_05.mid

    0:000	SysEx(41 10 42 12 40 00 7F 00 41 F7)	GS Reset
    0:240	SysEx(41 10 42 12 40 01 30 14 7B F7)	Reverb Macro #20
    0:360	SysEx(41 10 42 12 40 01 33 00 0C F7)	Reverb Level 0
    0:420	SysEx(41 10 42 12 40 01 34 14 77 F7)	Reverb Time 20

    ssg_10.mid

    0:000	SysEx(41 10 42 12 40 00 7F 00 41 F7)	GS Reset
    0:240	SysEx(41 10 42 12 40 01 30 14 7B F7)	Reverb Macro #20
    0:360	SysEx(41 10 42 12 40 01 33 00 0C F7)	Reverb Level 0
    0:420	SysEx(41 10 42 12 40 01 34 60 2B F7)	Reverb Time 96
    The irony of using invalid Reverb Macros within already invalid SysEx messages is not lost on me.

    This is something we should fix even before running these files through Sound Canvas VA in order to render these with the reverb settings that ZUN clearly (and, for once, unironically) intended.

  4. For perfect preservation of the original BGM/gameplay synchronicity, it makes sense for the waveform versions to retain the leading 1 or 2 beats of silence that the original MIDI files use for their SysEx setup. While some of the AST tracks use a slightly different tempo compared to their OST counterparts, they would still be largely in sync as ZUN didn't rearrange the layout of their setup area… except for, once again, the three tracks used in the Extra Stage. :zunpet: Marisa's and Reimu's boss themes aren't too bad with their 4 beats of setup, but シルクロードアリス takes the cake with a whopping 12 beats of leading silence. That's 5 seconds from the start of the Extra Stage to the first note you'd hear. 🐌

2) and 4) could theoretically be worked around in Shuusou Gyoku's MIDI code, but there's no way around editing the MIDI files themselves as far as 3) is concerned. Thus, it makes sense to apply all of the workarounds to the AST MIDIs as part of the BGM build process – parsing the titles out of the 【brackets】, inserting the Roland vendor prefix byte where necessary, and compressing the setup bars in the Extra Stage themes to match their OST counterparts. Adding any hidden magic to the MIDI code would only have needlessly increased complexity and/or annoyed some modder in the future who would then have to work around it.
Ideally, these edits would involve taking the mly dump output, performing the necessary replacements at a plaintext level, and rebuilding the result back into a MIDI file, bu~t we're unfortunately missing the latter feature. Luckily, someone else had the same idea 13 years ago and wrote a tool in C that does exactly what we need. Getting it to compile in 2024 only required fixing a typical C thing… why are students and boomers defending this antique of a language again? 🙄

The single most glaring issue, however, is the drastic difference in volume between the individual tracks in both soundtracks. While Romantique Tp had to normalize each track to the maximum possible volume individually as a consequence of the recording process, the Sound Canvas VA renderings reveal just how inconsistent the volume levels of these MIDI files really are:

The peak amplitudes of every track in both soundtracks, as rendered by Sound Canvas VA at maximum volume. Looking at these, you might think that kaorin's 2007 recordings were purposely trying to preserve the clipping that would come out of an SC-88Pro if you don't manually adjust the volume knob for each song, but those recordings are still much louder than even these numbers.

So how do we interpret this? Is this a bug, because no one in their right mind would want their music to clip on purpose, and that in turn means that everything about these volume levels is arbitrary and unintentional? Or is this a quirk, and ZUN deliberately chose these volume levels for compositional reasons? It certainly would make sense for the name registration theme.
Once again, the AST version of シルクロードアリス is the worst offender in this regard as well, but it might also provide some evidence for the quirk interpretation. The fact that almost all of its MIDI channels blast away at full volume might have been an accident that could have gone unnoticed if the volume knob of ZUN's SC-88Pro was turned rather low during the time he arranged this piece, but the excessive left-panning must have been deliberate. Even Romantique Tp agrees:

Stereo waveform of the Sound Canvas VA rendering of Shuusou Gyoku's Extra Stage theme (シルクロードアリス), highlighting the excessive left-panningStereo waveform of Romantique Tp's recording of Shuusou Gyoku's Extra Stage theme (シルクロードアリス), highlighting the excessive left-panning
It might have even made compositional sense if Silk Road Alice was supposed to be a "Western-style piece", but it's not. :zunpet:

And that's with the volume already normalized. Because this one channel of this one track is almost twice as loud as anything else in the AST, we would consequently have to bring down the volume of every other arranged track and the right channel of the same track by almost 50% if we wanted to maintain the volume differences between the individual tracks of the AST. In the process, we lose almost one entire bit of dynamic range. At this rate, you might even consider remixing and remastering the entire thing, but that would involve so many creative decisions to definitely fall into fanfiction territory…

However, normalizing each track to a peak level of 0 dBFS makes much more sense for in-game playback if you consider how loud Shuusou Gyoku's sound effects are. Once again, the best solution would involve offering both versions, but should we really add two more SCVA BGM packs just to cover volume differences? :thonk:
ReplayGain solves this exact problem for regular music listening in a non-destructive way by writing the per-track and per-album gain levels into an audio file's metadata. Since we need metadata support for titles anyway, we can do something similar, albeit not exactly the same for two reasons:

And so, we hard-apply the album-level gain during the conversion from 32-bit float to FLAC to preserve the volume differences between the tracks, calculate the track-level GAIN FACTOR based on the resulting peak levels, add a volume normalization toggle to the Sound / Config menu, enable it by default, and thus make everyone happy. ✅

The final interesting tidbit in building these packages can be found in the way the Sound Canvas VA recordings are looped. When manually cutting loops, you always have to consider that the intro might end with unique notes that aren't present at the end of the loop, which will still be fading out at the calculated loop start point. This necessitates shifting the loop start point by a few bars until these notes are no longer audible – or you could simply ignore the issue because ZUN's compositions are so frantic that no one would ever notice. :onricdennat:
With the separate intro and loop files generated by mly, on the other hand, the reverb/release trails are immediately visible and, after trimming trailing silence, exactly define the number of samples that the calculated loop start point needs to be shifted by. The .loop file then remains always exactly as long, in samples, as the duration of the loop reported by mly. If a piece happens to have a constant tempo whose beat duration corresponds to an integer number of samples, we get some very satisfying, round loop durations out of this process. ☺️


So let's play it all back in-game… and immediately run into two unexpected miniaudio limitations, what the…?!

  1. miniaudio uses a fixed linear function for its fade-out envelope, and doesn't offer anything else? We might not even want a logarithmic one this time because symmetry with MIDI's simple quadratic curve would be neat, but we sure don't want a linear function – those stay near the original volume for too long, and then turn quiet way too quickly.
  2. There is no way to access FLAC metadata from miniaudio's public API, even though the library bundles the author's own FLAC library which has this feature?

📝 Back when I evaluated miniaudio, I alluded that I consider single-file C libraries to be massively overrated, and this is exactly why: Once they grow as massive as miniaudio (how ironic), they can quickly lead to their authors treating their dependencies as implementation details and melting down the interfaces that would naturally arise. In a regular library, dr_flac would be a separate, proper dependency, and the API would have a way to initialize a stream from an externally loaded drflac object. But since the C community collectively pretends that multi-file libraries are a burden on other developers, miniaudio ended up with dr_flac copy-pasted into its giant single file, with a silly ma_ namespacing prefix added to all its functions. And why? Did we have to move so far in the other direction just because CMake doesn't support globbing? That's a symptom of CMake not actually solving any problem, not a valid architectural decision that libraries should bend around. 🙄
So unless we fork and hack around in miniaudio, there's now no way around depending on a second, regular copy of dr_flac. Which has now led to the same project organization bloat that single-file libraries originally set out to prevent…

Sigh. At this rate, it makes more sense to just copy-paste and adapt the old BGM streaming code I wrote for thcrap in late 2018, which used dr_flac directly, and extend it with metadata support. With the streaming code moved out of the platform layer and into game logic, it also makes much more sense to implement the squared fade-out curve at that same level instead of copy-pasting and adjusting an unhealthy amount of miniaudio's verbose C code.
While I'm doing the same for the old Vorbis streaming code, it would also make sense to rewrite that one to use stb_vorbis instead of the old libogg+libvorbis reference libraries. There's no need to add two more dependencies if miniaudio already comes with stb_vorbis.c, and that library is widely acclaimed. So, integration should be a breeze, right?
Well, surprise, rarely have I seen a C library so actively hostile toward being integrated. Both of its API variants are completely unreasonable:

What happened to the tried-and-true idea of providing a structure with read, tell, and seek callbacks, and then providing an optional variant for C FILE* handles if you absolutely must? Sure, the whole point of Vorbis is to be small and nobody these days would care about spending a few MB on keeping an entire Vorbis file in memory, but come on. If pulldata made the deliberate and opinionated choice to only support buffers of complete Vorbis streams and argued in the name of simplicity that hand-coded disk streaming isn't worth it in this day and age, I might have even been convinced. And this is from the guy who popularized the concept of single-file C libraries in the first place? :thonk:

Oh well, tupblocks go brrr. libvorbis definitely shows its age with all the old command-line tools in the lib/ directory that they never moved away and that we now have to remove from our glob. But even that just adds a single line to the Tupfile, and then we get to enjoy its much friendlier API. That sure beats the almost 800 lines of code that miniaudio had to write to integrate stb_vorbis… which I can't even link because the file is too big for GitHub. 🤷
At this point, it would have even made sense to upgrade from a 24-year-old lossy codec to an 11-year-old lossy codec and use Opus instead, since the enforced 48,000 Hz sampling rate is a non-issue when you control the entire audio pipeline. But let's keep compatibility with existing thcrap mods for now.

The last time I added dependencies, 📝 I wondered whether just downloading and extracting official Windows binary builds might be superior to pasting batch script duct tape over the usability issues of Git submodules. However, I still wanted to try out Git's sparse checkout feature before, in an attempt to remove all the unneeded bloat… and as it turned out, this might just be the idealistic and perfect nirvana of vendoring libraries in C++ projects. I particularly like how the limitations of its default mode (always checking out all files within each directory level that shows up in a filter) can be turned into a guideline about how to structure a repository: All non-essential stuff that consumers of your code might not need – tests, high-level documentation, or optional features – should go into a subdirectory where it can be easily filtered.
And that's how the size of our libs/ directory went down from 82.7 MiB in the P0256 build to 30.4 MiB in the P0275 build, despite adding 4 more libraries in the latter. Now if only this didn't require even more duct tape to actually set up shallow clones correctly

In the end, the Windows build ended up using only a single one of the miniaudio features that DirectSound doesn't have, and that's the ability to use the more modern WASAPI instead of DirectSound. We're still going to use miniaudio for the Linux port, but as far as Windows is concerned, it would be quite nice to backport BGM streaming to the game's original DirectSound backend. The P0275 build is pushing 1 MiB of binary size for a game that originally came in a 220 KiB binary, so it would remove a noticeable amount of bloat from GIAN07.EXE, but it would also allow waveform BGM to work in the Windows 98-compatible i586 build. If that sounds cool to you, this is the issue you want to fund.


That only left some logic and UI busywork to put it all together, which means that we've almost reached the end of things to talk about! Here's what it all looks like:



After half a year of being bought out way past the cap, I've finally got some small room left for new orders again. If it weren't for this blog post and the required research and web development work, this delivery would have probably come out in early January, taking half the time it ended up taking. So I really have to start factoring the blog posts into the push prices in a better and fairer way.
Meanwhile, the hate toward my day job only keeps growing, but there's little point in looking for a new one as long as ReC98 remains this motivating and complex. It leaves pretty much no cognitive room for any similarly demanding job. Thus, I want 2024 to be the year where ReC98 either becomes profitable enough to be my only full-time job, or where we conclusively find out that it can't, I go look for a better day job, and ReC98 shifts to a slower pace. Here's the plan:

With the new price of per push, this means that there's now a small window in which you can get a full push worth of functionality for , until the current cap is filled up again.

Next up: Probably TH02's endings to relax a bit. Maybe we're also getting some new Touhou-related contributions?

📝 Posted:
💰 Funded by:
Blue Bolt, [Anonymous], iruleatgames
🏷️ Tags:

Oh, it's 2024 already and I didn't even have a delivery for December or January? Yeah… I can only repeat what I said at the end of November, although the finish line is actually in sight now. With 10 pushes across 4 repositories and a blog post that has already reached a word count of 9,240, the Shuusou Gyoku SC-88Pro BGM release is going to break 📝 both the push record set by TH01 Sariel two years ago, and 📝 the blog post length record set by the last Shuusou Gyoku delivery. Until that's done though, let's clear some more PC-98 Touhou pushes out of the backlog, and continue the preparation work for the non-ASCII translation project starting later this year.

But first, we got another free bugfix according to my policy! 📝 Back in April 2022 when I researched the Divide Error crash that can occur in TH04's Stage 4 Marisa fight, I proposed and implemented four possible workarounds and let the community pick one of them for the generally recommended small bugfix mod. I still pushed the others onto individual branches in case the gameplay community ever wants to look more closely into them and maybe pick a different one… except that I accidentally pushed the wrong code for the warp workaround, probably because I got confused with the second warp variant I developed later on.
Fortunately, I still had the intended code for both variants lying around, and used the occasion to merge the current master branch into all of these mod branches. Thanks to wyatt8740 for spotting and reporting this oversight!

  1. The Music Room background masking effect
  2. The GRCG's plane disabling flags
  3. Text color restrictions
  4. The entire messy rest of the Music Room code
  5. TH04's partially consistent congratulation picture on Easy Mode
  6. TH02's boss position and damage variables

As the final piece of code shared in largely identical form between 4 of the 5 games, the Music Rooms were the biggest remaining piece of low-hanging fruit that guaranteed big finalization% gains for comparatively little effort. They seemed to be especially easy because I already decompiled TH02's Music Room together with the rest of that game's OP.EXE back in early 2015, when this project focused on just raw decompilation with little to no research. 9 years of increased standards later though, it turns out that I missed a lot of details, and ended up renaming most variables and functions. Combined with larger-than-expected changes in later games and the usual quality level of ZUN's menu code, this ended up taking noticeably longer than the single push I expected.

The undoubtedly most interesting part about this screen is the animation in the background, with the spinning and falling polygons cutting into a single-color background to reveal a spacey image below. However, the only background image loaded in the Music Room is OP3.PI (TH02/TH03) or MUSIC3.PI (TH04/TH05), which looks like this in a .PI viewer or when converted into another image format with the usual tools:

TH02's Music Room background in its on-disk state TH03's Music Room background in its on-disk state TH04's Music Room background in its on-disk state TH05's Music Room background in its on-disk state
Let's call this "the blank image".

That is definitely the color that appears on top of the polygons, but where is the spacey background? If there is no other .PI file where it could come from, it has to be somewhere in that same file, right? :thonk:
And indeed: This effect is another bitplane/color palette trick, exactly like the 📝 three falling stars in the background of TH04's Stage 5. If we set every bit on the first bitplane and thus change any of the resulting even hardware palette color indices to odd ones, we reveal a full second 8-color sub-image hiding in the same .PI file:

TH02's Music Room background, with all bits in the first bitplane set to reveal the spacey background image, and the full color palette at the bottom TH03's Music Room background, with all bits in the first bitplane set to reveal the spacey background image, and the full color palette at the bottom TH04's Music Room background, with all bits in the first bitplane set to reveal the spacey background image, and the full color palette at the bottom TH05's Music Room background, with all bits in the first bitplane set to reveal the spacey background image, and the full color palette at the bottom
The spacey sub-image. Never before seen!1!! …OK, touhou-memories beat me by a month. Let's add each image's full 16-color palette to deliver some additional value.

On a high level, the first bitplane therefore acts as a stencil buffer that selects between the blank and spacey sub-image for every pixel. The important part here, however, is that the first bitplane of the blank sub-images does not consist entirely of 0 bits, but does have 1 bits at the pixels that represent the caption that's supposed to be overlaid on top of the animation. Since there now are some pixels that should always be taken from the spacey sub-image regardless of whether they're covered by a polygon, the game can no longer just clear the first bitplane at the start of every frame. Instead, it has to keep a separate copy of the first bitplane's original state (called nopoly_B in the code), captured right after it blitted the .PI image to VRAM. Turns out that this copy also comes in quite handy with the text, but more on that later.


Then, the game simply draws polygons onto only the reblitted first bitplane to conditionally set the respective bits. ZUN used master.lib's grcg_polygon_c() function for this, which means that we can entirely thank the uncredited master.lib developers for this iconic animation – if they hadn't included such a function, the Music Rooms would most certainly look completely different.
This is where we get to complete the series on the PC-98 GRCG chip with the last remaining four bits of its mode register. So far, we only needed the highest bit (0x80) to either activate or deactivate it, and the bit below (0x40) to choose between the 📝 RMW and 📝 TCR/📝 TDW modes. But you can also use the lowest four bits to restrict the GRCG's operations to any subset of the four bitplanes, leaving the other ones untouched:

// Enable the GRCG (0x80) in regular RMW mode (0x40). All bitplanes are
// enabled and written according to the contents of the tile register.
outportb(0x7C, 0xC0);

// The same, but limiting writes to the first bitplane by disabling the
// second (0x02), third (0x04), and fourth (0x08) one, as done in the
// PC-98 Touhou Music Rooms.
outportb(0x7C, 0xCE);

// Regular GRCG blitting code to any VRAM segment…
pokeb(0xA8000, offset, …);

// We're done, turn off the GRCG.
outportb(0x7C, 0x00);

This could be used for some unusual effects when writing to two or three of the four planes, but it seems rather pointless for this specific case at first. If we only want to write to a single plane, why not just do so directly, without the GRCG? Using that chip only involves more hardware and is therefore slower by definition, and the blitting code would be the same, right?
This is another one of these questions that would be interesting to benchmark one day, but in this case, the reason is purely practical: All of master.lib's polygon drawing functions expect the GRCG to be running in RMW mode. They write their pixels as bitmasks where 1 and 0 represent pixels that should or should not change, and leave it to the GRCG to combine these masks with its tile register and OR the result into the bitplanes instead of doing so themselves. Since GRCG writes are done via MOV instructions, not using the GRCG would turn these bitmasks into actual dot patterns, overwriting any previous contents of each VRAM byte that gets modified.
Technically, you'd only have to replace a few MOV instructions with OR to build a non-GRCG version of such a function, but why would you do that if you haven't measured polygon drawing to be an actual bottleneck.

Three overlapping Music Room polygons rendered using master.lib's grcg_polygon_c() function with a disabled GRCGThree overlapping Music Room polygons rendered as in the original game, with the GRCG enabled
An example with three polygons drawn from top to bottom. Without the GRCG, edges of later polygons overwrite any previously drawn pixels within the same VRAM byte. Note how treating bitmasks as dot patterns corrupts even those areas where the background image had nonzero bits in its first bitplane.

As far as complexity is concerned though, the worst part is the implicit logic that allows all this text to show up on top of the polygons in the first place. If every single piece of text is only rendered a single time, how can it appear on top of the polygons if those are drawn every frame?
Depending on the game (because of course it's game-specific), the answer involves either the individual bits of the text color index or the actual contents of the palette:

The contents of nopoly_B with each game's first track selected.

Finally, here's a list of all the smaller details that turn the Music Rooms into such a mess:

And that's all the Music Rooms! The OP.EXE binaries of TH04 and especially TH05 are now very close to being 100% RE'd, with only the respective High Score menus and TH04's title animation still missing. As for actual completion though, the finalization% metric is more relevant as it also includes the ZUN Soft logo, which I RE'd on paper but haven't decompiled. I'm 📝 still hoping that this will be the final piece of code I decompile for these two games, and that no one pays to get it done earlier… :onricdennat:


For the rest of the second push, there was a specific goal I wanted to reach for the remaining anything budget, which was blocked by a few functions at the beginning of TH04's and TH05's MAINE.EXE. In another anticlimactic development, this involved yet another way too early decompilation of a main() function…
Generally, this main() function just calls the top-level functions of all other ending-related screens in sequence, but it also handles the TH04-exclusive congratulating All Clear images within itself. After a 1CC, these are an additional reward on top of the Good Ending, showing the player character wearing a different outfit depending on the selected difficulty. On Easy Mode, however, the Good Ending is unattainable because the game always ends after Stage 5 with a Bad Ending, but ZUN still chose to show the EASY ALL CLEAR!! image in this case, regardless of how many continues you used.
While this might seem inconsistent with the other difficulties, it is consistent within Easy Mode itself, as the enforced Bad Ending after Stage 5 also doesn't distinguish between the number of continues. Also, Try to Normal Rank!! could very well be ZUN's roundabout way of implying "because this is how you avoid the Bad Ending".

With that out of the way, I was finally able to separate the VRAM text renderer of TH04 and TH05 into its own assembly unit, 📝 finishing the technical debt repayment project that I couldn't complete in 2021 due to assembly-time code segment label arithmetic in the data segment. This now allows me to translate this undecompilable self-modifying mess of ASM into C++ for the non-ASCII translation project, and thus unify the text renderers of all games and enhance them with support for Unicode characters loaded from a bitmap font. As the final finalized function in the SHARED segment, it also allowed me to remove 143 lines of particularly ugly segmentation workarounds 🙌


The remaining 1/6th of the second push provided the perfect occasion for some light TH02 PI work. The global boss position and damage variables represented some equally low-hanging fruit, being easily identified global variables that aren't part of a larger structure in this game. In an interesting twist, TH02 is the only game that uses an increasing damage value to track boss health rather than decreasing HP, and also doesn't internally distinguish between bosses and midbosses as far as these variables are concerned. Obviously, there's quite a bit of state left to be RE'd, not least because Marisa is doing her own thing with a bunch of redundant copies of her position, but that was too complex to figure out right now.

Also doing their own thing are the Five Magic Stones, which need five positions rather than a single one. Since they don't move, the game doesn't have to keep 📝 separate position variables for both VRAM pages, and can handle their positions in a much simpler way that made for a nice final commit.
And for the first time in a long while, I quite like what ZUN did there! Not only are their positions stored in an array that is indexed with a consistent ID for every stone, but these IDs also follow the order you fight the stones in: The two inner ones use 0 and 1, the two outer ones use 2 and 3, and the one in the center uses 4. This might look like an odd choice at first because it doesn't match their horizontal order on the playfield. But then you notice that ZUN uses this property in the respective phase control functions to iterate over only the subrange of active stones, and you realize how brilliant it actually is.

Screenshot of TH02's Five Magic Stones, with the first two (both internally and in the order you fight them in) alive and activated Screenshot of TH02's Five Magic Stones, with the second two (both internally and in the order you fight them in) alive and activated Screenshot of TH02's Five Magic Stones, with the last one (both internally and in the order you fight them in) alive and activated

This seems like a really basic thing to get excited about, especially since the rest of their data layout sure isn't perfect. Splitting each piece of state and even the individual X and Y coordinates into separate 5-element arrays is still counter-productive because the game ends up paying more memory and CPU cycles to recalculate the element offsets over and over again than this would have ever saved in cache misses on a 486. But that's a minor issue that could be fixed with a few regex replacements, not a misdesigned architecture that would require a full rewrite to clean it up. Compared to the hardcoded and bloated mess that was 📝 YuugenMagan's five eyes, this is definitely an improvement worthy of the good-code tag. The first actual one in two years, and a welcome change after the Music Room!

These three pieces of data alone yielded a whopping 5% of overall TH02 PI in just 1/6th of a push, bringing that game comfortably over the 60% PI mark. MAINE.EXE is guaranteed to reach 100% PI before I start working on the non-ASCII translations, but at this rate, it might even be realistic to go for 100% PI on MAIN.EXE as well? Or at least technical position independence, without the false positives.

Next up: Shuusou Gyoku SC-88Pro BGM. It's going to be wild.

📝 Posted:
💰 Funded by:
Blue Bolt, [Anonymous]
🏷️ Tags:

And once again, the Shuusou Gyoku task was too complex to be satisfyingly solved within a single month. Even just finding provably correct loop sections in both the original and arranged MIDI files required some rather involved detection algorithms. I could have just defined what sounded like correct loops, but the results of these algorithms were quite surprising indeed. Turns out that not even Seihou is safe from ZUN quirks, and some tracks technically loop much later than you'd think they do, or don't loop at all. And since I then wanted to put these MIDI loops back into the game to ensure perfect synchronization between the recordings and MIDI versions, I ended up rewriting basically all the MIDI code in a cross-platform way. This rewrite also uncovered a pbg bug that has traveled from Shuusou Gyoku into Windows Touhou, where it survived until ZUN ultimately removed all MIDI code in TH11 (!)

Fortunately, the backlog still had enough general PC-98 Touhou funds that I could spend on picking some soon-important low-hanging fruit, giving me something to deliver for the end of the month after all. TH04 and TH05 use almost identical code for their main/option menus, so decompiling it would make number go up quite significantly and the associated blog post won't be that long…

Wait, what's this, a bug report from touhou-memories concerning the website?

  1. Tab switchers tended to break on certain Firefox versions, and
  2. video playback didn't work on Microsoft Edge at all?

Those are definitely some high-priority bugs that demand immediate attention.

  1. Microsoft Edge's anti-support of AV1
  2. TH04/TH05's main/option menu
  3. TH04/TH05's first-launch sound setup menu
  4. TH05's title animation ☯️

The tab switcher issue was easily fixed by replacing the previous z-index trickery with a more robust solution involving the hidden attribute. The second one, however, is much more aggravating, because video playback on Edge has been broken ever since I 📝 switched the preferred video codec to AV1.
This goes so far beyond not supporting a specific codec. Usually, unsupported codecs aren't supposed to be an issue: As soon as you start using the HTML <video> tag, you'll learn that not every browser supports all codecs. And so you set up an encoding pipeline to serve each video in a mix of new and ancient formats, put the <source> tag of the most preferred codec first, and rest assured that browsers will fall back on the best-supported option as necessary. Except that Edge doesn't even try, and insists on staying on a non-playing AV1 video. 🙄

The codecs parameter for the <source> type attribute was the first potential solution I came across. Specifying the video codec down to the finest encoding details right in the HTML markup sounds like a good idea, similar to specifying sizes of images and videos to prevent layout reflows on long pages during the initial page load. So why was this the first time I heard of this feature? The fact that there isn't a simple ffprobe -show_html_codecs_string command to retrieve this string might already give a clue about how useful it is in practice. Instead, you have to manually piece the string together by grepping your way through all of a video's metadata
…and then it still doesn't change anything about Edge's behavior, even when also specifying the string for the VP9 and VP8 sources. Calling the infamously ridiculous HTMLMediaElement.canPlayType() method with a representative parameter of "video/webm; codecs=av01.1.04M.08.0.000.01.13.00.0" explains why: Both the AV1-supporting Chrome and Edge return "probably", but only the former can actually play this format. 🤦

But wait, there is an AV1 video extension in the Microsoft Store that would add support to any unspecified favorite video app. Except that it stopped working inside Edge as of version 116. And even if it did: If you can't query the presence of this extension via JavaScript, it might as well not exist at all.
Not to mention that the favorite video app part is obviously a lie as a lot of widely preferred Windows video apps are bundled with their own codecs, and have probably long supported AV1.

In the end, there's no way around the utter desperation move of removing the AV1 <source> for Edge users. Serving each video in two other formats means that we can at least do something here – try visiting the GitHub release page of the P0234-1 TH01 Anniversary Edition build in Edge and you also don't get to see anything, because that video uses AV1 and GitHub understandably doesn't re-encode every uploaded video into a variety of old formats.
Just for comparison, I tried both that page and the ReC98 blog on an old Android 6 phone from 2014, and even that phone picked and played the AV1 videos with the latest available Chrome and Firefox versions. This was the phone whose available Firefox version didn't support VP9 in 2019, which was my initial reason for adding the VP8 versions. Looks like it's finally time to drop those… 🤔 Maybe in the far future once I start running out of space on this server.

Removing the <source> tags can be done in one of two places:

  1. server-side, detecting Edge via the User-Agent header, or
  2. client-side, using navigator.userAgentData.brands.

I went with 2) because more dynamic server-side code would only move us further away from static site generation, which would make a lot of sense as the next evolutionary step in the architecture of this website. The client-side solution is much simpler too, and we can defer the deletion until a user actually hovers over a specific video.
And while we're at it, let's also add a popup complaining about this whole state of affairs. Edge is heavily marketed inside Windows as "the modern browser recommended by Microsoft", and you sure wouldn't expect low-quality chroma-subsampled VP9 from such a tagline. With such a level of anti-support for AV1, Edge users deserve to know exactly what's going on, especially since this post also explains what they will encounter on other websites.

A popup on top of a ReC98 blog video, showing the caption "⚠️ Edge does not support AV1, falling back on low-quality video…"
That's the polite way of putting it.

Alright, where was I? For TH01, the main menu was the last thing I decompiled before the 100% finalization mark, so it's rather anticlimactic to already cover the TH04/TH05 one now, with both of the games still being very far away from 100%, just because people will soon want to translate the description text in the bottom-right corner of the screen. But then again, the ZUN Soft logo animation would make for an even nicer final piece of decompiled code, especially since the bouncing-ball logo from TH01, TH02, and TH03 was the very first decompilation I did, all the way back in 2015.

The code quality of ZUN's VRAM-based menus has barely increased between TH01 and TH05. Both the top-level and option menu still need to know the bounding rectangle of the other one to unblit the right pixels when switching between the two. And since ZUN sure loved hardcoded and copy-pasted numbers in the PC-98 days, the coordinates both tend to be excessively large, and excessively wrong. :zunpet: Luckily, each menu item comes with its own correct unblitting rectangle, which avoids any graphical glitches that would otherwise occur.
As for actual observable quirks and bugs, these menus only contain one of each, and both are exclusive to TH04:

And yes, these videos do have a frame rate of 2 FPS.

Now that 100% finalization of their OP.EXE binaries is within reach, all this bloat made me think about the viability of a 📝 single-executable build for TH04's and TH05's debloated and anniversary versions. It would be really nice to have such a build ready before I start working on the non-ASCII translations – not just because they will be based on the anniversary branch by default, but also because it would significantly help their development if there are 4 fewer executables to worry about.
However, it's not as simple for these games as it was for TH01. The unique code in their OP.EXE and MAINE.EXE binaries is much larger than Borland's easily removed C++ exception handler, so I'd have to remove a lot more bloat to keep the resulting single binary at or below the size of the original MAIN.EXE. But I'm sure going to try.


Speaking of code that can be debloated for great effect: The second push of this delivery focused on the first-launch sound setup menu, whose BGM and sound effect submenus are almost complete code duplicates of each other. The debloated branch could easily remove more than half of the code in there, yielding another ≈800 bytes in case we need them.
If hex-editing MIKO.CFG is more convenient for you than deleting that file, you can set its first byte to FF to re-trigger this menu. Decompiling this screen was not only relevant now because it contains text rendered with font ROM glyphs and it would help dig our way towards more important strings in the data segment, but also because of its visual style. I can imagine many potential mods that might want to use the same backgrounds and box graphics for their menus.

TH04's first-launch sound setup menu, showing the BGM mode selectionTH05's first-launch sound setup menu, showing the sound effect mode selection
How about an initial language selection menu in the same style?

With the two submenus being shown in a fixed sequence, there's not a lot of room for the code to do anything wrong, and it's even more identical between the two games than the main menu already was. Thankfully, ZUN just reblits the respective options in the new color when moving the cursor, with no 📝 palette tricks. TH04's background image only uses 7 colors, so he could have easily reserved 3 colors for that. In exchange, the TH05 image gets to use the full 16 colors with no change to the code.


Rounding out this delivery, we also got TH05's rolling Yin-Yang Orb animation before the title screen… and it's just more bloat and landmines on a smaller scale that might be noticeable on slower PC-98 models. In total, there are three unnecessary inter-page copies of the entire VRAM that can easily insert lag frames, and two minor page-switching landmines that can potentially lead to tearing on the first frame of the roll or fade animation. Clearly, ZUN did not have smoothness or code quality in mind there, as evidenced by the fact that this animation simply displays 8 .PI files in sequence. But hey, a short animation like this is 📝 another perfectly appropriate place for a quick-and-dirty solution if you develop with a deadline.
And that's 1.30% of all PC-98 Touhou code finalized in two pushes! We're slowly running out of these big shared pieces of ASM code…

I've been neglecting TH03's OP.EXE quite a bit since it simply doesn't contain any translatable plaintext outside the Music Room. All menu labels are gaiji, and even the character selection menu displays its monochrome character names using the 4-plane sprites from CHNAME.BFT. Splitting off half of its data into a separate .ASM file was more akin to getting out a jackhammer to free up the room in front of the third remaining Music Room, but now we're there, and I can decompile all three of them in a natural way, with all referenced data.
Next up, therefore: Doing just that, securing another important piece of text for the upcoming non-ASCII translations and delivering another big piece of easily finalized code. I'm going to work full-time on ReC98 for almost all of December, and delivering that and the Shuusou Gyoku SC-88Pro recording BGM back-to-back should free up about half of the slightly higher cap for this month.

📝 Posted:
💰 Funded by:
Blue Bolt, [Anonymous]
🏷️ Tags:

TH03 finally passed 20% RE, and the newly decompiled code contains no serious ZUN bugs! What a nice way to end the year.

There's only a single unlockable feature in TH03: Chiyuri and Yumemi as playable characters, unlocked after a 1CC on any difficulty. Just like the Extra Stages in TH04 and TH05, YUME.NEM contains a single designated variable for this unlocked feature, making it trivial to craft a fully unlocked score file without recording any high scores that others would have to compete against. So, we can now put together a complete set for all PC-98 Touhou games: 2021-12-27-Fully-unlocked-clean-score-files.zip It would have been cool to set the randomly generated encryption keys in these files to a fixed value so that they cancel out and end up not actually encrypting the file. Too bad that TH03 also started feeding each encrypted byte back into its stream cipher, which makes this impossible.

The main loading and saving code turned out to be the second-cleanest implementation of a score file format in PC-98 Touhou, just behind TH02. Only two of the YUME.NEM functions come with nonsensical differences between OP.EXE and MAINL.EXE, rather than 📝 all of them, as in TH01 or 📝 too many of them, as in TH04 and TH05. As for the rest of the per-difficulty structure though… well, it quickly becomes clear why this was the final score file format to be RE'd. The name, score, and stage fields are directly stored in terms of the internal REGI*.BFT sprite IDs used on the high score screen. TH03 also stores 10 score digits for each place rather than the 9 possible ones, keeps any leading 0 digits, and stores the letters of entered names in reverse order… yeah, let's decompile the high score screen as well, for a full understanding of why ZUN might have done all that. (Answer: For no reason at all. :zunpet:)


And wow, what a breath of fresh air. It's surely not good-code: The overlapping shadows resulting from using a 24-pixel letterspacing with 32-pixel glyphs in the name column led ZUN to do quite a lot of unnecessary and slightly confusing rendering work when moving the cursor back and forth, and he even forgot about the EGC there. But it's nowhere close to the level of jank we saw in 📝 TH01's high score menu last year. Good to see that ZUN had learned a thing or two by his third game – especially when it comes to storing the character map cursor in terms of a character ID, and improving the layout of the character map:

The alphabet available for TH03 high score names.

That's almost a nicely regular grid there. With the question mark and the double-wide SP, BS, and END options, the cursor movement code only comes with a reasonable two exceptions, which are easily handled. And while I didn't get this screen completely decompiled, one additional push was enough to cover all important code there.

The only potential glitch on this screen is a result of ZUN's continued use of binary-coded decimal digits without any bounds check or cap. Like the in-game HUD score display in TH04 and TH05, TH03's high score screen simply uses the next glyph in the character set for the most significant digit of any score above 1,000,000,000 points – in this case, the period. Still, it only really gets bad at 8,000,000,000 points: Once the glyphs are exhausted, the blitting function ends up accessing garbage data and filling the entire screen with garbage pixels. For comparison though, the current world record is 133,650,710 points, so good luck getting 8 billion in the first place.

Next up: Starting 2022 with the long-awaited decompilation of TH01's Sariel fight! Due to the 📝 recent price increase, we now got a window in the cap that is going to remain open until tomorrow, providing an early opportunity to set a new priority after Sariel is done.

📝 Posted:
💰 Funded by:
Ember2528
🏷️ Tags:

OK, TH01 missile bullets. Can we maybe have a well-behaved entity type, without any weirdness? Just once?

Ehh, kinda. Apart from another 150 bytes wasted on unused structure members, this code is indeed more on the low end in terms of overall jank. It does become very obvious why dodging these missiles in the YuugenMagan, Mima, and Elis fights feels so awful though: An unfair 46×46 pixel hitbox around Reimu's center pixel, combined with the comeback of 📝 interlaced rendering, this time in every stage. ZUN probably did this because missiles are the only 16×16 sprite in TH01 that is blitted to unaligned X positions, which effectively ends up touching a 32×16 area of VRAM per sprite.
But even if we assume VRAM writes to be the bottleneck here, it would have been totally possible to render every missile in every frame at roughly the same amount of CPU time that the original game uses for interlaced rendering:

That's an optimization that would have significantly benefitted the game, in contrast to all of the fake ones introduced in later games. Then again, this optimization is actually something that the later games do, and it might have in fact been necessary to achieve their higher bullet counts without significant slowdown.

Unfortunately, it was only worth decompiling half of the missile code right now, thanks to gratuitous FPU usage in the other half, where 📝 double variables are compared to float literals. That one will have to wait 📝 until after SinGyoku.


After some effectively unused Mima sprite effect code that is so broken that it's impossible to make sense out of it, we get to the final feature I wanted to cover for all bosses in parallel before returning to Sariel: The separate sprite background storage for moving or animated boss sprites in the Mima, Elis, and Sariel fights. But, uh… why is this necessary to begin with? Doesn't TH01 already reserve the other VRAM page for backgrounds?
Well, these sprites are quite big, and ZUN didn't want to blit them from main memory on every frame. After all, TH01 and TH02 had a minimum required clock speed of 33 MHz, half of the speed required for the later three games. So, he simply blitted these boss sprites to both VRAM pages, leading the usual unblitting calls to only remove the other sprites on top of the boss. However, these bosses themselves want to move across the screen… and this makes it necessary to save the stage background behind them in some other way.

Enter .PTN, and its functions to capture a 16×16 or 32×32 square from VRAM into a sprite slot. No problem with that approach in theory, as the size of all these bigger sprites is a multiple of 32×32; splitting a larger sprite into these smaller 32×32 chunks makes the code look just a little bit clumsy (and, of course, slower).
But somewhere during the development of Mima's fight, ZUN apparently forgot that those sprite backgrounds existed. And once Mima's 🚫 casting sprite is blitted on top of her regular sprite, using just regular sprite transparency, she ends up with her infamous third arm:

TH01 Mima's third arm

Ironically, there's an unused code path in Mima's unblit function where ZUN assumes a height of 48 pixels for Mima's animation sprites rather than the actual 64. This leads to even clumsier .PTN function calls for the bottom 128×16 pixels… Failing to unblit the bottom 16 pixels would have also yielded that third arm, although it wouldn't have looked as natural. Still wouldn't say that it was intentional; maybe this casting sprite was just added pretty late in the game's development?


So, mission accomplished, Sariel unblocked… at 2¼ pushes. :thonk: That's quite some time left for some smaller stage initialization code, which bundles a bunch of random function calls in places where they logically really don't belong. The stage opening animation then adds a bunch of VRAM inter-page copies that are not only redundant but can't even be understood without knowing the hidden internal state of the last VRAM page accessed by previous ZUN code…
In better news though: Turbo C++ 4.0 really doesn't seem to have any complexity limit on inlining arithmetic expressions, as long as they only operate on compile-time constants. That's how we get macro-free, compile-time Shift-JIS to JIS X 0208 conversion of the individual code points in the 東方★靈異伝 string, in a compiler from 1994. As long as you don't store any intermediate results in variables, that is… :tannedcirno:

But wait, there's more! With still ¼ of a push left, I also went for the boss defeat animation, which includes the route selection after the SinGyoku fight.
As in all other instances, the 2× scaled font is accomplished by first rendering the text at regular 1× resolution to the other, invisible VRAM page, and then scaled from there to the visible one. However, the route selection is unique in that its scaled text is both drawn transparently on top of the stage background (not onto a black one), and can also change colors depending on the selection. It would have been no problem to unblit and reblit the text by rendering the 1× version to a position on the invisible VRAM page that isn't covered by the 2× version on the visible one, but ZUN (needlessly) clears the invisible page before rendering any text. :zunpet: Instead, he assigned a separate VRAM color for both the 魔界 and 地獄 options, and only changed the palette value for these colors to white or gray, depending on the correct selection. This is another one of the 📝 rare cases where TH01 demonstrates good use of PC-98 hardware, as the 魔界へ and 地獄へ strings don't need to be reblitted during the selection process, only the Orb "cursor" does.

Then, why does this still not count as good-code? When changing palette colors, you kinda need to be aware of everything else that can possibly be on screen, which colors are used there, and which aren't and can therefore be used for such an effect without affecting other sprites. In this case, well… hover over the image below, and notice how Reimu's hair and the bomb sprites in the HUD light up when Makai is selected:

Demonstration of palette changes in TH01's route selection

This push did end on a high note though, with the generic, non-SinGyoku version of the defeat animation being an easily parametrizable copy. And that's how you decompile another 2.58% of TH01 in just slightly over three pushes.


Now, we're not only ready to decompile Sariel, but also Kikuri, Elis, and SinGyoku without needing any more detours into non-boss code. Thanks to the current TH01 funding subscriptions, I can plan to cover most, if not all, of Sariel in a single push series, but the currently 3 pending pushes probably won't suffice for Sariel's 8.10% of all remaining code in TH01. We've got quite a lot of not specifically TH01-related funds in the backlog to pass the time though.

Due to recent developments, it actually makes quite a lot of sense to take a break from TH01: spaztron64 has managed what every Touhou download site so far has failed to do: Bundling all 5 game onto a single .HDI together with pre-configured PC-98 emulators and a nice boot menu, and hosting the resulting package on a proper website. While this first release is already quite good (and much better than my attempt from 2014), there is still a bit of room for improvement to be gained from specific ReC98 research. Next up, therefore:

📝 Posted:
💰 Funded by:
[Anonymous]
🏷️ Tags:

Alright, no more big code maintenance tasks that absolutely need to be done right now. Time to really focus on parts 6 and 7 of repaying technical debt, right? Except that we don't get to speed up just yet, as TH05's barely decompilable PMD file loading function is rather… complicated.
Fun fact: Whenever I see an unusual sequence of x86 instructions in PC-98 Touhou, I first consult the disassembly of Wolfenstein 3D. That game was originally compiled with the quite similar Borland C++ 3.0, so it's quite helpful to compare its ASM to the officially released source code. If I find the instructions in question, they mostly come from that game's ASM code, leading to the amusing realization that "even John Carmack was unable to get these instructions out of this compiler" :onricdennat: This time though, Wolfenstein 3D did point me to Borland's intrinsics for common C functions like memcpy() and strchr(), available via #pragma intrinsic. Bu~t those unfortunately still generate worse code than what ZUN micro-optimized here. Commenting how these sequences of instructions should look in C is unfortunately all I could do here.
The conditional branches in this function did compile quite nicely though, clarifying the control flow, and clearly exposing a ZUN bug: TH05's snd_load() will hang in an infinite loop when trying to load a non-existing -86 BGM file (with a .M2 extension) if the corresponding -26 BGM file (with a .M extension) doesn't exist either.

Unsurprisingly, the PMD channel monitoring code in TH05's Music Room remains undecompilable outside the two most "high-level" initialization and rendering functions. And it's not because there's data in the middle of the code segment – that would have actually been possible with some #pragmas to ensure that the data and code segments have the same name. As soon as the SI and DI registers are referenced anywhere, Turbo C++ insists on emitting prolog code to save these on the stack at the beginning of the function, and epilog code to restore them from there before returning. Found that out in September 2019, and confirmed that there's no way around it. All the small helper functions here are quite simply too optimized, throwing away any concern for such safety measures. 🤷
Oh well, the two functions that were decompilable at least indicate that I do try.


Within that same 6th push though, we've finally reached the one function in TH05 that was blocking further progress in TH04, allowing that game to finally catch up with the others in terms of separated translation units. Feels good to finally delete more of those .ASM files we've decompiled a while ago… finally!

But since that was just getting started, the most satisfying development in both of these pushes actually came from some more experiments with macros and inline functions for near-ASM code. By adding "unused" dummy parameters for all relevant registers, the exact input registers are made more explicit, which might help future port authors who then maybe wouldn't have to look them up in an x86 instruction reference quite as often. At its best, this even allows us to declare certain functions with the __fastcall convention and express their parameter lists as regular C, with no additional pseudo-registers or macros required.
As for output registers, Turbo C++'s code generation turns out to be even more amazing than previously thought when it comes to returning pseudo-registers from inline functions. A nice example for how this can improve readability can be found in this piece of TH02 code for polling the PC-98 keyboard state using a BIOS interrupt:

inline uint8_t keygroup_sense(uint8_t group) {
	_AL = group;
	_AH = 0x04;
	geninterrupt(0x18);
	// This turns the output register of this BIOS call into the return value
	// of this function. Surprisingly enough, this does *not* naively generate
	// the `MOV AL, AH` instruction you might expect here!
	return _AH;
}

void input_sense(void)
{
	// As a result, this assignment becomes `_AH = _AH`, which Turbo C++
	// never emits as such, giving us only the three instructions we need.
	_AH = keygroup_sense(8);

	// Whereas this one gives us the one additional `MOV BH, AH` instruction
	// we'd expect, and nothing more.
	_BH = keygroup_sense(7);

	// And now it's obvious what both of these registers contain, from just
	// the assignments above.
	if(_BH & K7_ARROW_UP || _AH & K8_NUM_8) {
		key_det |= INPUT_UP;
	}
	// […]
}

I love it. No inline assembly, as close to idiomatic C code as something like this is going to get, yet still compiling into the minimum possible number of x86 instructions on even a 1994 compiler. This is how I keep this project interesting for myself during chores like these. :tannedcirno: We might have even reached peak inline already?

And that's 65% of technical debt in the SHARED segment repaid so far. Next up: Two more of these, which might already complete that segment? Finally!

📝 Posted:
💰 Funded by:
Blue Bolt, [Anonymous]
🏷️ Tags:

Turns out that TH04's player selection menu is exactly three times as complicated as TH05's. Two screens for character and shot type rather than one, and a way more intricate implementation for saving and restoring the background behind the raised top and left edges of a character picture when moving the cursor between Reimu and Marisa. TH04 decides to backup precisely only the two 256×8 (top) and 8×244 (left) strips behind the edges, indicated in red in the picture below.

Backed-up VRAM area in TH04's player character selection

These take up just 4 KB of heap memory… but require custom blitting functions, and expanding this explicitly hardcoded approach to TH05's 4 characters would have been pretty annoying. So, rather than, uh, not explicitly hardcoding it all, ZUN decided to just be lazy with the backup area in TH05, saving the entire 640×400 screen, and thus spending 128 KB of heap memory on this rather simple selection shadow effect. :zunpet:


So, this really wasn't something to quickly get done during the first half of a push, even after already having done TH05's equivalent of this menu. But since life is very busy right now, I also used the occasion to start addressing another code organization annoyance: master.lib's single master.h header file.

So, time to start a new master.hpp header that would contain just the declarations from master.h that PC-98 Touhou actually needs, plus some semantic (yes, semantic) sugar. Comparing just the old master.h to just the new master.hpp after roughly 60% of the transition has been completed, we get median build times of 319 ms for master.h, and 144 ms for master.hpp on my (admittedly rather slow) DOSBox setup. Nice!
As of this push, ReC98 consists of 107 translation units that have to be compiled with Turbo C++ 4.0J. Fully rebuilding all of these currently takes roughly 37.5 seconds in DOSBox. After the transition to master.hpp is done, we could therefore shave some 10 to 15 seconds off this time, simply by switching header files. And that's just the beginning, as this will also pave the way for further #include optimizations. Life in this codebase will be great!


Unfortunately, there wasn't enough time to repay some of the actual technical debt I was looking forward to, after all of this. Oh well, at least we now also have nice identifiers for the three different boldface options that are used when rendering text to VRAM, after procrastinating that issue for almost 11 months. Next up, assuming the existing subscriptions: More ridiculous decompilations of things that definitely weren't originally written in C, and a big blocker in TH03's MAIN.EXE.

📝 Posted:
💰 Funded by:
[Anonymous], -Tom-
🏷️ Tags:

So, TH05 OP.EXE. The first half of this push started out nicely, with an easy decompilation of the entire player character selection menu. Typical ZUN quality, with not much to say about it. While the overall function structure is identical to its TH04 counterpart, the two games only really share small snippets inside these functions, and do need to be RE'd separately.

The high score viewing (not registration) menu would have been next. Unfortunately, it calls one of the GENSOU.SCR loading functions… which are all a complete mess that still needed to be sorted out first. 5 distinct functions in 6 binaries, and of course TH05 also micro-optimized its MAIN.EXE version to directly use the DOS INT 21h file loading API instead of master.lib's wrappers. Could have all been avoided with a single method on the score data structure, taking a player character ID and a difficulty level as parameters…

So, no score menu in this push then. Looking at the other end of the ASM code though, we find the starting functions for the main game, the Extra Stage, and the demo replays, which did fit perfectly to round out this push.

Which is where we find an easter egg! 🥚 If you've ever looked into 怪綺談2.DAT, you might have noticed 6 .REC files with replays for the Demo Play mode. However, the game only ever seems to cycle between 4 replays. So what's in the other two, and why are they 40 KB instead of just 10 KB like the others? Turns out that they combine into a full Extra Stage Clear replay with Mima, with 3 bombs and 1 death, obviously recorded by ZUN himself. The split into two files for the stage (DEMO4.REC) and boss (DEMO5.REC) portion is merely an attempt to limit the amount of simultaneously allocated heap memory.
To watch this replay without modding the game, unlock the Extra Stage with all 4 characters, then hold both the ⬅️ left and ➡️ right arrow keys in the main menu while waiting for the usual demo replay. I can't possibly be the first one to discover this, but I couldn't find any other mention of it.
Edit (2021-03-15): ZUN did in fact document this replay in Section 6 of TH05's OMAKE.TXT, along with the exact method to view it. Thanks to Popfan for the discovery!

Here's a recording of the whole replay:

Note how the boss dialogue is skipped. MAIN.EXE actually contains no less than 6 if() branches just to distinguish this overly long replay from the regular ones.


I'd really like to do the TH04 and TH05 main menus in parallel, since we can expect a bit more shared code after all the initial differences. Therefore, I'm going to put the next "anything" push towards covering the TH04 version of those functions. Next up though, it's back to TH01, with more redundant image format code…

📝 Posted:
💰 Funded by:
Lmocinemod, Blue Bolt, [Anonymous]
🏷️ Tags:

Finally, after a long while, we've got two pushes with barely anything to talk about! Continuing the road towards 100% PI for TH05, these were exactly the two pushes that TH05 MAINE.EXE PI was estimated to additionally cost, relative to TH04's. Consequently, they mostly went to TH05's unique data structures in the ending cutscenes, the score name registration menu, and the staff roll.

A unique feature in there is TH05's support for automatic text color changes in its ending scripts, based on the first full-width Shift-JIS codepoint in a line. The \c=codepoint,color commands at the top of the _ED??.TXT set up exactly this codepoint→color mapping. As far as I can tell, TH05 is the only Touhou game with a feature like this – even the Windows Touhou games went back to manually spelling out each color change.

The orb particles in TH05's staff roll also try to be a bit unique by using 32-bit X and Y subpixel variables for their current position. With still just 4 fractional bits, I can't really tell yet whether the extended range was actually necessary. Maybe due to how the "camera scrolling" through "space" was implemented? All other entities were pretty much the usual fare, though.
12.4, 4.4, and now a 28.4 fixed-point format… yup, 📝 C++ templates were definitely the right choice.

At the end of its staff roll, TH05 not only displays the usual performance verdict, but then scrolls in the scores at the end of each stage before switching to the high score menu. The simplest way to smoothly scroll between two full screens on a PC-98 involves a separate bitmap… which is exactly what TH05 does here, reserving 28,160 bytes of its global data segment for just one overly large monochrome 320×704 bitmap where both the screens are rendered to. That's… one benefit of splitting your game into multiple executables, I guess? :tannedcirno:
Not sure if it's common knowledge that you can actually scroll back and forth between the two screens with the Up and Down keys before moving to the score menu. I surely didn't know that before. But it makes sense – might as well get the most out of that memory.


The necessary groundwork for all of this may have actually made TH04's (yes, TH04's) MAINE.EXE technically position-independent. Didn't quite reach the same goal for TH05's – but what we did reach is ⅔ of all PC-98 Touhou code now being position-independent! Next up: Celebrating even more milestones, as -Tom- is about to finish development on his TH05 MAIN.EXE PI demo…

📝 Posted:
💰 Funded by:
Yanga, Ember2528
🏷️ Tags:

Three pushes to decompile the TH01 high score menu… because it's completely terrible, and needlessly complicated in pretty much every aspect:

In the end, I just gave up with my usual redundancy reduction efforts for this one. Anyone wanting to change TH01's high score name entering code would be better off just rewriting the entire thing properly.

And that's all of the shared code in TH01! Both OP.EXE and FUUIN.EXE are now only missing the actual main menu and ending code, respectively. Next up, though: The long awaited TH01 PI push. Which will not only deliver 100% PI for OP.EXE and FUUIN.EXE, but also probably quite some gains in REIIDEN.EXE. With now over 30% of the game decompiled, it's about time we get to look at some gameplay code!

📝 Posted:
💰 Funded by:
Yanga, Ember2528
🏷️ Tags:

Back to TH01, and its high score menu… oh, wait, that one will eventually involve keyboard input. And thanks to the generous TH01 funding situation, there's really no reason not to cover that right now. After all, TH01 is the last game where input still hadn't been RE'd.
But first, let's also cover that one unused blitting function, together with REIIDEN.CFG loading and saving, which are in front of the input function in OP.EXE… (By now, we all know about the hidden start bomb configuration, right?)

Unsurprisingly, the earliest game also implements input in the messiest way, with a different function for each of the three executables. "Because they all react differently to keyboard inputs :zunpet:", apparently? OP.EXE even has two functions for it, one for the START / CONTINUE / OPTION / QUIT main menu, and one for both Option and Music Test menus, both of which directly perform the ring arithmetic on the menu cursor variable. A consistent separation of keyboard polling from input processing apparently wasn't all too obvious of a thought, since it's only truly done from TH02 on.

This lack of proper architecture becomes actually hilarious once you notice that it did in fact facilitate a recursion bug! :godzun: In case you've been living under a rock for the past 8 years, TH01 shipped with debugging features, which you can enter by running the game via game d from the DOS prompt. These features include a memory info screen, shown when pressing PgUp, implemented as one blocking function (test_mem()) called directly in response to the pressed key inside the polling function. test_mem() only returns once that screen is left by pressing PgDown. And in order to poll input… it directly calls back into the same polling function that called it in the first place, after a 3-frame delay.

Which means that this screen is actually re-entered for every 3 frames that the PgUp key is being held. And yes, you can, of course, also crash the system via a stack overflow this way by holding down PgUp for a few seconds, if that's your thing.
Edit (2020-09-17): Here's a video from spaztron64, showing off this exact stack overflow crash while running under the VEM486 memory manager, which displays additional information about these sorts of crashes:

What makes this even funnier is that the code actually tracks the last state of every polled key, to prevent exactly that sort of bug. But the copy-pasted assignment of the last input state is only done after test_mem() already returned, making it effectively pointless for PgUp. It does work as intended for PgDown… and that's why you have to actually press and release this key once for every call to test_mem() in order to actually get back into the game. Even though a single call to PgDown will already show the game screen again.

In maybe more relevant news though, this function also came with what can be considered the first piece of actual gameplay logic! Bombing via double-tapping the Z and X keys is also handled here, and now we know that both keys simply have to be tapped twice within a window of 20 frames. They are tracked independently from each other, so you don't necessarily have to press them simultaneously.
In debug mode, the bomb count tracks precisely this window of time. That's why it only resets back to 0 when pressing Z or X if it's ≥20.

Sure, TH01's code is expectedly terrible and messy. But compared to the micro-optimizations of TH04 and TH05, it's an absolute joy to work on, and opening all these ZUN bug loot boxes is just the icing on the cake. Looking forward to more of the high score menu in the next pushes!

📝 Posted:
💰 Funded by:
Touhou Patch Center
🏷️ Tags:

🎉 TH04's and TH05's OP.EXE are now fully position-independent! 🎉

What does this mean?

You can now add any data or code to the main menus of the two games, by simply editing the ReC98 source, writing your mod in ASM or C/C++, and recompiling the code. Since all absolute memory addresses have now been converted to labels, this will work without causing any instability. See the position independence section in the FAQ for a more thorough explanation about why this was a problem.

What does this not mean?

The original ZUN code hasn't been completely reverse-engineered yet, let alone decompiled. Pretty much all of that is still ASM, which might make modding a bit inconvenient right now.

Since this push was otherwise pretty unremarkable, I made a video demonstrating a few basic things you can do with this:

Now, what to do for the last outstanding Touhou Patch Center push? Bullets, or resident structures? :thonk: