Blog

📝 Posted:
💰 Funded by:
[Anonymous], Ember2528
🏷️ Tags:

Part 4! Let's get this over with, fix all these landmines, and conclude 📝 this 4-post series about the big 2025 PC-98 Touhou portability subproject. Unless I find something big in TH03, this is likely to be the final long blog post for this year. I wanna code again!

  1. The scope
  2. Retaining menu backgrounds in conventional RAM
  3. Resolving screen tearing landmines
  4. Intermission: Handling unused code on fork branches
  5. Merging TH02-TH05's OP.EXE and MAINE.EXE/MAINL.EXE
  6. Replicating TH02-TH05's GAME.BAT in C++
  7. Topping it off with an actual feature

As you can already tell by this table of contents, this "initial" cleanup work was quite larger in scope than its counterpart for 📝 the first TH01 Anniversary Edition release. Even that already took unexpectedly long 2½ years ago, and now imagine doing that across four games simultaneously while keeping all the little required inconsistencies in place. Then you'll get why this has taken over four months…
With an overall goal of "general portability", it's very tempting to escalate the scope towards covering everything in these menu and cutscene binaries. So I had to draw at least some boundaries:

Even then, this was way premature. Not only because we still need to maintain the memory layout of TH02's and TH03's MAIN.EXE, but also because of all the undecompilable ASM code in all four games that blocks certain architectural simplifications.
The biggest problem, however: I haven't quite decided on how to use static libraries within my build environment yet. Since Turbo C++ 4.0J's linker just blindly includes every explicitly named object file without eliminating dead code, static libraries are essential for reducing bloat by providing a layer of optional files to be included on demand. However, the Windows and DOS versions of TLIB are easily confused, TLIB's usual paradigm of mutating existing library files goes against Tup's explicit dependency graph, and should we really depend on an ancient proprietary tool for a job that I could reimplement in a few hundred lines? Famous last words, I know. But since I 📝 didn't want to do any dedicated build system work this year, I also didn't want to sort out these questions in a 12th or even 13th push. Leaving the build environment woefully ill-equipped for the complexity of this task was probably a mistake; while the resulting workaround of feature bundles does the job, it's very silly and hopefully won't last very long. I'm definitely going to spend some time sorting out the static library situation before I ever attempt something like this again. Or at some general point before we hit the overall 100% finalization mark, because we've still got that long-awaited librarization of ZUN's master.lib fork ahead of us.


Let's get to it then, starting with the feature that will remove lag in menus by removing PC-98-specific page-flipping and EGC code:

Retaining menu backgrounds in conventional RAM

At first, this seems to be no problem. We just swap out master.lib's .PI functions with our forked PiLoad and our generic blitter, and make sure to keep the images allocated. master.lib's graph_pi_load_pack() has always loaded .PI images into one big contiguous buffer in conventional RAM, so this shouldn't negatively affect the heap layout. If anything, we'd be saving memory by 📝 not allocating these extra two rows, right?
Unfortunately, it's that second goal that would turn out to be a massive problem. 📝 The end of part 1 already hinted at how the majority of menu backgrounds are only rendered to VRAM a single time before ZUN immediately frees them from memory. These cases are so common that I defined a macro for them:

#define pi_fullres_load_palette_apply_put_free(slot, fn) { \
	pi_load(slot, fn); \
	pi_palette_apply(slot); \
	pi_put_8(0, 0, slot); \
	pi_free(slot); \
}
At the current state of decompilation, this macro is used 16 times across TH02-TH05, and it will appear an additional 12 times by the time decompilation is done.

In these cases, the games only need that single 128 KiB block temporarily, and then get to reuse that memory for other, more dynamic graphics. Consequently, ZUN probably dimensioned the master.lib heap sizes for TH02-TH05 to leave ample headroom with this fact in mind. I wasn't so sure about deliberately limiting the amount of heap memory 📝 in late 2021 when I fixed the one out-of-memory landmine that remained in TH04, but I've begun to appreciate these memory limits quite a lot as the scope of my research has deepened. Specify the right amount of bytes, perform the single allocation from the DOS heap at startup, and if that allocation succeeds, you've removed an entire class of out-of-memory bugs from consideration. Sure, modders might prefer mem_assign_all() for simplicity during development, but it does make sense to return to a static limit when shipping. For once, ZUN was right, and there is no excuse. :godzun:

On the surface, this macro is equivalent to PiLoad's original direct-to-VRAM approach. And indeed, we can replace this code with a call to PiLoad's original code path in the few cases where we just want to show a static image without unblitting any of its regions later on, removing even the requirement for that temporary 128 KiB block in the process. But in the majority of cases, we do need these images in RAM, and ZUN's original heap sizes simply weren't intended for that.
But how much of a problem are ZUN's limits in practice? Well, there's at least one instance where retaining all images would require significantly more memory than ZUN anticipated. TH02's OP.EXE requests a 256,000-byte master.lib heap, but then wants to do this:

If you step through this video, you'll see that this effect indeed page-flips between the later menu background (128,000 bytes) and the three images with the resized text (3 × 54,720 = 164,160 bytes). Hence, we must have already loaded all four of them into the heap before this animation starts. While the menu's main text is rendered on the text layer, its shadow is part of the graphics layer and must be unblitted when switching to the option menu and back, thus requiring this image in at least some portion of memory.
Since VRAM page 1 always shows the unmodified menu background image, we could cheat, use PiLoad's direct-to-VRAM code path, and then do a second load of the image to conventional RAM on frame 19, before the white-in palette effect. 📝 Frame-perfection rule #2 would also allow us to do that. But adding a second load time is lame, especially because the white-in effect wouldn't even hide that many expensive calls in the debloated build anymore. After replacing the original final graph_pack_put_8() call for the main menu image with a faster planar blit, it only hides the draw calls for the ©ZUN text and some minor file and port I/O to load and apply the menu's final 48-byte palette from OP.RGB.

There's really no reason against just increasing the size of the master.lib heap to incorporate all of these four images at the same time. But how much additional memory do we actually need? Obviously, these four images are not the only allocations on the heap, which also needs to fit at least the following buffers:

We can bypass the super_buffer allocation through a dumb trick. But the single worst aspect hides between all these individual allocations:

Fighting heap fragmentation

Let's look at TH05's OP.EXE, whose heap limit of 336,000 bytes is much more lenient. This limit should be more than enough to fit the new additional 128,000-byte buffer for the background image in addition to the original heap contents on every menu screen, and we can indeed enter the main menu without any issue. But then, we're still greeted with an out-of-memory crash after entering and leaving the Music Room? Let's take a look into the master.lib heap, with the retained background images we'd like to have:

Step Heap layout Total remaining Largest free block Fragmentation loss
Entering the main menu op1.pi (128,000) sft*.cd2 + car.cd2 (7,360) sl*.cdg (90,240) scnum.bft + hi_m.bft (7,680) 76,352 66,240 10,112
Leaving the main menu scnum.bft + hi_m.bft (7,680) 301,984 234,608 67,376
Entering the Music Room music.pi (128,000) 📝 nopoly_B (32,000) scnum.bft + hi_m.bft (7,680) 139,536 66,240 73,296
Leaving the Music Room music.pi (!) (128,000) scnum.bft + hi_m.bft (7,680) 165,552 81,920 83,632
Re-entering the main menu .PI load buffer (16,384) sft*.cd2 + car.cd2 (7,360) sl*.cdg (90,240) scnum.bft + hi_m.bft (7,680) 179,760 115,648 64,112

Something gradually shreds our heap into tiny pieces that ultimately prevent us from allocating the main menu background image a second time. We can surely blame this on ZUN's suboptimal order of load calls that doesn't prioritize larger images over smaller ones, or on the 16 KiB .PI load buffer that we maybe should have allocated statically. But the biggest hidden offender turns out to be… master.lib's packfile implementation?!

Yup. Every time you load a file out of an archive, master.lib heap-allocates a 31-byte state structure and a file read buffer that master.lib originally dimensioned at 520 bytes. TH03's MAIN.EXE and MAINL.EXE then increased the size of that buffer to 4,104 bytes, before TH04 and TH05 went up to 8,200 bytes. :zunpet:
Ultimately though, it's not the size that's the problem here, but the fact that we repeatedly allocate any memory that could have been allocated once when setting up the INT 21h handlers. You'd think that master.lib went for dynamic allocations in order to support the fact that the INT 21h file API lets you open multiple file handles simultaneously, which would point to different RLE-compressed files within the archive. But no, master.lib doesn't even support this case, and even incorrectly returns FileNotFound if you attempt to open a second file from an archive before closing the first one! :onricdennat:

After I identified this whole issue, I immediately wanted to replace master.lib's packfile code with the much saner and more explicit C++ implementation from TH01. Sooner or later, we'll have to do this anyway because we can't just hook file syscalls on other operating systems the way we can hook them on DOS.
However, TH01's implementation would quickly turn out to have its own share of heap fragmentation issues. Every time the game loads an RLE-compressed file from 東方靈異.伝, the archived file is completely decompressed into a newly allocated temporary buffer, from where the game then copies out parts into the actual game structures. The resulting fragmentation is at least easily fixable though, and that's what the TH01 part of the very first push assigned to part 1 went to. Switching to a zero-copy architecture basically only required persisting the RLE state and brought a significant improvement: 15,776 bytes of heap memory during Stage 1 freed up by that switch alone, as reported by the coreleft() output seen in debug mode?! That much for just removing temporary allocations that the game was freeing anyway?
Let's check the Borland C++ DOS Reference for how this value is actually calculated. Turns out that it is simply intended to be a measure of unused RAM memory, and sure enough:

In the large data models, coreleft returns the amount of memory between the highest allocated block and the end of memory.

That's a reasonably meaningful measurement that can be determined in constant time, compared to the 𝑂(𝑛) operation of finding the true total size of available heap memory by walking over every node.

In the end though, rolling out this C++ implementation to the other four games was way premature and would have pushed this delivery way above 12 pushes. After all, both ZUN's and master.lib's code is still full of INT 21h file syscalls that would all need to be replaced. Conditionally, even, given the two binaries that are not yet position-independent…
Fortunately, .PI loading itself is just as much of an issue and can be worked around in the much simpler way I already spoiled when explaining 📝 the API changes I made to PiLoad: We simply hold on to the menu background pixel buffer for as long as possible. Ideally, we only allocate these 128 KiB once, decode every new menu background into that same buffer, and only explicitly free it when we really need to. That's why the platform layer logic requires full control over .PI buffer allocation.
This was enough to keep the required amount of additional heap memory to a more than acceptable level:

And after a few rewritten function calls, we've indeed removed every single EGC-powered inter-page copy from all menus and cutscenes of TH02-TH05! On to the next goal…


Resolving screen tearing landmines

…which requires individual solutions for every case that merely follow a common pattern. If things are already done close to a VSync wait loop and just in a slightly wrong order, the solution is easy and we just have to shift around a few function calls.
But what can we do if ZUN mutates visible pixels at some place far away from the last VSync wait loop? After all, a lot of these landmines result from confusing the current CRT beam position across multiple functions. Often, it's impossible to see at a glance where these menu-specific subfunctions are called within a frame without tracing execution back to the last VSync delay loop at the call site. For starters, it would be nice to clearly formalize that a specific section of code must be run in VBLANK.

master.lib's vsync_Proc function pointer already gets us most of the way there. Its VSync subsystem automatically calls any non-nullptr shortly after the VSync interrupt fires, and our task function would then set vsync_Proc back to a nullptr to ensure the intended one-shot behavior.
However, this approach can at best defer a task to the next VBLANK interval, which might leave us one frame behind the original game and hurt our frame-perfection goals. What we actually want is a conditional approach for timing-sensitive tasks, as a common operation that only requires a single line of code:

Now we're only missing that crucial one bit of information, which is delivered by Bit 5 of the graphics GDC's status register at I/O port 0xA0. In fact, ZUN uses the same bit in all hand-written VSync wait code throughout TH01 and in the bouncing-ball ZUN Soft logo:

void vsync_wait_via_gdc_polling(void)
{
	// Bit 5 of the graphics GDC register indicates VBLANK. Wait until this bit
	// is set.
	while((inportb(0xA0) & 0x20) != 0) {
	}

	// Once Bit 5 is no longer set, the CRT has started drawing the next frame.
	// I have no idea why you would ever want to throw away all your precious
	// vertical retrace time, but ZUN does this all throughout TH01.
	while((inportb(0xA0) & 0x20) == 0) {
	}
}

Of course, this only solves the problem in theory, as the tasks themselves don't come with any real-time guarantees. It's entirely possible for the resulting vblank_run() function to get called near the end of VBLANK, start the task immediately, and return long after the CRT beam has started drawing again. Heck, if the system is slow enough, the task might not even complete within the VBLANK interval if we run it immediately after VSync. But this is a much more complex problem to solve, requiring upfront measurements of both the VBLANK interval and the execution time for each potential task, which can then be factored into the run-now-or-defer decision. We definitely don't need to go there as long as we're mainly targeting emulated 66 MHz systems.

In easy cases, vblank_run() can then resolve screen tearing landmines completely by itself. Towards the end of PC-98 Touhou, ZUN's menus made more and more use of master.lib's blocking palette fading functions, which delay themselves to the next VSync signal and thus avoid any tearing issues. Hence, TH04's and TH05's screen tearing landmines are limited to the very few sudden palette changes that remained in these games:

void return_from_other_screen_to_main_menu(void)
{
	// Loads the .PI image into our persistent menu image background buffer,
	// and overwrites master.lib's 8-bit palette. Takes a few frames and
	// probably won't return during VBLANK.
	GrpSurface_LoadPI(bgimage, &Palettes, "op1.pi");

	graph_accesspage(0);
	bgimage.write(0, 0); // Planar blit
	PaletteTone = 100;   // Use original brightness in palette_show()

-	// ZUN landmine: Updating the hardware palette right now will most likely
-	// cause screen tearing.
-	palette_show();
+	vblank_run(palette_show);

	[…]
}

The fixes for the landmines in TH03 and TH02, however, require much more thought and care to stay as close to ZUN's defined logical frame sequence as possible. TH03's character selection screen, which prompted this whole subproject in the first place, houses one of the harder groups of landmines:

Recorded on DOSBox-X with 375,000 cycles, since exact machine specifications are not important to demonstrate these landmines.

Very finicky work, where every single branch has the potential to introduce an off-by-one-frame error, and vblank_run() doesn't help at all.

And then you reach TH02, which asks for way too much to happen within a single frame, in plain sight, and with no palette tricks to hide it. The screen transitions into and out of its HiScore screen are by far the worst example:

Recorded at 1.9968 × 17 = 33.9456 MHz for a change to magnify the jank that you perhaps wouldn't see at higher clock speeds.

These screen transitions exhibit no less than 6 landmines and 2 bugs:

  1. 💣 Frame 1 shows how TRAM (containing the actual menu text as gaiji) gets cleared immediately, but VRAM (containing the shadow) remains untouched as ZUN decides to load HUUHI.DAT first.

  2. 💣 While the following VRAM clear appears to produce a well-defined black frame 2, it's anything but well-defined, as the load operation only happens to conclude within VBLANK by sheer chance in this recording.

  3. 💣 Frame 4 is wild. First of all, the code still hasn't waited for a single VBLANK signal ever since entering the menu, and therefore shouldn't be writing to TRAM to begin with.
    But even then, you wouldn't expect to see only the name and nothing else on the scanlines of a score record in such a partial rendering. How can TRAM operations possibly be that slow? This almost seemed as if I was missing some crucial timing-related detail about the hardware. But in the end, what we're seeing here is simply Neko Project not actually using scanline rendering for the text layer. If you write to a TRAM cell, Neko Project just marks the entire 16-pixel row to be redrawn during the next screen refresh event.

  4. 🐞 Frame 5, then, is the first well-defined frame that actually renders the way it's defined in the code. The green 東方封魔録 logo is indeed only meant to be visible from the next frame onwards. This certainly meets all criteria for a bug, but the debloated build isn't allowed to fix those. In fact, it needs a dedicated conditional branch to preserve this bug.

  5. Once you leave the menu, you'll first have to sit through a stylistic and non-productive 20-frame delay, before… the screen switches back to the last frame rendered before the delay on frame 77?
    💣 By that point, we're technically already back to the main menu, where the first thing ZUN does is to switch from double-buffering back to single-buffering with VRAM page 0 shown. If you happened to leave the menu by hitting a key on the 50% of frames where VRAM page 1 is shown, the screen will therefore flip back to the frame rendered before the 20-frame delay, and keep it visible while master.lib decodes the title screen image.

  6. 💣 This decoding process finishes after ~20.4 frames in this recording, near the middle of frame 98. Clearly, we then have to immediately switch the hardware palette to the one we just loaded. Let's completely disregard that we're probably not in VBLANK, or that the screen is still showing the last High Score menu frame… :zunpet:

  7. Then, we need to get the image onto both VRAM pages. 📝 As we found out in Part 2, a low-clocked 386 is pretty much the most suboptimal system for master.lib's packed→planar conversion code, and 12 frames exactly match the performance we would expect from Neko Project at 33 MHz.
    💣 But that only rendered the image to the invisible VRAM page 1. We could now temporarily show page 1 after the next VSync signal to hide the pretty much guaranteed multi-frame VRAM writes… but nah, who cares except for some researcher 28 years later. By leaving VRAM page 0 on screen, ZUN doesn't even attempt to hide the jank that is about to occur. Once again, he reaches for master.lib's graph_copy_page(), 📝 whose slowness I already talked about in Part 2. At 33 MHz, Neko Project takes 3 frames to copy one page after another, leaving us with two frames of mixed pixels. This can be even worse on real hardware: On 📝 spaztron64's K6-2-upgraded and southbridge-bottlenecked PC-9821V166 model, this copy took 100 ms. I was able to watch every single bitplane getting individually copied in the recording. Unpacking the .PI image a second time would have been faster on that machine. :onricdennat:

  8. 🐞 Also, ZUN should have definitely cleared TRAM before the page copy instead of deferring this responsibility to the main menu rendering code. Since we then return to the main menu's VSync-timed loop and regularly wait for VSync while the scoreboard remains on screen and part of the current logical frame, this is not a landmine.

Compare this with the debloated version:

Then, I did that 13 more times for the other screen tearing landmines fixed in this build. And no, these new builds don't even fix every instance of this issue…


Intermission: Handling unused code on fork branches

Given that all of these improvements are taking place on the debloated branch, it's time to decide on how to handle the biggest unneeded obstacle in the way of our portability efforts, after 📝 I procrastinated this question 2½ years ago.
In Shuusou Gyoku, I've been trying to retain every single line of unused code in a dedicated directory, not least because that game has 📝 some very wild effects that should be reasonably preserved. The problem with this approach is that all this unused code quickly stopped compiling as I started to refactor the game into its current cross-platform state. For discoverability, this is still better than outright deleting the code and expecting people to read pbg's original codebase, but it's not all too practical either. :thonk:

In the ReC98 codebase, we have a different situation: All the unused code doesn't just exist at some old commit that maybe won't even compile going forward, but is an integral part of the master branch. Therefore, removing this code from fork branches is not only in line with their goals, but also completely non-destructive, since its compilable form on master keeps getting maintained for a handful of building platforms.
Then again, I like the added overview and discoverability of the Shuusou Gyoku approach. So let's meet in the middle: From now on, the debloated branch will only keep unused code in the form of its declarations and some short explanatory comments, in files within the unused/ directory whose names point to the actual implementations on the master branch.

Funnily enough, unused code wasn't even the main reason why TH01's ANNIV.EXE lost 10,834 bytes between the previous and current builds. Although TH01 is the one game with by far the most unused engine code, that code only made up 3,728 bytes of that difference. The rest came from the work surrounding the zero-copy unpacker and the few portability features that already made sense to be rolled out for this game. Yes, TH01 really is that bloated.


Merging TH02-TH05's OP.EXE and MAINE.EXE/MAINL.EXE

Onto the second most exciting feature, 📝 as motivated by the blog post from May! A true single-executable build 📝 never looked that viable for TH04 and TH05 to begin with, so let's just go for the one viable partial merge that makes sense for all of the four games. With all of MAINE.EXE/MAINL.EXE being position-independent, the remaining bunch of ASM code there isn't much of an obstacle either.

And once again, this merge means that we have to resolve all 📝 binary-specific inconsistencies at once. While ZUN thankfully eliminated most of them by the end of the PC-98 era, the scorefile code remained inconsistent until the very end, 📝 as Part 3 already mentioned. Hopefully, this is the second-to-last time I have to mention these formats…
Funnily enough, all of their most noteworthy inconsistencies are found in how these formats deal with corrupted files:

The actual merge then indeed delivers what we were hoping for: In three of the four games, the added unique code from OP.EXE and MAINE.EXE/MAINL.EXE comes in at far below the 20,512 bytes we freed by removing 📝 Borland's C++ exception handler, both in the binaries themselves and in their loaded in-memory state.
But it's TH05 where both OP.EXE's expanded Music Room and MAINE.EXE's Staff Roll and All Cast sequence add so much unique data that the initial merge ended up slightly larger than the size of the original MAINE.EXE. Getting the binary and run-time size of the new DEBLOAT.EXE below that point required every trick in the book and then some. The more critical tricks were good ideas in their own right:

But the rest of them definitely crossed over into silly micro-optimization territory:


Alright, another idealistic bonus goal reached! That means we're only missing a single aspect to reach feature parity with the debloated TH01 build:

Replicating TH02-TH05's GAME.BAT in C++

In TH03, this is slightly more involved. We not only need to launch PMD using this technique, but also apply it to the INTvector set program and SPRITE16. 📝 You know the way this goes:

…actually, wait a moment! TH02-TH05 don't even use the C heap, and the master.lib heap works in a completely different way. Borland implemented their heap in a traditional sbrk(2) style that dynamically resizes a program's main DOS memory block as needed, which is how we end up with the whole concept of the heap growing in one direction. master.lib, however, needs to place its heap entirely within a single pre-allocated and fixed-size block of memory. And since mem_assign_dos() simply allocates this block via the usual DOS INT 21h, AH=48h API, it doesn't matter where on the DOS heap this block is located. This means that we don't even have to do the error-prone witchcraft of pushing these TSRs to the top of conventional memory that we had to do for TH01! Right?

Just like last time, these visualizations are not even remotely to scale.
Also, note the small gap between SPRITE16 and PMD. That's supposed to be PMD's environment segment, and DOS does allocate it, but PMD explicitly frees it before going resident. Sure, it's just a few bytes on real DOS, but still, very considerate!

Unfortunately though, the fixed position of all these TSRs would still prevent the game allocation from being replaced with a binary that asks for more memory than the one this block was initially allocated for. In TH01, this would have been a minor issue because it only applied to hot-reloading the single DEBLOAT.EXE or ANNIV.EXE that contains all game code. For the other four games, however, we still keep the larger MAIN.EXE as a separate binary, and most likely will do so for the foreseeable future. And we're surely not getting into the business of moving already allocated TSRs…
So we're back to the technique from two years ago after all. Let's precalculate the size of each TSR, push that TSR to the top of conventional RAM by temporarily claiming all free memory minus its expected size, and then we get…

…a ZUN.COM spawn failure from DOS as we try to start the ZUNINIT sub-binary. :zunpet:
Yup. Thanks to ZUN's fantastic idea of bundling these small utility tools and TSRs into a single binary that's larger than each individual TSR, we can't just reuse the strategy that worked for TH01. DOS must load the entirety of ZUN.COM into conventional RAM before the bundling code gets a chance to shift the selected sub-binary to the top of the program's memory block and then reduce the size of that block.
So how are we going to solve this?

Sure, we could allocate the resident structure into the gap left by ZUNINIT's instance of ZUN.COM, but that's just a few bytes. Keep in mind that the (env.) blocks are much smaller than they appear here. Thus, the free blocks are also a lot larger in reality, but still not large enough to fit PMD and its music buffer.

But if we can't get rid of ZUN.COM's high load-time memory requirements, how about using that memory more productively? Is there a way we could maybe spawn the other TSRs into the hole left by ZUN.COM after it went resident?
Let's take a step back from individual TSRs and instead look at the full picture of spawning a bundle of TSRs in a defined order. First, we determine both the binary size (file size of the .COM binary + Program Segment Prefix + 256 bytes of stack) and the resident size (the size of its memory block after it goes resident) of each TSR. With these metrics, we can calculate a minimum and resident size for the full bundle by simulating the TSR spawns in order:

uint32_t bundle_size_min = 0;
uint32_t bundle_size_resident = 0;

for(const auto& tsr : tsrs) {
	// Since DOS has freed all excess binary memory before we get to spawn a new TSR, the new
	// one will end up next to the previous resident allocations. We only need to consider
	// the previous minimum size because it might be larger than the one we calculate here.
	bundle_size_min = std::max((bundle_size_resident + tsr.size_binary), bundle_size_min);

	bundle_size_resident += tsr.size_resident;
}

Let's step through the bundle construction for TH03:

TSR Binary Resident Bundle minimum Bundle resident Naive
ZUNINIT (ZUN.COM) 23,276 1,056 23,276 1,056 23,276
SPRITE16 (ZUN.COM) 23,276 36,528 24,332 37,584 59,804
PMD86.COM 29,295 30,144 66,879 67,728 89,948

Then, we only need to resize our main memory block a single time to leave a gap at the top of conventional RAM whose size matches the larger of the minimum or resident bundle sizes. If we then spawn the TSRs into this gap, we indeed save 22,220 bytes over the naive approach! Let's visualize the resulting memory layout with TH02 because there's a nice detail with MMD and PMD:

MMD also frees its environment segment before going resident, so we can subtract the size of one environment segment from the reserved heap size if we spawn both PMD and MMD.
Unfortunately, one of these gaps will always remain with this approach. We could get rid of it by spawning MMD and PMD first, which would merge it into the remaining free memory. But then, we'd have nothing to spawn into the hole left by the ZUN.COM binary for ZUNINIT, and we'd be back to the naive fragmented situation above. Thus, this trick remains unique to TH02.

However, there's one crucial detail in all of this that would prove to be more complicated:

Calculating correct resident sizes

In TH01, this was no big deal. MDRV98 was the only TSR we had to care about, and there was no reason not to just replicate its simple resident size calculation within the code. After all, people would either run the version bundled with the game or the smaller previous version if they played on a real-hardware CanBe model. No one really cares about MDRV98 beyond that level; the driver is almost universally disliked for just not being PMD, which managed to attract a sizable community, documentation, and even new developments over the years. A PMD port of TH01 has been one of the most common mod requests as well.
The TSRs in later games, however, are much more flexible. We compile both ZUNINIT and SPRITE16 from source and should therefore expect people to mod them, but these two in particular might just be considered uninteresting and static enough to justify hardcoding their sizes. But this approach utterly breaks with PMD, whose chip-specific variants come in multiple versions depending on the game:

:th02: TH02 :th03: TH03 :th04: TH04 :th05: TH05
PMD.COM 4.8l (1996-12-28)
14,336 bytes
PMDB2.COM
(ADPCM)
4.8l (1996-12-28)
18,496 bytes
4.8o (1997-06-19)
18,592 bytes
PMD86.COM
(86PCM)
4.8l (1996-12-28)
19,904 bytes
4.8o (1997-06-19)
19,984 bytes
PMDPPZ.COM
(PPZ8/CanBe)
4.8l (1996-12-28)
20,768 bytes
4.8o (1997-06-19)
21,024 bytes
The PMD versions that ZUN shipped with each game. The byte size refers to the in-memory TSR size without any music, voice, or effect data added on top.

In theory, nothing stops us from hardcoding these sizes for each game as well. But these physical details about specific PMD versions are even less of a property of the game. There's no reason why modders shouldn't be able to replace any of the hardware-specific driver versions with any other – and given the sizable PMD composer and arranger community, this is a much more likely kind of mod to happen. SSG-EG, anyone?

But how could we figure out the required resident size of arbitrary PMD versions without hardcoding anything? From the outside, we can only really know for sure by running the driver and seeing how much memory it keeps resident…
…so that's exactly what we need to do. The merged binaries spawn each driver three times during setup – once to figure out its size, a second time to remove this test TSR, and a third time to respawn the TSR at its designated place at the top of conventional memory. And if we have such a system in place, nothing stops us from applying it to all other TSRs as well, removing the need to precalculate or hardcode any size… well, except for SPRITE16, which still needs a hack to factor in its extra two blocks on the DOS heap. In TH03, these 2×3 additional processes do slow down startup by about 6 frames on our target 66 MHz Neko Project configuration when compared to the batch file, which should still be tolerable relative to the .PI load times we removed by switching to PiLoad.

The whole feature has a few other nice properties as well:

And then you test with the actual ZUN.COM and notice that you're still not done:

TSR Binary Resident Bundle minimum Bundle resident
ZUNINIT (ZUN.COM) 13,394 784 13,394 784
PMD86.COM 29,383 30,224 30,167 31,008
TSR Binary Resident Bundle minimum Bundle resident
ZUNINIT (ZUN.COM) 65,968 784 65,968 784
PMD86.COM 29,383 30,224 66,752 31,008

It's this TH04 issue that raises the question of whether this whole TSR bundling solution was even worthwhile in the first place. It sure was an interesting problem to solve, but it'd be much simpler and less bloated to just integrate the INTvector set program into every binary. For TH03, we could similarly integrate all SPRITE16 functionality directly into DEBLOATM.EXE/ANNIVM.EXE and still end up with a smaller-than-original binary after removing Borland's C++ exception handler. That would leave PMD and MMD as the only TSRs we'd need to spawn from C++, and those do have good reasons to be separate from game code.
Oh well, gotta get TH03's MAIN.EXE position-independent first…

Also, the usual caveats from two years ago still apply. This whole trick of pushing TSRs to the top of conventional RAM still relies on witchcraft that may not work on certain DOS kernels. For developers, tinkerers, and people who know what they're doing, it does succeed at nicely decluttering the game directory. But for… ahem, distributors, I still recommend shipping the modified version of GAME.BAT and GAMECB.BAT in the package below to defend against any potential stability issues.


Finally, if the performance improvements aren't enough of a reason to upgrade to these new builds, how about an actual new feature? TH03's Anniversary Edition now lets you quit out of the VS Start menu via either ESC or a new menu item, without going through the Select screen. 🙌

Screenshot of the VS Start menu in the P0323 build of TH03's Anniversary Edition, showing off the new Quit option and the version text
Matching the style of the version text to the style of ☪ The Phantasmagoria of Dim. Dream on the other side seemed like the least bad option here. That outline is indeed created by rendering every line 9 times…

And with that, I'm finally done with 2025's most indulgent subproject! Let's quickly check the overall impact on the codebase:

$ git diff --stat debloated~193 debloated -- . ":(exclude)Tupfile.lua" ":(exclude)build_dumb.bat" ":(exclude)unused/"
[…]
259 files changed, 4145 insertions(+), 8099 deletions(-)

That's almost 4,000 lines of ad-hoc PC-98-native graphics code, bloat, landmines, bloat- and landmine-documenting comments, and binary-specific inconsistencies removed from game code, in exchange for…

git diff --stat master~203 master -- platform/
[…]
28 files changed, 2213 insertions(+), 258 deletions(-)

…not even half that many additional lines in the platform layer. And here's what all of this compiles to:

Richard Stallman cosplaying as a shrine maiden ReC98 (version P0323) 2025-09-29-ReC98.zip

After the Shuusou Gyoku debacle and the many last-minute fixes that cropped up while I was writing this post, I'm not particularly confident in these builds, despite the weeks of testing that went into them. Still, we've got to start somewhere. At least for TH03, we're bound to quickly find any issues that slipped through the cracks while I'm implementing netplay into the Anniversary Edition.

Next up: The very quick round of 📝 Shuusou Gyoku maintenance and forward compatibility I announced in April, to clear out the backlog a bit. This whole series also really stretched the concept of what 11 pushes should be, so I'll charge 2 pushes for that maintenance round to compensate. In exchange, I'll also incorporate a small bit of new Windows 98 feature work, since it fits nicely with the cleanup work.