- 📝 Posted:
- 💰 Funded by:
- [Anonymous], Ember2528
- 🏷️ Tags:
Part 4! Let's get this over with, fix all these landmines, and conclude 📝 this 4-post series about the big 2025 PC-98 Touhou portability subproject. Unless I find something big in TH03, this is likely to be the final long blog post for this year. I wanna code again!
- The scope
- Retaining menu backgrounds in conventional RAM
- Resolving screen tearing landmines
- Intermission: Handling unused code on fork branches
- Merging TH02-TH05's
OP.EXEandMAINE.EXE/MAINL.EXE - Replicating TH02-TH05's
GAME.BATin C++ - Topping it off with an actual feature
As you can already tell by this table of contents, this "initial" cleanup work was quite larger in scope than its counterpart for 📝 the first TH01 Anniversary Edition release. Even that already took unexpectedly long 2½ years ago, and now imagine doing that across four games simultaneously while keeping all the little required inconsistencies in place. Then you'll get why this has taken over four months…
With an overall goal of "general portability", it's very tempting to escalate the scope towards covering everything in these menu and cutscene binaries. So I had to draw at least some boundaries:
- No big feature work in TH01
- No work on any graphics formats besides PI
- Graphical text will still get rendered directly to VRAM
Even then, this was way premature. Not only because we still need to maintain the memory layout of TH02's and TH03's MAIN.EXE, but also because of all the undecompilable ASM code in all four games that blocks certain architectural simplifications.
The biggest problem, however: I haven't quite decided on how to use static libraries within my build environment yet. Since Turbo C++ 4.0J's linker just blindly includes every explicitly named object file without eliminating dead code, static libraries are essential for reducing bloat by providing a layer of optional files to be included on demand. However, the Windows and DOS versions of TLIB are easily confused, TLIB's usual paradigm of mutating existing library files goes against Tup's explicit dependency graph, and should we really depend on an ancient proprietary tool for a job that I could reimplement in a few hundred lines? Famous last words, I know. But since I 📝 didn't want to do any dedicated build system work this year, I also didn't want to sort out these questions in a 12th or even 13th push. Leaving the build environment woefully ill-equipped for the complexity of this task was probably a mistake; while the resulting workaround of feature bundles does the job, it's very silly and hopefully won't last very long. I'm definitely going to spend some time sorting out the static library situation before I ever attempt something like this again. Or at some general point before we hit the overall 100% finalization mark, because we've still got that long-awaited librarization of ZUN's master.lib fork ahead of us.
Let's get to it then, starting with the feature that will remove lag in menus by removing PC-98-specific page-flipping and EGC code:
Retaining menu backgrounds in conventional RAM
At first, this seems to be no problem. We just swap out master.lib's .PI functions with our forked PiLoad and our generic blitter, and make sure to keep the images allocated. master.lib's graph_pi_load_pack() has always loaded .PI images into one big contiguous buffer in conventional RAM, so this shouldn't negatively affect the heap layout. If anything, we'd be saving memory by 📝 not allocating these extra two rows, right?
Unfortunately, it's that second goal that would turn out to be a massive problem. 📝 The end of part 1 already hinted at how the majority of menu backgrounds are only rendered to VRAM a single time before ZUN immediately frees them from memory. These cases are so common that I defined a macro for them:
#define pi_fullres_load_palette_apply_put_free(slot, fn) { \
pi_load(slot, fn); \
pi_palette_apply(slot); \
pi_put_8(0, 0, slot); \
pi_free(slot); \
}
In these cases, the games only need that single 128 KiB block temporarily, and then get to reuse that memory for other, more dynamic graphics. Consequently, ZUN probably dimensioned the master.lib heap sizes for TH02-TH05 to leave ample headroom with this fact in mind. I wasn't so sure about deliberately limiting the amount of heap memory 📝 in late 2021 when I fixed the one out-of-memory landmine that remained in TH04, but I've begun to appreciate these memory limits quite a lot as the scope of my research has deepened. Specify the right amount of bytes, perform the single allocation from the DOS heap at startup, and if that allocation succeeds, you've removed an entire class of out-of-memory bugs from consideration. Sure, modders might prefer mem_assign_all() for simplicity during development, but it does make sense to return to a static limit when shipping. For once, ZUN was right, and there is no excuse.
On the surface, this macro is equivalent to PiLoad's original direct-to-VRAM approach. And indeed, we can replace this code with a call to PiLoad's original code path in the few cases where we just want to show a static image without unblitting any of its regions later on, removing even the requirement for that temporary 128 KiB block in the process. But in the majority of cases, we do need these images in RAM, and ZUN's original heap sizes simply weren't intended for that.
But how much of a problem are ZUN's limits in practice? Well, there's at least one instance where retaining all images would require significantly more memory than ZUN anticipated. TH02's OP.EXE requests a 256,000-byte master.lib heap, but then wants to do this:
If you step through this video, you'll see that this effect indeed page-flips between the later menu background (128,000 bytes) and the three images with the resized text (3 × 54,720 = 164,160 bytes). Hence, we must have already loaded all four of them into the heap before this animation starts. While the menu's main text is rendered on the text layer, its shadow is part of the graphics layer and must be unblitted when switching to the option menu and back, thus requiring this image in at least some portion of memory.
Since VRAM page 1 always shows the unmodified menu background image, we could cheat, use PiLoad's direct-to-VRAM code path, and then do a second load of the image to conventional RAM on frame 19, before the white-in palette effect. 📝 Frame-perfection rule #2 would also allow us to do that. But adding a second load time is lame, especially because the white-in effect wouldn't even hide that many expensive calls in the debloated build anymore. After replacing the original final graph_pack_put_8() call for the main menu image with a faster planar blit, it only hides the draw calls for the ©ZUN text and some minor file and port I/O to load and apply the menu's final 48-byte palette from OP.RGB.
There's really no reason against just increasing the size of the master.lib heap to incorporate all of these four images at the same time. But how much additional memory do we actually need? Obviously, these four images are not the only allocations on the heap, which also needs to fit at least the following buffers:
- The 8,192-byte snapshot of the gaiji loaded before the game, which are restored upon quitting the game… as well as when switching between game binaries. Yup – since the master.lib heap is not retained across binary switches, every one of the three binaries restores the system's previous gaiji before being switched out, before the new binary reads those same gaiji from the character generator back onto the master.lib heap.
It would have been much smarter to keep them in a separate persistent allocation on the DOS heap instead. - The 16,384-byte .PI load buffer. This one has to be allocated before we allocate any .PI, so it will necessarily fragment away that memory.
- The 2,560 bytes of High Score menu sprites loaded from
op_h.bft. We might have shown the High Score menu if we came from a demo, and ZUN wants to keep these sprites loaded across the lifetime of the process. - The 9,216-byte
super_bufferthat master.lib pre-allocates upon loading the first BFNT sprite, just in case you later want to callsuper_convert_tiny().
We can bypass the super_buffer allocation through a dumb trick. But the single worst aspect hides between all these individual allocations:
Fighting heap fragmentation
Let's look at TH05's OP.EXE, whose heap limit of 336,000 bytes is much more lenient. This limit should be more than enough to fit the new additional 128,000-byte buffer for the background image in addition to the original heap contents on every menu screen, and we can indeed enter the main menu without any issue. But then, we're still greeted with an out-of-memory crash after entering and leaving the Music Room? Let's take a look into the master.lib heap, with the retained background images we'd like to have:
| Step | Heap layout | Total remaining | Largest free block | Fragmentation loss | |||
|---|---|---|---|---|---|---|---|
| Entering the main menu | op1.pi (128,000) |
sft*.cd2 + car.cd2 (7,360) |
sl*.cdg (90,240) |
scnum.bft + hi_m.bft (7,680) |
76,352 | 66,240 | 10,112 |
| Leaving the main menu | scnum.bft + hi_m.bft (7,680) |
301,984 | 234,608 | 67,376 | |||
| Entering the Music Room | music.pi (128,000) |
📝 nopoly_B (32,000) |
scnum.bft + hi_m.bft (7,680) |
139,536 | 66,240 | 73,296 | |
| Leaving the Music Room | music.pi (!) (128,000) |
scnum.bft + hi_m.bft (7,680) |
165,552 | 81,920 | 83,632 | ||
| Re-entering the main menu | .PI load buffer (16,384) | sft*.cd2 + car.cd2 (7,360) |
sl*.cdg (90,240) |
scnum.bft + hi_m.bft (7,680) |
179,760 | 115,648 | 64,112 |
Something gradually shreds our heap into tiny pieces that ultimately prevent us from allocating the main menu background image a second time. We can surely blame this on ZUN's suboptimal order of load calls that doesn't prioritize larger images over smaller ones, or on the 16 KiB .PI load buffer that we maybe should have allocated statically. But the biggest hidden offender turns out to be… master.lib's packfile implementation?!
Yup. Every time you load a file out of an archive, master.lib heap-allocates a 31-byte state structure and a file read buffer that master.lib originally dimensioned at 520 bytes. TH03's MAIN.EXE and MAINL.EXE then increased the size of that buffer to 4,104 bytes, before TH04 and TH05 went up to 8,200 bytes. ![]()
Ultimately though, it's not the size that's the problem here, but the fact that we repeatedly allocate any memory that could have been allocated once when setting up the INT 21h handlers. You'd think that master.lib went for dynamic allocations in order to support the fact that the INT 21h file API lets you open multiple file handles simultaneously, which would point to different RLE-compressed files within the archive. But no, master.lib doesn't even support this case, and even incorrectly returns FileNotFound if you attempt to open a second file from an archive before closing the first one!
After I identified this whole issue, I immediately wanted to replace master.lib's packfile code with the much saner and more explicit C++ implementation from TH01. Sooner or later, we'll have to do this anyway because we can't just hook file syscalls on other operating systems the way we can hook them on DOS.
However, TH01's implementation would quickly turn out to have its own share of heap fragmentation issues. Every time the game loads an RLE-compressed file from 東方靈異.伝, the archived file is completely decompressed into a newly allocated temporary buffer, from where the game then copies out parts into the actual game structures. The resulting fragmentation is at least easily fixable though, and that's what the TH01 part of the very first push assigned to part 1 went to. Switching to a zero-copy architecture basically only required persisting the RLE state and brought a significant improvement: 15,776 bytes of heap memory during Stage 1 freed up by that switch alone, as reported by the coreleft() output seen in debug mode?! That much for just removing temporary allocations that the game was freeing anyway?
Let's check the Borland C++ DOS Reference for how this value is actually calculated. Turns out that it is simply intended to be a measure of unused RAM memory
, and sure enough:
In the large data models, coreleft returns the amount of memory between the highest allocated block and the end of memory.
That's a reasonably meaningful measurement that can be determined in constant time, compared to the 𝑂(𝑛) operation of finding the true total size of available heap memory by walking over every node.
In the end though, rolling out this C++ implementation to the other four games was way premature and would have pushed this delivery way above 12 pushes. After all, both ZUN's and master.lib's code is still full of INT 21h file syscalls that would all need to be replaced. Conditionally, even, given the two binaries that are not yet position-independent…
Fortunately, .PI loading itself is just as much of an issue and can be worked around in the much simpler way I already spoiled when explaining 📝 the API changes I made to PiLoad: We simply hold on to the menu background pixel buffer for as long as possible. Ideally, we only allocate these 128 KiB once, decode every new menu background into that same buffer, and only explicitly free it when we really need to. That's why the platform layer logic requires full control over .PI buffer allocation.
This was enough to keep the required amount of additional heap memory to a more than acceptable level:
-
None of the
MAINE.EXE/MAINL.EXEbinaries needed a larger memory limit. -
TH04's
OP.EXEalso got to keep its original 336,000-byte heap. -
As did TH03's
OP.EXE. That one seems particularly surprising if you remember 📝 its 255,216-byte character selection screen. But since ZUN only preloads 186,096 bytes before or during the main menu, we can nicely fit the title screen image into the original 352,000-byte heap. -
TH05's
OP.EXEneeded a slight increase by 4,768 bytes to a new limit of 340,768 bytes, for the mere sake of accommodating the heap fragmentation caused by entering and leaving the Music Room and then entering its character selection screen. An increase at that level would have been fine even if it wasn't temporary, as we're still 52,832 bytes short of reaching the amount of memory required byMAIN.EXE:
(original)Static Heap Total OP.EXE 88,064 336,000 424,064 MAIN.EXE 190,464 291,200 481,664 That only left TH02's
OP.EXE, which did require a whopping additional 80,416 bytes up to a very similar new limit of 336,416 bytes. But again, an increase at that level is also fine for this game when compared against itsMAIN.EXE, which would allowOP.EXEto go up to 383,232 bytes of heap without increasing the original memory requirements:
(original)Static Heap Total OP.EXE 69,632 256,000 325,632 MAIN.EXE 164,864 288,000 452,864 Not to mention that our final merged
DEBLOAT.EXEwill only require 63,488 bytes of static memory, and that's despite TH02 being the one game that received the quickest and sloppiest merge job with a lot of corners cut for budget reasons.
And after a few rewritten function calls, we've indeed removed every single EGC-powered inter-page copy from all menus and cutscenes of TH02-TH05! On to the next goal…
Resolving screen tearing landmines
…which requires individual solutions for every case that merely follow a common pattern. If things are already done close to a VSync wait loop and just in a slightly wrong order, the solution is easy and we just have to shift around a few function calls.
But what can we do if ZUN mutates visible pixels at some place far away from the last VSync wait loop? After all, a lot of these landmines result from confusing the current CRT beam position across multiple functions. Often, it's impossible to see at a glance where these menu-specific subfunctions are called within a frame without tracing execution back to the last VSync delay loop at the call site. For starters, it would be nice to clearly formalize that a specific section of code must be run in VBLANK.
master.lib's vsync_Proc function pointer already gets us most of the way there. Its VSync subsystem automatically calls any non-nullptr shortly after the VSync interrupt fires, and our task function would then set vsync_Proc back to a nullptr to ensure the intended one-shot behavior.
However, this approach can at best defer a task to the next VBLANK interval, which might leave us one frame behind the original game and hurt our frame-perfection goals. What we actually want is a conditional approach for timing-sensitive tasks, as a common operation that only requires a single line of code:
- If we're within VBLANK, run the task right now.
- If we aren't, set up a VSync proc that runs the task immediately after the next VSync and then removes the proc.
Now we're only missing that crucial one bit of information, which is delivered by Bit 5 of the graphics GDC's status register at I/O port 0xA0. In fact, ZUN uses the same bit in all hand-written VSync wait code throughout TH01 and in the bouncing-ball ZUN Soft logo:
void vsync_wait_via_gdc_polling(void)
{
// Bit 5 of the graphics GDC register indicates VBLANK. Wait until this bit
// is set.
while((inportb(0xA0) & 0x20) != 0) {
}
// Once Bit 5 is no longer set, the CRT has started drawing the next frame.
// I have no idea why you would ever want to throw away all your precious
// vertical retrace time, but ZUN does this all throughout TH01.
while((inportb(0xA0) & 0x20) == 0) {
}
}
Of course, this only solves the problem in theory, as the tasks themselves don't come with any real-time guarantees. It's entirely possible for the resulting vblank_run() function to get called near the end of VBLANK, start the task immediately, and return long after the CRT beam has started drawing again. Heck, if the system is slow enough, the task might not even complete within the VBLANK interval if we run it immediately after VSync. But this is a much more complex problem to solve, requiring upfront measurements of both the VBLANK interval and the execution time for each potential task, which can then be factored into the run-now-or-defer decision. We definitely don't need to go there as long as we're mainly targeting emulated 66 MHz systems.
In easy cases, vblank_run() can then resolve screen tearing landmines completely by itself. Towards the end of PC-98 Touhou, ZUN's menus made more and more use of master.lib's blocking palette fading functions, which delay themselves to the next VSync signal and thus avoid any tearing issues. Hence, TH04's and TH05's screen tearing landmines are limited to the very few sudden palette changes that remained in these games:
void return_from_other_screen_to_main_menu(void)
{
// Loads the .PI image into our persistent menu image background buffer,
// and overwrites master.lib's 8-bit palette. Takes a few frames and
// probably won't return during VBLANK.
GrpSurface_LoadPI(bgimage, &Palettes, "op1.pi");
graph_accesspage(0);
bgimage.write(0, 0); // Planar blit
PaletteTone = 100; // Use original brightness in palette_show()
- // ZUN landmine: Updating the hardware palette right now will most likely
- // cause screen tearing.
- palette_show();
+ vblank_run(palette_show);
[…]
}The fixes for the landmines in TH03 and TH02, however, require much more thought and care to stay as close to ZUN's defined logical frame sequence as possible. TH03's character selection screen, which prompted this whole subproject in the first place, houses one of the harder groups of landmines:
💣 Landmine #1 is caused by loading the Selection BGM and the character name sprites before clearing VRAM, without a frame delay inbetween. The fix is obvious.
💣 Landmine #2 already got mentioned in passing in the corresponding blog post from last year: 📝 If a frame took longer than 3 VSync interrupts to render, ZUN flips the VRAM pages immediately without waiting for the next VSync interrupt. This always applies to the very first frame 📝 because the game thinks it took ≥30 VSync interrupts to render.
In both versions, we enter the loop asvsync_Count1turns 30. But while the original game page-flips as soon as it finishes rendering in the middle of the screen frame, the debloated build instead waits for VSync during that 30th frame and only page-flips and resets the counter after VSync. This way, we still guarantee ZUN's originally defined 30 black frames preceding this menu, despitevsync_Count1being one frame ahead in the debloated version of that delay.💣 Landmine #3 looks conceptually identical to landmine #2, being another mid-screen-frame page flip caused by
vsync_Count1being ≥3 by the time execution reaches the frame-rate-dropping delay loop. However, this one is caused by the game's response to a selection-confirming input and therefore needs a dedicated fix. Here's what's going on:- The game renders the next frame to the invisible VRAM page, as it usually does. We won't see this frame for a while.
-
The game checks for input and sees the Shot key. It then immediately runs the palette flash effect, using master.lib's blocking
palette_white_in()while still displaying the previously rendered frame. -
palette_white_in()itself avoids screen tearing issues by running its own VSync busy-waiting loops using the(inportb(0xA0) & 0x20)check, without mutatingvsync_Count1. This is why the game immediately page-flips once execution is back to the main loop. -
However, this is not the cause of this landmine.
palette_white_in()also stops within VBLANK, so you'd also expect the immediate page flip to not cause any screen tearing. -
Except that ZUN then (re-)loads the .CDG portrait for the selected or automatically assigned palette variant of the confirmed character, immediately after
palette_white_in(), and only then drops back into the main loop, without any further delay. Hence, we have file I/O in our logical frame, and thus can't guarantee anything.
On the hypothetical infinitely fast PC-98, the .CDG load call completes instantly and turns this into a non-issue. On real systems, however, we would need some way of hiding this load call to stick to ZUN's defined logical frames:
- Leaving it after
palette_white_in()is completely wrong because it messes with the defined sequence of logical frames. Even just maintaining the original number of frames requires inserting an additional delay frame and compensating for that by cutting that one frame from the next iteration of the loop. This was my original solution, and the realization of how wrong it was certainly delayed this blog post by about a day… - Moving it in front of
palette_white_in()might work since the effect starts with a VSync wait, but it might also insert an additional screen frame of delay. Keep in mind that we're still on the same logical frame that rendered the very expensive curve effect.
That only leaves one answer: Running both the white-in effect and the .CDG load concurrently. 💡 Using
vsync_Proc, we can implement a non-blocking version ofpalette_white_in()that runs one iteration of its palette-manipulating loop during VBLANK. Meanwhile, the "main thread" gets 16 frames to load a single 22.5 KiB character portrait and then simply waits for the white-in effect to complete. And since our VSync proc also always signals completion during VBLANK, we then get to immediately page-flip and retain ZUN's intended 3-frame timing. With this solution, we don't even have to optimize away the .CDG load in the usual case where the game just reloads the character's regular palette variant.💣 Landmine #4 is the palette tearing issue that got an entire section in the post from last year. After moving the palette-mutating branch from before VSync to immediately after VSync, we also have to adjust the calculation of the palette brightness value to match ZUN's original values.
Very finicky work, where every single branch has the potential to introduce an off-by-one-frame error, and vblank_run() doesn't help at all.
And then you reach TH02, which asks for way too much to happen within a single frame, in plain sight, and with no palette tricks to hide it. The screen transitions into and out of its HiScore screen are by far the worst example:
These screen transitions exhibit no less than 6 landmines and 2 bugs:
💣 Frame 1 shows how TRAM (containing the actual menu text as gaiji) gets cleared immediately, but VRAM (containing the shadow) remains untouched as ZUN decides to load
HUUHI.DATfirst.💣 While the following VRAM clear appears to produce a well-defined black frame 2, it's anything but well-defined, as the load operation only happens to conclude within VBLANK by sheer chance in this recording.
💣 Frame 4 is wild. First of all, the code still hasn't waited for a single VBLANK signal ever since entering the menu, and therefore shouldn't be writing to TRAM to begin with.
But even then, you wouldn't expect to see only the name and nothing else on the scanlines of a score record in such a partial rendering. How can TRAM operations possibly be that slow? This almost seemed as if I was missing some crucial timing-related detail about the hardware. But in the end, what we're seeing here is simply Neko Project not actually using scanline rendering for the text layer. If you write to a TRAM cell, Neko Project just marks the entire 16-pixel row to be redrawn during the next screen refresh event.🐞 Frame 5, then, is the first well-defined frame that actually renders the way it's defined in the code. The green 東方封魔録 logo is indeed only meant to be visible from the next frame onwards. This certainly meets all criteria for a bug, but the debloated build isn't allowed to fix those. In fact, it needs a dedicated conditional branch to preserve this bug.
Once you leave the menu, you'll first have to sit through a stylistic and non-productive 20-frame delay, before… the screen switches back to the last frame rendered before the delay on frame 77?
💣 By that point, we're technically already back to the main menu, where the first thing ZUN does is to switch from double-buffering back to single-buffering with VRAM page 0 shown. If you happened to leave the menu by hitting a key on the 50% of frames where VRAM page 1 is shown, the screen will therefore flip back to the frame rendered before the 20-frame delay, and keep it visible while master.lib decodes the title screen image.💣 This decoding process finishes after ~20.4 frames in this recording, near the middle of frame 98. Clearly, we then have to immediately switch the hardware palette to the one we just loaded. Let's completely disregard that we're probably not in VBLANK, or that the screen is still showing the last High Score menu frame…
Then, we need to get the image onto both VRAM pages. 📝 As we found out in Part 2, a low-clocked 386 is pretty much the most suboptimal system for master.lib's packed→planar conversion code, and 12 frames exactly match the performance we would expect from Neko Project at 33 MHz.
💣 But that only rendered the image to the invisible VRAM page 1. We could now temporarily show page 1 after the next VSync signal to hide the pretty much guaranteed multi-frame VRAM writes… but nah, who cares except for some researcher 28 years later. By leaving VRAM page 0 on screen, ZUN doesn't even attempt to hide the jank that is about to occur. Once again, he reaches for master.lib'sgraph_copy_page(), 📝 whose slowness I already talked about in Part 2. At 33 MHz, Neko Project takes 3 frames to copy one page after another, leaving us with two frames of mixed pixels. This can be even worse on real hardware: On 📝 spaztron64's K6-2-upgraded and southbridge-bottlenecked PC-9821V166 model, this copy took 100 ms. I was able to watch every single bitplane getting individually copied in the recording. Unpacking the .PI image a second time would have been faster on that machine.
🐞 Also, ZUN should have definitely cleared TRAM before the page copy instead of deferring this responsibility to the main menu rendering code. Since we then return to the main menu's VSync-timed loop and regularly wait for VSync while the scoreboard remains on screen and part of the current logical frame, this is not a landmine.
Compare this with the debloated version:
The first three landmines are fixed by running the common "set palette to black and clear TRAM" operation in VBLANK, and deferring both the palette update and the scoreboard rendering to the VBLANK interval preceding frame 5.
Everything between frame 77 and frame 113 inclusive is defined to happen on a single logical frame. Since this screen doesn't allocate its own 640×400 background, we get to keep the title screen image in memory and actually turn this logical frame into a real one. Then, we can use ZUN's defined 20-frame delay constructively:
- First, we render the last frame to the other VRAM page to defuse landmine #5. Yes, render – a GRCG-accelerated 385×209-pixel flood fill followed by eight transparent 16-color 128×32 sprites is much faster than copying VRAM pages.
- Then, we can unconditionally switch to showing VRAM page 1 and accessing VRAM page 0 on the next VSync, without affecting what's shown on screen.
- Then, we have all the time in the world to blit the planar title screen image from memory to VRAM page 0, the only one we still need to touch.
On the VSync that precedes frame 77, we then simply 🫰 flip VRAM pages and the hardware palette to produce exactly the well-defined image that an infinitely fast PC-98 would have produced for ZUN's original code.
Then, I did that 13 more times for the other screen tearing landmines fixed in this build. And no, these new builds don't even fix every instance of this issue…
Intermission: Handling unused code on fork branches
Given that all of these improvements are taking place on the debloated branch, it's time to decide on how to handle the biggest unneeded obstacle in the way of our portability efforts, after 📝 I procrastinated this question 2½ years ago.
In Shuusou Gyoku, I've been trying to retain every single line of unused code in a dedicated directory, not least because that game has 📝 some very wild effects that should be reasonably preserved. The problem with this approach is that all this unused code quickly stopped compiling as I started to refactor the game into its current cross-platform state. For discoverability, this is still better than outright deleting the code and expecting people to read pbg's original codebase, but it's not all too practical either.
In the ReC98 codebase, we have a different situation: All the unused code doesn't just exist at some old commit that maybe won't even compile going forward, but is an integral part of the master branch. Therefore, removing this code from fork branches is not only in line with their goals, but also completely non-destructive, since its compilable form on master keeps getting maintained for a handful of building platforms.
Then again, I like the added overview and discoverability of the Shuusou Gyoku approach. So let's meet in the middle: From now on, the debloated branch will only keep unused code in the form of its declarations and some short explanatory comments, in files within the unused/ directory whose names point to the actual implementations on the master branch.
Funnily enough, unused code wasn't even the main reason why TH01's ANNIV.EXE lost 10,834 bytes between the previous and current builds. Although TH01 is the one game with by far the most unused engine code, that code only made up 3,728 bytes of that difference. The rest came from the work surrounding the zero-copy unpacker and the few portability features that already made sense to be rolled out for this game. Yes, TH01 really is that bloated.
Merging TH02-TH05's OP.EXE and MAINE.EXE/MAINL.EXE
Onto the second most exciting feature, 📝 as motivated by the blog post from May! A true single-executable build 📝 never looked that viable for TH04 and TH05 to begin with, so let's just go for the one viable partial merge that makes sense for all of the four games. With all of MAINE.EXE/MAINL.EXE being position-independent, the remaining bunch of ASM code there isn't much of an obstacle either.
And once again, this merge means that we have to resolve all 📝 binary-specific inconsistencies at once. While ZUN thankfully eliminated most of them by the end of the PC-98 era, the scorefile code remained inconsistent until the very end, 📝 as Part 3 already mentioned. Hopefully, this is the second-to-last time I have to mention these formats…
Funnily enough, all of their most noteworthy inconsistencies are found in how these formats deal with corrupted files:
The TH03 inconsistency I 📝 teased in part 3 is almost not worth mentioning. If the game ends up recreating YUME.NEMwhile loading the high scores for name registration after a 1CC, the clear flag is written for all difficulties, not just the one you've actually cleared. Our exact definition of observable bugs comes in doubly handy here:- To get a 1CC in the first place, you must have gone through character selection, which also (re-)creates
YUME.NEMif necessary. Therefore,MAINL.EXEwould only ever recreateYUME.NEMin this "1CC mode" if something outside the game deleted or tampered with the file while the game was running. - TH03 offers no benefits for a 1CC on specific difficulties, and doesn't even visually indicate this flag, unlike the three other games. 1CC'ing any difficulty is all that matters for unlocking Chiyuri and Yumemi.
With no way to observe this per-difficulty state, this is one of the rare landmines where we get total freedom for the fix. Thus, we can just do the right thing and set the clear flag for only the current difficulty, reflecting your actual achievements and paving the way for a future feature that can highlight this per-difficulty clear state in the UI.
- To get a 1CC in the first place, you must have gone through character selection, which also (re-)creates
TH04's OP.EXEsimultaneously loads both Reimu's and Marisa's scores for the currently selected difficulty into two separate structures. This alone is a great source of unnecessary inconsistencies, but it gets even worse when either of the two sections is found to be corrupted during decryption. In that case, the game doesn't decrypt Marisa's section and leaves its encrypted state in the respective structure. However, the High Score viewer still assumes that both sections were decrypted. While Reimu's section will always contain either valid or recreated default data, you probably won't see that under all the garbage sprite data rendered for the still encrypted Marisa:
Corruption with random bytes will look slightly more varied than the zeroed-out example from the previous post.
/
The original games would recreate the full GENSOU.SCRwith its default data if even just one character×difficulty-specific section of the file was found to be corrupted. The debloated build now only resets individual corrupted sections to their default state, preserving as much of the file as possible. This also went hand in hand with removing that separate Marisa score structure in TH04, giving us identical and glitchless corruption repair behavior in both games and saving me from having to mention TH04's corruption behavior in the release notes. Efficiency!
/
As an added consistency bonus, the debloated builds no longer fully re-encrypt GENSOU.SCRafter entering a score after a cutscene. This was dumb for many reasons.
Also, they still preserve 📝 the inconsistent stage numbers upon recreation. I couldn't bring myself to fix this 😩
The actual merge then indeed delivers what we were hoping for: In three of the four games, the added unique code from OP.EXE and MAINE.EXE/MAINL.EXE comes in at far below the 20,512 bytes we freed by removing 📝 Borland's C++ exception handler, both in the binaries themselves and in their loaded in-memory state.
But it's TH05 where both OP.EXE's expanded Music Room and MAINE.EXE's Staff Roll and All Cast sequence add so much unique data that the initial merge ended up slightly larger than the size of the original MAINE.EXE. Getting the binary and run-time size of the new DEBLOAT.EXE below that point required every trick in the book and then some. The more critical tricks were good ideas in their own right:
- Heap-allocating 📝 the scrollable verdict bitmap shown after the Staff Roll frees up 28,160 bytes of statically allocated memory. The fact that you can just have such large arrays of static data seemed like a great benefit of this binary splitting model 5 years ago, but it really doesn't hold up against just writing the two lines to allocate and free that memory from the heap.
MAINE.EXE's 320,000-byte heap memory limit is more than enough to fit that bitmap in addition to all the simultaneously loaded Staff Roll sprites. - Heap-allocating 📝 TH04's and TH05's cutscene script buffer not only does the same at the smaller scale of 8,192 bytes, but also practically saves over half of that memory, as TH05's largest actual script (Reimu's Good Ending, stored in
_ED10.TXT) is just 3,152 bytes. And not just that: It also removes the original 8 KiB limit on cutscene scripts in those games, allowing mods to use up to 64 KiB just like TH03.
But the rest of them definitely crossed over into silly micro-optimization territory:
The single biggest reduction came from turning the various statically allocated
farpointers to hardcoded strings intonearones. ZUN used the Large memory model for every .EXE binary, where every statically initialized C pointer variable not only gets turned into this 4-byte segment+offset form, but also receives a 2-byte relocation in the MZ header that allows the DOS .EXE loader to adjust the relative segment part to the correct absolute value in conventional RAM. These relocations don't remain in memory after a process has started, but they do have quite an impact on a binary's size if it uses lots of hardcoded strings.
The correct high-level solution is to simply switch to the Medium memory model, which restricts a program to just 64 KiB of statically allocated data and reduces all data pointers to offset-onlynearpointers by default. Sadly, switching memory models is one of those wide-ranging architectural changes that we absolutely can not realistically do with that much undecompiled and undecompilable ASM left in the codebase:- All ZUN-written ASM code came out of the disassembler in a Large-exclusive form and would have to be manually adapted to work for the Medium model as well.
- Due to all the code sharing between the games, we'd pretty much have to flip the Medium model switch for all games at the same time. A gradual transition would take even more effort.
Hence, this will only make sense at that far point in the future when we've even translated the majority of undecompilable ASM back to C++. In the meantime, we're left with manually declaring all such pointers as
near. With a total of 471 pointers to hardcoded strings in the merged TH05 executable, this brought the binary size down by 1,884 bytes. 1,356 of those bytes came from the Music Room and its hardcoded track titles and BGM filenames, but we've also got 300 bytes in 📝 the All Cast sequence, 156 bytes in the main menu, and 72 bytes in the sound setup menu.At startup, Borland's libc must correctly set up buffering for C's
stdinandstdoutstreams. Section 7.21.3/7 of the C standard mandates how this setup must behave in case any of these streams are redirected away from the terminal, but even the"implementation-defined"
terminal case must at least set up line buffering forstdinto makescanf()and similar functions behave as expected, just in case you ever want to use these functions. TH01 usesscanf()for the stage selection feature in Debug mode, but other games thankfully stay far away from C's standard I/O functions and use master.lib's text layer functions instead. Disabling this I/O setup in the same way we disable Borland's forced C++ exception handler saves 1,722 bytes in TH02-TH05.
At least C doesn't even pretend to make you not pay for things you don't use in the way C++ does. It just unconditionally throws all the trash your way…Removing trailing whitespace from the hardcoded Music Room track titles and sound setup menu help texts saved another 862 bytes. Hex-editing translators might disapprove, but come on, we have C++ code now. If you commit and push your edits somewhere, there's at least a chance that we can keep them working into the future.
The explosion sprite structure in the ZUN Soft logo has an unused 2-byte structure field that wastes 512 statically allocated bytes in the game's data segment. That array would have been another prime candidate for heap allocation, but that would have only been feasible with a decompilation, and 📝 someone insisted on keeping this particular animation in ASM for the time being…
Removing unnecessary inlining from game startup saved 64 bytes.
Data-driving the Demo Play characters and stages saved 54 bytes.
The original
MAINE.EXEcontains a second copy of theスローモードでのプレイでは、スコアは記録されませんstring because ZUN didn't use a single optimal set of compiler flags for the entire game. Removing that second copy gives us our final 51 bytes.
Alright, another idealistic bonus goal reached! That means we're only missing a single aspect to reach feature parity with the debloated TH01 build:
Replicating TH02-TH05's GAME.BAT in C++
In TH03, this is slightly more involved. We not only need to launch PMD using this technique, but also apply it to the INTvector set program and SPRITE16. 📝 You know the way this goes:
Unfortunately though, the fixed position of all these TSRs would still prevent the game allocation from being replaced with a binary that asks for more memory than the one this block was initially allocated for. In TH01, this would have been a minor issue because it only applied to hot-reloading the single DEBLOAT.EXE or ANNIV.EXE that contains all game code. For the other four games, however, we still keep the larger MAIN.EXE as a separate binary, and most likely will do so for the foreseeable future. And we're surely not getting into the business of moving already allocated TSRs…
So we're back to the technique from two years ago after all. Let's precalculate the size of each TSR, push that TSR to the top of conventional RAM by temporarily claiming all free memory minus its expected size, and then we get…
…a ZUN.COM spawn failure from DOS as we try to start the ZUNINIT sub-binary. ![]()
Yup. Thanks to ZUN's fantastic idea of bundling these small utility tools and TSRs into a single binary that's larger than each individual TSR, we can't just reuse the strategy that worked for TH01. DOS must load the entirety of ZUN.COM into conventional RAM before the bundling code gets a chance to shift the selected sub-binary to the top of the program's memory block and then reduce the size of that block.
So how are we going to solve this?
- We could ship the individual small binaries bundled in
ZUN.COM. But that would defeat the whole point of reducing clutter in the game directory, being even worse than the batch file we're trying to eliminate. - We could reserve the entire required size of
ZUN.COMinstead of just the size we expect for each TSR. But that would leave the difference betweenZUN.COMand the TSR as an unallocated block we can't do anything with, fragmenting the DOS heap as a result:
But if we can't get rid of ZUN.COM's high load-time memory requirements, how about using that memory more productively? Is there a way we could maybe spawn the other TSRs into the hole left by ZUN.COM after it went resident?
Let's take a step back from individual TSRs and instead look at the full picture of spawning a bundle of TSRs in a defined order. First, we determine both the binary size (file size of the .COM binary + Program Segment Prefix + 256 bytes of stack) and the resident size (the size of its memory block after it goes resident) of each TSR. With these metrics, we can calculate a minimum and resident size for the full bundle by simulating the TSR spawns in order:
uint32_t bundle_size_min = 0;
uint32_t bundle_size_resident = 0;
for(const auto& tsr : tsrs) {
// Since DOS has freed all excess binary memory before we get to spawn a new TSR, the new
// one will end up next to the previous resident allocations. We only need to consider
// the previous minimum size because it might be larger than the one we calculate here.
bundle_size_min = std::max((bundle_size_resident + tsr.size_binary), bundle_size_min);
bundle_size_resident += tsr.size_resident;
}Let's step through the bundle construction for TH03:
| TSR | Binary | Resident | Bundle minimum | Bundle resident | Naive |
|---|---|---|---|---|---|
ZUNINIT (ZUN.COM) |
23,276 | 1,056 | 23,276 | 1,056 | 23,276 |
SPRITE16 (ZUN.COM) |
23,276 | 36,528 | 24,332 | 37,584 | 59,804 |
PMD86.COM |
29,295 | 30,144 | 66,879 | 67,728 | 89,948 |
Then, we only need to resize our main memory block a single time to leave a gap at the top of conventional RAM whose size matches the larger of the minimum or resident bundle sizes. If we then spawn the TSRs into this gap, we indeed save 22,220 bytes over the naive approach! Let's visualize the resulting memory layout with TH02 because there's a nice detail with MMD and PMD:
However, there's one crucial detail in all of this that would prove to be more complicated:
Calculating correct resident sizes
In TH01, this was no big deal. MDRV98 was the only TSR we had to care about, and there was no reason not to just replicate its simple resident size calculation within the code. After all, people would either run the version bundled with the game or the smaller previous version if they played on a real-hardware CanBe model. No one really cares about MDRV98 beyond that level; the driver is almost universally disliked for just not being PMD, which managed to attract a sizable community, documentation, and even new developments over the years. A PMD port of TH01 has been one of the most common mod requests as well.
The TSRs in later games, however, are much more flexible. We compile both ZUNINIT and SPRITE16 from source and should therefore expect people to mod them, but these two in particular might just be considered uninteresting and static enough to justify hardcoding their sizes. But this approach utterly breaks with PMD, whose chip-specific variants come in multiple versions depending on the game:
PMD.COM |
4.8l (1996-12-28) 14,336 bytes |
|||
|---|---|---|---|---|
PMDB2.COM(ADPCM) |
4.8l (1996-12-28) 18,496 bytes |
4.8o (1997-06-19) 18,592 bytes |
||
PMD86.COM(86PCM) |
4.8l (1996-12-28) 19,904 bytes |
4.8o (1997-06-19) 19,984 bytes |
||
PMDPPZ.COM(PPZ8/CanBe) |
4.8l (1996-12-28) 20,768 bytes |
4.8o (1997-06-19) 21,024 bytes |
||
In theory, nothing stops us from hardcoding these sizes for each game as well. But these physical details about specific PMD versions are even less of a property of the game. There's no reason why modders shouldn't be able to replace any of the hardware-specific driver versions with any other – and given the sizable PMD composer and arranger community, this is a much more likely kind of mod to happen. SSG-EG, anyone?
But how could we figure out the required resident size of arbitrary PMD versions without hardcoding anything? From the outside, we can only really know for sure by running the driver and seeing how much memory it keeps resident…
…so that's exactly what we need to do. The merged binaries spawn each driver three times during setup – once to figure out its size, a second time to remove this test TSR, and a third time to respawn the TSR at its designated place at the top of conventional memory. And if we have such a system in place, nothing stops us from applying it to all other TSRs as well, removing the need to precalculate or hardcode any size… well, except for SPRITE16, which still needs a hack to factor in its extra two blocks on the DOS heap. In TH03, these 2×3 additional processes do slow down startup by about 6 frames on our target 66 MHz Neko Project configuration when compared to the batch file, which should still be tolerable relative to the .PI load times we removed by switching to PiLoad.
The whole feature has a few other nice properties as well:
Since this entire
GAME.BATreplica should be optional, we need a reliable way of detecting whether we were started fromGAME.BAT. Checking whether all of a game's TSRs are already resident is the obvious choice here. But then, we can even do one better and only start the specific TSRs that aren't resident by the time our merged binary is started. Of course, removing any non-ZUN.COMTSR from the bundle will invariably leave gaps in the DOS heap, but we do gain an extra bit of resilience since the game at least starts in case of a messed-up batch file.
If we do see all TSRs in memory though, we also skip TH02's and TH03's bouncing-ball ZUN Soft logo as well as TH05's gaiji upload, 📝 matching the behavior I ended up with in TH01. After all, we can't validate whether those were already run or not. If you remove thezun -gline from an edited version of TH05'sGAME.BATthat launchesDEBLOAT.EXEinstead ofOP.EXE, you'd therefore get the same gaiji- and HUD-less game that you'd get with ZUN's original binaries.We also don't spawn TH04's and TH05's memory checks from C++ for a similar reason. Their hardcoded memory values assume that the checks are run from
GAME.BATbefore the game gets loaded, which would obviously cause them to fail if all menu and cutscene code is already loaded into conventional RAM. After merging that code into a single binary, there's not much of a point to such an upfront check either:- If there wasn't enough memory to launch
DEBLOAT.EXE/ANNIV.EXEin the first place, you'd immediately get to know. - If the single DOS-heap-allocating call to
mem_assign_dos()failed, we should probably adopt ZUN's original errors to tell you about it in detail, but the game would also refuse to start immediately. This must necessarily be one of the first function calls made by each binary. - If either of these two issues occurred for just a game's
MAIN.EXE, it would be somewhat inconvenient to always go through the title screen animation and the main menu to test any new memory setup, but it wouldn't be a big deal either. - The original games did have the theoretical issue that their
MAINE.EXE/MAINL.EXEcould have required more memory than eitherOP.EXEorMAIN.EXE. Without an upfront check for the expected size ofMAINE.EXE/MAINL.EXE, a lack of memory could have meant losing a run to an out-of-memory crash upon switching toMAINE.EXE/MAINL.EXE, where scores and clear flags get written to disk. In practice, none of the games actually have this issue, and merging the two binaries avoids it entirely.
- If there wasn't enough memory to launch
These merged binaries also integrate PMDPPZ/CanBe support via the -c or --canbe option. It is quite silly how the community refers to the combination of
PMDPPZ.COMandGAMECB.BATas aCanBe patch
, since this is a strict surface-level addition and doesn't modify anything. Now that my package integrates at least one of the two required parts, can we maybe stop calling it like that? You even get a nice error message in casePMDPPZ.COMis still missing from your game directory!
And then you test with the actual ZUN.COM and notice that you're still not done:
The
INTvector set programsets up handlers forINT 5andINT 6, which collide with Turbo C++ 4.0J's implementation ofsignal(2). If your program only consists of its main process and the TSR you launch from it, this is no problem as long as you shut down the TSR before your process. However, we want to launchDEBLOATM.EXE/ANNIVM.EXEviaexecl()from the same process that launched the TSR. You'd think that Borland'ssignal()implementation would then install anatexit()handler to restore the specific hooked interrupt vector at shutdown. But no:execl()unconditionally resets all interrupts thatsignal()can possibly hook to their original handlers during libc initialization, even if your program never callssignal(). Hence,execl()would not only remove ZUN'sINT 5andINT 6handlers if they were set up by a C++-spawnedZUNINITprocess, but also leak said process:ZUNINIT's-rcommand locates the resident process via the segment part of the system-wideINT 6handler, which obviously no longer works after Borland overwrote that handler.
Thankfully, Borland's function pointers for the original handlers must come with public symbols to remain accessible from two different places in the standard library. Overwriting these pointers after spawning and removing theZUNINITTSR is therefore enough to work around this dumb issue.Bundling all these small utility programs into
ZUN.COMwas apparently not enough for ZUN, and so he additionally compressed TH03's and TH04'sZUN.COMusing Diet. This means that these binaries also have to first decompress themselves before they can unbundle and actually launch the requested sub-binary.
Any compressed binary necessarily decompresses into a process larger than the size of its binary file, and the .COM format has no way of expressing that larger size. Dynamically resizing the program's DOS memory block at startup could work, but Diet made the much more reliable choice of turning such .COM binaries into .EXE binaries, which can declaratively request more memory. Although it certainly is questionable how these binaries retain their original .COM extension… 
Thus, our TSR size calculation code also needs to support .EXE binaries. The implementation is not complicated at all; you read the MZ header and adapt the single expression for calculating the minimum size from DOSBox-X's source code. But then, we're up for a major disappointment once we see how Diet requests almost one full 64 KiB segment to fit both its compressed and decompressed payload. This doesn't matter for TH03, where SPRITE16 allocates an extra 32 KB for alpha channels that would be placed into that extra bit of memory allocated for Diet before. But TH04 doesn't have a similarly sized third TSR, which leaves us with an unsightly 34,944-byte hole at the top of the DOS heap:
| TSR | Binary | Resident | Bundle minimum | Bundle resident |
|---|---|---|---|---|
ZUNINIT (ZUN.COM) |
13,394 | 784 | 13,394 | 784 |
PMD86.COM |
29,383 | 30,224 | 30,167 | 31,008 |
| TSR | Binary | Resident | Bundle minimum | Bundle resident |
|---|---|---|---|---|
ZUNINIT (ZUN.COM) |
65,968 | 784 | 65,968 | 784 |
PMD86.COM |
29,383 | 30,224 | 66,752 | 31,008 |
It's this TH04 issue that raises the question of whether this whole TSR bundling solution was even worthwhile in the first place. It sure was an interesting problem to solve, but it'd be much simpler and less bloated to just integrate the INTvector set program into every binary. For TH03, we could similarly integrate all SPRITE16 functionality directly into DEBLOATM.EXE/ANNIVM.EXE and still end up with a smaller-than-original binary after removing Borland's C++ exception handler. That would leave PMD and MMD as the only TSRs we'd need to spawn from C++, and those do have good reasons to be separate from game code.
Oh well, gotta get TH03's MAIN.EXE position-independent first…
Also, the usual caveats from two years ago still apply. This whole trick of pushing TSRs to the top of conventional RAM still relies on witchcraft that may not work on certain DOS kernels. For developers, tinkerers, and people who know what they're doing, it does succeed at nicely decluttering the game directory. But for… ahem, distributors, I still recommend shipping the modified version of GAME.BAT and GAMECB.BAT in the package below to defend against any potential stability issues.
Finally, if the performance improvements aren't enough of a reason to upgrade to these new builds, how about an actual new feature? TH03's Anniversary Edition now lets you quit out of the VS Start menu via either ESC or a new menu item, without going through the Select screen. 🙌
☪ The Phantasmagoria of Dim. Dreamon the other side seemed like the least bad option here. That outline is indeed created by rendering every line 9 times…
And with that, I'm finally done with 2025's most indulgent subproject! Let's quickly check the overall impact on the codebase:
$ git diff --stat debloated~193 debloated -- . ":(exclude)Tupfile.lua" ":(exclude)build_dumb.bat" ":(exclude)unused/" […] 259 files changed, 4145 insertions(+), 8099 deletions(-)
That's almost 4,000 lines of ad-hoc PC-98-native graphics code, bloat, landmines, bloat- and landmine-documenting comments, and binary-specific inconsistencies removed from game code, in exchange for…
git diff --stat master~203 master -- platform/ […] 28 files changed, 2213 insertions(+), 258 deletions(-)
…not even half that many additional lines in the platform layer. And here's what all of this compiles to:
ReC98 (version P0323)
2025-09-29-ReC98.zip
After the Shuusou Gyoku debacle and the many last-minute fixes that cropped up while I was writing this post, I'm not particularly confident in these builds, despite the weeks of testing that went into them. Still, we've got to start somewhere. At least for TH03, we're bound to quickly find any issues that slipped through the cracks while I'm implementing netplay into the Anniversary Edition.
Next up: The very quick round of 📝 Shuusou Gyoku maintenance and forward compatibility I announced in April, to clear out the backlog a bit. This whole series also really stretched the concept of what 11 pushes should be, so I'll charge 2 pushes for that maintenance round to compensate. In exchange, I'll also incorporate a small bit of new Windows 98 feature work, since it fits nicely with the cleanup work.