⮜ Blog

⮜ List of tags

Showing all posts tagged
and

📝 Posted:
🚚 Summary of:
P0264, P0265
Commits:
46cd6e7...78728f6, 78728f6...ff19bed
💰 Funded by:
Blue Bolt, [Anonymous], iruleatgames
🏷 Tags:

Oh, it's 2024 already and I didn't even have a delivery for December or January? Yeah… I can only repeat what I said at the end of November, although the finish line is actually in sight now. With 10 pushes across 4 repositories and a blog post that has already reached a word count of 9,240, the Shuusou Gyoku SC-88Pro BGM release is going to break 📝 both the push record set by TH01 Sariel two years ago, and 📝 the blog post length record set by the last Shuusou Gyoku delivery. Until that's done though, let's clear some more PC-98 Touhou pushes out of the backlog, and continue the preparation work for the non-ASCII translation project starting later this year.

But first, we got another free bugfix according to my policy! 📝 Back in April 2022 when I researched the Divide Error crash that can occur in TH04's Stage 4 Marisa fight, I proposed and implemented four possible workarounds and let the community pick one of them for the generally recommended small bugfix mod. I still pushed the others onto individual branches in case the gameplay community ever wants to look more closely into them and maybe pick a different one… except that I accidentally pushed the wrong code for the warp workaround, probably because I got confused with the second warp variant I developed later on.
Fortunately, I still had the intended code for both variants lying around, and used the occasion to merge the current master branch into all of these mod branches. Thanks to wyatt8740 for spotting and reporting this oversight!

  1. The Music Room background masking effect
  2. The GRCG's plane disabling flags
  3. Text color restrictions
  4. The entire messy rest of the Music Room code
  5. TH04's partially consistent congratulation picture on Easy Mode
  6. TH02's boss position and damage variables

As the final piece of code shared in largely identical form between 4 of the 5 games, the Music Rooms were the biggest remaining piece of low-hanging fruit that guaranteed big finalization% gains for comparatively little effort. They seemed to be especially easy because I already decompiled TH02's Music Room together with the rest of that game's OP.EXE back in early 2015, when this project focused on just raw decompilation with little to no research. 9 years of increased standards later though, it turns out that I missed a lot of details, and ended up renaming most variables and functions. Combined with larger-than-expected changes in later games and the usual quality level of ZUN's menu code, this ended up taking noticeably longer than the single push I expected.

The undoubtedly most interesting part about this screen is the animation in the background, with the spinning and falling polygons cutting into a single-color background to reveal a spacey image below. However, the only background image loaded in the Music Room is OP3.PI (TH02/TH03) or MUSIC3.PI (TH04/TH05), which looks like this in a .PI viewer or when converted into another image format with the usual tools:

TH02's Music Room background in its on-disk state TH03's Music Room background in its on-disk state TH04's Music Room background in its on-disk state TH05's Music Room background in its on-disk state
Let's call this "the blank image".

That is definitely the color that appears on top of the polygons, but where is the spacey background? If there is no other .PI file where it could come from, it has to be somewhere in that same file, right? :thonk:
And indeed: This effect is another bitplane/color palette trick, exactly like the 📝 three falling stars in the background of TH04's Stage 5. If we set every bit on the first bitplane and thus change any of the resulting even hardware palette color indices to odd ones, we reveal a full second 8-color sub-image hiding in the same .PI file:

TH02's Music Room background, with all bits in the first bitplane set to reveal the spacey background image, and the full color palette at the bottom TH03's Music Room background, with all bits in the first bitplane set to reveal the spacey background image, and the full color palette at the bottom TH04's Music Room background, with all bits in the first bitplane set to reveal the spacey background image, and the full color palette at the bottom TH05's Music Room background, with all bits in the first bitplane set to reveal the spacey background image, and the full color palette at the bottom
The spacey sub-image. Never before seen!1!! …OK, touhou-memories beat me by a month. Let's add each image's full 16-color palette to deliver some additional value.

On a high level, the first bitplane therefore acts as a stencil buffer that selects between the blank and spacey sub-image for every pixel. The important part here, however, is that the first bitplane of the blank sub-images does not consist entirely of 0 bits, but does have 1 bits at the pixels that represent the caption that's supposed to be overlaid on top of the animation. Since there now are some pixels that should always be taken from the spacey sub-image regardless of whether they're covered by a polygon, the game can no longer just clear the first bitplane at the start of every frame. Instead, it has to keep a separate copy of the first bitplane's original state (called nopoly_B in the code), captured right after it blitted the .PI image to VRAM. Turns out that this copy also comes in quite handy with the text, but more on that later.


Then, the game simply draws polygons onto only the reblitted first bitplane to conditionally set the respective bits. ZUN used master.lib's grcg_polygon_c() function for this, which means that we can entirely thank the uncredited master.lib developers for this iconic animation – if they hadn't included such a function, the Music Rooms would most certainly look completely different.
This is where we get to complete the series on the PC-98 GRCG chip with the last remaining four bits of its mode register. So far, we only needed the highest bit (0x80) to either activate or deactivate it, and the bit below (0x40) to choose between the 📝 RMW and 📝 TCR/📝 TDW modes. But you can also use the lowest four bits to restrict the GRCG's operations to any subset of the four bitplanes, leaving the other ones untouched:

// Enable the GRCG (0x80) in regular RMW mode (0x40). All bitplanes are
// enabled and written according to the contents of the tile register.
outportb(0x7C, 0xC0);

// The same, but limiting writes to the first bitplane by disabling the
// second (0x02), third (0x04), and fourth (0x08) one, as done in the
// PC-98 Touhou Music Rooms.
outportb(0x7C, 0xCE);

// Regular GRCG blitting code to any VRAM segment…
pokeb(0xA8000, offset, …);

// We're done, turn off the GRCG.
outportb(0x7C, 0x00);

This could be used for some unusual effects when writing to two or three of the four planes, but it seems rather pointless for this specific case at first. If we only want to write to a single plane, why not just do so directly, without the GRCG? Using that chip only involves more hardware and is therefore slower by definition, and the blitting code would be the same, right?
This is another one of these questions that would be interesting to benchmark one day, but in this case, the reason is purely practical: All of master.lib's polygon drawing functions expect the GRCG to be running in RMW mode. They write their pixels as bitmasks where 1 and 0 represent pixels that should or should not change, and leave it to the GRCG to combine these masks with its tile register and OR the result into the bitplanes instead of doing so themselves. Since GRCG writes are done via MOV instructions, not using the GRCG would turn these bitmasks into actual dot patterns, overwriting any previous contents of each VRAM byte that gets modified.
Technically, you'd only have to replace a few MOV instructions with OR to build a non-GRCG version of such a function, but why would you do that if you haven't measured polygon drawing to be an actual bottleneck.

Three overlapping Music Room polygons rendered using master.lib's grcg_polygon_c() function with a disabled GRCGThree overlapping Music Room polygons rendered as in the original game, with the GRCG enabled
An example with three polygons drawn from top to bottom. Without the GRCG, edges of later polygons overwrite any previously drawn pixels within the same VRAM byte. Note how treating bitmasks as dot patterns corrupts even those areas where the background image had nonzero bits in its first bitplane.

As far as complexity is concerned though, the worst part is the implicit logic that allows all this text to show up on top of the polygons in the first place. If every single piece of text is only rendered a single time, how can it appear on top of the polygons if those are drawn every frame?
Depending on the game (because of course it's game-specific), the answer involves either the individual bits of the text color index or the actual contents of the palette:

The contents of nopoly_B with each game's first track selected.

Finally, here's a list of all the smaller details that turn the Music Rooms into such a mess:

And that's all the Music Rooms! The OP.EXE binaries of TH04 and especially TH05 are now very close to being 100% RE'd, with only the respective High Score menus and TH04's title animation still missing. As for actual completion though, the finalization% metric is more relevant as it also includes the ZUN Soft logo, which I RE'd on paper but haven't decompiled. I'm 📝 still hoping that this will be the final piece of code I decompile for these two games, and that no one pays to get it done earlier… :onricdennat:


For the rest of the second push, there was a specific goal I wanted to reach for the remaining anything budget, which was blocked by a few functions at the beginning of TH04's and TH05's MAINE.EXE. In another anticlimactic development, this involved yet another way too early decompilation of a main() function…
Generally, this main() function just calls the top-level functions of all other ending-related screens in sequence, but it also handles the TH04-exclusive congratulating All Clear images within itself. After a 1CC, these are an additional reward on top of the Good Ending, showing the player character wearing a different outfit depending on the selected difficulty. On Easy Mode, however, the Good Ending is unattainable because the game always ends after Stage 5 with a Bad Ending, but ZUN still chose to show the EASY ALL CLEAR!! image in this case, regardless of how many continues you used.
While this might seem inconsistent with the other difficulties, it is consistent within Easy Mode itself, as the enforced Bad Ending after Stage 5 also doesn't distinguish between the number of continues. Also, Try to Normal Rank!! could very well be ZUN's roundabout way of implying "because this is how you avoid the Bad Ending".

With that out of the way, I was finally able to separate the VRAM text renderer of TH04 and TH05 into its own assembly unit, 📝 finishing the technical debt repayment project that I couldn't complete in 2021 due to assembly-time code segment label arithmetic in the data segment. This now allows me to translate this undecompilable self-modifying mess of ASM into C++ for the non-ASCII translation project, and thus unify the text renderers of all games and enhance them with support for Unicode characters loaded from a bitmap font. As the final finalized function in the SHARED segment, it also allowed me to remove 143 lines of particularly ugly segmentation workarounds 🙌


The remaining 1/6th of the second push provided the perfect occasion for some light TH02 PI work. The global boss position and damage variables represented some equally low-hanging fruit, being easily identified global variables that aren't part of a larger structure in this game. In an interesting twist, TH02 is the only game that uses an increasing damage value to track boss health rather than decreasing HP, and also doesn't internally distinguish between bosses and midbosses as far as these variables are concerned. Obviously, there's quite a bit of state left to be RE'd, not least because Marisa is doing her own thing with a bunch of redundant copies of her position, but that was too complex to figure out right now.

Also doing their own thing are the Five Magic Stones, which need five positions rather than a single one. Since they don't move, the game doesn't have to keep 📝 separate position variables for both VRAM pages, and can handle their positions in a much simpler way that made for a nice final commit.
And for the first time in a long while, I quite like what ZUN did there! Not only are their positions stored in an array that is indexed with a consistent ID for every stone, but these IDs also follow the order you fight the stones in: The two inner ones use 0 and 1, the two outer ones use 2 and 3, and the one in the center uses 4. This might look like an odd choice at first because it doesn't match their horizontal order on the playfield. But then you notice that ZUN uses this property in the respective phase control functions to iterate over only the subrange of active stones, and you realize how brilliant it actually is.

Screenshot of TH02's Five Magic Stones, with the first two (both internally and in the order you fight them in) alive and activated Screenshot of TH02's Five Magic Stones, with the second two (both internally and in the order you fight them in) alive and activated Screenshot of TH02's Five Magic Stones, with the last one (both internally and in the order you fight them in) alive and activated

This seems like a really basic thing to get excited about, especially since the rest of their data layout sure isn't perfect. Splitting each piece of state and even the individual X and Y coordinates into separate 5-element arrays is still counter-productive because the game ends up paying more memory and CPU cycles to recalculate the element offsets over and over again than this would have ever saved in cache misses on a 486. But that's a minor issue that could be fixed with a few regex replacements, not a misdesigned architecture that would require a full rewrite to clean it up. Compared to the hardcoded and bloated mess that was 📝 YuugenMagan's five eyes, this is definitely an improvement worthy of the good-code tag. The first actual one in two years, and a welcome change after the Music Room!

These three pieces of data alone yielded a whopping 5% of overall TH02 PI in just 1/6th of a push, bringing that game comfortably over the 60% PI mark. MAINE.EXE is guaranteed to reach 100% PI before I start working on the non-ASCII translations, but at this rate, it might even be realistic to go for 100% PI on MAIN.EXE as well? Or at least technical position independence, without the false positives.

Next up: Shuusou Gyoku SC-88Pro BGM. It's going to be wild.

📝 Posted:
🚚 Summary of:
P0229, P0230, P0231, P0232, P0233, P0234
Commits:
6370f96...d535d87, d535d87...ca523b4, ca523b4...05a49b9, f7ef7f8...abeaf85, abeaf85...dbc5b51, dd2265c...12f29c6
💰 Funded by:
Ember2528, [Anonymous]
🏷 Tags:

128 commits! Who would have thought that the ideal first release of the TH01 Anniversary Edition would involve so much maintenance, and raise so many research questions? It's almost as if the real work only starts after the 100% finalization mark… Once again, I had to steal some funding from the reserved JIS trail word pushes to cover everything I liked to research, which means that the next towards the anything goal will repay this debt. Luckily, this doesn't affect any immediate plans, as I'll be spending March with tasks that are already fully funded.

So, how did this end up so massive? The list of things I originally set out to do was pretty short:

  1. Build entire game into single executable
  2. Fix rendering issues in the one or two most important parts of the game for a good initial impression

But even the first point already started with tons of little cleanup commits. A part of them can definitely be blamed on the rush to hit the 100% decompilation mark before the 25th anniversary last August. However, all the structural changes that I can't commit to master reveal how much of a mess the TH01 codebase actually is.
Merging the executables is mainly difficult because of all the inconsistencies between REIIDEN.EXE and FUUIN.EXE. The worst parts can be found in the REYHI*.DAT format code and the High Score menu, but the little things are just as annoying, like how the current score is an unsigned variable in REIIDEN.EXE, but a signed one in FUUIN.EXE. :zunpet: If it takes me this long and this many commits just to sort out all of these issues, it's no wonder that the only thing I've seen being done with this codebase since TH01's 100% decompilation was a single porting attempt that ended in a rather quick ragequit.
So why are we merging the executables in preparation for the Anniversary Edition, and not waiting with it until we start doing ports?

The game actually is so bloated that the combined binary ended up smaller than the original REIIDEN.EXE. If all you see are the file sizes of the original three executables, this might look like a pretty impressive feat. Like, how can we possibly get 407,812 bytes into less than 238,612 bytes, without using compression?
If you've ever looked at the linker map though, it's not at all surprising. Excluding the aforementioned inconsistencies that are hard to quantify, OP.EXE and FUUIN.EXE only feature 5,767 and 6,475 bytes of unique code and data, respectively. All other code in these binaries is already part of REIIDEN.EXE, with more than half of the size coming from the Borland C++ runtime. The single worst offender here is the C++ exception handler that Borland forces onto every non-.COM binary by default, which alone adds 20,512 bytes even if your binary doesn't use C++ exceptions.
On a more hilarious note, this single line is responsible for pulling another unnecessary 14,242 bytes into OP.EXE and FUUIN.EXE. This floating-point multiplication is completely unnecessary in this context because all possible parameters are integers, but it's enough for Turbo C++ and TLINK to pull in the entire x87 FPU emulation machinery. These two binaries don't even draw lines, but since this function is part of the general graphics code translation unit and contains other functions that these binaries do need, TLINK links in the entire thing. Maybe, multiple executables aren't the best choice either if you use a linker that can't do dead code elimination…

Since the 📝 Orb's physics do turn the entire precision of a double variable into gameplay effects, it's not feasible to ever get rid of all FPU code in TH01. The exception handler, however, can be removed, which easily brings the combined binary below the size of the original REIIDEN.EXE. Compiling all code with a single set of compiler optimization flags, including the more x86-friendly pascal calling convention, then gets us a few more KB on top. As does, of course, removing unused code: The only remaining purpose of features such as 📝 resident palettes is to potentially make porting more difficult for anyone who doesn't immediately realize that nothing in the game uses these functions.
Technically, all unused code would be bloat, but for now, I'm keeping the parts that may tell stories about the game's development history (such as unused effects or the 📝 mouse cursor), or that might help with debugging. Even with that in mind, I've only scratched the surface when it comes to bloat removal, and the binary is only going to get smaller from here. A lot smaller.

If only we now could start MDRV98 from this new combined binary, we wouldn't need a second batch file either…


Which brings us to the first big research question of this delivery. Using the C spawn() function works fine on this compiler, so spawn("MDRV98.COM") would be all we need to do, right? Except that the game crashes very soon after that subprocess returned. :thonk:
So it's not going to be that easy if the spawned process is a TSR. But why should this be a problem? Let's take a look at the DOS heap, and how DOS lays out processes in conventional memory if we launch the game regularly through GAME.BAT:

The rough layout of the DOS heap when launching TH01 from GAME.BAT.

The batch file starts MDRV98 first, which will therefore end up below the game in conventional memory. This is perfect for a TSR: The program can resize itself arbitrarily before returning to DOS, and the rest of memory will be left over for the game. If we assume such a layout, a DOS program can implement a custom memory allocator in a very simple way, as it only has to search for free memory in one direction – and this is exactly how Borland implemented the C heap for functions like malloc() and free(), and the C++ new and delete operators.
But if we spawn MDRV98 after starting TH01, well…

MDRV98 will spawn in the next free memory location, allocate itself, return to TH01… which suddenly finds its C heap blocked from growing. As a result, the next big allocation will immediately fail with a rather misleading "out of memory" error.

So, what can we do about this? Still in a bloat removal mindset, my gut reaction was to just throw out Borland's C heap implementation, and replace it with a very thin wrapper around the DOS heap as managed by INT 21h, AH=48h/49h/4Ah. Like, why did these DOS compilers even bother with a custom allocator in the first place if DOS already comes with a perfectly fine native one? Using the native allocator would completely erase the distinction between TSR memory and game memory, and inherently allow the game to allocate beyond MDRV98.
I did in fact implement this, and noticed even more benefits:

Ultimately though, the drawbacks became too significant. Most of them are related to the PC-98 Touhou games only ever creating a single DOS process, even though they contain multiple executables. Switching executables is done via exec(), which resizes a program's main allocation to match the new binary and then overwrites the old program image with the new one. If you've ever wondered why DOSBox-X only ever shows OP as the active process name in the title bar, you now know why. As far as DOS is concerned, it's still the same OP.EXE process rooted at the same segment, and exec() doesn't bother rewriting the name either. Most importantly though, this is how REIIDEN.EXE can launch into another REIIDEN.EXE process even if there are less than 238,612 bytes free when exec() is called, and without consuming more memory for every successive binary.
For now, ANNIV.EXE still re-exec()s itself at every point where the original game did, as ZUN's original code really depends on being reinitialized at boss and scene boundaries. The resulting accidental semi-hot reloading is also a useful property to retain during development.
So why is the DOS heap a bad idea for regular game allocation after all?

I could release this DOS heap wrapper in unused form for another push if anyone's interested, but for now, I'm pretty happy with not actually using it in the games. Instead, let's stay with the Borland C heap, and find a way to push MDRV98 to the very top of conventional RAM. Like this:

Which is much easier said than done. It would be nice if we could just use the last fit allocation strategy here, but .COM executables always receive all free memory by default anyway, which eliminates any difference between the strategies.
But we can still change memory itself. So let's temporarily claim all remaining free memory, minus the exact amount we need for MDRV98, for our process. Then, the only remaining free space to spawn MDRV98 is at the exact place where we want it to be:

Obviously, we release all the additional memory after spawning MDRV98.

Now we only need to know how much memory to not temporarily allocate. First, we need to replicate the assumption that MDRV98's -M7 command-line parameter corresponds to a resident size of 23,552 bytes. This is not as bad as it seems, because the -M parameter explicitly has a KiB unit, and we can nicely abstract it away for the API.
The (env.) block though? Its minimum size equals the combined length of all environment variables passed to the process, but its maximum size is… not limited at all?! As in, DOS implementations can add and have historically added more free space because some programs insisted on storing their own new environment variables in this exact segment. DOSBox and DOSBox-X follow this tradition by providing a configuration option for the additional amount of environment space, with the latter adding 1024 additional bytes by default, y'know, just in case someone wants to compile FreeDOS on a slow emulator. It's not even worth sending a bug report for this specific case, because it's only a symptom of the fact that unexpectedly large program environment blocks can and will happen, and are to be expected in DOS land.
So thanks to this cruel joke, it's technically impossible to achieve what we want to do there. Hooray! The only thing we can kind of do here is an educated guess: Sum up the length of all environment variables in our environment block, compare that length against the allocated size of the block, and assume that the MDRV98 process will get as much additional memory as our process got. 🤷

The remaining hurdles came courtesy of some Borland C runtime implementation details. You would think that the temporary reallocation could even be done in pure C using the sbrk(), coreleft(), and brk() functions, but all values passed to or returned from these functions are inaccurate because they don't factor in the aforementioned KiB padding to the underlying DOS memory block. So we have to directly use the DOS syscalls after all. Which at least means that learning about them wasn't completely useless…
The final issue is caused inside Borland's spawn() implementation. The environment block for the child process is built out of all the strings reachable from C's environ pointer, which is what that FreeDOS build process should have used. Coalescing them into a single buffer involves yet another C heap allocation… and since we didn't report our DOS memory block manipulation back to the C heap, the malloc() call might think it needs to request more memory from DOS. This resets the DOS memory block back to its intended level, undoing our manipulation right before the actual INT 21h, AH=4Bh EXEC syscall. Or in short:

Manipulate DOS heap ➜ spawn() call ➜ _LoadProg() ➜ allocate and prepare environment block ➜ _spawn() ➜ DOS EXEC syscall

The obvious solution: Replace _LoadProg(), implement the coalescing ourselves, and do it before the heap manipulation. Fortunately, Borland's internal low-level _spawn() function is not static, so we can call it ourselves whenever we want to:

Allocate and prepare environment block ➜ manipulate DOS heap ➜ _spawn() call ➜ EXEC syscall

So yes, launching MDRV98 from C can be done, but it involves advanced witchcraft and is completely ridiculous. :tannedcirno: Launching external sound drivers from a batch file is the right way of doing things.
Fortunately, you don't have to rely on this auto-launching feature. You can still launch DEBLOAT.EXE or ANNIV.EXE from a batch file that launched MDRV98.COM before, and the binaries will detect this case and skip the attempt of launching MDRV98 from C. It's unlikely that my heuristic will ever break, but I definitely recommend replicating GAME.BAT just to be completely sure – especially for user-friendly repacks that don't want to include the original game anyway.
This is also why ANNIV.EXE doesn't launch ZUNSOFT.COM: The "correct" and stable way to launch ANNIV.EXE still involves a batch file, and I would say that expecting people to remove ZUNSOFT.COM from that file is worse than not playing the animation. It's certainly a debate we can have, though.


This deep dive into memory allocation revealed another previously undocumented bug in the original game. The RLE decompression code for the 東方靈異.伝 packfile contains two heap overflows, which are actually triggered by SinGyoku's BOSS1_3.BOS and Konngara's BOSS8_1.BOS. They only do not immediately crash the game when loading these bosses thanks to two implementation details of Borland's C heap. :zunpet:
Obviously, this is a bug we should fix, but according to the definition of bugs, that fix would be exclusive to the anniversary branch. Isn't that too restrictive for something this critical? This code is guaranteed to blow up with a different heap implementation, if only in a Debug build. :thonk: And besides, nobody would notice a fix just by looking at the game's rendered output…

Looks like we have to introduce a fourth category of weird code, in addition to the previous bloat, bug, and quirk categories, for invisible internal issues like these. Let's call it landmine, and fix them on the debloated branch as well. Thanks to Clerish for the naming inspiration!
With this new category, the full definitions for all categories have become quite extensive. Thus, they now live in CONTRIBUTING.md inside the ReC98 repository.

With the new discoveries and the new landmine category, TH01 is now at 67 bugs and 20 landmines. And the solution for the landmine in question? Simplifying the 61 lines of the original code down to 16. And yes, I'm including comments in these numbers – if the interactions of the code are complex enough to require multi-paragraph comments, these are a necessary and valid part of the code.


While we're on the topic of weird code and its visible or invisible effects, there's one thing you might be concerned about. With all the rearchitecting and data shifting we're doing on the debloated branch, what will happen to the 📝 negative glitch stages? These are the result of a clearly observable bug that, by definition, must not be fixed on the debloated branch. But given that the observable layout of the glitch stages is defined by the memory surrounding the scene stage variable, won't the debloated branch inherently alter their appearance (= ⚠️ fanfiction ⚠️), or even remove them completely?

Well, yes, it will. But we can still preserve their layout by hardcoding the exact original data that the game would originally read, and even emulate the original segment relocations and other pieces of global data.
Doing this is feasible thanks to the fact that there are only 4 glitch stages. Unfortunately, the same can't be said for the timer values, which are determined by an array lookup with the un-modulo'd stage ID. If we wanted to preserve those as well, we'd have to bundle an exact copy of the original REIIDEN.EXE data segment to preserve the values of all 32,768 negative stages you could possibly enter, together with a map of all relocations in this segment. 😵 Which I've decided against for now, since this has been going on for far too long already. Let's first see if anyone ever actually complains about details like this…


Alright, time to start the anniversary branch by rendering everything at its correct internal unaligned X position? Eh… maybe not quite yet. If we just hacked all the necessary bit-shifting code into all the format-specific blitting functions, we'd still retain all this largely redundant, bad, and slow code, and would make no progress in terms of portability. It'd be much better to first write a single generic blitter that's decently optimized, but supports all kinds of sprites to make this optimization actually worth something.
So, next research question: How would such a blitter look like? After I learned during my 📝 first foray into cycle counting that port I/O is slow on 486 CPUs, it became clear that TH04's 📝 GRCG batching for pellets was one of the more useful optimizations that probably contributed a big deal towards achieving the high bullet counts of that game. This leads to two conclusions:

Maybe we should also start by not even doing these unaligned bit shifts ourselves, and instead expect the call site to 📝 always deliver a byte-aligned sprite that is correctly preshifted, if necessary? Some day, we definitely should measure how slow runtime shifting would really be…

What we should do, however, are some further general optimizations that I would have expected from master.lib: Unrolling the vertical loop, and baking a single function for every sprite width to eliminate the horizontal loop. We can then use the widest possible x86 MOV instruction for the lowest possible number of cycles per row – for example, we'd blit a 56-wide sprite with three MOVs (32-bit + 16-bit + 8-bit), and a 64-wide one with two 32-bit MOVs.
Or maybe not? There's a lot of blitting code in both master.lib and PC-98 Touhou that checks for empty bytes within sprites to skip needlessly writing them to VRAM:

uint8_t left_half = ((uint8_t *)(sprite))[0];
uint8_t right_half = ((uint8_t *)(sprite))[1];
if(right_half != 0x00) {
	pokeb(VRAM_SEGMENT, (vram_offset + 0), left_half);
}
if(right_half != 0x00) {
	pokeb(VRAM_SEGMENT, (vram_offset + 1), right_half);
}

Which goes against everything you seem to know about computers. We aren't running on an 8-bit CPU here, so wouldn't it be faster to always write both halves of a sprite in a single operation?

uint16_t both_halves = ((uint16_t *)(sprite))[0];
pokew(VRAM_SEGMENT, vram_offset, both_halves);

That's a single CPU instruction, compared to two instructions and two branches. The only possible explanation for this would be that VRAM writes are so slow on PC-98 that you'd want to avoid them at all costs, even if that means additional branching on the CPU to do so. Or maybe that was something you would want to do on certain models with slow VRAM, but not on others?

So I wrote a benchmark to answer all these questions, and to compare my new blitter against typical TH01 blitting code:

A not really representative run on DOSBox-X. Since the master.lib sprite functions are also unbatched, I expect them to not be much faster than the naive C implementation.

2023-03-05-blitperf.zip And here are the real-hardware results I've got from the PC-9800 Central Discord server:

PC-286LS PC-9801ES PC-9821Cb/Cx PC-9821Ap3 PC-9821An PC-9821Nw133 PC-9821Ra20
80286, 12 MHz i386SX, 16 MHz 486SX, 33 MHz 486DX4, 100 MHz Pentium, 90 MHz Pentium, 133 MHz Pentium Pro, 200 MHz
1987 1989 1994 1994 1994 1997 1996
Unchecked C GRCG 36,85 38,42 26,02 26,87 3,98 4,13 2,08 2,16 1,81 1,87 0,86 0,89 1,25 1,25
MOVS GRCG 15,22 16,87 9,33 10,19 1,22 1,37 0,44 0,44
MOV GRCG 15,42 17,08 9,65 10,53 1,15 1,3 0,44 0,44
4-plane 37,23 43,97 29,2 32,96 4,44 5,01 4,39 4,67 5,11 5,32 5,61 5,74 6,63 6,64
Checking first GRCG 17,49 19,15 10,84 11,72 1,27 1,44 1,04 1,07 0,54 0,54
4-plane 46,49 53,36 35,01 38,79 5,66 6,26 5,43 5,74 6,56 6,8 8,08 8,29 10,25 10,29
Checking second GRCG 16,47 18,12 10,77 11,65 1,25 1,39 1,02 0,51 0,51
4-plane 43,41 50,26 33,79 37,82 5,22 5,81 5,14 5,43 6,18 6,4 7,57 7,77 9,58 9,62
Checking both GRCG 16,14 18,03 10,84 11,71 1,33 1,49 1,01 0,49 0,49
4-plane 43,61 50,45 34,11 37,87 5,39 5,99 4,92 5,23 5,88 6,11 7,19 7,43 9,1 9,13
Amount of frames required to render 2000 16×8 pellet sprites on a variety of PC-98 models, using the new generic blitter. Both preshifted (first column) and runtime-shifted (second column) sprites were tested; empty columns correspond to times faster than a single frame. Thanks to cuba200611, Shoutmon, cybermind, and Digmac for running the tests!

The key takeaways:

Since this won't be the only piece of game-independent and explicitly PC-98-specific custom code involved in this delivery, it makes sense to start a dedicated PC-98 platform layer. This code will gradually eliminate the dependency on master.lib and replace it with better optimized and more readable C++ code. The blitting benchmark, for example, is already implemented completely without master.lib.
While this platform layer is mainly written to generate optimal code within Turbo C++ 4.0J, it can also serve as general PC-98 documentation for everyone who prefers code over machine-translating old Japanese books. Not to mention the immediacy of having all actual relevant information in one place, which might otherwise be pretty well hidden in these books, or some obscure old text file. For example, did you know that uploading gaiji via INT 18h might end up disabling the VSync interrupt trigger, deadlocking the process on the next frame delay loop? This nuisance is not replicated by any emulators, and it's quite frustrating to encounter it when trying to run your code on real hardware. master.lib works around it by simply hooking INT 18h and unconditionally reenabling the VSync interrupt trigger after the original handler returns, and so does our platform layer.


So, with the pellet draw calls batched and routed through the new renderer, we should have gained enough free CPU cycles to disable 📝 interlaced pellet rendering without any impact on frame rates?

Well, kinda. We do get 56.4 FPS, but only together with noticeable and reproducible tearing in the top part of the playfield, suggesting exactly why ZUN interlaced the rendering in the first place. 😕 So have we already reached the limit of single-buffered PC-98 games here, or can we still do something about it?
As it turns out, the main bottleneck actually lies in the pellet unblitting code. Every EGC-"accelerated" unblitting call in TH01 is as unbatched as the pellet blitting calls were, spending an additional 17 I/O port writes per call to completely set up and shut down the EGC, every time. And since this is TH01, the two-instruction operation of changing the active PC-98 VRAM page isn't inlined either, but instead done via a function call to a faraway segment. On the 486, that's:

This sums up to

And this calculation even ignores the lack of small micro-optimizations that could further optimize the blitting loop. Multiply that by the game's pellet cap of 100, and we get a 6-digit number of wasted CPU cycles. On paper, that's roughly 1/6 of the time we have for each of our target 56.423 FPS on the game's target 33 MHz systems. Might not sound all too critical, but the single-buffered nature of the game means that we're effectively racing the beam on every frame. In turn, we have to be even more serious about performance.

So, time to also add a batched EGC API to our PC-98 platform layer? Writing our own EGC code presents a nice opportunity to finally look deeper into all its registers and configuration options, and see what exactly we can do about ZUN's enforced 16-pixel alignment.
To nobody's surprise, this alignment is completely unnecessary, and only displays a lack of knowledge about the chip. While it is true that the EGC wants VRAM to be exclusively addressed in 16-bit chunks at 16-bit-aligned addresses, it specifically provides

And it gets even better: After ⌈bitlength ÷ 16⌉ write instructions, the EGC's internal shifter state automatically reinitializes itself in preparation for blitting another row of pixels with the same initially configured bit addresses and length. This is perfect for blitting rectangles, as two I/O port writes before the start of your blitting loop are enough to define your entire rectangle.

The manual nature of reading and writing in 16-pixel chunks does come with a slight pitfall though. If the source bit address is larger than the destination bit address, the first 16-bit read won't fill the EGC's internal shift register with all pixels that should appear in the first 16-pixel destination chunk. In this case, the EGC simply won't write anything and leave the first chunk unchanged. In a 📝 regular blitting loop, however, you expect that memory to be written and immediately move on to the next chunks within the row. As a result, the actual blitting process for such a rectangle will no longer be aligned to the configured address and bit length. The first row of the rectangle will appear 16 pixels to the right of the destination address, and the second one will start at bit offset 0 with pixels from the rightmost byte of the first line, which weren't blitted and remained in the tile register.
There is an easy solution though: Before the horizontal loop on each line of the rectangle, simply read one additional 16-pixel chunk from the source location to prefill the shift register. Thankfully, it's large enough to also fit the second read of the then full 16 pixels, without dropping any pixels along the way.

And that's how we get arbitrarily unaligned rectangle copies with the EGC! Except for a small register allocation trick to use two-register addressing, there's not much use in further optimizations, as the runtime of these inter-page blit operations is dominated by the VRAM page switches anyway.

Except that T98-Next seems to disagree about the register prefilling issue:

Glitched blitting results on T98-Next when trying EGC copies where the source bit address is larger than the destination bit address

Every other emulator agrees with real hardware in this regard, so we can safely assume this to be a bug in T98-Next. Just in case this old emulator with its last release from June 2010 still has any fans left nowadays… For now though, even they can still enjoy the TH01 Anniversary Edition: The only EGC copy algorithm that TH01 actually needs is the left one during the single-buffered tests, which even that emulator gets right.
That only leaves 📝 my old offer of documenting the EGC raster ops, and we've got the EGC figured out completely!


And that did in fact remove tearing from the pellet rendering function! For the first time, we can now fight Elis, Kikuri, Sariel, and Konngara with a doubled pellet frame rate:

Switchable videos like these can nicely provide evidence that these changes have no effect on gameplay, making it easy to see that the Orb still collides with all pellets on the same frames. Also, check out the difference in remaining conventional memory (coreleft)…

With only pellets and no other animation on screen, this exact pattern presents the optimal demonstration case for the new unblitter. But as you can already tell from the invincibility sprites, we'd also need to route every other kind of sprite through the same new code. This isn't all too trivial: Most sprites are still rendered at byte-aligned positions, and their blitting APIs hide that fact by taking a pixel position regardless. This is why we can't just replace ZUN's original 16-pixel-aligned EGC unblitting function with ours, and always have to replace both the blitter and the unblitter on a per-sprite basis.
To completely remove all flickering, we'd also like to get rid of all the sprite-specific unblit ➜ update ➜ render sequences, and instead gather all unblitting code to the beginning of the game loop, before any update and rendering calls. So yeah, it will take a long time to completely get rid of all flickering. Until we're there, I recommend any backer to tell me their favorite boss, so that I can focus on getting that one rendered without any flickering. Remember that here at ReC98, we can have a Touhou character popularity contest at any time during the year, whenever the store is open! :tannedcirno:

In the meantime, the consistent use of 8×8 rectangles during pellet unblitting does significantly reduce flickering across the entire game, and shrinks certain holes that pellets tend to rip into lazily reblitted sprites:

TH01 SinGyoku's crossing pellet pattern in the Anniversary Edition, demonstrating smaller unblitting artifactsThe same frame in the original game, featuring much more giant holes ripped into the sphere sprite
SinGyoku's "crossing pellets" pattern, shortly before completing the transformation back to the sphere.

To round out the first release, I added all the other bug fixes to achieve parity with my previously released patched REIIDEN.EXE builds:

So here it is, the first build of TH01's Anniversary Edition: 2023-03-05-th01-anniv.zip Edit (2023-03-12): If you're playing on Neko Project and seeing more flickering than in the original game, make sure you've checked the Screen → Disp vsync option.

Next up: The long overdue extended trip through the depths of TH02's low-level code. From what I've seen of it so far, the work on this project is finally going to become a bit more relaxing. Which is quite welcome after, what, 6 months of stressful research-heavy work?

📝 Posted:
🚚 Summary of:
P0190, P0191, P0192
Commits:
5734815...293e16a, 293e16a...71cb7b5, 71cb7b5...e1f3f9f
💰 Funded by:
nrook, -Tom-, [Anonymous]
🏷 Tags:

The important things first:

So, Shinki! As far as final boss code is concerned, she's surprisingly economical, with 📝 her background animations making up more than ⅓ of her entire code. Going straight from TH01's 📝 final 📝 bosses to TH05's final boss definitely showed how much ZUN had streamlined danmaku pattern code by the end of PC-98 Touhou. Don't get me wrong, there is still room for improvement: TH05 not only 📝 reuses the same 16 bytes of generic boss state we saw in TH04 last month, but also uses them 4× as often, and even for midbosses. Most importantly though, defining danmaku patterns using a single global instance of the group template structure is just bad no matter how you look at it:

Declaring a separate structure instance with the static data for every pattern would be both safer and more space-efficient, and there's more than enough space left for that in the game's data segment.
But all in all, the pattern functions are short, sweet, and easy to follow. The "devil" pattern is significantly more complex than the others, but still far from TH01's final bosses at their worst. I especially like the clear architectural separation between "one-shot pattern" functions that return true once they're done, and "looping pattern" functions that run as long as they're being called from a boss's main function. Not many all too interesting things in these pattern functions for the most part, except for two pieces of evidence that Shinki was coded after Yumeko:


Speaking about that wing sprite: If you look at ST05.BB2 (or any other file with a large sprite, for that matter), you notice a rather weird file layout:

Raw file layout of TH05's ST05.BB2, demonstrating master.lib's supposed BFNT width limit of 64 pixels
A large sprite split into multiple smaller ones with a width of 64 pixels each? What's this, hardware sprite limitations? On my PC-98?!

And it's not a limitation of the sprite width field in the BFNT+ header either. Instead, it's master.lib's BFNT functions which are limited to sprite widths up to 64 pixels… or at least that's what MASTER.MAN claims. Whatever the restriction was, it seems to be completely nonexistent as of master.lib version 0.23, and none of the master.lib functions used by the games have any issues with larger sprites.
Since ZUN stuck to the supposed 64-pixel width limit though, it's now the game that expects Shinki's winged form to consist of 4 physical sprites, not just 1. Any conversion from another, more logical sprite sheet layout back into BFNT+ must therefore replicate the original number of sprites. Otherwise, the sequential IDs ("patnums") assigned to every newly loaded sprite no longer match ZUN's hardcoded IDs, causing the game to crash. This is exactly what used to happen with -Tom-'s MysticTK automation scripts, which combined these exact sprites into a single large one. This issue has now been fixed – just in case there are some underground modders out there who used these scripts and wonder why their game crashed as soon as the Shinki fight started.


And then the code quality takes a nosedive with Shinki's main function. :onricdennat: Even in TH05, these boss and midboss update functions are still very imperative:

The biggest WTF in there, however, goes to using one of the 16 state bytes as a "relative phase" variable for differentiating between boss phases that share the same branch within the switch(boss.phase) statement. While it's commendable that ZUN tried to reduce code duplication for once, he could have just branched depending on the actual boss.phase variable? The same state byte is then reused in the "devil" pattern to track the activity state of the big jerky lasers in the second half of the pattern. If you somehow managed to end the phase after the first few bullets of the pattern, but before these lasers are up, Shinki's update function would think that you're still in the phase before the "devil" pattern. The main function then sequence-breaks right to the defeat phase, skipping the final pattern with the burning Makai background. Luckily, the HP boundaries are far away enough to make this impossible in practice.
The takeaway here: If you want to use the state bytes for your custom boss script mods, alias them to your own 16-byte structure, and limit each of the bytes to a clearly defined meaning across your entire boss script.

One final discovery that doesn't seem to be documented anywhere yet: Shinki actually has a hidden bomb shield during her two purple-wing phases. uth05win got this part slightly wrong though: It's not a complete shield, and hitting Shinki will still deal 1 point of chip damage per frame. For comparison, the first phase lasts for 3,000 HP, and the "devil" pattern phase lasts for 5,800 HP.

And there we go, 3rd PC-98 Touhou boss script* decompiled, 28 to go! 🎉 In case you were expecting a fix for the Shinki death glitch: That one is more appropriately fixed as part of the Mai & Yuki script. It also requires new code, should ideally look a bit prettier than just removing cheetos between one frame and the next, and I'd still like it to fit within the original position-dependent code layout… Let's do that some other time.
Not much to say about the Stage 1 midboss, or midbosses in general even, except that their update functions have to imperatively handle even more subsystems, due to the relative lack of helper functions.


The remaining ¾ of the third push went to a bunch of smaller RE and finalization work that would have hardly got any attention otherwise, to help secure that 50% RE mark. The nicest piece of code in there shows off what looks like the optimal way of setting up the 📝 GRCG tile register for monochrome blitting in a variable color:

mov ah, palette_index ; Any other non-AL 8-bit register works too.
                      ; (x86 only supports AL as the source operand for OUTs.)

rept 4                ; For all 4 bitplanes…
    shr ah,  1        ; Shift the next color bit into the x86 carry flag
    sbb al,  al       ; Extend the carry flag to a full byte
                      ; (CF=0 → 0x00, CF=1 → 0xFF)
    out 7Eh, al       ; Write AL to the GRCG tile register
endm

Thanks to Turbo C++'s inlining capabilities, the loop body even decompiles into a surprisingly nice one-liner. What a beautiful micro-optimization, at a place where micro-optimization doesn't hurt and is almost expected.
Unfortunately, the micro-optimizations went all downhill from there, becoming increasingly dumb and undecompilable. Was it really necessary to save 4 x86 instructions in the highly unlikely case of a new spark sprite being spawned outside the playfield? That one 2D polar→Cartesian conversion function then pointed out Turbo C++ 4.0J's woefully limited support for 32-bit micro-optimizations. The code generation for 32-bit 📝 pseudo-registers is so bad that they almost aren't worth using for arithmetic operations, and the inline assembler just flat out doesn't support anything 32-bit. No use in decompiling a function that you'd have to entirely spell out in machine code, especially if the same function already exists in multiple other, more idiomatic C++ variations.
Rounding out the third push, we got the TH04/TH05 DEMO?.REC replay file reading code, which should finally prove that nothing about the game's original replay system could serve as even just the foundation for community-usable replays. Just in case anyone was still thinking that.


Next up: Back to TH01, with the Elis fight! Got a bit of room left in the cap again, and there are a lot of things that would make a lot of sense now: