⮜ Blog

⮜ List of tags

Showing all posts tagged
,
,
and

📝 Posted:
🚚 Summary of:
P0198, P0199, P0200
Commits:
48db0b7...440637e, 440637e...5af2048, 5af2048...67e46b5
💰 Funded by:
Ember2528, Lmocinemod, Yanga
🏷 Tags:

What's this? A simple, straightforward, easy-to-decompile TH01 boss with just a few minor quirks and only two rendering-related ZUN bugs? Yup, 2½ pushes, and Kikuri was done. Let's get right into the overview:

So yeah, there's your new timeout challenge. :godzun:


The few issues in this fight all relate to hitboxes, starting with the main one of Kikuri against the Orb. The coordinates in the code clearly describe a hitbox in the upper center of the disc, but then ZUN wrote a < sign instead of a > sign, resulting in an in-game hitbox that's not quite where it was intended to be…

Kikuri's actual hitbox. Since the Orb sprite doesn't change its shape, we can visualize the hitbox in a pixel-perfect way here. The Orb must be completely within the red area for a hit to be registered.
TODO TH01 Kikuri's intended hitboxTH01 Kikuri's actual hitbox

Much worse, however, are the teardrop ripples. It already starts with their rendering routine, which places the sprites from TAMAYEN.PTN at byte-aligned VRAM positions in the ultimate piece of if(…) {…} else if(…) {…} else if(…) {…} meme code. Rather than tracking the position of each of the five ripple sprites, ZUN suddenly went purely functional and manually hardcoded the exact rendering and collision detection calls for each frame of the animation, based on nothing but its total frame counter. :zunpet:
Each of the (up to) 5 columns is also unblitted and blitted individually before moving to the next column, starting at the center and then symmetrically moving out to the left and right edges. This wouldn't be a problem if ZUN's EGC-powered unblitting function didn't word-align its X coordinates to a 16×1 grid. If the ripple sprites happen to start at an odd VRAM byte position, their unblitting coordinates get rounded both down and up to the nearest 16 pixels, thus touching the adjacent 8 pixels of the previously blitted columns and leaving the well-known black vertical bars in their place. :tannedcirno:

OK, so where's the hitbox issue here? If you just look at the raw calculation, it's a slightly confusingly expressed, but perfectly logical 17 pixels. But this is where byte-aligned blitting has a direct effect on gameplay: These ripples can be spawned at any arbitrary, non-byte-aligned VRAM position, and collisions are calculated relative to this internal position. Therefore, the actual hitbox is shifted up to 7 pixels to the right, compared to where you would expect it from a ripple sprite's on-screen position:

Due to the deterministic nature of this part of the fight, it's always 5 pixels for this first set of ripples. These visualizations are obviously not pixel-perfect due to the different potential shapes of Reimu's sprite, so they instead relate to her 32×32 bounding box, which needs to be entirely inside the red area.

We've previously seen the same issue with the 📝 shot hitbox of Elis' bat form, where pixel-perfect collision detection against a byte-aligned sprite was merely a sidenote compared to the more serious X=Y coordinate bug. So why do I elevate it to bug status here? Because it directly affects dodging: Reimu's regular movement speed is 4 pixels per frame, and with the internal position of an on-screen ripple sprite varying by up to 7 pixels, any micrododging (or "grazing") attempt turns into a coin flip. It's sort of mitigated by the fact that Reimu is also only ever rendered at byte-aligned VRAM positions, but I wouldn't say that these two bugs cancel out each other.
Oh well, another set of rendering issues to be fixed in the hypothetical Anniversary Edition – obviously, the hitboxes should remain unchanged. Until then, you can always memorize the exact internal positions. The sequence of teardrop spawn points is completely deterministic and only controlled by the fixed per-difficulty spawn interval.


Aside from more minor coordinate inaccuracies, there's not much of interest in the rest of the pattern code. In another parallel to Elis though, the first soul pattern in phase 4 is aimed on every difficulty except Lunatic, where the pellets are once again statically fired downwards. This time, however, the pattern's difficulty is much more appropriately distributed across the four levels, with the simultaneous spinning circle pellets adding a constant aimed component to every difficulty level.

Kikuri's phase 4 patterns, on every difficulty.


That brings us to 5 fully decompiled PC-98 Touhou bosses, with 26 remaining… and another ½ of a push going to the cutscene code in FUUIN.EXE.
You wouldn't expect something as mundane as the boss slideshow code to contain anything interesting, but there is in fact a slight bit of speculation fuel there. The text typing functions take explicit string lengths, which precisely match the corresponding strings… for the most part. For the "Gatekeeper 'SinGyoku'" string though, ZUN passed 23 characters, not 22. Could that have been the "h" from the Hepburn romanization of 神玉?!
Also, come on, if this text is already blitted to VRAM for no reason, you could have gone for perfect centering at unaligned byte positions; the rendering function would have perfectly supported it. Instead, the X coordinates are still rounded up to the nearest byte.

The hardcoded ending cutscene functions should be even less interesting – don't they just show a bunch of images followed by frame delays? Until they don't, and we reach the 地獄/Jigoku Bad Ending with its special shake/"boom" effect, and this picture:

Picture #2 from ED2A.GRP.

Which is rendered by the following code:

for(int i = 0; i <= boom_duration; i++) { // (yes, off-by-one)
	if((i & 3) == 0) {
		graph_scrollup(8);
	} else {
		graph_scrollup(0);
	}

	end_pic_show(1); // ← different picture is rendered
	frame_delay(2);  // ← blocks until 2 VSync interrupts have occurred

	if(i & 1) {
		end_pic_show(2); // ← picture above is rendered
	} else {
		end_pic_show(1);
	}
}

Notice something? You should never see this picture because it's immediately overwritten before the frame is supposed to end. And yet it's clearly flickering up for about one frame with common emulation settings as well as on my real PC-9821 Nw133, clocked at 133 MHz. master.lib's graph_scrollup() doesn't block until VSync either, and removing these calls doesn't change anything about the blitted images. end_pic_show() uses the EGC to blit the given 320×200 quarter of VRAM from page 1 to the visible page 0, so the bottleneck shouldn't be there either…

…or should it? After setting it up via a few I/O port writes, the common method of EGC-powered blitting works like this:

  1. Read 16 bits from the source VRAM position on any single bitplane. This fills the EGC's 4 16-bit tile registers with the VRAM contents at that specific position on every bitplane. You do not care about the value the CPU returns from the read – in optimized code, you would make sure to just read into a register to avoid useless additional stores into local variables.
  2. Write any 16 bits to the target VRAM position on any single bitplane. This copies the contents of the EGC's tile registers to that specific position on every bitplane.

To transfer pixels from one VRAM page to another, you insert an additional write to I/O port 0xA6 before 1) and 2) to set your source and destination page… and that's where we find the bottleneck. Taking a look at the i486 CPU and its cycle counts, a single one of these page switches costs 17 cycles – 1 for MOVing the page number into AL, and 16 for the OUT instruction itself. Therefore, the 8,000 page switches required for EGC-copying a 320×200-pixel image require 136,000 cycles in total.

And that's the optimal case of using only those two instructions. 📝 As I implied last time, TH01 uses a function call for VRAM page switches, complete with creating and destroying a useless stack frame and unnecessarily updating a global variable in main memory. I tried optimizing ZUN's code by throwing out unnecessary code and using 📝 pseudo-registers to generate probably optimal assembly code, and that did speed up the blitting to almost exactly 50% of the original version's run time. However, it did little about the flickering itself. Here's a comparison of the first loop with boom_duration = 16, recorded in DOSBox-X with cputype=auto and cycles=max, and with i overlaid using the text chip. Caution, flashing lights:

The original animation, completing in 50 frames instead of the expected 34, thanks to slow blitting. Combined with the lack of double-buffering, this results in noticeable tearing as the screen refreshes while blitting is still in progress. (Note how the background of the ドカーン image is shifted 1 pixel to the left compared to pic #1.)
This optimized version completes in the expected 34 frames. No tearing happens to be visible in this recording, but the ドカーン image is still visible on every second loop iteration. (Note how the background of the ドカーン image is shifted 1 pixel to the left compared to pic #1.)

I pushed the optimized code to the th01_end_pic_optimize branch, to also serve as an example of how to get close to optimal code out of Turbo C++ 4.0J without writing a single ASM instruction.
And if you really want to use the EGC for this, that's the best you can do. It really sucks that it merely expanded the GRCG's 4×8-bit tile register to 4×16 bits. With 32 bits, ≥386 CPUs could have taken advantage of their wider registers and instructions to double the blitting performance. Instead, we now know the reason why 📝 Promisence Soft's EGC-powered sprite driver that ZUN later stole for TH03 is called SPRITE16 and not SPRITE32. What a massive disappointment.

But what's perhaps a bigger surprise: Blitting planar images from main memory is much faster than EGC-powered inter-page VRAM copies, despite the required manual access to all 4 bitplanes. In fact, the blitting functions for the .CDG/.CD2 format, used from TH03 onwards, would later demonstrate the optimal method of using REP MOVSD for blitting every line in 32-pixel chunks. If that was also used for these ending images, the core blitting operation would have taken ((12 + (3 × (320 / 32))) × 200 × 4) = 33,600 cycles, with not much more overhead for the surrounding row and bitplane loops. Sure, this doesn't factor in the whole infamous issue of VRAM being slow on PC-98, but the aforementioned 136,000 cycles don't even include any actual blitting either. And as you move up to later PC-98 models with Pentium CPUs, the gap between OUT and REP MOVSD only becomes larger. (Note that the page I linked above has a typo in the cycle count of REP MOVSD on Pentium CPUs: According to the original Intel Architecture and Programming Manual, it's 13+𝑛, not 3+𝑛.)
This difference explains why later games rarely use EGC-"accelerated" inter-page VRAM copies, and keep all of their larger images in main memory. It especially explains why TH04 and TH05 can get away with naively redrawing boss backdrop images on every frame.

In the end, the whole fact that ZUN did not define how long this image should be visible is enough for me to increment the game's overall bug counter. Who would have thought that looking at endings of all things would teach us a PC-98 performance lesson… Sure, optimizing TH01 already seemed promising just by looking at its bloated code, but I had no idea that its performance issues extended so far past that level.

That only leaves the common beginning part of all endings and a short main() function before we're done with FUUIN.EXE, and 98 functions until all of TH01 is decompiled! Next up: SinGyoku, who not only is the quickest boss to defeat in-game, but also comes with the least amount of code. See you very soon!

📝 Posted:
🚚 Summary of:
P0190, P0191, P0192
Commits:
5734815...293e16a, 293e16a...71cb7b5, 71cb7b5...e1f3f9f
💰 Funded by:
nrook, -Tom-, [Anonymous]
🏷 Tags:

The important things first:

So, Shinki! As far as final boss code is concerned, she's surprisingly economical, with 📝 her background animations making up more than ⅓ of her entire code. Going straight from TH01's 📝 final 📝 bosses to TH05's final boss definitely showed how much ZUN had streamlined danmaku pattern code by the end of PC-98 Touhou. Don't get me wrong, there is still room for improvement: TH05 not only 📝 reuses the same 16 bytes of generic boss state we saw in TH04 last month, but also uses them 4× as often, and even for midbosses. Most importantly though, defining danmaku patterns using a single global instance of the group template structure is just bad no matter how you look at it:

Declaring a separate structure instance with the static data for every pattern would be both safer and more space-efficient, and there's more than enough space left for that in the game's data segment.
But all in all, the pattern functions are short, sweet, and easy to follow. The "devil" pattern is significantly more complex than the others, but still far from TH01's final bosses at their worst. I especially like the clear architectural separation between "one-shot pattern" functions that return true once they're done, and "looping pattern" functions that run as long as they're being called from a boss's main function. Not many all too interesting things in these pattern functions for the most part, except for two pieces of evidence that Shinki was coded after Yumeko:


Speaking about that wing sprite: If you look at ST05.BB2 (or any other file with a large sprite, for that matter), you notice a rather weird file layout:

Raw file layout of TH05's ST05.BB2, demonstrating master.lib's supposed BFNT width limit of 64 pixels
A large sprite split into multiple smaller ones with a width of 64 pixels each? What's this, hardware sprite limitations? On my PC-98?!

And it's not a limitation of the sprite width field in the BFNT+ header either. Instead, it's master.lib's BFNT functions which are limited to sprite widths up to 64 pixels… or at least that's what MASTER.MAN claims. Whatever the restriction was, it seems to be completely nonexistent as of master.lib version 0.23, and none of the master.lib functions used by the games have any issues with larger sprites.
Since ZUN stuck to the supposed 64-pixel width limit though, it's now the game that expects Shinki's winged form to consist of 4 physical sprites, not just 1. Any conversion from another, more logical sprite sheet layout back into BFNT+ must therefore replicate the original number of sprites. Otherwise, the sequential IDs ("patnums") assigned to every newly loaded sprite no longer match ZUN's hardcoded IDs, causing the game to crash. This is exactly what used to happen with -Tom-'s MysticTK automation scripts, which combined these exact sprites into a single large one. This issue has now been fixed – just in case there are some underground modders out there who used these scripts and wonder why their game crashed as soon as the Shinki fight started.


And then the code quality takes a nosedive with Shinki's main function. :onricdennat: Even in TH05, these boss and midboss update functions are still very imperative:

The biggest WTF in there, however, goes to using one of the 16 state bytes as a "relative phase" variable for differentiating between boss phases that share the same branch within the switch(boss.phase) statement. While it's commendable that ZUN tried to reduce code duplication for once, he could have just branched depending on the actual boss.phase variable? The same state byte is then reused in the "devil" pattern to track the activity state of the big jerky lasers in the second half of the pattern. If you somehow managed to end the phase after the first few bullets of the pattern, but before these lasers are up, Shinki's update function would think that you're still in the phase before the "devil" pattern. The main function then sequence-breaks right to the defeat phase, skipping the final pattern with the burning Makai background. Luckily, the HP boundaries are far away enough to make this impossible in practice.
The takeaway here: If you want to use the state bytes for your custom boss script mods, alias them to your own 16-byte structure, and limit each of the bytes to a clearly defined meaning across your entire boss script.

One final discovery that doesn't seem to be documented anywhere yet: Shinki actually has a hidden bomb shield during her two purple-wing phases. uth05win got this part slightly wrong though: It's not a complete shield, and hitting Shinki will still deal 1 point of chip damage per frame. For comparison, the first phase lasts for 3,000 HP, and the "devil" pattern phase lasts for 5,800 HP.

And there we go, 3rd PC-98 Touhou boss script* decompiled, 28 to go! 🎉 In case you were expecting a fix for the Shinki death glitch: That one is more appropriately fixed as part of the Mai & Yuki script. It also requires new code, should ideally look a bit prettier than just removing cheetos between one frame and the next, and I'd still like it to fit within the original position-dependent code layout… Let's do that some other time.
Not much to say about the Stage 1 midboss, or midbosses in general even, except that their update functions have to imperatively handle even more subsystems, due to the relative lack of helper functions.


The remaining ¾ of the third push went to a bunch of smaller RE and finalization work that would have hardly got any attention otherwise, to help secure that 50% RE mark. The nicest piece of code in there shows off what looks like the optimal way of setting up the 📝 GRCG tile register for monochrome blitting in a variable color:

mov ah, palette_index ; Any other non-AL 8-bit register works too.
                      ; (x86 only supports AL as the source operand for OUTs.)

rept 4                ; For all 4 bitplanes…
    shr ah,  1        ; Shift the next color bit into the x86 carry flag
    sbb al,  al       ; Extend the carry flag to a full byte
                      ; (CF=0 → 0x00, CF=1 → 0xFF)
    out 7Eh, al       ; Write AL to the GRCG tile register
endm

Thanks to Turbo C++'s inlining capabilities, the loop body even decompiles into a surprisingly nice one-liner. What a beautiful micro-optimization, at a place where micro-optimization doesn't hurt and is almost expected.
Unfortunately, the micro-optimizations went all downhill from there, becoming increasingly dumb and undecompilable. Was it really necessary to save 4 x86 instructions in the highly unlikely case of a new spark sprite being spawned outside the playfield? That one 2D polar→Cartesian conversion function then pointed out Turbo C++ 4.0J's woefully limited support for 32-bit micro-optimizations. The code generation for 32-bit 📝 pseudo-registers is so bad that they almost aren't worth using for arithmetic operations, and the inline assembler just flat out doesn't support anything 32-bit. No use in decompiling a function that you'd have to entirely spell out in machine code, especially if the same function already exists in multiple other, more idiomatic C++ variations.
Rounding out the third push, we got the TH04/TH05 DEMO?.REC replay file reading code, which should finally prove that nothing about the game's original replay system could serve as even just the foundation for community-usable replays. Just in case anyone was still thinking that.


Next up: Back to TH01, with the Elis fight! Got a bit of room left in the cap again, and there are a lot of things that would make a lot of sense now:

📝 Posted:
🚚 Summary of:
P0182, P0183
Commits:
313450f...1e2c7ad, 1e2c7ad...f9d983e
💰 Funded by:
Lmocinemod, [Anonymous], Yanga
🏷 Tags:

Been 📝 a while since we last looked at any of TH03's game code! But before that, we need to talk about Y coordinates.

During TH03's MAIN.EXE, the PC-98 graphics GDC runs in its line-doubled 640×200 resolution, which gives the in-game portion its distinctive stretched low-res look. This lower resolution is a consequence of using 📝 Promisence Soft's SPRITE16 driver: Its performance simply stems from the fact that it expects sprites to be stored in the bottom half of VRAM, which allows them to be blitted using the same EGC-accelerated VRAM-to-VRAM copies we've seen again and again in all other games. Reducing the visible resolution also means that the sprites can be stored on both VRAM pages, allowing the game to still be double-buffered. If you force the graphics chip to run at 640×400, you can see them:

The full VRAM contents during TH03's in-game portion, as seen when forcing the system into a 640×400 resolution.
TH03's VRAM at regular line-doubled 640×200 resolutionTH03's VRAM at full 640×400 resolution, including the SPRITE16 sprite areaTH03's text layer during an in-game round.

Note that the text chip still displays its overlaid contents at 640×400, which means that TH03's in-game portion technically runs at two resolutions at the same time.

But that means that any mention of a Y coordinate is ambiguous: Does it refer to undoubled VRAM pixels, or on-screen stretched pixels? Especially people who have known about the line doubling for years might almost expect technical blog posts on this game to use undoubled VRAM coordinates. So, let's introduce a new formatting convention for both on-screen 640×400 and undoubled 640×200 coordinates, and always write out both to minimize the confusion.


Alright, now what's the thing gonna be? The enemy structure is highly overloaded, being used for enemies, fireballs, and explosions with seemingly different semantics for each. Maybe a bit too much to be figured out in what should ideally be a single push, especially with all the functions that would need to be decompiled? Bullet code would be easier, but not exactly single-push material either. As it turns out though, there's something more fundamental left to be done first, which both of these subsystems depend on: collision detection!

And it's implemented exactly how I always naively imagined collision detection to be implemented in a fixed-resolution 2D bullet hell game with small hitboxes: By keeping a separate 1bpp bitmap of both playfields in memory, drawing in the collidable regions of all entities on every frame, and then checking whether any pixels at the current location of the player's hitbox are set to 1. It's probably not done in the other games because their single data segment was already too packed for the necessary 17,664 bytes to store such a bitmap at pixel resolution, and 282,624 bytes for a bitmap at Q12.4 subpixel resolution would have been prohibitively expensive in 16-bit Real Mode DOS anyway. In TH03, on the other hand, this bitmap is doubly useful, as the AI also uses it to elegantly learn what's on the playfield. By halving the resolution and only tracking tiles of 2×2 / 2×1 pixels, TH03 only requires an adequate total of 6,624 bytes of memory for the collision bitmaps of both playfields.

So how did the implementation not earn the good-code tag this time? Because the code for drawing into these bitmaps is undecompilable hand-written x86 assembly. :zunpet: And not just your usual ASM that was basically compiled from C and then edited to maybe optimize register allocation and maybe replace a bunch of local variables with self-modifying code, oh no. This code is full of overly clever bit twiddling, abusing the fact that the 16-bit AX, BX, CX, and DX registers can also be accessed as two 8-bit registers, calculations that change the semantic meaning behind the value of a register, or just straight-up reassignments of different values to the same small set of registers. Sure, in some way it is impressive, and it all does work and correctly covers every edge case, but come on. This could have all been a lot more readable in exchange for just a few CPU cycles.

What's most interesting though are the actual shapes that these functions draw into the collision bitmap. On the surface, we have:

  1. vertical slopes at any angle across the whole playfield; exclusively used for Chiyuri's diagonal laser EX attack
  2. straight vertical lines, with a width of 1 tile; exclusively used for the 2×2 / 2×1 hitboxes of bullets
  3. rectangles at arbitrary sizes

But only 2) actually draws a full solid line. 1) and 3) are only ever drawn as horizontal stripes, with a hardcoded distance of 2 vertical tiles between every stripe of a slope, and 4 vertical tiles between every stripe of a rectangle. That's 66-75% of each rectangular entity's intended hitbox not actually taking part in collision detection. Now, if player hitboxes were ≤ 6 / 3 pixels, we'd have one possible explanation of how the AI can "cheat", because it could just precisely move through those blank regions at TAS speeds. So, let's make this two pushes after all and tell the complete story, since this is one of the more interesting aspects to still be documented in this game.


And the code only gets worse. :godzun: While the player collision detection function is decompilable, it might as well not have been, because it's just more of the same "optimized", hard-to-follow assembly. With the four splittable 16-bit registers having a total of 20 different meanings in this function, I would have almost preferred self-modifying code…

In fact, it was so bad that it prompted some maintenance work on my inline assembly coding standards as a whole. Turns out that the _asm keyword is not only still supported in modern Visual Studio compilers, but also in Clang with the -fms-extensions flag, and compiles fine there even for 64-bit targets. While that might sound like amazing news at first ("awesome, no need to rewrite this stuff for my x86_64 Linux port!"), you quickly realize that almost all inline assembly in this codebase assumes either PC-98 hardware, segmented 16-bit memory addressing, or is a temporary hack that will be removed with further RE progress.
That's mainly because most of the raw arithmetic code uses Turbo C++'s register pseudovariables where possible. While they certainly have their drawbacks, being a non-standard extension that's not supported in other x86-targeting C compilers, their advantages are quite significant: They allow this code to stay in the same language, and provide slightly more immediate portability to any other architecture, together with 📝 readability and maintainability improvements that can get quite significant when combined with inlining:

// This one line compiles to five ASM instructions, which would need to be
// spelled out in any C compiler that doesn't support register pseudovariables.
// By adding typed aliases for these registers via `#define`, this code can be
// both made even more readable, and be prepared for an easier transformation
// into more portable local variables.
_ES = (((_AX * 4) + _BX) + SEG_PLANE_B);

However, register pseudovariables might cause potential portability issues as soon as they are mixed with inline assembly instructions that rely on their state. The lazy way of "supporting pseudo-registers" in other compilers would involve declaring the full set as global variables, which would immediately break every one of those instances:

_DI = 0;
_AX = 0xFFFF;

// Special x86 instruction doing the equivalent of
//
// 	*reinterpret_cast(MK_FP(_ES, _DI)) = _AX;
// 	_DI += sizeof(uint16_t);
//
// Only generated by Turbo C++ in very specific cases, and therefore only
// reliably available through inline assembly.
asm { movsw; }

What's also not all too standardized, though, are certain variants of the asm keyword. That's why I've now introduced a distinction between the _asm keyword for "decently sane" inline assembly, and the slightly less standard asm keyword for inline assembly that relies on the contents of pseudo-registers, and should break on compilers that don't support them.
So yeah, have some minor portability work in exchange for these two pushes not having all that much in RE'd content.

With that out of the way and the function deciphered, we can confirm the player hitboxes to be a constant 8×8 / 8×4 pixels, and prove that the hit stripes are nothing but an adequate optimization that doesn't affect gameplay in any way.


And what's the obvious thing to immediately do if you have both the collision bitmap and the player hitbox? Writing a "real hitbox" mod, of course:

  1. Reorder the calls to rendering functions so that player and shot sprites are rendered after bullets
  2. Blank out all player sprite pixels outside an 8×8 / 8×4 box around the center point
  3. After the bullet rendering function, turn on the GRCG in RMW mode and set the tile register set to the background color
  4. Stretch the negated contents of collision bitmap onto each playfield, leaving only collidable pixels untouched
  5. Do the same with the actual, non-negated contents and a white color, for extra contrast against the background. This also makes sure to show any collidable areas whose sprite pixels are transparent, such as with the moon enemy. (Yeah, how unfair.) Doing that also loses a lot of information about the playfield, such as enemy HP indicated by their color, but what can you do:
A decently busy TH03 in-game frame.The underlying content of the collision bitmap, showing off all three different shapes together with the player hitboxes.
A decently busy TH03 in-game frame and its underlying collision bitmap, showing off all three different collision shapes together with the player hitboxes.

2022-02-18-TH03-real-hitbox.zip The secret for writing such mods before having reached a sufficient level of position independence? Put your new code segment into DGROUP, past the end of the uninitialized data section. That's why this modded MAIN.EXE is a lot larger than you would expect from the raw amount of new code: The file now actually needs to store all these uninitialized 0 bytes between the end of the data segment and the first instruction of the mod code – normally, this number is simply a part of the MZ EXE header, and doesn't need to be redundantly stored on disk. Check the th03_real_hitbox branch for the code.

And now we know why so many "real hitbox" mods for the Windows Touhou games are inaccurate: The games would simply be unplayable otherwise – or can you dodge rapidly moving 2×2 / 2×1 blocks as an 8×8 / 8×4 rectangle that is smaller than your shot sprites, especially without focused movement? I can't. :tannedcirno: Maybe it will feel more playable after making explosions visible, but that would need more RE groundwork first.
It's also interesting how adding two full GRCG-accelerated redraws of both playfields per frame doesn't significantly drop the game's frame rate – so why did the drawing functions have to be micro-optimized again? It would be possible in one pass by using the GRCG's TDW mode, which should theoretically be 8× faster, but I have to stop somewhere. :onricdennat:

Next up: The final missing piece of TH04's and TH05's bullet-moving code, which will include a certain other type of projectile as well.

📝 Posted:
🚚 Summary of:
P0174, P0175, P0176, P0177, P0178, P0179, P0180, P0181
Commits:
27f901c...a0fe812, a0fe812...40ac9a7, 40ac9a7...c5dc45b, c5dc45b...5f0cabc, 5f0cabc...60621f8, 60621f8...9e5b344, 9e5b344...091f19f, 091f19f...313450f
💰 Funded by:
Ember2528, Yanga
🏷 Tags:

Here we go, TH01 Sariel! This is the single biggest boss fight in all of PC-98 Touhou: If we include all custom effect code we previously decompiled, it amounts to a total of 10.31% of all code in TH01 (and 3.14% overall). These 8 pushes cover the final 8.10% (or 2.47% overall), and are likely to be the single biggest delivery this project will ever see. Considering that I only managed to decompile 6.00% across all games in 2021, 2022 is already off to a much better start!

So, how can Sariel's code be that large? Well, we've got:

In total, it's just under 3,000 lines of C++ code, containing a total of 8 definite ZUN bugs, 3 of them being subpixel/pixel confusions. That might not look all too bad if you compare it to the 📝 player control function's 8 bugs in 900 lines of code, but given that Konngara had 0… (Edit (2022-07-17): Konngara contains two bugs after all: A 📝 possible heap corruption in test or debug mode, and the infamous 📝 temporary green discoloration.) And no, the code doesn't make it obvious whether ZUN coded Konngara or Sariel first; there's just as much evidence for either.

Some terminology before we start: Sariel's first form is separated into four phases, indicated by different background images, that cycle until Sariel's HP reach 0 and the second, single-phase form starts. The danmaku patterns within each phase are also on a cycle, and the game picks a random but limited number of patterns per phase before transitioning to the next one. The fight always starts at pattern 1 of phase 1 (the random purple lasers), and each new phase also starts at its respective first pattern.


Sariel's bugs already start at the graphics asset level, before any code gets to run. Some of the patterns include a wand raise animation, which is stored in BOSS6_2.BOS:

TH01 BOSS6_2.BOS
Umm… OK? The same sprite twice, just with slightly different colors? So how is the wand lowered again?

The "lowered wand" sprite is missing in this file simply because it's captured from the regular background image in VRAM, at the beginning of the fight and after every background transition. What I previously thought to be 📝 background storage code has therefore a different meaning in Sariel's case. Since this captured sprite is fully opaque, it will reset the entire 128×128 wand area… wait, 128×128, rather than 96×96? Yup, this lowered sprite is larger than necessary, wasting 1,967 bytes of conventional memory.
That still doesn't quite explain the second sprite in BOSS6_2.BOS though. Turns out that the black part is indeed meant to unblit the purple reflection (?) in the first sprite. But… that's not how you would correctly unblit that?

VRAM after blitting the first sprite of TH01's BOSS6_2.BOS VRAM after blitting the second sprite of TH01's BOSS6_2.BOS

The first sprite already eats up part of the red HUD line, and the second one additionally fails to recover the seal pixels underneath, leaving a nice little black hole and some stray purple pixels until the next background transition. :tannedcirno: Quite ironic given that both sprites do include the right part of the seal, which isn't even part of the animation.


Just like Konngara, Sariel continues the approach of using a single function per danmaku pattern or custom entity. While I appreciate that this allows all pattern- and entity-specific state to be scoped locally to that one function, it quickly gets ugly as soon as such a function has to do more than one thing.
The "bird function" is particularly awful here: It's just one if(…) {…} else if(…) {…} else if(…) {…} chain with different branches for the subfunction parameter, with zero shared code between any of these branches. It also uses 64-bit floating-point double as its subpixel type… and since it also takes four of those as parameters (y'know, just in case the "spawn new bird" subfunction is called), every call site has to also push four double values onto the stack. Thanks to Turbo C++ even using the FPU for pushing a 0.0 constant, we have already reached maximum floating-point decadence before even having seen a single danmaku pattern. Why decadence? Every possible spawn position and velocity in both bird patterns just uses pixel resolution, with no fractional component in sight. And there goes another 720 bytes of conventional memory.

Speaking about bird patterns, the red-bird one is where we find the first code-level ZUN bug: The spawn cross circle sprite suddenly disappears after it finished spawning all the bird eggs. How can we tell it's a bug? Because there is code to smoothly fly this sprite off the playfield, that code just suddenly forgets that the sprite's position is stored in Q12.4 subpixels, and treats it as raw screen pixels instead. :zunpet: As a result, the well-intentioned 640×400 screen-space clipping rectangle effectively shrinks to 38×23 pixels in the top-left corner of the screen. Which the sprite is always outside of, and thus never rendered again.
The intended animation is easily restored though:

Sariel's third pattern, and the first to spawn birds, in its original and fixed versions. Note that I somewhat fixed the bird hatch animation as well: ZUN's code never unblits any frame of animation there, and simply blits every new one on top of the previous one.

Also, did you know that birds actually have a quite unfair 14×38-pixel hitbox? Not that you'd ever collide with them in any of the patterns…

Another 3 of the 8 bugs can be found in the symmetric, interlaced spawn rays used in three of the patterns, and the 32×32 debris "sprites" shown at their endpoint, at the edge of the screen. You kinda have to commend ZUN's attention to detail here, and how he wrote a lot of code for those few rapidly animated pixels that you most likely don't even notice, especially with all the other wrong pixels resulting from rendering glitches. One of the bugs in the very final pattern of phase 4 even turns them into the vortex sprites from the second pattern in phase 1 during the first 5 frames of the first time the pattern is active, and I had to single-step the blitting calls to verify it.
It certainly was annoying how much time I spent making sense of these bugs, and all weird blitting offsets, for just a few pixels… Let's look at something more wholesome, shall we?


So far, we've only seen the PC-98 GRCG being used in RMW (read-modify-write) mode, which I previously 📝 explained in the context of TH01's red-white HP pattern. The second of its three modes, TCR (Tile Compare Read), affects VRAM reads rather than writes, and performs "color extraction" across all 4 bitplanes: Instead of returning raw 1bpp data from one plane, a VRAM read will instead return a bitmask, with a 1 bit at every pixel whose full 4-bit color exactly matches the color at that offset in the GRCG's tile register, and 0 everywhere else. Sariel uses this mode to make sure that the 2×2 particles and the wind effect are only blitted on top of "air color" pixels, with other parts of the background behaving like a mask. The algorithm:

  1. Set the GRCG to TCR mode, and all 8 tile register dots to the air color
  2. Read N bits from the target VRAM position to obtain an N-bit mask where all 1 bits indicate air color pixels at the respective position
  3. AND that mask with the alpha plane of the sprite to be drawn, shifted to the correct start bit within the 8-pixel VRAM byte
  4. Set the GRCG to RMW mode, and all 8 tile register dots to the color that should be drawn
  5. Write the previously obtained bitmask to the same position in VRAM

Quite clever how the extracted colors double as a secondary alpha plane, making for another well-earned good-code tag. The wind effect really doesn't deserve it, though:

As far as I can tell, ZUN didn't use TCR mode anywhere else in PC-98 Touhou. Tune in again later during a TH04 or TH05 push to learn about TDW, the final GRCG mode!


Speaking about the 2×2 particle systems, why do we need three of them? Their only observable difference lies in the way they move their particles:

  1. Up or down in a straight line (used in phases 4 and 2, respectively)
  2. Left or right in a straight line (used in the second form)
  3. Left and right in a sinusoidal motion (used in phase 3, the "dark orange" one)

Out of all possible formats ZUN could have used for storing the positions and velocities of individual particles, he chose a) 64-bit / double-precision floating-point, and b) raw screen pixels. Want to take a guess at which data type is used for which particle system?

If you picked double for 1) and 2), and raw screen pixels for 3), you are of course correct! :godzun: Not that I'm implying that it should have been the other way round – screen pixels would have perfectly fit all three systems use cases, as all 16-bit coordinates are extended to 32 bits for trigonometric calculations anyway. That's what, another 1.080 bytes of wasted conventional memory? And that's even calculated while keeping the current architecture, which allocates space for 3×30 particles as part of the game's global data, although only one of the three particle systems is active at any given time.

That's it for the first form, time to put on "Civilization of Magic"! Or "死なばもろとも"? Or "Theme of 地獄めくり"? Or whatever SYUGEN is supposed to mean…


… and the code of these final patterns comes out roughly as exciting as their in-game impact. With the big exception of the very final "swaying leaves" pattern: After 📝 Q4.4, 📝 Q28.4, 📝 Q24.8, and double variables, this pattern uses… decimal subpixels? Like, multiplying the number by 10, and using the decimal one's digit to represent the fractional part? Well, sure, if you really insist on moving the leaves in cleanly represented integer multiples of ⅒, which is infamously impossible in IEEE 754. Aside from aesthetic reasons, it only really combines less precision (10 possible fractions rather than the usual 16) with the inferior performance of having to use integer divisions and multiplications rather than simple bit shifts. And it's surely not because the leaf sprites needed an extended integer value range of [-3276, +3276], compared to Q12.4's [-2047, +2048]: They are clipped to 640×400 screen space anyway, and are removed as soon as they leave this area.

This pattern also contains the second bug in the "subpixel/pixel confusion hiding an entire animation" category, causing all of BOSS6GR4.GRC to effectively become unused:

The "swaying leaves" pattern. ZUN intended a splash animation to be shown once each leaf "spark" reaches the top of the playfield, which is never displayed in the original game.

At least their hitboxes are what you would expect, exactly covering the 30×30 pixels of Reimu's sprite. Both animation fixes are available on the th01_sariel_fixes branch.

After all that, Sariel's main function turned out fairly unspectacular, just putting everything together and adding some shake, transition, and color pulse effects with a bunch of unnecessary hardware palette changes. There is one reference to a missing BOSS6.GRP file during the first→second form transition, suggesting that Sariel originally had a separate "first form defeat" graphic, before it was replaced with just the shaking effect in the final game.
Speaking about the transition code, it is kind of funny how the… um, imperative and concrete nature of TH01 leads to these 2×24 lines of straight-line code. They kind of look like ZUN rattling off a laundry list of subsystems and raw variables to be reinitialized, making damn sure to not forget anything.


Whew! Second PC-98 Touhou boss completely decompiled, 29 to go, and they'll only get easier from here! 🎉 The next one in line, Elis, is somewhere between Konngara and Sariel as far as x86 instruction count is concerned, so that'll need to wait for some additional funding. Next up, therefore: Looking at a thing in TH03's main game code – really, I have little idea what it will be!

Now that the store is open again, also check out the 📝 updated RE progress overview I've posted together with this one. In addition to more RE, you can now also directly order a variety of mods; all of these are further explained in the order form itself.

📝 Posted:
🚚 Summary of:
P0096, P0097, P0098
Commits:
8ddb778...8283c5e, 8283c5e...600f036, 600f036...ad06748
💰 Funded by:
Ember2528, Yanga
🏷 Tags:

So, let's finally look at some TH01 gameplay structures! The obvious choices here are player shots and pellets, which are conveniently located in the last code segment. Covering these would therefore also help in transferring some first bits of data in REIIDEN.EXE from ASM land to C land. (Splitting the data segment would still be quite annoying.) Player shots are immediately at the beginning…

…but wait, these are drawn as transparent sprites loaded from .PTN files. Guess we first have to spend a push on 📝 Part 2 of this format.
Hm, 4 functions for alpha-masked blitting and unblitting of both 16×16 and 32×32 .PTN sprites that align the X coordinate to a multiple of 8 (remember, the PC-98 uses a planar VRAM memory layout, where 8 pixels correspond to a byte), but only one function that supports unaligned blitting to any X coordinate, and only for 16×16 sprites? Which is only called twice? And doesn't come with a corresponding unblitting function? :thonk:

Yeah, "unblitting". TH01 isn't double-buffered, and uses the PC-98's second VRAM page exclusively to store a stage's background and static sprites. Since the PC-98 has no hardware sprites, all you can do is write pixels into VRAM, and any animated sprite needs to be manually removed from VRAM at the beginning of each frame. Not using double-buffering theoretically allows TH01 to simply copy back all 128 KB of VRAM once per frame to do this. :tannedcirno: But that would be pretty wasteful, so TH01 just looks at all animated sprites, and selectively copies only their occupied pixels from the second to the first VRAM page.


Alright, player shot class methods… oh, wait, the collision functions directly act on the Yin-Yang Orb, so we first have to spend a push on that one. And that's where the impression we got from the .PTN functions is confirmed: The orb is, in fact, only ever displayed at byte-aligned X coordinates, divisible by 8. It's only thanks to the constant spinning that its movement appears at least somewhat smooth.
This is purely a rendering issue; internally, its position is tracked at pixel precision. Sadly, smooth orb rendering at any unaligned X coordinate wouldn't be that trivial of a mod, because well, the necessary functions for unaligned blitting and unblitting of 32×32 sprites don't exist in TH01's code. Then again, there's so much potential for optimization in this code, so it might be very possible to squeeze those additional two functions into the same C++ translation unit, even without position independence…

More importantly though, this was the right time to decompile the core functions controlling the orb physics – probably the highlight in these three pushes for most people.
Well, "physics". The X velocity is restricted to the 5 discrete states of -8, -4, 0, 4, and 8, and gravity is applied by simply adding 1 to the Y velocity every 5 frames :zunpet: No wonder that this can easily lead to situations in which the orb infinitely bounces from the ground.
At least fangame authors now have a reference of how ZUN did it originally, because really, this bad approximation of physics had to have been written that way on purpose. But hey, it uses 64-bit floating-point variables! :onricdennat:

…sometimes at least, and quite randomly. This was also where I had to learn about Turbo C++'s floating-point code generation, and how rigorously it defines the order of instructions when mixing double and float variables in arithmetic or conditional expressions. This meant that I could only get ZUN's original instruction order by using literal constants instead of variables, which is impossible right now without somehow splitting the data segment. In the end, I had to resort to spelling out ⅔ of one function, and one conditional branch of another, in inline ASM. 😕 If ZUN had just written 16.0 instead of 16.0f there, I would have saved quite some hours of my life trying to decompile this correctly…

To sort of make up for the slowdown in progress, here's the TH01 orb physics debug mod I made to properly understand them. Edit (2022-07-12): This mod is outdated, 📝 the current version is here! 2020-06-13-TH01OrbPhysicsDebug.zip To use it, simply replace REIIDEN.EXE, and run the game in debug mode, via game d on the DOS prompt.
Its code might also serve as an example of how to achieve this sort of thing without position independence.

Screenshot of the TH01 orb physics debug mod

Alright, now it's time for player shots though. Yeah, sure, they don't move horizontally, so it's not too bad that those are also always rendered at byte-aligned positions. But, uh… why does this code only use the 16×16 alpha-masked unblitting function for decaying shots, and just sloppily unblits an entire 16×16 square everywhere else?

The worst part though: Unblitting, moving, and rendering player shots is done in a single function, in that order. And that's exactly where TH01's sprite flickering comes from. Since different types of sprites are free to overlap each other, you'd have to first unblit all types, then move all types, and then render all types, as done in later PC-98 Touhou games. If you do these three steps per-type instead, you will unblit sprites of other types that have been rendered before… and therefore end up with flicker.
Oh, and finally, ZUN also added an additional sloppy 16×16 square unblit call if a shot collides with a pellet or a boss, for some guaranteed flicker. Sigh.


And that's ⅓ of all ZUN code in TH01 decompiled! Next up: Pellets!