- π Posted:
- π° Funded by:
- Lmocinemod, [Anonymous], Yanga
- π·οΈ Tags:
Been π a while since we last looked at any of TH03's game code! But before that, we need to talk about Y coordinates.
During TH03's MAIN.EXE
, the PC-98 graphics GDC runs in its
line-doubled 640Γ200 resolution, which gives the in-game portion its
distinctive stretched low-res look. This lower resolution is a consequence
of using π Promisence Soft's SPRITE16 driver:
Its performance simply stems from the fact that it expects sprites to be
stored in the bottom half of VRAM, which allows them to be blitted using the
same EGC-accelerated VRAM-to-VRAM copies we've seen again and again in all
other games. Reducing the visible resolution also means that the sprites can
be stored on both VRAM pages, allowing the game to still be double-buffered.
If you force the graphics chip to run at 640Γ400, you can see them:
Note that the text chip still displays its overlaid contents at 640Γ400, which means that TH03's in-game portion technically runs at two resolutions at the same time.
But that means that any mention of a Y coordinate is ambiguous: Does it refer to undoubled VRAM pixels, or on-screen stretched pixels? Especially people who have known about the line doubling for years might almost expect technical blog posts on this game to use undoubled VRAM coordinates. So, let's introduce a new formatting convention for both on-screen 640Γ400 and undoubled 640Γ200 coordinates, and always write out both to minimize the confusion.
Alright, now what's the thing gonna be? The enemy structure is highly overloaded, being used for enemies, fireballs, and explosions with seemingly different semantics for each. Maybe a bit too much to be figured out in what should ideally be a single push, especially with all the functions that would need to be decompiled? Bullet code would be easier, but not exactly single-push material either. As it turns out though, there's something more fundamental left to be done first, which both of these subsystems depend on: collision detection!
And it's implemented exactly how I always naively imagined collision detection to be implemented in a fixed-resolution 2D bullet hell game with small hitboxes: By keeping a separate 1bpp bitmap of both playfields in memory, drawing in the collidable regions of all entities on every frame, and then checking whether any pixels at the current location of the player's hitbox are set to 1. It's probably not done in the other games because their single data segment was already too packed for the necessary 17,664 bytes to store such a bitmap at pixel resolution, and 282,624 bytes for a bitmap at Q12.4 subpixel resolution would have been prohibitively expensive in 16-bit Real Mode DOS anyway. In TH03, on the other hand, this bitmap is doubly useful, as the AI also uses it to elegantly learn what's on the playfield. By halving the resolution and only tracking tiles of 2Γ2 / 2Γ1 pixels, TH03 only requires an adequate total of 6,624 bytes of memory for the collision bitmaps of both playfields.
So how did the implementation not earn the good-code tag this time? Because the code for drawing into these bitmaps is undecompilable hand-written x86 assembly. And not just your usual ASM that was basically compiled from C and then edited to maybe optimize register allocation and maybe replace a bunch of local variables with self-modifying code, oh no. This code is full of overly clever bit twiddling, abusing the fact that the 16-bit AX
,
BX
, CX
, and DX
registers can also be
accessed as two 8-bit registers, calculations that change the semantic
meaning behind the value of a register, or just straight-up reassignments of
different values to the same small set of registers. Sure, in some way it is
impressive, and it all does work and correctly covers every edge
case, but come on. This could have all been a lot more readable in
exchange for just a few CPU cycles.
What's most interesting though are the actual shapes that these functions draw into the collision bitmap. On the surface, we have:
- vertical slopes at any angle across the whole playfield; exclusively used for Chiyuri's diagonal laser EX attack
- straight vertical lines, with a width of 1 tile; exclusively used for the 2Γ2 / 2Γ1 hitboxes of bullets
- rectangles at arbitrary sizes
But only 2) actually draws a full solid line. 1) and 3) are only ever drawn as horizontal stripes, with a hardcoded distance of 2 vertical tiles between every stripe of a slope, and 4 vertical tiles between every stripe of a rectangle. That's 66-75% of each rectangular entity's intended hitbox not actually taking part in collision detection. Now, if player hitboxes were β€ 6 / 3 pixels, we'd have one possible explanation of how the AI can "cheat", because it could just precisely move through those blank regions at TAS speeds. So, let's make this two pushes after all and tell the complete story, since this is one of the more interesting aspects to still be documented in this game.
And the code only gets worse. While the player collision detection function is decompilable, it might as well not have been, because it's just more of the same "optimized", hard-to-follow assembly. With the four splittable 16-bit registers having a total of 20 different meanings in this function, I would have almost preferred self-modifying codeβ¦
In fact, it was so bad that it prompted some maintenance work on my inline
assembly coding standards as a whole. Turns out that the _asm
keyword is not only still supported in modern Visual Studio compilers, but
also in Clang with the -fms-extensions
flag, and compiles fine
there even for 64-bit targets. While that might sound like amazing news at
first ("awesome, no need to rewrite this stuff for my x86_64 Linux
port!"), you quickly realize that almost all inline assembly in this
codebase assumes either PC-98 hardware, segmented 16-bit memory addressing,
or is a temporary hack that will be removed with further RE progress.
That's mainly because most of the raw arithmetic code uses Turbo C++'s
register pseudovariables where possible. While they certainly have their
drawbacks, being a non-standard extension that's not supported in other
x86-targeting C compilers, their advantages are quite significant: They
allow this code to stay in the same language, and provide slightly more
immediate portability to any other architecture, together with
π readability and maintainability improvements that can get quite significant when combined with inlining:
// This one line compiles to five ASM instructions, which would need to be // spelled out in any C compiler that doesn't support register pseudovariables. // By adding typed aliases for these registers via `#define`, this code can be // both made even more readable, and be prepared for an easier transformation // into more portable local variables. _ES = (((_AX * 4) + _BX) + SEG_PLANE_B);
However, register pseudovariables might cause potential portability issues as soon as they are mixed with inline assembly instructions that rely on their state. The lazy way of "supporting pseudo-registers" in other compilers would involve declaring the full set as global variables, which would immediately break every one of those instances:
_DI = 0; _AX = 0xFFFF; // Special x86 instruction doing the equivalent of // // *reinterpret_cast(MK_FP(_ES, _DI)) = _AX; // _DI += sizeof(uint16_t); // // Only generated by Turbo C++ in very specific cases, and therefore only // reliably available through inline assembly. asm { movsw; }
What's also not all too standardized, though, are certain variants of
the asm
keyword. That's why I've now introduced a distinction
between the _asm
keyword for "decently sane" inline assembly,
and the slightly less standard asm
keyword for inline assembly
that relies on the contents of pseudo-registers, and should break on
compilers that don't support them.
So yeah, have some minor
portability work in exchange for these two pushes not having all that much
in RE'd content.
With that out of the way and the function deciphered, we can confirm the player hitboxes to be a constant 8Γ8 / 8Γ4 pixels, and prove that the hit stripes are nothing but an adequate optimization that doesn't affect gameplay in any way.
And what's the obvious thing to immediately do if you have both the collision bitmap and the player hitbox? Writing a "real hitbox" mod, of course:
- Reorder the calls to rendering functions so that player and shot sprites are rendered after bullets
- Blank out all player sprite pixels outside an 8Γ8 / 8Γ4 box around the center point
- After the bullet rendering function, turn on the GRCG in RMW mode and set the tile register set to the background color
- Stretch the negated contents of collision bitmap onto each playfield, leaving only collidable pixels untouched
- Do the same with the actual, non-negated contents and a white color, for extra contrast against the background. This also makes sure to show any collidable areas whose sprite pixels are transparent, such as with the moon enemy. (Yeah, how unfair.) Doing that also loses a lot of information about the playfield, such as enemy HP indicated by their color, but what can you do:
2022-02-18-TH03-real-hitbox.zip
The secret for writing such mods before having reached a sufficient level of
position independence? Put your new code segment into DGROUP
,
past the end of the uninitialized data section. That's why this modded
MAIN.EXE
is a lot larger than you would expect from the raw amount of new code: The file now actually needs to store all these
uninitialized 0 bytes between the end of the data segment and the first
instruction of the mod code β normally, this number is simply a part of the
MZ EXE header, and doesn't need to be redundantly stored on disk. Check the
th03_real_hitbox
branch for the code.
And now we know why so many "real hitbox" mods for the Windows Touhou games
are inaccurate: The games would simply be unplayable otherwise β or can
you dodge rapidly moving 2Γ2 /
2Γ1 blocks as an 8Γ8 /
8Γ4 rectangle that is smaller than your shot sprites,
especially without focused movement? I can't.
Maybe it will feel more playable after making explosions visible, but that
would need more RE groundwork first.
It's also interesting how adding two full GRCG-accelerated redraws of both
playfields per frame doesn't significantly drop the game's frame rate β so
why did the drawing functions have to be micro-optimized again? It
would be possible in one pass by using the GRCG's TDW mode, which
should theoretically be 8Γ faster, but I have to stop somewhere.
Next up: The final missing piece of TH04's and TH05's bullet-moving code, which will include a certain other type of projectile as well.