⮜ Blog

⮜ List of tags

Showing all posts tagged
,
,
and

📝 Posted:
🚚 Summary of:
P0190, P0191, P0192
Commits:
5734815...293e16a, 293e16a...71cb7b5, 71cb7b5...e1f3f9f
💰 Funded by:
nrook, -Tom-, [Anonymous]
🏷 Tags:

The important things first:

So, Shinki! As far as final boss code is concerned, she's surprisingly economical, with 📝 her background animations making up more than ⅓ of her entire code. Going straight from TH01's 📝 final 📝 bosses to TH05's final boss definitely showed how much ZUN had streamlined danmaku pattern code by the end of PC-98 Touhou. Don't get me wrong, there is still room for improvement: TH05 not only 📝 reuses the same 16 bytes of generic boss state we saw in TH04 last month, but also uses them 4× as often, and even for midbosses. Most importantly though, defining danmaku patterns using a single global instance of the group template structure is just bad no matter how you look at it:

Declaring a separate structure instance with the static data for every pattern would be both safer and more space-efficient, and there's more than enough space left for that in the game's data segment.
But all in all, the pattern functions are short, sweet, and easy to follow. The "devil" pattern is significantly more complex than the others, but still far from TH01's final bosses at their worst. I especially like the clear architectural separation between "one-shot pattern" functions that return true once they're done, and "looping pattern" functions that run as long as they're being called from a boss's main function. Not many all too interesting things in these pattern functions for the most part, except for two pieces of evidence that Shinki was coded after Yumeko:


Speaking about that wing sprite: If you look at ST05.BB2 (or any other file with a large sprite, for that matter), you notice a rather weird file layout:

Raw file layout of TH05's ST05.BB2, demonstrating master.lib's supposed BFNT width limit of 64 pixels
A large sprite split into multiple smaller ones with a width of 64 pixels each? What's this, hardware sprite limitations? On my PC-98?!

And it's not a limitation of the sprite width field in the BFNT+ header either. Instead, it's master.lib's BFNT functions which are limited to sprite widths up to 64 pixels… or at least that's what MASTER.MAN claims. Whatever the restriction was, it seems to be completely nonexistent as of master.lib version 0.23, and none of the master.lib functions used by the games have any issues with larger sprites.
Since ZUN stuck to the supposed 64-pixel width limit though, it's now the game that expects Shinki's winged form to consist of 4 physical sprites, not just 1. Any conversion from another, more logical sprite sheet layout back into BFNT+ must therefore replicate the original number of sprites. Otherwise, the sequential IDs ("patnums") assigned to every newly loaded sprite no longer match ZUN's hardcoded IDs, causing the game to crash. This is exactly what used to happen with -Tom-'s MysticTK automation scripts, which combined these exact sprites into a single large one. This issue has now been fixed – just in case there are some underground modders out there who used these scripts and wonder why their game crashed as soon as the Shinki fight started.


And then the code quality takes a nosedive with Shinki's main function. :onricdennat: Even in TH05, these boss and midboss update functions are still very imperative:

The biggest WTF in there, however, goes to using one of the 16 state bytes as a "relative phase" variable for differentiating between boss phases that share the same branch within the switch(boss.phase) statement. While it's commendable that ZUN tried to reduce code duplication for once, he could have just branched depending on the actual boss.phase variable? The same state byte is then reused in the "devil" pattern to track the activity state of the big jerky lasers in the second half of the pattern. If you somehow managed to end the phase after the first few bullets of the pattern, but before these lasers are up, Shinki's update function would think that you're still in the phase before the "devil" pattern. The main function then sequence-breaks right to the defeat phase, skipping the final pattern with the burning Makai background. Luckily, the HP boundaries are far away enough to make this impossible in practice.
The takeaway here: If you want to use the state bytes for your custom boss script mods, alias them to your own 16-byte structure, and limit each of the bytes to a clearly defined meaning across your entire boss script.

One final discovery that doesn't seem to be documented anywhere yet: Shinki actually has a hidden bomb shield during her two purple-wing phases. uth05win got this part slightly wrong though: It's not a complete shield, and hitting Shinki will still deal 1 point of chip damage per frame. For comparison, the first phase lasts for 3,000 HP, and the "devil" pattern phase lasts for 5,800 HP.

And there we go, 3rd PC-98 Touhou boss script* decompiled, 28 to go! 🎉 In case you were expecting a fix for the Shinki death glitch: That one is more appropriately fixed as part of the Mai & Yuki script. It also requires new code, should ideally look a bit prettier than just removing cheetos between one frame and the next, and I'd still like it to fit within the original position-dependent code layout… Let's do that some other time.
Not much to say about the Stage 1 midboss, or midbosses in general even, except that their update functions have to imperatively handle even more subsystems, due to the relative lack of helper functions.


The remaining ¾ of the third push went to a bunch of smaller RE and finalization work that would have hardly got any attention otherwise, to help secure that 50% RE mark. The nicest piece of code in there shows off what looks like the optimal way of setting up the 📝 GRCG tile register for monochrome blitting in a variable color:

mov ah, palette_index ; Any other non-AL 8-bit register works too.
                      ; (x86 only supports AL as the source operand for OUTs.)

rept 4                ; For all 4 bitplanes…
    shr ah,  1        ; Shift the next color bit into the x86 carry flag
    sbb al,  al       ; Extend the carry flag to a full byte
                      ; (CF=0 → 0x00, CF=1 → 0xFF)
    out 7Eh, al       ; Write AL to the GRCG tile register
endm

Thanks to Turbo C++'s inlining capabilities, the loop body even decompiles into a surprisingly nice one-liner. What a beautiful micro-optimization, at a place where micro-optimization doesn't hurt and is almost expected.
Unfortunately, the micro-optimizations went all downhill from there, becoming increasingly dumb and undecompilable. Was it really necessary to save 4 x86 instructions in the highly unlikely case of a new spark sprite being spawned outside the playfield? That one 2D polar→Cartesian conversion function then pointed out Turbo C++ 4.0J's woefully limited support for 32-bit micro-optimizations. The code generation for 32-bit 📝 pseudo-registers is so bad that they almost aren't worth using for arithmetic operations, and the inline assembler just flat out doesn't support anything 32-bit. No use in decompiling a function that you'd have to entirely spell out in machine code, especially if the same function already exists in multiple other, more idiomatic C++ variations.
Rounding out the third push, we got the TH04/TH05 DEMO?.REC replay file reading code, which should finally prove that nothing about the game's original replay system could serve as even just the foundation for community-usable replays. Just in case anyone was still thinking that.


Next up: Back to TH01, with the Elis fight! Got a bit of room left in the cap again, and there are a lot of things that would make a lot of sense now:

📝 Posted:
🚚 Summary of:
P0182, P0183
Commits:
313450f...1e2c7ad, 1e2c7ad...f9d983e
💰 Funded by:
Lmocinemod, [Anonymous], Yanga
🏷 Tags:

Been 📝 a while since we last looked at any of TH03's game code! But before that, we need to talk about Y coordinates.

During TH03's MAIN.EXE, the PC-98 graphics GDC runs in its line-doubled 640×200 resolution, which gives the in-game portion its distinctive stretched low-res look. This lower resolution is a consequence of using 📝 Promisence Soft's SPRITE16 driver: Its performance simply stems from the fact that it expects sprites to be stored in the bottom half of VRAM, which allows them to be blitted using the same EGC-accelerated VRAM-to-VRAM copies we've seen again and again in all other games. Reducing the visible resolution also means that the sprites can be stored on both VRAM pages, allowing the game to still be double-buffered. If you force the graphics chip to run at 640×400, you can see them:

The full VRAM contents during TH03's in-game portion, as seen when forcing the system into a 640×400 resolution.
TH03's VRAM at regular line-doubled 640×200 resolutionTH03's VRAM at full 640×400 resolution, including the SPRITE16 sprite areaTH03's text layer during an in-game round.

Note that the text chip still displays its overlaid contents at 640×400, which means that TH03's in-game portion technically runs at two resolutions at the same time.

But that means that any mention of a Y coordinate is ambiguous: Does it refer to undoubled VRAM pixels, or on-screen stretched pixels? Especially people who have known about the line doubling for years might almost expect technical blog posts on this game to use undoubled VRAM coordinates. So, let's introduce a new formatting convention for both on-screen 640×400 and undoubled 640×200 coordinates, and always write out both to minimize the confusion.


Alright, now what's the thing gonna be? The enemy structure is highly overloaded, being used for enemies, fireballs, and explosions with seemingly different semantics for each. Maybe a bit too much to be figured out in what should ideally be a single push, especially with all the functions that would need to be decompiled? Bullet code would be easier, but not exactly single-push material either. As it turns out though, there's something more fundamental left to be done first, which both of these subsystems depend on: collision detection!

And it's implemented exactly how I always naively imagined collision detection to be implemented in a fixed-resolution 2D bullet hell game with small hitboxes: By keeping a separate 1bpp bitmap of both playfields in memory, drawing in the collidable regions of all entities on every frame, and then checking whether any pixels at the current location of the player's hitbox are set to 1. It's probably not done in the other games because their single data segment was already too packed for the necessary 17,664 bytes to store such a bitmap at pixel resolution, and 282,624 bytes for a bitmap at Q12.4 subpixel resolution would have been prohibitively expensive in 16-bit Real Mode DOS anyway. In TH03, on the other hand, this bitmap is doubly useful, as the AI also uses it to elegantly learn what's on the playfield. By halving the resolution and only tracking tiles of 2×2 / 2×1 pixels, TH03 only requires an adequate total of 6,624 bytes of memory for the collision bitmaps of both playfields.

So how did the implementation not earn the good-code tag this time? Because the code for drawing into these bitmaps is undecompilable hand-written x86 assembly. :zunpet: And not just your usual ASM that was basically compiled from C and then edited to maybe optimize register allocation and maybe replace a bunch of local variables with self-modifying code, oh no. This code is full of overly clever bit twiddling, abusing the fact that the 16-bit AX, BX, CX, and DX registers can also be accessed as two 8-bit registers, calculations that change the semantic meaning behind the value of a register, or just straight-up reassignments of different values to the same small set of registers. Sure, in some way it is impressive, and it all does work and correctly covers every edge case, but come on. This could have all been a lot more readable in exchange for just a few CPU cycles.

What's most interesting though are the actual shapes that these functions draw into the collision bitmap. On the surface, we have:

  1. vertical slopes at any angle across the whole playfield; exclusively used for Chiyuri's diagonal laser EX attack
  2. straight vertical lines, with a width of 1 tile; exclusively used for the 2×2 / 2×1 hitboxes of bullets
  3. rectangles at arbitrary sizes

But only 2) actually draws a full solid line. 1) and 3) are only ever drawn as horizontal stripes, with a hardcoded distance of 2 vertical tiles between every stripe of a slope, and 4 vertical tiles between every stripe of a rectangle. That's 66-75% of each rectangular entity's intended hitbox not actually taking part in collision detection. Now, if player hitboxes were ≤ 6 / 3 pixels, we'd have one possible explanation of how the AI can "cheat", because it could just precisely move through those blank regions at TAS speeds. So, let's make this two pushes after all and tell the complete story, since this is one of the more interesting aspects to still be documented in this game.


And the code only gets worse. :godzun: While the player collision detection function is decompilable, it might as well not have been, because it's just more of the same "optimized", hard-to-follow assembly. With the four splittable 16-bit registers having a total of 20 different meanings in this function, I would have almost preferred self-modifying code…

In fact, it was so bad that it prompted some maintenance work on my inline assembly coding standards as a whole. Turns out that the _asm keyword is not only still supported in modern Visual Studio compilers, but also in Clang with the -fms-extensions flag, and compiles fine there even for 64-bit targets. While that might sound like amazing news at first ("awesome, no need to rewrite this stuff for my x86_64 Linux port!"), you quickly realize that almost all inline assembly in this codebase assumes either PC-98 hardware, segmented 16-bit memory addressing, or is a temporary hack that will be removed with further RE progress.
That's mainly because most of the raw arithmetic code uses Turbo C++'s register pseudovariables where possible. While they certainly have their drawbacks, being a non-standard extension that's not supported in other x86-targeting C compilers, their advantages are quite significant: They allow this code to stay in the same language, and provide slightly more immediate portability to any other architecture, together with 📝 readability and maintainability improvements that can get quite significant when combined with inlining:

// This one line compiles to five ASM instructions, which would need to be
// spelled out in any C compiler that doesn't support register pseudovariables.
// By adding typed aliases for these registers via `#define`, this code can be
// both made even more readable, and be prepared for an easier transformation
// into more portable local variables.
_ES = (((_AX * 4) + _BX) + SEG_PLANE_B);

However, register pseudovariables might cause potential portability issues as soon as they are mixed with inline assembly instructions that rely on their state. The lazy way of "supporting pseudo-registers" in other compilers would involve declaring the full set as global variables, which would immediately break every one of those instances:

_DI = 0;
_AX = 0xFFFF;

// Special x86 instruction doing the equivalent of
//
// 	*reinterpret_cast(MK_FP(_ES, _DI)) = _AX;
// 	_DI += sizeof(uint16_t);
//
// Only generated by Turbo C++ in very specific cases, and therefore only
// reliably available through inline assembly.
asm { movsw; }

What's also not all too standardized, though, are certain variants of the asm keyword. That's why I've now introduced a distinction between the _asm keyword for "decently sane" inline assembly, and the slightly less standard asm keyword for inline assembly that relies on the contents of pseudo-registers, and should break on compilers that don't support them.
So yeah, have some minor portability work in exchange for these two pushes not having all that much in RE'd content.

With that out of the way and the function deciphered, we can confirm the player hitboxes to be a constant 8×8 / 8×4 pixels, and prove that the hit stripes are nothing but an adequate optimization that doesn't affect gameplay in any way.


And what's the obvious thing to immediately do if you have both the collision bitmap and the player hitbox? Writing a "real hitbox" mod, of course:

  1. Reorder the calls to rendering functions so that player and shot sprites are rendered after bullets
  2. Blank out all player sprite pixels outside an 8×8 / 8×4 box around the center point
  3. After the bullet rendering function, turn on the GRCG in RMW mode and set the tile register set to the background color
  4. Stretch the negated contents of collision bitmap onto each playfield, leaving only collidable pixels untouched
  5. Do the same with the actual, non-negated contents and a white color, for extra contrast against the background. This also makes sure to show any collidable areas whose sprite pixels are transparent, such as with the moon enemy. (Yeah, how unfair.) Doing that also loses a lot of information about the playfield, such as enemy HP indicated by their color, but what can you do:
A decently busy TH03 in-game frame.The underlying content of the collision bitmap, showing off all three different shapes together with the player hitboxes.
A decently busy TH03 in-game frame and its underlying collision bitmap, showing off all three different collision shapes together with the player hitboxes.

2022-02-18-TH03-real-hitbox.zip The secret for writing such mods before having reached a sufficient level of position independence? Put your new code segment into DGROUP, past the end of the uninitialized data section. That's why this modded MAIN.EXE is a lot larger than you would expect from the raw amount of new code: The file now actually needs to store all these uninitialized 0 bytes between the end of the data segment and the first instruction of the mod code – normally, this number is simply a part of the MZ EXE header, and doesn't need to be redundantly stored on disk. Check the th03_real_hitbox branch for the code.

And now we know why so many "real hitbox" mods for the Windows Touhou games are inaccurate: The games would simply be unplayable otherwise – or can you dodge rapidly moving 2×2 / 2×1 blocks as an 8×8 / 8×4 rectangle that is smaller than your shot sprites, especially without focused movement? I can't. :tannedcirno: Maybe it will feel more playable after making explosions visible, but that would need more RE groundwork first.
It's also interesting how adding two full GRCG-accelerated redraws of both playfields per frame doesn't significantly drop the game's frame rate – so why did the drawing functions have to be micro-optimized again? It would be possible in one pass by using the GRCG's TDW mode, which should theoretically be 8× faster, but I have to stop somewhere. :onricdennat:

Next up: The final missing piece of TH04's and TH05's bullet-moving code, which will include a certain other type of projectile as well.

📝 Posted:
🚚 Summary of:
P0126, P0127
Commits:
6c22af7...8b01657, 8b01657...dc65b59
💰 Funded by:
Blue Bolt, [Anonymous]
🏷 Tags:

Alright, back to continuing the master.hpp transition started in P0124, and repaying technical debt. The last blog post already announced some ridiculous decompilations… and in fact, not a single one of the functions in these two pushes was decompilable into idiomatic C/C++ code.

As usual, that didn't keep me from trying though. The TH04 and TH05 version of the infamous 16-pixel-aligned, EGC-accelerated rectangle blitting function from page 1 to page 0 was fairly average as far as unreasonable decompilations are concerned.
The big blocker in TH03's MAIN.EXE, however, turned out to be the .MRS functions, used to render the gauge attack portraits and bomb backgrounds. The blitting code there uses the additional FS and GS segment registers provided by the Intel 386… which

  1. are not supported by Turbo C++'s inline assembler, and
  2. can't be turned into pointers, due to a compiler bug in Turbo C++ that generates wrong segment prefix opcodes for the _FS and _GS pseudo-registers.

Apparently I'm the first one to even try doing that with this compiler? I haven't found any other mention of this bug…
Compiling via assembly (#pragma inline) would work around this bug and generate the correct instructions. But that would incur yet another dependency on a 16-bit TASM, for something honestly quite insignificant.

What we can always do, however, is using __emit__() to simply output x86 opcodes anywhere in a function. Unlike spelled-out inline assembly, that can even be used in helper functions that are supposed to inline… which does in fact allow us to fully abstract away this compiler bug. Regular if() comparisons with pseudo-registers wouldn't inline, but "converting" them into C++ template function specializations does. All that's left is some C preprocessor abuse to turn the pseudo-registers into types, and then we do retain a normal-looking poke() call in the blitting functions in the end. 🤯

Yeah… the result is batshit insane. I may have gone too far in a few places…


One might certainly argue that all these ridiculous decompilations actually hurt the preservation angle of this project. "Clearly, ZUN couldn't have possibly written such unreasonable C++ code. So why pretend he did, and not just keep it all in its more natural ASM form?" Well, there are several reasons:

Unfortunately, these pushes also demonstrated a second disadvantage in trying to decompile everything possible: Since Turbo C++ lacks TASM's fine-grained ability to enforce code alignment on certain multiples of bytes, it might actually be unfeasible to link in a C-compiled object file at its intended original position in some of the .EXE files it's used in. Which… you're only going to notice once you encounter such a case. Due to the slightly jumbled order of functions in the 📝 second, shared code segment, that might be long after you decompiled and successfully linked in the function everywhere else.

And then you'll have to throw away that decompilation after all 😕 Oh well. In this specific case (the lookup table generator for horizontally flipping images), that decompilation was a mess anyway, and probably helped nobody. I could have added a dummy .OBJ that does nothing but enforce the needed 2-byte alignment before the function if I really insisted on keeping the C version, but it really wasn't worth it.


Now that I've also described yet another meta-issue, maybe there'll really be nothing to say about the next technical debt pushes? :onricdennat: Next up though: Back to actual progress again, with TH01. Which maybe even ends up pushing that game over the 50% RE mark?