⮜ Blog

⮜ List of tags

Showing all posts tagged
,
,
and

📝 Posted:
🚚 Summary of:
P0235, P0236, P0237
Commits:
e7a9262...62c4b7f, 62c4b7f...7fa9038, 7fa9038...c5e51e6
💰 Funded by:
Ember2528, Yanga
🏷 Tags:

So, TH02! Being the only game whose main binary hadn't seen any dedicated attention ever, we get to start the TH02-related blog posts at the very beginning with the most foundational pieces of code. The stage tile system is the best place to start here: It not only blocks every entity that is rendered on top of these tiles, but is curiously placed right next to master.lib code in TH02, and would need to be separated out into its own translation unit before we can do the same with all the master.lib functions.

In late 2018, I already RE'd 📝 TH04's and TH05's stage tile implementation, but haven't properly documented it on this blog yet, so this post is also going to include the details that are unique to those games. On a high level, the stage tile system works identically in all three games:

The differences between the three games can best be summarized in a table:

:th02: TH02 :th04: TH04 :th05: TH05
Tile image file extension .MPN
Tile section format .MAP
Tile section order defined as part of .DT1 .STD
Tile section index format 0-based ID 0-based ID × 2
Tile image index format Index between 0 and 100, 1 byte VRAM offset in tile source area, 2 bytes
Scroll speed control Hardcoded Part of the .STD format, defined per referenced tile section
Redraw granularity Full tiles (16×16) Half tiles (16×8)
Rows per tile section 8 5
Maximum number of tile sections 16 32
Lowest number of tile sections used 5 (Stage 3 / Extra) 8 (Stage 6) 11 (Stage 2 / 4)
Highest number of tile sections used 13 (Stage 4) 19 (Extra) 24 (Stage 3)
Maximum length of a map 320 sections (static buffer) 256 sections (format limitation)
Shortest map 14 sections (Stage 5) 20 sections (Stage 5) 15 sections (Stage 2)
Longest map 143 sections (Stage 4) 95 sections (Stage 4) 40 sections (Stage 1 / 4 / Extra)

The most interesting part about stage tiles is probably the fact that some of the .MAP files contain unused tile sections. 👀 Many of these are empty, duplicates, or don't really make sense, but a few are unique, fit naturally into their respective stage, and might have been part of the map during development. In TH02, we can find three unused sections in Stage 5:

Section 0 of TH02's STAGE4.MAPSection 1 of TH02's STAGE4.MAPSection 2 of TH02's STAGE4.MAPSection 3 of TH02's STAGE4.MAPSection 4 of TH02's STAGE4.MAPSection 5 of TH02's STAGE4.MAPSection 6 of TH02's STAGE4.MAPSection 7 of TH02's STAGE4.MAP
The non-empty tile sections defined in TH02's STAGE4.MAP, showing off three unused ones.
These unused tile sections are much more common in the later games though, where we can find them in TH04's Stage 3, 4, and 5, and TH05's Stage 1, 2, and 4. I'll document those once I get to finalize the tile rendering code of these games, to leave some more content for that blog post. TH04/TH05 tile code would be quite an effective investment of your money in general, as most of it is identical across both games. Or how about going for a full-on PC-98 Touhou map viewer and editor GUI?


Compared to TH04 and TH05, TH02's stage tile code definitely feels like ZUN was just starting to understand how to pull off smooth vertical scrolling on a PC-98. As such, it comes with a few inefficiencies and suboptimal implementation choices:

Even though this was ZUN's first attempt at scrolling tiles, he already saw it fit to write most of the code in assembly. This was probably a reaction to all of TH01's performance issues, and the frame rate reduction workarounds he implemented to keep the game from slowing down too much in busy places. "If TH01 was all C++ and slow, TH02 better contain more ASM code, and then it will be fast, right?" :zunpet:
Another reason for going with ASM might be found in the kind of documentation that may have been available to ZUN. Last year, the PC-98 community discovered and scanned two new game programming tutorial books from 1991 (1, 2). Their example code is not only entirely written in assembly, but restricts itself to the bare minimum of x86 instructions that were available on the 8086 CPU used by the original PC-9801 model 9 years earlier. Such code is not only suboptimal on the 486, but can often be actually worse than what your C++ compiler would generate. TH02 is where the trend of bad hand-written ASM code started, and it 📝 only intensified in ZUN's later games. So, don't copy code from these books unless you absolutely want to target the earlier 8086 and 286 models. Which, 📝 as we've gathered from the recent blitting benchmark results, are not all too common among current real-hardware owners.
That said, all that ASM code really only impacts readability and maintainability. Apart from the aforementioned issues, the algorithms themselves are mostly fine – especially since most EGC and GRCG operations are decently batched this time around, in contrast to TH01.


Luckily, the tile functions merely use inline assembly within a typical C function and can therefore be at least part of a C++ source file, even if the result is pretty ugly. This time, we can actually be sure that they weren't written directly in a .ASM file, because they feature x86 instruction encodings that can only be generated with Turbo C++ 4.0J's inline assembler, not with TASM. The same can't unfortunately be said about the following function in the same segment, which marks the tiles covered by the spark sprites for redrawing. In this one, it took just one dumb hand-written ASM inconsistency in the function's epilog to make the entire function undecompilable.
The standard x86 instruction sequence to set up a stack frame in a function prolog looks like this:

PUSH	BP
MOV 	BP, SP
SUB 	SP, ?? ; if the function needs the stack for local variables
When compiling without optimizations, Turbo C++ 4.0J will replace this sequence with a single ENTER instruction. That one is two bytes smaller, but much slower on every x86 CPU except for the 80186 where it was introduced.
In functions without local variables, BP and SP remain identical, and a single POP BP is all that's needed in the epilog to tear down such a stack frame before returning from the function. Otherwise, the function needs an additional MOV SP, BP instruction to pop all local variables. With x86 being the helpful CISC architecture that it is, the 80186 also introduced the LEAVE instruction to perform both tasks. Unlike ENTER, this single instruction is faster than the raw two instructions on a lot of x86 CPUs (and even current ones!), and it's always smaller, taking up just 1 byte instead of 3.
So what if you use LEAVE even if your function doesn't use local variables? :thonk: The fact that the instruction first does the equivalent of MOV SP, BP doesn't matter if these registers are identical, and who cares about the additional CPU cycles of LEAVE compared to just POP BP, right? So that's definitely something you could theoretically do, but not something that any compiler would ever generate.

And so, TH02 MAIN.EXE decompilation already hits the first brick wall after two pushes. Awesome! :godzun: Theoretically, we could slowly mash through this wall using the 📝 code generator. But having such an inconsistency in the function epilog would mean that we'd have to keep Turbo C++ 4.0J from emitting any epilog or prolog code so that we can write our own. This means that we'd once again have to hide any use of the SI and DI registers from the compiler… and doing that requires code generation macros for 22 of the 49 instructions of the function in question, almost none of which we currently have. So, this gets quite silly quite fast, especially if we only need to do it for one single byte.

Instead, wouldn't it be much better if we had a separate build step between compile and link time that allowed us to replicate mistakes like these by just patching the compiled .OBJ files? These files still contain the names of exported functions for linking, which would allow us to look up the code of a function in a robust manner, navigate to specific instructions using a disassembler, replace them, and write the modified .OBJ back to disk before linking. Such a system could then naturally expand to cover all other decompilation issues, culminating in a full-on optimizer that could even recreate ZUN's self-modifying code. At that point, we would have sealed away all of ZUN's ugly ASM code within a separate build step, and could finally decompile everything into readable C++.

Pulling that off would require a significant tooling investment though. Patching that one byte in TH02's spark invalidation function could be done within 1 or 2 pushes, but that's just one issue, and we currently have 32 other .ASM files with undecompilable code. Also, note that this is fundamentally different from what we're doing with the debloated branch and the Anniversary Editions. Mistake patching would purely be about having readable code on master that compiles into ZUN's exact binaries, without fixing weird code. The Anniversary Editions go much further and rewrite such code in a much more fundamental way, improving it further than mistake patching ever could.
Right now, the Anniversary Editions seem much more popular, which suggests that people just want 100% RE as fast as possible so that I can start working on them. In that case, why bother with such undecompilable functions, and not just leave them in raw and unreadable x86 opcode form if necessary… :tannedcirno: But let's first see how much backer support there actually is for mistake patching before falling back on that.

The best part though: Once we've made a decision and then covered TH02's spark and particle systems, that was it, and we will have already RE'd all ZUN-written PC-98-specific blitting code in this game. Every further sprite or shape is rendered via master.lib, and is thus decently abstracted. Guess I'll need to update 📝 the assessment of which PC-98 Touhou game is the easiest to port, because it sure isn't TH01, as we've seen with all the work required for the first Anniversary Edition build.


Until then, there are still enough parts of the game that don't use any of the remaining few functions in the _TEXT segment. Previously, I mentioned in the 📝 status overview blog post that TH02 had a seemingly weird sprite system, but the spark and point popup (〇一二三四五六七八九十×÷) structures showed that the game just stores the current and previous position of its entities in a slightly different way compared to the rest of PC-98 Touhou. Instead of having dedicated structure fields, TH02 uses two-element arrays indexed with the active VRAM page. Same thing, and such a pattern even helps during RE since it's easy to spot once you know what to look for.
There's not much to criticize about the point popup system, except for maybe a landmine that causes sprite glitches when trying to display more than 99,990 points. Sadly, the final push in this delivery was rounded out by yet another piece of code at the opposite end of the quality spectrum. The particle and smear effects for Reimu's bomb animations consist almost entirely of assembly bloat, which would just be replaced with generic calls to the generic blitter in this game's future Anniversary Edition.

If I continue to decompile TH02 while avoiding the brick wall, items would be next, but they probably require two pushes. Next up, therefore: Integrating Stripe as an alternative payment provider into the order form. There have been at least three people who reported issues with PayPal, and Stripe has been working much better in tests. In the meantime, here's a temporary Stripe order link for everyone. This one is not connected to the cap yet, so please make sure to stay within whatever value is currently shown on the front page – I will treat any excess money as donations. :onricdennat: If there's some time left afterward, I might also add some small improvements to the TH01 Anniversary Edition.

📝 Posted:
🚚 Summary of:
P0190, P0191, P0192
Commits:
5734815...293e16a, 293e16a...71cb7b5, 71cb7b5...e1f3f9f
💰 Funded by:
nrook, -Tom-, [Anonymous]
🏷 Tags:

The important things first:

So, Shinki! As far as final boss code is concerned, she's surprisingly economical, with 📝 her background animations making up more than ⅓ of her entire code. Going straight from TH01's 📝 final 📝 bosses to TH05's final boss definitely showed how much ZUN had streamlined danmaku pattern code by the end of PC-98 Touhou. Don't get me wrong, there is still room for improvement: TH05 not only 📝 reuses the same 16 bytes of generic boss state we saw in TH04 last month, but also uses them 4× as often, and even for midbosses. Most importantly though, defining danmaku patterns using a single global instance of the group template structure is just bad no matter how you look at it:

Declaring a separate structure instance with the static data for every pattern would be both safer and more space-efficient, and there's more than enough space left for that in the game's data segment.
But all in all, the pattern functions are short, sweet, and easy to follow. The "devil" pattern is significantly more complex than the others, but still far from TH01's final bosses at their worst. I especially like the clear architectural separation between "one-shot pattern" functions that return true once they're done, and "looping pattern" functions that run as long as they're being called from a boss's main function. Not many all too interesting things in these pattern functions for the most part, except for two pieces of evidence that Shinki was coded after Yumeko:


Speaking about that wing sprite: If you look at ST05.BB2 (or any other file with a large sprite, for that matter), you notice a rather weird file layout:

Raw file layout of TH05's ST05.BB2, demonstrating master.lib's supposed BFNT width limit of 64 pixels
A large sprite split into multiple smaller ones with a width of 64 pixels each? What's this, hardware sprite limitations? On my PC-98?!

And it's not a limitation of the sprite width field in the BFNT+ header either. Instead, it's master.lib's BFNT functions which are limited to sprite widths up to 64 pixels… or at least that's what MASTER.MAN claims. Whatever the restriction was, it seems to be completely nonexistent as of master.lib version 0.23, and none of the master.lib functions used by the games have any issues with larger sprites.
Since ZUN stuck to the supposed 64-pixel width limit though, it's now the game that expects Shinki's winged form to consist of 4 physical sprites, not just 1. Any conversion from another, more logical sprite sheet layout back into BFNT+ must therefore replicate the original number of sprites. Otherwise, the sequential IDs ("patnums") assigned to every newly loaded sprite no longer match ZUN's hardcoded IDs, causing the game to crash. This is exactly what used to happen with -Tom-'s MysticTK automation scripts, which combined these exact sprites into a single large one. This issue has now been fixed – just in case there are some underground modders out there who used these scripts and wonder why their game crashed as soon as the Shinki fight started.


And then the code quality takes a nosedive with Shinki's main function. :onricdennat: Even in TH05, these boss and midboss update functions are still very imperative:

The biggest WTF in there, however, goes to using one of the 16 state bytes as a "relative phase" variable for differentiating between boss phases that share the same branch within the switch(boss.phase) statement. While it's commendable that ZUN tried to reduce code duplication for once, he could have just branched depending on the actual boss.phase variable? The same state byte is then reused in the "devil" pattern to track the activity state of the big jerky lasers in the second half of the pattern. If you somehow managed to end the phase after the first few bullets of the pattern, but before these lasers are up, Shinki's update function would think that you're still in the phase before the "devil" pattern. The main function then sequence-breaks right to the defeat phase, skipping the final pattern with the burning Makai background. Luckily, the HP boundaries are far away enough to make this impossible in practice.
The takeaway here: If you want to use the state bytes for your custom boss script mods, alias them to your own 16-byte structure, and limit each of the bytes to a clearly defined meaning across your entire boss script.

One final discovery that doesn't seem to be documented anywhere yet: Shinki actually has a hidden bomb shield during her two purple-wing phases. uth05win got this part slightly wrong though: It's not a complete shield, and hitting Shinki will still deal 1 point of chip damage per frame. For comparison, the first phase lasts for 3,000 HP, and the "devil" pattern phase lasts for 5,800 HP.

And there we go, 3rd PC-98 Touhou boss script* decompiled, 28 to go! 🎉 In case you were expecting a fix for the Shinki death glitch: That one is more appropriately fixed as part of the Mai & Yuki script. It also requires new code, should ideally look a bit prettier than just removing cheetos between one frame and the next, and I'd still like it to fit within the original position-dependent code layout… Let's do that some other time.
Not much to say about the Stage 1 midboss, or midbosses in general even, except that their update functions have to imperatively handle even more subsystems, due to the relative lack of helper functions.


The remaining ¾ of the third push went to a bunch of smaller RE and finalization work that would have hardly got any attention otherwise, to help secure that 50% RE mark. The nicest piece of code in there shows off what looks like the optimal way of setting up the 📝 GRCG tile register for monochrome blitting in a variable color:

mov ah, palette_index ; Any other non-AL 8-bit register works too.
                      ; (x86 only supports AL as the source operand for OUTs.)

rept 4                ; For all 4 bitplanes…
    shr ah,  1        ; Shift the next color bit into the x86 carry flag
    sbb al,  al       ; Extend the carry flag to a full byte
                      ; (CF=0 → 0x00, CF=1 → 0xFF)
    out 7Eh, al       ; Write AL to the GRCG tile register
endm

Thanks to Turbo C++'s inlining capabilities, the loop body even decompiles into a surprisingly nice one-liner. What a beautiful micro-optimization, at a place where micro-optimization doesn't hurt and is almost expected.
Unfortunately, the micro-optimizations went all downhill from there, becoming increasingly dumb and undecompilable. Was it really necessary to save 4 x86 instructions in the highly unlikely case of a new spark sprite being spawned outside the playfield? That one 2D polar→Cartesian conversion function then pointed out Turbo C++ 4.0J's woefully limited support for 32-bit micro-optimizations. The code generation for 32-bit 📝 pseudo-registers is so bad that they almost aren't worth using for arithmetic operations, and the inline assembler just flat out doesn't support anything 32-bit. No use in decompiling a function that you'd have to entirely spell out in machine code, especially if the same function already exists in multiple other, more idiomatic C++ variations.
Rounding out the third push, we got the TH04/TH05 DEMO?.REC replay file reading code, which should finally prove that nothing about the game's original replay system could serve as even just the foundation for community-usable replays. Just in case anyone was still thinking that.


Next up: Back to TH01, with the Elis fight! Got a bit of room left in the cap again, and there are a lot of things that would make a lot of sense now:

📝 Posted:
🚚 Summary of:
P0135, P0136
Commits:
a6eed55...252c13d, 252c13d...07bfcf2
💰 Funded by:
[Anonymous]
🏷 Tags:

Alright, no more big code maintenance tasks that absolutely need to be done right now. Time to really focus on parts 6 and 7 of repaying technical debt, right? Except that we don't get to speed up just yet, as TH05's barely decompilable PMD file loading function is rather… complicated.
Fun fact: Whenever I see an unusual sequence of x86 instructions in PC-98 Touhou, I first consult the disassembly of Wolfenstein 3D. That game was originally compiled with the quite similar Borland C++ 3.0, so it's quite helpful to compare its ASM to the officially released source code. If I find the instructions in question, they mostly come from that game's ASM code, leading to the amusing realization that "even John Carmack was unable to get these instructions out of this compiler" :onricdennat: This time though, Wolfenstein 3D did point me to Borland's intrinsics for common C functions like memcpy() and strchr(), available via #pragma intrinsic. Bu~t those unfortunately still generate worse code than what ZUN micro-optimized here. Commenting how these sequences of instructions should look in C is unfortunately all I could do here.
The conditional branches in this function did compile quite nicely though, clarifying the control flow, and clearly exposing a ZUN bug: TH05's snd_load() will hang in an infinite loop when trying to load a non-existing -86 BGM file (with a .M2 extension) if the corresponding -26 BGM file (with a .M extension) doesn't exist either.

Unsurprisingly, the PMD channel monitoring code in TH05's Music Room remains undecompilable outside the two most "high-level" initialization and rendering functions. And it's not because there's data in the middle of the code segment – that would have actually been possible with some #pragmas to ensure that the data and code segments have the same name. As soon as the SI and DI registers are referenced anywhere, Turbo C++ insists on emitting prolog code to save these on the stack at the beginning of the function, and epilog code to restore them from there before returning. Found that out in September 2019, and confirmed that there's no way around it. All the small helper functions here are quite simply too optimized, throwing away any concern for such safety measures. 🤷
Oh well, the two functions that were decompilable at least indicate that I do try.


Within that same 6th push though, we've finally reached the one function in TH05 that was blocking further progress in TH04, allowing that game to finally catch up with the others in terms of separated translation units. Feels good to finally delete more of those .ASM files we've decompiled a while ago… finally!

But since that was just getting started, the most satisfying development in both of these pushes actually came from some more experiments with macros and inline functions for near-ASM code. By adding "unused" dummy parameters for all relevant registers, the exact input registers are made more explicit, which might help future port authors who then maybe wouldn't have to look them up in an x86 instruction reference quite as often. At its best, this even allows us to declare certain functions with the __fastcall convention and express their parameter lists as regular C, with no additional pseudo-registers or macros required.
As for output registers, Turbo C++'s code generation turns out to be even more amazing than previously thought when it comes to returning pseudo-registers from inline functions. A nice example for how this can improve readability can be found in this piece of TH02 code for polling the PC-98 keyboard state using a BIOS interrupt:

inline uint8_t keygroup_sense(uint8_t group) {
	_AL = group;
	_AH = 0x04;
	geninterrupt(0x18);
	// This turns the output register of this BIOS call into the return value
	// of this function. Surprisingly enough, this does *not* naively generate
	// the `MOV AL, AH` instruction you might expect here!
	return _AH;
}

void input_sense(void)
{
	// As a result, this assignment becomes `_AH = _AH`, which Turbo C++
	// never emits as such, giving us only the three instructions we need.
	_AH = keygroup_sense(8);

	// Whereas this one gives us the one additional `MOV BH, AH` instruction
	// we'd expect, and nothing more.
	_BH = keygroup_sense(7);

	// And now it's obvious what both of these registers contain, from just
	// the assignments above.
	if(_BH & K7_ARROW_UP || _AH & K8_NUM_8) {
		key_det |= INPUT_UP;
	}
	// […]
}

I love it. No inline assembly, as close to idiomatic C code as something like this is going to get, yet still compiling into the minimum possible number of x86 instructions on even a 1994 compiler. This is how I keep this project interesting for myself during chores like these. :tannedcirno: We might have even reached peak inline already?

And that's 65% of technical debt in the SHARED segment repaid so far. Next up: Two more of these, which might already complete that segment? Finally!

📝 Posted:
🚚 Summary of:
P0133
Commits:
045450c...1d5db71
💰 Funded by:
[Anonymous]
🏷 Tags:

Wow, 31 commits in a single push? Well, what the last push had in progress, this one had in maintenance. The 📝 master.lib header transition absolutely had to be completed in this one, for my own sanity. And indeed, it reduced the build time for the entirety of ReC98 to about 27 seconds on my system, just as expected in the original announcement. Looking forward to even faster build times with the upcoming #include improvements I've got up my sleeve! The port authors of the future are going to appreciate those quite a bit.

As for the new translation units, the funniest one is probably TH05's function for blitting the 1-color .CDG images used for the main menu options. Which is so optimized that it becomes decompilable again, by ditching the self-modifying code of its TH04 counterpart in favor of simply making better use of CPU registers. The resulting C code is still a mess, but what can you do. :tannedcirno:
This was followed by even more TH05 functions that clearly weren't compiled from C, as evidenced by their padding bytes. It's about time I've documented my lack of ideas of how to get those out of Turbo C++. :onricdennat:

And just like in the previous push, I also had to 📝 throw away a decompiled TH02 function purely due to alignment issues. Couldn't have been a better one though, no one's going to miss a residency check for the MMD driver that is largely identical to the corresponding (and indeed decompilable) function for the PMD driver. Both of those should have been merged into a single function anyway, given how they also mutate the game's sound configuration flags…

In the end, I've slightly slowed down with this one, with only 37% of technical debt done after this 4th dedicated push. Next up: One more of these, centered around TH05's stupidly optimized .PI functions. Maybe also with some more reverse-engineering, after not having done any for 1½ months?

📝 Posted:
🚚 Summary of:
P0031, P0032, P0033
Commits:
dea40ad...9f764fa, 9f764fa...e6294c2, e6294c2...6cdd229
💰 Funded by:
zorg
🏷 Tags:

The glacial pace continues, with TH05's unnecessarily, inappropriately micro-optimized, and hence, un-decompilable code for rendering the current and high score, as well as the enemy health / dream / power bars. While the latter might still pass as well-written ASM, the former goes to such ridiculous levels that it ends up being technically buggy. If you enjoy quality ZUN code, it's definitely worth a read.

In TH05, this all still is at the end of code segment #1, but in TH04, the same code lies all over the same segment. And since I really wanted to move that code into its final form now, I finally did the research into decompiling from anywhere else in a segment.

Turns out we actually can! It's kinda annoying, though: After splitting the segment after the function we want to decompile, we then need to group the two new segments back together into one "virtual segment" matching the original one. But since all ASM in ReC98 heavily relies on being assembled in MASM mode, we then start to suffer from MASM's group addressing quirk. Which then forces us to manually prefix every single function call

with the group name. It's stupidly boring busywork, because of all the function calls you mustn't prefix. Special tooling might make this easier, but I don't have it, and I'm not getting crowdfunded for it.

So while you now definitely can request any specific thing in any of the 5 games to be decompiled right now, it will take slightly longer, and cost slightly more.
(Except for that one big segment in TH04, of course.)

Only one function away from the TH05 shot type control functions now!