⮜ Blog

⮜ List of tags

Showing all posts tagged
,
and

📝 Posted:
🚚 Summary of:
P0190, P0191, P0192
Commits:
5734815...293e16a, 293e16a...71cb7b5, 71cb7b5...e1f3f9f
💰 Funded by:
nrook, -Tom-, [Anonymous]
🏷 Tags:

The important things first:

So, Shinki! As far as final boss code is concerned, she's surprisingly economical, with 📝 her background animations making up more than ⅓ of her entire code. Going straight from TH01's 📝 final 📝 bosses to TH05's final boss definitely showed how much ZUN had streamlined danmaku pattern code by the end of PC-98 Touhou. Don't get me wrong, there is still room for improvement: TH05 not only 📝 reuses the same 16 bytes of generic boss state we saw in TH04 last month, but also uses them 4× as often, and even for midbosses. Most importantly though, defining danmaku patterns using a single global instance of the group template structure is just bad no matter how you look at it:

Declaring a separate structure instance with the static data for every pattern would be both safer and more space-efficient, and there's more than enough space left for that in the game's data segment.
But all in all, the pattern functions are short, sweet, and easy to follow. The "devil" pattern is significantly more complex than the others, but still far from TH01's final bosses at their worst. I especially like the clear architectural separation between "one-shot pattern" functions that return true once they're done, and "looping pattern" functions that run as long as they're being called from a boss's main function. Not many all too interesting things in these pattern functions for the most part, except for two pieces of evidence that Shinki was coded after Yumeko:


Speaking about that wing sprite: If you look at ST05.BB2 (or any other file with a large sprite, for that matter), you notice a rather weird file layout:

Raw file layout of TH05's ST05.BB2, demonstrating master.lib's supposed BFNT width limit of 64 pixels
A large sprite split into multiple smaller ones with a width of 64 pixels each? What's this, hardware sprite limitations? On my PC-98?!

And it's not a limitation of the sprite width field in the BFNT+ header either. Instead, it's master.lib's BFNT functions which are limited to sprite widths up to 64 pixels… or at least that's what MASTER.MAN claims. Whatever the restriction was, it seems to be completely nonexistent as of master.lib version 0.23, and none of the master.lib functions used by the games have any issues with larger sprites.
Since ZUN stuck to the supposed 64-pixel width limit though, it's now the game that expects Shinki's winged form to consist of 4 physical sprites, not just 1. Any conversion from another, more logical sprite sheet layout back into BFNT+ must therefore replicate the original number of sprites. Otherwise, the sequential IDs ("patnums") assigned to every newly loaded sprite no longer match ZUN's hardcoded IDs, causing the game to crash. This is exactly what used to happen with -Tom-'s MysticTK automation scripts, which combined these exact sprites into a single large one. This issue has now been fixed – just in case there are some underground modders out there who used these scripts and wonder why their game crashed as soon as the Shinki fight started.


And then the code quality takes a nosedive with Shinki's main function. :onricdennat: Even in TH05, these boss and midboss update functions are still very imperative:

The biggest WTF in there, however, goes to using one of the 16 state bytes as a "relative phase" variable for differentiating between boss phases that share the same branch within the switch(boss.phase) statement. While it's commendable that ZUN tried to reduce code duplication for once, he could have just branched depending on the actual boss.phase variable? The same state byte is then reused in the "devil" pattern to track the activity state of the big jerky lasers in the second half of the pattern. If you somehow managed to end the phase after the first few bullets of the pattern, but before these lasers are up, Shinki's update function would think that you're still in the phase before the "devil" pattern. The main function then sequence-breaks right to the defeat phase, skipping the final pattern with the burning Makai background. Luckily, the HP boundaries are far away enough to make this impossible in practice.
The takeaway here: If you want to use the state bytes for your custom boss script mods, alias them to your own 16-byte structure, and limit each of the bytes to a clearly defined meaning across your entire boss script.

One final discovery that doesn't seem to be documented anywhere yet: Shinki actually has a hidden bomb shield during her two purple-wing phases. uth05win got this part slightly wrong though: It's not a complete shield, and hitting Shinki will still deal 1 point of chip damage per frame. For comparison, the first phase lasts for 3,000 HP, and the "devil" pattern phase lasts for 5,800 HP.

And there we go, 3rd PC-98 Touhou boss script* decompiled, 28 to go! 🎉 In case you were expecting a fix for the Shinki death glitch: That one is more appropriately fixed as part of the Mai & Yuki script. It also requires new code, should ideally look a bit prettier than just removing cheetos between one frame and the next, and I'd still like it to fit within the original position-dependent code layout… Let's do that some other time.
Not much to say about the Stage 1 midboss, or midbosses in general even, except that their update functions have to imperatively handle even more subsystems, due to the relative lack of helper functions.


The remaining ¾ of the third push went to a bunch of smaller RE and finalization work that would have hardly got any attention otherwise, to help secure that 50% RE mark. The nicest piece of code in there shows off what looks like the optimal way of setting up the 📝 GRCG tile register for monochrome blitting in a variable color:

mov ah, palette_index ; Any other non-AL 8-bit register works too.
                      ; (x86 only supports AL as the source operand for OUTs.)

rept 4                ; For all 4 bitplanes…
    shr ah,  1        ; Shift the next color bit into the x86 carry flag
    sbb al,  al       ; Extend the carry flag to a full byte
                      ; (CF=0 → 0x00, CF=1 → 0xFF)
    out 7Eh, al       ; Write AL to the GRCG tile register
endm

Thanks to Turbo C++'s inlining capabilities, the loop body even decompiles into a surprisingly nice one-liner. What a beautiful micro-optimization, at a place where micro-optimization doesn't hurt and is almost expected.
Unfortunately, the micro-optimizations went all downhill from there, becoming increasingly dumb and undecompilable. Was it really necessary to save 4 x86 instructions in the highly unlikely case of a new spark sprite being spawned outside the playfield? That one 2D polar→Cartesian conversion function then pointed out Turbo C++ 4.0J's woefully limited support for 32-bit micro-optimizations. The code generation for 32-bit 📝 pseudo-registers is so bad that they almost aren't worth using for arithmetic operations, and the inline assembler just flat out doesn't support anything 32-bit. No use in decompiling a function that you'd have to entirely spell out in machine code, especially if the same function already exists in multiple other, more idiomatic C++ variations.
Rounding out the third push, we got the TH04/TH05 DEMO?.REC replay file reading code, which should finally prove that nothing about the game's original replay system could serve as even just the foundation for community-usable replays. Just in case anyone was still thinking that.


Next up: Back to TH01, with the Elis fight! Got a bit of room left in the cap again, and there are a lot of things that would make a lot of sense now:

📝 Posted:
🚚 Summary of:
P0137
Commits:
07bfcf2...8d953dc
💰 Funded by:
[Anonymous]
🏷 Tags:

Whoops, the build was broken again? Since P0127 from mid-November 2020, on TASM32 version 5.3, which also happens to be the one in the DevKit… That version changed the alignment for the default segments of certain memory models when requesting .386 support. And since redefining segment alignment apparently is highly illegal and absolutely has to be a build error, some of the stand-alone .ASM translation units didn't assemble anymore on this version. I've only spotted this on my own because I casually compiled ReC98 somewhere else – on my development system, I happened to have TASM32 version 5.0 in the PATH during all this time.
At least this was a good occasion to get rid of some weird segment alignment workarounds from 2015, and replace them with the superior convention of using the USE16 modifier for the .MODEL directive.

ReC98 would highly benefit from a build server – both in order to immediately spot issues like this one, and as a service for modders. Even more so than the usual open-source project of its size, I would say. But that might be exactly because it doesn't seem like something you can trivially outsource to one of the big CI providers for open-source projects, and quickly set it up with a few lines of YAML.
That might still work in the beginning, and we might get by with a regular 64-bit Windows 10 and DOSBox running the exact build tools from the DevKit. Ideally, though, such a server should really run the optimal configuration of a 32-bit Windows 10, allowing both the 32-bit and the 16-bit build step to run natively, which already is something that no popular CI service out there offers. Then, we'd optimally expand to Linux, every other Windows version down to 95, emulated PC-98 systems, other TASM versions… yeah, it'd be a lot. An experimental project all on its own, with additional hosting costs and probably diminishing returns, the more it expands…
I've added it as a category to the order form, let's see how much interest there is once the store reopens (which will be at the beginning of May, at the latest). That aside, it would 📝 also be a great project for outside contributors!


So, technical debt, part 8… and right away, we're faced with TH03's low-level input function, which 📝 once 📝 again 📝 insists on being word-aligned in a way we can't fake without duplicating translation units. Being undecompilable isn't exactly the best property for a function that has been interesting to modders in the past: In 2018, spaztron64 created an ASM-level mod that hardcoded more ergonomic key bindings for human-vs-human multiplayer mode: 2021-04-04-TH03-WASD-2player.zip However, this remapping attempt remained quite limited, since we hadn't (and still haven't) reached full position independence for TH03 yet. There's quite some potential for size optimizations in this function, which would allow more BIOS key groups to already be used right now, but it's not all that obvious to modders who aren't intimately familiar with x86 ASM. Therefore, I really wouldn't want to keep such a long and important function in ASM if we don't absolutely have to…

… and apparently, that's all the motivation I needed? So I took the risk, and spent the first half of this push on reverse-engineering TCC.EXE, to hopefully find a way to get word-aligned code segments out of Turbo C++ after all.

And there is! The -WX option, used for creating DPMI applications, messes up all sorts of code generation aspects in weird ways, but does in fact mark the code segment as word-aligned. We can consider ourselves quite lucky that we get to use Turbo C++ 4.0, because this feature isn't available in any previous version of Borland's C++ compilers.
That allowed us to restore all the decompilations I previously threw away… well, two of the three, that lookup table generator was too much of a mess in C. :tannedcirno: But what an abuse this is. The subtly different code generation has basically required one creative workaround per usage of -WX. For example, enabling that option causes the regular PUSH BP and POP BP prolog and epilog instructions to be wrapped with INC BP and DEC BP, for some reason:

a_function_compiled_with_wx proc
	inc 	bp    	; ???
	push	bp
	mov 	bp, sp
	    	      	; [… function code …]
	pop 	bp
	dec 	bp    	; ???
	ret
a_function_compiled_with_wx endp

Luckily again, all the functions that currently require -WX don't set up a stack frame and don't take any parameters.
While this hasn't directly been an issue so far, it's been pretty close: snd_se_reset(void) is one of the functions that require word alignment. Previously, it shared a translation unit with the immediately following snd_se_play(int new_se), which does take a parameter, and therefore would have had its prolog and epilog code messed up by -WX. Since the latter function has a consistent (and thus, fakeable) alignment, I simply split that code segment into two, with a new -WX translation unit for just snd_se_reset(void). Problem solved – after all, two C++ translation units are still better than one ASM translation unit. :onricdennat: Especially with all the previous #include improvements.

The rest was more of the usual, getting us 74% done with repaying the technical debt in the SHARED segment. A lot of the remaining 26% is TH04 needing to catch up with TH03 and TH05, which takes comparatively little time. With some good luck, we might get this done within the next push… that is, if we aren't confronted with all too many more disgusting decompilations, like the two functions that ended this push. If we are, we might be needing 10 pushes to complete this after all, but that piece of research was definitely worth the delay. Next up: One more of these.

📝 Posted:
🚚 Summary of:
P0135, P0136
Commits:
a6eed55...252c13d, 252c13d...07bfcf2
💰 Funded by:
[Anonymous]
🏷 Tags:

Alright, no more big code maintenance tasks that absolutely need to be done right now. Time to really focus on parts 6 and 7 of repaying technical debt, right? Except that we don't get to speed up just yet, as TH05's barely decompilable PMD file loading function is rather… complicated.
Fun fact: Whenever I see an unusual sequence of x86 instructions in PC-98 Touhou, I first consult the disassembly of Wolfenstein 3D. That game was originally compiled with the quite similar Borland C++ 3.0, so it's quite helpful to compare its ASM to the officially released source code. If I find the instructions in question, they mostly come from that game's ASM code, leading to the amusing realization that "even John Carmack was unable to get these instructions out of this compiler" :onricdennat: This time though, Wolfenstein 3D did point me to Borland's intrinsics for common C functions like memcpy() and strchr(), available via #pragma intrinsic. Bu~t those unfortunately still generate worse code than what ZUN micro-optimized here. Commenting how these sequences of instructions should look in C is unfortunately all I could do here.
The conditional branches in this function did compile quite nicely though, clarifying the control flow, and clearly exposing a ZUN bug: TH05's snd_load() will hang in an infinite loop when trying to load a non-existing -86 BGM file (with a .M2 extension) if the corresponding -26 BGM file (with a .M extension) doesn't exist either.

Unsurprisingly, the PMD channel monitoring code in TH05's Music Room remains undecompilable outside the two most "high-level" initialization and rendering functions. And it's not because there's data in the middle of the code segment – that would have actually been possible with some #pragmas to ensure that the data and code segments have the same name. As soon as the SI and DI registers are referenced anywhere, Turbo C++ insists on emitting prolog code to save these on the stack at the beginning of the function, and epilog code to restore them from there before returning. Found that out in September 2019, and confirmed that there's no way around it. All the small helper functions here are quite simply too optimized, throwing away any concern for such safety measures. 🤷
Oh well, the two functions that were decompilable at least indicate that I do try.


Within that same 6th push though, we've finally reached the one function in TH05 that was blocking further progress in TH04, allowing that game to finally catch up with the others in terms of separated translation units. Feels good to finally delete more of those .ASM files we've decompiled a while ago… finally!

But since that was just getting started, the most satisfying development in both of these pushes actually came from some more experiments with macros and inline functions for near-ASM code. By adding "unused" dummy parameters for all relevant registers, the exact input registers are made more explicit, which might help future port authors who then maybe wouldn't have to look them up in an x86 instruction reference quite as often. At its best, this even allows us to declare certain functions with the __fastcall convention and express their parameter lists as regular C, with no additional pseudo-registers or macros required.
As for output registers, Turbo C++'s code generation turns out to be even more amazing than previously thought when it comes to returning pseudo-registers from inline functions. A nice example for how this can improve readability can be found in this piece of TH02 code for polling the PC-98 keyboard state using a BIOS interrupt:

inline uint8_t keygroup_sense(uint8_t group) {
	_AL = group;
	_AH = 0x04;
	geninterrupt(0x18);
	// This turns the output register of this BIOS call into the return value
	// of this function. Surprisingly enough, this does *not* naively generate
	// the `MOV AL, AH` instruction you might expect here!
	return _AH;
}

void input_sense(void)
{
	// As a result, this assignment becomes `_AH = _AH`, which Turbo C++
	// never emits as such, giving us only the three instructions we need.
	_AH = keygroup_sense(8);

	// Whereas this one gives us the one additional `MOV BH, AH` instruction
	// we'd expect, and nothing more.
	_BH = keygroup_sense(7);

	// And now it's obvious what both of these registers contain, from just
	// the assignments above.
	if(_BH & K7_ARROW_UP || _AH & K8_NUM_8) {
		key_det |= INPUT_UP;
	}
	// […]
}

I love it. No inline assembly, as close to idiomatic C code as something like this is going to get, yet still compiling into the minimum possible number of x86 instructions on even a 1994 compiler. This is how I keep this project interesting for myself during chores like these. :tannedcirno: We might have even reached peak inline already?

And that's 65% of technical debt in the SHARED segment repaid so far. Next up: Two more of these, which might already complete that segment? Finally!

📝 Posted:
🚚 Summary of:
P0133
Commits:
045450c...1d5db71
💰 Funded by:
[Anonymous]
🏷 Tags:

Wow, 31 commits in a single push? Well, what the last push had in progress, this one had in maintenance. The 📝 master.lib header transition absolutely had to be completed in this one, for my own sanity. And indeed, it reduced the build time for the entirety of ReC98 to about 27 seconds on my system, just as expected in the original announcement. Looking forward to even faster build times with the upcoming #include improvements I've got up my sleeve! The port authors of the future are going to appreciate those quite a bit.

As for the new translation units, the funniest one is probably TH05's function for blitting the 1-color .CDG images used for the main menu options. Which is so optimized that it becomes decompilable again, by ditching the self-modifying code of its TH04 counterpart in favor of simply making better use of CPU registers. The resulting C code is still a mess, but what can you do. :tannedcirno:
This was followed by even more TH05 functions that clearly weren't compiled from C, as evidenced by their padding bytes. It's about time I've documented my lack of ideas of how to get those out of Turbo C++. :onricdennat:

And just like in the previous push, I also had to 📝 throw away a decompiled TH02 function purely due to alignment issues. Couldn't have been a better one though, no one's going to miss a residency check for the MMD driver that is largely identical to the corresponding (and indeed decompilable) function for the PMD driver. Both of those should have been merged into a single function anyway, given how they also mutate the game's sound configuration flags…

In the end, I've slightly slowed down with this one, with only 37% of technical debt done after this 4th dedicated push. Next up: One more of these, centered around TH05's stupidly optimized .PI functions. Maybe also with some more reverse-engineering, after not having done any for 1½ months?

📝 Posted:
🚚 Summary of:
P0126, P0127
Commits:
6c22af7...8b01657, 8b01657...dc65b59
💰 Funded by:
Blue Bolt, [Anonymous]
🏷 Tags:

Alright, back to continuing the master.hpp transition started in P0124, and repaying technical debt. The last blog post already announced some ridiculous decompilations… and in fact, not a single one of the functions in these two pushes was decompilable into idiomatic C/C++ code.

As usual, that didn't keep me from trying though. The TH04 and TH05 version of the infamous 16-pixel-aligned, EGC-accelerated rectangle blitting function from page 1 to page 0 was fairly average as far as unreasonable decompilations are concerned.
The big blocker in TH03's MAIN.EXE, however, turned out to be the .MRS functions, used to render the gauge attack portraits and bomb backgrounds. The blitting code there uses the additional FS and GS segment registers provided by the Intel 386… which

  1. are not supported by Turbo C++'s inline assembler, and
  2. can't be turned into pointers, due to a compiler bug in Turbo C++ that generates wrong segment prefix opcodes for the _FS and _GS pseudo-registers.

Apparently I'm the first one to even try doing that with this compiler? I haven't found any other mention of this bug…
Compiling via assembly (#pragma inline) would work around this bug and generate the correct instructions. But that would incur yet another dependency on a 16-bit TASM, for something honestly quite insignificant.

What we can always do, however, is using __emit__() to simply output x86 opcodes anywhere in a function. Unlike spelled-out inline assembly, that can even be used in helper functions that are supposed to inline… which does in fact allow us to fully abstract away this compiler bug. Regular if() comparisons with pseudo-registers wouldn't inline, but "converting" them into C++ template function specializations does. All that's left is some C preprocessor abuse to turn the pseudo-registers into types, and then we do retain a normal-looking poke() call in the blitting functions in the end. 🤯

Yeah… the result is batshit insane. I may have gone too far in a few places…


One might certainly argue that all these ridiculous decompilations actually hurt the preservation angle of this project. "Clearly, ZUN couldn't have possibly written such unreasonable C++ code. So why pretend he did, and not just keep it all in its more natural ASM form?" Well, there are several reasons:

Unfortunately, these pushes also demonstrated a second disadvantage in trying to decompile everything possible: Since Turbo C++ lacks TASM's fine-grained ability to enforce code alignment on certain multiples of bytes, it might actually be unfeasible to link in a C-compiled object file at its intended original position in some of the .EXE files it's used in. Which… you're only going to notice once you encounter such a case. Due to the slightly jumbled order of functions in the 📝 second, shared code segment, that might be long after you decompiled and successfully linked in the function everywhere else.

And then you'll have to throw away that decompilation after all 😕 Oh well. In this specific case (the lookup table generator for horizontally flipping images), that decompilation was a mess anyway, and probably helped nobody. I could have added a dummy .OBJ that does nothing but enforce the needed 2-byte alignment before the function if I really insisted on keeping the C version, but it really wasn't worth it.


Now that I've also described yet another meta-issue, maybe there'll really be nothing to say about the next technical debt pushes? :onricdennat: Next up though: Back to actual progress again, with TH01. Which maybe even ends up pushing that game over the 50% RE mark?

📝 Posted:
🚚 Summary of:
P0110
Commits:
2c7d86b...8b5c146
💰 Funded by:
[Anonymous], Blue Bolt
🏷 Tags:

… and just as I explained 📝 in the last post how decompilation is typically more sensible and efficient than ASM-level reverse-engineering, we have this push demonstrating a counter-example. The reason why the background particles and lines in the Shinki and EX-Alice battles contributed so much to position dependence was simply because they're accessed in a relatively large amount of functions, one for each different animation. Too many to spend the remaining precious crowdfunded time on reverse-engineering or even decompiling them all, especially now that everyone anticipates 100% PI for TH05's MAIN.EXE.

Therefore, I only decompiled the two functions of the line structure that also demonstrate best how it works, which in turn also helped with RE. Sadly, this revealed that we actually can't 📝 overload operator =() to get that nice assignment syntax for 12.4 fixed-point values, because one of those new functions relies on Turbo C++'s built-in optimizations for trivially copyable structures. Still, impressive that this abstraction caused no other issues for almost one year.

As for the structures themselves… nope, nothing to criticize this time! Sure, one good particle system would have been awesome, instead of having separate structures for the Stage 2 "starfield" particles and the one used in Shinki's battle, with hardcoded animations for both. But given the game's short development time, that was quite an acceptable compromise, I'd say.
And as for the lines, there just has to be a reason why the game reserves 20 lines per set, but only renders lines #0, #6, #12, and #18. We'll probably see once we get to look at those animation functions more closely.

This was quite a 📝 TH03-style RE push, which yielded way more PI% than RE%. But now that that's done, I can finally not get distracted by all that stuff when looking at the list of remaining memory references. Next up: The last few missing structures in TH05's MAIN.EXE!

📝 Posted:
🚚 Summary of:
P0076, P0077
Commits:
222fc99...9ae9754, 9ae9754...f4eb7a8
💰 Funded by:
[Anonymous], -Tom-, Splashman
🏷 Tags:

Well, that took twice as long as I thought, with the two pushes containing a lot more maintenance than actual new research. Spending some time improving both field names and types in 32th System's TH03 resident structure finally gives us all of those structures. Which means that we can now cover all the remaining decompilable ZUN.COM parts at once…

Oh wait, their main() functions have stayed largely identical since TH02? Time to clean up and separate that first, then… and combine two recent code generation observations into the solution to a decompilation puzzle from 4½ years ago. Alright, time to decomp-

Oh wait, we'd kinda like to properly RE all the code in TH03-TH05 that deals with loading and saving .CFG files. Almost every outside contributor wanted to grab this supposedly low-hanging fruit a lot earlier, but (of course) always just for a single game, while missing how the format evolved.

So, ZUN.COM. For some reason, people seem to consider it particularly important, even though it contains neither any game logic nor any code specific to PC-98 hardware… All that this decompilable part does is to initialize a game's .CFG file, allocate an empty resident structure using master.lib functions, release it after you quit the game, error-check all that, and print some playful messages~ (OK, TH05's also directly fills the resident structure with all data from MIKO.CFG, which all the other games do in OP.EXE.) At least modders can now freely change and extend all the resident structures, as well as the .CFG files? And translators can translate those messages that you won't see on a decently fast emulator anyway? Have fun, I guess 🤷‍

And you can in fact do this right now – even for TH04 and TH05, whose ZUN.COM currently isn't rebuilt by ReC98. There is actually a rather involved reason for this:

So yeah, no meaningful RE and PI progress at any of these levels. Heck, even as a modder, you can just replace the zun zun_res (TH02), zun -5 (TH03), or zun -s (TH04/TH05) calls in GAME.BAT with a direct call to your modified *RES*.COM. And with the alternative being "manually typing 0 and 1 bits into a text file", editing the sprites in TH05's GJINIT.COM is way more comfortable in a binary sprite editor anyway.

For me though, the best part in all of this was that it finally made sense to throw out the old Borland C++ run-time assembly slices 🗑 This giant waste of time became obvious 5 years ago, but any ASM dump of a .COM file would have needed rather ugly workarounds without those slices. Now that all .COM binaries that were originally written in C are compiled from C, we can all enjoy slightly faster grepping over the entire repository, which now has 229 fewer files. Productivity will skyrocket! :tannedcirno:

Next up: Three weeks of almost full-time ReC98 work! Two more PI-focused pushes to finish this TH05 stretch first, before switching priorities to TH01 again.