⮜ Blog

⮜ List of tags

Showing all posts tagged tasm- and th05-

📝 Posted:
🚚 Summary of:
P0137
Commits:
07bfcf2...8d953dc
💰 Funded by:
[Anonymous]
🏷 Tags:
rec98+ th02+ th03+ th04+ th05- build-process+ meta+ contribution-ideas+ mod+ tasm- tcc+

Whoops, the build was broken again? Since P0127 from mid-November 2020, on TASM32 version 5.3, which also happens to be the one in the DevKit… That version changed the alignment for the default segments of certain memory models when requesting .386 support. And since redefining segment alignment apparently is highly illegal and absolutely has to be a build error, some of the stand-alone .ASM translation units didn't assemble anymore on this version. I've only spotted this on my own because I casually compiled ReC98 somewhere else – on my development system, I happened to have TASM32 version 5.0 in the PATH during all this time.
At least this was a good occasion to get rid of some weird segment alignment workarounds from 2015, and replace them with the superior convention of using the USE16 modifier for the .MODEL directive.

ReC98 would highly benefit from a build server – both in order to immediately spot issues like this one, and as a service for modders. Even more so than the usual open-source project of its size, I would say. But that might be exactly because it doesn't seem like something you can trivially outsource to one of the big CI providers for open-source projects, and quickly set it up with a few lines of YAML.
That might still work in the beginning, and we might get by with a regular 64-bit Windows 10 and DOSBox running the exact build tools from the DevKit. Ideally, though, such a server should really run the optimal configuration of a 32-bit Windows 10, allowing both the 32-bit and the 16-bit build step to run natively, which already is something that no popular CI service out there offers. Then, we'd optimally expand to Linux, every other Windows version down to 95, emulated PC-98 systems, other TASM versions… yeah, it'd be a lot. An experimental project all on its own, with additional hosting costs and probably diminishing returns, the more it expands…
I've added it as a category to the order form, let's see how much interest there is once the store reopens (which will be at the beginning of May, at the latest). That aside, it would 📝 also be a great project for outside contributors!


So, technical debt, part 8… and right away, we're faced with TH03's low-level input function, which 📝 once 📝 again 📝 insists on being word-aligned in a way we can't fake without duplicating translation units. Being undecompilable isn't exactly the best property for a function that has been interesting to modders in the past: In 2018, spaztron64 created an ASM-level mod that hardcoded more ergonomic key bindings for human-vs-human multiplayer mode: 2021-04-04-TH03-WASD-2player.zip However, this remapping attempt remained quite limited, since we hadn't (and still haven't) reached full position independence for TH03 yet. There's quite some potential for size optimizations in this function, which would allow more BIOS key groups to already be used right now, but it's not all that obvious to modders who aren't intimately familiar with x86 ASM. Therefore, I really wouldn't want to keep such a long and important function in ASM if we don't absolutely have to…

… and apparently, that's all the motivation I needed? So I took the risk, and spent the first half of this push on reverse-engineering TCC.EXE, to hopefully find a way to get word-aligned code segments out of Turbo C++ after all.

And there is! The -WX option, used for creating DPMI applications, messes up all sorts of code generation aspects in weird ways, but does in fact mark the code segment as word-aligned. We can consider ourselves quite lucky that we get to use Turbo C++ 4.0, because this feature isn't available in any previous version of Borland's C++ compilers.
That allowed us to restore all the decompilations I previously threw away… well, two of the three, that lookup table generator was too much of a mess in C. :tannedcirno: But what an abuse this is. The subtly different code generation has basically required one creative workaround per usage of -WX. For example, enabling that option causes the regular PUSH BP and POP BP prolog and epilog instructions to be wrapped with INC BP and DEC BP, for some reason:

a_function_compiled_with_wx proc
	inc 	bp    	; ???
	push	bp
	mov 	bp, sp
	    	      	; [… function code …]
	pop 	bp
	dec 	bp    	; ???
	ret
a_function_compiled_with_wx endp

Luckily again, all the functions that currently require -WX don't set up a stack frame and don't take any parameters.
While this hasn't directly been an issue so far, it's been pretty close: snd_se_reset(void) is one of the functions that require word alignment. Previously, it shared a translation unit with the immediately following snd_se_play(int new_se), which does take a parameter, and therefore would have had its prolog and epilog code messed up by -WX. Since the latter function has a consistent (and thus, fakeable) alignment, I simply split that code segment into two, with a new -WX translation unit for just snd_se_reset(void). Problem solved – after all, two C++ translation units are still better than one ASM translation unit. :onricdennat: Especially with all the previous #include improvements.

The rest was more of the usual, getting us 74% done with repaying the technical debt in the SHARED segment. A lot of the remaining 26% is TH04 needing to catch up with TH03 and TH05, which takes comparatively little time. With some good luck, we might get this done within the next push… that is, if we aren't confronted with all too many more disgusting decompilations, like the two functions that ended this push. If we are, we might be needing 10 pushes to complete this after all, but that piece of research was definitely worth the delay. Next up: One more of these.

📝 Posted:
🚚 Summary of:
P0134
Commits:
1d5db71...a6eed55
💰 Funded by:
[Anonymous]
🏷 Tags:
rec98+ th05- blitting+ portability+ micro-optimization+ jank+ tasm- tcc+

Technical debt, part 5… and we only got TH05's stupidly optimized .PI functions this time?

As far as actual progress is concerned, that is. In maintenance news though, I was really hyped for the #include improvements I've mentioned in 📝 the last post. The result: A new x86real.h file, bundling all the declarations specific to the 16-bit x86 Real Mode in a smaller file than Turbo C++'s own DOS.H. After all, DOS is something else than the underlying CPU. And while it didn't speed up build times quite as much as I had hoped, it now clearly indicates the x86-specific parts of PC-98 Touhou code to future port authors.

After another couple of improvements to parameter declaration in ASM land, we get to TH05's .PI functions… and really, why did ZUN write all of them in ASM? Why (re)declare all the necessary structures and data in ASM land, when all these functions are merely one layer of abstraction above master.lib, which does all the actual work?
I get that ZUN might have wanted masked blitting to be faster, which is used for the fade-in effect seen during TH05's main menu animation and the ending artwork. But, uh… he knew how to modify master.lib. In fact, he did already modify the graph_pack_put_8() function used for rendering a single .PI image row, to ignore master.lib's VRAM clipping region. For this effect though, he first blits each row regularly to the invisible 400th row of VRAM, and then does an EGC-accelerated VRAM-to-VRAM blit of that row to its actual target position with the mask enabled. It would have been way more efficient to add another version of this function that takes a mask pattern. No amount of REP MOVSW is going to change the fact that two VRAM writes per line are slower than a single one. Not to mention that it doesn't justify writing every other .PI function in ASM to go along with it…
This is where we also find the most hilarious aspect about this: For most of ZUN's pointless micro-optimizations, you could have maybe made the argument that they do save some CPU cycles here and there, and therefore did something positive to the final, PC-98-exclusive result. But some of the hand-written ASM here doesn't even constitute a micro-optimization, because it's worse than what you would have got out of even Turbo C++ 4.0J with its 80386 optimization flags! :zunpet:

At least it was possible to "decompile" 6 out of the 10 functions here, making them easy to clean up for future modders and port authors. Could have been 7 functions if I also decided to "decompile" pi_free(), but all the C++ code is already surrounded by ASM, resulting in 2 ASM translation units and 2 C++ translation units. pi_free() would have needed a single translation unit by itself, which wasn't worth it, given that I would have had to spell out every single ASM instruction anyway.

void pascal pi_free(int slot)
{
	if(pi_buffers[slot]) {
		graph_pi_free(&pi_headers[slot], &pi_buffers[slot]);
		pi_buffers[slot] = NULL;
	}
}

There you go. What about this needed to be written in ASM?!?

The function calls between these small translation units even seemed to glitch out TASM and the linker in the end, leading to one CALL offset being weirdly shifted by 32 bytes. Usually, TLINK reports a fixup overflow error when this happens, but this time it didn't, for some reason? Mirroring the segment grouping in the affected translation unit did solve the problem, and I already knew this, but only thought of it after spending quite some RTFM time… during which I discovered the -lE switch, which enables TLINK to use the expanded dictionaries in Borland's .OBJ and .LIB files to speed up linking. That shaved off roughly another second from the build time of the complete ReC98 repository. The more you know… Binary blobs compiled with non-Borland tools would be the only reason not to use this flag.

So, even more slowdown with this 5th dedicated push, since we've still only repaid 41% of the technical debt in the SHARED segment so far. Next up: Part 6, which hopefully manages to decompile the FM and SSG channel animations in TH05's Music Room, and hopefully ends up being the final one of the slow ones.

📝 Posted:
🚚 Summary of:
P0126, P0127
Commits:
6c22af7...8b01657, 8b01657...dc65b59
💰 Funded by:
Blue Bolt, [Anonymous]
🏷 Tags:
rec98+ th03+ th04+ th05- pc98+ micro-optimization+ tcc+ tasm- meta+

Alright, back to continuing the master.hpp transition started in P0124, and repaying technical debt. The last blog post already announced some ridiculous decompilations… and in fact, not a single one of the functions in these two pushes was decompilable into idiomatic C/C++ code.

As usual, that didn't keep me from trying though. The TH04 and TH05 version of the infamous 16-pixel-aligned, EGC-accelerated rectangle blitting function from page 1 to page 0 was fairly average as far as unreasonable decompilations are concerned.
The big blocker in TH03's MAIN.EXE, however, turned out to be the .MRS functions, used to render the gauge attack portraits and bomb backgrounds. The blitting code there uses the additional FS and GS segment registers provided by the Intel 386… which

  1. are not supported by Turbo C++'s inline assembler, and
  2. can't be turned into pointers, due to a compiler bug in Turbo C++ that generates wrong segment prefix opcodes for the _FS and _GS pseudo-registers.

Apparently I'm the first one to even try doing that with this compiler? I haven't found any other mention of this bug…
Compiling via assembly (#pragma inline) would work around this bug and generate the correct instructions. But that would incur yet another dependency on a 16-bit TASM, for something honestly quite insignificant.

What we can always do, however, is using __emit__() to simply output x86 opcodes anywhere in a function. Unlike spelled-out inline assembly, that can even be used in helper functions that are supposed to inline… which does in fact allow us to fully abstract away this compiler bug. Regular if() comparisons with pseudo-registers wouldn't inline, but "converting" them into C++ template function specializations does. All that's left is some C preprocessor abuse to turn the pseudo-registers into types, and then we do retain a normal-looking poke() call in the blitting functions in the end. 🤯

Yeah… the result is batshit insane. I may have gone too far in a few places…


One might certainly argue that all these ridiculous decompilations actually hurt the preservation angle of this project. "Clearly, ZUN couldn't have possibly written such unreasonable C++ code. So why pretend he did, and not just keep it all in its more natural ASM form?" Well, there are several reasons:

Unfortunately, these pushes also demonstrated a second disadvantage in trying to decompile everything possible: Since Turbo C++ lacks TASM's fine-grained ability to enforce code alignment on certain multiples of bytes, it might actually be unfeasible to link in a C-compiled object file at its intended original position in some of the .EXE files it's used in. Which… you're only going to notice once you encounter such a case. Due to the slightly jumbled order of functions in the 📝 second, shared code segment, that might be long after you decompiled and successfully linked in the function everywhere else.

And then you'll have to throw away that decompilation after all 😕 Oh well. In this specific case (the lookup table generator for horizontally flipping images), that decompilation was a mess anyway, and probably helped nobody. I could have added a dummy .OBJ that does nothing but enforce the needed 2-byte alignment before the function if I really insisted on keeping the C version, but it really wasn't worth it.


Now that I've also described yet another meta-issue, maybe there'll really be nothing to say about the next technical debt pushes? :onricdennat: Next up though: Back to actual progress again, with TH01. Which maybe even ends up pushing that game over the 50% RE mark?

📝 Posted:
🚚 Summary of:
P0109
Commits:
dcf4e2c...2c7d86b
💰 Funded by:
[Anonymous], Blue Bolt
🏷 Tags:
rec98+ th04+ th05- gameplay+ bullet+ micro-optimization+ glitch+ uth05win+ tasm-

Back to TH05! Thanks to the good funding situation, I can strike a nice balance between getting TH05 position-independent as quickly as possible, and properly reverse-engineering some missing important parts of the game. Once 100% PI will get the attention of modders, the code will then be in better shape, and a bit more usable than if I just rushed that goal.

By now, I'm apparently also pretty spoiled by TH01's immediate decompilability, after having worked on that game for so long. Reverse-engineering in ASM land is pretty annoying, after all, since it basically boils down to meticulously editing a piece of ASM into something I can confidently call "reverse-engineered". Most of the time, simply decompiling that piece of code would take just a little bit longer, but be massively more useful. So, I immediately tried decompiling with TH05… and it just worked, at every place I tried!? Whatever the issue was that made 📝 segment splitting so annoying at my first attempt, I seem to have completely solved it in the meantime. 🤷 So yeah, backers can now request pretty much any part of TH04 and TH05 to be decompiled immediately, with no additional segment splitting cost.

(Protip for everyone interested in starting their own ReC project: Just declare one segment per function, right from the start, then group them together to restore the original code segmentation…)


Except that TH05 then just throws more of its infamous micro-optimized and undecompilable ASM at you. 🙄 This push covered the function that adjusts the bullet group template based on rank and the selected difficulty, called every time such a group is configured. Which, just like pretty much all of TH05's bullet spawning code, is one of those undecompilable functions. If C allowed labels of other functions as goto targets, it might have been decompilable into something useful to modders… maybe. But like this, there's no point in even trying.

This is such a terrible idea from a software architecture point of view, I can't even. Because now, you suddenly have to mirror your C++ declarations in ASM land, and keep them in sync with each other. I'm always happy when I get to delete an ASM declaration from the codebase once I've decompiled all the instances where it was referenced. But for TH05, we now have to keep those declarations around forever. 😕 And all that for a performance increase you probably couldn't even measure. Oh well, pulling off Galaxy Brain-level ASM optimizations is kind of fun if you don't have portability plans… I guess?

If I started a full fangame mod of a PC-98 Touhou game, I'd base it on TH04 rather than TH05, and backport selected features from TH05 as needed. Just because it was released later doesn't make it better, and this is by far not the only one of ZUN's micro-optimizations that just went way too far.

Dropping down to ASM also makes it easier to introduce weird quirks. Decompiled, one of TH05's tuning conditions for stack groups on Easy Mode would look something like:

case BP_STACK:
	// […]
	if(spread_angle_delta >= 2) {
		stack_bullet_count--;
	}

The fields of the bullet group template aren't typically reset when setting up a new group. So, spread_angle_delta in the context of a stack group effectively refers to "the delta angle of the last spread group that was fired before this stack – whenever that was". uth05win also spotted this quirk, considered it a bug, and wrote fanfiction by changing spread_angle_delta to stack_bullet_count.
As usual for functions that occur in more than one game, I also decompiled the TH04 bullet group tuning function, and it's perfectly sane, with no such quirks.


In the more PI-focused parts of this push, we got the TH05-exclusive smooth boss movement functions, for flying randomly or towards a given point. Pretty unspectacular for the most part, but we've got yet another uth05win inconsistency in the latter one. Once the Y coordinate gets close enough to the target point, it actually speeds up twice as much as the X coordinate would, whereas uth05win used the same speedup factors for both. This might make uth05win a couple of frames slower in all boss fights from Stage 3 on. Hard to measure though – and boss movement partly depends on RNG anyway.


Next up: Shinki's background animations – which are actually the single biggest source of position dependence left in TH05.

📝 Posted:
🚚 Summary of:
P0031, P0032, P0033
Commits:
dea40ad...9f764fa, 9f764fa...e6294c2, e6294c2...6cdd229
💰 Funded by:
zorg
🏷 Tags:
rec98+ th02+ th04+ th05- file-format+ hud+ score+ tasm- tcc+ micro-optimization+ jank+

The glacial pace continues, with TH05's unnecessarily, inappropriately micro-optimized, and hence, un-decompilable code for rendering the current and high score, as well as the enemy health / dream / power bars. While the latter might still pass as well-written ASM, the former goes to such ridiculous levels that it ends up being technically buggy. If you enjoy quality ZUN code, it's definitely worth a read.

In TH05, this all still is at the end of code segment #1, but in TH04, the same code lies all over the same segment. And since I really wanted to move that code into its final form now, I finally did the research into decompiling from anywhere else in a segment.

Turns out we actually can! It's kinda annoying, though: After splitting the segment after the function we want to decompile, we then need to group the two new segments back together into one "virtual segment" matching the original one. But since all ASM in ReC98 heavily relies on being assembled in MASM mode, we then start to suffer from MASM's group addressing quirk. Which then forces us to manually prefix every single function call

with the group name. It's stupidly boring busywork, because of all the function calls you mustn't prefix. Special tooling might make this easier, but I don't have it, and I'm not getting crowdfunded for it.

So while you now definitely can request any specific thing in any of the 5 games to be decompiled right now, it will take slightly longer, and cost slightly more.
(Except for that one big segment in TH04, of course.)

Only one function away from the TH05 shot type control functions now!