⮜ Blog

⮜ List of tags

Showing all posts tagged tasm- and micro-optimization-

📝 Posted:
🚚 Summary of:
P0134
Commits:
1d5db71...a6eed55
💰 Funded by:
[Anonymous]
🏷 Tags:
rec98+ th05+ blitting+ portability+ micro-optimization- jank+ tasm- tcc+

Technical debt, part 5… and we only got TH05's stupidly optimized .PI functions this time?

As far as actual progress is concerned, that is. In maintenance news though, I was really hyped for the #include improvements I've mentioned in 📝 the last post. The result: A new x86real.h file, bundling all the declarations specific to the 16-bit x86 Real Mode in a smaller file than Turbo C++'s own DOS.H. After all, DOS is something else than the underlying CPU. And while it didn't speed up build times quite as much as I had hoped, it now clearly indicates the x86-specific parts of PC-98 Touhou code to future port authors.

After another couple of improvements to parameter declaration in ASM land, we get to TH05's .PI functions… and really, why did ZUN write all of them in ASM? Why (re)declare all the necessary structures and data in ASM land, when all these functions are merely one layer of abstraction above master.lib, which does all the actual work?
I get that ZUN might have wanted masked blitting to be faster, which is used for the fade-in effect seen during TH05's main menu animation and the ending artwork. But, uh… he knew how to modify master.lib. In fact, he did already modify the graph_pack_put_8() function used for rendering a single .PI image row, to ignore master.lib's VRAM clipping region. For this effect though, he first blits each row regularly to the invisible 400th row of VRAM, and then does an EGC-accelerated VRAM-to-VRAM blit of that row to its actual target position with the mask enabled. It would have been way more efficient to add another version of this function that takes a mask pattern. No amount of REP MOVSW is going to change the fact that two VRAM writes per line are slower than a single one. Not to mention that it doesn't justify writing every other .PI function in ASM to go along with it…
This is where we also find the most hilarious aspect about this: For most of ZUN's pointless micro-optimizations, you could have maybe made the argument that they do save some CPU cycles here and there, and therefore did something positive to the final, PC-98-exclusive result. But some of the hand-written ASM here doesn't even constitute a micro-optimization, because it's worse than what you would have got out of even Turbo C++ 4.0J with its 80386 optimization flags! :zunpet:

At least it was possible to "decompile" 6 out of the 10 functions here, making them easy to clean up for future modders and port authors. Could have been 7 functions if I also decided to "decompile" pi_free(), but all the C++ code is already surrounded by ASM, resulting in 2 ASM translation units and 2 C++ translation units. pi_free() would have needed a single translation unit by itself, which wasn't worth it, given that I would have had to spell out every single ASM instruction anyway.

void pascal pi_free(int slot)
{
	if(pi_buffers[slot]) {
		graph_pi_free(&pi_headers[slot], &pi_buffers[slot]);
		pi_buffers[slot] = NULL;
	}
}

There you go. What about this needed to be written in ASM?!?

The function calls between these small translation units even seemed to glitch out TASM and the linker in the end, leading to one CALL offset being weirdly shifted by 32 bytes. Usually, TLINK reports a fixup overflow error when this happens, but this time it didn't, for some reason? Mirroring the segment grouping in the affected translation unit did solve the problem, and I already knew this, but only thought of it after spending quite some RTFM time… during which I discovered the -lE switch, which enables TLINK to use the expanded dictionaries in Borland's .OBJ and .LIB files to speed up linking. That shaved off roughly another second from the build time of the complete ReC98 repository. The more you know… Binary blobs compiled with non-Borland tools would be the only reason not to use this flag.

So, even more slowdown with this 5th dedicated push, since we've still only repaid 41% of the technical debt in the SHARED segment so far. Next up: Part 6, which hopefully manages to decompile the FM and SSG channel animations in TH05's Music Room, and hopefully ends up being the final one of the slow ones.

📝 Posted:
🚚 Summary of:
P0126, P0127
Commits:
6c22af7...8b01657, 8b01657...dc65b59
💰 Funded by:
Blue Bolt, [Anonymous]
🏷 Tags:
rec98+ th03+ th04+ th05+ pc98+ micro-optimization- tcc+ tasm- meta+

Alright, back to continuing the master.hpp transition started in P0124, and repaying technical debt. The last blog post already announced some ridiculous decompilations… and in fact, not a single one of the functions in these two pushes was decompilable into idiomatic C/C++ code.

As usual, that didn't keep me from trying though. The TH04 and TH05 version of the infamous 16-pixel-aligned, EGC-accelerated rectangle blitting function from page 1 to page 0 was fairly average as far as unreasonable decompilations are concerned.
The big blocker in TH03's MAIN.EXE, however, turned out to be the .MRS functions, used to render the gauge attack portraits and bomb backgrounds. The blitting code there uses the additional FS and GS segment registers provided by the Intel 386… which

  1. are not supported by Turbo C++'s inline assembler, and
  2. can't be turned into pointers, due to a compiler bug in Turbo C++ that generates wrong segment prefix opcodes for the _FS and _GS pseudo-registers.

Apparently I'm the first one to even try doing that with this compiler? I haven't found any other mention of this bug…
Compiling via assembly (#pragma inline) would work around this bug and generate the correct instructions. But that would incur yet another dependency on a 16-bit TASM, for something honestly quite insignificant.

What we can always do, however, is using __emit__() to simply output x86 opcodes anywhere in a function. Unlike spelled-out inline assembly, that can even be used in helper functions that are supposed to inline… which does in fact allow us to fully abstract away this compiler bug. Regular if() comparisons with pseudo-registers wouldn't inline, but "converting" them into C++ template function specializations does. All that's left is some C preprocessor abuse to turn the pseudo-registers into types, and then we do retain a normal-looking poke() call in the blitting functions in the end. 🤯

Yeah… the result is batshit insane. I may have gone too far in a few places…


One might certainly argue that all these ridiculous decompilations actually hurt the preservation angle of this project. "Clearly, ZUN couldn't have possibly written such unreasonable C++ code. So why pretend he did, and not just keep it all in its more natural ASM form?" Well, there are several reasons:

Unfortunately, these pushes also demonstrated a second disadvantage in trying to decompile everything possible: Since Turbo C++ lacks TASM's fine-grained ability to enforce code alignment on certain multiples of bytes, it might actually be unfeasible to link in a C-compiled object file at its intended original position in some of the .EXE files it's used in. Which… you're only going to notice once you encounter such a case. Due to the slightly jumbled order of functions in the 📝 second, shared code segment, that might be long after you decompiled and successfully linked in the function everywhere else.

And then you'll have to throw away that decompilation after all 😕 Oh well. In this specific case (the lookup table generator for horizontally flipping images), that decompilation was a mess anyway, and probably helped nobody. I could have added a dummy .OBJ that does nothing but enforce the needed 2-byte alignment before the function if I really insisted on keeping the C version, but it really wasn't worth it.


Now that I've also described yet another meta-issue, maybe there'll really be nothing to say about the next technical debt pushes? :onricdennat: Next up though: Back to actual progress again, with TH01. Which maybe even ends up pushing that game over the 50% RE mark?

📝 Posted:
🚚 Summary of:
P0109
Commits:
dcf4e2c...2c7d86b
💰 Funded by:
[Anonymous], Blue Bolt
🏷 Tags:
rec98+ th04+ th05+ gameplay+ bullet+ micro-optimization- glitch+ uth05win+ tasm-

Back to TH05! Thanks to the good funding situation, I can strike a nice balance between getting TH05 position-independent as quickly as possible, and properly reverse-engineering some missing important parts of the game. Once 100% PI will get the attention of modders, the code will then be in better shape, and a bit more usable than if I just rushed that goal.

By now, I'm apparently also pretty spoiled by TH01's immediate decompilability, after having worked on that game for so long. Reverse-engineering in ASM land is pretty annoying, after all, since it basically boils down to meticulously editing a piece of ASM into something I can confidently call "reverse-engineered". Most of the time, simply decompiling that piece of code would take just a little bit longer, but be massively more useful. So, I immediately tried decompiling with TH05… and it just worked, at every place I tried!? Whatever the issue was that made 📝 segment splitting so annoying at my first attempt, I seem to have completely solved it in the meantime. 🤷 So yeah, backers can now request pretty much any part of TH04 and TH05 to be decompiled immediately, with no additional segment splitting cost.

(Protip for everyone interested in starting their own ReC project: Just declare one segment per function, right from the start, then group them together to restore the original code segmentation…)


Except that TH05 then just throws more of its infamous micro-optimized and undecompilable ASM at you. 🙄 This push covered the function that adjusts the bullet group template based on rank and the selected difficulty, called every time such a group is configured. Which, just like pretty much all of TH05's bullet spawning code, is one of those undecompilable functions. If C allowed labels of other functions as goto targets, it might have been decompilable into something useful to modders… maybe. But like this, there's no point in even trying.

This is such a terrible idea from a software architecture point of view, I can't even. Because now, you suddenly have to mirror your C++ declarations in ASM land, and keep them in sync with each other. I'm always happy when I get to delete an ASM declaration from the codebase once I've decompiled all the instances where it was referenced. But for TH05, we now have to keep those declarations around forever. 😕 And all that for a performance increase you probably couldn't even measure. Oh well, pulling off Galaxy Brain-level ASM optimizations is kind of fun if you don't have portability plans… I guess?

If I started a full fangame mod of a PC-98 Touhou game, I'd base it on TH04 rather than TH05, and backport selected features from TH05 as needed. Just because it was released later doesn't make it better, and this is by far not the only one of ZUN's micro-optimizations that just went way too far.

Dropping down to ASM also makes it easier to introduce weird quirks. Decompiled, one of TH05's tuning conditions for stack groups on Easy Mode would look something like:

case BP_STACK:
	// […]
	if(spread_angle_delta >= 2) {
		stack_bullet_count--;
	}

The fields of the bullet group template aren't typically reset when setting up a new group. So, spread_angle_delta in the context of a stack group effectively refers to "the delta angle of the last spread group that was fired before this stack – whenever that was". uth05win also spotted this quirk, considered it a bug, and wrote fanfiction by changing spread_angle_delta to stack_bullet_count.
As usual for functions that occur in more than one game, I also decompiled the TH04 bullet group tuning function, and it's perfectly sane, with no such quirks.


In the more PI-focused parts of this push, we got the TH05-exclusive smooth boss movement functions, for flying randomly or towards a given point. Pretty unspectacular for the most part, but we've got yet another uth05win inconsistency in the latter one. Once the Y coordinate gets close enough to the target point, it actually speeds up twice as much as the X coordinate would, whereas uth05win used the same speedup factors for both. This might make uth05win a couple of frames slower in all boss fights from Stage 3 on. Hard to measure though – and boss movement partly depends on RNG anyway.


Next up: Shinki's background animations – which are actually the single biggest source of position dependence left in TH05.

📝 Posted:
🚚 Summary of:
P0031, P0032, P0033
Commits:
dea40ad...9f764fa, 9f764fa...e6294c2, e6294c2...6cdd229
💰 Funded by:
zorg
🏷 Tags:
rec98+ th02+ th04+ th05+ file-format+ hud+ score+ tasm- tcc+ micro-optimization- jank+

The glacial pace continues, with TH05's unnecessarily, inappropriately micro-optimized, and hence, un-decompilable code for rendering the current and high score, as well as the enemy health / dream / power bars. While the latter might still pass as well-written ASM, the former goes to such ridiculous levels that it ends up being technically buggy. If you enjoy quality ZUN code, it's definitely worth a read.

In TH05, this all still is at the end of code segment #1, but in TH04, the same code lies all over the same segment. And since I really wanted to move that code into its final form now, I finally did the research into decompiling from anywhere else in a segment.

Turns out we actually can! It's kinda annoying, though: After splitting the segment after the function we want to decompile, we then need to group the two new segments back together into one "virtual segment" matching the original one. But since all ASM in ReC98 heavily relies on being assembled in MASM mode, we then start to suffer from MASM's group addressing quirk. Which then forces us to manually prefix every single function call

with the group name. It's stupidly boring busywork, because of all the function calls you mustn't prefix. Special tooling might make this easier, but I don't have it, and I'm not getting crowdfunded for it.

So while you now definitely can request any specific thing in any of the 5 games to be decompiled right now, it will take slightly longer, and cost slightly more.
(Except for that one big segment in TH04, of course.)

Only one function away from the TH05 shot type control functions now!