⮜ Blog

⮜ List of tags

Showing all posts tagged
and

📝 Posted:
🚚 Summary of:
P0229, P0230, P0231, P0232, P0233, P0234
Commits:
6370f96...d535d87, d535d87...ca523b4, ca523b4...05a49b9, f7ef7f8...abeaf85, abeaf85...dbc5b51, dd2265c...12f29c6
💰 Funded by:
Ember2528, [Anonymous]
🏷 Tags:

128 commits! Who would have thought that the ideal first release of the TH01 Anniversary Edition would involve so much maintenance, and raise so many research questions? It's almost as if the real work only starts after the 100% finalization mark… Once again, I had to steal some funding from the reserved JIS trail word pushes to cover everything I liked to research, which means that the next towards the anything goal will repay this debt. Luckily, this doesn't affect any immediate plans, as I'll be spending March with tasks that are already fully funded.

So, how did this end up so massive? The list of things I originally set out to do was pretty short:

  1. Build entire game into single executable
  2. Fix rendering issues in the one or two most important parts of the game for a good initial impression

But even the first point already started with tons of little cleanup commits. A part of them can definitely be blamed on the rush to hit the 100% decompilation mark before the 25th anniversary last August. However, all the structural changes that I can't commit to master reveal how much of a mess the TH01 codebase actually is.
Merging the executables is mainly difficult because of all the inconsistencies between REIIDEN.EXE and FUUIN.EXE. The worst parts can be found in the REYHI*.DAT format code and the High Score menu, but the little things are just as annoying, like how the current score is an unsigned variable in REIIDEN.EXE, but a signed one in FUUIN.EXE. :zunpet: If it takes me this long and this many commits just to sort out all of these issues, it's no wonder that the only thing I've seen being done with this codebase since TH01's 100% decompilation was a single porting attempt that ended in a rather quick ragequit.
So why are we merging the executables in preparation for the Anniversary Edition, and not waiting with it until we start doing ports?

The game actually is so bloated that the combined binary ended up smaller than the original REIIDEN.EXE. If all you see are the file sizes of the original three executables, this might look like a pretty impressive feat. Like, how can we possibly get 407,812 bytes into less than 238,612 bytes, without using compression?
If you've ever looked at the linker map though, it's not at all surprising. Excluding the aforementioned inconsistencies that are hard to quantify, OP.EXE and FUUIN.EXE only feature 5,767 and 6,475 bytes of unique code and data, respectively. All other code in these binaries is already part of REIIDEN.EXE, with more than half of the size coming from the Borland C++ runtime. The single worst offender here is the C++ exception handler that Borland forces onto every non-.COM binary by default, which alone adds 20,512 bytes even if your binary doesn't use C++ exceptions.
On a more hilarious note, this single line is responsible for pulling another unnecessary 14,242 bytes into OP.EXE and FUUIN.EXE. This floating-point multiplication is completely unnecessary in this context because all possible parameters are integers, but it's enough for Turbo C++ and TLINK to pull in the entire x87 FPU emulation machinery. These two binaries don't even draw lines, but since this function is part of the general graphics code translation unit and contains other functions that these binaries do need, TLINK links in the entire thing. Maybe, multiple executables aren't the best choice either if you use a linker that can't do dead code elimination…

Since the 📝 Orb's physics do turn the entire precision of a double variable into gameplay effects, it's not feasible to ever get rid of all FPU code in TH01. The exception handler, however, can be removed, which easily brings the combined binary below the size of the original REIIDEN.EXE. Compiling all code with a single set of compiler optimization flags, including the more x86-friendly pascal calling convention, then gets us a few more KB on top. As does, of course, removing unused code: The only remaining purpose of features such as 📝 resident palettes is to potentially make porting more difficult for anyone who doesn't immediately realize that nothing in the game uses these functions.
Technically, all unused code would be bloat, but for now, I'm keeping the parts that may tell stories about the game's development history (such as unused effects or the 📝 mouse cursor), or that might help with debugging. Even with that in mind, I've only scratched the surface when it comes to bloat removal, and the binary is only going to get smaller from here. A lot smaller.

If only we now could start MDRV98 from this new combined binary, we wouldn't need a second batch file either…


Which brings us to the first big research question of this delivery. Using the C spawn() function works fine on this compiler, so spawn("MDRV98.COM") would be all we need to do, right? Except that the game crashes very soon after that subprocess returned. :thonk:
So it's not going to be that easy if the spawned process is a TSR. But why should this be a problem? Let's take a look at the DOS heap, and how DOS lays out processes in conventional memory if we launch the game regularly through GAME.BAT:

The rough layout of the DOS heap when launching TH01 from GAME.BAT.

The batch file starts MDRV98 first, which will therefore end up below the game in conventional memory. This is perfect for a TSR: The program can resize itself arbitrarily before returning to DOS, and the rest of memory will be left over for the game. If we assume such a layout, a DOS program can implement a custom memory allocator in a very simple way, as it only has to search for free memory in one direction – and this is exactly how Borland implemented the C heap for functions like malloc() and free(), and the C++ new and delete operators.
But if we spawn MDRV98 after starting TH01, well…

MDRV98 will spawn in the next free memory location, allocate itself, return to TH01… which suddenly finds its C heap blocked from growing. As a result, the next big allocation will immediately fail with a rather misleading "out of memory" error.

So, what can we do about this? Still in a bloat removal mindset, my gut reaction was to just throw out Borland's C heap implementation, and replace it with a very thin wrapper around the DOS heap as managed by INT 21h, AH=48h/49h/4Ah. Like, why did these DOS compilers even bother with a custom allocator in the first place if DOS already comes with a perfectly fine native one? Using the native allocator would completely erase the distinction between TSR memory and game memory, and inherently allow the game to allocate beyond MDRV98.
I did in fact implement this, and noticed even more benefits:

Ultimately though, the drawbacks became too significant. Most of them are related to the PC-98 Touhou games only ever creating a single DOS process, even though they contain multiple executables. Switching executables is done via exec(), which resizes a program's main allocation to match the new binary and then overwrites the old program image with the new one. If you've ever wondered why DOSBox-X only ever shows OP as the active process name in the title bar, you now know why. As far as DOS is concerned, it's still the same OP.EXE process rooted at the same segment, and exec() doesn't bother rewriting the name either. Most importantly though, this is how REIIDEN.EXE can launch into another REIIDEN.EXE process even if there are less than 238,612 bytes free when exec() is called, and without consuming more memory for every successive binary.
For now, ANNIV.EXE still re-exec()s itself at every point where the original game did, as ZUN's original code really depends on being reinitialized at boss and scene boundaries. The resulting accidental semi-hot reloading is also a useful property to retain during development.
So why is the DOS heap a bad idea for regular game allocation after all?

I could release this DOS heap wrapper in unused form for another push if anyone's interested, but for now, I'm pretty happy with not actually using it in the games. Instead, let's stay with the Borland C heap, and find a way to push MDRV98 to the very top of conventional RAM. Like this:

Which is much easier said than done. It would be nice if we could just use the last fit allocation strategy here, but .COM executables always receive all free memory by default anyway, which eliminates any difference between the strategies.
But we can still change memory itself. So let's temporarily claim all remaining free memory, minus the exact amount we need for MDRV98, for our process. Then, the only remaining free space to spawn MDRV98 is at the exact place where we want it to be:

Obviously, we release all the additional memory after spawning MDRV98.

Now we only need to know how much memory to not temporarily allocate. First, we need to replicate the assumption that MDRV98's -M7 command-line parameter corresponds to a resident size of 23,552 bytes. This is not as bad as it seems, because the -M parameter explicitly has a KiB unit, and we can nicely abstract it away for the API.
The (env.) block though? Its minimum size equals the combined length of all environment variables passed to the process, but its maximum size is… not limited at all?! As in, DOS implementations can add and have historically added more free space because some programs insisted on storing their own new environment variables in this exact segment. DOSBox and DOSBox-X follow this tradition by providing a configuration option for the additional amount of environment space, with the latter adding 1024 additional bytes by default, y'know, just in case someone wants to compile FreeDOS on a slow emulator. It's not even worth sending a bug report for this specific case, because it's only a symptom of the fact that unexpectedly large program environment blocks can and will happen, and are to be expected in DOS land.
So thanks to this cruel joke, it's technically impossible to achieve what we want to do there. Hooray! The only thing we can kind of do here is an educated guess: Sum up the length of all environment variables in our environment block, compare that length against the allocated size of the block, and assume that the MDRV98 process will get as much additional memory as our process got. 🤷

The remaining hurdles came courtesy of some Borland C runtime implementation details. You would think that the temporary reallocation could even be done in pure C using the sbrk(), coreleft(), and brk() functions, but all values passed to or returned from these functions are inaccurate because they don't factor in the aforementioned KiB padding to the underlying DOS memory block. So we have to directly use the DOS syscalls after all. Which at least means that learning about them wasn't completely useless…
The final issue is caused inside Borland's spawn() implementation. The environment block for the child process is built out of all the strings reachable from C's environ pointer, which is what that FreeDOS build process should have used. Coalescing them into a single buffer involves yet another C heap allocation… and since we didn't report our DOS memory block manipulation back to the C heap, the malloc() call might think it needs to request more memory from DOS. This resets the DOS memory block back to its intended level, undoing our manipulation right before the actual INT 21h, AH=4Bh EXEC syscall. Or in short:

Manipulate DOS heap ➜ spawn() call ➜ _LoadProg() ➜ allocate and prepare environment block ➜ _spawn() ➜ DOS EXEC syscall

The obvious solution: Replace _LoadProg(), implement the coalescing ourselves, and do it before the heap manipulation. Fortunately, Borland's internal low-level _spawn() function is not static, so we can call it ourselves whenever we want to:

Allocate and prepare environment block ➜ manipulate DOS heap ➜ _spawn() call ➜ EXEC syscall

So yes, launching MDRV98 from C can be done, but it involves advanced witchcraft and is completely ridiculous. :tannedcirno: Launching external sound drivers from a batch file is the right way of doing things.
Fortunately, you don't have to rely on this auto-launching feature. You can still launch DEBLOAT.EXE or ANNIV.EXE from a batch file that launched MDRV98.COM before, and the binaries will detect this case and skip the attempt of launching MDRV98 from C. It's unlikely that my heuristic will ever break, but I definitely recommend replicating GAME.BAT just to be completely sure – especially for user-friendly repacks that don't want to include the original game anyway.
This is also why ANNIV.EXE doesn't launch ZUNSOFT.COM: The "correct" and stable way to launch ANNIV.EXE still involves a batch file, and I would say that expecting people to remove ZUNSOFT.COM from that file is worse than not playing the animation. It's certainly a debate we can have, though.


This deep dive into memory allocation revealed another previously undocumented bug in the original game. The RLE decompression code for the 東方靈異.伝 packfile contains two heap overflows, which are actually triggered by SinGyoku's BOSS1_3.BOS and Konngara's BOSS8_1.BOS. They only do not immediately crash the game when loading these bosses thanks to two implementation details of Borland's C heap. :zunpet:
Obviously, this is a bug we should fix, but according to the definition of bugs, that fix would be exclusive to the anniversary branch. Isn't that too restrictive for something this critical? This code is guaranteed to blow up with a different heap implementation, if only in a Debug build. :thonk: And besides, nobody would notice a fix just by looking at the game's rendered output…

Looks like we have to introduce a fourth category of weird code, in addition to the previous bloat, bug, and quirk categories, for invisible internal issues like these. Let's call it landmine, and fix them on the debloated branch as well. Thanks to Clerish for the naming inspiration!
With this new category, the full definitions for all categories have become quite extensive. Thus, they now live in CONTRIBUTING.md inside the ReC98 repository.

With the new discoveries and the new landmine category, TH01 is now at 67 bugs and 20 landmines. And the solution for the landmine in question? Simplifying the 61 lines of the original code down to 16. And yes, I'm including comments in these numbers – if the interactions of the code are complex enough to require multi-paragraph comments, these are a necessary and valid part of the code.


While we're on the topic of weird code and its visible or invisible effects, there's one thing you might be concerned about. With all the rearchitecting and data shifting we're doing on the debloated branch, what will happen to the 📝 negative glitch stages? These are the result of a clearly observable bug that, by definition, must not be fixed on the debloated branch. But given that the observable layout of the glitch stages is defined by the memory surrounding the scene stage variable, won't the debloated branch inherently alter their appearance (= ⚠️ fanfiction ⚠️), or even remove them completely?

Well, yes, it will. But we can still preserve their layout by hardcoding the exact original data that the game would originally read, and even emulate the original segment relocations and other pieces of global data.
Doing this is feasible thanks to the fact that there are only 4 glitch stages. Unfortunately, the same can't be said for the timer values, which are determined by an array lookup with the un-modulo'd stage ID. If we wanted to preserve those as well, we'd have to bundle an exact copy of the original REIIDEN.EXE data segment to preserve the values of all 32,768 negative stages you could possibly enter, together with a map of all relocations in this segment. 😵 Which I've decided against for now, since this has been going on for far too long already. Let's first see if anyone ever actually complains about details like this…


Alright, time to start the anniversary branch by rendering everything at its correct internal unaligned X position? Eh… maybe not quite yet. If we just hacked all the necessary bit-shifting code into all the format-specific blitting functions, we'd still retain all this largely redundant, bad, and slow code, and would make no progress in terms of portability. It'd be much better to first write a single generic blitter that's decently optimized, but supports all kinds of sprites to make this optimization actually worth something.
So, next research question: How would such a blitter look like? After I learned during my 📝 first foray into cycle counting that port I/O is slow on 486 CPUs, it became clear that TH04's 📝 GRCG batching for pellets was one of the more useful optimizations that probably contributed a big deal towards achieving the high bullet counts of that game. This leads to two conclusions:

Maybe we should also start by not even doing these unaligned bit shifts ourselves, and instead expect the call site to 📝 always deliver a byte-aligned sprite that is correctly preshifted, if necessary? Some day, we definitely should measure how slow runtime shifting would really be…

What we should do, however, are some further general optimizations that I would have expected from master.lib: Unrolling the vertical loop, and baking a single function for every sprite width to eliminate the horizontal loop. We can then use the widest possible x86 MOV instruction for the lowest possible number of cycles per row – for example, we'd blit a 56-wide sprite with three MOVs (32-bit + 16-bit + 8-bit), and a 64-wide one with two 32-bit MOVs.
Or maybe not? There's a lot of blitting code in both master.lib and PC-98 Touhou that checks for empty bytes within sprites to skip needlessly writing them to VRAM:

uint8_t left_half = ((uint8_t *)(sprite))[0];
uint8_t right_half = ((uint8_t *)(sprite))[1];
if(right_half != 0x00) {
	pokeb(VRAM_SEGMENT, (vram_offset + 0), left_half);
}
if(right_half != 0x00) {
	pokeb(VRAM_SEGMENT, (vram_offset + 1), right_half);
}

Which goes against everything you seem to know about computers. We aren't running on an 8-bit CPU here, so wouldn't it be faster to always write both halves of a sprite in a single operation?

uint16_t both_halves = ((uint16_t *)(sprite))[0];
pokew(VRAM_SEGMENT, vram_offset, both_halves);

That's a single CPU instruction, compared to two instructions and two branches. The only possible explanation for this would be that VRAM writes are so slow on PC-98 that you'd want to avoid them at all costs, even if that means additional branching on the CPU to do so. Or maybe that was something you would want to do on certain models with slow VRAM, but not on others?

So I wrote a benchmark to answer all these questions, and to compare my new blitter against typical TH01 blitting code:

A not really representative run on DOSBox-X. Since the master.lib sprite functions are also unbatched, I expect them to not be much faster than the naive C implementation.

2023-03-05-blitperf.zip And here are the real-hardware results I've got from the PC-9800 Central Discord server:

PC-286LS PC-9801ES PC-9821Cb/Cx PC-9821Ap3 PC-9821An PC-9821Nw133 PC-9821Ra20
80286, 12 MHz i386SX, 16 MHz 486SX, 33 MHz 486DX4, 100 MHz Pentium, 90 MHz Pentium, 133 MHz Pentium Pro, 200 MHz
1987 1989 1994 1994 1994 1997 1996
Unchecked C GRCG 36,85 38,42 26,02 26,87 3,98 4,13 2,08 2,16 1,81 1,87 0,86 0,89 1,25 1,25
MOVS GRCG 15,22 16,87 9,33 10,19 1,22 1,37 0,44 0,44
MOV GRCG 15,42 17,08 9,65 10,53 1,15 1,3 0,44 0,44
4-plane 37,23 43,97 29,2 32,96 4,44 5,01 4,39 4,67 5,11 5,32 5,61 5,74 6,63 6,64
Checking first GRCG 17,49 19,15 10,84 11,72 1,27 1,44 1,04 1,07 0,54 0,54
4-plane 46,49 53,36 35,01 38,79 5,66 6,26 5,43 5,74 6,56 6,8 8,08 8,29 10,25 10,29
Checking second GRCG 16,47 18,12 10,77 11,65 1,25 1,39 1,02 0,51 0,51
4-plane 43,41 50,26 33,79 37,82 5,22 5,81 5,14 5,43 6,18 6,4 7,57 7,77 9,58 9,62
Checking both GRCG 16,14 18,03 10,84 11,71 1,33 1,49 1,01 0,49 0,49
4-plane 43,61 50,45 34,11 37,87 5,39 5,99 4,92 5,23 5,88 6,11 7,19 7,43 9,1 9,13
Amount of frames required to render 2000 16×8 pellet sprites on a variety of PC-98 models, using the new generic blitter. Both preshifted (first column) and runtime-shifted (second column) sprites were tested; empty columns correspond to times faster than a single frame. Thanks to cuba200611, Shoutmon, cybermind, and Digmac for running the tests!

The key takeaways:

Since this won't be the only piece of game-independent and explicitly PC-98-specific custom code involved in this delivery, it makes sense to start a dedicated PC-98 platform layer. This code will gradually eliminate the dependency on master.lib and replace it with better optimized and more readable C++ code. The blitting benchmark, for example, is already implemented completely without master.lib.
While this platform layer is mainly written to generate optimal code within Turbo C++ 4.0J, it can also serve as general PC-98 documentation for everyone who prefers code over machine-translating old Japanese books. Not to mention the immediacy of having all actual relevant information in one place, which might otherwise be pretty well hidden in these books, or some obscure old text file. For example, did you know that uploading gaiji via INT 18h might end up disabling the VSync interrupt trigger, deadlocking the process on the next frame delay loop? This nuisance is not replicated by any emulators, and it's quite frustrating to encounter it when trying to run your code on real hardware. master.lib works around it by simply hooking INT 18h and unconditionally reenabling the VSync interrupt trigger after the original handler returns, and so does our platform layer.


So, with the pellet draw calls batched and routed through the new renderer, we should have gained enough free CPU cycles to disable 📝 interlaced pellet rendering without any impact on frame rates?

Well, kinda. We do get 56.4 FPS, but only together with noticeable and reproducible tearing in the top part of the playfield, suggesting exactly why ZUN interlaced the rendering in the first place. 😕 So have we already reached the limit of single-buffered PC-98 games here, or can we still do something about it?
As it turns out, the main bottleneck actually lies in the pellet unblitting code. Every EGC-"accelerated" unblitting call in TH01 is as unbatched as the pellet blitting calls were, spending an additional 17 I/O port writes per call to completely set up and shut down the EGC, every time. And since this is TH01, the two-instruction operation of changing the active PC-98 VRAM page isn't inlined either, but instead done via a function call to a faraway segment. On the 486, that's:

This sums up to

And this calculation even ignores the lack of small micro-optimizations that could further optimize the blitting loop. Multiply that by the game's pellet cap of 100, and we get a 6-digit number of wasted CPU cycles. On paper, that's roughly 1/6 of the time we have for each of our target 56.423 FPS on the game's target 33 MHz systems. Might not sound all too critical, but the single-buffered nature of the game means that we're effectively racing the beam on every frame. In turn, we have to be even more serious about performance.

So, time to also add a batched EGC API to our PC-98 platform layer? Writing our own EGC code presents a nice opportunity to finally look deeper into all its registers and configuration options, and see what exactly we can do about ZUN's enforced 16-pixel alignment.
To nobody's surprise, this alignment is completely unnecessary, and only displays a lack of knowledge about the chip. While it is true that the EGC wants VRAM to be exclusively addressed in 16-bit chunks at 16-bit-aligned addresses, it specifically provides

And it gets even better: After ⌈bitlength ÷ 16⌉ write instructions, the EGC's internal shifter state automatically reinitializes itself in preparation for blitting another row of pixels with the same initially configured bit addresses and length. This is perfect for blitting rectangles, as two I/O port writes before the start of your blitting loop are enough to define your entire rectangle.

The manual nature of reading and writing in 16-pixel chunks does come with a slight pitfall though. If the source bit address is larger than the destination bit address, the first 16-bit read won't fill the EGC's internal shift register with all pixels that should appear in the first 16-pixel destination chunk. In this case, the EGC simply won't write anything and leave the first chunk unchanged. In a 📝 regular blitting loop, however, you expect that memory to be written and immediately move on to the next chunks within the row. As a result, the actual blitting process for such a rectangle will no longer be aligned to the configured address and bit length. The first row of the rectangle will appear 16 pixels to the right of the destination address, and the second one will start at bit offset 0 with pixels from the rightmost byte of the first line, which weren't blitted and remained in the tile register.
There is an easy solution though: Before the horizontal loop on each line of the rectangle, simply read one additional 16-pixel chunk from the source location to prefill the shift register. Thankfully, it's large enough to also fit the second read of the then full 16 pixels, without dropping any pixels along the way.

And that's how we get arbitrarily unaligned rectangle copies with the EGC! Except for a small register allocation trick to use two-register addressing, there's not much use in further optimizations, as the runtime of these inter-page blit operations is dominated by the VRAM page switches anyway.

Except that T98-Next seems to disagree about the register prefilling issue:

Glitched blitting results on T98-Next when trying EGC copies where the source bit address is larger than the destination bit address

Every other emulator agrees with real hardware in this regard, so we can safely assume this to be a bug in T98-Next. Just in case this old emulator with its last release from June 2010 still has any fans left nowadays… For now though, even they can still enjoy the TH01 Anniversary Edition: The only EGC copy algorithm that TH01 actually needs is the left one during the single-buffered tests, which even that emulator gets right.
That only leaves 📝 my old offer of documenting the EGC raster ops, and we've got the EGC figured out completely!


And that did in fact remove tearing from the pellet rendering function! For the first time, we can now fight Elis, Kikuri, Sariel, and Konngara with a doubled pellet frame rate:

Switchable videos like these can nicely provide evidence that these changes have no effect on gameplay, making it easy to see that the Orb still collides with all pellets on the same frames. Also, check out the difference in remaining conventional memory (coreleft)…

With only pellets and no other animation on screen, this exact pattern presents the optimal demonstration case for the new unblitter. But as you can already tell from the invincibility sprites, we'd also need to route every other kind of sprite through the same new code. This isn't all too trivial: Most sprites are still rendered at byte-aligned positions, and their blitting APIs hide that fact by taking a pixel position regardless. This is why we can't just replace ZUN's original 16-pixel-aligned EGC unblitting function with ours, and always have to replace both the blitter and the unblitter on a per-sprite basis.
To completely remove all flickering, we'd also like to get rid of all the sprite-specific unblit ➜ update ➜ render sequences, and instead gather all unblitting code to the beginning of the game loop, before any update and rendering calls. So yeah, it will take a long time to completely get rid of all flickering. Until we're there, I recommend any backer to tell me their favorite boss, so that I can focus on getting that one rendered without any flickering. Remember that here at ReC98, we can have a Touhou character popularity contest at any time during the year, whenever the store is open! :tannedcirno:

In the meantime, the consistent use of 8×8 rectangles during pellet unblitting does significantly reduce flickering across the entire game, and shrinks certain holes that pellets tend to rip into lazily reblitted sprites:

TH01 SinGyoku's crossing pellet pattern in the Anniversary Edition, demonstrating smaller unblitting artifactsThe same frame in the original game, featuring much more giant holes ripped into the sphere sprite
SinGyoku's "crossing pellets" pattern, shortly before completing the transformation back to the sphere.

To round out the first release, I added all the other bug fixes to achieve parity with my previously released patched REIIDEN.EXE builds:

So here it is, the first build of TH01's Anniversary Edition: 2023-03-05-th01-anniv.zip Edit (2023-03-12): If you're playing on Neko Project and seeing more flickering than in the original game, make sure you've checked the Screen → Disp vsync option.

Next up: The long overdue extended trip through the depths of TH02's low-level code. From what I've seen of it so far, the work on this project is finally going to become a bit more relaxing. Which is quite welcome after, what, 6 months of stressful research-heavy work?

📝 Posted:
🚚 Summary of:
P0165, P0166, P0167
Commits:
7a0e5d8...f2bca01, f2bca01...e697907, e697907...c2de6ab
💰 Funded by:
Ember2528
🏷 Tags:

OK, TH01 missile bullets. Can we maybe have a well-behaved entity type, without any weirdness? Just once?

Ehh, kinda. Apart from another 150 bytes wasted on unused structure members, this code is indeed more on the low end in terms of overall jank. It does become very obvious why dodging these missiles in the YuugenMagan, Mima, and Elis fights feels so awful though: An unfair 46×46 pixel hitbox around Reimu's center pixel, combined with the comeback of 📝 interlaced rendering, this time in every stage. ZUN probably did this because missiles are the only 16×16 sprite in TH01 that is blitted to unaligned X positions, which effectively ends up touching a 32×16 area of VRAM per sprite.
But even if we assume VRAM writes to be the bottleneck here, it would have been totally possible to render every missile in every frame at roughly the same amount of CPU time that the original game uses for interlaced rendering:

That's an optimization that would have significantly benefitted the game, in contrast to all of the fake ones introduced in later games. Then again, this optimization is actually something that the later games do, and it might have in fact been necessary to achieve their higher bullet counts without significant slowdown.

Unfortunately, it was only worth decompiling half of the missile code right now, thanks to gratuitous FPU usage in the other half, where 📝 double variables are compared to float literals. That one will have to wait 📝 until after SinGyoku.


After some effectively unused Mima sprite effect code that is so broken that it's impossible to make sense out of it, we get to the final feature I wanted to cover for all bosses in parallel before returning to Sariel: The separate sprite background storage for moving or animated boss sprites in the Mima, Elis, and Sariel fights. But, uh… why is this necessary to begin with? Doesn't TH01 already reserve the other VRAM page for backgrounds?
Well, these sprites are quite big, and ZUN didn't want to blit them from main memory on every frame. After all, TH01 and TH02 had a minimum required clock speed of 33 MHz, half of the speed required for the later three games. So, he simply blitted these boss sprites to both VRAM pages, leading the usual unblitting calls to only remove the other sprites on top of the boss. However, these bosses themselves want to move across the screen… and this makes it necessary to save the stage background behind them in some other way.

Enter .PTN, and its functions to capture a 16×16 or 32×32 square from VRAM into a sprite slot. No problem with that approach in theory, as the size of all these bigger sprites is a multiple of 32×32; splitting a larger sprite into these smaller 32×32 chunks makes the code look just a little bit clumsy (and, of course, slower).
But somewhere during the development of Mima's fight, ZUN apparently forgot that those sprite backgrounds existed. And once Mima's 🚫 casting sprite is blitted on top of her regular sprite, using just regular sprite transparency, she ends up with her infamous third arm:

TH01 Mima's third arm

Ironically, there's an unused code path in Mima's unblit function where ZUN assumes a height of 48 pixels for Mima's animation sprites rather than the actual 64. This leads to even clumsier .PTN function calls for the bottom 128×16 pixels… Failing to unblit the bottom 16 pixels would have also yielded that third arm, although it wouldn't have looked as natural. Still wouldn't say that it was intentional; maybe this casting sprite was just added pretty late in the game's development?


So, mission accomplished, Sariel unblocked… at 2¼ pushes. :thonk: That's quite some time left for some smaller stage initialization code, which bundles a bunch of random function calls in places where they logically really don't belong. The stage opening animation then adds a bunch of VRAM inter-page copies that are not only redundant but can't even be understood without knowing the hidden internal state of the last VRAM page accessed by previous ZUN code…
In better news though: Turbo C++ 4.0 really doesn't seem to have any complexity limit on inlining arithmetic expressions, as long as they only operate on compile-time constants. That's how we get macro-free, compile-time Shift-JIS to JIS X 0208 conversion of the individual code points in the 東方★靈異伝 string, in a compiler from 1994. As long as you don't store any intermediate results in variables, that is… :tannedcirno:

But wait, there's more! With still ¼ of a push left, I also went for the boss defeat animation, which includes the route selection after the SinGyoku fight.
As in all other instances, the 2× scaled font is accomplished by first rendering the text at regular 1× resolution to the other, invisible VRAM page, and then scaled from there to the visible one. However, the route selection is unique in that its scaled text is both drawn transparently on top of the stage background (not onto a black one), and can also change colors depending on the selection. It would have been no problem to unblit and reblit the text by rendering the 1× version to a position on the invisible VRAM page that isn't covered by the 2× version on the visible one, but ZUN (needlessly) clears the invisible page before rendering any text. :zunpet: Instead, he assigned a separate VRAM color for both the 魔界 and 地獄 options, and only changed the palette value for these colors to white or gray, depending on the correct selection. This is another one of the 📝 rare cases where TH01 demonstrates good use of PC-98 hardware, as the 魔界へ and 地獄へ strings don't need to be reblitted during the selection process, only the Orb "cursor" does.

Then, why does this still not count as good-code? When changing palette colors, you kinda need to be aware of everything else that can possibly be on screen, which colors are used there, and which aren't and can therefore be used for such an effect without affecting other sprites. In this case, well… hover over the image below, and notice how Reimu's hair and the bomb sprites in the HUD light up when Makai is selected:

Demonstration of palette changes in TH01's route selection

This push did end on a high note though, with the generic, non-SinGyoku version of the defeat animation being an easily parametrizable copy. And that's how you decompile another 2.58% of TH01 in just slightly over three pushes.


Now, we're not only ready to decompile Sariel, but also Kikuri, Elis, and SinGyoku without needing any more detours into non-boss code. Thanks to the current TH01 funding subscriptions, I can plan to cover most, if not all, of Sariel in a single push series, but the currently 3 pending pushes probably won't suffice for Sariel's 8.10% of all remaining code in TH01. We've got quite a lot of not specifically TH01-related funds in the backlog to pass the time though.

Due to recent developments, it actually makes quite a lot of sense to take a break from TH01: spaztron64 has managed what every Touhou download site so far has failed to do: Bundling all 5 game onto a single .HDI together with pre-configured PC-98 emulators and a nice boot menu, and hosting the resulting package on a proper website. While this first release is already quite good (and much better than my attempt from 2014), there is still a bit of room for improvement to be gained from specific ReC98 research. Next up, therefore:

📝 Posted:
🚚 Summary of:
P0160, P0161
Commits:
e491cd7...42ba4a5, 42ba4a5...81dd96e
💰 Funded by:
Yanga, [Anonymous]
🏷 Tags:

Nothing really noteworthy in TH01's stage timer code, just yet another HUD element that is needlessly drawn into VRAM. Sure, ZUN applies his custom boldfacing effect on top of the glyphs retrieved from font ROM, but he could have easily installed those modified glyphs as gaiji.
Well, OK, halfwidth gaiji aren't exactly well documented, and sometimes not even correctly emulated 📝 due to the same PC-98 hardware oddity I was researching last month. I've reserved two of the pending anonymous "anything" pushes for the conclusion of this research, just in case you were wondering why the outstanding workload is now lower after the two delivered here.

And since it doesn't seem to be clearly documented elsewhere: Every 2 ticks on the stage timer correspond to 4 frames.


So, TH01 rank pellet speed. The resident pellet speed value is a factor ranging from a minimum of -0.375 up to a maximum of 0.5 (pixels per frame), multiplied with the difficulty-adjusted base speed for each pellet and added on top of that same speed. This multiplier is modified

Apparently, ZUN noted that these deltas couldn't be losslessly stored in an IEEE 754 floating-point variable, and therefore didn't store the pellet speed factor exactly in a way that would correspond to its gameplay effect. Instead, it's stored similar to Q12.4 subpixels: as a simple integer, pre-multiplied by 40. This results in a raw range of -15 to 20, which is what the undecompiled ASM calls still use. When spawning a new pellet, its base speed is first multiplied by that factor, and then divided by 40 again. This is actually quite smart: The calculation doesn't need to be aware of either Q12.4 or the 40× format, as ((Q12.4 * factor×40) / factor×40) still comes out as a Q12.4 subpixel even if all numbers are integers. The only limiting issue here would be the potential overflow of the 16-bit multiplication at unadjusted base speeds of more than 50 pixels per frame, but that'd be seriously unplayable.
So yeah, pellet speed modifications are indeed gradual, and don't just fall into the coarse three "high, normal, and low" categories.


That's ⅝ of P0160 done, and the continue and pause menus would make good candidates to fill up the remaining ⅜… except that it seemed impossible to figure out the correct compiler options for this code?
The issues centered around the two effects of Turbo C++ 4.0J's -O switch:

  1. Optimizing jump instructions: merging duplicate successive jumps into a single one, and merging duplicated instructions at the end of conditional branches into a single place under a single branch, which the other branches then jump to
  2. Compressing ADD SP and POP CX stack-clearing instructions after multiple successive CALLs to __cdecl functions into a single ADD SP with the combined parameter stack size of all function calls

But how can the ASM for these functions exhibit #1 but not #2? How can it be seemingly optimized and unoptimized at the same time? The only option that gets somewhat close would be -O- -y, which emits line number information into the .OBJ files for debugging. This combination provides its own kind of #1, but these functions clearly need the real deal.

The research into this issue ended up consuming a full push on its own. In the end, this solution turned out to be completely unrelated to compiler options, and instead came from the effects of a compiler bug in a totally different place. Initializing a local structure instance or array like

const uint4_t flash_colors[3] = { 3, 4, 5 };

always emits the { 3, 4, 5 } array into the program's data segment, and then generates a call to the internal SCOPY@ function which copies this data array to the local variable on the stack. And as soon as this SCOPY@ call is emitted, the -O optimization #1 is disabled for the entire rest of the translation unit?!
So, any code segment with an SCOPY@ call followed by __cdecl functions must strictly be decompiled from top to bottom, mirroring the original layout of translation units. That means no TH01 continue and pause menus before we haven't decompiled the bomb animation, which contains such an SCOPY@ call. 😕
Luckily, TH01 is the only game where this bug leads to significant restrictions in decompilation order, as later games predominantly use the pascal calling convention, in which each function itself clears its stack as part of its RET instruction.


What now, then? With 51% of REIIDEN.EXE decompiled, we're slowly running out of small features that can be decompiled within ⅜ of a push. Good that I haven't been looking a lot into OP.EXE and FUUIN.EXE, which pretty much only got easy pieces of code left to do. Maybe I'll end up finishing their decompilations entirely within these smaller gaps?
I still ended up finding one more small piece in REIIDEN.EXE though: The particle system, seen in the Mima fight.

I like how everything about this animation is contained within a single function that is called once per frame, but ZUN could have really consolidated the spawning code for new particles a bit. In Mima's fight, particles are only spawned from the top and right edges of the screen, but the function in fact contains unused code for all other 7 possible directions, written in quite a bloated manner. This wouldn't feel quite as unused if ZUN had used an angle parameter instead… :thonk: Also, why unnecessarily waste another 40 bytes of the BSS segment?

But wait, what's going on with the very first spawned particle that just stops near the bottom edge of the screen in the video above? Well, even in such a simple and self-contained function, ZUN managed to include an off-by-one error. This one then results in an out-of-bounds array access on the 80th frame, where the code attempts to spawn a 41st particle. If the first particle was unlucky to be both slow enough and spawned away far enough from the bottom and right edges, the spawning code will then kill it off before its unblitting code gets to run, leaving its pixel on the screen until something else overlaps it and causes it to be unblitted.
Which, during regular gameplay, will quickly happen with the Orb, all the pellets flying around, and your own player movement. Also, the RNG can easily spawn this particle at a position and velocity that causes it to leave the screen more quickly. Kind of impressive how ZUN laid out the structure of arrays in a way that ensured practically no effect of this bug on the game; this glitch could have easily happened every 80 frames instead. He almost got close to all bugs canceling out each other here! :godzun:

Next up: The player control functions, including the second-biggest function in all of PC-98 Touhou.

📝 Posted:
🚚 Summary of:
P0149, P0150, P0151, P0152
Commits:
e1a26bb...05e4c4a, 05e4c4a...768251d, 768251d...4d24ca5, 4d24ca5...81fc861
💰 Funded by:
Blue Bolt, Ember2528, -Tom-, [Anonymous]
🏷 Tags:

…or maybe not that soon, as it would have only wasted time to untangle the bullet update commits from the rest of the progress. So, here's all the bullet spawning code in TH04 and TH05 instead. I hope you're ready for this, there's a lot to talk about!

(For the sake of readability, "bullets" in this blog post refers to the white 8×8 pellets and all 16×16 bullets loaded from MIKO16.BFT, nothing else.)


But first, what was going on 📝 in 2020? Spent 4 pushes on the basic types and constants back then, still ended up confusing a couple of things, and even getting some wrong. Like how TH05's "bullet slowdown" flag actually always prevents slowdown and fires bullets at a constant speed instead. :tannedcirno: Or how "random spread" is not the best term to describe that unused bullet group type in TH04.
Or that there are two distinct ways of clearing all bullets on screen, which deserve different names:

Mechanic #1: Clearing bullets for a custom amount of time, awarding 1000 points for all bullets alive on the first frame, and 100 points for all bullets spawned during the clear time.
Mechanic #2: Zapping bullets for a fixed 16 frames, awarding a semi-exponential and loudly announced Bonus!! for all bullets alive on the first frame, and preventing new bullets from being spawned during those 16 frames. In TH04 at least; thanks to a ZUN bug, zapping got reduced to 1 frame and no animation in TH05…

Bullets are zapped at the end of most midboss and boss phases, and cleared everywhere else – most notably, during bombs, when losing a life, or as rewards for extends or a maximized Dream bonus. The Bonus!! points awarded for zapping bullets are calculated iteratively, so it's not trivial to give an exact formula for these. For a small number 𝑛 of bullets, it would exactly be 5𝑛³ - 10𝑛² + 15𝑛 points – or, using uth05win's (correct) recursive definition, Bonus(𝑛) = Bonus(𝑛-1) + 15𝑛² - 5𝑛 + 10. However, one of the internal step variables is capped at a different number of points for each difficulty (and game), after which the points only increase linearly. Hence, "semi-exponential".


On to TH04's bullet spawn code then, because that one can at least be decompiled. And immediately, we have to deal with a pointless distinction between regular bullets, with either a decelerating or constant velocity, and special bullets, with preset velocity changes during their lifetime. That preset has to be set somewhere, so why have separate functions? In TH04, this separation continues even down to the lowest level of functions, where values are written into the global bullet array. TH05 merges those two functions into one, but then goes too far and uses self-modifying code to save a grand total of two local variables… Luckily, the rest of its actual code is identical to TH04.

Most of the complexity in bullet spawning comes from the (thankfully shared) helper function that calculates the velocities of the individual bullets within a group. Both games handle each group type via a large switch statement, which is where TH04 shows off another Turbo C++ 4.0 optimization: If the range of case values is too sparse to be meaningfully expressed in a jump table, it usually generates a linear search through a second value table. But with the -G command-line option, it instead generates branching code for a binary search through the set of cases. 𝑂(log 𝑛) as the worst case for a switch statement in a C++ compiler from 1994… that's so cool. But still, why are the values in TH04's group type enum all over the place to begin with? :onricdennat:
Unfortunately, this optimization is pretty rare in PC-98 Touhou. It only shows up here and in a few places in TH02, compared to at least 50 switch value tables.

In all of its micro-optimized pointlessness, TH05's undecompilable version at least fixes some of TH04's redundancy. While it's still not even optimal, it's at least a decently written piece of ASM… if you take the time to understand what's going on there, because it certainly took quite a bit of that to verify that all of the things which looked like bugs or quirks were in fact correct. And that's how the code for this function ended up with 35% comments and blank lines before I could confidently call it "reverse-engineered"…
Oh well, at least it finally fixes a correctness issue from TH01 and TH04, where an invalid bullet group type would fill all remaining slots in the bullet array with identical versions of the first bullet.

Something that both games also share in these functions is an over-reliance on globals for return values or other local state. The most ridiculous example here: Tuning the speed of a bullet based on rank actually mutates the global bullet template… which ZUN then works around by adding a wrapper function around both regular and special bullet spawning, which saves the base speed before executing that function, and restores it afterward. :zunpet: Add another set of wrappers to bypass that exact tuning, and you've expanded your nice 1-function interface to 4 functions. Oh, and did I mention that TH04 pointlessly duplicates the first set of wrapper functions for 3 of the 4 difficulties, which can't even be explained with "debugging reasons"? That's 10 functions then… and probably explains why I've procrastinated this feature for so long.

At this point, I also finally stopped decompiling ZUN's original ASM just for the sake of it. All these small TH05 functions would look horribly unidiomatic, are identical to their decompiled TH04 counterparts anyway, except for some unique constant… and, in the case of TH05's rank-based speed tuning function, actually become undecompilable as soon as we want to return a C++ class to preserve the semantic meaning of the return value. Mainly, this is because Turbo C++ does not allow register pseudo-variables like _AX or _AL to be cast into class types, even if their size matches. Decompiling that function would have therefore lowered the quality of the rest of the decompiled code, in exchange for the additional maintenance and compile-time cost of another translation unit. Not worth it – and for a TH05 port, you'd already have to decompile all the rest of the bullet spawning code anyway!


The only thing in there that was still somewhat worth being decompiled was the pre-spawn clipping and collision detection function. Due to what's probably a micro-optimization mistake, the TH05 version continues to spawn a bullet even if it was spawned on top of the player. This might sound like it has a different effect on gameplay… until you realize that the player got hit in this case and will either lose a life or deathbomb, both of which will cause all on-screen bullets to be cleared anyway. So it's at most a visual glitch.

But while we're at it, can we please stop talking about hitboxes? At least in the context of TH04 and TH05 bullets. The actual collision detection is described way better as a kill delta of 8×8 pixels between the center points of the player and a bullet. You can distribute these pixels to any combination of bullet and player "hitboxes" that make up 8×8. 4×4 around both the player and bullets? 1×1 for bullets, and 8×8 for the player? All equally valid… or perhaps none of them, once you keep in mind that other entity types might have different kill deltas. With that in mind, the concept of a "hitbox" turns into just a confusing abstraction.

The same is true for the 36×44 graze box delta. For some reason, this one is not exactly around the center of a bullet, but shifted to the right by 2 pixels. So, a bullet can be grazed up to 20 pixels right of the player, but only up to 16 pixels left of the player. uth05win also spotted this… and rotated the deltas clockwise by 90°?!


Which brings us to the bullet updates… for which I still had to research a decompilation workaround, because 📝 P0148 turned out to not help at all? Instead, the solution was to lie to the compiler about the true segment distance of the popup function and declare its signature far rather than near. This allowed ZUN to save that ridiculous overhead of 1 additional far function call/return per frame, and those precious 2 bytes in the BSS segment that he didn't have to spend on a segment value. 📝 Another function that didn't have just a single declaration in a common header file… really, 📝 how were these games even built???

The function itself is among the longer ones in both games. It especially stands out in the indentation department, with 7 levels at its most indented point – and that's the minimum of what's possible without goto. Only two more notable discoveries there:

  1. Bullets are the only entity affected by Slow Mode. If the number of bullets on screen is ≥ (24 + (difficulty * 8) + rank) in TH04, or (42 + (difficulty * 8)) in TH05, Slow Mode reduces the frame rate by 33%, by waiting for one additional VSync event every two frames.
    The code also reveals a second tier, with 50% slowdown for a slightly higher number of bullets, but that conditional branch can never be executed :zunpet:
  2. Bullets must have been grazed in a previous frame before they can be collided with. (Note how this does not apply to bullets that spawned on top of the player, as explained earlier!)

Whew… When did ReC98 turn into a full-on code review?! 😅 And after all this, we're still not done with TH04 and TH05 bullets, with all the special movement types still missing. That should be less than one push though, once we get to it. Next up: Back to TH01 and Konngara! Now have fun rewriting the Touhou Wiki Gameplay pages 😛