⮜ Blog

⮜ List of tags

Showing all posts tagged yuuka-5-

📝 Posted:
🚚 Summary of:
P0240, P0241
be69ab6...40c900f, 40c900f...08352a5
💰 Funded by:
JonathKane, Blue Bolt, [Anonymous]
🏷 Tags:
rec98+ th03+ th04+ th05+ meta+ seihou+ sh01+ bullet+ boss+ kurumi+ reimu-4+ marisa-4+ yuuka-5- yuuka-6+ gengetsu+ bug+ animation+ pc98+

Well, well. My original plan was to ship the first step of Shuusou Gyoku OpenGL support on the next day after this delivery. But unfortunately, the complications just kept piling up, to a point where the required solutions definitely blow the current budget for that goal. I'm currently sitting on over 70 commits that would take at least 5 pushes to deliver as a meaningful release, and all of that is just rearchitecting work, preparing the game for a not too Windows-specific OpenGL backend in the first place. I haven't even written a single line of OpenGL yet… 🥲
This shifts the intended Big Release Month™ to June after all. Now I know that the next round of Shuusou Gyoku features should better start with the SC-88Pro recordings, which are much more likely to get done within their current budget. At least I've already completed the configuration versioning system required for that goal, which leaves only the actual audio part.

So, TH04 position independence. Thanks to a bit of funding for stage dialogue RE, non-ASCII translations will soon become viable, which finally presents a reason to push TH04 to 100% position independence after 📝 TH05 had been there for almost 3 years. I haven't heard back from Touhou Patch Center about how much they want to be involved in funding this goal, if at all, but maybe other backers are interested as well.
And sure, it would be entirely possible to implement non-ASCII translations in a way that retains the layout of the original binaries and can be easily compared at a binary level, in case we consider translations to be a critical piece of infrastructure. This wouldn't even just be an exercise in needless perfectionism, and we only have to look to Shuusou Gyoku to realize why: Players expected that my builds were compatible with existing SpoilerAL SSG files, which was something I hadn't even considered the need for. I mean, the game is open-source 📝 and I made it easy to build. You can just fork the code, implement all the practice features you want in a much more efficient way, and I'd probably even merge your code into my builds then?
But I get it – recompiling the game yields just yet another build that can't be easily compared to the original release. A cheat table is much more trustworthy in giving players the confidence that they're still practicing the same original game. And given the current priorities of my backers, it'll still take a while for me to implement proof by replay validation, which will ultimately free every part of the community from depending on the original builds of both Seihou and PC-98 Touhou.

However, such an implementation within the original binary layout would significantly drive up the budget of non-ASCII translations, and I sure don't want to constantly maintain this layout during development. So, let's chase TH04 position independence like it's 2020, and quickly cover a larger amount of PI-relevant structures and functions at a shallow level. The only parts I decompiled for now contain calculations whose intent can't be clearly communicated in ASM. Hitbox visualizations or other more in-depth research would have to wait until I get to the proper decompilation of these features.
But even this shallow work left us with a large amount of TH04-exclusive code that had its worst parts RE'd and could be decompiled fairly quickly. If you want to see big TH04 finalization% gains, general TH04 progress would be a very good investment.

The first push went to the often-mentioned stage-specific custom entities that share a single statically allocated buffer. Back in 2020, I 📝 wrongly claimed that these were a TH05 innovation, but the system actually originated in TH04. Both games use a 26-byte structure, but TH04 only allocates a 32-element array rather than TH05's 64-element one. The conclusions from back then still apply, but I also kept wondering why these games used a static array for these entities to begin with. You know what they call an area of memory that you can cleanly repurpose for things? That's right, a heap! :tannedcirno: And absolutely no one would mind one additional heap allocation at the start of a stage, next to the ones for all the sprites and portraits.
However, we are still running in Real Mode with segmented memory. Accessing anything outside a common data segment involves modifying segment registers, which has a nonzero CPU cycle cost, and Turbo C++ 4.0J is terrible at optimizing away the respective instructions. Does this matter? Probably not, but you don't take "risks" like these if you're in a permanent micro-optimization mindset… :godzun:

In TH04, this system is used for:

  1. Kurumi's symmetric bullet spawn rays, fired from her hands towards the left and right edges of the playfield. These are rather infamous for being the last thing you see before 📝 the Divide Error crash that can happen in ZUN's original build. Capped to 6 entities.

  2. The 4 📝 bits used in Marisa's Stage 4 boss fight. Coincidentally also related to the rare Divide Error crash in that fight.

  3. Stage 4 Reimu's spinning orbs. Note how the game uses two different sets of sprites just to have two different outline colors. This was probably better than messing with the palette, which can easily cause unintended effects if you only have 16 colors to work with. Heck, I have an entire blog post tag just to highlight these cases. Capped to the full 32 entities.

  4. The chasing cross bullets, seen in Phase 14 of the same Stage 6 Yuuka fight. Featuring some smart sprite work, making use of point symmetry to achieve a fluid animation in just 4 frames. This is good-code in sprite form. Capped to 31 entities, because the 32nd custom entity during this fight is defined to be…

  5. The single purple pulsating and shrinking safety circle, seen in Phase 4 of the same fight. The most interesting aspect here is actually still related to the cross bullets, whose spawn function is wrongly limited to 32 entities and could theoretically overwrite this circle. :zunpet: This is strictly landmine territory though:

    • Yuuka never uses these bullets and the safety circle simultaneously
    • She never spawns more than 24 cross bullets
    • All cross bullets are fast enough to have left the screen by the time Yuuka restarts the corresponding subpattern
    • The cross bullets spawn at Yuuka's center position, and assign its Q12.4 coordinates to structure fields that the safety circle interprets as raw pixels. The game does try to render the circle afterward, but since Yuuka's static position during this phase is nowhere near a valid pixel coordinate, it is immediately clipped.

  6. The flashing lines seen in Phase 5 of the Gengetsu fight, telegraphing the slightly random bullet columns.

    The spawn column lines in the TH05 Gengetsu fight, in the first of their two flashing colors.The spawn column lines in the TH05 Gengetsu fight, in the second of their two flashing colors.

These structures only took 1 push to reverse-engineer rather than the 2 I needed for their TH05 counterparts because they are much simpler in this game. The "structure" for Gengetsu's lines literally uses just a single X position, with the remaining 24 bytes being basically padding. The only minor bug I found on this shallow level concerns Marisa's bits, which are clipped at the right and bottom edges of the playfield 16 pixels earlier than you would expect:

The remaining push went to a bunch of smaller structures and functions:

To top off the second push, we've got the vertically scrolling checkerboard background during the Stage 6 Yuuka fight, made up of 32×32 squares. This one deserves a special highlight just because of its needless complexity. You'd think that even a performant implementation would be pretty simple:

  1. Set the GRCG to TDW mode
  2. Set the GRCG tile to one of the two square colors
  3. Start with Y as the current scroll offset, and X as some indicator of which color is currently shown at the start of each row of squares
  4. Iterate over all lines of the playfield, filling in all pixels that should be displayed in the current color, skipping over the other ones
  5. Count down Y for each line drawn
  6. If Y reaches 0, reset it to 32 and flip X
  7. At the bottom of the playfield, change the GRCG tile to the other color, and repeat with the initial value of X flipped

The most important aspect of this algorithm is how it reduces GRCG state changes to a minimum, avoiding the costly port I/O that we've identified time and time again as one of the main bottlenecks in TH01. With just 2 state variables and 3 loops, the resulting code isn't that complex either. A naive implementation that just drew the squares from top to bottom in a single pass would barely be simpler, but much slower: By changing the GRCG tile on every color, such an implementation would burn a low 5-digit number of CPU cycles per frame for the 12×11.5-square checkerboard used in the game.
And indeed, ZUN retained all important aspects of this algorithm… but still implemented it all in ASM, with a ridiculous layer of x86 segment arithmetic on top? :zunpet: Which blows up the complexity to 4 state variables, 5 nested loops, and a bunch of constants in unusual units. I'm not sure what this code is supposed to optimize for, especially with that rather questionable register allocation that nevertheless leaves one of the general-purpose registers unused. :onricdennat: Fortunately, the function was still decompilable without too many code generation hacks, and retains the 5 nested loops in all their goto-connected glory. If you want to add a checkerboard to your next PC-98 demo, just stick to the algorithm I gave above.
(Using a single XOR for flipping the starting X offset between 32 and 64 pixels is pretty nice though, I have to give him that.)

This makes for a good occasion to talk about the third and final GRCG mode, completing the series I started with my previous coverage of the 📝 RMW and 📝 TCR modes. The TDW (Tile Data Write) mode is the simplest of the three and just writes the 8×1 GRCG tile into VRAM as-is, without applying any alpha bitmask. This makes it perfect for clearing rectangular areas of pixels – or even all of VRAM by doing a single memset():

// Set up the GRCG in TDW mode.
outportb(0x7C, 0x80);

// Fill the tile register with color #7 (0111 in binary).
outportb(0x7E, 0xFF); // Plane 0: (B): (********)
outportb(0x7E, 0xFF); // Plane 1: (R): (********)
outportb(0x7E, 0xFF); // Plane 2: (G): (********)
outportb(0x7E, 0x00); // Plane 3: (E): (        )

// Set the 32 pixels at the top-left corner of VRAM to the exact contents of
// the tile register, effectively repeating the tile 4 times. In TDW mode, the
// GRCG ignores the CPU-supplied operand, so we might as well just pass the
// contents of a register with the intended width. This eliminates useless load
// instructions in the compiled assembly, and even sort of signals to readers
// of this code that we do not care about the source value.
*reinterpret_cast<uint32_t far *>(MK_FP(0xA800, 0)) = _EAX;

// Fill the entirety of VRAM with the GRCG tile. A simple C one-liner that will
// probably compile into a single `REP STOS` instruction. Unfortunately, Turbo
// C++ 4.0J only ever generates the 16-bit `REP STOSW` here, even when using
// the `__memset__` intrinsic and when compiling in 386 mode. When targeting
// that CPU and above, you'd ideally want `REP STOSD` for twice the speed.
memset(MK_FP(0xA800, 0), _AL, ((640 / 8) * 400));

However, this might make you wonder why TDW mode is even necessary. If it's functionally equivalent to RMW mode with a CPU-supplied bitmask made up entirely of 1 bits (i.e., 0xFF, 0xFFFF, or 0xFFFFFFFF), what's the point? The difference lies in the hardware implementation: If all you need to do is write tile data to VRAM, you don't need the read and modify parts of RMW mode which require additional processing time. The PC-9801 Programmers' Bible claims a speedup of almost 2× when using TDW mode over equivalent operations in RMW mode.
And that's the only performance claim I found, because none of these old PC-98 hardware and programming books did any benchmarks. Then again, it's not too interesting of a question to benchmark either, as the byte-aligned nature of TDW blitting severely limits its use in a game engine anyway. Sure, maybe it makes sense to temporarily switch from RMW to TDW mode if you've identified a large rectangular and byte-aligned section within a sprite that could be blitted without a bitmask? But the necessary identification work likely nullifies the performance gained from TDW mode, I'd say. In any case, that's pretty deep micro-optimization territory. Just use TDW mode for the few cases it's good at, and stick to RMW mode for the rest.

So is this all that can be said about the GRCG? Not quite, because there are 4 bits I haven't talked about yet…

And now we're just 5.37% away from 100% position independence for TH04! From this point, another 2 pushes should be enough to reach this goal. It might not look like we're that close based on the current estimate, but a big chunk of the remaining numbers are false positives from the player shot control functions. Since we've got a very special deadline to hit, I'm going to cobble these two pushes together from the two current general subscriptions and the rest of the backlog. But you can, of course, still invest in this goal to allow the existing contributions to go to something else.
… Well, if the store was actually open. :thonk: So I'd better continue with a quick task to free up some capacity sooner rather than later. Next up, therefore: Back to TH02, and its item and player systems. Shouldn't take that long, I'm not expecting any surprises there. (Yeah, I know, famous last words…)

📝 Posted:
🚚 Summary of:
P0168, P0169
c2de6ab...8b046da, 8b046da...479b766
💰 Funded by:
rosenrose, Blue Bolt
🏷 Tags:
rec98+ th04+ th05+ boss+ yuuka-5- blitting+ bug+ master.lib+ waste+ mod+

EMS memory! The infamous stopgap measure between the 640 KiB ("ought to be enough for everyone") of conventional memory offered by DOS from the very beginning, and the later XMS standard for accessing all the rest of memory up to 4 GiB in the x86 Protected Mode. With an optionally active EMS driver, TH04 and TH05 will make use of EMS memory to preload a bunch of situational .CDG images at the beginning of MAIN.EXE:

  1. The "eye catch" game title image, shown while stages are loaded
  2. The character-specific background image, shown while bombing
  3. The player character dialog portraits
  4. TH05 additionally stores the boss portraits there, preloading them at the beginning of each stage. (TH04 instead keeps them in conventional memory during the entire stage.)

Once these images are needed, they can then be copied into conventional memory and accessed as usual.

Uh… wait, copied? It certainly would have been possible to map EMS memory to a regular 16-bit Real Mode segment for direct access, bank-switching out rarely used system or peripheral memory in exchange for the EMS data. However, master.lib doesn't expose this functionality, and only provides functions for copying data from EMS to regular memory and vice versa.
But even that still makes EMS an excellent fit for the large image files it's used for, as it's possible to directly copy their pixel data from EMS to VRAM. (Yes, I tried!) Well… would, because ZUN doesn't do that either, and always naively copies the images to newly allocated conventional memory first. In essence, this dumbs down EMS into just another layer of the memory hierarchy, inserted between conventional memory and disk: Not quite as slow as disk, but still requiring that memcpy() to retrieve the data. Most importantly though: Using EMS in this way does not increase the total amount of memory simultaneously accessible to the game. After all, some other data will have to be freed from conventional memory to make room for the newly loaded data.

The most idiomatic way to define the game-specific layout of the EMS area would be either a struct or an enum. Unfortunately, the total size of all these images exceeds the range of a 16-bit value, and Turbo C++ 4.0J supports neither 32-bit enums (which are silently degraded to 16-bit) nor 32-bit structs (which simply don't compile). That still leaves raw compile-time constants though, you only have to manually define the offset to each image in terms of the size of its predecessor. But instead of doing that, ZUN just placed each image at a nice round decimal offset, each slightly larger than the actual memory required by the previous image, just to make sure that everything fits. :tannedcirno: This results not only in quite a bit of unnecessary padding, but also in technically the single biggest amount of "wasted" memory in PC-98 Touhou: Out of the 180,000 (TH04) and 320,000 (TH05) EMS bytes requested, the game only uses 135,552 (TH04) and 175,904 (TH05) bytes. But hey, it's EMS, so who cares, right? Out of all the opportunities to take shortcuts during development, this is among the most acceptable ones. Any actual PC-98 model that could run these two games comes with plenty of memory for this to not turn into an actual issue.

On to the EMS-using functions themselves, which are the definition of "cross-cutting concerns". Most of these have a fallback path for the non-EMS case, and keep the loaded .CDG images in memory if they are immediately needed. Which totally makes sense, but also makes it difficult to find names that reflect all the global state changed by these functions. Every one of these is also just called from a single place, so inlining them would have saved me a lot of naming and documentation trouble there.
The TH04 version of the EMS allocation code was actually displayed on ZUN's monitor in the 2010 MAG・ネット documentary; WindowsTiger already transcribed the low-quality video image in 2019. By 2015 ReC98 standards, I would have just run with that, but the current project goal is to write better code than ZUN, so I didn't. 😛 We sure ain't going to use magic numbers for EMS offsets.

The dialog init and exit code then is completely different in both games, yet equally cross-cutting. TH05 goes even further in saving conventional memory, loading each individual player or boss portrait into a single .CDG slot immediately before blitting it to VRAM and freeing the pixel data again. People who play TH05 without an active EMS driver are surely going to enjoy the hard drive access lag between each portrait change… :godzun: TH04, on the other hand, also abuses the dialog exit function to preload the Mugetsu defeat / Gengetsu entrance and Gengetsu defeat portraits, using a static variable to track how often the function has been called during the Extra Stage… who needs function parameters anyway, right? :zunpet:

This is also the function in which TH04 infamously crashes after the Stage 5 pre-boss dialog when playing with Reimu and without any active EMS driver. That crash is what motivated this look into the games' EMS usage… but the code looks perfectly fine? Oh well, guess the crash is not related to EMS then. Next u–

OK, of course I can't leave it like that. Everyone is expecting a fix now, and I still got half of a push left over after decompiling the regular EMS code. Also, I've now RE'd every function that could possibly be involved in the crash, and this is very likely to be the last time I'll be looking at them.

Turns out that the bug has little to do with EMS, and everything to do with ZUN limiting the amount of conventional RAM that TH04's MAIN.EXE is allowed to use, and then slightly miscalculating this upper limit. Playing Stage 5 with Reimu is the most asset-intensive configuration in this game, due to the combination of

The star image used in TH04's Stage 5.
The star image used in TH04's Stage 5.

Remove any single one of the above points, and this crash would have never occurred. But with all of them combined, the total amount of memory consumed by TH04's MAIN.EXE just barely exceeds ZUN's limit of 320,000 bytes, by no more than 3,840 bytes, the size of the star image.

But wait: As we established earlier, EMS does nothing to reduce the amount of conventional memory used by the game. In fact, if you disabled TH04's EMS handling, you'd still get this crash even if you are running an EMS driver and loaded DOS into the High Memory Area to free up as much conventional RAM as possible. How can EMS then prevent this crash in the first place?

The answer: It's only because ZUN's usage of EMS bypasses the need to load the cached images back out of the XOR-encrypted 東方幻想.郷 packfile. Leaving aside the general stupidity of any game data file encryption*, master.lib's decryption implementation is also quite wasteful: It uses a separate buffer that receives fixed-size chunks of the file, before decrypting every individual byte and copying it to its intended destination buffer. That really resembles the typical slowness of a C fread() implementation more than it does the highly optimized ASM code that master.lib purports to be… And how large is this well-hidden decryption buffer? 4 KiB. :onricdennat:

So, looking back at the game, here is what happens once the Stage 5 pre-battle dialog ends:

  1. Reimu's bomb background image, which was previously freed to make space for her dialog portraits, has to be loaded back into conventional memory from disk
  2. BB0.CDG is found inside the 東方幻想.郷 packfile
  3. file_ropen() ends up allocating a 4 KiB buffer for the encrypted packfile data, getting us the decisive ~4 KiB closer to the memory limit
  4. The .CDG loader tries to allocate 52 608 contiguous bytes for the pixel data of Reimu's bomb image
  5. This would exceed the memory limit, so hmem_allocbyte() fails and returns a nullptr
  6. ZUN doesn't check for this case (as usual)
  7. The pixel data is loaded to address 0000:0000, overwriting the Interrupt Vector Table and whatever comes after
  8. The game crashes
The final frame rendered before the TH04 Stage 5 Reimu No-EMS crash
The final frame rendered by a crashing TH04.

The 4 KiB encryption buffer would only be freed by the corresponding file_close() call, which of course never happens because the game crashes before it gets there. At one point, I really did suspect the cause to be some kind of memory leak or fragmentation inside master.lib, which would have been quite delightful to fix.
Instead, the most straightforward fix here is to bump up that memory limit by at least 4 KiB. Certainly easier than squeezing in a cdg_free() call for the star image before the pre-boss dialog without breaking position dependence.

Or, even better, let's nuke all these memory limits from orbit because they make little sense to begin with, and fix every other potential out-of-memory crash that modders would encounter when adding enough data to any of the 4 games that impose such limits on themselves. Unless you want to launch other binaries (which need to do their own memory allocations) after launching the game, there's really no reason to restrict the amount of memory available to a DOS process. Heck, whenever DOS creates a new one, it assigns all remaining free memory by default anyway.
Removing the memory limits also removes one of ZUN's few error checks, which end up quitting the game if there isn't at least a given maximum amount of conventional RAM available. While it might be tempting to reserve enough memory at the beginning of execution and then never check any allocation for a potential failure, that's exactly where something like TH04's crash comes from.
This game is also still running on DOS, where such an initial allocation failure is very unlikely to happen – no one fills close to half of conventional RAM with TSRs and then tries running one of these games. It might have been useful to detect systems with less than 640 KiB of actual, physical RAM, but none of the PC-98 models with that little amount of memory are fast enough to run these games to begin with. How ironic… a place where ZUN actually added an error check, and then it's mostly pointless.

Here's an archive that contains both fix variants, just in case. These were compiled from the th04_noems_crash_fix and mem_assign_all branches, and contain as little code changes as possible.
Edit (2022-04-18): For TH04, you probably want to download the 📝 community choice fix package instead, which contains this fix along with other workarounds for the Divide error crashes. 2021-11-29-Memory-limit-fixes.zip

So yeah, quite a complex bug, leaving no time for the TH03 scorefile format research after all. Next up: Raising prices.