ReC98

📝 Posted:: 2023-03-30 13:23 UTC
🚚 Summary of:: P0235, P0236, P0237
⌨ Commits:: e7a9262...62c4b7f, 62c4b7f...7fa9038, 7fa9038...c5e51e6
💰 Funded by:: Ember2528, Yanga
🏷 Tags:

So, TH02! Being the only game whose main binary hadn't seen any dedicated attention ever, we get to start the TH02-related blog posts at the very beginning with the most foundational pieces of code. The stage tile system is the best place to start here: It not only blocks every entity that is rendered on top of these tiles, but is curiously placed right next to master.lib code in TH02, and would need to be separated out into its own translation unit before we can do the same with all the master.lib functions.

In late 2018, I already RE'd 📝 TH04's and TH05's stage tile implementation, but haven't properly documented it on this blog yet, so this post is also going to include the details that are unique to those games. On a high level, the stage tile system works identically in all three games:

The tiles themselves are 16×16 pixels large, and a stage can use 100 of them at the same time.
The optimal way of blitting tiles would involve VRAM-to-VRAM copies within the same page using the EGC, and that's exactly what the games do. All tiles are stored on both VRAM pages within the rightmost 64×400 pixels of the screen just right next to the HUD, and you only don't see them because the games cover the same area in text RAM with black cells:

The initial screen of TH02's Stage 1, with the tile source area uncovered by filling the same area in text RAM with transparent cells instead of black ones. In TH02, this also reveals how the tile area ends with a bunch of glitch tiles, tinted blue in the image. These are the result of ZUN unconditionally blitting 100 tile images every time, regardless of how many are actually contained in an .MPN file.
These glitch tiles are another good example of a ZUN landmine. Their appearance is the result of reading heap memory outside allocated boundaries, which can easily cause segmentation faults when porting the game to a system with virtual memory. Therefore, these would not just be removed in this game's Anniversary Edition, but on the more conservative debloated branch as well. Since the game never uses these tiles and you can't observe them unless you manipulate text RAM from outside the confines of the game, it's not a bug according to our definition.
To reduce the memory required for a map, tiles are arranged into fixed vertical sections of a game-specific constant size.
The 6 24×8-tile sections defined in TH02's STAGE0.MAP, in reverse order compared to how they're defined in the file. Note the duplicated row at the top of the final section: The boss fight starts once the game scrolled the last full row of tiles onto the top of the screen, not the playfield. But since the PC-98 text chip covers the top tile row of the screen with black cells, this final row is never visible, which effectively reduces a map's final tile section to 7 rows rather than 8.
The actual stage map then is simply a list of these tile sections, ordered from the start/bottom to the top/end.
Any manipulation of specific tiles within the fixed tile sections has to be hardcoded. An example can be found right in Stage 1, where the Shrine Tank leaves track marks on the tiles it appears to drive over:

This video also shows off the two issues with Touhou's first-ever midboss: The replaced tiles are rendered below the midboss during their first 4 frames, and maybe ZUN should have stopped the tile replacements one row before the timeout. The first one is clearly a bug, but it's not so clear-cut with the second one. I'd need to look at the code to tell for sure whether it's a quirk or a bug.

The differences between the three games can best be summarized in a table:

	TH02	TH04	TH05
Tile image file extension	.MPN
Tile section format	.MAP
Tile section order defined as part of	.DT1	.STD
Tile section index format	0-based ID		0-based ID × 2
Tile image index format	Index between 0 and 100, 1 byte	VRAM offset in tile source area, 2 bytes
Scroll speed control	Hardcoded	Part of the .STD format, defined per referenced tile section
Redraw granularity	Full tiles (16×16)	Half tiles (16×8)
Rows per tile section	8	5
Maximum number of tile sections	16	32
Lowest number of tile sections used	5 (Stage 3 / Extra)	8 (Stage 6)	11 (Stage 2 / 4)
Highest number of tile sections used	13 (Stage 4)	19 (Extra)	24 (Stage 3)
Maximum length of a map	320 sections (static buffer)	256 sections (format limitation)
Shortest map	14 sections (Stage 5)	20 sections (Stage 5)	15 sections (Stage 2)
Longest map	143 sections (Stage 4)	95 sections (Stage 4)	40 sections (Stage 1 / 4 / Extra)

The most interesting part about stage tiles is probably the fact that some of the .MAP files contain unused tile sections. 👀 Many of these are empty, duplicates, or don't really make sense, but a few are unique, fit naturally into their respective stage, and might have been part of the map during development. In TH02, we can find three unused sections in Stage 5:

Section 0 of TH02's STAGE4.MAP — The non-empty tile sections defined in TH02's `STAGE4.MAP`, showing off three unused ones.

Section 1 of TH02's STAGE4.MAP — The non-empty tile sections defined in TH02's `STAGE4.MAP`, showing off three unused ones.

These unused tile sections are much more common in the later games though, where we can find them in TH04's Stage 3, 4, and 5, and TH05's Stage 1, 2, and 4. I'll document those once I get to finalize the tile rendering code of these games, to leave some more content for that blog post. TH04/TH05 tile code would be quite an effective investment of your money in general, as most of it is identical across both games. Or how about going for a full-on PC-98 Touhou map viewer and editor GUI?

Compared to TH04 and TH05, TH02's stage tile code definitely feels like ZUN was just starting to understand how to pull off smooth vertical scrolling on a PC-98. As such, it comes with a few inefficiencies and suboptimal implementation choices:

The redraw flag for each tile is stored in a 24×25 bool array that does nothing with 7 of the 8 bits.
During bombs and the Stage 4, 5, and Extra bosses, the game disables the tile system to render more elaborate backgrounds, which require the playfield to be flood-filled with a single color on every frame. ZUN uses the GRCG's RMW mode rather than TDW mode for this, leaving almost half of the potential performance on the table for no reason. Literally, changing modes only involves changing a single constant.
The scroll speed could theoretically be changed at any time. However, the function that scrolls in new stage tiles can only ever blit part of a single tile row during every call, so it's up to the caller to ensure that scrolling always ends up on an exact 16-pixel boundary. TH02 avoids this problem by keeping the scroll speed constant across a stage, using 2 pixels for Stage 4 and 1 pixel everywhere else.
Since the scroll speed is given in pixels, the slowest speed would be 1 pixel per frame. To allow the even slower speeds seen in the final game, TH02 adds a separate scroll interval variable that only runs the scroll function every 𝑛th frame, effectively adding a prescaler to the scroll speed. In TH04 and TH05, the speed is specified as a Q12.4 value instead, allowing true fractional speeds at any multiple of ¹/₁₆ pixels. This also necessitated a fixed algorithm that correctly blits tile lines from two rows.
Finally, we've got a few inconsistencies in the way the code handles the two VRAM pages, which cause a few unnecessary tiles to be rendered to just one of the two pages. Mentioning that just in case someone tries to play this game with a fully cleared text RAM and wonders where the flickering tiles come from.

Even though this was ZUN's first attempt at scrolling tiles, he already saw it fit to write most of the code in assembly. This was probably a reaction to all of TH01's performance issues, and the frame rate reduction workarounds he implemented to keep the game from slowing down too much in busy places. "If TH01 was all C++ and slow, TH02 better contain more ASM code, and then it will be fast, right?"
Another reason for going with ASM might be found in the kind of documentation that may have been available to ZUN. Last year, the PC-98 community discovered and scanned two new game programming tutorial books from 1991 (1, 2). Their example code is not only entirely written in assembly, but restricts itself to the bare minimum of x86 instructions that were available on the 8086 CPU used by the original PC-9801 model 9 years earlier. Such code is not only suboptimal on the 486, but can often be actually worse than what your C++ compiler would generate. TH02 is where the trend of bad hand-written ASM code started, and it 📝 only intensified in ZUN's later games. So, don't copy code from these books unless you absolutely want to target the earlier 8086 and 286 models. Which, 📝 as we've gathered from the recent blitting benchmark results, are not all too common among current real-hardware owners.
That said, all that ASM code really only impacts readability and maintainability. Apart from the aforementioned issues, the algorithms themselves are mostly fine – especially since most EGC and GRCG operations are decently batched this time around, in contrast to TH01.

Luckily, the tile functions merely use inline assembly within a typical C function and can therefore be at least part of a C++ source file, even if the result is pretty ugly. This time, we can actually be sure that they weren't written directly in a .ASM file, because they feature x86 instruction encodings that can only be generated with Turbo C++ 4.0J's inline assembler, not with TASM. The same can't unfortunately be said about the following function in the same segment, which marks the tiles covered by the spark sprites for redrawing. In this one, it took just one dumb hand-written ASM inconsistency in the function's epilog to make the entire function undecompilable.
The standard x86 instruction sequence to set up a stack frame in a function prolog looks like this:

PUSH	BP
MOV 	BP, SP
SUB 	SP, ?? ; if the function needs the stack for local variables

When compiling without optimizations, Turbo C++ 4.0J will replace this sequence with a single ENTER instruction. That one is two bytes smaller, but much slower on every x86 CPU except for the 80186 where it was introduced.

In functions without local variables, BP and SP remain identical, and a single POP BP is all that's needed in the epilog to tear down such a stack frame before returning from the function. Otherwise, the function needs an additional

MOV SP,
	BP

instruction to pop all local variables. With x86 being the helpful CISC architecture that it is, the 80186 also introduced the LEAVE instruction to perform both tasks. Unlike ENTER, this single instruction is faster than the raw two instructions on a lot of x86 CPUs (and even current ones!), and it's always smaller, taking up just 1 byte instead of 3.
So what if you use LEAVE even if your function doesn't use local variables?

The fact that the instruction first does the equivalent of MOV SP, BP doesn't matter if these registers are identical, and who cares about the additional CPU cycles of LEAVE compared to just POP BP, right? So that's definitely something you could theoretically do, but not something that any compiler would ever generate.

And so, TH02 MAIN.EXE decompilation already hits the first brick wall after two pushes. Awesome! Theoretically, we could slowly mash through this wall using the 📝 code generator. But having such an inconsistency in the function epilog would mean that we'd have to keep Turbo C++ 4.0J from emitting any epilog or prolog code so that we can write our own. This means that we'd once again have to hide any use of the SI and DI registers from the compiler… and doing that requires code generation macros for 22 of the 49 instructions of the function in question, almost none of which we currently have. So, this gets quite silly quite fast, especially if we only need to do it for one single byte.

Instead, wouldn't it be much better if we had a separate build step between compile and link time that allowed us to replicate mistakes like these by just patching the compiled .OBJ files? These files still contain the names of exported functions for linking, which would allow us to look up the code of a function in a robust manner, navigate to specific instructions using a disassembler, replace them, and write the modified .OBJ back to disk before linking. Such a system could then naturally expand to cover all other decompilation issues, culminating in a full-on optimizer that could even recreate ZUN's self-modifying code. At that point, we would have sealed away all of ZUN's ugly ASM code within a separate build step, and could finally decompile everything into readable C++.

Pulling that off would require a significant tooling investment though. Patching that one byte in TH02's spark invalidation function could be done within 1 or 2 pushes, but that's just one issue, and we currently have 32 other .ASM files with undecompilable code. Also, note that this is fundamentally different from what we're doing with the debloated branch and the Anniversary Editions. Mistake patching would purely be about having readable code on master that compiles into ZUN's exact binaries, without fixing weird code. The Anniversary Editions go much further and rewrite such code in a much more fundamental way, improving it further than mistake patching ever could.
Right now, the Anniversary Editions seem much more popular, which suggests that people just want 100% RE as fast as possible so that I can start working on them. In that case, why bother with such undecompilable functions, and not just leave them in raw and unreadable x86 opcode form if necessary… But let's first see how much backer support there actually is for mistake patching before falling back on that.

The best part though: Once we've made a decision and then covered TH02's spark and particle systems, that was it, and we will have already RE'd all ZUN-written PC-98-specific blitting code in this game. Every further sprite or shape is rendered via master.lib, and is thus decently abstracted. Guess I'll need to update 📝 the assessment of which PC-98 Touhou game is the easiest to port, because it sure isn't TH01, as we've seen with all the work required for the first Anniversary Edition build.

Until then, there are still enough parts of the game that don't use any of the remaining few functions in the _TEXT segment. Previously, I mentioned in the 📝 status overview blog post that TH02 had a seemingly weird sprite system, but the spark and point popup ( 〇一二三四五六七八九十×÷ ) structures showed that the game just stores the current and previous position of its entities in a slightly different way compared to the rest of PC-98 Touhou. Instead of having dedicated structure fields, TH02 uses two-element arrays indexed with the active VRAM page. Same thing, and such a pattern even helps during RE since it's easy to spot once you know what to look for.
There's not much to criticize about the point popup system, except for maybe a landmine that causes sprite glitches when trying to display more than 99,990 points. Sadly, the final push in this delivery was rounded out by yet another piece of code at the opposite end of the quality spectrum. The particle and smear effects for Reimu's bomb animations consist almost entirely of assembly bloat, which would just be replaced with generic calls to the generic blitter in this game's future Anniversary Edition.

If I continue to decompile TH02 while avoiding the brick wall, items would be next, but they probably require two pushes. Next up, therefore: Integrating Stripe as an alternative payment provider into the order form. There have been at least three people who reported issues with PayPal, and Stripe has been working much better in tests. In the meantime, here's a temporary Stripe order link for everyone. This one is not connected to the cap yet, so please make sure to stay within whatever value is currently shown on the front page – I will treat any excess money as donations. If there's some time left afterward, I might also add some small improvements to the TH01 Anniversary Edition.

⮜ Blog