- 📝 Posted:
- 🚚 Summary of:
- P0235, P0236, P0237
- ⌨ Commits:
- 💰 Funded by:
- Ember2528, Yanga
- 🏷 Tags:
- rec98+ th02+ th04+ th05+ stage- blitting+ micro-optimization- unused+ tcc+
So, TH02! Being the only game whose main binary hadn't seen any dedicated attention ever, we get to start the TH02-related blog posts at the very beginning with the most foundational pieces of code. The stage tile system is the best place to start here: It not only blocks every entity that is rendered on top of these tiles, but is curiously placed right next to master.lib code in TH02, and would need to be separated out into its own translation unit before we can do the same with all the master.lib functions.
In late 2018, I already RE'd 📝 TH04's and TH05's stage tile implementation, but haven't properly documented it on this blog yet, so this post is also going to include the details that are unique to those games. On a high level, the stage tile system works identically in all three games:
- The tiles themselves are 16×16 pixels large, and a stage can use 100 of them at the same time.
- The optimal way of blitting tiles would involve VRAM-to-VRAM copies within the same page using the EGC, and that's exactly what the games do. All tiles are stored on both VRAM pages within the rightmost 64×400 pixels of the screen just right next to the HUD, and you only don't see them because the games cover the same area in text RAM with black cells:
- To reduce the memory required for a map, tiles are arranged into fixed vertical sections of a game-specific constant size.
- The actual stage map then is simply a list of these tile sections, ordered from the start/bottom to the top/end.
- Any manipulation of specific tiles within the fixed tile sections has to be hardcoded. An example can be found right in Stage 1, where the Shrine Tank leaves track marks on the tiles it appears to drive over:
The differences between the three games can best be summarized in a table:
|Tile image file extension||.MPN|
|Tile section format||.MAP|
|Tile section order defined as part of||.DT1||.STD|
|Tile section index format||0-based ID||0-based ID × 2||Tile image index format||Index between 0 and 100, 1 byte||VRAM offset in tile source area, 2 bytes|
|Scroll speed control||Hardcoded||Part of the .STD format, defined per referenced tile section|
|Redraw granularity||Full tiles (16×16)||Half tiles (16×8)|
|Rows per tile section||8||5|
|Maximum number of tile sections||16||32|
|Lowest number of tile sections used||5 (Stage 3 / Extra)||8 (Stage 6)||11 (Stage 2 / 4)|
|Highest number of tile sections used||13 (Stage 4)||19 (Extra)||24 (Stage 3)|
|Maximum length of a map||320 sections (static buffer)||256 sections (format limitation)|
|Shortest map||14 sections (Stage 5)||20 sections (Stage 5)||15 sections (Stage 2)|
|Longest map||143 sections (Stage 4)||95 sections (Stage 4)||40 sections (Stage 1 / 4 / Extra)|
The most interesting part about stage tiles is probably the fact that some of the .MAP files contain unused tile sections. 👀 Many of these are empty, duplicates, or don't really make sense, but a few are unique, fit naturally into their respective stage, and might have been part of the map during development. In TH02, we can find three unused sections in Stage 5:These unused tile sections are much more common in the later games though, where we can find them in TH04's Stage 3, 4, and 5, and TH05's Stage 1, 2, and 4. I'll document those once I get to finalize the tile rendering code of these games, to leave some more content for that blog post. TH04/TH05 tile code would be quite an effective investment of your money in general, as most of it is identical across both games. Or how about going for a full-on PC-98 Touhou map viewer and editor GUI?
Compared to TH04 and TH05, TH02's stage tile code definitely feels like ZUN was just starting to understand how to pull off smooth vertical scrolling on a PC-98. As such, it comes with a few inefficiencies and suboptimal implementation choices:
- The redraw flag for each tile is stored in a 24×25
boolarray that does nothing with 7 of the 8 bits.
- During bombs and the Stage 4, 5, and Extra bosses, the game disables the tile system to render more elaborate backgrounds, which require the playfield to be flood-filled with a single color on every frame. ZUN uses the GRCG's RMW mode rather than TDW mode for this, leaving almost half of the potential performance on the table for no reason. Literally, changing modes only involves changing a single constant.
- The scroll speed could theoretically be changed at any time. However, the function that scrolls in new stage tiles can only ever blit part of a single tile row during every call, so it's up to the caller to ensure that scrolling always ends up on an exact 16-pixel boundary. TH02 avoids this problem by keeping the scroll speed constant across a stage, using 2 pixels for Stage 4 and 1 pixel everywhere else.
- Since the scroll speed is given in pixels, the slowest speed would be 1 pixel per frame. To allow the even slower speeds seen in the final game, TH02 adds a separate scroll interval variable that only runs the scroll function every 𝑛th frame, effectively adding a prescaler to the scroll speed. In TH04 and TH05, the speed is specified as a Q12.4 value instead, allowing true fractional speeds at any multiple of 1/16 pixels. This also necessitated a fixed algorithm that correctly blits tile lines from two rows.
- Finally, we've got a few inconsistencies in the way the code handles the two VRAM pages, which cause a few unnecessary tiles to be rendered to just one of the two pages. Mentioning that just in case someone tries to play this game with a fully cleared text RAM and wonders where the flickering tiles come from.
Even though this was ZUN's first attempt at scrolling tiles, he already saw
it fit to write most of the code in assembly. This was probably a reaction
to all of TH01's performance issues, and the frame rate reduction
workarounds he implemented to keep the game from slowing down too much in
busy places. "If TH01 was all C++ and slow, TH02 better contain more ASM
code, and then it will be fast, right?"
Another reason for going with ASM might be found in the kind of documentation that may have been available to ZUN. Last year, the PC-98 community discovered and scanned two new game programming tutorial books from 1991 (1, 2). Their example code is not only entirely written in assembly, but restricts itself to the bare minimum of x86 instructions that were available on the 8086 CPU used by the original PC-9801 model 9 years earlier. Such code is not only suboptimal on the 486, but can often be actually worse than what your C++ compiler would generate. TH02 is where the trend of bad hand-written ASM code started, and it 📝 only intensified in ZUN's later games. So, don't copy code from these books unless you absolutely want to target the earlier 8086 and 286 models. Which, 📝 as we've gathered from the recent blitting benchmark results, are not all too common among current real-hardware owners.
That said, all that ASM code really only impacts readability and maintainability. Apart from the aforementioned issues, the algorithms themselves are mostly fine – especially since most EGC and GRCG operations are decently batched this time around, in contrast to TH01.
Luckily, the tile functions merely use inline assembly within a
typical C function and can therefore be at least part of a C++ source file,
even if the result is pretty ugly. This time, we can actually be sure that
they weren't written directly in a .ASM file, because they feature x86
instruction encodings that can only be generated with Turbo C++ 4.0J's
inline assembler, not with TASM. The same can't unfortunately be said about
the following function in the same segment, which marks the tiles covered by
the spark sprites for redrawing. In this one, it took just one dumb hand-written ASM
inconsistency in the function's epilog to make the entire function
The standard x86 instruction sequence to set up a stack frame in a function prolog looks like this: In functions without local variables,
remain identical, and a single
POP BP is all that's needed in
the epilog to tear down such a stack frame before returning from the
function. Otherwise, the function needs an additional
BP instruction to pop all local variables. With x86 being the helpful
CISC architecture that it is, the 80186 also introduced the
LEAVE instruction to perform both tasks. Unlike
ENTER, this single instruction
is faster than the raw two instructions on a lot of x86 CPUs (and
even current ones!), and it's always smaller, taking up just 1 byte instead
So what if you use
LEAVE even if your function
doesn't use local variables? The fact that the
instruction first does the equivalent of
MOV SP, BP doesn't
matter if these registers are identical, and who cares about the additional
CPU cycles of
LEAVE compared to just
right? So that's definitely something you could theoretically do, but
not something that any compiler would ever generate.
And so, TH02
MAIN.EXE decompilation already hits the first
brick wall after two pushes. Awesome! Theoretically,
we could slowly mash through this wall using the 📝 code generator. But having such an inconsistency in the
function epilog would mean that we'd have to keep Turbo C++ 4.0J from
emitting any epilog or prolog code so that we can write our
own. This means that we'd once again have to hide any use of the
DI registers from the compiler… and doing
that requires code generation macros for 22 of the 49 instructions of
the function in question, almost none of which we currently have. So, this
gets quite silly quite fast, especially if we only need to do it
for one single byte.
Instead, wouldn't it be much better if we had a separate build step between compile and link time that allowed us to replicate mistakes like these by just patching the compiled .OBJ files? These files still contain the names of exported functions for linking, which would allow us to look up the code of a function in a robust manner, navigate to specific instructions using a disassembler, replace them, and write the modified .OBJ back to disk before linking. Such a system could then naturally expand to cover all other decompilation issues, culminating in a full-on optimizer that could even recreate ZUN's self-modifying code. At that point, we would have sealed away all of ZUN's ugly ASM code within a separate build step, and could finally decompile everything into readable C++.
Pulling that off would require a significant tooling investment though.
Patching that one byte in TH02's spark invalidation function could be done
within 1 or 2 pushes, but that's just one issue, and we currently have 32
other .ASM files with undecompilable code. Also, note that this is
fundamentally different from what we're doing with the
debloated branch and the Anniversary Editions. Mistake patching
would purely be about having readable code on
compiles into ZUN's exact binaries, without fixing weird
code. The Anniversary Editions go much further and rewrite such code in
a much more fundamental way, improving it further than mistake patching ever
Right now, the Anniversary Editions seem much more popular, which suggests that people just want 100% RE as fast as possible so that I can start working on them. In that case, why bother with such undecompilable functions, and not just leave them in raw and unreadable x86 opcode form if necessary… But let's first see how much backer support there actually is for mistake patching before falling back on that.
The best part though: Once we've made a decision and then covered TH02's spark and particle systems, that was it, and we will have already RE'd all ZUN-written PC-98-specific blitting code in this game. Every further sprite or shape is rendered via master.lib, and is thus decently abstracted. Guess I'll need to update 📝 the assessment of which PC-98 Touhou game is the easiest to port, because it sure isn't TH01, as we've seen with all the work required for the first Anniversary Edition build.
Until then, there are still enough parts of the game that don't use any of
the remaining few functions in the
_TEXT segment. Previously, I
mentioned in the 📝 status overview blog post
that TH02 had a seemingly weird sprite system, but the spark and point popup
() structures showed that the game just
stores the current and previous position of its entities in a slightly
different way compared to the rest of PC-98 Touhou. Instead of having
dedicated structure fields, TH02 uses two-element arrays indexed with the
active VRAM page. Same thing, and such a pattern even helps during RE since
it's easy to spot once you know what to look for.
There's not much to criticize about the point popup system, except for maybe a landmine that causes sprite glitches when trying to display more than 99,990 points. Sadly, the final push in this delivery was rounded out by yet another piece of code at the opposite end of the quality spectrum. The particle and smear effects for Reimu's bomb animations consist almost entirely of assembly bloat, which would just be replaced with generic calls to the generic blitter in this game's future Anniversary Edition.
If I continue to decompile TH02 while avoiding the brick wall, items would be next, but they probably require two pushes. Next up, therefore: Integrating Stripe as an alternative payment provider into the order form. There have been at least three people who reported issues with PayPal, and Stripe has been working much better in tests. In the meantime, here's a temporary Stripe order link for everyone. This one is not connected to the cap yet, so please make sure to stay within whatever value is currently shown on the front page – I will treat any excess money as donations. If there's some time left afterward, I might also add some small improvements to the TH01 Anniversary Edition.