- 📝 Posted:
- 🚚 Summary of:
- P0235, P0236, P0237
- ⌨ Commits:
e7a9262...62c4b7f
,62c4b7f...7fa9038
,7fa9038...c5e51e6
- 💰 Funded by:
- Ember2528, Yanga
- 🏷 Tags:
So, TH02! Being the only game whose main binary hadn't seen any dedicated attention ever, we get to start the TH02-related blog posts at the very beginning with the most foundational pieces of code. The stage tile system is the best place to start here: It not only blocks every entity that is rendered on top of these tiles, but is curiously placed right next to master.lib code in TH02, and would need to be separated out into its own translation unit before we can do the same with all the master.lib functions.
In late 2018, I already RE'd 📝 TH04's and TH05's stage tile implementation, but haven't properly documented it on this blog yet, so this post is also going to include the details that are unique to those games. On a high level, the stage tile system works identically in all three games:
- The tiles themselves are 16×16 pixels large, and a stage can use 100 of them at the same time.
- The optimal way of blitting tiles would involve VRAM-to-VRAM copies within the same page using the EGC, and that's exactly what the games do. All tiles are stored on both VRAM pages within the rightmost 64×400 pixels of the screen just right next to the HUD, and you only don't see them because the games cover the same area in text RAM with black cells:
- To reduce the memory required for a map, tiles are arranged into fixed vertical sections of a game-specific constant size.
- The actual stage map then is simply a list of these tile sections, ordered from the start/bottom to the top/end.
- Any manipulation of specific tiles within the fixed tile sections has to
be hardcoded. An example can be found right in Stage 1, where the Shrine
Tank leaves track marks on the tiles it appears to drive over:
This video also shows off the two issues with Touhou's first-ever midboss: The replaced tiles are rendered below the midboss during their first 4 frames, and maybe ZUN should have stopped the tile replacements one row before the timeout. The first one is clearly a bug, but it's not so clear-cut with the second one. I'd need to look at the code to tell for sure whether it's a quirk or a bug.
The differences between the three games can best be summarized in a table:
TH02 | TH04 | TH05 | |
---|---|---|---|
Tile image file extension | .MPN | ||
Tile section format | .MAP | ||
Tile section order defined as part of | .DT1 | .STD | |
Tile section index format | 0-based ID | 0-based ID × 2 | Tile image index format | Index between 0 and 100, 1 byte | VRAM offset in tile source area, 2 bytes |
Scroll speed control | Hardcoded | Part of the .STD format, defined per referenced tile section | |
Redraw granularity | Full tiles (16×16) | Half tiles (16×8) | |
Rows per tile section | 8 | 5 | |
Maximum number of tile sections | 16 | 32 | |
Lowest number of tile sections used | 5 (Stage 3 / Extra) | 8 (Stage 6) | 11 (Stage 2 / 4) |
Highest number of tile sections used | 13 (Stage 4) | 19 (Extra) | 24 (Stage 3) |
Maximum length of a map | 320 sections (static buffer) | 256 sections (format limitation) | |
Shortest map | 14 sections (Stage 5) | 20 sections (Stage 5) | 15 sections (Stage 2) |
Longest map | 143 sections (Stage 4) | 95 sections (Stage 4) | 40 sections (Stage 1 / 4 / Extra) |
The most interesting part about stage tiles is probably the fact that some of the .MAP files contain unused tile sections. 👀 Many of these are empty, duplicates, or don't really make sense, but a few are unique, fit naturally into their respective stage, and might have been part of the map during development. In TH02, we can find three unused sections in Stage 5:
STAGE4.MAP
,
showing off three unused ones.
Compared to TH04 and TH05, TH02's stage tile code definitely feels like ZUN was just starting to understand how to pull off smooth vertical scrolling on a PC-98. As such, it comes with a few inefficiencies and suboptimal implementation choices:
- The redraw flag for each tile is stored in a 24×25
bool
array that does nothing with 7 of the 8 bits. - During bombs and the Stage 4, 5, and Extra bosses, the game disables the tile system to render more elaborate backgrounds, which require the playfield to be flood-filled with a single color on every frame. ZUN uses the GRCG's RMW mode rather than TDW mode for this, leaving almost half of the potential performance on the table for no reason. Literally, changing modes only involves changing a single constant.
- The scroll speed could theoretically be changed at any time. However, the function that scrolls in new stage tiles can only ever blit part of a single tile row during every call, so it's up to the caller to ensure that scrolling always ends up on an exact 16-pixel boundary. TH02 avoids this problem by keeping the scroll speed constant across a stage, using 2 pixels for Stage 4 and 1 pixel everywhere else.
- Since the scroll speed is given in pixels, the slowest speed would be 1 pixel per frame. To allow the even slower speeds seen in the final game, TH02 adds a separate scroll interval variable that only runs the scroll function every 𝑛th frame, effectively adding a prescaler to the scroll speed. In TH04 and TH05, the speed is specified as a Q12.4 value instead, allowing true fractional speeds at any multiple of 1/16 pixels. This also necessitated a fixed algorithm that correctly blits tile lines from two rows.
- Finally, we've got a few inconsistencies in the way the code handles the two VRAM pages, which cause a few unnecessary tiles to be rendered to just one of the two pages. Mentioning that just in case someone tries to play this game with a fully cleared text RAM and wonders where the flickering tiles come from.
Even though this was ZUN's first attempt at scrolling tiles, he already saw
it fit to write most of the code in assembly. This was probably a reaction
to all of TH01's performance issues, and the frame rate reduction
workarounds he implemented to keep the game from slowing down too much in
busy places. "If TH01 was all C++ and slow, TH02 better contain more ASM
code, and then it will be fast, right?"
Another reason for going with ASM might be found in the kind of
documentation that may have been available to ZUN. Last year, the PC-98
community discovered and scanned two new game programming tutorial books
from 1991 (1, 2).
Their example code is not only entirely written in assembly, but restricts
itself to the bare minimum of x86 instructions that were available on the
8086 CPU used by the original PC-9801 model 9 years earlier. Such code is
not only suboptimal
on the 486, but can often be actually worse than what your C++
compiler would generate. TH02 is where the trend of bad hand-written ASM
code started, and it
📝 only intensified in ZUN's later games. So,
don't copy code from these books unless you absolutely want to target the
earlier 8086 and 286 models. Which,
📝 as we've gathered from the recent blitting benchmark results,
are not all too common among current real-hardware owners.
That said, all that ASM code really only impacts readability and
maintainability. Apart from the aforementioned issues, the algorithms
themselves are mostly fine – especially since most EGC and GRCG operations
are decently batched this time around, in contrast to TH01.
Luckily, the tile functions merely use inline assembly within a
typical C function and can therefore be at least part of a C++ source file,
even if the result is pretty ugly. This time, we can actually be sure that
they weren't written directly in a .ASM file, because they feature x86
instruction encodings that can only be generated with Turbo C++ 4.0J's
inline assembler, not with TASM. The same can't unfortunately be said about
the following function in the same segment, which marks the tiles covered by
the spark sprites for redrawing. In this one, it took just one dumb hand-written ASM
inconsistency in the function's epilog to make the entire function
undecompilable.
The standard x86 instruction sequence to set up a stack frame in a function prolog looks like this:PUSH BP
MOV BP, SP
SUB SP, ?? ; if the function needs the stack for local variables
ENTER
instruction. That one
is two bytes smaller, but much slower on every x86 CPU except for the 80186
where it was introduced.BP
and SP
remain identical, and a single POP BP
is all that's needed in
the epilog to tear down such a stack frame before returning from the
function. Otherwise, the function needs an additional MOV SP,
BP
instruction to pop all local variables. With x86 being the helpful
CISC architecture that it is, the 80186 also introduced the
LEAVE
instruction to perform both tasks. Unlike
ENTER
, this single instruction
is faster than the raw two instructions on a lot of x86 CPUs (and
even current ones!), and it's always smaller, taking up just 1 byte instead
of 3.
So what if you use LEAVE
even if your function
doesn't use local variables? The fact that the
instruction first does the equivalent of MOV SP, BP
doesn't
matter if these registers are identical, and who cares about the additional
CPU cycles of LEAVE
compared to just POP BP
,
right? So that's definitely something you could theoretically do, but
not something that any compiler would ever generate.
And so, TH02 MAIN.EXE
decompilation already hits the first
brick wall after two pushes. Awesome! Theoretically,
we could slowly mash through this wall using the 📝 code generator. But having such an inconsistency in the
function epilog would mean that we'd have to keep Turbo C++ 4.0J from
emitting any epilog or prolog code so that we can write our
own. This means that we'd once again have to hide any use of the
SI
and DI
registers from the compiler… and doing
that requires code generation macros for 22 of the 49 instructions of
the function in question, almost none of which we currently have. So, this
gets quite silly quite fast, especially if we only need to do it
for one single byte.
Instead, wouldn't it be much better if we had a separate build step between compile and link time that allowed us to replicate mistakes like these by just patching the compiled .OBJ files? These files still contain the names of exported functions for linking, which would allow us to look up the code of a function in a robust manner, navigate to specific instructions using a disassembler, replace them, and write the modified .OBJ back to disk before linking. Such a system could then naturally expand to cover all other decompilation issues, culminating in a full-on optimizer that could even recreate ZUN's self-modifying code. At that point, we would have sealed away all of ZUN's ugly ASM code within a separate build step, and could finally decompile everything into readable C++.
Pulling that off would require a significant tooling investment though.
Patching that one byte in TH02's spark invalidation function could be done
within 1 or 2 pushes, but that's just one issue, and we currently have 32
other .ASM files with undecompilable code. Also, note that this is
fundamentally different from what we're doing with the
debloated
branch and the Anniversary Editions. Mistake patching
would purely be about having readable code on master
that
compiles into ZUN's exact binaries, without fixing weird
code. The Anniversary Editions go much further and rewrite such code in
a much more fundamental way, improving it further than mistake patching ever
could.
Right now, the Anniversary Editions seem much more
popular, which suggests that people just want 100% RE as fast as
possible so that I can start working on them. In that case, why bother with
such undecompilable functions, and not just leave them in raw and unreadable
x86 opcode form if necessary… But let's first
see how much backer support there actually is for mistake patching before
falling back on that.
The best part though: Once we've made a decision and then covered TH02's spark and particle systems, that was it, and we will have already RE'd all ZUN-written PC-98-specific blitting code in this game. Every further sprite or shape is rendered via master.lib, and is thus decently abstracted. Guess I'll need to update 📝 the assessment of which PC-98 Touhou game is the easiest to port, because it sure isn't TH01, as we've seen with all the work required for the first Anniversary Edition build.
Until then, there are still enough parts of the game that don't use any of
the remaining few functions in the _TEXT
segment. Previously, I
mentioned in the 📝 status overview blog post
that TH02 had a seemingly weird sprite system, but the spark and point popup
() structures showed that the game just
stores the current and previous position of its entities in a slightly
different way compared to the rest of PC-98 Touhou. Instead of having
dedicated structure fields, TH02 uses two-element arrays indexed with the
active VRAM page. Same thing, and such a pattern even helps during RE since
it's easy to spot once you know what to look for.
There's not much to criticize about the point popup system, except for maybe
a landmine that causes sprite glitches when trying to display more than
99,990 points. Sadly, the final push in this delivery was rounded out by yet
another piece of code at the opposite end of the quality spectrum. The
particle and smear effects for Reimu's bomb animations consist almost
entirely of assembly bloat, which would just be replaced with generic calls
to the generic blitter in this game's future Anniversary Edition.
If I continue to decompile TH02 while avoiding the brick wall, items would be next, but they probably require two pushes. Next up, therefore: Integrating Stripe as an alternative payment provider into the order form. There have been at least three people who reported issues with PayPal, and Stripe has been working much better in tests. In the meantime, here's a temporary Stripe order link for everyone. This one is not connected to the cap yet, so please make sure to stay within whatever value is currently shown on the front page – I will treat any excess money as donations. If there's some time left afterward, I might also add some small improvements to the TH01 Anniversary Edition.