So, TH02! Being the only game whose main binary hadn't seen any dedicated
attention ever, we get to start the TH02-related blog posts at the very
beginning with the most foundational pieces of code. The stage tile system
is the best place to start here: It not only blocks every entity that is
rendered on top of these tiles, but is curiously placed right next to
master.lib code in TH02, and would need to be separated out into its own
translation unit before we can do the same with all the master.lib
functions.
In late 2018, I already RE'd
📝 TH04's and TH05's stage tile implementation, but haven't properly documented it on this
blog yet, so this post is also going to include the details that are unique
to those games. On a high level, the stage tile system works identically in
all three games:
The tiles themselves are 16×16 pixels large, and a stage can use 100 of
them at the same time.
The optimal way of blitting tiles would involve VRAM-to-VRAM copies
within the same page using the EGC, and that's exactly what the games do.
All tiles are stored on both VRAM pages within the rightmost 64×400 pixels
of the screen just right next to the HUD, and you only don't see them
because the games cover the same area in text RAM with black cells:
The initial screen of TH02's Stage 1, with the tile source
area uncovered by filling the same area in text RAM with transparent
cells instead of black ones. In TH02, this also reveals how the tile
area ends with a bunch of glitch tiles, tinted blue in the image. These
are the result of ZUN unconditionally blitting 100 tile images every
time, regardless of how many are actually contained in an
.MPN file.
These glitch tiles are another good example of a ZUN
landmine. Their appearance is the result of reading heap memory
outside allocated boundaries, which can easily cause segmentation faults
when porting the game to a system with virtual memory. Therefore, these
would not just be removed in this game's Anniversary Edition, but on the
more conservative debloated branch as well. Since the game
never uses these tiles and you can't observe them unless you manipulate
text RAM from outside the confines of the game, it's not a bug
according to our definition.
To reduce the memory required for a map, tiles are arranged into fixed
vertical sections of a game-specific constant size.
The 6 24×8-tile sections defined in TH02's STAGE0.MAP, in
reverse order compared to how they're defined in the file. Note the
duplicated row at the top of the final section: The boss fight starts
once the game scrolled the last full row of tiles onto the top of the
screen, not the playfield. But since the PC-98 text chip
covers the top tile row of the screen with black cells, this final row
is never visible, which effectively reduces a map's final tile section
to 7 rows rather than 8.
The actual stage map then is simply a list of these tile sections,
ordered from the start/bottom to the top/end.
Any manipulation of specific tiles within the fixed tile sections has to
be hardcoded. An example can be found right in Stage 1, where the Shrine
Tank leaves track marks on the tiles it appears to drive over:
This video also shows off the two issues with Touhou's first-ever
midboss: The replaced tiles are rendered below the midboss
during their first 4 frames, and maybe ZUN should have stopped the
tile replacements one row before the timeout. The first one is
clearly a bug, but it's not so clear-cut with the second one. I'd
need to look at the code to tell for sure whether it's a quirk or a
bug.
The differences between the three games can best be summarized in a table:
TH02
TH04
TH05
Tile image file extension
.MPN
Tile section format
.MAP
Tile section order defined as part of
.DT1
.STD
Tile section index format
0-based ID
0-based ID × 2
Tile image index format
Index between 0 and 100, 1 byte
VRAM offset in tile source area, 2 bytes
Scroll speed control
Hardcoded
Part of the .STD format, defined per referenced tile
section
Redraw granularity
Full tiles (16×16)
Half tiles (16×8)
Rows per tile section
8
5
Maximum number of tile sections
16
32
Lowest number of tile sections used
5 (Stage 3 / Extra)
8 (Stage 6)
11 (Stage 2 / 4)
Highest number of tile sections used
13 (Stage 4)
19 (Extra)
24 (Stage 3)
Maximum length of a map
320 sections (static buffer)
256 sections (format limitation)
Shortest map
14 sections (Stage 5)
20 sections (Stage 5)
15 sections (Stage 2)
Longest map
143 sections (Stage 4)
95 sections (Stage 4)
40 sections (Stage 1 / 4 / Extra)
The most interesting part about stage tiles is probably the fact that some
of the .MAP files contain unused tile sections. 👀 Many
of these are empty, duplicates, or don't really make sense, but a few
are unique, fit naturally into their respective stage, and might have
been part of the map during development. In TH02, we can find three unused
sections in Stage 5:
The non-empty tile sections defined in TH02's STAGE4.MAP,
showing off three unused ones.
These unused tile sections are much more common in the later games though,
where we can find them in TH04's Stage 3, 4, and 5, and TH05's Stage 1, 2,
and 4. I'll document those once I get to finalize the tile rendering code of
these games, to leave some more content for that blog post. TH04/TH05 tile
code would be quite an effective investment of your money in general, as
most of it is identical across both games. Or how about going for a full-on
PC-98 Touhou map viewer and editor GUI?
Compared to TH04 and TH05, TH02's stage tile code definitely feels like ZUN
was just starting to understand how to pull off smooth vertical scrolling on
a PC-98. As such, it comes with a few inefficiencies and suboptimal
implementation choices:
The redraw flag for each tile is stored in a 24×25 bool
array that does nothing with 7 of the 8 bits.
During bombs and the Stage 4, 5, and Extra bosses, the game disables the
tile system to render more elaborate backgrounds, which require the
playfield to be flood-filled with a single color on every frame. ZUN uses
the GRCG's RMW mode rather than TDW mode for this, leaving almost half of
the potential performance on the table for no reason. Literally,
changing modes only involves changing a single constant.
The scroll speed could theoretically be changed at any time. However,
the function that scrolls in new stage tiles can only ever blit part of a
single tile row during every call, so it's up to the caller to ensure
that scrolling always ends up on an exact 16-pixel boundary. TH02 avoids
this problem by keeping the scroll speed constant across a stage, using 2
pixels for Stage 4 and 1 pixel everywhere else.
Since the scroll speed is given in pixels, the slowest speed would be 1
pixel per frame. To allow the even slower speeds seen in the final game,
TH02 adds a separate scroll interval variable that only runs the
scroll function every 𝑛th frame, effectively adding a prescaler to the
scroll speed. In TH04 and TH05, the speed is specified as a Q12.4 value
instead, allowing true fractional speeds at any multiple of
1/16 pixels. This also necessitated a fixed algorithm
that correctly blits tile lines from two rows.
Finally, we've got a few inconsistencies in the way the code handles the
two VRAM pages, which cause a few unnecessary tiles to be rendered to just
one of the two pages. Mentioning that just in case someone tries to play
this game with a fully cleared text RAM and wonders where the flickering
tiles come from.
Even though this was ZUN's first attempt at scrolling tiles, he already saw
it fit to write most of the code in assembly. This was probably a reaction
to all of TH01's performance issues, and the frame rate reduction
workarounds he implemented to keep the game from slowing down too much in
busy places. "If TH01 was all C++ and slow, TH02 better contain more ASM
code, and then it will be fast, right?"
Another reason for going with ASM might be found in the kind of
documentation that may have been available to ZUN. Last year, the PC-98
community discovered and scanned two new game programming tutorial books
from 1991 (1, 2).
Their example code is not only entirely written in assembly, but restricts
itself to the bare minimum of x86 instructions that were available on the
8086 CPU used by the original PC-9801 model 9 years earlier. Such code is
not only suboptimal
on the 486, but can often be actually worse than what your C++
compiler would generate. TH02 is where the trend of bad hand-written ASM
code started, and it
📝 only intensified in ZUN's later games. So,
don't copy code from these books unless you absolutely want to target the
earlier 8086 and 286 models. Which,
📝 as we've gathered from the recent blitting benchmark results,
are not all too common among current real-hardware owners.
That said, all that ASM code really only impacts readability and
maintainability. Apart from the aforementioned issues, the algorithms
themselves are mostly fine – especially since most EGC and GRCG operations
are decently batched this time around, in contrast to TH01.
Luckily, the tile functions merely use inline assembly within a
typical C function and can therefore be at least part of a C++ source file,
even if the result is pretty ugly. This time, we can actually be sure that
they weren't written directly in a .ASM file, because they feature x86
instruction encodings that can only be generated with Turbo C++ 4.0J's
inline assembler, not with TASM. The same can't unfortunately be said about
the following function in the same segment, which marks the tiles covered by
the spark sprites for redrawing. In this one, it took just one dumb hand-written ASM
inconsistency in the function's epilog to make the entire function
undecompilable.
The standard x86 instruction sequence to set up a stack frame in a function prolog looks like this:
PUSH BP
MOV BP, SP
SUB SP, ?? ; if the function needs the stack for local variables
When compiling without optimizations, Turbo C++ 4.0J will
replace this sequence with a single ENTER instruction. That one
is two bytes smaller, but much slower on every x86 CPU except for the 80186
where it was introduced.
In functions without local variables, BP and SP
remain identical, and a single POP BP is all that's needed in
the epilog to tear down such a stack frame before returning from the
function. Otherwise, the function needs an additional MOV SP,
BP instruction to pop all local variables. With x86 being the helpful
CISC architecture that it is, the 80186 also introduced the
LEAVE instruction to perform both tasks. Unlike
ENTER, this single instruction
is faster than the raw two instructions on a lot of x86 CPUs (and
even current ones!), and it's always smaller, taking up just 1 byte instead
of 3. So what if you use LEAVE even if your function
doesn't use local variables? The fact that the
instruction first does the equivalent of MOV SP, BP doesn't
matter if these registers are identical, and who cares about the additional
CPU cycles of LEAVE compared to just POP BP,
right? So that's definitely something you could theoretically do, but
not something that any compiler would ever generate.
And so, TH02 MAIN.EXE decompilation already hits the first
brick wall after two pushes. Awesome! Theoretically,
we could slowly mash through this wall using the 📝 code generator. But having such an inconsistency in the
function epilog would mean that we'd have to keep Turbo C++ 4.0J from
emitting any epilog or prolog code so that we can write our
own. This means that we'd once again have to hide any use of the
SI and DI registers from the compiler… and doing
that requires code generation macros for 22 of the 49 instructions of
the function in question, almost none of which we currently have. So, this
gets quite silly quite fast, especially if we only need to do it
for one single byte.
Instead, wouldn't it be much better if we had a separate build step between
compile and link time that allowed us to replicate mistakes like these by
just patching the compiled .OBJ files? These files still contain the names
of exported functions for linking, which would allow us to look up the code
of a function in a robust manner, navigate to specific instructions using a
disassembler, replace them, and write the modified .OBJ back to disk before
linking. Such a system could then naturally expand to cover all other
decompilation issues, culminating in a full-on optimizer that could even
recreate ZUN's self-modifying code. At that point, we would have sealed away
all of ZUN's ugly ASM code within a separate build step, and could finally
decompile everything into readable C++.
Pulling that off would require a significant tooling investment though.
Patching that one byte in TH02's spark invalidation function could be done
within 1 or 2 pushes, but that's just one issue, and we currently have 32
other .ASM files with undecompilable code. Also, note that this is
fundamentally different from what we're doing with the
debloated branch and the Anniversary Editions. Mistake patching
would purely be about having readable code on master that
compiles into ZUN's exact binaries, without fixing weird
code. The Anniversary Editions go much further and rewrite such code in
a much more fundamental way, improving it further than mistake patching ever
could.
Right now, the Anniversary Editions seem much more
popular, which suggests that people just want 100% RE as fast as
possible so that I can start working on them. In that case, why bother with
such undecompilable functions, and not just leave them in raw and unreadable
x86 opcode form if necessary… But let's first
see how much backer support there actually is for mistake patching before
falling back on that.
The best part though: Once we've made a decision and then covered TH02's
spark and particle systems, that was it, and we will have already RE'd
all ZUN-written PC-98-specific blitting code in this game. Every further
sprite or shape is rendered via master.lib, and is thus decently abstracted.
Guess I'll need to update
📝 the assessment of which PC-98 Touhou game is the easiest to port,
because it sure isn't TH01, as we've seen with all the work required for the first Anniversary Edition build.
Until then, there are still enough parts of the game that don't use any of
the remaining few functions in the _TEXT segment. Previously, I
mentioned in the 📝 status overview blog post
that TH02 had a seemingly weird sprite system, but the spark and point popup
() structures showed that the game just
stores the current and previous position of its entities in a slightly
different way compared to the rest of PC-98 Touhou. Instead of having
dedicated structure fields, TH02 uses two-element arrays indexed with the
active VRAM page. Same thing, and such a pattern even helps during RE since
it's easy to spot once you know what to look for.
There's not much to criticize about the point popup system, except for maybe
a landmine that causes sprite glitches when trying to display more than
99,990 points. Sadly, the final push in this delivery was rounded out by yet
another piece of code at the opposite end of the quality spectrum. The
particle and smear effects for Reimu's bomb animations consist almost
entirely of assembly bloat, which would just be replaced with generic calls
to the generic blitter in this game's future Anniversary Edition.
If I continue to decompile TH02 while avoiding the brick wall, items would
be next, but they probably require two pushes. Next up, therefore:
Integrating Stripe as an alternative payment provider into the order form.
There have been at least three people who reported issues with PayPal, and
Stripe has been working much better in tests. In the meantime, here's a temporary Stripe
order link for everyone. This one is not connected to the cap yet, so
please make sure to stay within whatever value is currently shown on the
front page – I will treat any excess money as donations.
If there's some time left afterward, I might
also add some small improvements to the TH01 Anniversary Edition.
128 commits! Who would have thought that the ideal first release of the TH01
Anniversary Edition would involve so much maintenance, and raise so many
research questions? It's almost as if the real work only starts after
the 100% finalization mark… Once again, I had to steal some funding from the
reserved JIS trail word pushes to cover everything I liked to research,
which means that the next towards the
anything goal will repay this debt. Luckily, this doesn't affect any
immediate plans, as I'll be spending March with tasks that are already fully
funded.
So, how did this end up so massive? The list of things I originally set out
to do was pretty short:
Build entire game into single executable
Fix rendering issues in the one or two most important parts of the game
for a good initial impression
But even the first point already started with tons of little cleanup
commits. A part of them can definitely be blamed on the rush to hit the 100%
decompilation mark before the 25th anniversary last August.
However, all the structural changes that I can't commit to
master reveal how much of a mess the TH01 codebase actually
is.
Merging the executables is mainly difficult because of all the
inconsistencies between REIIDEN.EXE and FUUIN.EXE.
The worst parts can be found in the REYHI*.DAT format code and
the High Score menu, but the little things are just as annoying, like how
the current score is an unsigned variable in
REIIDEN.EXE, but a signed one in FUUIN.EXE.
If it takes me this long and this many
commits just to sort out all of these issues, it's no wonder that the only
thing I've seen being done with this codebase since TH01's 100%
decompilation was a single porting attempt that ended in a rather quick
ragequit.
So why are we merging the executables in preparation for the Anniversary
Edition, and not waiting with it until we start doing ports?
Distributing and updating one executable is cleaner than doing the same
with three, especially as long as installation will still involve manually
dropping the new binary into the game directory.
The Anniversary Edition won't be the only fork binary. We are already
going to start out with a separate DEBLOAT.EXE that contains
only the bloat removal changes without any bug fixes, and spaztron64
will probably redo his seizure-less edition. We don't want to clutter
the game directory with three binaries for each of these fork builds, and we
especially don't want to remember things like oh, but this fork
only modifies REIIDEN.EXE…
All forks should run side-by-side with the original game. During the
time I was maintaining thcrap, I've had countless bug reports of people
assuming that thcrap was
responsible for bugs that were present in the original game, and the
same is certain to happen with the Anniversary Edition. Separate binaries
will make it easier for everyone to check where these bugs came from.
Also, I'd like to make a point about how bloated the original
three-executable structure really is, since I've heard people defending it
as neat software architecture. Really, even in Real Mode where you typically
want to use as little of the 640 KiB of conventional memory as possible, you
don't want to split your game up like this.
The game actually is so bloated that the combined binary ended up
smaller than the original REIIDEN.EXE. If all you see are the
file sizes of the original three executables, this might look like a
pretty impressive feat. Like, how can we possibly get 407,812
bytes into less than 238,612 bytes, without using compression?
If you've ever looked at the linker map though, it's not at all surprising.
Excluding the aforementioned inconsistencies that are hard to quantify,
OP.EXE and FUUIN.EXE only feature 5,767 and 6,475
bytes of unique code and data, respectively. All other code in these
binaries is already part of REIIDEN.EXE, with more than half of
the size coming from the Borland C++ runtime. The single worst offender here
is the C++ exception handler that Borland forces
onto every non-.COM binary by default, which alone adds 20,512 bytes
even if your binary doesn't use C++ exceptions.
On a more hilarious note, this
single line is responsible for pulling another unnecessary 14,242 bytes
into OP.EXE and FUUIN.EXE. This floating-point
multiplication is completely unnecessary in this context because all
possible parameters are integers, but it's enough for Turbo C++ and TLINK to
pull in the entire x87 FPU emulation machinery. These two binaries don't
even draw lines, but since this function is part of the general
graphics code translation unit and contains other functions that these
binaries do need, TLINK links in the entire thing. Maybe, multiple
executables aren't the best choice either if you use a linker that can't do
dead code elimination…
Since the 📝 Orb's physics do turn the entire
precision of a double variable into gameplay effects, it's not
feasible to ever get rid of all FPU code in TH01. The exception handler,
however, can
be removed, which easily brings the combined binary below the size of
the original REIIDEN.EXE. Compiling all code with a single set
of compiler optimization flags, including the more x86-friendly
pascal calling convention, then gets us a few more KB on top.
As does, of course, removing unused code: The only remaining purpose of
features such as 📝 resident palettes is to
potentially make porting more difficult for anyone who doesn't immediately
realize that nothing in the game uses these functions.
Technically, all unused code would be bloat, but for now, I'm keeping
the parts that may tell stories about the game's development history (such
as unused effects or the 📝 mouse cursor), or
that might help with debugging. Even with that in mind, I've only scratched
the surface when it comes to bloat removal, and the binary is only going to
get smaller from here. A lot smaller.
If only we now could start MDRV98 from this new combined binary, we wouldn't
need a second batch file either…
Which brings us to the first big research question of this delivery. Using
the C spawn() function works fine on this compiler, so
spawn("MDRV98.COM") would be all we need to do, right? Except
that the game crashes very soon after that subprocess returned.
So it's not going to be that easy if the spawned process is a TSR.
But why should this be a problem? Let's take a look at the DOS heap, and how
DOS lays out processes in conventional memory if we launch the game
regularly through GAME.BAT:
The rough layout of the DOS heap when launching TH01 from
GAME.BAT.
The batch file starts MDRV98 first, which will therefore end up below
the game in conventional memory. This is perfect for a TSR: The program can
resize itself arbitrarily before returning to DOS, and the rest of memory
will be left over for the game. If we assume such a layout, a DOS program
can implement a custom memory allocator in a very simple way, as it only has
to search for free memory in one direction – and this is exactly how Borland
implemented the C heap for functions like malloc() and
free(), and the C++ new and delete
operators.
But if we spawn MDRV98 after starting TH01, well…
MDRV98 will spawn in the next free memory location, allocate itself, return
to TH01… which suddenly finds its C heap blocked from growing. As a result,
the next big allocation will immediately fail with a rather misleading "out
of memory" error.
So, what can we do about this? Still in a bloat removal mindset, my gut
reaction was to just throw out Borland's C heap implementation, and replace
it with a very thin wrapper around the DOS heap as managed by INT 21h,
AH=48h/49h/4Ah. Like, why
did these DOS compilers even bother with a custom allocator in the first
place if DOS already comes with a perfectly fine native one? Using the
native allocator would completely erase the distinction between TSR memory
and game memory, and inherently allow the game to allocate beyond
MDRV98.
I did in fact implement this, and noticed even more benefits:
While DOS uses 16 bytes rather than Borland's 4 bytes for the control
structure of each memory block, this larger size automatically aligns all
allocations to 16-byte boundaries. Therefore, all allocation addresses would
fit into 16-bit segment-only pointers rather than needing 32-bit
far ones. On the Borland heap, the 4-byte header further limits
regular far pointers to 65,532 bytes, forcing you into
expensive huge pointers for bigger allocations.
Debuggers in DOS emulators typically have features to show and manage
the DOS heap. No need for custom debugging code.
You can change the memory placement
strategy to allocate from the top of conventional memory down to the
bottom. This is how the games allocate their resident structures.
Ultimately though, the drawbacks became too significant. Most of them are
related to the PC-98 Touhou games only ever creating a single DOS
process, even though they contain multiple executables.
Switching executables is done via exec(), which resizes a
program's main allocation to match the new binary and then overwrites the
old program image with the new one. If you've ever wondered why DOSBox-X
only ever shows OP as the active process name in the title bar,
you now know why. As far as DOS is concerned, it's still the same
OP.EXE process rooted at the same segment, and
exec() doesn't bother rewriting the name either. Most
importantly though, this is how REIIDEN.EXE can launch into
another REIIDEN.EXE process even if there are less than 238,612
bytes free when exec() is called, and without consuming more
memory for every successive binary.
For now, ANNIV.EXE still re-exec()s itself at
every point where the original game did, as ZUN's original code really
depends on being reinitialized at boss and scene boundaries. The resulting
accidental semi-hot reloading is also a useful property to retain
during development.
So why is the DOS heap a bad idea for regular game allocation after all?
Even DOS automatically releases all memory associated with a process
during its termination. But since we keep running the same process until the
player quits out of the main menu, we lose the C heap's implicit cleanup on
exec(), and have to manually free all memory ourselves.
Since the binary can be larger after hot reloading, we in fact have
to allocate all regular memory using the last fit strategy.
Otherwise, exec() fails to resize the program's main block for
the same reason that crashed the game on our initial attempt to
spawn("MDRV98.COM").
Just like Borland's heap implementation, the DOS heap stores its control
structures immediately before each allocation, forming a singly linked list.
But since the entire OS shares this single list, corruptions from heap
overflows also affect the whole system, and become much more disastrous.
Theoretically, it might be possible to recover from them by forcibly
releasing all blocks after the last correct one, or even by doing a
brute-force search for valid memory
control blocks, but in reality, DOS will likely just throw error code #7
(ERROR_ARENA_TRASHED) on the next memory management syscall,
forcing a reboot.
With a custom allocator, small corruptions remain isolated to the process.
They can be even further limited if the process adds some padding between
its last internal allocation and the end of the allocated DOS memory block;
Borland's heap sort of does this as well by always rounding up the DOS block
to a full KiB. All this might not make a difference in today's emulated and
single-tasked usage, but would have back then when software was still
developed inside IDEs running on the same system.
TH01's debug mode uses heapcheck() and
heapchecknode(), and reimplementing these on top of the DOS
heap is not trivial. On the contrary, it would be the most complicated part
of such a wrapper, by far.
I could release this DOS heap wrapper in unused form for another push if
anyone's interested, but for now, I'm pretty happy with not actually using
it in the games. Instead, let's stay with the Borland C heap, and find a way
to push MDRV98 to the very top of conventional RAM. Like this:
Which is much easier said than done. It would be nice if we could just use
the last fit allocation strategy here, but .COM executables always
receive all free memory by default anyway, which eliminates any difference
between the strategies.
But we can still change memory itself. So let's temporarily claim all
remaining free memory, minus the exact amount we need for MDRV98, for our
process. Then, the only remaining free space to spawn MDRV98 is at the exact
place where we want it to be:
Obviously, we release all the additional memory after spawning MDRV98.
Now we only need to know how much memory to not temporarily allocate. First,
we need to replicate the assumption that MDRV98's -M7
command-line parameter corresponds to a resident size of 23,552 bytes. This
is not as bad as it seems, because the -M parameter explicitly
has a KiB unit, and we can nicely abstract it away for the API.
The (env.) block though? Its minimum size equals the combined length
of all environment variables passed to the process, but its maximum size is…
not limited at all?! As in, DOS implementations can add and have
historically added more free space because some programs insisted on storing
their own new environment variables in this exact segment. DOSBox and
DOSBox-X follow this tradition by providing a configuration option for the
additional amount of environment space, with the latter adding 1024
additional bytes by default, y'know, just in case someone wants to compile
FreeDOS on a slow emulator. It's not even worth sending a bug report for
this specific case, because it's only a symptom of the fact that
unexpectedly large program environment blocks can and will happen, and are
to be expected in DOS land.
So thanks to this cruel joke, it's technically impossible to achieve what we
want to do there. Hooray! The only thing we can kind of do here is an
educated guess: Sum up the length of all environment variables in our
environment block, compare that length against the allocated size of the
block, and assume that the MDRV98 process will get as much additional memory
as our process got. 🤷
The remaining hurdles came courtesy of some Borland C runtime implementation
details. You would think that the temporary reallocation could even be done
in pure C using the sbrk(), coreleft(), and
brk() functions, but all values passed to or returned from
these functions are inaccurate because they don't factor in the
aforementioned KiB padding to the underlying DOS memory block. So we have to
directly use the DOS syscalls after all. Which at least means that learning
about them wasn't completely useless…
The final issue is caused inside Borland's
spawn() implementation. The environment block for the
child process is built out of all the strings reachable from C's
environ pointer, which is what that FreeDOS build process
should have used. Coalescing them into a single buffer involves yet
another C heap allocation… and since we didn't report our DOS memory block
manipulation back to the C heap, the malloc() call might think
it needs to request more memory from DOS. This resets the DOS memory block
back to its intended level, undoing our manipulation right before the actual
INT 21h, AH=4Bh
EXEC syscall. Or in short:
Manipulate DOS heap ➜ spawn() call ➜_LoadProg() ➜ allocate and prepare environment block ➜ _spawn() ➜ DOS EXEC syscall
The obvious solution: Replace _LoadProg(), implement the
coalescing ourselves, and do it before the heap manipulation. Fortunately,
Borland's internal low-level _spawn() function is not
static, so we can call it ourselves whenever we want to:
Allocate and prepare environment block ➜ manipulate DOS heap ➜ _spawn() call ➜EXEC syscall
So yes, launching MDRV98 from C can be done, but it involves advanced
witchcraft and is completely ridiculous.
Launching external sound drivers from a batch file is the right way
of doing things.
Fortunately, you don't have to rely on this auto-launching feature. You can
still launch DEBLOAT.EXE or ANNIV.EXE from a batch
file that launched MDRV98.COM before, and the binaries will
detect this case and skip the attempt of launching MDRV98 from C. It's
unlikely that my heuristic will ever break, but I definitely recommend
replicating GAME.BAT just to be completely sure – especially
for user-friendly repacks that don't want to include the original game
anyway.
This is also why ANNIV.EXE doesn't launch
ZUNSOFT.COM: The "correct" and stable way to launch
ANNIV.EXE still involves a batch file, and I would say that
expecting people to remove ZUNSOFT.COM from that file is worse
than not playing the animation. It's certainly a debate we can have, though.
This deep dive into memory allocation revealed another previously
undocumented bug in the original game. The RLE decompression code for the
東方靈異.伝 packfile contains two heap overflows, which are
actually triggered by SinGyoku's BOSS1_3.BOS and Konngara's
BOSS8_1.BOS. They only do not immediately crash the game when
loading these bosses thanks to two implementation details of Borland's C
heap.
Obviously, this is a bug we should fix, but according to the definition of
bugs, that fix would be exclusive to the anniversary branch.
Isn't that too restrictive for something this critical? This code is
guaranteed to blow up with a different heap implementation, if only in a
Debug build. And besides, nobody would notice a fix
just by looking at the game's rendered output…
Looks like we have to introduce a fourth category of weird code, in addition
to the previous bloat, bug, and quirk categories, for
invisible internal issues like these. Let's call it landmine, and fix
them on the debloated branch as well. Thanks to
Clerish for the naming inspiration!
With this new category, the full definitions for all categories have become
quite extensive. Thus, they now live in CONTRIBUTING.md
inside the ReC98 repository.
With the new discoveries and the new landmine category, TH01 is now at 67
bugs and 20 landmines. And the solution for the landmine in question? Simplifying
the 61 lines of the original code down to 16. And yes, I'm including
comments in these numbers – if the interactions of the code are complex
enough to require multi-paragraph comments, these are a necessary and
valid part of the code.
While we're on the topic of weird code and its visible or invisible effects,
there's one thing you might be concerned about. With all the rearchitecting
and data shifting we're doing on the debloated branch, what
will happen to the 📝 negative glitch stages?
These are the result of a clearly observable bug that, by definition, must
not be fixed on the debloated branch. But given that the
observable layout of the glitch stages is defined by the memory
surrounding the scene stage variable, won't the
debloated branch inherently alter their appearance (= ⚠️
fanfiction ⚠️), or even remove them completely?
Well, yes, it will. But we can still preserve their layout by
hardcoding
the exact original data that the game would originally read, and even emulate
the original segment relocations and other pieces of global data.
Doing this is feasible thanks to the fact that there are only 4 glitch
stages. Unfortunately, the same can't be said for the timer values, which
are determined by an array lookup with the un-modulo'd stage ID. If we
wanted to preserve those as well, we'd have to bundle an exact copy of the
original REIIDEN.EXE data segment to preserve the values of all
32,768 negative stages you could possibly enter, together with a map
of all relocations in this segment. 😵 Which I've decided against for now,
since this has been going on for far too long already. Let's first see if
anyone ever actually complains about details like this…
Alright, time to start the anniversary branch by rendering
everything at its correct internal unaligned X position? Eh… maybe not quite
yet. If we just hacked all the necessary bit-shifting code into all the
format-specific blitting functions, we'd still retain all this largely
redundant, bad, and slow code, and would make no progress in terms of
portability. It'd be much better to first write a single generic blitter
that's decently optimized, but supports all kinds of sprites to make this
optimization actually worth something.
So, next research question: How would such a blitter look like? After I
learned during my
📝 first foray into cycle counting that port
I/O is slow on 486 CPUs, it became clear that TH04's
📝 GRCG batching for pellets was one of the
more useful optimizations that probably contributed a big deal towards
achieving the high bullet counts of that game. This leads to two
conclusions:
master.lib's super_*() sprite functions are slow, and not
worth looking at for inspiration. Even the 📝 tiny format reinitializes the GRCG on every color change, wasting 80
cycles.
Hence, our low-level blitting API should not even care about colors. It
should only concern itself with blitting a given 1bpp sprite to a single
VRAM segment. This way, it can work for both 4-plane sprites and
single-plane sprites, and just assume that the GRCG is active.
Maybe we should also start by not even doing these unaligned bit shifts
ourselves, and instead expect the call site to
📝 always deliver a byte-aligned sprite that is correctly preshifted,
if necessary? Some day, we definitely should measure how slow runtime
shifting would really be…
What we should do, however, are some further general optimizations that I
would have expected from master.lib: Unrolling the vertical
loop, and baking a single function for every sprite width to eliminate
the horizontal loop. We can then use the widest possible x86
MOV instruction for the lowest possible number of cycles per
row – for example, we'd blit a 56-wide sprite with three MOVs
(32-bit + 16-bit + 8-bit), and a 64-wide one with two 32-bit
MOVs.
Or maybe not? There's a lot of blitting code in both master.lib and PC-98
Touhou that checks for empty bytes within sprites to skip needlessly writing
them to VRAM:
Which goes against everything you seem to know about computers. We aren't
running on an 8-bit CPU here, so wouldn't it be faster to always write both
halves of a sprite in a single operation?
That's a single CPU instruction, compared to two instructions and two
branches. The only possible explanation for this would be that VRAM writes
are so slow on PC-98 that you'd want to avoid them at all costs, even
if that means additional branching on the CPU to do so. Or maybe that was
something you would want to do on certain models with slow VRAM, but not on
others?
So I wrote a benchmark to answer all these questions, and to compare my new
blitter against typical TH01 blitting code:
A not really representative run on DOSBox-X. Since the master.lib sprite
functions are also unbatched, I expect them to not be much faster than
the naive C implementation.
2023-03-05-blitperf.zip
And here are the real-hardware results I've got from the PC-9800
Central Discord server:
PC-286LS
PC-9801ES
PC-9821Cb/Cx
PC-9821Ap3
PC-9821An
PC-9821Nw133
PC-9821Ra20
80286, 12 MHz
i386SX, 16 MHz
486SX, 33 MHz
486DX4, 100 MHz
Pentium, 90 MHz
Pentium, 133 MHz
Pentium Pro, 200 MHz
1987
1989
1994
1994
1994
1997
1996
Unchecked
C
GRCG
36,85
38,42
26,02
26,87
3,98
4,13
2,08
2,16
1,81
1,87
0,86
0,89
1,25
1,25
MOVS
GRCG
15,22
16,87
9,33
10,19
1,22
1,37
0,44
0,44
MOV
GRCG
15,42
17,08
9,65
10,53
1,15
1,3
0,44
0,44
4-plane
37,23
43,97
29,2
32,96
4,44
5,01
4,39
4,67
5,11
5,32
5,61
5,74
6,63
6,64
Checking first
GRCG
17,49
19,15
10,84
11,72
1,27
1,44
1,04
1,07
0,54
0,54
4-plane
46,49
53,36
35,01
38,79
5,66
6,26
5,43
5,74
6,56
6,8
8,08
8,29
10,25
10,29
Checking second
GRCG
16,47
18,12
10,77
11,65
1,25
1,39
1,02
0,51
0,51
4-plane
43,41
50,26
33,79
37,82
5,22
5,81
5,14
5,43
6,18
6,4
7,57
7,77
9,58
9,62
Checking both
GRCG
16,14
18,03
10,84
11,71
1,33
1,49
1,01
0,49
0,49
4-plane
43,61
50,45
34,11
37,87
5,39
5,99
4,92
5,23
5,88
6,11
7,19
7,43
9,1
9,13
Amount of frames required to render 2000 16×8 pellet sprites on a variety of
PC-98 models, using the new generic blitter. Both preshifted (first column)
and runtime-shifted (second column) sprites were tested; empty columns
correspond to times faster than a single frame. Thanks to cuba200611,
Shoutmon, cybermind, and Digmac for running the tests!
The key takeaways:
Checking for empty bytes has never been a good idea.
Preshifting sprites made a slight difference on the 286. Starting with
the 386 though, that difference got smaller and smaller, until it completely
vanished on Pentium models. The memory tradeoff is especially not worth it
for 4-plane sprites, given that you would have to preshift each of the 4
planes and possibly even a fifth alpha plane. Ironically, ZUN only ever
preshifted monochrome single-bitplane sprites with a width of 8 pixels.
That's the smallest possible amount of memory a sprite can possibly take,
and where preshifting consequently has the smallest effect on performance.
Shifting 8-wide sprites on the fly literally takes a single ROL
or ROR instruction per row.
You might want to use MOVS instead of MOV when
targeting the 286 and 386, but the performance gains are barely worth the
resulting mess you would make out of your blitting code. On Pentium models,
there is no difference.
Use the GRCG whenever you have to render lots of things that share a
static 8×1 pattern.
These are the PC-98 models that the people who are willing to test your
newly written PC-98 code actually use.
Since this won't be the only piece of game-independent and explicitly
PC-98-specific custom code involved in this delivery, it makes sense to
start a
dedicated PC-98 platform layer. This code will gradually eliminate the
dependency on master.lib and replace it with better optimized and more
readable C++ code. The blitting benchmark, for example, is already
implemented completely without master.lib.
While this platform layer is mainly written to generate optimal code within
Turbo C++ 4.0J, it can also serve as general PC-98 documentation for
everyone who prefers code over machine-translating old Japanese books. Not
to mention the immediacy of having all actual relevant information in
one place, which might otherwise be pretty well hidden in these books, or
some obscure old text file. For example, did you know that uploading gaiji
via INT 18h might end up disabling the VSync interrupt trigger,
deadlocking the process on the next frame delay loop? This nuisance is not
replicated by any emulators, and it's quite frustrating to encounter it when
trying to run your code on real hardware. master.lib works around it by
simply hooking INT 18h and unconditionally reenabling the VSync
interrupt trigger after the original handler returns, and so does our
platform layer.
So, with the pellet draw calls batched and routed through the new renderer,
we should have gained enough free CPU cycles to disable
📝 interlaced pellet rendering without any
impact on frame rates?
Well, kinda. We do get 56.4 FPS, but only together with noticeable and
reproducible tearing in the top part of the playfield, suggesting exactly
why ZUN interlaced the rendering in the first place. 😕 So have we
already reached the limit of single-buffered PC-98 games here, or can we
still do something about it?
As it turns out, the main bottleneck actually lies in the pellet
unblitting code. Every EGC-"accelerated" unblitting call in TH01 is
as unbatched as the pellet blitting calls were, spending an additional 17
I/O port writes per call to completely set up and shut down the EGC, every
time. And since this is TH01, the two-instruction operation of changing the
active PC-98 VRAM page isn't inlined either, but instead done via a function
call to a faraway segment. On the 486, that's:
>341 cycles for EGC setup and teardown, plus
>72 cycles for each 16-pixel chunk to be unblitted.
This sums up to
>917 cycles of completely unnecessary work for every active pellet,
in the optimal 50% of cases where it lies on an even VRAM byte,
or
>1493 cycles if it lies on an odd VRAM byte, because ZUN's code
extends the unblitted rectangle to a gargantuan 32×8 pixels in this case
And this calculation even ignores the lack of small micro-optimizations that
could further optimize the blitting loop. Multiply that by the game's pellet
cap of 100, and we get a 6-digit number of wasted CPU cycles. On
paper, that's roughly 1/6 of the time we have for each
of our target 56.423 FPS on the game's target 33 MHz systems. Might not
sound all too critical, but the single-buffered nature of the game means
that we're effectively racing the beam on every frame. In turn, we have to
be even more serious about performance.
So, time to also add a batched EGC API to our PC-98 platform layer? Writing
our own EGC code presents a nice opportunity to finally look deeper into all
its registers and configuration options, and see what exactly we can do
about ZUN's enforced 16-pixel alignment.
To nobody's surprise, this alignment is completely unnecessary, and only
displays a lack of knowledge about the chip. While it is true that
the EGC wants VRAM to be exclusively addressed in 16-bit chunks at
16-bit-aligned addresses, it specifically provides
an address register (0x4AC) for shifting the horizontal
start offsets of the source and destination to any pixel within the
16 pixels of such a chunk, and
a bit length register (0x4AE) for specifying the total
width of pixels to be transferred, which also implies the correct end
offsets.
And it gets even better: After ⌈bitlength ÷ 16⌉ write
instructions, the EGC's internal shifter state automatically reinitializes
itself in preparation for blitting another row of pixels with the same
initially configured bit addresses and length. This is perfect for blitting
rectangles, as two I/O port writes before the start of your blitting loop
are enough to define your entire rectangle.
The manual nature of reading and writing in 16-pixel chunks does come with a
slight pitfall though. If the source bit address is larger than the
destination bit address, the first 16-bit read won't fill the EGC's internal
shift register with all pixels that should appear in the first 16-pixel
destination chunk. In this case, the EGC simply won't write anything and
leave the first chunk unchanged. In a
📝 regular blitting loop, however, you expect
that memory to be written and immediately move on to the next chunks within
the row. As a result, the actual blitting process for such a rectangle will
no longer be aligned to the configured address and bit length. The first row
of the rectangle will appear 16 pixels to the right of the destination
address, and the second one will start at bit offset 0 with pixels from the
rightmost byte of the first line, which weren't blitted and remained in the
tile register.
There is an easy solution though: Before the horizontal loop on each line of
the rectangle, simply read one additional 16-pixel chunk from the source
location to prefill the shift register. Thankfully, it's large enough to
also fit the second read of the then full 16 pixels, without dropping any
pixels along the way.
And that's how we get arbitrarily unaligned rectangle copies with the EGC!
Except for a small register allocation trick to use two-register addressing,
there's not much use in further optimizations, as the runtime of these
inter-page blit operations is dominated by the VRAM page switches anyway.
Except that T98-Next seems to disagree about the register prefilling issue:
Every other emulator agrees with real hardware in this regard, so we can
safely assume this to be a bug in T98-Next. Just in case this old emulator
with its last release from June 2010 still has any fans left nowadays… For
now though, even they can still enjoy the TH01 Anniversary Edition: The only
EGC copy algorithm that TH01 actually needs is the left one during the
single-buffered tests, which even that emulator gets right.
That only leaves
📝 my old offer of documenting the EGC raster ops,
and we've got the EGC figured out completely!
And that did in fact remove tearing from the pellet rendering function! For
the first time, we can now fight Elis, Kikuri, Sariel, and Konngara with a
doubled pellet frame rate:
Switchable videos like these can nicely provide evidence that these
changes have no effect on gameplay, making it easy to see that the Orb
still collides with all pellets on the same frames. Also, check out the
difference in remaining conventional memory (coreleft)…
With only pellets and no other animation on screen, this exact pattern
presents the optimal demonstration case for the new unblitter. But as you
can already tell from the invincibility sprites, we'd also need to route
every other kind of sprite through the same new code. This isn't all too
trivial: Most sprites are still rendered at byte-aligned positions, and
their blitting APIs hide that fact by taking a pixel position regardless.
This is why we can't just replace ZUN's original 16-pixel-aligned EGC
unblitting function with ours, and always have to replace both the blitter
and the unblitter on a per-sprite basis.
To completely remove all flickering, we'd also like to get rid of all the
sprite-specific unblit ➜ update ➜ render sequences, and instead
gather all unblitting code to the beginning of the game loop, before any
update and rendering calls. So yeah, it will take a long time to completely
get rid of all flickering. Until we're there, I recommend any backer to tell
me their favorite boss, so that I can focus on getting that one
rendered without any flickering. Remember that here at ReC98, we can have a
Touhou character popularity contest at any time during the year, whenever
the store is open!
In the meantime, the consistent use of 8×8 rectangles during pellet
unblitting does significantly reduce flickering across the entire game,
and shrinks certain holes that pellets tend to rip into lazily reblitted
sprites:
SinGyoku's "crossing pellets" pattern, shortly before completing
the transformation back to the sphere.
To round out the first release, I added all the other bug fixes to achieve
parity with my previously released patched REIIDEN.EXE builds:
I removed the 📝 shootout laser crash by
simply leaving the lasers on screen if a boss is defeated,
prevented the HP bar heap corruption bug in test or debug mode by not
letting it display negative HP in the first place, and
So here it is, the first build of TH01's Anniversary Edition:
2023-03-05-th01-anniv.zip Edit (2023-03-12): If you're playing on Neko Project and seeing more
flickering than in the original game, make sure you've checked the Screen
→ Disp vsync option.
Next up: The long overdue extended trip through the depths of TH02's
low-level code. From what I've seen of it so far, the work on this project
is finally going to become a bit more relaxing. Which is quite welcome
after, what, 6 months of stressful research-heavy work?
Starting the year with a delivery that wasn't delayed until the last
day of the month for once, nice! Still, very soon and
high-maintenance did not go well together…
It definitely wasn't Sara's fault though. As you would expect from a Stage 1
Boss, her code was no challenge at all. Most of the TH02, TH04, and TH05
bosses follow the same overall structure, so let's introduce a new table to
replace most of the boilerplate overview text:
Phase #
Patterns
HP boundary
Timeout condition
(Entrance)
4,650
288 frames
2
4
2,550
2,568 frames
(= 32 patterns)
3
4
450
5,296 frames
(= 24 patterns)
4
1
0
1,300 frames
Total
9
9,452 frames
In Phases 2 and 3, Sara cycles between waiting, moving randomly for a
fixed 28 frames, and firing a random pattern among the 4 phase-specific
ones. The pattern selection makes sure to never
pick any pattern twice in a row. Both phases contain spiral patterns that
only differ in the clockwise or counterclockwise turning direction of the
spawner; these directions are treated as individual unrelated patterns, so
it's possible for the "same" pattern to be fired multiple times in a row
with a flipped direction.
The two phases also differ in the wait and pattern durations:
In Phase 2, the wait time starts at 64 frames and decreases by 12
frames after the first 5 patterns each, ending on a minimum of 4 frames.
In Phase 3, it's a constant 16 frames instead.
All Phase 2 patterns are fired for 28 frames, after a 16-frame
gather animation. The Phase 3 pattern time starts at 80 frames and
increases by 24 frames for the first 6 patterns, ending at 200 frames
for all later ones.
Phase 4 consists of the single laser corridor pattern with additional
random bullets every 16 frames.
And that's all the gameplay-relevant detail that ZUN put
into Sara's code. It doesn't even make sense to describe the remaining
patterns in depth, as their groups can significantly change between
difficulties and rank values. The
📝 general code structure of TH05 bosses
won't ever make for good-code, but Sara's code is just a
lesser example of what I already documented for Shinki.
So, no bugs, no unused content, only inconsequential bloat to be found here,
and less than 1 push to get it done… That makes 9 PC-98 Touhou bosses
decompiled, with 22 to go, and gets us over the sweet 50% overall
finalization mark! 🎉 And sure, it might be possible to pass through the
lasers in Sara's final pattern, but the boss script just controls the
origin, angle, and activity of lasers, so any quirk there would be part of
the laser code… wait, you can do what?!?
TH05 expands TH04's one-off code for Yuuka's Master and Double Sparks into a
more featureful laser system, and Sara is the first boss to show it off.
Thus, it made sense to look at it again in more detail and finalize the code
I had purportedly
📝 reverse-engineered over 4 years ago.
That very short delivery notice already hinted at a very time-consuming
future finalization of this code, and that prediction certainly came true.
On the surface, all of the low-level laser ray rendering and
collision detection code is undecompilable: It uses the SI and
DI registers without Turbo C++'s safety backups on the stack,
and its helper functions take their input and output parameters from
convenient registers, completely ignoring common calling conventions. And
just to raise the confusion even further, the code doesn't just set
these registers for the helper function calls and then restores their
original values, but permanently shifts them via additions and
subtractions. Unfortunately, these convenient registers also include the
BP base pointer to the stack frame of a function… and shifting
that register throws any intuition behind accessed local variables right out
of the window for a good part of the function, requiring a correctly shifted
view of the stack frame just to make sense of it again.
How could such code even have been written?! This
goes well beyond the already wrong assumption that using more stack space is
somehow bad, and straight into the territory of self-inflicted pain.
So while it's not a lot of instructions, it's quite dense and really hard to
follow. This code would really benefit from a decompilation that
anchors all this madness as much as possible in existing C++ structures… so
let's decompile it anyway?
Doing so would involve emitting lots of raw machine code bytes to hide the
SI and DI registers from the compiler, but I
already had a certain
📝 batshit insane compiler bug workaround abstraction
lying around that could make such code more readable. Hilariously, it only
took this one additional use case for that abstraction to reveal itself as
premature and way too complicated. Expanding
the core idea into a full-on x86 instruction generator ended up simplifying
the code structure a lot. All we really want there is a way to set all
potential parameters to e.g. a specific form of the MOV
instruction, which can all be expressed as the parameters to a force-inlined
__emit__() function. Type safety can help by providing
overloads for different operand widths here, but there really is no need for
classes, templates, or explicit specialization of templates based on
classes. We only need a couple of enums with opcode, register,
and prefix constants from the x86 reference documentation, and a set of
associated macros that token-paste pseudoregisters onto the prefixes of
these enum constants.
And that's how you get a custom compile-time assembler in a 1994 C++
compiler and expand the limits of decompilability even further. What's even
truly left now? Self-modifying code, layout tricks that can't be replicated
with regularly structured control flow… and that's it. That leaves quite a
few functions I previously considered undecompilable to be revisited once I
get to work on making this game more portable.
With that, we've turned the low-level laser code into the expected horrible
monstrosity that exposes all the hidden complexity in those few ASM
instructions. The high-level part should be no big deal now… except that
we're immediately bombarded with Fixup overflow errors at link
time? Oh well, time to finally learn the true way of fixing this highly
annoying issue in a second new piece of decompilation tech – and one
that might actually be useful for other x86 Real Mode retro developers at
that.
Earlier in the RE history of TH04 and TH05, I often wrote about the need to
split the two original code segments into multiple segments within two
groups, which makes it possible to slot in code from different
translation units at arbitrary places within the original segment. If we
don't want to define a unique segment name for each of these slotted-in
translation units, we need a way to set custom segment and group names in C
land. Turbo C++ offers two #pragmas for that:
#pragma option -zCsegment -zPgroup – preferred in most
cases as it's equivalent to setting the default segment and group via the
command line, but can only be used at the beginning of a translation unit,
before the first non-preprocessor and non-comment C language token
#pragma codeseg segment <group> – necessary if a
translation unit needs to emit code into two or more segments
For the most part, these #pragmas work well, but they seemed to
not help much when it came to calling near functions declared
in different segments within the same group. It took a bit of trial and
error to figure out what was actually going on in that case, but there
is a clear logic to it:
Symbols are allocated to the segment and group that's active during
their first appearance, no matter whether that appearance is a declaration
or definition. Any later appearance of the function in a different segment
is ignored.
The linker calculates the 16-bit offsets of such references relative to
the symbol's declared segment, not its actual one. Turbo C++ does
not show an error or warning if the declared and actual segments are
different, as referencing the same symbol from multiple segments is a valid
use case. The linker merely throws the Fixup overflow error if
the calculated distance exceeds 64 KiB and thus couldn't possibly fit
within a near reference. With a wrong segment declaration
though, your code can be incorrect long before a fixup hits that limit.
Summarized in code:
#pragma option -zCfoo_TEXT -zPfoo
void bar(void);
void near qux(void); // defined somewhere else, maybe in a different segment
#pragma codeseg baz_TEXT baz
// Despite the segment change in the line above, this function will still be
// put into `foo_TEXT`, the active segment during the first appearance of the
// function name.
void bar(void) {
}
// This function hasn't been declared yet, so it will go into `baz_TEXT` as
// expected.
void baz(void) {
// This `near` function pointer will be calculated by subtracting the
// flat/linear address of qux() inside the binary from the base address
// of qux()'s declared segment, i.e., `foo_TEXT`.
void (near *ptr_to_qux)(void) = qux;
}
So yeah, you might have to put #pragma codeseg into your
headers to tell the linker about the correct segment of a
near function in advance. 🤯 This is an important insight for
everyone using this compiler, and I'm shocked that none of the Borland C++
books documented the interaction of code segment definitions and
near references at least at this level of clarity. The TASM
manuals did have a few pages on the topic of groups, but that syntax
obviously doesn't apply to a C compiler. Fixup overflows in particular are
such a common error and really deserved better than the unhelpful 🤷
of an explanation that ended up in the User's Guide. Maybe this whole
technique of custom code segment names was considered arcane even by 1993,
judging from the mere three sentences that #pragma codeseg was
documented with? Still, it must have been common knowledge among Amusement
Makers, because they couldn't have built these exact binaries without
knowing about these details. This is the true solution to
📝 any issues involving references to near functions,
and I'm glad to see that ZUN did not in fact lie to the compiler. 👍
OK, but now the remaining laser code compiles, and we get to write
C++ code to draw some hitboxes during the two collision-detected states of
each laser. These confirm what the low-level code from earlier already
uncovered: Collision detection against lasers is done by testing a
12×12-pixel box at every 16 pixels along the length of a laser, which leaves
obvious 4-pixel gaps at regular intervals that the player can just pass
through. This adds
📝 yet📝 another📝 quirk to the growing list of quirks that
were either intentional or must have been deliberately left in the game
after their initial discovery. This is what constants were invented for, and
there really is no excuse for not using them – especially during
intoxicated coding, and/or if you don't have a compile-time abstraction for
Q12.4 literals.
When detecting laser collisions, the game checks the player's single
center coordinate against any of the aforementioned 12×12-pixel boxes.
Therefore, it's correct to split these 12×12 pixels into two 6×6-pixel
boxes and assign the other half to the player for a more natural
visualization. Always remember that hitbox visualizations need to keep
all colliding entities in mind –
📝 assigning a constant-sized hitbox to "the player" and "the bullets" will be wrong in most other cases.
Using subpixel coordinates in collision detection also introduces a slight
inaccuracy into any hitbox visualization recorded in-engine on a 16-color
PC-98. Since we have to render discrete pixels, we cannot exactly place a
Q12.4 coordinate in the 93.75% of cases where the fractional part is
non-zero. This is why pretty much every laser segment hitbox in the video
above shows up as 7×7 rather than 6×6: The actual W×H area of each box is 13
pixels smaller, but since the hitbox lies between these pixels, we
cannot indicate where it lies exactly, and have to err on the
side of caution. It's also why Reimu's box slightly changes size as she
moves: Her non-diagonal movement speed is 3.5 pixels per frame, and the
constant focused movement in the video above halves that to 1.75 pixels,
making her end up on an exact pixel every 4 frames. Looking forward to the
glorious future of displays that will allow us to scale up the playfield to
16× its original pixel size, thus rendering the game at its exact internal
resolution of 6144×5888 pixels. Such a port would definitely add a lot of
value to the game…
The remaining high-level laser code is rather unremarkable for the most
part, but raises one final interesting question: With no explicitly defined
limit, how wide can a laser be? Looking at the laser structure's 1-byte
width field and the unsigned comparisons all throughout the update and
rendering code, the answer seems to be an obvious 255 pixels. However, the
laser system also contains an automated shrinking state, which can be most
notably seen in Mai's wheel pattern. This state shrinks a laser by 2 pixels
every 2 frames until it reached a width of 0. This presents a problem with
odd widths, which would fall below 0 and overflow back to 255 due to the
unsigned nature of this variable. So rather than, I don't know, treating
width values of 0 as invalid and stopping at a width of 1, or even adding a
condition for that specific case, the code just performs a signed
comparison, effectively limiting the width of a shrinkable laser to a
maximum of 127 pixels. This small signedness
inconsistency now forces the distinction between shrinkable and
non-shrinkable lasers onto every single piece of code that uses lasers. Yet
another instance where
📝 aiming for a cinematic 30 FPS look
made the resulting code much more complicated than if ZUN had just evenly
spread out the subtraction across 2 frames. 🤷
Oh well, it's not as if any of the fixed lasers in the original scripts came
close to any of these limits. Moving lasers are much more streamlined and
limited to begin with: Since they're hardcoded to 6 pixels, the game can
safely assume that they're always thinner than the 28 pixels they get
gradually widened to during their decay animation.
Finally, in case you were missing a mention of hitboxes in the previous
paragraph: Yes, the game always uses the aforementioned 12×12 boxes,
regardless of a laser's width.
This video also showcases the 127-pixel limit because I wanted
to include the shrink animation for a seamless loop.
That was what, 50% of this blog post just being about complications that
made laser difficult for no reason? Next up: The first TH01 Anniversary
Edition build, where I finally get to reap the rewards of having a 100%
decompiled game and write some good code for once.
Whew, TH01's boss code just had to end with another beast of a boss, taking
way longer than it should have and leaving uncomfortably little time for the
rest of the game. Let's get right into the overview of YuugenMagan, the most
sequential and scripted battle in this game:
The fight consists of 14 phases, numbered (of course) from 0 to 13.
Unlike all other bosses, the "entrance phase" 0 is a proper gameplay-enabled
part of the fight itself, which is why I also count it here.
YuugenMagan starts with 16 HP, second only to Sariel's 18+6. The HP bar
visualizes the HP threshold for the end of phases 3 (white part) and 7
(red-white part), respectively.
All even-numbered phases change the color of the 邪 kanji in the stage
background, and don't check for collisions between the Orb and any eye.
Almost all of them consequently don't feature an attack, except for phase
0's 1-pixel lasers, spawning symmetrically from the left and right edges of
the playfield towards the center. Which means that yes, YuugenMagan is in
fact invincible during this first attack.
All other attacks are part of the odd-numbered phases:
Phase 1: Slow pellets from the lateral eyes. Ends
at 15 HP.
Phase 3: Missiles from the southern eyes, whose
angles first shift away from Reimu's tracked position and then towards
it. Ends at 12 HP.
Phase 5: Circular pellets sprayed from the lateral
eyes. Ends at 10 HP.
Phase 7: Another missile pattern, but this time
with both eyes shifting their missile angles by the same
(counter-)clockwise delta angles. Ends at 8 HP.
Phase 9: The 3-pixel 3-laser sequence from the
northern eye. Ends at 2 HP.
Phase 11: Spawns the pentagram with one corner out
of every eye, then gradually shrinks and moves it towards the center of
the playfield. Not really an "attack" (surprise) as the pentagram can't
reach the player during this phase, but collision detection is
technically already active here. Ends at 0 HP, marking the earliest
point where the fight itself can possibly end.
Phase 13: Runs through the parallel "pentagram
attack phases". The first five consist of the pentagram alternating its
spinning direction between clockwise and counterclockwise while firing
pellets from each of the five star corners. After that, the pentagram
slams itself into the player, before YuugenMagan loops back to phase
10 to spawn a new pentagram. On the next run through phase 13, the
pentagram grows larger and immediately slams itself into the player,
before starting a new pentagram attack phase cycle with another loop
back to phase 10.
Since the HP bar fills up in a phase with no collision detection,
YuugenMagan is immune to
📝 test/debug mode heap corruption. It's
generally impossible to get YuugenMagan's HP into negative numbers, with
collision detection being disabled every other phase, and all odd-numbered
phases ending immediately upon reaching their HP threshold.
All phases until the very last one have a timeout condition, independent
from YuugenMagan's current HP:
Phase 0: 331 frames
Phase 1: 1101 frames
Phases 2, 4, 6, 8, 10, and 12: 70 frames each
Phases 3 and 7: 5 iterations of the pattern, or
1845 frames each
Phase 5: 5 iterations of the pattern, or 2230
frames
Phase 9: The full duration of the sequence, or 491
frames
Phase 11: Until the pentagram reached its target
position, or 221 frames
This makes it possible to reach phase 13 without dealing a single point of
damage to YuugenMagan, after almost exactly 2½ minutes on any difficulty.
Your actual time will certainly be higher though, as you will have to
HARRY UP at least once during the attempt.
And let's be real, you're very likely to subsequently lose a
life.
At a pixel-perfect 81×61 pixels, the Orb hitboxes are laid out rather
generously this time, reaching quite a bit outside the 64×48 eye sprites:
And that's about the only positive thing I can say about a position
calculation in this fight. Phase 0 already starts with the lasers being off
by 1 pixel from the center of the iris. Sure, 28 may be a nicer number to
add than 29, but the result won't be byte-aligned either way? This is
followed by the eastern laser's hitbox somehow being 24 pixels larger than
the others, stretching a rather unexpected 70 pixels compared to the 46 of
every other laser.
On a more hilarious note, the eye closing keyframe contains the following
(pseudo-)code, comprising the only real accidentally "unused" danmaku
subpattern in TH01:
// Did you mean ">= RANK_HARD"?
if(rank == RANK_HARD) {
eye_north.fire_aimed_wide_5_spread();
eye_southeast.fire_aimed_wide_5_spread();
eye_southwest.fire_aimed_wide_5_spread();
// Because this condition can never be true otherwise.
// As a result, no pellets will be spawned on Lunatic mode.
// (There is another Lunatic-exclusive subpattern later, though.)
if(rank == RANK_LUNATIC) {
eye_west.fire_aimed_wide_5_spread();
eye_east.fire_aimed_wide_5_spread();
}
}
Featuring the weirdly extended hitbox for the eastern laser, as well as
an initial Reimu position that points out the disparity between
byte-aligned rendering and the internal coordinates one final time.
After a few utility functions that look more like a quickly abandoned
refactoring attempt, we quickly get to the main attraction: YuugenMagan
combines the entire boss script and most of the pattern code into a single
2,634-instruction function, totaling 9,677 bytes inside
REIIDEN.EXE. For comparison, ReC98's version of this code
consists of at least 49 functions, excluding those I had to add to work
around ZUN's little inconsistencies, or the ones I added for stylistic
reasons.
In fact, this function is so large that Turbo C++ 4.0J refuses to generate
assembly output for it via the -S command-line option, aborting
with a Compiler table limit exceeded in function error.
Contrary to what the Borland C++ 4.0 User Guide suggests, this
instance of the error is not at all related to the number of function bodies
or any metric of algorithmic complexity, but is simply a result of the
compiler's internal text representation for a single function overflowing a
64 KiB memory segment. Merely shortening the names of enough identifiers
within the function can help to get that representation down below 64 KiB.
If you encounter this error during regular software development, you might
interpret it as the compiler's roundabout way of telling you that it inlined
way more function calls than you probably wanted to have inlined. Because
you definitely won't explicitly spell out such a long function
in newly-written code, right?
At least it wasn't the worst copy-pasting job in this
game; that trophy still goes to 📝 Elis. And
while the tracking code for adjusting an eye's sprite according to the
player's relative position is one of the main causes behind all the bloat,
it's also 100% consistent, and might have been an inlined class method in
ZUN's original code as well.
The clear highlight in this fight though? Almost no coordinate is
precisely calculated where you'd expect it to be. In particular, all
bullet spawn positions completely ignore the direction the eyes are facing
to:
Combining the bottom of the pupil with the exact horizontal
center of the sprite as a whole might sound like a good idea, but looks
especially wrong if the eye is facing right.Here it's the other way round: OK for a right-facing eye, really
wrong for a left-facing one.Dude, the eye is even supposed to track the laser in this
one!Hint: That's not the center of the playfield. At least the
pellets spawned from the corners are sort of correct, but with the corner
calculates precomputed, you could only get them wrong on
purpose.
Due to their effect on gameplay, these inaccuracies can't even be called
"bugs", and made me devise a new "quirk" category instead. More on that in
the TH01 100% blog post, though.
While we did see an accidentally unused bullet pattern earlier, I can
now say with certainty that there are no truly unused danmaku
patterns in TH01, i.e., pattern code that exists but is never called.
However, the code for YuugenMagan's phase 5 reveals another small piece of
danmaku design intention that never shows up within the parameters of
the original game.
By default, pellets are clipped when they fly past the top of the playfield,
which we can clearly observe for the first few pellets of this pattern.
Interestingly though, the second subpattern actually configures its pellets
to fall straight down from the top of the playfield instead. You never see
this happening in-game because ZUN limited that subpattern to a downwards
angle range of 0x73 or 162°, resulting in none of its pellets
ever getting close to the top of the playfield. If we extend that range to a
full 360° though, we can see how ZUN might have originally planned the
pattern to end:
YuugenMagan's phase 5 patterns on every difficulty, with the
second subpattern extended to reveal the different pellet behavior that
remained in the final game code. In the original game, the eyes would stop
spawning bullets on the marked frame.
If we also disregard everything else about YuugenMagan that fits the
upcoming definition of quirk, we're left with 6 "fixable" bugs, all
of which are a symptom of general blitting and unblitting laziness. Funnily
enough, they can all be demonstrated within a short 9-second part of the
fight, from the end of phase 9 up until the pentagram starts spinning in
phase 13:
General flickering whenever any sprite overlaps an eye. This is caused
by only reblitting each eye every 3 frames, and is an issue all throughout
the fight. You might have already spotted it in the videos above.
Each of the two lasers is unblitted and blitted individually instead of
each operation being done for both lasers together. Remember how
📝 ZUN unblits 32 horizontal pixels for every row of a line regardless of its width?
That's why the top part of the left, right-moving laser is never visible,
because it's blitted before the other laser is unblitted.
ZUN forgot to unblit the lasers when phase 9 ends. This footage was
recorded by pressing ↵ Return in test mode (game t or
game d), and it's probably impossible to achieve this during
actual gameplay without TAS techniques. You would have to deal the required
6 points of damage within 491 frames, with the eye being invincible during
240 of them. Simply shooting up an Orb with a horizontal velocity of 0 would
also only work a single time, as boss entities always repel the Orb with a
horizontal velocity of ±4.
The shrinking pentagram is unblitted after the eyes were blitted,
adding another guaranteed frame of flicker on top of the ones in 1). Like in
2), the blockiness of the holes is another result of unblitting 32 pixels
per row at a time.
Another missing unblitting call in a phase transition, as the pentagram
switches from its not quite correctly interpolated shrunk form to a regular
star polygon with a radius of 64 pixels. Indirectly caused by the massively
bloated coordinate calculation for the shrink animation being done
separately for the unblitting and blitting calls. Instead of, y'know, just
doing it once and storing the result in variables that can later be
reused.
The pentagram is not reblitted at all during the first 100 frames of
phase 13. During that rather long time, it's easily possible to remove
it from VRAM completely by covering its area with player shots. Or HARRY UP pellets.
Definitely an appropriate end for this game's entity blitting code.
I'm really looking forward to writing a
proper sprite system for the Anniversary Edition…
And just in case you were wondering about the hitboxes of these pentagrams
as they slam themselves into Reimu:
62 pixels on the X axis, centered around each corner point of the star, 16
pixels below, and extending infinitely far up. The latter part becomes
especially devious because the game always collision-detects
all 5 corners, regardless of whether they've already clipped through
the bottom of the playfield. The simultaneously occurring shape distortions
are simply a result of the line drawing function's rather poor
re-interpolation of any line that runs past the 640×400 VRAM boundaries;
📝 I described that in detail back when I debugged the shootout laser crash.
Ironically, using fixed-size hitboxes for a variable-sized pentagram means
that the larger one is easier to dodge.
The final puzzle in TH01's boss code comes
📝 once again in the form of weird hardware
palette changes. The 邪 kanji on the background
image goes through various colors throughout the fight, which ZUN
implemented by gradually incrementing and decrementing either a single one
or none of the color's three 4-bit components at the beginning of each
even-numbered phase. The resulting color sequence, however, doesn't
quite seem to follow these simple rules:
Phase 0: #DD5邪
Phase 2: #0DF邪
Phase 4: #F0F邪
Phase 6: #00F邪, but at the
end of the phase?!
Phase 8: #0FF邪, at the start
of the phase, #0F5邪, at the end!?
Phase 10: #FF5邪, at the start of
the phase, #F05邪, at the end
Second repetition of phase 12: #005邪
shortly after the start of the phase?!
Adding some debug output sheds light on what's going on there:
Since each iteration of phase 12 adds 63 to the red component, integer
overflow will cause the color to infinitely alternate between dark-blue
and red colors on every 2.03 iterations of the pentagram phase loop. The
65th iteration will therefore be the first one with a dark-blue color
for a third iteration in a row – just in case you manage to stall the
fight for that long.
Yup, ZUN had so much trust in the color clamping done by his hardware
palette functions that he did not clamp the increment operation on the
stage_palette itself. Therefore, the 邪
colors and even the timing of their changes from Phase 6 onwards are
"defined" by wildly incrementing color components beyond their intended
domain, so much that even the underlying signed 8-bit integer ends up
overflowing. Given that the decrement operation on the
stage_paletteis clamped though, this might be another
one of those accidents that ZUN deliberately left in the game,
📝 similar to the conclusion I reached with infinite bumper loops.
But guess what, that's also the last time we're going to encounter this type
of palette component domain quirk! Later games use master.lib's 8-bit
palette system, which keeps the comfort of using a single byte per
component, but shifts the actual hardware color into the top 4 bits, leaving
the bottom 4 bits for added precision during fades.
OK, but now we're done with TH01's bosses! 🎉That was the
8th PC-98 Touhou boss in total, leaving 23 to go.
With all the necessary research into these quirks going well into a fifth
push, I spent the remaining time in that one with transferring most of the
data between YuugenMagan and the upcoming rest of REIIDEN.EXE
into C land. This included the one piece of technical debt in TH01 we've
been carrying around since March 2015, as well as the final piece of the
ending sequence in FUUIN.EXE. Decompiling that executable's
main() function in a meaningful way requires pretty much all
remaining data from REIIDEN.EXE to also be moved into C land,
just in case you were wondering why we're stuck at 99.46% there.
On a more disappointing note, the static initialization code for the
📝 5 boss entity slots ultimately revealed why
YuugenMagan's code is as bloated and redundant as it is: The 5 slots really
are 5 distinct variables rather than a single 5-element array. That's why
ZUN explicitly spells out all 5 eyes every time, because the array he could
have just looped over simply didn't exist. 😕 And while these slot variables
are stored in a contiguous area of memory that I could just have
taken the address of and then indexed it as if it were an array, I
didn't want to annoy future port authors with what would technically be
out-of-bounds array accesses for purely stylistic reasons. At least it
wasn't that big of a deal to rewrite all boss code to use these distinct
variables, although I certainly had to get a bit creative with Elis.
Next up: Finding out how many points we got in totle, and hoping that ZUN
didn't hide more unexpected complexities in the remaining 45 functions of
this game. If you have to spare, there are two ways
in which that amount of money would help right now:
I'm expecting another subscription transaction
from Yanga before the 15th, which would leave to
round out one final TH01 RE push. With that, there'd be a total of 5 left in
the backlog, which should be enough to get the rest of this game done.
I really need to address the performance and usability issues
with all the small videos in this blog. Just look at the video immediately
above, where I disabled the controls because they would cover the debug text
at the bottom… Edit (2022-10-31):… which no longer is an
issue with our 📝 custom video player.
I already reserved this month's anonymous contribution for this work, so it would take another to be turned into a full push.
What's this? A simple, straightforward, easy-to-decompile TH01 boss with
just a few minor quirks and only two rendering-related ZUN bugs? Yup, 2½
pushes, and Kikuri was done. Let's get right into the overview:
Just like 📝 Elis, Kikuri's fight consists
of 5 phases, excluding the entrance animation. For some reason though, they
are numbered from 2 to 6 this time, skipping phase 1? For consistency, I'll
use the original phase numbers from the source code in this blog post.
The main phases (2, 5, and 6) also share Elis' HP boundaries of 10, 6,
and 0, respectively, and are once again indicated by different colors in the
HP bar. They immediately end upon reaching the given number of HP, making
Kikuri immune to the
📝 heap corruption in test or debug mode that can happen with Elis and Konngara.
Phase 2 solely consists of the infamous big symmetric spiral
pattern.
Phase 3 fades Kikuri's ball of light from its default bluish color to bronze over 100 frames. Collision detection is deactivated
during this phase.
In Phase 4, Kikuri activates her two souls while shooting the spinning
8-pellet circles from the previously activated ball. The phase ends shortly
after the souls fired their third spread pellet group.
Note that this is a timed phase without an HP boundary, which makes
it possible to reduce Kikuri's HP below the boundaries of the next
phases, effectively skipping them. Take this video for example,
where Kikuri has 6 HP by the end of Phase 4, and therefore directly
starts Phase 6.
(Obviously, Kikuri's HP can also be reduced to 0 or below, which will
end the fight immediately after this phase.)
Phase 5 combines the teardrop/ripple "pattern" from the souls with the
"two crossed eye laser" pattern, on independent cycles.
Finally, Kikuri cycles through her remaining 4 patterns in Phase 6,
while the souls contribute single aimed pellets every 200 frames.
Interestingly, all HP-bounded phases come with an additional hidden
timeout condition:
Phase 2 automatically ends after 6 cycles of the spiral pattern, or
5,400 frames in total.
Phase 5 ends after 1,600 frames, or the first frame of the
7th cycle of the two crossed red lasers.
If you manage to keep Kikuri alive for 29 of her Phase 6 patterns,
her HP are automatically set to 1. The HP bar isn't redrawn when this
happens, so there is no visual indication of this timeout condition even
existing – apart from the next Orb hit ending the fight regardless of
the displayed HP. Due to the deterministic order of patterns, this
always happens on the 8th cycle of the "symmetric gravity
pellet lines from both souls" pattern, or 11,800 frames. If dodging and
avoiding orb hits for 3½ minutes sounds tiring, you can always watch the
byte at DS:0x1376 in your emulator's memory viewer. Once
it's at 0x1E, you've reached this timeout.
So yeah, there's your new timeout challenge.
The few issues in this fight all relate to hitboxes, starting with the main
one of Kikuri against the Orb. The coordinates in the code clearly describe
a hitbox in the upper center of the disc, but then ZUN wrote a < sign
instead of a > sign, resulting in an in-game hitbox that's not
quite where it was intended to be…
Kikuri's actual hitbox.
Since the Orb sprite doesn't change its shape, we can visualize the
hitbox in a pixel-perfect way here. The Orb must be completely within
the red area for a hit to be registered.
Much worse, however, are the teardrop ripples. It already starts with their
rendering routine, which places the sprites from TAMAYEN.PTN at byte-aligned VRAM positions in the ultimate piece of if(…) {…}
else if(…) {…} else if(…) {…} meme code. Rather than
tracking the position of each of the five ripple sprites, ZUN suddenly went
purely functional and manually hardcoded the exact rendering and collision
detection calls for each frame of the animation, based on nothing but its
total frame counter.
Each of the (up to) 5 columns is also unblitted and blitted individually
before moving to the next column, starting at the center and then
symmetrically moving out to the left and right edges. This wouldn't be a
problem if ZUN's EGC-powered unblitting function didn't word-align its X
coordinates to a 16×1 grid. If the ripple sprites happen to start at an
odd VRAM byte position, their unblitting coordinates get rounded both down
and up to the nearest 16 pixels, thus touching the adjacent 8 pixels of the
previously blitted columns and leaving the well-known black vertical bars in
their place.
OK, so where's the hitbox issue here? If you just look at the raw
calculation, it's a slightly confusingly expressed, but perfectly logical 17
pixels. But this is where byte-aligned blitting has a direct effect on
gameplay: These ripples can be spawned at any arbitrary, non-byte-aligned
VRAM position, and collisions are calculated relative to this internal
position. Therefore, the actual hitbox is shifted up to 7 pixels to the
right, compared to where you would expect it from a ripple sprite's
on-screen position:
Due to the deterministic nature of this part of the fight, it's
always 5 pixels for this first set of ripples. These visualizations are
obviously not pixel-perfect due to the different potential shapes of
Reimu's sprite, so they instead relate to her 32×32 bounding box, which
needs to be entirely inside the red
area.
We've previously seen the same issue with the
📝 shot hitbox of Elis' bat form, where
pixel-perfect collision detection against a byte-aligned sprite was merely a
sidenote compared to the more serious X=Y coordinate bug. So why do I
elevate it to bug status here? Because it directly affects dodging: Reimu's
regular movement speed is 4 pixels per frame, and with the internal position
of an on-screen ripple sprite varying by up to 7 pixels, any micrododging
(or "grazing") attempt turns into a coin flip. It's sort of mitigated
by the fact that Reimu is also only ever rendered at byte-aligned
VRAM positions, but I wouldn't say that these two bugs cancel out each
other.
Oh well, another set of rendering issues to be fixed in the hypothetical
Anniversary Edition – obviously, the hitboxes should remain unchanged. Until
then, you can always memorize the exact internal positions. The sequence of
teardrop spawn points is completely deterministic and only controlled by the
fixed per-difficulty spawn interval.
Aside from more minor coordinate inaccuracies, there's not much of interest
in the rest of the pattern code. In another parallel to Elis though, the
first soul pattern in phase 4 is aimed on every difficulty except
Lunatic, where the pellets are once again statically fired downwards. This
time, however, the pattern's difficulty is much more appropriately
distributed across the four levels, with the simultaneous spinning circle
pellets adding a constant aimed component to every difficulty level.
Kikuri's phase 4 patterns, on every difficulty.
That brings us to 5 fully decompiled PC-98 Touhou bosses, with 26 remaining…
and another ½ of a push going to the cutscene code in
FUUIN.EXE.
You wouldn't expect something as mundane as the boss slideshow code to
contain anything interesting, but there is in fact a slight bit of
speculation fuel there. The text typing functions take explicit string
lengths, which precisely match the corresponding strings… for the most part.
For the "Gatekeeper 'SinGyoku'" string though, ZUN passed 23
characters, not 22. Could that have been the "h" from the Hepburn
romanization of 神玉?!
Also, come on, if this text is already blitted to VRAM for no reason,
you could have gone for perfect centering at unaligned byte positions; the
rendering function would have perfectly supported it. Instead, the X
coordinates are still rounded up to the nearest byte.
The hardcoded ending cutscene functions should be even less interesting –
don't they just show a bunch of images followed by frame delays? Until they
don't, and we reach the 地獄/Jigoku Bad Ending with
its special shake/"boom" effect, and this picture:
Picture #2 from ED2A.GRP.
Which is rendered by the following code:
for(int i = 0; i <= boom_duration; i++) { // (yes, off-by-one)
if((i & 3) == 0) {
graph_scrollup(8);
} else {
graph_scrollup(0);
}
end_pic_show(1); // ← different picture is rendered
frame_delay(2); // ← blocks until 2 VSync interrupts have occurred
if(i & 1) {
end_pic_show(2); // ← picture above is rendered
} else {
end_pic_show(1);
}
}
Notice something? You should never see this picture because it's
immediately overwritten before the frame is supposed to end. And yet
it's clearly flickering up for about one frame with common emulation
settings as well as on my real PC-9821 Nw133, clocked at 133 MHz.
master.lib's graph_scrollup() doesn't block until VSync either,
and removing these calls doesn't change anything about the blitted images.
end_pic_show() uses the EGC to blit the given 320×200 quarter
of VRAM from page 1 to the visible page 0, so the bottleneck shouldn't be
there either…
…or should it? After setting it up via a few I/O port writes, the common
method of EGC-powered blitting works like this:
Read 16 bits from the source VRAM position on any single
bitplane. This fills the EGC's 4 16-bit tile registers with the VRAM
contents at that specific position on every bitplane. You do not care
about the value the CPU returns from the read – in optimized code, you would
make sure to just read into a register to avoid useless additional stores
into local variables.
Write any 16 bits
to the target VRAM position on any single bitplane. This copies the
contents of the EGC's tile registers to that specific position on
every bitplane.
To transfer pixels from one VRAM page to another, you insert an additional
write to I/O port 0xA6 before 1) and 2) to set your source and
destination page… and that's where we find the bottleneck. Taking a look at
the i486 CPU and its cycle
counts, a single one of these page switches costs 17 cycles – 1 for
MOVing the page number into AL, and 16 for the
OUT instruction itself. Therefore, the 8,000 page switches
required for EGC-copying a 320×200-pixel image require 136,000 cycles in
total.
And that's the optimal case of using only those two
instructions. 📝 As I implied last time, TH01
uses a function call for VRAM page switches, complete with creating
and destroying a useless stack frame and unnecessarily updating a global
variable in main memory. I tried optimizing ZUN's code by throwing out
unnecessary code and using 📝 pseudo-registers
to generate probably optimal assembly code, and that did speed up the
blitting to almost exactly 50% of the original version's run time. However,
it did little about the flickering itself. Here's a comparison of the first
loop with boom_duration = 16, recorded in DOSBox-X with
cputype=auto and cycles=max, and with
i overlaid using the text chip. Caution, flashing lights:
The original animation, completing in 50 frames instead of the expected
34, thanks to slow blitting. Combined with the lack of
double-buffering, this results in noticeable tearing as the screen
refreshes while blitting is still in progress.
(Note how the background of the ドカーン image is shifted 1 pixel to the left compared to pic
#1.)
This optimized version completes in the expected 34 frames. No tearing
happens to be visible in this recording, but the ドカーン image is still visible on every
second loop iteration. (Note how the background of the ドカーン image is shifted 1 pixel to the left compared to pic
#1.)
I pushed the optimized code to the th01_end_pic_optimize
branch, to also serve as an example of how to get close to optimal code out
of Turbo C++ 4.0J without writing a single ASM instruction.
And if you really want to use the EGC for this, that's the best you can do.
It really sucks that it merely expanded the GRCG's 4×8-bit tile register to
4×16 bits. With 32 bits, ≥386 CPUs could have taken advantage of their wider
registers and instructions to double the blitting performance. Instead, we
now know the reason why
📝 Promisence Soft's EGC-powered sprite driver that ZUN later stole for TH03
is called SPRITE16 and not SPRITE32. What a massive disappointment.
But what's perhaps a bigger surprise: Blitting planar
images from main memory is much faster than EGC-powered inter-page
VRAM copies, despite the required manual access to all 4 bitplanes. In
fact, the blitting functions for the .CDG/.CD2 format, used from TH03
onwards, would later demonstrate the optimal method of using REP
MOVSD for blitting every line in 32-pixel chunks. If that was also
used for these ending images, the core blitting operation would have taken
((12 + (3 × (320 / 32))) × 200 × 4) =
33,600 cycles, with not much more overhead for the surrounding row
and bitplane loops. Sure, this doesn't factor in the whole infamous issue of
VRAM being slow on PC-98, but the aforementioned 136,000 cycles don't even
include any actual blitting either. And as you move up to later PC-98
models with Pentium CPUs, the gap between OUT and REP
MOVSD only becomes larger. (Note that the page I linked above has a
typo in the cycle count of REP MOVSD on Pentium CPUs: According
to the original Intel Architecture and Programming Manual, it's
13+𝑛, not 3+𝑛.)
This difference explains why later games rarely use EGC-"accelerated"
inter-page VRAM copies, and keep all of their larger images in main memory.
It especially explains why TH04 and TH05 can get away with naively redrawing
boss backdrop images on every frame.
In the end, the whole fact that ZUN did not define how long this image
should be visible is enough for me to increment the game's overall bug
counter. Who would have thought that looking at endings of all things
would teach us a PC-98 performance lesson… Sure, optimizing TH01 already
seemed promising just by looking at its bloated code, but I had no idea that
its performance issues extended so far past that level.
That only leaves the common beginning part of all endings and a short
main() function before we're done with FUUIN.EXE,
and 98 functions until all of TH01 is decompiled! Next up: SinGyoku, who not
only is the quickest boss to defeat in-game, but also comes with the least
amount of code. See you very soon!
TH05 has passed the 50% RE mark, with both MAIN.EXE and the
game as a whole! With that, we've also reached what -Tom-
wanted out of the project, so he's suspending his discount offer for a
bit.
Curve bullets are now officially called cheetos! 76.7% of
fans prefer this term, and it fits into the 8.3 DOS filename scheme much
better than homing lasers (as they're called in
OMAKE.TXT) or Taito
lasers (which would indeed have made sense as well).
…oh, and I managed to decompile Shinki within 2 pushes after all. That
left enough budget to also add the Stage 1 midboss on top.
So, Shinki! As far as final boss code is concerned, she's surprisingly
economical, with 📝 her background animations
making up more than ⅓ of her entire code. Going straight from TH01's
📝 final📝 bosses
to TH05's final boss definitely showed how much ZUN had streamlined
danmaku pattern code by the end of PC-98 Touhou. Don't get me wrong, there
is still room for improvement: TH05 not only
📝 reuses the same 16 bytes of generic boss state we saw in TH04 last month,
but also uses them 4× as often, and even for midbosses. Most importantly
though, defining danmaku patterns using a single global instance of the
group template structure is just bad no matter how you look at it:
The script code ends up rather bloated, with a single MOV
instruction for setting one of the fields taking up 5 bytes. By comparison,
the entire structure for regular bullets is 14 bytes large, while the
template structure for Shinki's 32×32 ball bullets could have easily been
reduced to 8 bytes.
Since it's also one piece of global state, you can easily forget to set
one of the required fields for a group type. The resulting danmaku group
then reuses these values from the last time they were set… which might have
been as far back as another boss fight from a previous stage.
And of course, I wouldn't point this out if it
didn't actually happen in Shinki's pattern code. Twice.
Declaring a separate structure instance with the static data for every
pattern would be both safer and more space-efficient, and there's
more than enough space left for that in the game's data segment.
But all in all, the pattern functions are short, sweet, and easy to follow.
The "devil"
patternis significantly more complex than the others, but still
far from TH01's final bosses at their worst. I especially like the clear
architectural separation between "one-shot pattern" functions that return
true once they're done, and "looping pattern" functions that
run as long as they're being called from a boss's main function. Not many
all too interesting things in these pattern functions for the most part,
except for two pieces of evidence that Shinki was coded after Yumeko:
The gather animation function in the first two phases contains a bullet
group configuration that looks like it's part of an unused danmaku
pattern. It quickly turns out to just be copy-pasted from a similar function
in Yumeko's fight though, where it is turned into actual
bullets.
As one of the two places where ZUN forgot to set a template field, the
lasers at the end of the white wing preparation pattern reuse the 6-pixel
width of Yumeko's final laser pattern. This actually has an effect on
gameplay: Since these lasers are active for the first 8 frames after
Shinki's wings appear on screen, the player can get hit by them in the last
2 frames after they grew to their final width.
Of course, there are more than enough safespots between the lasers.
Speaking about that wing sprite: If you look at ST05.BB2 (or
any other file with a large sprite, for that matter), you notice a rather
weird file layout:
A large sprite split into multiple smaller ones with a width of
64 pixels each? What's this, hardware sprite limitations? On my
PC-98?!
And it's not a limitation of the sprite width field in the BFNT+ header
either. Instead, it's master.lib's BFNT functions which are limited to
sprite widths up to 64 pixels… or at least that's what
MASTER.MAN claims. Whatever the restriction was, it seems to be
completely nonexistent as of master.lib version 0.23, and none of the
master.lib functions used by the games have any issues with larger
sprites.
Since ZUN stuck to the supposed 64-pixel width limit though, it's now the
game that expects Shinki's winged form to consist of 4 physical
sprites, not just 1. Any conversion from another, more logical sprite sheet
layout back into BFNT+ must therefore replicate the original number of
sprites. Otherwise, the sequential IDs ("patnums") assigned to every newly
loaded sprite no longer match ZUN's hardcoded IDs, causing the game to
crash. This is exactly what used to happen with -Tom-'s
MysticTK automation scripts,
which combined these exact sprites into a single large one. This issue has
now been fixed – just in case there are some underground modders out there
who used these scripts and wonder why their game crashed as soon as the
Shinki fight started.
And then the code quality takes a nosedive with Shinki's main function.
Even in TH05, these boss and midboss update
functions are still very imperative:
The origin point of all bullet types used by a boss must be manually set
to the current boss/midboss position; there is no concept of a bullet type
tracking a certain entity.
The same is true for the target point of a player's homing shots…
… and updating the HP bar. At least the initial fill animation is
abstracted away rather decently.
Incrementing the phase frame variable also must be done manually. TH05
even "innovates" here by giving the boss update function exclusive ownership
of that variable, in contrast to TH04 where that ownership is given out to
the player shot collision detection (?!) and boss defeat helper
functions.
Speaking about collision detection: That is done by calling different
functions depending on whether the boss is supposed to be invincible or
not.
Timeout conditions? No standard way either, and all done with manual
if statements. In combination with the regular phase end
condition of lowering (mid)boss HP to a certain value, this leads to quite a
convoluted control flow.
The manual calls to the score bonus functions for cleared phases at least provide some sense of orientation.
One potentially nice aspect of all this imperative freedom is that
phases can end outside of HP boundaries… by manually incrementing the
phase variable and resetting the phase frame variable to 0.
The biggest WTF in there, however, goes to using one of the 16 state bytes
as a "relative phase" variable for differentiating between boss phases that
share the same branch within the switch(boss.phase)
statement. While it's commendable that ZUN tried to reduce code duplication
for once, he could have just branched depending on the actual
boss.phase variable? The same state byte is then reused in the
"devil" pattern to track the activity state of the big jerky lasers in the
second half of the pattern. If you somehow managed to end the phase after
the first few bullets of the pattern, but before these lasers are up,
Shinki's update function would think that you're still in the phase
before the "devil" pattern. The main function then sequence-breaks
right to the defeat phase, skipping the final pattern with the burning Makai
background. Luckily, the HP boundaries are far away enough to make this
impossible in practice.
The takeaway here: If you want to use the state bytes for your custom
boss script mods, alias them to your own 16-byte structure, and limit each
of the bytes to a clearly defined meaning across your entire boss script.
One final discovery that doesn't seem to be documented anywhere yet: Shinki
actually has a hidden bomb shield during her two purple-wing phases.
uth05win got this part slightly wrong though: It's not a complete
shield, and hitting Shinki will still deal 1 point of chip damage per
frame. For comparison, the first phase lasts for 3,000 HP, and the "devil"
pattern phase lasts for 5,800 HP.
And there we go, 3rd PC-98 Touhou boss
script* decompiled, 28 to go! 🎉 In case you were expecting a fix for
the Shinki death glitch: That one
is more appropriately fixed as part of the Mai & Yuki script. It also
requires new code, should ideally look a bit prettier than just removing
cheetos between one frame and the next, and I'd still like it to fit within
the original position-dependent code layout… Let's do that some other
time.
Not much to say about the Stage 1 midboss, or midbosses in general even,
except that their update functions have to imperatively handle even more
subsystems, due to the relative lack of helper functions.
The remaining ¾ of the third push went to a bunch of smaller RE and
finalization work that would have hardly got any attention otherwise, to
help secure that 50% RE mark. The nicest piece of code in there shows off
what looks like the optimal way of setting up the
📝 GRCG tile register for monochrome blitting
in a variable color:
mov ah, palette_index ; Any other non-AL 8-bit register works too.
; (x86 only supports AL as the source operand for OUTs.)
rept 4 ; For all 4 bitplanes…
shr ah, 1 ; Shift the next color bit into the x86 carry flag
sbb al, al ; Extend the carry flag to a full byte
; (CF=0 → 0x00, CF=1 → 0xFF)
out 7Eh, al ; Write AL to the GRCG tile register
endm
Thanks to Turbo C++'s inlining capabilities, the loop body even decompiles
into a surprisingly nice one-liner. What a beautiful micro-optimization, at
a place where micro-optimization doesn't hurt and is almost expected.
Unfortunately, the micro-optimizations went all downhill from there,
becoming increasingly dumb and undecompilable. Was it really necessary to
save 4 x86 instructions in the highly unlikely case of a new spark sprite
being spawned outside the playfield? That one 2D polar→Cartesian
conversion function then pointed out Turbo C++ 4.0J's woefully limited
support for 32-bit micro-optimizations. The code generation for 32-bit
📝 pseudo-registers is so bad that they almost
aren't worth using for arithmetic operations, and the inline assembler just
flat out doesn't support anything 32-bit. No use in decompiling a function
that you'd have to entirely spell out in machine code, especially if the
same function already exists in multiple other, more idiomatic C++
variations.
Rounding out the third push, we got the TH04/TH05 DEMO?.REC
replay file reading code, which should finally prove that nothing about the
game's original replay system could serve as even just the foundation for
community-usable replays. Just in case anyone was still thinking that.
Next up: Back to TH01, with the Elis fight! Got a bit of room left in the
cap again, and there are a lot of things that would make a lot of
sense now:
TH04 would really enjoy a large number of dedicated pushes to catch up
with TH05. This would greatly support the finalization of both games.
Continuing with TH05's bosses and midbosses has shown to be good value
for your money. Shinki would have taken even less than 2 pushes if she
hadn't been the first boss I looked at.
Oh, and I also added Seihou as a selectable goal, for the two people out
there who genuinely like it. If I ever want to quit my day job, I need to
branch out into safer territory that isn't threatened by takedowns, after
all.
Did you know that moving on top of a boss sprite doesn't kill the player in
TH04, only in TH05?
Yup, Reimu is not getting hit… yet.
That's the first of only three interesting discoveries in these 3 pushes,
all of which concern TH04. But yeah, 3 for something as seemingly simple as
these shared boss functions… that's still not quite the speed-up I had hoped
for. While most of this can be blamed, again, on TH04 and all of its
hardcoded complexities, there still was a lot of work to be done on the
maintenance front as well. These functions reference a bunch of code I RE'd
years ago and that still had to be brought up to current standards, with the
dependencies reaching from 📝 boss explosions
over 📝 text RAM overlay functionality up to
in-game dialog loading.
The latter provides a good opportunity to talk a bit about x86 memory
segmentation. Many aspiring PC-98 developers these days are very scared
of it, with some even going as far as to rather mess with Protected Mode and
DOS extenders just so that they don't have to deal with it. I wonder where
that fear comes from… Could it be because every modern programming language
I know of assumes memory to be flat, and lacks any standard language-level
features to even express something like segments and offsets? That's why
compilers have a hard time targeting 16-bit x86 these days: Doing anything
interesting on the architecture requires giving the programmer full
control over segmentation, which always comes down to adding the
typical non-standard language extensions of compilers from back in the day.
And as soon as DOS stopped being used, these extensions no longer made sense
and were subsequently removed from newer tools. A good example for this can
be found in an old version of the
NASM manual: The project started as an attempt to make x86 assemblers
simple again by throwing out most of the segmentation features from
MASM-style assemblers, which made complete sense in 1996 when 16-bit DOS and
Windows were already on their way out. But there was a point to all
those features, and that's why ReC98 still has to use the supposedly
inferior TASM.
Not that this fear of segmentation is completely unfounded: All the
segmentation-related keywords, directives, and #pragmas
provided by Borland C++ and TASM absolutely can be the cause of many
weird runtime bugs. Even if the compiler or linker catches them, you are
often left with confusing error messages that aged just as poorly as memory
segmentation itself.
However, embracing the concept does provide quite the opportunity for
optimizations. While it definitely was a very crazy idea, there is a small
bit of brilliance to be gained from making proper use of all these
segmentation features. Case in point: The buffer for the in-game dialog
scripts in TH04 and TH05.
// Thanks to the semantics of `far` pointers, we only need a single 32-bit
// pointer variable for the following code.
extern unsigned char far *dialog_p;
// This master.lib function returns a `void __seg *`, which is a 16-bit
// segment-only pointer. Converting to a `far *` yields a full segment:offset
// pointer to offset 0000h of that segment.
dialog_p = (unsigned char far *)hmem_allocbyte(/* … */);
// Running the dialog script involves pointer arithmetic. On a far pointer,
// this only affects the 16-bit offset part, complete with overflow at 64 KiB,
// from FFFFh back to 0000h.
dialog_p += /* … */;
dialog_p += /* … */;
dialog_p += /* … */;
// Since the segment part of the pointer is still identical to the one we
// allocated above, we can later correctly free the buffer by pulling the
// segment back out of the pointer.
hmem_free((void __seg *)dialog_p);
If dialog_p was a huge pointer, any pointer
arithmetic would have also adjusted the segment part, requiring a second
pointer to store the base address for the hmem_free call. Doing
that will also be necessary for any port to a flat memory model. Depending
on how you look at it, this compression of two logical pointers into a
single variable is either quite nice, or really, really dumb in its
reliance on the precise memory model of one single architecture.
Why look at dialog loading though, wasn't this supposed to be all about
shared boss functions? Well, TH04 unnecessarily puts certain stage-specific
code into the boss defeat function, such as loading the alternate Stage 5
Yuuka defeat dialog before a Bad Ending, or initializing Gengetsu after
Mugetsu's defeat in the Extra Stage.
That's TH04's second core function with an explicit conditional branch for
Gengetsu, after the
📝 dialog exit code we found last year during EMS research.
And I've heard people say that Shinki was the most hardcoded fight in PC-98
Touhou… Really, Shinki is a perfectly regular boss, who makes proper use of
all internal mechanics in the way they were intended, and doesn't blast
holes into the architecture of the game. Even within TH05, it's Mai and Yuki
who rely on hacks and duplicated code, not Shinki.
The worst part about this though? How the function distinguishes Mugetsu
from Gengetsu. Once again, it uses its own global variable to track whether
it is called the first or the second time within TH04's Extra Stage,
unrelated to the same variable used in the dialog exit function. But this
time, it's not just any newly created, single-use variable, oh no. In a
misguided attempt to micro-optimize away a few bytes of conventional memory,
TH04 reserves 16 bytes of "generic boss state", which can (and are) freely
used for anything a boss doesn't want to store in a more dedicated
variable.
It might have been worth it if the bosses actually used most of these
16 bytes, but the majority just use (the same) two, with only Stage 4 Reimu
using a whopping seven different ones. To reverse-engineer the various uses
of these variables, I pretty much had to map out which of the undecompiled
danmaku-pattern functions corresponds to which boss
fight. In the end, I assigned 29 different variable names for each of the
semantically different use cases, which made up another full push on its
own.
Now, 16 bytes of wildly shared state, isn't that the perfect recipe for
bugs? At least during this cursory look, I haven't found any obvious ones
yet. If they do exist, it's more likely that they involve reused state from
earlier bosses – just how the Shinki death glitch in
TH05 is caused by reusing cheeto data from way back in Stage 4 – and
hence require much more boss-specific progress.
And yes, it might have been way too early to look into all these tiny
details of specific boss scripts… but then, this happened:
Looks similar to another
screenshot of a crash in the same fight that was reported in December,
doesn't it? I was too much in a hurry to figure it out exactly, but notice
how both crashes happen right as the last of Marisa's four bits is destroyed.
KirbyComment has suspected
this to be the cause for a while, and now I can pretty much confirm it
to be an unguarded division by the number of on-screen bits in
Marisa-specific pattern code. But what's the cause for Kurumi then?
As for fixing it, I can go for either a fast or a slow option:
Superficially fixing only this crash will probably just take a fraction
of a push.
But I could also go for a deeper understanding by looking at TH04's
version of the 📝 custom entity structure. It
not only stores the data of Marisa's bits, but is also very likely to be
involved in Kurumi's crash, and would get TH04 a lot closer to 100%
PI. Taking that look will probably need at least 2 pushes, and might require
another 3-4 to completely decompile Marisa's fight, and 2-3 to decompile
Kurumi's.
OK, now that that's out of the way, time to finish the boss defeat function…
but not without stumbling over the third of TH04's quirks, relating to the
Clear Bonus for the main game or the Extra Stage:
To achieve the incremental addition effect for the in-game score display
in the HUD, all new points are first added to a score_delta
variable, which is then added to the actual score at a maximum rate of
61,110 points per frame.
There are a fixed 416 frames between showing the score tally and
launching into MAINE.EXE.
As a result, TH04's Clear Bonus is effectively limited to
(416 × 61,110) = 25,421,760 points.
Only TH05 makes sure to commit the entirety of the
score_delta to the actual score before switching binaries,
which fixes this issue.
And after another few collision-related functions, we're now truly,
finally ready to decompile bosses in both TH04 and TH05! Just as the
anything funds were running out… The
remaining ¼ of the third push then went to Shinki's 32×32 ball bullets,
rounding out this delivery with a small self-contained piece of the first
TH05 boss we're probably going to look at.
Next up, though: I'm not sure, actually. Both Shinki and Elis seem just a
little bit larger than the 2¼ or 4 pushes purchased so far, respectively.
Now that there's a bunch of room left in the cap again, I'll just let the
next contribution decide – with a preference for Shinki in case of a tie.
And if it will take longer than usual for the store to sell out again this
time (heh), there's still the
📝 PC-98 text RAM JIS trail word rendering research
waiting to be documented.
Been 📝 a while since we last looked at any of
TH03's game code! But before that, we need to talk about Y coordinates.
During TH03's MAIN.EXE, the PC-98 graphics GDC runs in its
line-doubled 640×200 resolution, which gives the in-game portion its
distinctive stretched low-res look. This lower resolution is a consequence
of using 📝 Promisence Soft's SPRITE16 driver:
Its performance simply stems from the fact that it expects sprites to be
stored in the bottom half of VRAM, which allows them to be blitted using the
same EGC-accelerated VRAM-to-VRAM copies we've seen again and again in all
other games. Reducing the visible resolution also means that the sprites can
be stored on both VRAM pages, allowing the game to still be double-buffered.
If you force the graphics chip to run at 640×400, you can see them:
The full VRAM contents during TH03's in-game portion, as seen when forcing the system into a 640×400 resolution.
•
Note that the text chip still displays its overlaid contents at 640×400,
which means that TH03's in-game portion technically runs at two
resolutions at the same time.
But that means that any mention of a Y coordinate is ambiguous: Does it
refer to undoubled VRAM pixels, or on-screen stretched pixels? Especially
people who have known about the line doubling for years might almost expect
technical blog posts on this game to use undoubled VRAM coordinates. So,
let's introduce a new formatting convention for both on-screen
640×400 and undoubled 640×200 coordinates,
and always write out both to minimize the confusion.
Alright, now what's the thing gonna be? The enemy structure is highly
overloaded, being used for enemies, fireballs, and explosions with seemingly
different semantics for each. Maybe a bit too much to be figured out in what
should ideally be a single push, especially with all the functions that
would need to be decompiled? Bullet code would be easier, but not exactly
single-push material either. As it turns out though, there's something more
fundamental left to be done first, which both of these subsystems depend on:
collision detection!
And it's implemented exactly how I always naively imagined collision
detection to be implemented in a fixed-resolution 2D bullet hell game with
small hitboxes: By keeping a separate 1bpp bitmap of both playfields in
memory, drawing in the collidable regions of all entities on every frame,
and then checking whether any pixels at the current location of the player's
hitbox are set to 1. It's probably not done in the other games because their
single data segment was already too packed for the necessary 17,664 bytes to
store such a bitmap at pixel resolution, and 282,624 bytes for a bitmap at
Q12.4 subpixel resolution would have been prohibitively expensive in 16-bit
Real Mode DOS anyway. In TH03, on the other hand, this bitmap is doubly
useful, as the AI also uses it to elegantly learn what's on the playfield.
By halving the resolution and only tracking tiles of 2×2 / 2×1 pixels, TH03 only requires an adequate total
of 6,624 bytes of memory for the collision bitmaps of both playfields.
So how did the implementation not earn the good-code tag
this time? Because the code for drawing into these bitmaps is undecompilable
hand-written x86 assembly. And not just your usual
ASM that was basically compiled from C and then edited to maybe optimize
register allocation and maybe replace a bunch of local variables with
self-modifying code, oh no. This code is full of overly clever bit
twiddling, abusing the fact that the 16-bit AX,
BX, CX, and DX registers can also be
accessed as two 8-bit registers, calculations that change the semantic
meaning behind the value of a register, or just straight-up reassignments of
different values to the same small set of registers. Sure, in some way it is
impressive, and it all does work and correctly covers every edge
case, but come on. This could have all been a lot more readable in
exchange for just a few CPU cycles.
What's most interesting though are the actual shapes that these functions
draw into the collision bitmap. On the surface, we have:
vertical slopes at any angle across the whole playfield; exclusively
used for Chiyuri's diagonal laser EX attack
straight vertical lines, with a width of 1 tile; exclusively used for
the 2×2 / 2×1 hitboxes of bullets
rectangles at arbitrary sizes
But only 2) actually draws a full solid line. 1) and 3) are only ever drawn
as horizontal stripes, with a hardcoded distance of 2 vertical tiles
between every stripe of a slope, and 4 vertical tiles between every stripe
of a rectangle. That's 66-75% of each rectangular entity's intended hitbox
not actually taking part in collision detection. Now, if player hitboxes
were ≤ 6 / 3 pixels, we'd have one
possible explanation of how the AI can "cheat", because it could just
precisely move through those blank regions at TAS speeds. So, let's make
this two pushes after all and tell the complete story, since this is one of
the more interesting aspects to still be documented in this game.
And the code only gets worse. While the player
collision detection function is decompilable, it might as well not
have been, because it's just more of the same "optimized", hard-to-follow
assembly. With the four splittable 16-bit registers having a total of 20
different meanings in this function, I would have almost preferred
self-modifying code…
In fact, it was so bad that it prompted some maintenance work on my inline
assembly coding standards as a whole. Turns out that the _asm
keyword is not only still supported in modern Visual Studio compilers, but
also in Clang with the -fms-extensions flag, and compiles fine
there even for 64-bit targets. While that might sound like amazing news at
first ("awesome, no need to rewrite this stuff for my x86_64 Linux
port!"), you quickly realize that almost all inline assembly in this
codebase assumes either PC-98 hardware, segmented 16-bit memory addressing,
or is a temporary hack that will be removed with further RE progress.
That's mainly because most of the raw arithmetic code uses Turbo C++'s
register pseudovariables where possible. While they certainly have their
drawbacks, being a non-standard extension that's not supported in other
x86-targeting C compilers, their advantages are quite significant: They
allow this code to stay in the same language, and provide slightly more
immediate portability to any other architecture, together with
📝 readability and maintainability improvements that can get quite significant when combined with inlining:
// This one line compiles to five ASM instructions, which would need to be
// spelled out in any C compiler that doesn't support register pseudovariables.
// By adding typed aliases for these registers via `#define`, this code can be
// both made even more readable, and be prepared for an easier transformation
// into more portable local variables.
_ES = (((_AX * 4) + _BX) + SEG_PLANE_B);
However, register pseudovariables might cause potential portability issues
as soon as they are mixed with inline assembly instructions that rely on
their state. The lazy way of "supporting pseudo-registers" in other
compilers would involve declaring the full set as global variables, which
would immediately break every one of those instances:
_DI = 0;
_AX = 0xFFFF;
// Special x86 instruction doing the equivalent of
//
// *reinterpret_cast(MK_FP(_ES, _DI)) = _AX;
// _DI += sizeof(uint16_t);
//
// Only generated by Turbo C++ in very specific cases, and therefore only
// reliably available through inline assembly.
asm { movsw; }
What's also not all too standardized, though, are certain variants of
the asm keyword. That's why I've now introduced a distinction
between the _asm keyword for "decently sane" inline assembly,
and the slightly less standard asm keyword for inline assembly
that relies on the contents of pseudo-registers, and should break on
compilers that don't support them. So yeah, have some minor
portability work in exchange for these two pushes not having all that much
in RE'd content.
With that out of the way and the function deciphered, we can confirm the
player hitboxes to be a constant 8×8 /
8×4 pixels, and prove that the hit stripes are nothing but
an adequate optimization that doesn't affect gameplay in any way.
And what's the obvious thing to immediately do if you have both the
collision bitmap and the player hitbox? Writing a "real hitbox" mod, of
course:
Reorder the calls to rendering functions so that player and shot sprites
are rendered after bullets
Blank out all player sprite pixels outside an
8×8 / 8×4 box around the center
point
After the bullet rendering function, turn on the GRCG in RMW mode and
set the tile register set to the background color
Stretch the negated contents of collision bitmap onto each playfield,
leaving only collidable pixels untouched
Do the same with the actual, non-negated contents and a white color, for
extra contrast against the background. This also makes sure to show any
collidable areas whose sprite pixels are transparent, such as with the moon
enemy. (Yeah, how unfair.) Doing that also loses a lot of information about
the playfield, such as enemy HP indicated by their color, but what can you
do:
A decently busy TH03 in-game frame and its underlying collision bitmap,
showing off all three different collision shapes together with the
player hitboxes.
2022-02-18-TH03-real-hitbox.zip
The secret for writing such mods before having reached a sufficient level of
position independence? Put your new code segment into DGROUP,
past the end of the uninitialized data section. That's why this modded
MAIN.EXE is a lot larger than you would expect from the raw amount of new code: The file now actually needs to store all these
uninitialized 0 bytes between the end of the data segment and the first
instruction of the mod code – normally, this number is simply a part of the
MZ EXE header, and doesn't need to be redundantly stored on disk. Check the
th03_real_hitbox
branch for the code.
And now we know why so many "real hitbox" mods for the Windows Touhou games
are inaccurate: The games would simply be unplayable otherwise – or can
you dodge rapidly moving 2×2 /
2×1 blocks as an 8×8 /
8×4 rectangle that is smaller than your shot sprites,
especially without focused movement? I can't.
Maybe it will feel more playable after making explosions visible, but that
would need more RE groundwork first.
It's also interesting how adding two full GRCG-accelerated redraws of both
playfields per frame doesn't significantly drop the game's frame rate – so
why did the drawing functions have to be micro-optimized again? It
would be possible in one pass by using the GRCG's TDW mode, which
should theoretically be 8× faster, but I have to stop somewhere.
Next up: The final missing piece of TH04's and TH05's
bullet-moving code, which will include a certain other
type of projectile as well.
Here we go, TH01 Sariel! This is the single biggest boss fight in all of
PC-98 Touhou: If we include all custom effect code we previously decompiled,
it amounts to a total of 10.31% of all code in TH01 (and 3.14%
overall). These 8 pushes cover the final 8.10% (or 2.47% overall),
and are likely to be the single biggest delivery this project will ever see.
Considering that I only managed to decompile 6.00% across all games in 2021,
2022 is already off to a much better start!
So, how can Sariel's code be that large? Well, we've got:
16 danmaku patterns; including the one snowflake detonating into a giant
94×32 hitbox
Gratuitous usage of floating-point variables, bloating the binary thanks
to Turbo C++ 4.0J's particularly horrid code generation
The hatching birds that shoot pellets
3 separate particle systems, sharing the general idea, overall code
structure, and blitting algorithm, but differing in every little detail
The "gust of wind" background transition animation
5 sets of custom monochrome sprite animations, loaded from
BOSS6GR?.GRC
A further 3 hardcoded monochrome 8×8 sprites for the "swaying leaves"
pattern during the second form
In total, it's just under 3,000 lines of C++ code, containing a total of 8
definite ZUN bugs, 3 of them being subpixel/pixel confusions. That might not
look all too bad if you compare it to the
📝 player control function's 8 bugs in 900 lines of code,
but given that Konngara had 0… (Edit (2022-07-17):
Konngara contains two bugs after all: A
📝 possible heap corruption in test or debug mode,
and the infamous
📝 temporary green discoloration.)
And no, the code doesn't make it obvious whether ZUN coded Konngara or
Sariel first; there's just as much evidence for either.
Some terminology before we start: Sariel's first form is separated
into four phases, indicated by different background images, that
cycle until Sariel's HP reach 0 and the second, single-phase form
starts. The danmaku patterns within each phase are also on a cycle,
and the game picks a random but limited number of patterns per phase before
transitioning to the next one. The fight always starts at pattern 1 of phase
1 (the random purple lasers), and each new phase also starts at its
respective first pattern.
Sariel's bugs already start at the graphics asset level, before any code
gets to run. Some of the patterns include a wand raise animation, which is
stored in BOSS6_2.BOS:
Umm… OK? The same sprite twice, just with slightly different
colors? So how is the wand lowered again?
The "lowered wand" sprite is missing in this file simply because it's
captured from the regular background image in VRAM, at the beginning of the
fight and after every background transition. What I previously thought to be
📝 background storage code has therefore a
different meaning in Sariel's case. Since this captured sprite is fully
opaque, it will reset the entire 128×128 wand area… wait, 128×128, rather
than 96×96? Yup, this lowered sprite is larger than necessary, wasting 1,967
bytes of conventional memory. That still doesn't quite explain the
second sprite in BOSS6_2.BOS though. Turns out that the black
part is indeed meant to unblit the purple reflection (?) in the first
sprite. But… that's not how you would correctly unblit that?
The first sprite already eats up part of the red HUD line, and the second
one additionally fails to recover the seal pixels underneath, leaving a nice
little black hole and some stray purple pixels until the next background
transition. Quite ironic given that both
sprites do include the right part of the seal, which isn't even part of the
animation.
Just like Konngara, Sariel continues the approach of using a single function
per danmaku pattern or custom entity. While I appreciate that this allows
all pattern- and entity-specific state to be scoped locally to that one
function, it quickly gets ugly as soon as such a function has to do more than one thing.
The "bird function" is particularly awful here: It's just one if(…)
{…} else if(…) {…} else if(…) {…} chain with different
branches for the subfunction parameter, with zero shared code between any of
these branches. It also uses 64-bit floating-point double as
its subpixel type… and since it also takes four of those as parameters
(y'know, just in case the "spawn new bird" subfunction is called), every
call site has to also push four double values onto the stack.
Thanks to Turbo C++ even using the FPU for pushing a 0.0 constant, we
have already reached maximum floating-point decadence before even having
seen a single danmaku pattern. Why decadence? Every possible spawn position
and velocity in both bird patterns just uses pixel resolution, with no
fractional component in sight. And there goes another 720 bytes of
conventional memory.
Speaking about bird patterns, the red-bird one is where we find the first
code-level ZUN bug: The spawn cross circle sprite suddenly disappears after
it finished spawning all the bird eggs. How can we tell it's a bug? Because
there is code to smoothly fly this sprite off the playfield, that
code just suddenly forgets that the sprite's position is stored in Q12.4
subpixels, and treats it as raw screen pixels instead.
As a result, the well-intentioned 640×400
screen-space clipping rectangle effectively shrinks to 38×23 pixels in the
top-left corner of the screen. Which the sprite is always outside of, and
thus never rendered again.
The intended animation is easily restored though:
Sariel's third pattern, and the first to spawn birds, in its original
and fixed versions. Note that I somewhat fixed the bird hatch animation
as well: ZUN's code never unblits any frame of animation there, and
simply blits every new one on top of the previous one.
Also, did you know that birds actually have a quite unfair 14×38-pixel
hitbox? Not that you'd ever collide with them in any of the patterns…
Another 3 of the 8 bugs can be found in the symmetric, interlaced spawn rays
used in three of the patterns, and the 32×32 debris "sprites" shown at their endpoint, at
the edge of the screen. You kinda have to commend ZUN's attention to detail
here, and how he wrote a lot of code for those few rapidly animated pixels
that you most likely don't
even notice, especially with all the other wrong pixels
resulting from rendering glitches. One of the bugs in the very final pattern
of phase 4 even turns them into the vortex sprites from the second pattern
in phase 1 during the first 5 frames of
the first time the pattern is active, and I had to single-step the blitting
calls to verify it.
It certainly was annoying how much time I spent making sense of these bugs,
and all weird blitting offsets, for just a few pixels… Let's look at
something more wholesome, shall we?
So far, we've only seen the PC-98 GRCG being used in RMW (read-modify-write)
mode, which I previously
📝 explained in the context of TH01's red-white HP pattern.
The second of its three modes, TCR (Tile Compare Read), affects VRAM reads
rather than writes, and performs "color extraction" across all 4 bitplanes:
Instead of returning raw 1bpp data from one plane, a VRAM read will instead
return a bitmask, with a 1 bit at every pixel whose full 4-bit color exactly
matches the color at that offset in the GRCG's tile register, and 0
everywhere else. Sariel uses this mode to make sure that the 2×2 particles
and the wind effect are only blitted on top of "air color" pixels, with
other parts of the background behaving like a mask. The algorithm:
Set the GRCG to TCR mode, and all 8 tile register dots to the air
color
Read N bits from the target VRAM position to obtain an N-bit mask where
all 1 bits indicate air color pixels at the respective position
AND that mask with the alpha plane of the sprite to be drawn, shifted to
the correct start bit within the 8-pixel VRAM byte
Set the GRCG to RMW mode, and all 8 tile register dots to the color that
should be drawn
Write the previously obtained bitmask to the same position in VRAM
Quite clever how the extracted colors double as a secondary alpha plane,
making for another well-earned good-code tag. The wind effect really doesn't deserve it, though:
ZUN calculates every intermediate result inside this function
over and over and over again… Together with some ugly
pointer arithmetic, this function turned into one of the most tedious
decompilations in a long while.
This gradual effect is blitted exclusively to the front page of VRAM,
since parts of it need to be unblitted to create the illusion of a gust of
wind. Then again, anything that moves on top of air-colored background –
most likely the Orb – will also unblit whatever it covered of the effect…
As far as I can tell, ZUN didn't use TCR mode anywhere else in PC-98 Touhou.
Tune in again later during a TH04 or TH05 push to learn about TDW, the final
GRCG mode!
Speaking about the 2×2 particle systems, why do we need three of them? Their
only observable difference lies in the way they move their particles:
Up or down in a straight line (used in phases 4 and 2,
respectively)
Left or right in a straight line (used in the second form)
Left and right in a sinusoidal motion (used in phase 3, the "dark
orange" one)
Out of all possible formats ZUN could have used for storing the positions
and velocities of individual particles, he chose a) 64-bit /
double-precision floating-point, and b) raw screen pixels. Want to take a
guess at which data type is used for which particle system?
If you picked double for 1) and 2), and raw screen pixels for
3), you are of course correct! Not that I'm implying
that it should have been the other way round – screen pixels would have
perfectly fit all three systems use cases, as all 16-bit coordinates
are extended to 32 bits for trigonometric calculations anyway. That's what,
another 1.080 bytes of wasted conventional memory? And that's even
calculated while keeping the current architecture, which allocates
space for 3×30 particles as part of the game's global data, although only
one of the three particle systems is active at any given time.
That's it for the first form, time to put on "Civilization
of Magic"! Or "死なばもろとも"? Or "Theme of 地獄めくり"? Or whatever SYUGEN is
supposed to mean…
… and the code of these final patterns comes out roughly as exciting as
their in-game impact. With the big exception of the very final "swaying
leaves" pattern: After 📝 Q4.4,
📝 Q28.4,
📝 Q24.8, and double variables,
this pattern uses… decimal subpixels? Like, multiplying the number by
10, and using the decimal one's digit to represent the fractional part?
Well, sure, if you really insist on moving the leaves in cleanly
represented integer multiples of ⅒, which is infamously impossible in IEEE
754. Aside from aesthetic reasons, it only really combines less precision
(10 possible fractions rather than the usual 16) with the inferior
performance of having to use integer divisions and multiplications rather
than simple bit shifts. And it's surely not because the leaf sprites needed
an extended integer value range of [-3276, +3276], compared to
Q12.4's [-2047, +2048]: They are clipped to 640×400 screen space
anyway, and are removed as soon as they leave this area.
This pattern also contains the second bug in the "subpixel/pixel confusion
hiding an entire animation" category, causing all of
BOSS6GR4.GRC to effectively become unused:
The "swaying leaves" pattern. ZUN intended a splash animation to be
shown once each leaf "spark" reaches the top of the playfield, which is
never displayed in the original game.
At least their hitboxes are what you would expect, exactly covering the
30×30 pixels of Reimu's sprite. Both animation fixes are available on the th01_sariel_fixes
branch.
After all that, Sariel's main function turned out fairly unspectacular, just
putting everything together and adding some shake, transition, and color
pulse effects with a bunch of unnecessary hardware palette changes. There is
one reference to a missing BOSS6.GRP file during the
first→second form transition, suggesting that Sariel originally had a
separate "first form defeat" graphic, before it was replaced with just the
shaking effect in the final game.
Speaking about the transition code, it is kind of funny how the… um,
imperative and concrete nature of TH01 leads to these 2×24
lines of straight-line code. They kind of look like ZUN rattling off a
laundry list of subsystems and raw variables to be reinitialized, making
damn sure to not forget anything.
Whew! Second PC-98 Touhou boss completely decompiled, 29 to go, and they'll
only get easier from here! 🎉 The next one in line, Elis, is somewhere
between Konngara and Sariel as far as x86 instruction count is concerned, so
that'll need to wait for some additional funding. Next up, therefore:
Looking at a thing in TH03's main game code – really, I have little
idea what it will be!
Now that the store is open again, also check out the
📝 updated RE progress overview I've posted
together with this one. In addition to more RE, you can now also directly
order a variety of mods; all of these are further explained in the order
form itself.
OK, TH01 missile bullets. Can we maybe have a well-behaved entity type,
without any weirdness? Just once?
Ehh, kinda. Apart from another 150 bytes wasted on unused structure members,
this code is indeed more on the low end in terms of overall jank. It does
become very obvious why dodging these missiles in the YuugenMagan, Mima, and
Elis fights feels so awful though: An unfair 46×46 pixel hitbox around
Reimu's center pixel, combined with the comeback of
📝 interlaced rendering, this time in every
stage. ZUN probably did this because missiles are the only 16×16 sprite in
TH01 that is blitted to unaligned X positions, which effectively ends up
touching a 32×16 area of VRAM per sprite.
But even if we assume VRAM writes to be the bottleneck here, it would
have been totally possible to render every missile in every frame at roughly
the same amount of CPU time that the original game uses for interlaced
rendering:
Note that all missile sprites only use two colors, white and green.
Instead of naively going with the usual four bitplanes, extract the
pixels drawn in each of the two used colors into their own bitplanes.
master.lib calls this the "tiny format".
Use the GRCG to draw these two bitplanes in the intended white and green
colors, halving the amount of VRAM writes compared to the original
function.
(Not using the .PTN format would have also avoided the inconsistency of
storing the missile sprites in boss-specific sprite slots.)
That's an optimization that would have significantly benefitted the game, in
contrast to all of the fake ones
introduced in later games. Then again, this optimization is
actually something that the later games do, and it might have in fact been
necessary to achieve their higher bullet counts without significant
slowdown.
After some effectively unused Mima sprite effect code that is so broken that
it's impossible to make sense out of it, we get to the final feature I
wanted to cover for all bosses in parallel before returning to Sariel: The
separate sprite background storage for moving or animated boss sprites in
the Mima, Elis, and Sariel fights. But, uh… why is this necessary to begin
with? Doesn't TH01 already reserve the other VRAM page for backgrounds?
Well, these sprites are quite big, and ZUN didn't want to blit them from
main memory on every frame. After all, TH01 and TH02 had a minimum required
clock speed of 33 MHz, half of the speed required for the later three games.
So, he simply blitted these boss sprites to both VRAM pages, leading
the usual unblitting calls to only remove the other sprites on top of the
boss. However, these bosses themselves want to move across the screen…
and this makes it necessary to save the stage background behind them
in some other way.
Enter .PTN, and its functions to capture a 16×16 or 32×32 square from VRAM
into a sprite slot. No problem with that approach in theory, as the size of
all these bigger sprites is a multiple of 32×32; splitting a larger sprite
into these smaller 32×32 chunks makes the code look just a little bit clumsy
(and, of course, slower).
But somewhere during the development of Mima's fight, ZUN apparently forgot
that those sprite backgrounds existed. And once Mima's 🚫 casting sprite is
blitted on top of her regular sprite, using just regular sprite
transparency, she ends up with her infamous third arm:
Ironically, there's an unused code path in Mima's unblit function where ZUN
assumes a height of 48 pixels for Mima's animation sprites rather than the
actual 64. This leads to even clumsier .PTN function calls for the bottom
128×16 pixels… Failing to unblit the bottom 16 pixels would have also
yielded that third arm, although it wouldn't have looked as natural. Still
wouldn't say that it was intentional; maybe this casting sprite was just
added pretty late in the game's development?
So, mission accomplished, Sariel unblocked… at 2¼ pushes. That's quite some time left for some smaller stage initialization
code, which bundles a bunch of random function calls in places where they
logically really don't belong. The stage opening animation then adds a bunch
of VRAM inter-page copies that are not only redundant but can't even be
understood without knowing the hidden internal state of the last VRAM page
accessed by previous ZUN code…
In better news though: Turbo C++ 4.0 really doesn't seem to have any
complexity limit on inlining arithmetic expressions, as long as they only
operate on compile-time constants. That's how we get macro-free,
compile-time Shift-JIS to JIS X 0208 conversion of the individual code
points in the 東方★靈異伝 string, in a compiler from 1994. As long as you
don't store any intermediate results in variables, that is…
But wait, there's more! With still ¼ of a push left, I also went for the
boss defeat animation, which includes the route selection after the SinGyoku
fight.
As in all other instances, the 2× scaled font is accomplished by first
rendering the text at regular 1× resolution to the other, invisible VRAM
page, and then scaled from there to the visible one. However, the route
selection is unique in that its scaled text is both drawn transparently on
top of the stage background (not onto a black one), and can also change
colors depending on the selection. It would have been no problem to unblit
and reblit the text by rendering the 1× version to a position on the
invisible VRAM page that isn't covered by the 2× version on the visible one,
but ZUN (needlessly) clears the invisible page before rendering any text.
Instead, he assigned a separate VRAM color for both
the 魔界 and 地獄 options, and only changed the palette value for
these colors to white or gray, depending on the correct selection. This is
another one of the
📝 rare cases where TH01 demonstrates good use of PC-98 hardware,
as the 魔界へ and 地獄へ strings don't need to be reblitted during the selection process, only the Orb "cursor" does.
Then, why does this still not count as good-code? When
changing palette colors, you kinda need to be aware of everything
else that can possibly be on screen, which colors are used there, and which
aren't and can therefore be used for such an effect without affecting other
sprites. In this case, well… hover over the image below, and notice how
Reimu's hair and the bomb sprites in the HUD light up when Makai is
selected:
This push did end on a high note though, with the generic, non-SinGyoku
version of the defeat animation being an easily parametrizable copy. And
that's how you decompile another 2.58% of TH01 in just slightly over three
pushes.
Now, we're not only ready to decompile Sariel, but also Kikuri, Elis, and
SinGyoku without needing any more detours into non-boss code. Thanks to the
current TH01 funding subscriptions, I can plan to cover most, if not all, of
Sariel in a single push series, but the currently 3 pending pushes probably
won't suffice for Sariel's 8.10% of all remaining code in TH01. We've got
quite a lot of not specifically TH01-related funds in the backlog to pass
the time though.
Due to recent developments, it actually makes quite a lot of sense to take a
break from TH01: spaztron64 has
managed what every Touhou download site so far has failed to do: Bundling
all 5 game onto a single .HDI together with pre-configured PC-98
emulators and a nice boot menu, and hosting the resulting package on a
proper website. While this first release is already quite good (and much
better than my attempt from 2014), there is still a bit of room for
improvement to be gained from specific ReC98 research. Next up,
therefore:
Researching how TH04 and TH05 use EMS memory, together with the cause
behind TH04's crash in Stage 5 when playing as Reimu without an EMS driver
loaded, and
reverse-engineering TH03's score data file format
(YUME.NEM), which hopefully also comes with a way of building a
file that unlocks all characters without any high scores.
Nothing really noteworthy in TH01's stage timer code, just yet another HUD
element that is needlessly drawn into VRAM. Sure, ZUN applies his custom
boldfacing effect on top of the glyphs retrieved from font ROM, but he could
have easily installed those modified glyphs as gaiji.
Well, OK, halfwidth gaiji aren't exactly well documented, and sometimes not
even correctly emulated
📝 due to the same PC-98 hardware oddity I was researching last month.
I've reserved two of the pending anonymous "anything" pushes for the
conclusion of this research, just in case you were wondering why the
outstanding workload is now lower after the two delivered here.
And since it doesn't seem to be clearly documented elsewhere: Every 2 ticks
on the stage timer correspond to 4 frames.
So, TH01 rank pellet speed. The resident pellet speed value is a
factor ranging from a minimum of -0.375 up to a maximum of 0.5 (pixels per
frame), multiplied with the difficulty-adjusted base speed for each pellet
and added on top of that same speed. This multiplier is modified
every time the stage timer reaches 0 and
HARRY UP is shown (+0.05)
for every score-based extra life granted below the maximum number of
lives (+0.025)
every time a bomb is used (+0.025)
on every frame in which the rand value (shown in debug
mode) is evenly divisible by
(1800 - (lives × 200) - (bombs × 50)) (+0.025)
every time Reimu got hit (set to 0 if higher, then -0.05)
when using a continue (set to -0.05 if higher, then -0.125)
Apparently, ZUN noted that these deltas couldn't be losslessly stored in an
IEEE 754 floating-point variable, and therefore didn't store the pellet
speed factor exactly in a way that would correspond to its gameplay effect.
Instead, it's stored similar to Q12.4 subpixels: as a simple integer,
pre-multiplied by 40. This results in a raw range of -15 to 20, which is
what the undecompiled ASM calls still use. When spawning a new pellet, its
base speed is first multiplied by that factor, and then divided by 40 again.
This is actually quite smart: The calculation doesn't need to be aware of
either Q12.4 or the 40× format, as
((Q12.4 * factor×40) / factor×40) still comes out as a
Q12.4 subpixel even if all numbers are integers. The only limiting issue
here would be the potential overflow of the 16-bit multiplication at
unadjusted base speeds of more than 50 pixels per frame, but that'd be
seriously unplayable.
So yeah, pellet speed modifications are indeed gradual, and don't just fall
into the coarse three "high, normal, and low" categories.
That's ⅝ of P0160 done, and the continue and pause menus would make good
candidates to fill up the remaining ⅜… except that it seemed impossible to
figure out the correct compiler options for this code?
The issues centered around the two effects of Turbo C++ 4.0J's
-O switch:
Optimizing jump instructions: merging duplicate successive jumps into a
single one, and merging duplicated instructions at the end of conditional
branches into a single place under a single branch, which the other branches
then jump to
Compressing ADD SP and POP CX
stack-clearing instructions after multiple successive CALLs to
__cdecl functions into a single ADD SP with the
combined parameter stack size of all function calls
But how can the ASM for these functions exhibit #1 but not #2? How
can it be seemingly optimized and unoptimized at the same time? The
only option that gets somewhat close would be -O- -y, which
emits line number information into the .OBJ files for debugging. This
combination provides its own kind of #1, but these functions clearly need
the real deal.
The research into this issue ended up consuming a full push on its own.
In the end, this solution turned out to be completely unrelated to compiler
options, and instead came from the effects of a compiler bug in a totally
different place. Initializing a local structure instance or array like
const uint4_t flash_colors[3] = { 3, 4, 5 };
always emits the { 3, 4, 5 } array into the program's data
segment, and then generates a call to the internal SCOPY@
function which copies this data array to the local variable on the stack.
And as soon as this SCOPY@ call is emitted, the -O
optimization #1 is disabled for the entire rest of the translation
unit?!
So, any code segment with an SCOPY@ call followed by
__cdecl functions must strictly be decompiled from top to
bottom, mirroring the original layout of translation units. That means no
TH01 continue and pause menus before we haven't decompiled the bomb
animation, which contains such an SCOPY@ call. 😕
Luckily, TH01 is the only game where this bug leads to significant
restrictions in decompilation order, as later games predominantly use the
pascal calling convention, in which each function itself clears
its stack as part of its RET instruction.
What now, then? With 51% of REIIDEN.EXE decompiled, we're
slowly running out of small features that can be decompiled within ⅜ of a
push. Good that I haven't been looking a lot into OP.EXE and
FUUIN.EXE, which pretty much only got easy pieces of
code left to do. Maybe I'll end up finishing their decompilations entirely
within these smaller gaps? I still ended up finding one more small
piece in REIIDEN.EXE though: The particle system, seen in the
Mima fight.
I like how everything about this animation is contained within a single
function that is called once per frame, but ZUN could have really
consolidated the spawning code for new particles a bit. In Mima's fight,
particles are only spawned from the top and right edges of the screen, but
the function in fact contains unused code for all other 7 possible
directions, written in quite a bloated manner. This wouldn't feel quite as
unused if ZUN had used an angle parameter instead…
Also, why unnecessarily waste another 40 bytes of
the BSS segment?
But wait, what's going on with the very first spawned particle that just
stops near the bottom edge of the screen in the video above? Well, even in
such a simple and self-contained function, ZUN managed to include an
off-by-one error. This one then results in an out-of-bounds array access on
the 80th frame, where the code attempts to spawn a 41st
particle. If the first particle was unlucky to be both slow enough and
spawned away far enough from the bottom and right edges, the spawning code
will then kill it off before its unblitting code gets to run, leaving its
pixel on the screen until something else overlaps it and causes it to be
unblitted.
Which, during regular gameplay, will quickly happen with the Orb, all the
pellets flying around, and your own player movement. Also, the RNG can
easily spawn this particle at a position and velocity that causes it to
leave the screen more quickly. Kind of impressive how ZUN laid out the
structure
of arrays in a way that ensured practically no effect of this bug on the
game; this glitch could have easily happened every 80 frames instead.
He almost got close to all bugs canceling out each other here!
Next up: The player control functions, including the second-biggest function
in all of PC-98 Touhou.
…or maybe not that soon, as it would have only wasted time to
untangle the bullet update commits from the rest of the progress. So,
here's all the bullet spawning code in TH04 and TH05 instead. I hope
you're ready for this, there's a lot to talk about!
(For the sake of readability, "bullets" in this blog post refers to the
white 8×8 pellets
and all 16×16 bullets loaded from MIKO16.BFT, nothing else.)
But first, what was going on📝 in 2020? Spent 4 pushes on the basic types
and constants back then, still ended up confusing a couple of things, and
even getting some wrong. Like how TH05's "bullet slowdown" flag actually
always prevents slowdown and fires bullets at a constant speed
instead. Or how "random spread" is not the
best term to describe that unused bullet group type in TH04.
Or that there are two distinct ways of clearing all bullets on screen,
which deserve different names:
Mechanic #1: Clearing bullets for a custom amount of
time, awarding 1000 points for all bullets alive on the first frame,
and 100 points for all bullets spawned during the clear time.
Mechanic #2: Zapping bullets for a fixed 16 frames,
awarding a semi-exponential and loudly announced Bonus!! for all
bullets alive on the first frame, and preventing new bullets from being
spawned during those 16 frames. In TH04 at least; thanks to a ZUN bug,
zapping got reduced to 1 frame and no animation in TH05…
Bullets are zapped at the end of most midboss and boss phases, and
cleared everywhere else – most notably, during bombs, when losing a
life, or as rewards for extends or a maximized Dream bonus. The
Bonus!! points awarded for zapping bullets are calculated iteratively,
so it's not trivial to give an exact formula for these. For a small number
𝑛 of bullets, it would exactly be 5𝑛³ - 10𝑛² + 15𝑛
points – or, using uth05win's (correct) recursive definition,
Bonus(𝑛) = Bonus(𝑛-1) + 15𝑛² - 5𝑛 + 10.
However, one of the internal step variables is capped at a different number
of points for each difficulty (and game), after which the points only
increase linearly. Hence, "semi-exponential".
On to TH04's bullet spawn code then, because that one can at least be
decompiled. And immediately, we have to deal with a pointless distinction
between regular bullets, with either a decelerating or constant
velocity, and special bullets, with preset velocity changes during
their lifetime. That preset has to be set somewhere, so why have
separate functions? In TH04, this separation continues even down to the
lowest level of functions, where values are written into the global bullet
array. TH05 merges those two functions into one, but then goes too far and
uses self-modifying code to save a grand total of two local variables…
Luckily, the rest of its actual code is identical to TH04.
Most of the complexity in bullet spawning comes from the (thankfully
shared) helper function that calculates the velocities of the individual
bullets within a group. Both games handle each group type via a large
switch statement, which is where TH04 shows off another Turbo
C++ 4.0 optimization: If the range of case values is too
sparse to be meaningfully expressed in a jump table, it usually generates a
linear search through a second value table. But with the -G
command-line option, it instead generates branching code for a binary
search through the set of cases. 𝑂(log 𝑛) as the worst case for a
switch statement in a C++ compiler from 1994… that's so cool.
But still, why are the values in TH04's group type enum all
over the place to begin with?
Unfortunately, this optimization is pretty rare in PC-98 Touhou. It only
shows up here and in a few places in TH02, compared to at least 50
switch value tables.
In all of its micro-optimized pointlessness, TH05's undecompilable version
at least fixes some of TH04's redundancy. While it's still not even
optimal, it's at least a decently written piece of ASM…
if you take the time to understand what's going on there, because it
certainly took quite a bit of that to verify that all of the things which
looked like bugs or quirks were in fact correct. And that's how the code
for this function ended up with 35% comments and blank lines before I could
confidently call it "reverse-engineered"…
Oh well, at least it finally fixes a correctness issue from TH01 and TH04,
where an invalid bullet group type would fill all remaining slots in the
bullet array with identical versions of the first bullet.
Something that both games also share in these functions is an over-reliance
on globals for return values or other local state. The most ridiculous
example here: Tuning the speed of a bullet based on rank actually mutates
the global bullet template… which ZUN then works around by adding a wrapper
function around both regular and special bullet spawning, which saves the
base speed before executing that function, and restores it afterward.
Add another set of wrappers to bypass that exact
tuning, and you've expanded your nice 1-function interface to 4 functions.
Oh, and did I mention that TH04 pointlessly duplicates the first set of
wrapper functions for 3 of the 4 difficulties, which can't even be
explained with "debugging reasons"? That's 10 functions then… and probably
explains why I've procrastinated this feature for so long.
At this point, I also finally stopped decompiling ZUN's original ASM just
for the sake of it. All these small TH05 functions would look horribly
unidiomatic, are identical to their decompiled TH04 counterparts anyway,
except for some unique constant… and, in the case of TH05's rank-based
speed tuning function, actually become undecompilable as soon as we
want to return a C++ class to preserve the semantic meaning of the return
value. Mainly, this is because Turbo C++ does not allow register
pseudo-variables like _AX or _AL to be cast into
class types, even if their size matches. Decompiling that function would
have therefore lowered the quality of the rest of the decompiled code, in
exchange for the additional maintenance and compile-time cost of another
translation unit. Not worth it – and for a TH05 port, you'd already have to
decompile all the rest of the bullet spawning code anyway!
The only thing in there that was still somewhat worth being
decompiled was the pre-spawn clipping and collision detection function. Due
to what's probably a micro-optimization mistake, the TH05 version continues
to spawn a bullet even if it was spawned on top of the player. This might
sound like it has a different effect on gameplay… until you realize that
the player got hit in this case and will either lose a life or deathbomb,
both of which will cause all on-screen bullets to be cleared anyway.
So it's at most a visual glitch.
But while we're at it, can we please stop talking about hitboxes? At least
in the context of TH04 and TH05 bullets. The actual collision detection is
described way better as a kill delta of 8×8 pixels between the
center points of the player and a bullet. You can distribute these pixels
to any combination of bullet and player "hitboxes" that make up 8×8. 4×4
around both the player and bullets? 1×1 for bullets, and 8×8 for the
player? All equally valid… or perhaps none of them, once you keep in mind
that other entity types might have different kill deltas. With that in
mind, the concept of a "hitbox" turns into just a confusing abstraction.
The same is true for the 36×44 graze box delta. For some reason,
this one is not exactly around the center of a bullet, but shifted to the
right by 2 pixels. So, a bullet can be grazed up to 20 pixels right of the
player, but only up to 16 pixels left of the player. uth05win also spotted
this… and rotated the deltas clockwise by 90°?!
Which brings us to the bullet updates… for which I still had to
research a decompilation workaround, because
📝 P0148 turned out to not help at all?
Instead, the solution was to lie to the compiler about the true segment
distance of the popup function and declare its signature far
rather than near. This allowed ZUN to save that ridiculous overhead of 1 additional far function
call/return per frame, and those precious 2 bytes in the BSS segment
that he didn't have to spend on a segment value.
📝 Another function that didn't have just a
single declaration in a common header file… really,
📝 how were these games even built???
The function itself is among the longer ones in both games. It especially
stands out in the indentation department, with 7 levels at its most
indented point – and that's the minimum of what's possible without
goto. Only two more notable discoveries there:
Bullets are the only entity affected by Slow Mode. If the number of
bullets on screen is ≥ (24 + (difficulty * 8) + rank) in TH04,
or (42 + (difficulty * 8)) in TH05, Slow Mode reduces the frame
rate by 33%, by waiting for one additional VSync event every two frames.
The code also reveals a second tier, with 50% slowdown for a slightly
higher number of bullets, but that conditional branch can never be executed
Bullets must have been grazed in a previous frame before they can
be collided with. (Note how this does not apply to bullets that spawned
on top of the player, as explained earlier!)
Whew… When did ReC98 turn into a full-on code review?! 😅 And after all
this, we're still not done with TH04 and TH05 bullets, with all the
special movement types still missing. That should be less than one push
though, once we get to it. Next up: Back to TH01 and Konngara! Now have fun
rewriting the Touhou Wiki Gameplay pages 😛
Y'know, I kinda prefer the pending crowdfunded workload to stay more near
the middle of the cap, rather than being sold out all the time. So to reach
this point more quickly, let's do the most relaxing thing that can be
easily done in TH05 right now: The boss backgrounds, starting with Shinki's,
📝 now that we've got the time to look at it in detail.
… Oh come on, more things that are borderline undecompilable, and
require new workarounds to be developed? Yup, Borland C++ always optimizes
any comparison of a register with a literal 0 to OR reg, reg,
no matter how many calculations and inlined function calls you replace the
0 with. Shinki's background particle rendering function contains a
CMP AX, 0 instruction though… so yeah,
📝 yet another piece of custom ASM that's worse
than what Turbo C++ 4.0J would have generated if ZUN had just written
readable C. This was probably motivated by ZUN insisting that his modified
master.lib function for blitting particles takes its X and Y parameters as
registers. If he had just used the __fastcall convention, he
also would have got the sprite ID passed as a register. 🤷
So, we really don't want to be forced into inline assembly just
because of the third comparison in the otherwise perfectly decompilable
four-comparison if() expression that prevents invisible
particles from being drawn. The workaround: Comparing to a pointer
instead, which only the linker gets to resolve to the actual value of 0.
This way, the compiler has to make room for
any 16-bit literal, and can't optimize anything.
And then we go straight from micro-optimization to
waste, with all the duplication in the code that
animates all those particles together with the zooming and spinning lines.
This push decompiled 1.31% of all code in TH05, and thanks to alignment,
we're still missing Shinki's high-level background rendering function that
calls all the subfunctions I decompiled here.
With all the manipulated state involved here, it's not at all trivial to
see how this code produces what you see in-game. Like:
If all lines have the same Y velocity, how do the other three lines in
background type B get pushed down into this vertical formation while the
top one stays still? (Answer: This velocity is only applied to the top
line, the other lines are only pushed based on some delta.)
How can this delta be calculated based on the distance of the top line
with its supposed target point around Shinki's wings? (Answer: The velocity
is never set to 0, so the top line overshoots this target point in every
frame. After calculating the delta, the top line itself is pushed down as
well, canceling out the movement. )
Why don't they get pushed down infinitely, but stop eventually?
(Answer: We only see four lines out of 20, at indices #0, #6, #12, and
#18. In each frame, lines [0..17] are copied to lines [1..18], before
anything gets moved. The invisible lines are pushed down based on the delta
as well, which defines a distance between the visible lines of (velocity *
array gap). And since the velocity is capped at -14 pixels per frame, this
also means a maximum distance of 84 pixels between the midpoints of each
line.)
And why are the lines moving back up when switching to background type
C, before moving down? (Answer: Because type C increases the
velocity rather than decreasing it. Therefore, it relies on the previous
velocity state from type B to show a gapless animation.)
So yeah, it's a nice-looking effect, just very hard to understand. 😵
With the amount of effort I'm putting into this project, I typically
gravitate towards more descriptive function names. Here, however,
uth05win's simple and seemingly tiny-brained "background type A/B/C/D" was
quite a smart choice. It clearly defines the sequence in which these
animations are intended to be shown, and as we've seen with point 4
from the list above, that does indeed matter.
Next up: At least EX-Alice's background animations, and probably also the
high-level parts of the background rendering for all the other TH05 bosses.
Whoops, the build was broken again? Since
P0127 from
mid-November 2020, on TASM32 version 5.3, which also happens to be the
one in the DevKit… That version changed the alignment for the default
segments of certain memory models when requesting .386
support. And since redefining segment alignment apparently is highly
illegal and absolutely has to be a build error, some of the stand-alone
.ASM translation units didn't assemble anymore on this version. I've only
spotted this on my own because I casually compiled ReC98 somewhere else –
on my development system, I happened to have TASM32 version 5.0 in the
PATH during all this time.
At least this was a good occasion to
get rid of some
weird segment alignment workarounds from 2015, and replace them with the
superior convention of using the USE16 modifier for the
.MODEL directive.
ReC98 would highly benefit from a build server – both in order to
immediately spot issues like this one, and as a service for modders.
Even more so than the usual open-source project of its size, I would say.
But that might be exactly
because it doesn't seem like something you can trivially outsource
to one of the big CI providers for open-source projects, and quickly set
it up with a few lines of YAML.
That might still work in the beginning, and we might get by with a regular
64-bit Windows 10 and DOSBox running the exact build tools from the DevKit.
Ideally, though, such a server should really run the optimal configuration
of a 32-bit Windows 10, allowing both the 32-bit and the 16-bit build step
to run natively, which already is something that no popular CI service out
there offers. Then, we'd optimally expand to Linux, every other Windows
version down to 95, emulated PC-98 systems, other TASM versions… yeah, it'd
be a lot. An experimental project all on its own, with additional hosting
costs and probably diminishing returns, the more it expands…
I've added it as a category to the order form, let's see how much interest
there is once the store reopens (which will be at the beginning of May, at
the latest). That aside, it would 📝 also be
a great project for outside contributors!
So, technical debt, part 8… and right away, we're faced with TH03's
low-level input function, which
📝 once📝 again📝 insists on being word-aligned in a way we
can't fake without duplicating translation units.
Being undecompilable isn't exactly the best property for a function that
has been interesting to modders in the past: In 2018,
spaztron64 created an
ASM-level mod that hardcoded more ergonomic key bindings for human-vs-human
multiplayer mode: 2021-04-04-TH03-WASD-2player.zip
However, this remapping attempt remained quite limited, since we hadn't
(and still haven't) reached full position independence for TH03 yet.
There's quite some potential for size optimizations in this function, which
would allow more BIOS key groups to already be used right now, but it's not
all that obvious to modders who aren't intimately familiar with x86 ASM.
Therefore, I really wouldn't want to keep such a long and important
function in ASM if we don't absolutely have to…
… and apparently, that's all the motivation I needed? So I took the risk,
and spent the first half of this push on reverse-engineering
TCC.EXE, to hopefully find a way to get word-aligned code
segments out of Turbo C++ after all.
And there is! The -WX option, used for creating
DPMI
applications, messes up all sorts of code generation aspects in weird
ways, but does in fact mark the code segment as word-aligned. We can
consider ourselves quite lucky that we get to use Turbo C++ 4.0, because
this feature isn't available in any previous version of Borland's C++
compilers.
That allowed us to restore all the decompilations I previously threw away…
well, two of the three, that lookup table generator was too much of a mess
in C. But what an abuse this is. The
subtly different code generation has basically required one creative
workaround per usage of -WX. For example, enabling that option
causes the regular PUSH BP and POP BP prolog and
epilog instructions to be wrapped with INC BP and
DEC BP, for some reason:
a_function_compiled_with_wx proc
inc bp ; ???
push bp
mov bp, sp
; [… function code …]
pop bp
dec bp ; ???
ret
a_function_compiled_with_wx endp
Luckily again, all the functions that currently require -WX
don't set up a stack frame and don't take any parameters.
While this hasn't directly been an issue so far, it's been pretty
close: snd_se_reset(void) is one of the functions that require
word alignment. Previously, it shared a translation unit with the
immediately following snd_se_play(int new_se), which does take
a parameter, and therefore would have had its prolog and epilog code messed
up by -WX.
Since the latter function has a consistent (and thus, fakeable) alignment,
I simply split that code segment into two, with a new -WX
translation unit for just snd_se_reset(void). Problem solved –
after all, two C++ translation units are still better than one ASM
translation unit. Especially with all the
previous #include improvements.
The rest was more of the usual, getting us 74% done with repaying the
technical debt in the SHARED segment. A lot of the remaining
26% is TH04 needing to catch up with TH03 and TH05, which takes
comparatively little time. With some good luck, we might get this
done within the next push… that is, if we aren't confronted with all too
many more disgusting decompilations, like the two functions that ended this
push.
If we are, we might be needing 10 pushes to complete this after all, but
that piece of research was definitely worth the delay. Next up: One more of
these.
Alright, no more big code maintenance tasks that absolutely need to be
done right now. Time to really focus on parts 6 and 7 of repaying
technical debt, right? Except that we don't get to speed up just yet, as
TH05's barely decompilable PMD file loading function is rather…
complicated.
Fun fact: Whenever I see an unusual sequence of x86 instructions in PC-98
Touhou, I first consult the disassembly of Wolfenstein 3D. That game was
originally compiled with the quite similar Borland C++ 3.0, so it's quite
helpful to compare its ASM to the
officially released source
code. If I find the instructions in question, they mostly come from
that game's ASM code, leading to the amusing realization that "even John
Carmack was unable to get these instructions out of this compiler"
This time though, Wolfenstein 3D did point me
to Borland's intrinsics for common C functions like memcpy()
and strchr(), available via #pragma intrinsic.
Bu~t those unfortunately still generate worse code than what ZUN
micro-optimized here. Commenting how these sequences of instructions
should look in C is unfortunately all I could do here.
The conditional branches in this function did compile quite nicely
though, clarifying the control flow, and clearly exposing a ZUN
bug: TH05's snd_load() will hang in an infinite loop when
trying to load a non-existing -86 BGM file (with a .M2
extension) if the corresponding -26 BGM file (with a .M
extension) doesn't exist either.
Unsurprisingly, the PMD channel monitoring code in TH05's Music Room
remains undecompilable outside the two most "high-level" initialization
and rendering functions. And it's not because there's data in the
middle of the code segment – that would have actually been possible with
some #pragmas to ensure that the data and code segments have
the same name. As soon as the SI and DI registers are referenced
anywhere, Turbo C++ insists on emitting prolog code to save these
on the stack at the beginning of the function, and epilog code to restore
them from there before returning.
Found that out in
September 2019, and confirmed that there's no way around it. All the
small helper functions here are quite simply too optimized, throwing away
any concern for such safety measures. 🤷
Oh well, the two functions that were decompilable at least indicate
that I do try.
Within that same 6th push though, we've finally reached the one function
in TH05 that was blocking further progress in TH04, allowing that game
to finally catch up with the others in terms of separated translation
units. Feels good to finally delete more of those .ASM files we've
decompiled a while ago… finally!
But since that was just getting started, the most satisfying development
in both of these pushes actually came from some more experiments with
macros and inline functions for near-ASM code. By adding
"unused" dummy parameters for all relevant registers, the exact input
registers are made more explicit, which might help future port authors who
then maybe wouldn't have to look them up in an x86 instruction
reference quite as often. At its best, this even allows us to
declare certain functions with the __fastcall convention and
express their parameter lists as regular C, with no additional
pseudo-registers or macros required.
As for output registers, Turbo C++'s code generation turns out to be even
more amazing than previously thought when it comes to returning
pseudo-registers from inline functions. A nice example for
how this can improve readability can be found in this piece of TH02 code
for polling the PC-98 keyboard state using a BIOS interrupt:
inline uint8_t keygroup_sense(uint8_t group) {
_AL = group;
_AH = 0x04;
geninterrupt(0x18);
// This turns the output register of this BIOS call into the return value
// of this function. Surprisingly enough, this does *not* naively generate
// the `MOV AL, AH` instruction you might expect here!
return _AH;
}
void input_sense(void)
{
// As a result, this assignment becomes `_AH = _AH`, which Turbo C++
// never emits as such, giving us only the three instructions we need.
_AH = keygroup_sense(8);
// Whereas this one gives us the one additional `MOV BH, AH` instruction
// we'd expect, and nothing more.
_BH = keygroup_sense(7);
// And now it's obvious what both of these registers contain, from just
// the assignments above.
if(_BH & K7_ARROW_UP || _AH & K8_NUM_8) {
key_det |= INPUT_UP;
}
// […]
}
I love it. No inline assembly, as close to idiomatic C code as something
like this is going to get, yet still compiling into the minimum possible
number of x86 instructions on even a 1994 compiler. This is how I keep
this project interesting for myself during chores like these.
We might have even reached peak
inline already?
And that's 65% of technical debt in the SHARED segment repaid
so far. Next up: Two more of these, which might already complete that
segment? Finally!
Technical debt, part 5… and we only got TH05's stupidly optimized
.PI functions this time?
As far as actual progress is concerned, that is. In maintenance news
though, I was really hyped for the #include improvements I've
mentioned in 📝 the last post. The result: A
new x86real.h file, bundling all the declarations specific to
the 16-bit x86 Real Mode in a smaller file than Turbo C++'s own
DOS.H. After all, DOS is something else than the underlying
CPU. And while it didn't speed up build times quite as much as I had hoped,
it now clearly indicates the x86-specific parts of PC-98 Touhou code to
future port authors.
After another couple of improvements to parameter declaration in ASM land,
we get to TH05's .PI functions… and really, why did ZUN write all of
them in ASM? Why (re)declare all the necessary structures and data in
ASM land, when all these functions are merely one layer of abstraction
above master.lib, which does all the actual work?
I get that ZUN might have wanted masked blitting to be faster, which is
used for the fade-in effect seen during TH05's main menu animation and the
ending artwork. But, uh… he knew how to modify master.lib. In fact, he
did already modify the graph_pack_put_8() function
used for rendering a single .PI image row, to ignore master.lib's VRAM
clipping region. For this effect though, he first blits each row regularly
to the invisible 400th row of VRAM, and then does an EGC-accelerated
VRAM-to-VRAM blit of that row to its actual target position with the mask
enabled. It would have been way more efficient to add another version of
this function that takes a mask pattern. No amount of REP
MOVSW is going to change the fact that two VRAM writes per line are
slower than a single one. Not to mention that it doesn't justify writing
every other .PI function in ASM to go along with it…
This is where we also find the most hilarious aspect about this: For most
of ZUN's pointless micro-optimizations, you could have maybe made the
argument that they do save some CPU cycles here and there, and
therefore did something positive to the final, PC-98-exclusive result. But
some of the hand-written ASM here doesn't even constitute a
micro-optimization, because it's worse than what you would have got
out of even Turbo C++ 4.0J with its 80386 optimization flags!
At least it was possible to "decompile" 6 out of the 10 functions
here, making them easy to clean up for future modders and port authors.
Could have been 7 functions if I also decided to "decompile"
pi_free(), but all the C++ code is already surrounded by ASM,
resulting in 2 ASM translation units and 2 C++ translation units.
pi_free() would have needed a single translation unit by
itself, which wasn't worth it, given that I would have had to spell out
every single ASM instruction anyway.
There you go. What about this needed to be written in ASM?!?
The function calls between these small translation units even seemed to
glitch out TASM and the linker in the end, leading to one CALL
offset being weirdly shifted by 32 bytes. Usually, TLINK reports a fixup
overflow error when this happens, but this time it didn't, for some reason?
Mirroring the segment grouping in the affected translation unit did solve
the problem, and I already knew this, but only thought of it after spending
quite some RTFM time… during which I discovered the -lE
switch, which enables TLINK to use the expanded dictionaries in
Borland's .OBJ and .LIB files to speed up linking. That shaved off roughly
another second from the build time of the complete ReC98 repository. The
more you know… Binary blobs compiled with non-Borland tools would be the
only reason not to use this flag.
So, even more slowdown with this 5th dedicated push, since we've still only
repaid 41% of the technical debt in the SHARED segment so far.
Next up: Part 6, which hopefully manages to decompile the FM and SSG
channel animations in TH05's Music Room, and hopefully ends up being the
final one of the slow ones.
Wow, 31 commits in a single push? Well, what the last push had in
progress, this one had in maintenance. The
📝 master.lib header transition absolutely
had to be completed in this one, for my own sanity. And indeed,
it reduced the build time for the entirety of ReC98 to about 27 seconds on
my system, just as expected in the original announcement. Looking forward
to even faster build times with the upcoming #include
improvements I've got up my sleeve! The port authors of the future are
going to appreciate those quite a bit.
As for the new translation units, the funniest one is probably TH05's
function for blitting the 1-color .CDG images used for the main menu
options. Which is so optimized that it becomes decompilable again,
by ditching the self-modifying code of its TH04 counterpart in favor of
simply making better use of CPU registers. The resulting C code is still a
mess, but what can you do.
This was followed by even more TH05 functions that clearly weren't
compiled from C, as evidenced by their padding
bytes. It's about time I've documented my lack of ideas of how to get
those out of Turbo C++.
And just like in the previous push, I also had to 📝 throw away a decompiled TH02 function purely due to alignment issues. Couldn't have been a better one though, no one's going to miss a residency check for the MMD driver that is largely identical to the corresponding (and indeed decompilable) function for the PMD driver. Both of those should have been merged into a single function anyway, given how they also mutate the game's sound configuration flags…
In the end, I've slightly slowed down with this one, with only 37% of technical debt done after this 4th dedicated push. Next up: One more of these, centered around TH05's stupidly optimized .PI functions. Maybe also with some more reverse-engineering, after not having done any for 1½ months?
Alright, back to continuing the master.hpp transition started
in P0124, and repaying technical debt. The last blog post already
announced some ridiculous decompilations… and in fact, not a single
one of the functions in these two pushes was decompilable into
idiomatic C/C++ code.
As usual, that didn't keep me from trying though. The TH04 and TH05
version of the infamous 16-pixel-aligned, EGC-accelerated rectangle
blitting function from page 1 to page 0 was fairly average as far as
unreasonable decompilations are concerned.
The big blocker in TH03's MAIN.EXE, however, turned out to be
the .MRS functions, used to render the gauge attack portraits and bomb
backgrounds. The blitting code there uses the additional FS and GS segment
registers provided by the Intel 386… which
are not supported by Turbo C++'s inline assembler, and
can't be turned into pointers, due to a compiler bug in Turbo C++ that
generates wrong segment prefix opcodes for the _FS and
_GS pseudo-registers.
Apparently I'm the first one to even try doing that with this compiler? I
haven't found any other mention of this bug…
Compiling via assembly (#pragma inline) would work around
this bug and generate the correct instructions. But that would incur yet
another dependency on a 16-bit TASM, for something honestly quite
insignificant.
What we can always do, however, is using __emit__() to simply
output x86 opcodes anywhere in a function. Unlike spelled-out inline
assembly, that can even be used in helper functions that are supposed to
inline… which does in fact allow us to fully abstract away this compiler
bug. Regular if() comparisons with pseudo-registers
wouldn't inline, but "converting" them into C++ template function
specializations does. All that's left is some C preprocessor abuse
to turn the pseudo-registers into types, and then we do retain a
normal-looking poke() call in the blitting functions in the
end. 🤯
Yeah… the result is
batshitinsane.
I may have gone too far in a few places…
One might certainly argue that all these ridiculous decompilations
actually hurt the preservation angle of this project. "Clearly, ZUN
couldn't have possibly written such unreasonable C++ code.
So why pretend he did, and not just keep it all in its more natural ASM
form?" Well, there are several reasons:
Future port authors will merely have to translate all the
pseudo-registers and inline assembly to C++. For the former, this is
typically as easy as replacing them with newly declared local variables. No
need to bother with function prolog and epilog code, calling conventions, or
the build system.
No duplication of constants and structures in ASM land.
As a more expressive language, C++ can document the code much better.
Meticulous documentation seems to have become the main attraction of ReC98
these days – I've seen it appreciated quite a number of times, and the
continued financial support of all the backers speaks volumes. Mods, on the
other hand, are still a rather rare sight.
Having as few .ASM files in the source tree as possible looks better to
casual visitors who just look at GitHub's repo language breakdown. This way,
ReC98 will also turn from an "Assembly project" to its rightful state
of "C++ project" much sooner.
And finally, it's not like the ASM versions are
gone – they're still part of the Git history.
Unfortunately, these pushes also demonstrated a second disadvantage in
trying to decompile everything possible: Since Turbo C++ lacks TASM's
fine-grained ability to enforce code alignment on certain multiples of
bytes, it might actually be unfeasible to link in a C-compiled object file
at its intended original position in some of the .EXE files it's used in.
Which… you're only going to notice once you encounter such a case. Due to
the slightly jumbled order of functions in the
📝 second, shared code segment, that might
be long after you decompiled and successfully linked in the function
everywhere else.
And then you'll have to throw away that decompilation after all 😕 Oh
well. In this specific case (the lookup table generator for horizontally
flipping images), that decompilation was a mess anyway, and probably
helped nobody. I could have added a dummy .OBJ that does nothing but
enforce the needed 2-byte alignment before the function if I
really insisted on keeping the C version, but it really wasn't
worth it.
Now that I've also described yet another meta-issue, maybe there'll
really be nothing to say about the next technical debt pushes?
Next up though: Back to actual progress
again, with TH01. Which maybe even ends up pushing that game over the 50%
RE mark?
(tl;dr: ReC98 has switched to Tup for
the 32-bit build. You probably want to get
💾 this build of Tup, and put it somewhere in your
PATH. It's optional, and always will be, but highly
recommended.)
P0001! Reserved for the delivery of the very first financial contribution
I've ever received for ReC98, back in January 2018. GhostPhanom
requested the exact opposite of immediate results, which motivated me to
go on quite a passionate quest for the perfect ReC98 build system. A quest
that went way beyond the crowdfunding…
Makefiles are a decent idea in theory: Specify the targets to generate,
the source files these targets depend on and are generated from, and the
rules to do the generating, with some helpful shorthand syntax. Then, you
have a build dependency graph, and your make tool of choice
can provide minimal rebuilds of only the targets whose sources changed
since the last make call. But, uh… wait, this is C/C++ we're
talking about, and doesn't pretty much every source file come with a
second set of dependent source files, namely, every single
#include in the source file itself? Do we really
have to duplicate all these inside the Makefile, and keep it in sync with the source file? 🙄
This fact alone means that Makefiles are inherently unsuited for
any language with an #include feature… that is, pretty
much every language out there. Not to mention other aspects like changes
to the compilation command lines, or the build rules themselves, all of
which require metadata of the previous build to be persistently stored in
some way. I have no idea why such a trash technology is even touted as a
viable build tool for code.
So, I decided to just
write my own build system, tailor-made for the needs of ReC98's 16-bit
build process, and combining a number of experimental ideas. Which is
still not quite bug-free and ready for public use, given that the
entire past year has kept me busy with actual tangible RE and PI progress.
What did finally become ready, however, is the improvement for the
32-bit build part, and that's what we've got here.
💭 Now, if only there was a build system that would perfectly track
dependencies of any compiler it calls, by injecting code and
hooking file opening syscalls. It'd be completely unrealistic for it to
also run on DOS (and we probably don't want to traverse a graph database
in a cycle-limited DOSBox), but it would be perfect for our 32-bit build
part, as long as that one still exists.
Sure, it might seem really minor to worry about not unconditionally
rebuilding all 32-bit .asm files, which just takes a couple
of seconds anyway. But minimal rebuilds in the 32-bit part also provide
the foundation for minimal rebuilds in the 16-bit part – and those
TLINK invocations do take quite some time after all.
Using Tup for ReC98 was an idea that dated back to January 2017. Back
then, I already opened
the pull request with a fix to allow Tup to work together with 32-bit
TASM. As much as I love Tup though, the fact that it only worked on
64-bit Windows ≥Vista would have meant that we had to exchange perfect
dependency tracking for the ability to build on 32-bit and older Windows
versions at all. For a project that relies on DOS compilers, this
would have been exactly the wrong trade-off to make.
What's worse though: TLINK fails to run on modern 32-bit
Windows with Loader error (0000) : Unrecognized Error.
Therefore, the set of systems that Tup runs on, and the set of systems
that can actually compile ReC98's 16-bit build part natively, would have
been exactly disjoint, with no OS getting to use both at the same time.
So I've kept using Tup for only my own development, but indefinitely
shelved the idea of making it the official build system, due to those
drawbacks. Recently though, it all came together:
The tup generate sub-command can generate a
.bat file that does a full dumb rebuild of everything, which
can serve as a fallback option for systems that can't run Tup. All we have
to do is to commit that .bat file to the ReC98 Git repository
as well, and tell build32b.bat to fall back on that if Tup
can't be run. That alone would have given us the benefits of Tup without
being worse than the current dumb build process.
In the meantime, other contributors improved Tup's own build process to
the point where 32-bit builds were simple enough to accomplish from the
comfort of a WSL terminal.
Two commits of mine
later, and 32-bit Windows Tup was fully functional. Another one later,
and 32-bit Windows Tup even gained one potential advantage over its 64-bit
counterpart. Since it only has to support DLL injection into 32-bit
programs, it doesn't need a separate 32-bit binary for retrieving function
pointers to the 32-bit version of Windows' DLL loading syscalls. Weirdly
enough, Windows Defender on current Windows 10 falsely flags that binary as
malware, despite it doing nothing but printing those pointer values to
stdout. 🤷
I've also added it to the DevKit, for any newcomers to ReC98.
After the switch to Tup and the fallback option, I extensively tested
building ReC98 on all operating systems I had lying around. And holy cow,
so much in that build was broken beyond belief. In the end, the solution
involved just fully rebuilding the entire 16-bit part by default.
Which, of course, nullifies any of the
advantages we might have gotten from a Makefile in the first place, due to
just how unreliable they are. If you had problems building ReC98 in the
past, try again now!
And sure, it would certainly be possible to also get Tup working on
Windows ≤XP, or 9x even. But I leave that to all those tinkerers out there
who are actually motivated to keep those OSes alive. My work here is
done – we now have a build process that is optimal on 32-bit
Windows ≧Vista, and still functional and reliable on 64-bit
Windows, Linux, and everything down to Windows 98 SE, and therefore also
real PC-98 hardware. Pretty good, I'd say.
(If it weren't for that weird crash of the 16-bit TASM.EXE in
that Windows 95 command prompt I've tried it in, it would also work on
that OS. Probably just a misconfiguration on my part?)
Now, it might look like a waste of time to improve a 32-bit build part
that won't even exist anymore once this project is done. However, a fully
16-bit DOS build will only make sense after
master.lib has been turned into a proper library, linked in by
TLINK rather than #included in the big .ASM
files.
This affects all games. If master.lib's data was consistently placed at
the beginning or end of each data segment, this would be no big deal, but
it's placed somewhere else in every binary.
So, this will only make sense sometime around 90% overall PI, and maybe
~50% RE in each game. Which is something else than 50% overall –
especially since it includes TH02, the objectively worst Touhou game,
which hasn't received any dedicated funding ever.
Then, it will probably still require a couple of dedicated pushes to
move all the remaining data to C land.
Oh, and my 16-bit build system project also needs to be done before,
because, again, Makefiles are trash and we shouldn't rely on them even
more.
And who knows whether this project will get funded for that long. So yeah,
the 32-bit build part will stay with us for quite some more time, and for
all upcoming PI milestones. And with the current build process, it's
pretty much the most minor among all the minor issues I can think of.
Let's all enjoy the performance of a 32-bit build while we can 🙂
Next up: Paying some technical debt while keeping the RE% and PI% in place.
… and just as I explained 📝 in the last post
how decompilation is typically more sensible and efficient than ASM-level
reverse-engineering, we have this push demonstrating a counter-example.
The reason why the background particles and lines in the Shinki and
EX-Alice battles contributed so much to position dependence was simply
because they're accessed in a relatively large amount of functions, one
for each different animation. Too many to spend the remaining precious
crowdfunded time on reverse-engineering or even decompiling them all,
especially now that everyone anticipates 100% PI for TH05's
MAIN.EXE.
Therefore, I only decompiled the two functions of the line structure that
also demonstrate best how it works, which in turn also helped with RE.
Sadly, this revealed that we actually can't📝 overload operator =() to get
that nice assignment syntax for 12.4 fixed-point values, because one of
those new functions relies on Turbo C++'s built-in optimizations for
trivially copyable structures. Still, impressive that this abstraction
caused no other issues for almost one year.
As for the structures themselves… nope, nothing to criticize this time!
Sure, one good particle system would have been awesome, instead of having
separate structures for the Stage 2 "starfield" particles and the one used
in Shinki's battle, with hardcoded animations for both. But given the
game's short development time, that was quite an acceptable compromise,
I'd say.
And as for the lines, there just has to be a reason why the game
reserves 20 lines per set, but only renders lines #0, #6, #12, and #18.
We'll probably see once we get to look at those animation functions more
closely.
This was quite a 📝 TH03-style RE push,
which yielded way more PI% than RE%. But now that that's done, I can
finally not get distracted by all that stuff when looking at the
list of remaining memory references. Next up: The last few missing
structures in TH05's MAIN.EXE!
So, let's finally look at some TH01 gameplay structures! The obvious
choices here are player shots and pellets, which are conveniently located
in the last code segment. Covering these would therefore also help in
transferring some first bits of data in REIIDEN.EXE from ASM
land to C land. (Splitting the data segment would still be quite
annoying.) Player shots are immediately at the beginning…
…but wait, these are drawn as transparent sprites loaded from .PTN files.
Guess we first have to spend a push on
📝 Part 2 of this format.
Hm, 4 functions for alpha-masked blitting and unblitting of both 16×16 and
32×32 .PTN sprites that align the X coordinate to a multiple of 8
(remember, the PC-98 uses a
planar
VRAM memory layout, where 8 pixels correspond to a byte), but only one
function that supports unaligned blitting to any X coordinate, and only
for 16×16 sprites? Which is only called twice? And doesn't come with a
corresponding unblitting function?
Yeah, "unblitting". TH01 isn't
double-buffered,
and uses the PC-98's second VRAM page exclusively to store a stage's
background and static sprites. Since the PC-98 has no hardware sprites,
all you can do is write pixels into VRAM, and any animated sprite needs to
be manually removed from VRAM at the beginning of each frame. Not using
double-buffering theoretically allows TH01 to simply copy back all 128 KB
of VRAM once per frame to do this. But that
would be pretty wasteful, so TH01 just looks at all animated sprites, and
selectively copies only their occupied pixels from the second to the first
VRAM page.
Alright, player shot class methods… oh, wait, the collision functions
directly act on the Yin-Yang Orb, so we first have to spend a push on
that one. And that's where the impression we got from the .PTN
functions is confirmed: The orb is, in fact, only ever displayed at
byte-aligned X coordinates, divisible by 8. It's only thanks to the
constant spinning that its movement appears at least somewhat
smooth.
This is purely a rendering issue; internally, its position is
tracked at pixel precision. Sadly, smooth orb rendering at any unaligned X
coordinate wouldn't be that trivial of a mod, because well, the
necessary functions for unaligned blitting and unblitting of 32×32 sprites
don't exist in TH01's code. Then again, there's so much potential for
optimization in this code, so it might be very possible to squeeze those
additional two functions into the same C++ translation unit, even without
position independence…
More importantly though, this was the right time to decompile the core
functions controlling the orb physics – probably the highlight in these
three pushes for most people.
Well, "physics". The X velocity is restricted to the 5 discrete states of
-8, -4, 0, 4, and 8, and gravity is applied by simply adding 1 to the Y
velocity every 5 frames No wonder that this can
easily lead to situations in which the orb infinitely bounces from the
ground.
At least fangame authors now have
a
reference of how ZUN did it originally, because really, this bad
approximation of physics had to have been written that way on purpose. But
hey, it uses 64-bit floating-point variables!
…sometimes at least, and quite randomly. This was also where I had to
learn about Turbo C++'s floating-point code generation, and how rigorously
it defines the order of instructions when mixing double and
float variables in arithmetic or conditional expressions.
This meant that I could only get ZUN's original instruction order by using
literal constants instead of variables, which is impossible right now
without somehow splitting the data segment. In the end, I had to resort to
spelling out ⅔ of one function, and one conditional branch of another, in
inline ASM. 😕 If ZUN had just written 16.0 instead of
16.0f there, I would have saved quite some hours of my life
trying to decompile this correctly…
To sort of make up for the slowdown in progress, here's the TH01 orb
physics debug mod I made to properly understand them. Edit
(2022-07-12): This mod is outdated,
📝 the current version is here!2020-06-13-TH01OrbPhysicsDebug.zip
To use it, simply replace REIIDEN.EXE, and run the game
in debug mode, via game d on the DOS prompt.
Its code might also serve as an example of how to achieve this sort of
thing without position independence.
Alright, now it's time for player shots though. Yeah, sure, they
don't move horizontally, so it's not too bad that those are also
always rendered at byte-aligned positions. But, uh… why does this code
only use the 16×16 alpha-masked unblitting function for decaying shots,
and just sloppily unblits an entire 16×16 square everywhere else?
The worst part though: Unblitting, moving, and rendering player shots
is done in a single function, in that order. And that's exactly where
TH01's sprite flickering comes from. Since different types of sprites are
free to overlap each other, you'd have to first unblit all types, then
move all types, and then render all types, as done in later
PC-98 Touhou games. If you do these three steps per-type instead, you
will unblit sprites of other types that have been rendered before… and
therefore end up with flicker.
Oh, and finally, ZUN also added an additional sloppy 16×16 square unblit
call if a shot collides with a pellet or a boss, for some
guaranteed flicker. Sigh.
And that's ⅓ of all ZUN code in TH01 decompiled! Next up: Pellets!
Well, that took twice as long as I thought, with the two pushes containing
a lot more maintenance than actual new research. Spending some time
improving both field names and types in
32th System's
TH03 resident structure finally gives us all of those
structures. Which means that we can now cover all the remaining
decompilable ZUN.COM parts at once…
Oh wait, their main() functions have stayed largely identical
since TH02? Time to clean up and separate that first, then… and combine
two recent code generation observations into the solution to a
decompilation puzzle from 4½ years ago. Alright, time to decomp-
Oh wait, we'd kinda like to properly RE all the code in TH03-TH05
that deals with loading and saving .CFG files. Almost every outside
contributor wanted to grab this supposedly low-hanging fruit a lot
earlier, but (of course) always just for a single game, while missing how
the format evolved.
So, ZUN.COM. For some reason, people seem to consider it
particularly important, even though it contains neither any game logic nor
any code specific to PC-98 hardware… All that this decompilable part does
is to initialize a game's .CFG file, allocate an empty resident structure
using master.lib functions, release it after you quit the game,
error-check all that, and print some playful messages~ (OK, TH05's also
directly fills the resident structure with all data from
MIKO.CFG, which all the other games do in OP.EXE.)
At least modders can now freely change and extend all the resident
structures, as well as the .CFG files? And translators can translate those
messages that you won't see on a decently fast emulator anyway? Have fun,
I guess 🤷
And you can in fact do this right now – even for TH04 and TH05,
whose ZUN.COM currently isn't rebuilt by ReC98. There is
actually a rather involved reason for this:
One of the missing files is TH05's GJINIT.COM.
Which contains all of TH05's gaiji characters in hardcoded 1bpp form,
together with a bit of ASM for writing them to the PC-98's hardware gaiji
RAM
Which means we'd ideally first like to have a sprite compiler, for
all the hardcoded 1bpp sprites
Which must compile to an ASM slice in the meantime, but should also
output directly to an OMF .OBJ file (for performance now), as well as to C
code (for portability later)
Which I won't put in as long as the backlog contains actual
progress to drive up the percentages on the front page.
So yeah, no meaningful RE and PI progress at any of these levels. Heck,
even as a modder, you can just replace the zun zun_res
(TH02), zun -5 (TH03), or zun -s (TH04/TH05)
calls in GAME.BAT with a direct call to your modified
*RES*.COM. And with the alternative being "manually typing 0 and 1
bits into a text file", editing the sprites in TH05's
GJINIT.COM is way more comfortable in a binary sprite editor
anyway.
For me though, the best part in all of this was that it finally made sense
to throw out the old Borland C++ run-time assembly slices 🗑 This giant
waste of time
became obvious 5 years ago, but any ASM dump of a .COM
file would have needed rather ugly workarounds without those slices. Now
that all .COM binaries that were originally written in C are
compiled from C, we can all enjoy slightly faster grepping over the entire
repository, which now has 229 fewer files. Productivity will skyrocket!
Next up: Three weeks of almost full-time ReC98 work! Two more PI-focused
pushes to finish this TH05 stretch first, before switching priorities to
TH01 again.
So, the thing that made me so excited about TH01 were all those bulky C
reimplementations of master.lib functions. Identical copies in all three
executables, trivial to figure out and decompile, removing tons of
instructions, and providing a foundation for large parts of the game
later. The first set of functions near the end of that shared code segment
deals with color palette handling, and master.lib's resident palette
structure in particular. (No relation to the game's
resident structure.) Which directly starts us out with pretty much
all the decompilation difficulties imaginable:
iteration over internal DOS structures via segment pointers – Turbo
C++ doesn't support a lot of arithmetic on those, requiring tons of casts
to make it work
calls to a far function near the beginning of a segment
from a function near the end of a segment – these are undecompilable until
we've decompiled both functions (and thus, the majority of the segment),
and need to be spelled out in ASM for the time being. And if the caller
then stores some of the involved variables in registers, there's no
way around the ugliest of workarounds, spelling out opcode bytes…
surprising color format inconsistencies – apparently, GRB (rather than
RGB) is some sort of wider standard in PC-98 inter-process communication,
because it matches the order of the hardware's palette register ports?
(0AAh = green,
0ACh = red,
0AEh = blue)? Yet the
game's actual palette still uses RGB…
And as it turns out, the game doesn't even use the resident palette
feature. Which adds yet another set of functions to the, uh, learning
experience that ZUN must have chosen this game to be. I wouldn't be
surprised if we manage to uncover actual scrapped beta game content later
on, among all the unused code that's bound to still be in there.
At least decompilation should get easier for the next few TH01 pushes now…
right?
And just in time for zorg's last outstanding pushes, the
TH05 shot type control functions made the speedup happen!
TH05 as a whole is now 20% reverse-engineered, and 50% position
independent,
TH05's MAIN.EXE is now even below TH02's in terms of not
yet RE'd instructions,
and all price estimates have now fallen significantly.
It would have been really nice to also include Reimu's shot
control functions in this last push, but figuring out this entire system,
with its weird bitflags and switch statement
micro-optimizations, was once again taking way longer than it should
have. Especially with my new-found insistence on turning this obvious
copy-pasta into something somewhat readable and terse…
But with such a rather tabular visual structure, things should now be
moddable in hopefully easily consistent way. Of course, since we're
only at 54% position independence for MAIN.EXE,
this isn't possible yet without
crashing the game, but modifying damage would already work.
The glacial pace continues, with TH05's unnecessarily, inappropriately
micro-optimized, and hence, un-decompilable code for rendering the current
and high score, as well as the enemy health / dream / power bars. While
the latter might still pass as well-written ASM, the former goes to such
ridiculous levels that it ends up being technically buggy. If you
enjoy quality ZUN code, it's
definitely worth a read.
In TH05, this all still is at the end of code segment #1, but in TH04,
the same code lies all over the same segment. And since I really
wanted to move that code into its final form now, I finally did the
research into decompiling from anywhere else in a segment.
Turns out we actually can! It's kinda annoying, though: After splitting
the segment after the function we want to decompile, we then need to group
the two new segments back together into one "virtual segment" matching the
original one. But since all ASM in ReC98 heavily relies on being
assembled in MASM mode, we then start to suffer from MASM's group
addressing quirk. Which then forces us to manually prefix every single
function call
from inside the group
to anywhere else within the newly created segment
with the group name. It's stupidly boring busywork, because of all the
function calls you mustn't prefix. Special tooling might make this
easier, but I don't have it, and I'm not getting crowdfunded for it.
So while you now definitely can request any specific thing in any
of the 5 games to be decompiled right now, it will take slightly
longer, and cost slightly more.
(Except for that one big segment in TH04, of course.)
Only one function away from the TH05 shot type control functions now!
Here we go, new C code! …eh, it will still take a bit to really get
decompilation going at the speeds I was hoping for. Especially with the
sheer amount of stuff that is set in the first few significant
functions we actually can decompile, which now all has to be
correctly declared in the C world. Turns out I spent the last 2 years
screwing up the case of exported functions, and even some of their names,
so that it didn't actually reflect their calling convention… yup. That's
just the stuff you tend to forget while it doesn't matter.
To make up for that, I decided to research whether we can make use of some
C++ features to improve code readability after all. Previously, it seemed
that TH01 was the only game that included any C++ code, whereas TH02 and
later seemed to be 100% C and ASM. However, during the development of the
soon to be released new build system, I noticed that even this old
compiler from the mid-90's, infamous for prioritizing compile speeds over
all but the most trivial optimizations, was capable of quite surprising
levels of automatic inlining with class methods…
…leading the research to culminate in the mindblow that is
9d121c7 – yes, we can use C++ class methods
and operator overloading to make the code more readable, while still
generating the same code than if we had just used C and preprocessor
macros.
Looks like there's now the potential for a few pull requests from outside
devs that apply C++ features to improve the legibility of previously
decompiled and terribly macro-ridden code. So, if anyone wants to help
without spending money…