Technical debt, part 10… in which two of the PMD-related functions came
with such complex ramifications that they required one full push after
all, leaving no room for the additional decompilations I wanted to do. At
least, this did end up being the final one, completing all
SHARED segments for the time being.
The first one of these functions determines the BGM and sound effect
modes, combining the resident type of the PMD driver with the Option menu
setting. The TH04 and TH05 version is apparently coded quite smartly, as
PC-98 Touhou only needs to distinguish "OPN- /
PC-9801-26K-compatible sound sources handled by PMD.COM"
from "everything else", since all other PMD varieties are
OPNA- / PC-9801-86-compatible.
Therefore, I only documented those two results returned from PMD's
AH=09h function. I'll leave a comprehensive, fully documented
enum to interested contributors, since that would involve research into
basically the entire history of the PC-9800 series, and even the clearly
out-of-scope PC-88VA. After all, distinguishing between more versions of
the PMD driver in the Option menu (and adding new sprites for them!) is
strictly mod territory.
The honor of being the final decompiled function in any SHARED
segment went to TH04's snd_load(). TH04 contains by far the
sanest version of this function: Readable C code, no new ZUN bugs (and
still missing file I/O error handling, of course)… but wait, what about
that actual file read syscall, using the INT 21h, AH=3Fh DOS
file read API? Reading up to a hardcoded number of bytes into PMD's or
MMD's song or sound effect buffer, 20 KiB in TH02-TH04, 64 KiB in
TH05… that's kind of weird. About time we looked closer into this.
Turns out that no, KAJA's driver doesn't give you the full 64 KiB of one
memory segment for these, as especially TH05's code might suggest to
anyone unfamiliar with these drivers. Instead,
you can customize the size of these buffers on its command line. In
GAME.BAT, ZUN allocates 8 KiB for FM songs, 2 KiB for sound
effects, and 12 KiB for MMD files in TH02… which means that the hardcoded
sizes in snd_load() are completely wrong, no matter how you
look at them. Consequently, this read syscall
will overflow PMD's or MMD's song or sound effect buffer if the
given file is larger than the respective buffer size.
Now, ZUN could have simply hardcoded the sizes from GAME.BAT
instead, and it would have been fine. As it also turns out though,
PMD has an API function (AH=22h) to retrieve the actual
buffer sizes, provided for exactly that purpose. There is little excuse
not to use it, as it also gives you PMD's default sizes if you don't
specify any yourself.
(Unless your build process enumerates all PMD files that are part of the
game, and bakes the largest size into both snd_load() and
GAME.BAT. That would even work with MMD, which doesn't have
an equivalent for AH=22h.)
What'd be the consequence of loading a larger file then? Well, since we
don't get a full segment, let's look at the theoretical limit first.
PMD prefers to keep both its driver code and the data buffers in a single
memory segment. As a result, the limit for the combined size of the song,
instrument, and sound effect buffer is determined by the amount of
code in the driver itself. In PMD86 version 4.8o (bundled with TH04
and TH05) for example, the remaining size for these buffers is exactly
45,555 bytes. Being an actually good programmer who doesn't blindly trust
user input, KAJA thankfully validates the sizes given via the
/M, /V, and /E command-line options
before letting the driver reside in memory, and shuts down with an error
message if they exceed 40 KiB. Would have been even better if he calculated
the exact size – even in the current
PMD version 4.8s from
January 2020, it's still a hardcoded value (see line 8581).
Either way: If the file is larger than this maximum, the concrete effect
is down to the INT 21h, AH=3Fh implementation in the
underlying DOS version. DOS 3.3 treats the destination address as linear
and reads past the end of the segment,
5.0 and DOSBox-X truncate the number of bytes to not exceed the remaining
space in the segment, and maybe there's even a DOS that wraps around
and ends up overwriting the PMD driver code. In any case: You will
overwrite what's after the driver in memory – typically, the game .EXE and
its master.lib functions.
It almost feels like a happy accident that this doesn't cause issues in
the original games. The largest PMD file in any of the 4 games, the -86
version of 幽夢 ～ Inanimate Dream, takes up 8,099 bytes,
just under the 8,192 byte limit for BGM. For modders, I'd really recommend
implementing this properly, with PMD's AH=22h function and
error handling, once position independence has been reached.
Whew, didn't think I'd be doing more research into KAJA's drivers during
regular ReC98 development! That's probably been the final time though, as
all involved functions are now decompiled, and I'm unlikely to iterate
over them again.
And that's it! Repaid the biggest chunk of technical debt, time for some
actual progress again. Next up: Reopening the store tomorrow, and waiting
for new priorities. If we got nothing by Sunday, I'm going to put the
pending [Anonymous] pushes towards some work on the website.
ReC98 would highly benefit from a build server – both in order to
immediately spot issues like this one, and as a service for modders.
Even more so than the usual open-source project of its size, I would say.
But that might be exactly
because it doesn't seem like something you can trivially outsource
to one of the big CI providers for open-source projects, and quickly set
it up with a few lines of YAML.
That might still work in the beginning, and we might get by with a regular
64-bit Windows 10 and DOSBox running the exact build tools from the DevKit.
Ideally, though, such a server should really run the optimal configuration
of a 32-bit Windows 10, allowing both the 32-bit and the 16-bit build step
to run natively, which already is something that no popular CI service out
there offers. Then, we'd optimally expand to Linux, every other Windows
version down to 95, emulated PC-98 systems, other TASM versions… yeah, it'd
be a lot. An experimental project all on its own, with additional hosting
costs and probably diminishing returns, the more it expands…
I've added it as a category to the order form, let's see how much interest
there is once the store reopens (which will be at the beginning of May, at
the latest). That aside, it would 📝 also be
a great project for outside contributors!
So, technical debt, part 8… and right away, we're faced with TH03's
low-level input function, which
📝 once📝 again📝 insists on being word-aligned in a way we
can't fake without duplicating translation units.
Being undecompilable isn't exactly the best property for a function that
has been interesting to modders in the past: In 2018,
spaztron64 created an
ASM-level mod that hardcoded more ergonomic key bindings for human-vs-human
multiplayer mode: 2021-04-04-TH03-WASD-2player.zip
However, this remapping attempt remained quite limited, since we hadn't
(and still haven't) reached full position independence for TH03 yet.
There's quite some potential for size optimizations in this function, which
would allow more BIOS key groups to already be used right now, but it's not
all that obvious to modders who aren't intimately familiar with x86 ASM.
Therefore, I really wouldn't want to keep such a long and important
function in ASM if we don't absolutely have to…
… and apparently, that's all the motivation I needed? So I took the risk,
and spent the first half of this push on reverse-engineering
TCC.EXE, to hopefully find a way to get word-aligned code
segments out of Turbo C++ after all.
And there is! The -WX option, used for creating
applications, messes up all sorts of code generation aspects in weird
ways, but does in fact mark the code segment as word-aligned. We can
consider ourselves quite lucky that we get to use Turbo C++ 4.0, because
this feature isn't available in any previous version of Borland's C++
That allowed us to restore all the decompilations I previously threw away…
well, two of the three, that lookup table generator was too much of a mess
in C. But what an abuse this is. The
subtly different code generation has basically required one creative
workaround per usage of -WX. For example, enabling that option
causes the regular PUSH BP and POP BP prolog and
epilog instructions to be wrapped with INC BP and
DEC BP, for some reason:
inc bp ; ???
mov bp, sp
; [… function code …]
dec bp ; ???
Luckily again, all the functions that currently require -WX
don't set up a stack frame and don't take any parameters.
While this hasn't directly been an issue so far, it's been pretty
close: snd_se_reset(void) is one of the functions that require
word alignment. Previously, it shared a translation unit with the
immediately following snd_se_play(int new_se), which does take
a parameter, and therefore would have had its prolog and epilog code messed
up by -WX.
Since the latter function has a consistent (and thus, fakeable) alignment,
I simply split that code segment into two, with a new -WX
translation unit for just snd_se_reset(void). Problem solved –
after all, two C++ translation units are still better than one ASM
translation unit. Especially with all the
previous #include improvements.
The rest was more of the usual, getting us 74% done with repaying the
technical debt in the SHARED segment. A lot of the remaining
26% is TH04 needing to catch up with TH03 and TH05, which takes
comparatively little time. With some good luck, we might get this
done within the next push… that is, if we aren't confronted with all too
many more disgusting decompilations, like the two functions that ended this
If we are, we might be needing 10 pushes to complete this after all, but
that piece of research was definitely worth the delay. Next up: One more of
Well, make that three days. Trying to figure out all the details behind
the sprite flickering was absolutely dreadful…
It started out easy enough, though. Unsurprisingly, TH01 had a quite
limited pellet system compared to TH04 and TH05:
The cap is 100, rather than 240 in TH04 or 180 in TH05.
Only 6 special motion functions (with one of them broken and unused)
instead of 10. This is where you find the code that generates SinGyoku's
chase pellets, Kikuri's small spinning multi-pellet circles, and
Konngara's rain pellets that bounce down from the top of the playfield.
A tiny selection of preconfigured multi-pellet groups. Rather than
TH04's and TH05's freely configurable n-way spreads, stacks, and rings,
TH01 only provides abstractions for 2-, 3-, 4-, and 5- way spreads (yup,
no 6-way or beyond), with a fixed narrow or wide angle between the
individual pellets. The resulting pellets are also hardcoded to linear
motion, and can't use the special motion functions. Maybe not the best
code, but still kind of cute, since the generated groups do follow a
As expected from TH01, the code comes with its fair share of smaller,
insignificant ZUN bugs and oversights. As you would also expect
though, the sprite flickering points to the biggest and most consequential
flaw in all of this.
Apparently, it started with ZUN getting the impression that it's only
possible to use the PC-98 EGC for fast blitting of all 4 bitplanes in one
CPU instruction if you blit 16 horizontal pixels (= 2 bytes) at a time.
Consequently, he only wrote one function for EGC-accelerated sprite
unblitting, which can only operate on a "grid" of 16×1 tiles in VRAM. But
wait, pellets are not only just 8×8, but can also be placed at any
unaligned X position…
… yet the game still insists on using this 16-dot-aligned function to
unblit pellets, forcing itself into using a super sloppy 16×8 rectangle
for the job. 🤦 ZUN then tried to mitigate the resulting flickering in two
hilarious ways that just make it worse:
An… "interlaced rendering" mode? This one's activated for all Stage 15
and 20 fights, and separates pellets into two halves that are rendered on
alternating frames. Collision detection with the Yin-Yang Orb and the
player is only done for the visible half, but collision detection with
player shots is still done for all pellets every frame, as are
motion updates – so that pellets don't end up moving half as fast as they
So yeah, your eyes weren't deceiving you. The game does effectively
drop its perceived frame rate in the Elis, Kikuri, Sariel, and Konngara
fights, and it does so deliberately.
📝 Just like player shots, pellets
are also unblitted, moved, and rendered in a single function.
Thanks to the 16×8 rectangle, there's now the (completely unnecessary)
possibility of accidentally unblitting parts of a sprite that was
previously drawn into the 8 pixels right of a pellet. And this
is where ZUN went full and went "oh, I
know, let's test the entire 16 pixels, and in case we got an entity
there, we simply make the pellet invisible for this frame! Then
we don't even have to unblit it later!"
Except that this is only done for the first 3 elements of the player
shot array…?! Which don't even necessarily have to contain the 3 shots
fired last. It's not done for the player sprite, the Orb, or, heck,
other pellets that come earlier in the pellet array. (At least
we avoided going 𝑂(𝑛²) there?)
Actually, and I'm only realizing this now as I type this blog post:
This test is done even if the shots at those array elements aren't
active. So, pellets tend to be made invisible based on comparisons
with garbage data.
And then you notice that the player shot
unblit/move/render function is actually only ever called from the
pellet unblit/move/render function on the one global instance
of the player shot manager class, after pellets were unblitted. So, we
end up with a sequence of
which means that we can't ever unblit a previously rendered shot
with a pellet. Sure, as terrible as this one function call is from
a software architecture perspective, it was enough to fix this issue.
Yet we don't even get the intended positive effect, and walk away with
pellets that are made temporarily invisible for no reason at all. So,
uh, maybe it all just was an attempt at increasing the
ramerate on lower spec PC-98 models?
Yup, that's it, we've found the most stupid piece of code in this game,
period. It'll be hard to top this.
I'm confident that it's possible to turn TH01 into a well-written, fluid
PC-98 game, with no flickering, and no perceived lag, once it's
position-independent. With some more in-depth knowledge and documentation
on the EGC (remember, there's still
📝 this one TH03 push waiting to be funded),
you might even be able to continue using that piece of blitter hardware.
And no, you certainly won't need ASM micro-optimizations – just a bit of
knowledge about which optimizations Turbo C++ does on its own, and what
you'd have to improve in your own code. It'd be very hard to write
worse code than what you find in TH01 itself.
(Godbolt for Turbo C++ 4.0J when?
Seriously though, that would 📝 also be a
great project for outside contributors!)
Oh well. In contrast to TH04 and TH05, where 4 pushes only covered all the
involved data types, they were enough to completely cover all of
the pellet code in TH01. Everything's already decompiled, and we never
have to look at it again. 😌 And with that, TH01 has also gone from by far
the least RE'd to the most RE'd game within ReC98, in just half a year! 🎉
Still, that was enough TH01 game logic for a while.
Next up: Making up for the delay with some
more relaxing and easy pieces of TH01 code, that hopefully make just a
bit more sense than all this garbage. More image formats, mainly.
Sadly, we've already reached the end of fast triple-speed TH01 progress
with 📝 the last push, which decompiled the
last segment shared by all three of TH01's executables. There's still a
bit of double-speed progress left though, with a small number of code
segments that are shared between just two of the three executables.
At the end of the first one of these, we've got all the code for the .GRZ
format – which is yet another run-length encoded image format, but this
time storing up to 16 full 640×400 16-color images with an alpha bit. This
one is exclusively used to wastefully store Konngara's sword slash and
animations. Due to… suboptimal code organization, the code for the format
is also present in OP.EXE, despite not being used there. But
hey, that brings TH01 to over 20% in RE!
Decoupling the RLE command stream from the pixel data sounds like a nice
idea at first, allowing the format to efficiently encode a variety of
animation frames displayed all over the screen… if ZUN actually made
use of it. The RLE stream also has quite some ridiculous overhead,
starting with 1 byte to store the 1-bit command (putting a single 8×1
pixel block, or entering a run of N such blocks). Run commands then store
another 1-byte run length, which has to be followed by another
command byte to identify the run as putting N blocks, or skipping N blocks.
And the pixel data is just a sequence of these blocks for all 4 bitplanes,
in uncompressed form…
Also, have some rips of all the images this format is used for:
To make these, I just wrote a small viewer, calling the same decompiled
TH01 code: 2020-03-07-grzview.zip
Obviously, this means that it not only must to be run on a PC-98, but also
discards the alpha information.
If any backers are really interested in having a proper converter
to and from PNG, I can implement that in an upcoming push… although that
would be the perfect thing for outside contributors to do.
Next up, we got some code for the PI format… oh, wait, the actual files
are called "GRP" in TH01.