OK, TH01 missile bullets. Can we maybe have a well-behaved entity type,
without any weirdness? Just once?
Ehh, kinda. Apart from another 150 bytes wasted on unused structure members,
this code is indeed more on the low end in terms of overall jank. It does
become very obvious why dodging these missiles in the YuugenMagan, Mima, and
Elis fights feels so awful though: An unfair 46×46 pixel hitbox around
Reimu's center pixel, combined with the comeback of
📝 interlaced rendering, this time in every
stage. ZUN probably did this because missiles are the only 16×16 sprite in
TH01 that is blitted to unaligned X positions, which effectively ends up
touching a 32×16 area of VRAM per sprite.
But even if we assume VRAM writes to be the bottleneck here, it would
have been totally possible to render every missile in every frame at roughly
the same amount of CPU time that the original game uses for interlaced
rendering:
Note that all missile sprites only use two colors, white and green.
Instead of naively going with the usual four bitplanes, extract the
pixels drawn in each of the two used colors into their own bitplanes.
master.lib calls this the "tiny format".
Use the GRCG to draw these two bitplanes in the intended white and green
colors, halving the amount of VRAM writes compared to the original
function.
(Not using the .PTN format would have also avoided the inconsistency of
storing the missile sprites in boss-specific sprite slots.)
That's an optimization that would have significantly benefitted the game, in
contrast to all of the fake ones
introduced in later games. Then again, this optimization is
actually something that the later games do, and it might have in fact been
necessary to achieve their higher bullet counts without significant
slowdown.
After some effectively unused Mima sprite effect code that is so broken that
it's impossible to make sense out of it, we get to the final feature I
wanted to cover for all bosses in parallel before returning to Sariel: The
separate sprite background storage for moving or animated boss sprites in
the Mima, Elis, and Sariel fights. But, uh… why is this necessary to begin
with? Doesn't TH01 already reserve the other VRAM page for backgrounds?
Well, these sprites are quite big, and ZUN didn't want to blit them from
main memory on every frame. After all, TH01 and TH02 had a minimum required
clock speed of 33 MHz, half of the speed required for the later three games.
So, he simply blitted these boss sprites to both VRAM pages, leading
the usual unblitting calls to only remove the other sprites on top of the
boss. However, these bosses themselves want to move across the screen…
and this makes it necessary to save the stage background behind them
in some other way.
Enter .PTN, and its functions to capture a 16×16 or 32×32 square from VRAM
into a sprite slot. No problem with that approach in theory, as the size of
all these bigger sprites is a multiple of 32×32; splitting a larger sprite
into these smaller 32×32 chunks makes the code look just a little bit clumsy
(and, of course, slower).
But somewhere during the development of Mima's fight, ZUN apparently forgot
that those sprite backgrounds existed. And once Mima's 🚫 casting sprite is
blitted on top of her regular sprite, using just regular sprite
transparency, she ends up with her infamous third arm:
Ironically, there's an unused code path in Mima's unblit function where ZUN
assumes a height of 48 pixels for Mima's animation sprites rather than the
actual 64. This leads to even clumsier .PTN function calls for the bottom
128×16 pixels… Failing to unblit the bottom 16 pixels would have also
yielded that third arm, although it wouldn't have looked as natural. Still
wouldn't say that it was intentional; maybe this casting sprite was just
added pretty late in the game's development?
So, mission accomplished, Sariel unblocked… at 2¼ pushes. That's quite some time left for some smaller stage initialization
code, which bundles a bunch of random function calls in places where they
logically really don't belong. The stage opening animation then adds a bunch
of VRAM inter-page copies that are not only redundant but can't even be
understood without knowing the hidden internal state of the last VRAM page
accessed by previous ZUN code…
In better news though: Turbo C++ 4.0 really doesn't seem to have any
complexity limit on inlining arithmetic expressions, as long as they only
operate on compile-time constants. That's how we get macro-free,
compile-time Shift-JIS to JIS X 0208 conversion of the individual code
points in the 東方★靈異伝 string, in a compiler from 1994. As long as you
don't store any intermediate results in variables, that is…
But wait, there's more! With still ¼ of a push left, I also went for the
boss defeat animation, which includes the route selection after the SinGyoku
fight.
As in all other instances, the 2× scaled font is accomplished by first
rendering the text at regular 1× resolution to the other, invisible VRAM
page, and then scaled from there to the visible one. However, the route
selection is unique in that its scaled text is both drawn transparently on
top of the stage background (not onto a black one), and can also change
colors depending on the selection. It would have been no problem to unblit
and reblit the text by rendering the 1× version to a position on the
invisible VRAM page that isn't covered by the 2× version on the visible one,
but ZUN (needlessly) clears the invisible page before rendering any text.
Instead, he assigned a separate VRAM color for both
the 魔界 and 地獄 options, and only changed the palette value for
these colors to white or gray, depending on the correct selection. This is
another one of the
📝 rare cases where TH01 demonstrates good use of PC-98 hardware,
as the 魔界へ and 地獄へ strings don't need to be reblitted during the selection process, only the Orb "cursor" does.
Then, why does this still not count as good-code? When
changing palette colors, you kinda need to be aware of everything
else that can possibly be on screen, which colors are used there, and which
aren't and can therefore be used for such an effect without affecting other
sprites. In this case, well… hover over the image below, and notice how
Reimu's hair and the bomb sprites in the HUD light up when Makai is
selected:
This push did end on a high note though, with the generic, non-SinGyoku
version of the defeat animation being an easily parametrizable copy. And
that's how you decompile another 2.58% of TH01 in just slightly over three
pushes.
Now, we're not only ready to decompile Sariel, but also Kikuri, Elis, and
SinGyoku without needing any more detours into non-boss code. Thanks to the
current TH01 funding subscriptions, I can plan to cover most, if not all, of
Sariel in a single push series, but the currently 3 pending pushes probably
won't suffice for Sariel's 8.10% of all remaining code in TH01. We've got
quite a lot of not specifically TH01-related funds in the backlog to pass
the time though.
Due to recent developments, it actually makes quite a lot of sense to take a
break from TH01: spaztron64 has
managed what every Touhou download site so far has failed to do: Bundling
all 5 game onto a single .HDI together with pre-configured PC-98
emulators and a nice boot menu, and hosting the resulting package on a
proper website. While this first release is already quite good (and much
better than my attempt from 2014), there is still a bit of room for
improvement to be gained from specific ReC98 research. Next up,
therefore:
Researching how TH04 and TH05 use EMS memory, together with the cause
behind TH04's crash in Stage 5 when playing as Reimu without an EMS driver
loaded, and
reverse-engineering TH03's score data file format
(YUME.NEM), which hopefully also comes with a way of building a
file that unlocks all characters without any high scores.
Nothing really noteworthy in TH01's stage timer code, just yet another HUD
element that is needlessly drawn into VRAM. Sure, ZUN applies his custom
boldfacing effect on top of the glyphs retrieved from font ROM, but he could
have easily installed those modified glyphs as gaiji.
Well, OK, halfwidth gaiji aren't exactly well documented, and sometimes not
even correctly emulated
📝 due to the same PC-98 hardware oddity I was researching last month.
I've reserved two of the pending anonymous "anything" pushes for the
conclusion of this research, just in case you were wondering why the
outstanding workload is now lower after the two delivered here.
And since it doesn't seem to be clearly documented elsewhere: Every 2 ticks
on the stage timer correspond to 4 frames.
So, TH01 rank pellet speed. The resident pellet speed value is a
factor ranging from a minimum of -0.375 up to a maximum of 0.5 (pixels per
frame), multiplied with the difficulty-adjusted base speed for each pellet
and added on top of that same speed. This multiplier is modified
every time the stage timer reaches 0 and
HARRY UP is shown (+0.05)
for every score-based extra life granted below the maximum number of
lives (+0.025)
every time a bomb is used (+0.025)
on every frame in which the rand value (shown in debug
mode) is evenly divisible by
(1800 - (lives × 200) - (bombs × 50)) (+0.025)
every time Reimu got hit (set to 0 if higher, then -0.05)
when using a continue (set to -0.05 if higher, then -0.125)
Apparently, ZUN noted that these deltas couldn't be losslessly stored in an
IEEE 754 floating-point variable, and therefore didn't store the pellet
speed factor exactly in a way that would correspond to its gameplay effect.
Instead, it's stored similar to Q12.4 subpixels: as a simple integer,
pre-multiplied by 40. This results in a raw range of -15 to 20, which is
what the undecompiled ASM calls still use. When spawning a new pellet, its
base speed is first multiplied by that factor, and then divided by 40 again.
This is actually quite smart: The calculation doesn't need to be aware of
either Q12.4 or the 40× format, as
((Q12.4 * factor×40) / factor×40) still comes out as a
Q12.4 subpixel even if all numbers are integers. The only limiting issue
here would be the potential overflow of the 16-bit multiplication at
unadjusted base speeds of more than 50 pixels per frame, but that'd be
seriously unplayable.
So yeah, pellet speed modifications are indeed gradual, and don't just fall
into the coarse three "high, normal, and low" categories.
That's ⅝ of P0160 done, and the continue and pause menus would make good
candidates to fill up the remaining ⅜… except that it seemed impossible to
figure out the correct compiler options for this code?
The issues centered around the two effects of Turbo C++ 4.0J's
-O switch:
Optimizing jump instructions: merging duplicate successive jumps into a
single one, and merging duplicated instructions at the end of conditional
branches into a single place under a single branch, which the other branches
then jump to
Compressing ADD SP and POP CX
stack-clearing instructions after multiple successive CALLs to
__cdecl functions into a single ADD SP with the
combined parameter stack size of all function calls
But how can the ASM for these functions exhibit #1 but not #2? How
can it be seemingly optimized and unoptimized at the same time? The
only option that gets somewhat close would be -O- -y, which
emits line number information into the .OBJ files for debugging. This
combination provides its own kind of #1, but these functions clearly need
the real deal.
The research into this issue ended up consuming a full push on its own.
In the end, this solution turned out to be completely unrelated to compiler
options, and instead came from the effects of a compiler bug in a totally
different place. Initializing a local structure instance or array like
const uint4_t flash_colors[3] = { 3, 4, 5 };
always emits the { 3, 4, 5 } array into the program's data
segment, and then generates a call to the internal SCOPY@
function which copies this data array to the local variable on the stack.
And as soon as this SCOPY@ call is emitted, the -O
optimization #1 is disabled for the entire rest of the translation
unit?!
So, any code segment with an SCOPY@ call followed by
__cdecl functions must strictly be decompiled from top to
bottom, mirroring the original layout of translation units. That means no
TH01 continue and pause menus before we haven't decompiled the bomb
animation, which contains such an SCOPY@ call. 😕
Luckily, TH01 is the only game where this bug leads to significant
restrictions in decompilation order, as later games predominantly use the
pascal calling convention, in which each function itself clears
its stack as part of its RET instruction.
What now, then? With 51% of REIIDEN.EXE decompiled, we're
slowly running out of small features that can be decompiled within ⅜ of a
push. Good that I haven't been looking a lot into OP.EXE and
FUUIN.EXE, which pretty much only got easy pieces of
code left to do. Maybe I'll end up finishing their decompilations entirely
within these smaller gaps? I still ended up finding one more small
piece in REIIDEN.EXE though: The particle system, seen in the
Mima fight.
I like how everything about this animation is contained within a single
function that is called once per frame, but ZUN could have really
consolidated the spawning code for new particles a bit. In Mima's fight,
particles are only spawned from the top and right edges of the screen, but
the function in fact contains unused code for all other 7 possible
directions, written in quite a bloated manner. This wouldn't feel quite as
unused if ZUN had used an angle parameter instead…
Also, why unnecessarily waste another 40 bytes of
the BSS segment?
But wait, what's going on with the very first spawned particle that just
stops near the bottom edge of the screen in the video above? Well, even in
such a simple and self-contained function, ZUN managed to include an
off-by-one error. This one then results in an out-of-bounds array access on
the 80th frame, where the code attempts to spawn a 41st
particle. If the first particle was unlucky to be both slow enough and
spawned away far enough from the bottom and right edges, the spawning code
will then kill it off before its unblitting code gets to run, leaving its
pixel on the screen until something else overlaps it and causes it to be
unblitted.
Which, during regular gameplay, will quickly happen with the Orb, all the
pellets flying around, and your own player movement. Also, the RNG can
easily spawn this particle at a position and velocity that causes it to
leave the screen more quickly. Kind of impressive how ZUN laid out the
structure
of arrays in a way that ensured practically no effect of this bug on the
game; this glitch could have easily happened every 80 frames instead.
He almost got close to all bugs canceling out each other here!
Next up: The player control functions, including the second-biggest function
in all of PC-98 Touhou.
Back to TH01, and its boss sprite format… with a separate class for
storing animations that only differs minutely from the
📝 regular boss entity class I covered last time?
Decompiling this class was almost free, and the main reason why the first
of these pushes ended up looking pretty huge.
Next up were the remaining shape drawing functions from the code segment
that started with the .GRC functions. P0105 already started these with the
(surprisingly sanely implemented) 8×8 diamond, star, and… uh, snowflake
(?) sprites
,
prominently seen in the Konngara, Elis, and Sariel fights, respectively.
Now, we've also got:
ellipse arcs with a customizable angle distance between the individual
dots – mostly just used for drawing full circles, though
line loops – which are only used for the rotating white squares around
Mima, meaning that the white star in the YuugenMagan fight got a completely
redundant reimplementation
and the surprisingly weirdest one, drawing the red invincibility
sprites.
The weirdness becomes obvious with just a single screenshot:
First, we've got the obvious issue of the sprites not being clipped at the
right edge of VRAM, with the rightmost pixels in each row of the sprite
extending to the beginning of the next row. Well, that's just what you get
if you insist on writing unique low-level blitting code for the majority
of the individual sprites in the game… 🤷
More importantly though, the sprite sheet looks like this:
So how do we even get these fully filled red diamonds?
Well, turns out that the sprites are never consistently unblitted during
their 8 frames of animation. There is a function that looks
like it unblits the sprite… except that it starts with by enabling the
GRCG and… reading from the first bitplane on the background page?
If this was the EGC, such a read would fill some internal registers with
the contents of all 4 bitplanes, which can then subsequently be blitted to
all 4 bitplanes of any VRAM page with a single memory write. But with the
GRCG in RMW mode, reads do nothing special, and simply copy the memory
contents of one bitplane to the read destination. Maybe ZUN thought
that setting the RMW color to red
also sets some internal 4-plane mask register to match that color?
Instead, the rather random pixels read from the first bitplane are then
used as a mask for a second blit of the same red sprite.
Effectively, this only really "unblits" the invincibility pixels that are
drawn on top of Reimu's sprite. Since Reimu is drawn first, the
invincibility sprites are overwritten anyway. But due to the palette color
layout of Reimu's sprite, its pixels end up fully masking away any
invincibility sprite pixels in that second blit, leaving VRAM untouched as
a result. Anywhere else though, this animation quickly turns into the
union of all animation frames.
Then again, if that 16-dot-aligned rectangular unblitting function is all
you know about the EGC, and you can't be bothered to write a perfect
unblitter for 8×8 sprites, it becomes obvious why you wouldn't want to use
it:
Because Reimu would barely be visible under all that flicker. In
comparison, those fully filled diamonds actually look pretty good.
After all that, the remaining time wouldn't have been enough for the next
few essential classes, so I closed out the push with three more VRAM
effects instead:
Single-bitplane pixel inversion inside a 32×32 square – the main effect
behind the discoloration seen in the bomb animation, as well as the
expanding squares at the end of Kikuri's and Sariel's entrance
animation
EGC-accelerated VRAM row copies – the second half of smooth and fully
hardware-accelerated scrolling for backgrounds that are twice the size of
VRAM
And finally, the VRAM page content transition function using meshed 8×8
squares, used for the blocky transition to Sariel's first and second phases.
Which is quite ridiculous in just how needlessly bloated it is. I'm positive
that this sort of thing could have also been accelerated using the PC-98's
EGC… although simply writing better C would have already gone a long way.
The function also comes with three unused mesh patterns.
And with that, ReC98, as a whole, is not only ⅓ done, but I've also fully
caught up with the feature backlog for the first time in the history of
this crowdfunding! Time to go into maintenance mode then, while we wait
for the next pushes to be funded. Got a huge backlog of tiny maintenance
issues to address at a leisurely pace, and of course there's also the
📝 16-bit build system waiting to be
finished.