Whew, TH01's boss code just had to end with another beast of a boss, taking
way longer than it should have and leaving uncomfortably little time for the
rest of the game. Let's get right into the overview of YuugenMagan, the most
sequential and scripted battle in this game:
The fight consists of 14 phases, numbered (of course) from 0 to 13.
Unlike all other bosses, the "entrance phase" 0 is a proper gameplay-enabled
part of the fight itself, which is why I also count it here.
YuugenMagan starts with 16 HP, second only to Sariel's 18+6. The HP bar
visualizes the HP threshold for the end of phases 3 (white part) and 7
(red-white part), respectively.
All even-numbered phases change the color of the 邪 kanji in the stage
background, and don't check for collisions between the Orb and any eye.
Almost all of them consequently don't feature an attack, except for phase
0's 1-pixel lasers, spawning symmetrically from the left and right edges of
the playfield towards the center. Which means that yes, YuugenMagan is in
fact invincible during this first attack.
All other attacks are part of the odd-numbered phases:
Phase 1: Slow pellets from the lateral eyes. Ends
at 15 HP.
Phase 3: Missiles from the southern eyes, whose
angles first shift away from Reimu's tracked position and then towards
it. Ends at 12 HP.
Phase 5: Circular pellets sprayed from the lateral
eyes. Ends at 10 HP.
Phase 7: Another missile pattern, but this time
with both eyes shifting their missile angles by the same
(counter-)clockwise delta angles. Ends at 8 HP.
Phase 9: The 3-pixel 3-laser sequence from the
northern eye. Ends at 2 HP.
Phase 11: Spawns the pentagram with one corner out
of every eye, then gradually shrinks and moves it towards the center of
the playfield. Not really an "attack" (surprise) as the pentagram can't
reach the player during this phase, but collision detection is
technically already active here. Ends at 0 HP, marking the earliest
point where the fight itself can possibly end.
Phase 13: Runs through the parallel "pentagram
attack phases". The first five consist of the pentagram alternating its
spinning direction between clockwise and counterclockwise while firing
pellets from each of the five star corners. After that, the pentagram
slams itself into the player, before YuugenMagan loops back to phase
10 to spawn a new pentagram. On the next run through phase 13, the
pentagram grows larger and immediately slams itself into the player,
before starting a new pentagram attack phase cycle with another loop
back to phase 10.
Since the HP bar fills up in a phase with no collision detection,
YuugenMagan is immune to
📝 test/debug mode heap corruption. It's
generally impossible to get YuugenMagan's HP into negative numbers, with
collision detection being disabled every other phase, and all odd-numbered
phases ending immediately upon reaching their HP threshold.
All phases until the very last one have a timeout condition, independent
from YuugenMagan's current HP:
Phase 0: 331 frames
Phase 1: 1101 frames
Phases 2, 4, 6, 8, 10, and 12: 70 frames each
Phases 3 and 7: 5 iterations of the pattern, or
1845 frames each
Phase 5: 5 iterations of the pattern, or 2230
frames
Phase 9: The full duration of the sequence, or 491
frames
Phase 11: Until the pentagram reached its target
position, or 221 frames
This makes it possible to reach phase 13 without dealing a single point of
damage to YuugenMagan, after almost exactly 2½ minutes on any difficulty.
Your actual time will certainly be higher though, as you will have to
HARRY UP at least once during the attempt.
And let's be real, you're very likely to subsequently lose a
life.
At a pixel-perfect 81×61 pixels, the Orb hitboxes are laid out rather
generously this time, reaching quite a bit outside the 64×48 eye sprites:
And that's about the only positive thing I can say about a position
calculation in this fight. Phase 0 already starts with the lasers being off
by 1 pixel from the center of the iris. Sure, 28 may be a nicer number to
add than 29, but the result won't be byte-aligned either way? This is
followed by the eastern laser's hitbox somehow being 24 pixels larger than
the others, stretching a rather unexpected 70 pixels compared to the 46 of
every other laser.
On a more hilarious note, the eye closing keyframe contains the following
(pseudo-)code, comprising the only real accidentally "unused" danmaku
subpattern in TH01:
// Did you mean ">= RANK_HARD"?
if(rank == RANK_HARD) {
eye_north.fire_aimed_wide_5_spread();
eye_southeast.fire_aimed_wide_5_spread();
eye_southwest.fire_aimed_wide_5_spread();
// Because this condition can never be true otherwise.
// As a result, no pellets will be spawned on Lunatic mode.
// (There is another Lunatic-exclusive subpattern later, though.)
if(rank == RANK_LUNATIC) {
eye_west.fire_aimed_wide_5_spread();
eye_east.fire_aimed_wide_5_spread();
}
}
After a few utility functions that look more like a quickly abandoned
refactoring attempt, we quickly get to the main attraction: YuugenMagan
combines the entire boss script and most of the pattern code into a single
2,634-instruction function, totaling 9,677 bytes inside
REIIDEN.EXE. For comparison, ReC98's version of this code
consists of at least 49 functions, excluding those I had to add to work
around ZUN's little inconsistencies, or the ones I added for stylistic
reasons.
In fact, this function is so large that Turbo C++ 4.0J refuses to generate
assembly output for it via the -S command-line option, aborting
with a Compiler table limit exceeded in function error.
Contrary to what the Borland C++ 4.0 User Guide suggests, this
instance of the error is not at all related to the number of function bodies
or any metric of algorithmic complexity, but is simply a result of the
compiler's internal text representation for a single function overflowing a
64 KiB memory segment. Merely shortening the names of enough identifiers
within the function can help to get that representation down below 64 KiB.
If you encounter this error during regular software development, you might
interpret it as the compiler's roundabout way of telling you that it inlined
way more function calls than you probably wanted to have inlined. Because
you definitely won't explicitly spell out such a long function
in newly-written code, right?
At least it wasn't the worst copy-pasting job in this
game; that trophy still goes to 📝 Elis. And
while the tracking code for adjusting an eye's sprite according to the
player's relative position is one of the main causes behind all the bloat,
it's also 100% consistent, and might have been an inlined class method in
ZUN's original code as well.
The clear highlight in this fight though? Almost no coordinate is
precisely calculated where you'd expect it to be. In particular, all
bullet spawn positions completely ignore the direction the eyes are facing
to:
Due to their effect on gameplay, these inaccuracies can't even be called
"bugs", and made me devise a new "quirk" category instead. More on that in
the TH01 100% blog post, though.
While we did see an accidentally unused bullet pattern earlier, I can
now say with certainty that there are no truly unused danmaku
patterns in TH01, i.e., pattern code that exists but is never called.
However, the code for YuugenMagan's phase 5 reveals another small piece of
danmaku design intention that never shows up within the parameters of
the original game.
By default, pellets are clipped when they fly past the top of the playfield,
which we can clearly observe for the first few pellets of this pattern.
Interestingly though, the second subpattern actually configures its pellets
to fall straight down from the top of the playfield instead. You never see
this happening in-game because ZUN limited that subpattern to a downwards
angle range of 0x73 or 162°, resulting in none of its pellets
ever getting close to the top of the playfield. If we extend that range to a
full 360° though, we can see how ZUN might have originally planned the
pattern to end:
If we also disregard everything else about YuugenMagan that fits the
upcoming definition of quirk, we're left with 6 "fixable" bugs, all
of which are a symptom of general blitting and unblitting laziness. Funnily
enough, they can all be demonstrated within a short 9-second part of the
fight, from the end of phase 9 up until the pentagram starts spinning in
phase 13:
General flickering whenever any sprite overlaps an eye. This is caused
by only reblitting each eye every 3 frames, and is an issue all throughout
the fight. You might have already spotted it in the videos above.
Each of the two lasers is unblitted and blitted individually instead of
each operation being done for both lasers together. Remember how
📝 ZUN unblits 32 horizontal pixels for every row of a line regardless of its width?
That's why the top part of the left, right-moving laser is never visible,
because it's blitted before the other laser is unblitted.
ZUN forgot to unblit the lasers when phase 9 ends. This footage was
recorded by pressing ↵ Return in test mode (game t or
game d), and it's probably impossible to achieve this during
actual gameplay without TAS techniques. You would have to deal the required
6 points of damage within 491 frames, with the eye being invincible during
240 of them. Simply shooting up an Orb with a horizontal velocity of 0 would
also only work a single time, as boss entities always repel the Orb with a
horizontal velocity of ±4.
The shrinking pentagram is unblitted after the eyes were blitted,
adding another guaranteed frame of flicker on top of the ones in 1). Like in
2), the blockiness of the holes is another result of unblitting 32 pixels
per row at a time.
Another missing unblitting call in a phase transition, as the pentagram
switches from its not quite correctly interpolated shrunk form to a regular
star polygon with a radius of 64 pixels. Indirectly caused by the massively
bloated coordinate calculation for the shrink animation being done
separately for the unblitting and blitting calls. Instead of, y'know, just
doing it once and storing the result in variables that can later be
reused.
The pentagram is not reblitted at all during the first 100 frames of
phase 13. During that rather long time, it's easily possible to remove
it from VRAM completely by covering its area with player shots. Or HARRY UP pellets.
Definitely an appropriate end for this game's entity blitting code.
I'm really looking forward to writing a
proper sprite system for the Anniversary Edition…
And just in case you were wondering about the hitboxes of these pentagrams
as they slam themselves into Reimu:
62 pixels on the X axis, centered around each corner point of the star, 16
pixels below, and extending infinitely far up. The latter part becomes
especially devious because the game always collision-detects
all 5 corners, regardless of whether they've already clipped through
the bottom of the playfield. The simultaneously occurring shape distortions
are simply a result of the line drawing function's rather poor
re-interpolation of any line that runs past the 640×400 VRAM boundaries;
📝 I described that in detail back when I debugged the shootout laser crash.
Ironically, using fixed-size hitboxes for a variable-sized pentagram means
that the larger one is easier to dodge.
The final puzzle in TH01's boss code comes
📝 once again in the form of weird hardware
palette changes. The 邪 kanji on the background
image goes through various colors throughout the fight, which ZUN
implemented by gradually incrementing and decrementing either a single one
or none of the color's three 4-bit components at the beginning of each
even-numbered phase. The resulting color sequence, however, doesn't
quite seem to follow these simple rules:
Phase 0: #DD5邪
Phase 2: #0DF邪
Phase 4: #F0F邪
Phase 6: #00F邪, but at the
end of the phase?!
Phase 8: #0FF邪, at the start
of the phase, #0F5邪, at the end!?
Phase 10: #FF5邪, at the start of
the phase, #F05邪, at the end
Second repetition of phase 12: #005邪
shortly after the start of the phase?!
Adding some debug output sheds light on what's going on there:
Yup, ZUN had so much trust in the color clamping done by his hardware
palette functions that he did not clamp the increment operation on the
stage_palette itself. Therefore, the 邪
colors and even the timing of their changes from Phase 6 onwards are
"defined" by wildly incrementing color components beyond their intended
domain, so much that even the underlying signed 8-bit integer ends up
overflowing. Given that the decrement operation on the
stage_paletteis clamped though, this might be another
one of those accidents that ZUN deliberately left in the game,
📝 similar to the conclusion I reached with infinite bumper loops.
But guess what, that's also the last time we're going to encounter this type
of palette component domain quirk! Later games use master.lib's 8-bit
palette system, which keeps the comfort of using a single byte per
component, but shifts the actual hardware color into the top 4 bits, leaving
the bottom 4 bits for added precision during fades.
OK, but now we're done with TH01's bosses! 🎉That was the
8th PC-98 Touhou boss in total, leaving 23 to go.
With all the necessary research into these quirks going well into a fifth
push, I spent the remaining time in that one with transferring most of the
data between YuugenMagan and the upcoming rest of REIIDEN.EXE
into C land. This included the one piece of technical debt in TH01 we've
been carrying around since March 2015, as well as the final piece of the
ending sequence in FUUIN.EXE. Decompiling that executable's
main() function in a meaningful way requires pretty much all
remaining data from REIIDEN.EXE to also be moved into C land,
just in case you were wondering why we're stuck at 99.46% there.
On a more disappointing note, the static initialization code for the
📝 5 boss entity slots ultimately revealed why
YuugenMagan's code is as bloated and redundant as it is: The 5 slots really
are 5 distinct variables rather than a single 5-element array. That's why
ZUN explicitly spells out all 5 eyes every time, because the array he could
have just looped over simply didn't exist. 😕 And while these slot variables
are stored in a contiguous area of memory that I could just have
taken the address of and then indexed it as if it were an array, I
didn't want to annoy future port authors with what would technically be
out-of-bounds array accesses for purely stylistic reasons. At least it
wasn't that big of a deal to rewrite all boss code to use these distinct
variables, although I certainly had to get a bit creative with Elis.
Next up: Finding out how many points we got in totle, and hoping that ZUN
didn't hide more unexpected complexities in the remaining 45 functions of
this game. If you have to spare, there are two ways
in which that amount of money would help right now:
I'm expecting another subscription transaction
from Yanga before the 15th, which would leave to
round out one final TH01 RE push. With that, there'd be a total of 5 left in
the backlog, which should be enough to get the rest of this game done.
I really need to address the performance and usability issues
with all the small videos in this blog. Just look at the video immediately
above, where I disabled the controls because they would cover the debug text
at the bottom… Edit (2022-10-31):… which no longer is an
issue with our 📝 custom video player.
I already reserved this month's anonymous contribution for this work, so it would take another to be turned into a full push.
What's this? A simple, straightforward, easy-to-decompile TH01 boss with
just a few minor quirks and only two rendering-related ZUN bugs? Yup, 2½
pushes, and Kikuri was done. Let's get right into the overview:
Just like 📝 Elis, Kikuri's fight consists
of 5 phases, excluding the entrance animation. For some reason though, they
are numbered from 2 to 6 this time, skipping phase 1? For consistency, I'll
use the original phase numbers from the source code in this blog post.
The main phases (2, 5, and 6) also share Elis' HP boundaries of 10, 6,
and 0, respectively, and are once again indicated by different colors in the
HP bar. They immediately end upon reaching the given number of HP, making
Kikuri immune to the
📝 heap corruption in test or debug mode that can happen with Elis and Konngara.
Phase 2 solely consists of the infamous big symmetric spiral
pattern.
Phase 3 fades Kikuri's ball of light from its default bluish color to bronze over 100 frames. Collision detection is deactivated
during this phase.
In Phase 4, Kikuri activates her two souls while shooting the spinning
8-pellet circles from the previously activated ball. The phase ends shortly
after the souls fired their third spread pellet group.
Note that this is a timed phase without an HP boundary, which makes
it possible to reduce Kikuri's HP below the boundaries of the next
phases, effectively skipping them. Take this video for example,
where Kikuri has 6 HP by the end of Phase 4, and therefore directly
starts Phase 6.
(Obviously, Kikuri's HP can also be reduced to 0 or below, which will
end the fight immediately after this phase.)
Phase 5 combines the teardrop/ripple "pattern" from the souls with the
"two crossed eye laser" pattern, on independent cycles.
Finally, Kikuri cycles through her remaining 4 patterns in Phase 6,
while the souls contribute single aimed pellets every 200 frames.
Interestingly, all HP-bounded phases come with an additional hidden
timeout condition:
Phase 2 automatically ends after 6 cycles of the spiral pattern, or
5,400 frames in total.
Phase 5 ends after 1,600 frames, or the first frame of the
7th cycle of the two crossed red lasers.
If you manage to keep Kikuri alive for 29 of her Phase 6 patterns,
her HP are automatically set to 1. The HP bar isn't redrawn when this
happens, so there is no visual indication of this timeout condition even
existing – apart from the next Orb hit ending the fight regardless of
the displayed HP. Due to the deterministic order of patterns, this
always happens on the 8th cycle of the "symmetric gravity
pellet lines from both souls" pattern, or 11,800 frames. If dodging and
avoiding orb hits for 3½ minutes sounds tiring, you can always watch the
byte at DS:0x1376 in your emulator's memory viewer. Once
it's at 0x1E, you've reached this timeout.
So yeah, there's your new timeout challenge.
The few issues in this fight all relate to hitboxes, starting with the main
one of Kikuri against the Orb. The coordinates in the code clearly describe
a hitbox in the upper center of the disc, but then ZUN wrote a < sign
instead of a > sign, resulting in an in-game hitbox that's not
quite where it was intended to be…
Much worse, however, are the teardrop ripples. It already starts with their
rendering routine, which places the sprites from TAMAYEN.PTN
at byte-aligned VRAM positions in the ultimate piece of if(…) {…}
else if(…) {…} else if(…) {…} meme code. Rather than
tracking the position of each of the five ripple sprites, ZUN suddenly went
purely functional and manually hardcoded the exact rendering and collision
detection calls for each frame of the animation, based on nothing but its
total frame counter.
Each of the (up to) 5 columns is also unblitted and blitted individually
before moving to the next column, starting at the center and then
symmetrically moving out to the left and right edges. This wouldn't be a
problem if ZUN's EGC-powered unblitting function didn't word-align its X
coordinates to a 16×1 grid. If the ripple sprites happen to start at an
odd VRAM byte position, their unblitting coordinates get rounded both down
and up to the nearest 16 pixels, thus touching the adjacent 8 pixels of the
previously blitted columns and leaving the well-known black vertical bars in
their place.
OK, so where's the hitbox issue here? If you just look at the raw
calculation, it's a slightly confusingly expressed, but perfectly logical 17
pixels. But this is where byte-aligned blitting has a direct effect on
gameplay: These ripples can be spawned at any arbitrary, non-byte-aligned
VRAM position, and collisions are calculated relative to this internal
position. Therefore, the actual hitbox is shifted up to 7 pixels to the
right, compared to where you would expect it from a ripple sprite's
on-screen position:
We've previously seen the same issue with the
📝 shot hitbox of Elis' bat form, where
pixel-perfect collision detection against a byte-aligned sprite was merely a
sidenote compared to the more serious X=Y coordinate bug. So why do I
elevate it to bug status here? Because it directly affects dodging: Reimu's
regular movement speed is 4 pixels per frame, and with the internal position
of an on-screen ripple sprite varying by up to 7 pixels, any micrododging
(or "grazing") attempt turns into a coin flip. It's sort of mitigated
by the fact that Reimu is also only ever rendered at byte-aligned
VRAM positions, but I wouldn't say that these two bugs cancel out each
other.
Oh well, another set of rendering issues to be fixed in the hypothetical
Anniversary Edition – obviously, the hitboxes should remain unchanged. Until
then, you can always memorize the exact internal positions. The sequence of
teardrop spawn points is completely deterministic and only controlled by the
fixed per-difficulty spawn interval.
Aside from more minor coordinate inaccuracies, there's not much of interest
in the rest of the pattern code. In another parallel to Elis though, the
first soul pattern in phase 4 is aimed on every difficulty except
Lunatic, where the pellets are once again statically fired downwards. This
time, however, the pattern's difficulty is much more appropriately
distributed across the four levels, with the simultaneous spinning circle
pellets adding a constant aimed component to every difficulty level.
That brings us to 5 fully decompiled PC-98 Touhou bosses, with 26 remaining…
and another ½ of a push going to the cutscene code in
FUUIN.EXE.
You wouldn't expect something as mundane as the boss slideshow code to
contain anything interesting, but there is in fact a slight bit of
speculation fuel there. The text typing functions take explicit string
lengths, which precisely match the corresponding strings… for the most part.
For the "Gatekeeper 'SinGyoku'" string though, ZUN passed 23
characters, not 22. Could that have been the "h" from the Hepburn
romanization of 神玉?!
Also, come on, if this text is already blitted to VRAM for no reason,
you could have gone for perfect centering at unaligned byte positions; the
rendering function would have perfectly supported it. Instead, the X
coordinates are still rounded up to the nearest byte.
The hardcoded ending cutscene functions should be even less interesting –
don't they just show a bunch of images followed by frame delays? Until they
don't, and we reach the 地獄/Jigoku Bad Ending with
its special shake/"boom" effect, and this picture:
Which is rendered by the following code:
for(int i = 0; i <= boom_duration; i++) { // (yes, off-by-one)
if((i & 3) == 0) {
graph_scrollup(8);
} else {
graph_scrollup(0);
}
end_pic_show(1); // ← different picture is rendered
frame_delay(2); // ← blocks until 2 VSync interrupts have occurred
if(i & 1) {
end_pic_show(2); // ← picture above is rendered
} else {
end_pic_show(1);
}
}
Notice something? You should never see this picture because it's
immediately overwritten before the frame is supposed to end. And yet
it's clearly flickering up for about one frame with common emulation
settings as well as on my real PC-9821 Nw133, clocked at 133 MHz.
master.lib's graph_scrollup() doesn't block until VSync either,
and removing these calls doesn't change anything about the blitted images.
end_pic_show() uses the EGC to blit the given 320×200 quarter
of VRAM from page 1 to the visible page 0, so the bottleneck shouldn't be
there either…
…or should it? After setting it up via a few I/O port writes, the common
method of EGC-powered blitting works like this:
Read 16 bits from the source VRAM position on any single
bitplane. This fills the EGC's 4 16-bit tile registers with the VRAM
contents at that specific position on every bitplane. You do not care
about the value the CPU returns from the read – in optimized code, you would
make sure to just read into a register to avoid useless additional stores
into local variables.
Write any 16 bits
to the target VRAM position on any single bitplane. This copies the
contents of the EGC's tile registers to that specific position on
every bitplane.
To transfer pixels from one VRAM page to another, you insert an additional
write to I/O port 0xA6 before 1) and 2) to set your source and
destination page… and that's where we find the bottleneck. Taking a look at
the i486 CPU and its cycle
counts, a single one of these page switches costs 17 cycles – 1 for
MOVing the page number into AL, and 16 for the
OUT instruction itself. Therefore, the 8,000 page switches
required for EGC-copying a 320×200-pixel image require 136,000 cycles in
total.
And that's the optimal case of using only those two
instructions. 📝 As I implied last time, TH01
uses a function call for VRAM page switches, complete with creating
and destroying a useless stack frame and unnecessarily updating a global
variable in main memory. I tried optimizing ZUN's code by throwing out
unnecessary code and using 📝 pseudo-registers
to generate probably optimal assembly code, and that did speed up the
blitting to almost exactly 50% of the original version's run time. However,
it did little about the flickering itself. Here's a comparison of the first
loop with boom_duration = 16, recorded in DOSBox-X with
cputype=auto and cycles=max, and with
i overlaid using the text chip. Caution, flashing lights:
I pushed the optimized code to the th01_end_pic_optimize
branch, to also serve as an example of how to get close to optimal code out
of Turbo C++ 4.0J without writing a single ASM instruction.
And if you really want to use the EGC for this, that's the best you can do.
It really sucks that it merely expanded the GRCG's 4×8-bit tile register to
4×16 bits. With 32 bits, ≥386 CPUs could have taken advantage of their wider
registers and instructions to double the blitting performance. Instead, we
now know the reason why
📝 Promisence Soft's EGC-powered sprite driver that ZUN later stole for TH03
is called SPRITE16 and not SPRITE32. What a massive disappointment.
But what's perhaps a bigger surprise: Blitting planar
images from main memory is much faster than EGC-powered inter-page
VRAM copies, despite the required manual access to all 4 bitplanes. In
fact, the blitting functions for the .CDG/.CD2 format, used from TH03
onwards, would later demonstrate the optimal method of using REP
MOVSD for blitting every line in 32-pixel chunks. If that was also
used for these ending images, the core blitting operation would have taken
((12 + (3 × (320 / 32))) × 200 × 4) =
33,600 cycles, with not much more overhead for the surrounding row
and bitplane loops. Sure, this doesn't factor in the whole infamous issue of
VRAM being slow on PC-98, but the aforementioned 136,000 cycles don't even
include any actual blitting either. And as you move up to later PC-98
models with Pentium CPUs, the gap between OUT and REP
MOVSD only becomes larger. (Note that the page I linked above has a
typo in the cycle count of REP MOVSD on Pentium CPUs: According
to the original Intel Architecture and Programming Manual, it's
13+𝑛, not 3+𝑛.)
This difference explains why later games rarely use EGC-"accelerated"
inter-page VRAM copies, and keep all of their larger images in main memory.
It especially explains why TH04 and TH05 can get away with naively redrawing
boss backdrop images on every frame.
In the end, the whole fact that ZUN did not define how long this image
should be visible is enough for me to increment the game's overall bug
counter. Who would have thought that looking at endings of all things
would teach us a PC-98 performance lesson… Sure, optimizing TH01 already
seemed promising just by looking at its bloated code, but I had no idea that
its performance issues extended so far past that level.
That only leaves the common beginning part of all endings and a short
main() function before we're done with FUUIN.EXE,
and 98 functions until all of TH01 is decompiled! Next up: SinGyoku, who not
only is the quickest boss to defeat in-game, but also comes with the least
amount of code. See you very soon!
Here we go, TH01 Sariel! This is the single biggest boss fight in all of
PC-98 Touhou: If we include all custom effect code we previously decompiled,
it amounts to a total of 10.31% of all code in TH01 (and 3.14%
overall). These 8 pushes cover the final 8.10% (or 2.47% overall),
and are likely to be the single biggest delivery this project will ever see.
Considering that I only managed to decompile 6.00% across all games in 2021,
2022 is already off to a much better start!
So, how can Sariel's code be that large? Well, we've got:
16 danmaku patterns; including the one snowflake detonating into a giant
94×32 hitbox
Gratuitous usage of floating-point variables, bloating the binary thanks
to Turbo C++ 4.0J's particularly horrid code generation
The hatching birds that shoot pellets
3 separate particle systems, sharing the general idea, overall code
structure, and blitting algorithm, but differing in every little detail
The "gust of wind" background transition animation
5 sets of custom monochrome sprite animations, loaded from
BOSS6GR?.GRC
A further 3 hardcoded monochrome 8×8 sprites for the "swaying leaves"
pattern during the second form
In total, it's just under 3,000 lines of C++ code, containing a total of 8
definite ZUN bugs, 3 of them being subpixel/pixel confusions. That might not
look all too bad if you compare it to the
📝 player control function's 8 bugs in 900 lines of code,
but given that Konngara had 0… (Edit (2022-07-17):
Konngara contains two bugs after all: A
📝 possible heap corruption in test or debug mode,
and the infamous
📝 temporary green discoloration.)
And no, the code doesn't make it obvious whether ZUN coded Konngara or
Sariel first; there's just as much evidence for either.
Some terminology before we start: Sariel's first form is separated
into four phases, indicated by different background images, that
cycle until Sariel's HP reach 0 and the second, single-phase form
starts. The danmaku patterns within each phase are also on a cycle,
and the game picks a random but limited number of patterns per phase before
transitioning to the next one. The fight always starts at pattern 1 of phase
1 (the random purple lasers), and each new phase also starts at its
respective first pattern.
Sariel's bugs already start at the graphics asset level, before any code
gets to run. Some of the patterns include a wand raise animation, which is
stored in BOSS6_2.BOS:
The "lowered wand" sprite is missing in this file simply because it's
captured from the regular background image in VRAM, at the beginning of the
fight and after every background transition. What I previously thought to be
📝 background storage code has therefore a
different meaning in Sariel's case. Since this captured sprite is fully
opaque, it will reset the entire 128×128 wand area… wait, 128×128, rather
than 96×96? Yup, this lowered sprite is larger than necessary, wasting 1,967
bytes of conventional memory. That still doesn't quite explain the
second sprite in BOSS6_2.BOS though. Turns out that the black
part is indeed meant to unblit the purple reflection (?) in the first
sprite. But… that's not how you would correctly unblit that?
The first sprite already eats up part of the red HUD line, and the second
one additionally fails to recover the seal pixels underneath, leaving a nice
little black hole and some stray purple pixels until the next background
transition. Quite ironic given that both
sprites do include the right part of the seal, which isn't even part of the
animation.
Just like Konngara, Sariel continues the approach of using a single function
per danmaku pattern or custom entity. While I appreciate that this allows
all pattern- and entity-specific state to be scoped locally to that one
function, it quickly gets ugly as soon as such a function has to do more than one thing.
The "bird function" is particularly awful here: It's just one if(…)
{…} else if(…) {…} else if(…) {…} chain with different
branches for the subfunction parameter, with zero shared code between any of
these branches. It also uses 64-bit floating-point double as
its subpixel type… and since it also takes four of those as parameters
(y'know, just in case the "spawn new bird" subfunction is called), every
call site has to also push four double values onto the stack.
Thanks to Turbo C++ even using the FPU for pushing a 0.0 constant, we
have already reached maximum floating-point decadence before even having
seen a single danmaku pattern. Why decadence? Every possible spawn position
and velocity in both bird patterns just uses pixel resolution, with no
fractional component in sight. And there goes another 720 bytes of
conventional memory.
Speaking about bird patterns, the red-bird one is where we find the first
code-level ZUN bug: The spawn cross circle sprite suddenly disappears after
it finished spawning all the bird eggs. How can we tell it's a bug? Because
there is code to smoothly fly this sprite off the playfield, that
code just suddenly forgets that the sprite's position is stored in Q12.4
subpixels, and treats it as raw screen pixels instead.
As a result, the well-intentioned 640×400
screen-space clipping rectangle effectively shrinks to 38×23 pixels in the
top-left corner of the screen. Which the sprite is always outside of, and
thus never rendered again.
The intended animation is easily restored though:
Also, did you know that birds actually have a quite unfair 14×38-pixel
hitbox? Not that you'd ever collide with them in any of the patterns…
Another 3 of the 8 bugs can be found in the symmetric, interlaced spawn rays
used in three of the patterns, and the 32×32 debris "sprites" shown at their endpoint, at
the edge of the screen. You kinda have to commend ZUN's attention to detail
here, and how he wrote a lot of code for those few rapidly animated pixels
that you most likely don't
even notice, especially with all the other wrong pixels
resulting from rendering glitches. One of the bugs in the very final pattern
of phase 4 even turns them into the vortex sprites from the second pattern
in phase 1 during the first 5 frames of
the first time the pattern is active, and I had to single-step the blitting
calls to verify it.
It certainly was annoying how much time I spent making sense of these bugs,
and all weird blitting offsets, for just a few pixels… Let's look at
something more wholesome, shall we?
So far, we've only seen the PC-98 GRCG being used in RMW (read-modify-write)
mode, which I previously
📝 explained in the context of TH01's red-white HP pattern.
The second of its three modes, TCR (Tile Compare Read), affects VRAM reads
rather than writes, and performs "color extraction" across all 4 bitplanes:
Instead of returning raw 1bpp data from one plane, a VRAM read will instead
return a bitmask, with a 1 bit at every pixel whose full 4-bit color exactly
matches the color at that offset in the GRCG's tile register, and 0
everywhere else. Sariel uses this mode to make sure that the 2×2 particles
and the wind effect are only blitted on top of "air color" pixels, with
other parts of the background behaving like a mask. The algorithm:
Set the GRCG to TCR mode, and all 8 tile register dots to the air
color
Read N bits from the target VRAM position to obtain an N-bit mask where
all 1 bits indicate air color pixels at the respective position
AND that mask with the alpha plane of the sprite to be drawn, shifted to
the correct start bit within the 8-pixel VRAM byte
Set the GRCG to RMW mode, and all 8 tile register dots to the color that
should be drawn
Write the previously obtained bitmask to the same position in VRAM
Quite clever how the extracted colors double as a secondary alpha plane,
making for another well-earned good-code tag. The wind effect really doesn't deserve it, though:
ZUN calculates every intermediate result inside this function
over and over and over again… Together with some ugly
pointer arithmetic, this function turned into one of the most tedious
decompilations in a long while.
This gradual effect is blitted exclusively to the front page of VRAM,
since parts of it need to be unblitted to create the illusion of a gust of
wind. Then again, anything that moves on top of air-colored background –
most likely the Orb – will also unblit whatever it covered of the effect…
As far as I can tell, ZUN didn't use TCR mode anywhere else in PC-98 Touhou.
Tune in again later during a TH04 or TH05 push to learn about TDW, the final
GRCG mode!
Speaking about the 2×2 particle systems, why do we need three of them? Their
only observable difference lies in the way they move their particles:
Up or down in a straight line (used in phases 4 and 2,
respectively)
Left or right in a straight line (used in the second form)
Left and right in a sinusoidal motion (used in phase 3, the "dark
orange" one)
Out of all possible formats ZUN could have used for storing the positions
and velocities of individual particles, he chose a) 64-bit /
double-precision floating-point, and b) raw screen pixels. Want to take a
guess at which data type is used for which particle system?
If you picked double for 1) and 2), and raw screen pixels for
3), you are of course correct! Not that I'm implying
that it should have been the other way round – screen pixels would have
perfectly fit all three systems use cases, as all 16-bit coordinates
are extended to 32 bits for trigonometric calculations anyway. That's what,
another 1.080 bytes of wasted conventional memory? And that's even
calculated while keeping the current architecture, which allocates
space for 3×30 particles as part of the game's global data, although only
one of the three particle systems is active at any given time.
That's it for the first form, time to put on "Civilization
of Magic"! Or "死なばもろとも"? Or "Theme of 地獄めくり"? Or whatever SYUGEN is
supposed to mean…
… and the code of these final patterns comes out roughly as exciting as
their in-game impact. With the big exception of the very final "swaying
leaves" pattern: After 📝 Q4.4,
📝 Q28.4,
📝 Q24.8, and double variables,
this pattern uses… decimal subpixels? Like, multiplying the number by
10, and using the decimal one's digit to represent the fractional part?
Well, sure, if you really insist on moving the leaves in cleanly
represented integer multiples of ⅒, which is infamously impossible in IEEE
754. Aside from aesthetic reasons, it only really combines less precision
(10 possible fractions rather than the usual 16) with the inferior
performance of having to use integer divisions and multiplications rather
than simple bit shifts. And it's surely not because the leaf sprites needed
an extended integer value range of [-3276, +3276], compared to
Q12.4's [-2047, +2048]: They are clipped to 640×400 screen space
anyway, and are removed as soon as they leave this area.
This pattern also contains the second bug in the "subpixel/pixel confusion
hiding an entire animation" category, causing all of
BOSS6GR4.GRC to effectively become unused:
At least their hitboxes are what you would expect, exactly covering the
30×30 pixels of Reimu's sprite. Both animation fixes are available on the th01_sariel_fixes
branch.
After all that, Sariel's main function turned out fairly unspectacular, just
putting everything together and adding some shake, transition, and color
pulse effects with a bunch of unnecessary hardware palette changes. There is
one reference to a missing BOSS6.GRP file during the
first→second form transition, suggesting that Sariel originally had a
separate "first form defeat" graphic, before it was replaced with just the
shaking effect in the final game.
Speaking about the transition code, it is kind of funny how the… um,
imperative and concrete nature of TH01 leads to these 2×24
lines of straight-line code. They kind of look like ZUN rattling off a
laundry list of subsystems and raw variables to be reinitialized, making
damn sure to not forget anything.
Whew! Second PC-98 Touhou boss completely decompiled, 29 to go, and they'll
only get easier from here! 🎉 The next one in line, Elis, is somewhere
between Konngara and Sariel as far as x86 instruction count is concerned, so
that'll need to wait for some additional funding. Next up, therefore:
Looking at a thing in TH03's main game code – really, I have little
idea what it will be!
Now that the store is open again, also check out the
📝 updated RE progress overview I've posted
together with this one. In addition to more RE, you can now also directly
order a variety of mods; all of these are further explained in the order
form itself.
OK, TH01 missile bullets. Can we maybe have a well-behaved entity type,
without any weirdness? Just once?
Ehh, kinda. Apart from another 150 bytes wasted on unused structure members,
this code is indeed more on the low end in terms of overall jank. It does
become very obvious why dodging these missiles in the YuugenMagan, Mima, and
Elis fights feels so awful though: An unfair 46×46 pixel hitbox around
Reimu's center pixel, combined with the comeback of
📝 interlaced rendering, this time in every
stage. ZUN probably did this because missiles are the only 16×16 sprite in
TH01 that is blitted to unaligned X positions, which effectively ends up
touching a 32×16 area of VRAM per sprite.
But even if we assume VRAM writes to be the bottleneck here, it would
have been totally possible to render every missile in every frame at roughly
the same amount of CPU time that the original game uses for interlaced
rendering:
Note that all missile sprites only use two colors, white and green.
Instead of naively going with the usual four bitplanes, extract the
pixels drawn in each of the two used colors into their own bitplanes.
master.lib calls this the "tiny format".
Use the GRCG to draw these two bitplanes in the intended white and green
colors, halving the amount of VRAM writes compared to the original
function.
(Not using the .PTN format would have also avoided the inconsistency of
storing the missile sprites in boss-specific sprite slots.)
That's an optimization that would have significantly benefitted the game, in
contrast to all of the fake ones
introduced in later games. Then again, this optimization is
actually something that the later games do, and it might have in fact been
necessary to achieve their higher bullet counts without significant
slowdown.
After some effectively unused Mima sprite effect code that is so broken that
it's impossible to make sense out of it, we get to the final feature I
wanted to cover for all bosses in parallel before returning to Sariel: The
separate sprite background storage for moving or animated boss sprites in
the Mima, Elis, and Sariel fights. But, uh… why is this necessary to begin
with? Doesn't TH01 already reserve the other VRAM page for backgrounds?
Well, these sprites are quite big, and ZUN didn't want to blit them from
main memory on every frame. After all, TH01 and TH02 had a minimum required
clock speed of 33 MHz, half of the speed required for the later three games.
So, he simply blitted these boss sprites to both VRAM pages, leading
the usual unblitting calls to only remove the other sprites on top of the
boss. However, these bosses themselves want to move across the screen…
and this makes it necessary to save the stage background behind them
in some other way.
Enter .PTN, and its functions to capture a 16×16 or 32×32 square from VRAM
into a sprite slot. No problem with that approach in theory, as the size of
all these bigger sprites is a multiple of 32×32; splitting a larger sprite
into these smaller 32×32 chunks makes the code look just a little bit clumsy
(and, of course, slower).
But somewhere during the development of Mima's fight, ZUN apparently forgot
that those sprite backgrounds existed. And once Mima's 🚫 casting sprite is
blitted on top of her regular sprite, using just regular sprite
transparency, she ends up with her infamous third arm:
Ironically, there's an unused code path in Mima's unblit function where ZUN
assumes a height of 48 pixels for Mima's animation sprites rather than the
actual 64. This leads to even clumsier .PTN function calls for the bottom
128×16 pixels… Failing to unblit the bottom 16 pixels would have also
yielded that third arm, although it wouldn't have looked as natural. Still
wouldn't say that it was intentional; maybe this casting sprite was just
added pretty late in the game's development?
So, mission accomplished, Sariel unblocked… at 2¼ pushes. That's quite some time left for some smaller stage initialization
code, which bundles a bunch of random function calls in places where they
logically really don't belong. The stage opening animation then adds a bunch
of VRAM inter-page copies that are not only redundant but can't even be
understood without knowing the hidden internal state of the last VRAM page
accessed by previous ZUN code…
In better news though: Turbo C++ 4.0 really doesn't seem to have any
complexity limit on inlining arithmetic expressions, as long as they only
operate on compile-time constants. That's how we get macro-free,
compile-time Shift-JIS to JIS X 0208 conversion of the individual code
points in the 東方★靈異伝 string, in a compiler from 1994. As long as you
don't store any intermediate results in variables, that is…
But wait, there's more! With still ¼ of a push left, I also went for the
boss defeat animation, which includes the route selection after the SinGyoku
fight.
As in all other instances, the 2× scaled font is accomplished by first
rendering the text at regular 1× resolution to the other, invisible VRAM
page, and then scaled from there to the visible one. However, the route
selection is unique in that its scaled text is both drawn transparently on
top of the stage background (not onto a black one), and can also change
colors depending on the selection. It would have been no problem to unblit
and reblit the text by rendering the 1× version to a position on the
invisible VRAM page that isn't covered by the 2× version on the visible one,
but ZUN (needlessly) clears the invisible page before rendering any text.
Instead, he assigned a separate VRAM color for both
the 魔界 and 地獄 options, and only changed the palette value for
these colors to white or gray, depending on the correct selection. This is
another one of the
📝 rare cases where TH01 demonstrates good use of PC-98 hardware,
as the 魔界へ and 地獄へ strings don't need to be reblitted during the selection process, only the Orb "cursor" does.
Then, why does this still not count as good-code? When
changing palette colors, you kinda need to be aware of everything
else that can possibly be on screen, which colors are used there, and which
aren't and can therefore be used for such an effect without affecting other
sprites. In this case, well… hover over the image below, and notice how
Reimu's hair and the bomb sprites in the HUD light up when Makai is
selected:
This push did end on a high note though, with the generic, non-SinGyoku
version of the defeat animation being an easily parametrizable copy. And
that's how you decompile another 2.58% of TH01 in just slightly over three
pushes.
Now, we're not only ready to decompile Sariel, but also Kikuri, Elis, and
SinGyoku without needing any more detours into non-boss code. Thanks to the
current TH01 funding subscriptions, I can plan to cover most, if not all, of
Sariel in a single push series, but the currently 3 pending pushes probably
won't suffice for Sariel's 8.10% of all remaining code in TH01. We've got
quite a lot of not specifically TH01-related funds in the backlog to pass
the time though.
Due to recent developments, it actually makes quite a lot of sense to take a
break from TH01: spaztron64 has
managed what every Touhou download site so far has failed to do: Bundling
all 5 game onto a single .HDI together with pre-configured PC-98
emulators and a nice boot menu, and hosting the resulting package on a
proper website. While this first release is already quite good (and much
better than my attempt from 2014), there is still a bit of room for
improvement to be gained from specific ReC98 research. Next up,
therefore:
Researching how TH04 and TH05 use EMS memory, together with the cause
behind TH04's crash in Stage 5 when playing as Reimu without an EMS driver
loaded, and
reverse-engineering TH03's score data file format
(YUME.NEM), which hopefully also comes with a way of building a
file that unlocks all characters without any high scores.
So, let's finally look at some TH01 gameplay structures! The obvious
choices here are player shots and pellets, which are conveniently located
in the last code segment. Covering these would therefore also help in
transferring some first bits of data in REIIDEN.EXE from ASM
land to C land. (Splitting the data segment would still be quite
annoying.) Player shots are immediately at the beginning…
…but wait, these are drawn as transparent sprites loaded from .PTN files.
Guess we first have to spend a push on
📝 Part 2 of this format.
Hm, 4 functions for alpha-masked blitting and unblitting of both 16×16 and
32×32 .PTN sprites that align the X coordinate to a multiple of 8
(remember, the PC-98 uses a
planar
VRAM memory layout, where 8 pixels correspond to a byte), but only one
function that supports unaligned blitting to any X coordinate, and only
for 16×16 sprites? Which is only called twice? And doesn't come with a
corresponding unblitting function?
Yeah, "unblitting". TH01 isn't
double-buffered,
and uses the PC-98's second VRAM page exclusively to store a stage's
background and static sprites. Since the PC-98 has no hardware sprites,
all you can do is write pixels into VRAM, and any animated sprite needs to
be manually removed from VRAM at the beginning of each frame. Not using
double-buffering theoretically allows TH01 to simply copy back all 128 KB
of VRAM once per frame to do this. But that
would be pretty wasteful, so TH01 just looks at all animated sprites, and
selectively copies only their occupied pixels from the second to the first
VRAM page.
Alright, player shot class methods… oh, wait, the collision functions
directly act on the Yin-Yang Orb, so we first have to spend a push on
that one. And that's where the impression we got from the .PTN
functions is confirmed: The orb is, in fact, only ever displayed at
byte-aligned X coordinates, divisible by 8. It's only thanks to the
constant spinning that its movement appears at least somewhat
smooth.
This is purely a rendering issue; internally, its position is
tracked at pixel precision. Sadly, smooth orb rendering at any unaligned X
coordinate wouldn't be that trivial of a mod, because well, the
necessary functions for unaligned blitting and unblitting of 32×32 sprites
don't exist in TH01's code. Then again, there's so much potential for
optimization in this code, so it might be very possible to squeeze those
additional two functions into the same C++ translation unit, even without
position independence…
More importantly though, this was the right time to decompile the core
functions controlling the orb physics – probably the highlight in these
three pushes for most people.
Well, "physics". The X velocity is restricted to the 5 discrete states of
-8, -4, 0, 4, and 8, and gravity is applied by simply adding 1 to the Y
velocity every 5 frames No wonder that this can
easily lead to situations in which the orb infinitely bounces from the
ground.
At least fangame authors now have
a
reference of how ZUN did it originally, because really, this bad
approximation of physics had to have been written that way on purpose. But
hey, it uses 64-bit floating-point variables!
…sometimes at least, and quite randomly. This was also where I had to
learn about Turbo C++'s floating-point code generation, and how rigorously
it defines the order of instructions when mixing double and
float variables in arithmetic or conditional expressions.
This meant that I could only get ZUN's original instruction order by using
literal constants instead of variables, which is impossible right now
without somehow splitting the data segment. In the end, I had to resort to
spelling out ⅔ of one function, and one conditional branch of another, in
inline ASM. 😕 If ZUN had just written 16.0 instead of
16.0f there, I would have saved quite some hours of my life
trying to decompile this correctly…
To sort of make up for the slowdown in progress, here's the TH01 orb
physics debug mod I made to properly understand them. Edit
(2022-07-12): This mod is outdated,
📝 the current version is here!2020-06-13-TH01OrbPhysicsDebug.zip
To use it, simply replace REIIDEN.EXE, and run the game
in debug mode, via game d on the DOS prompt.
Its code might also serve as an example of how to achieve this sort of
thing without position independence.
Alright, now it's time for player shots though. Yeah, sure, they
don't move horizontally, so it's not too bad that those are also
always rendered at byte-aligned positions. But, uh… why does this code
only use the 16×16 alpha-masked unblitting function for decaying shots,
and just sloppily unblits an entire 16×16 square everywhere else?
The worst part though: Unblitting, moving, and rendering player shots
is done in a single function, in that order. And that's exactly where
TH01's sprite flickering comes from. Since different types of sprites are
free to overlap each other, you'd have to first unblit all types, then
move all types, and then render all types, as done in later
PC-98 Touhou games. If you do these three steps per-type instead, you
will unblit sprites of other types that have been rendered before… and
therefore end up with flicker.
Oh, and finally, ZUN also added an additional sloppy 16×16 square unblit
call if a shot collides with a pellet or a boss, for some
guaranteed flicker. Sigh.
And that's ⅓ of all ZUN code in TH01 decompiled! Next up: Pellets!
Here we go, new C code! …eh, it will still take a bit to really get
decompilation going at the speeds I was hoping for. Especially with the
sheer amount of stuff that is set in the first few significant
functions we actually can decompile, which now all has to be
correctly declared in the C world. Turns out I spent the last 2 years
screwing up the case of exported functions, and even some of their names,
so that it didn't actually reflect their calling convention… yup. That's
just the stuff you tend to forget while it doesn't matter.
To make up for that, I decided to research whether we can make use of some
C++ features to improve code readability after all. Previously, it seemed
that TH01 was the only game that included any C++ code, whereas TH02 and
later seemed to be 100% C and ASM. However, during the development of the
soon to be released new build system, I noticed that even this old
compiler from the mid-90's, infamous for prioritizing compile speeds over
all but the most trivial optimizations, was capable of quite surprising
levels of automatic inlining with class methods…
…leading the research to culminate in the mindblow that is
9d121c7 – yes, we can use C++ class methods
and operator overloading to make the code more readable, while still
generating the same code than if we had just used C and preprocessor
macros.
Looks like there's now the potential for a few pull requests from outside
devs that apply C++ features to improve the legibility of previously
decompiled and terribly macro-ridden code. So, if anyone wants to help
without spending money…