Oh look, it's another rather short and straightforward boss with a rather
small number of bugs and quirks. Yup, contrary to the character's
popularity, Mima's premiere is really not all that special in terms of code,
and continues the trend established with
📝 Kikuri and
📝 SinGyoku. I've already covered
📝 the initial sprite-related bugs last November,
so this post focuses on the main code of the fight itself. The overview:
The TH01 Mima fight consists of 3 phases, with phases 1 and 3 each
corresponding to one half of the 12-HP bar.
📝 Just like with SinGyoku, the distinction
between the red-white and red parts is purely visual once again, and doesn't
reflect anything about the boss script. As usual, all of the phases have to
be completed in order.
Phases 1 and 3 cycle through 4 danmaku patterns each, for a total of 8.
The cycles always start on a fixed pattern.
3 of the patterns in each phase feature rotating white squares, thus
introducing a new sprite in need of being unblitted.
Phase 1 additionally features the "hop pattern" as the last one in its
cycle. This is the only pattern where Mima leaves the seal in the center of
the playfield to hop from one edge of the playfield towards the other, while
also moving slightly higher up on the Y axis, and staying on the final
position for the next pattern cycle. For the first time, Mima selects a
random starting edge, which is then alternated on successive cycles.
Since the square entities are local to the respective pattern function,
Phase 1 can only end once the current pattern is done, even if Mima's HP are
already below 6. This makes Mima susceptible to the
📝 debug mode HP bar heap corruption bug.
Phase 2 simply consists of a spread-in teleport back to Mima's initial
position in the center of the playfield. This would only have been strictly
necessary if phase 1 ended on the hop pattern, but is done regardless of the
previous pattern, and does provide a nice visual separation between the two
main phases.
That's it – nothing special in Phase 3.
And there aren't even any weird hitboxes this time. What is maybe
special about Mima, however, is how there's something to cover about all of
her patterns. Since this is TH01, it's won't surprise anyone that the
rotating square patterns are one giant copy-pasta of unblitting, updating,
and rendering code. At least ZUN placed the core polar→Cartesian
transformation in a separate function for creating regular polygons
with an arbitrary number of sides, which might hint toward some more varied
shapes having been planned at one point?
5 of the 6 patterns even follow the exact same steps during square update
frames:
Calculate square corner coordinates
Unblit the square
Update the square angle and radius
Use the square corner coordinates for spawning pellets or missiles
Recalculate square corner coordinates
Render the square
Notice something? Bullets are spawned before the corner coordinates
are updated. That's why their initial positions seem to be a bit off – they
are spawned exactly in the corners of the square, it's just that it's
the square from 8 frames ago.
Mima's first pattern on Normal difficulty.
Once ZUN reached the final laser pattern though, he must have noticed that
there's something wrong there… or maybe he just wanted to fire those
lasers independently from the square unblit/update/render timer for a
change. Spending an additional 16 bytes of the data segment for conveniently
remembering the square corner coordinates across frames was definitely a
decent investment.
When Mima isn't shooting bullets from the corners of a square or hopping
across the playfield, she's raising flame pillars from the bottom of the playfield within very specifically calculated
random ranges… which are then rendered at byte-aligned VRAM positions, while
collision detection still uses their actual pixel position. Since I don't
want to sound like a broken record all too much, I'll just direct you to
📝 Kikuri, where we've seen the exact same issue with the teardrop ripple sprites.
The conclusions are identical as well.
Mima's flame pillar pattern. This video was recorded on a particularly
unlucky seed that resulted in great disparities between a pillar's
internal X coordinate and its byte-aligned on-screen appearance, leading
to lots of right-shifted hitboxes.
Also note how the change from the meteor animation to the three-arm 🚫
casting sprite doesn't unblit the meteor, and leaves that job to
any sprite that happens to fly over those pixels.
However, I'd say that the saddest part about this pattern is how choppy it
is, with the circle/pillar entities updating and rendering at a meager 7
FPS. Why go that low on purpose when you can just make the game render ✨
smoothly ✨ instead?
So smooth it's almost uncanny.
The reason quickly becomes obvious: With TH01's lack of optimization, going
for the full 56.4 FPS would have significantly slowed down the game on its
intended 33 MHz CPUs, requiring more than cheap surface-level ASM
optimization for a stable frame rate. That might very well have been ZUN's
reason for only ever rendering one circle per frame to VRAM, and designing
the pattern with these time offsets in mind. It's always been typical for
PC-98 developers to target the lowest-spec models that could possibly still
run a game, and implementing dynamic frame rates into such an engine-less
game is nothing I would wish on anybody. And it's not like TH01 is
particularly unique in its choppiness anyway; low frame rates are actually a
rather typical part of the PC-98 game aesthetic.
The final piece of weirdness in this fight can be found in phase 1's hop
pattern, and specifically its palette manipulation. Just from looking at the
pattern code itself, each of the 4 hops is supposed to darken the hardware
palette by subtracting #444 from every color. At the last hop,
every color should have therefore been reduced to a pitch-black
#000, leaving the player completely blind to the movement of
the chasing pellets for 30 frames and making the pattern quite ghostly
indeed. However, that's not what we see in the actual game:
Nothing in the pattern's code would cause the hardware palette to get
brighter before the end of the pattern, and yet…
The expected version doesn't look all too unfair, even on Lunatic…
well, at least at the default rank pellet speed shown in this
video. At maximum pellet speed, it is in fact rather brutal.
Looking at the frame counter, it appears that something outside the
pattern resets the palette every 40 frames. The only known constant with a
value of 40 would be the invincibility frames after hitting a boss with the
Orb, but we're not hitting Mima here…
But as it turns out, that's exactly where the palette reset comes from: The
hop animation darkens the hardware palette directly, while the
📝 infamous 12-parameter boss collision handler function
unconditionally resets the hardware palette to the "default boss palette"
every 40 frames, regardless of whether the boss was hit or not. I'd classify
this as a bug: That function has no business doing periodic hardware palette
resets outside the invincibility flash effect, and it completely defies
common sense that it does.
That explains one unexpected palette change, but could this function
possibly also explain the other infamous one, namely, the temporary green
discoloration in the Konngara fight? That glitch comes down to how the game
actually uses two global "default" palettes: a default boss
palette for undoing the invincibility flash effect, and a default
stage palette for returning the colors back to normal at the end of
the bomb animation or when leaving the Pause menu. And sure enough, the
stage palette is the one with the green color, while the boss
palette contains the intended colors used throughout the fight. Sending the
latter palette to the graphics chip every 40 frames is what corrects
the discoloration, which would otherwise be permanent.
The green color comes from BOSS7_D1.GRP, the scrolling
background of the entrance animation. That's what turns this into a clear
bug: The stage palette is only set a single time in the entire fight,
at the beginning of the entrance animation, to the palette of this image.
Apart from consistency reasons, it doesn't even make sense to set the stage
palette there, as you can't enter the Pause menu or bomb during a blocking
animation function.
And just 3 lines of code later, ZUN loads BOSS8_A1.GRP, the
main background image of the fight. Moving the stage palette assignment
there would have easily prevented the discoloration.
But yeah, as you can tell, palette manipulation is complete jank in this
game. Why differentiate between a stage and a boss palette to begin with?
The blocking Pause menu function could have easily copied the original
palette to a local variable before darkening it, and then restored it after
closing the menu. It's not so easy for bombs as the intended palette could
change between the start and end of the animation, but the code could have
still been simplified a lot if there was just one global "default palette"
variable instead of two. Heck, even the other bosses who manipulate their
palettes correctly only do so because they manually synchronize the two
after every change. The proper defense against bugs that result from wild
mutation of global state is to get rid of global state, and not to put up
safety nets hidden in the middle of existing effect code.
The easiest way of reproducing the green discoloration bug in
the TH01 Konngara fight, timed to show the maximum amount of time the
discoloration can possibly last.
In any case, that's Mima done! 7th PC-98 Touhou boss fully
decompiled, 24 bosses remaining, and 59 functions left in all of TH01.
In other thrilling news, my call for secondary funding priorities in new
TH01 contributions has given us three different priorities so far. This
raises an interesting question though: Which of these contributions should I
now put towards TH01 immediately, and which ones should I leave in the
backlog for the time being? Since I've never liked deciding on priorities,
let's turn this into a popularity contest instead: The contributions with
the least popular secondary priorities will go towards TH01 first, giving
the most popular priorities a higher chance to still be left over after TH01
is done. As of this delivery, we'd have the following popularity order:
TH05 (1.67 pushes), from T0182
Seihou (1 push), from T0184
TH03 (0.67 pushes), from T0146
Which means that T0146 will be consumed for TH01 next, followed by T0184 and
then T0182. I only assign transactions immediately before a delivery though,
so you all still have the chance to change up these priorities before the
next one.
Next up: The final boss of TH01 decompilation, YuugenMagan… if the current
or newly incoming TH01 funds happen to be enough to cover the entire fight.
If they don't turn out to be, I will have to pass the time with some Seihou
work instead, missing the TH01 anniversary deadline as a result.Edit (2022-07-18): Thanks to Yanga for
securing the funding for YuugenMagan after all! That fight will feature
slightly more than half of all remaining code in TH01's
REIIDEN.EXE and the single biggest function in all of PC-98
Touhou, let's go!
With Elis, we've not only reached the midway point in TH01's boss code, but
also a bunch of other milestones: Both REIIDEN.EXE and TH01 as
a whole have crossed the 75% RE mark, and overall position independence has
also finally cracked 80%!
And it got done in 4 pushes again? Yup, we're back to
📝 Konngara levels of redundancy and
copy-pasta. This time, it didn't even stop at the big copy-pasted code
blocks for the rift sprite and 256-pixel circle animations, with the words
"redundant" and "unnecessary" ending up a total of 18 times in my source
code comments.
But damn is this fight broken. As usual with TH01 bosses, let's start with a
high-level overview:
The Elis fight consists of 5 phases (excluding the entrance animation),
which must be completed in order.
In all odd-numbered phases, Elis uses a random one-shot danmaku pattern
from an exclusive per-phase pool before teleporting to a random
position.
There are 3 exclusive girl-form patterns per phase, plus 4
additional bat-form patterns in phase 5, for a total of 13.
Due to a quirk in the selection algorithm in phases 1 and 3, there
is a 25% chance of Elis skipping an attack cycle and just teleporting
again.
In contrast to Konngara, Elis can freely select the same pattern
multiple times in a row. There's nothing in the code to prevent that
from happening.
This pattern+teleport cycle is repeated until Elis' HP reach a certain
threshold value. The odd-numbered phases correspond to the white (phase 1),
red-white (phase 3), and red (phase 5) sections of the health bar. However,
the next phase can only start at the end of each cycle, after a
teleport.
Phase 2 simply teleports Elis back to her starting screen position of
(320, 144) and then advances to phase 3.
Phase 4 does the same as phase 2, but adds the initial bat form
transformation before advancing to phase 5.
Phase 5 replaces the teleport with a transformation to the bat form.
Rather than teleporting instantly to the target position, the bat gradually
flies there, firing a randomly selected looping pattern from the 4-pattern
bat pool on the way, before transforming back to the girl form.
This puts the earliest possible end of the fight at the first frame of phase
5. However, nothing prevents Elis' HP from reaching 0 before that point. You
can nicely see this in 📝 debug mode: Wait
until the HP bar has filled up to avoid heap corruption, hold ↵ Return
to reduce her HP to 0, and watch how Elis still goes through a total of
two patterns* and four
teleport animations before accepting defeat.
But wait, heap corruption? Yup, there's a bug in the HP bar that already
affected Konngara as well, and it isn't even just about the graphical
glitches generated by negative HP:
The initial fill-up animation is drawn to both VRAM pages at a rate of 1
HP per frame… by passing the current frame number as the
current_hp number.
The target_hp is indicated by simply passing the current
HP…
… which, however, can be reduced in debug mode at an equal rate of up to
1 HP per frame.
The completion condition only checks if
((target_hp - 1) == current_hp). With the
right timing, both numbers can run past each other though.
In that case, the function is repeatedly called on every frame, backing
up the original VRAM contents for the current HP point before blitting
it…
… until frame ((96 / 2) + 1), where the
.PTN slot pointer overflows the heap buffer and overwrites whatever comes
after. 📝 Sounds familiar, right?
Since Elis starts with 14 HP, which is an even number, this corruption is
trivial to cause: Simply hold ↵ Return from the beginning of the
fight, and the completion condition will never be true, as the
HP and frame numbers run past the off-by-one meeting point.
Regular gameplay, however, entirely prevents this due to the fixed start
positions of Reimu and the Orb, the Orb's fixed initial trajectory, and the
50 frames of delay until a bomb deals damage to a boss. These aspects make
it impossible to hit Elis within the first 14 frames of phase 1, and ensure
that her HP bar is always filled up completely. So ultimately, this bug ends
up comparable in seriousness to the
📝 recursion / stack overflow bug in the memory info screen.
These wavy teleport animations point to a quite frustrating architectural
issue in this fight. It's not even the fact that unblitting the yellow star
sprites rips temporary holes into Elis' sprite; that's almost expected from
TH01 at this point. Instead, it's all because of this unused frame of the
animation:
With this sprite still being part of BOSS5.BOS, Girl-Elis has a
total of 9 animation frames, 1 more than the
📝 8 per-entity sprites allowed by ZUN's architecture.
The quick and easy solution would have been to simply bump the sprite array
size by 1, but… nah, this would have added another 20 bytes to all 6 of the
.BOS image slots. Instead, ZUN wrote the manual
position synchronization code I mentioned in that 2020 blog post.
Ironically, he then copy-pasted this snippet of code often enough that it
ended up taking up more than 120 bytes in the Elis fight alone – with, you
guessed it, some of those copies being redundant. Not to mention that just
going from 8 to 9 sprites would have allowed ZUN to go down from 6 .BOS
image slots to 3. That would have actually saved 420 bytes in
addition to the manual synchronization trouble. Looking forward to SinGyoku,
that's going to be fun again…
As for the fight itself, it doesn't take long until we reach its most janky
danmaku pattern, right in phase 1:
The "pellets along circle" pattern on Lunatic, in its original version
and with fanfiction fixes for everything that can potentially be
interpreted as a bug.
For whatever reason, the lower-right quarter of the circle isn't
animated? This animation works by only drawing the new dots added with every
subsequent animation frame, expressed as a tiny arc of a dotted circle. This
arc starts at the animation's current 8-bit angle and ends on the sum of
that angle and a hardcoded constant. In every other (copy-pasted, and
correct) instance of this animation, ZUN uses 0x02 as the
constant, but this one uses… 0.05 for the lower-right quarter?
As in, a 64-bit double constant that truncates to 0 when added
to an 8-bit integer, thus leading to the start and end angles being
identical and the game not drawing anything.
On Easy and Normal, the pattern then spawns 32 bullets along the outline
of the circle, no problem there. On Lunatic though, every one of these
bullets is instead turned into a narrow-angled 5-spread, resulting in 160
pellets… in a game with a pellet cap of 100.
Now, if Elis teleported herself to a position near the top of the playfield,
most of the capped pellets would have been clipped at that top edge anyway,
since the bullets are spawned in clockwise order starting at Elis' right
side with an angle of 0x00. On lower positions though, you can
definitely see a difference if the cap were high enough to allow all coded
pellets to actually be spawned.
The Hard version gets dangerously close to the cap by spawning a total of 96
pellets. Since this is the only pattern in phase 1 that fires pellets
though, you are guaranteed to see all of the unclipped ones.
The pellets also aren't spawned exactly on the telegraphed circle, but 4 pixels to the left.
Then again, it might very well be that all of this was intended, or, most
likely, just left in the game as a happy accident. The latter interpretation
would explain why ZUN didn't just delete the rendering calls for the
lower-right quarter of the circle, because seriously, how would you not spot
that? The phase 3 patterns continue with more minor graphical glitches that
aren't even worth talking about anymore.
And then Elis transforms into her bat form at the beginning of Phase 5,
which displays some rather unique hitboxes. The one against the Orb is fine,
but the one against player shots…
… uses the bat's X coordinate for both X and Y dimensions.
In regular gameplay, it's not too bad as most
of the bat patterns fire aimed pellets which typically don't allow you to
move below her sprite to begin with. But if you ever tried destroying these
pellets while standing near the middle of the playfield, now you know why
that didn't work. This video also nicely points out how the bat, like any
boss sprite, is only ever blitted at positions on the 8×1-pixel VRAM byte
grid, while collision detection uses the actual pixel position.
The bat form patterns are all relatively simple, with little variation
depending on the difficulty level, except for the "slow pellet spreads"
pattern. This one is almost easiest to dodge on Lunatic, where the 5-spreads
are not only always fired downwards, but also at the hardcoded narrow delta
angle, leaving plenty of room for the player to move out of the way:
The "slow pellet spreads" pattern of Elis' bat form, on every
difficulty. Which version do you think is the easiest one?
Finally, we've got another potential timesave in the girl form's "safety
circle" pattern:
After the circle spawned completely, you lose a life by moving outside it,
but doing that immediately advances the pattern past the circle part. This
part takes 200 frames, but the defeat animation only takes 82 frames, so
you can save up to 118 frames there.
Final funny tidbit: As with all dynamic entities, this circle is only
blitted to VRAM page 0 to allow easy unblitting. However, it's also kind of
static, and there needs to be some way to keep the Orb, the player shots,
and the pellets from ripping holes into it. So, ZUN just re-blits the circle
every… 4 frames?! 🤪 The same is true for the Star of David and its
surrounding circle, but there you at least get a flash animation to justify
it. All the overlap is actually quite a good reason for not even attempting
to 📝 mess with the hardware color palette instead.
Reproducing the crash was the whole challenge here. Even after moving Elis
and Reimu to the exact positions seen in Pearl's video and setting Elis' HP
to 0 on the exact same frame, everything ran fine for me. It's definitely no
division by 0 this time, the function perfectly guards against that
possibility. The line specified in the function's parameters is always
clipped to the VRAM region as well, so we can also rule out illegal memory
accesses here…
… or can we? Stepping through it all reminded me of how this function brings
unblitting sloppiness to the next level: For each VRAM byte touched, ZUN
actually unblits the 4 surrounding bytes, adding one byte to the left
and two bytes to the right, and using a single 32-bit read and write per
bitplane. So what happens if the function tries to unblit the topmost byte
of VRAM, covering the pixel positions from (0, 0) to (7, 0)
inclusive? The VRAM offset of 0x0000 is decremented to
0xFFFF to cover the one byte to the left, 4 bytes are written
to this address, the CPU's internal offset overflows… and as it turns out,
that is illegal even in Real Mode as of the 80286, and will raise a General Protection
Fault. Which is… ignored by DOSBox-X,
every Neko Project II version in common use, the CSCP
emulators, SL9821, and T98-Next. Only Anex86 accurately emulates the
behavior of real hardware here.
OK, but no laser fired by Elis ever reaches the top-left corner of the
screen. How can such a fault even happen in practice? That's where the
broken laser reset+unblit function comes in: Not only does it just flat out pass the wrong
parameters to the line unblitting function – describing the line
already traveled by the laser and stopping where the laser begins –
but it also passes them
wrongly, in the form of raw 32-bit fixed-point Q24.8 values, with no
conversion other than a truncation to the signed 16-bit pixels expected by
the function. What then follows is an attempt at interpolation and clipping
to find a line segment between those garbage coordinates that actually falls
within the boundaries of VRAM:
right/bottom correspond to a laser's origin position, and
left/top to the leftmost pixel of its moved-out top line. The
bug therefore only occurs with lasers that stopped growing and have started
moving.
Moreover, it will only happen if either (left % 256) or
(right % 256) is ≤ 127 and the other one of the two is ≥ 128.
The typecast to signed 16-bit integers then turns the former into a large
positive value and the latter into a large negative value, triggering the
function's clipping code.
The function then follows Bresenham's
algorithm: left is ensured to be smaller than right
by swapping the two values if necessary. If that happened, top
and bottom are also swapped, regardless of their value – the
algorithm does not care about their order.
The slope in the X dimension is calculated using an integer division of
((bottom - top) /
(right - left)). Both subtractions are done on signed
16-bit integers, and overflow accordingly.
(-left × slope_x) is added to top,
and left is set to 0.
If both top and bottom are < 0 or
≥ 640, there's nothing to be unblitted. Otherwise, the final
coordinates are clipped to the VRAM range of [(0, 0),
(639, 399)].
If the function got this far, the line to be unblitted is now very
likely to reach from
the top-left to the bottom-right corner, starting out at
(0, 0) right away, or
from the bottom-left corner to the top-right corner. In this case,
you'd expect unblitting to end at (639, 0), but thanks to an
off-by-one error,
it actually ends at (640, -1), which is equivalent to
(0, 0). Why add clipping to VRAM offset calculations when
everything else is clipped already, right?
Possible laser states that will cause the fault, with some debug
output to help understand the cause, and any pellets removed for better
readability. This can happen for all bosses that can potentially have
shootout lasers on screen when being defeated, so it also applies to Mima.
Fixing this is easier than understanding why it happens, but since y'all
love reading this stuff…
tl;dr: TH01 has a high chance of freezing at a boss defeat sequence if there
are diagonally moving lasers on screen, and if your PC-98 system
raises a General Protection Fault on a 4-byte write to offset
0xFFFF, and if you don't run a TSR with an INT
0Dh handler that might handle this fault differently.
The easiest fix option would be to just remove the attempted laser
unblitting entirely, but that would also have an impact on this game's…
distinctive visual glitches, in addition to touching a whole lot of
code bytes. If I ever get funded to work on a hypothetical TH01 Anniversary
Edition that completely rearchitects the game to fix all these glitches, it
would be appropriate there, but not for something that purports to be the
original game.
(Sidenote to further hype up this Anniversary Edition idea for PC-98
hardware owners: With the amount of performance left on the table at every
corner of this game, I'm pretty confident that we can get it to work
decently on PC-98 models with just an 80286 CPU.)
Since we're in critical infrastructure territory once again, I went for the
most conservative fix with the least impact on the binary: Simply changing
any VRAM offsets >= 0xFFFD to 0x0000 to avoid
the GPF, and leaving all other bugs in place. Sure, it's rather lazy and
"incorrect"; the function still unblits a 32-pixel block there, but adding a
special case for blitting 24 pixels would add way too much code. And
seriously, it's not like anything happens in the 8 pixels between
(24, 0) and (31, 0) inclusive during gameplay to begin with.
To balance out the additional per-row if() branch, I inlined
the VRAM page change I/O, saving two function calls and one memory write per
unblitted row.
That means it's time for a new community_choice_fixes
build, containing the new definitive bugfixed versions of these games:
2022-05-31-community-choice-fixes.zip
Check the th01_critical_fixes
branch for the modified TH01 code. It also contains a fix for the HP bar
heap corruption in debug mode – simply changing the ==
comparison to <= is enough to avoid it, and negative HP will
still create aesthetic glitch art.
Once again, I then was left with ½ of a push, which I finally filled with
some FUUIN.EXE code, specifically the verdict screen. The most
interesting part here is the player title calculation, which is quite
sneaky: There are only 6 skill levels, but three groups of
titles for each level, and the title you'll see is picked from a random
group. It looks like this is the first time anyone has documented the
calculation?
As for the levels, ZUN definitely didn't expect players to do particularly
well. With a 1cc being the standard goal for completing a Touhou game, it's
especially funny how TH01 expects you to continue a lot: The code has
branches for up to 21 continues, and the on-screen table explicitly leaves
room for 3 digits worth of continues per 5-stage scene. Heck, these
counts are even stored in 32-bit long variables.
Next up: 📝 Finally finishing the long
overdue Touhou Patch Center MediaWiki update work, while continuing with
Kikuri in the meantime. Originally I wasn't sure about what to do between
Elis and Seihou,
but with Ember2528's surprise
contribution last week, y'all have
demonstrated more than enough interest in the idea of getting TH01 done
sooner rather than later. And I agree – after all, we've got the 25th
anniversary of its first public release coming up on August 15, and I might
still manage to completely decompile this game by that point…
Here we go, TH01 Sariel! This is the single biggest boss fight in all of
PC-98 Touhou: If we include all custom effect code we previously decompiled,
it amounts to a total of 10.31% of all code in TH01 (and 3.14%
overall). These 8 pushes cover the final 8.10% (or 2.47% overall),
and are likely to be the single biggest delivery this project will ever see.
Considering that I only managed to decompile 6.00% across all games in 2021,
2022 is already off to a much better start!
So, how can Sariel's code be that large? Well, we've got:
16 danmaku patterns; including the one snowflake detonating into a giant
94×32 hitbox
Gratuitous usage of floating-point variables, bloating the binary thanks
to Turbo C++ 4.0J's particularly horrid code generation
The hatching birds that shoot pellets
3 separate particle systems, sharing the general idea, overall code
structure, and blitting algorithm, but differing in every little detail
The "gust of wind" background transition animation
5 sets of custom monochrome sprite animations, loaded from
BOSS6GR?.GRC
A further 3 hardcoded monochrome 8×8 sprites for the "swaying leaves"
pattern during the second form
In total, it's just under 3,000 lines of C++ code, containing a total of 8
definite ZUN bugs, 3 of them being subpixel/pixel confusions. That might not
look all too bad if you compare it to the
📝 player control function's 8 bugs in 900 lines of code,
but given that Konngara had 0… (Edit (2022-07-17):
Konngara contains two bugs after all: A
📝 possible heap corruption in debug mode,
and the infamous
📝 temporary green discoloration.)
And no, the code doesn't make it obvious whether ZUN coded Konngara or
Sariel first; there's just as much evidence for either.
Some terminology before we start: Sariel's first form is separated
into four phases, indicated by different background images, that
cycle until Sariel's HP reach 0 and the second, single-phase form
starts. The danmaku patterns within each phase are also on a cycle,
and the game picks a random but limited number of patterns per phase before
transitioning to the next one. The fight always starts at pattern 1 of phase
1 (the random purple lasers), and each new phase also starts at its
respective first pattern.
Sariel's bugs already start at the graphics asset level, before any code
gets to run. Some of the patterns include a wand raise animation, which is
stored in BOSS6_2.BOS:
Umm… OK? The same sprite twice, just with slightly different
colors? So how is the wand lowered again?
The "lowered wand" sprite is missing in this file simply because it's
captured from the regular background image in VRAM, at the beginning of the
fight and after every background transition. What I previously thought to be
📝 background storage code has therefore a
different meaning in Sariel's case. Since this captured sprite is fully
opaque, it will reset the entire 128×128 wand area… wait, 128×128, rather
than 96×96? Yup, this lowered sprite is larger than necessary, wasting 1,967
bytes of conventional memory. That still doesn't quite explain the
second sprite in BOSS6_2.BOS though. Turns out that the black
part is indeed meant to unblit the purple reflection (?) in the first
sprite. But… that's not how you would correctly unblit that?
The first sprite already eats up part of the red HUD line, and the second
one additionally fails to recover the seal pixels underneath, leaving a nice
little black hole and some stray purple pixels until the next background
transition. Quite ironic given that both
sprites do include the right part of the seal, which isn't even part of the
animation.
Just like Konngara, Sariel continues the approach of using a single function
per danmaku pattern or custom entity. While I appreciate that this allows
all pattern- and entity-specific state to be scoped locally to that one
function, it quickly gets ugly as soon as such a function has to do more than one thing.
The "bird function" is particularly awful here: It's just one if(…)
{…} else if(…) {…} else if(…) {…} chain with different
branches for the subfunction parameter, with zero shared code between any of
these branches. It also uses 64-bit floating-point double as
its subpixel type… and since it also takes four of those as parameters
(y'know, just in case the "spawn new bird" subfunction is called), every
call site has to also push four double values onto the stack.
Thanks to Turbo C++ even using the FPU for pushing a 0.0 constant, we
have already reached maximum floating-point decadence before even having
seen a single danmaku pattern. Why decadence? Every possible spawn position
and velocity in both bird patterns just uses pixel resolution, with no
fractional component in sight. And there goes another 720 bytes of
conventional memory.
Speaking about bird patterns, the red-bird one is where we find the first
code-level ZUN bug: The spawn cross circle sprite suddenly disappears after
it finished spawning all the bird eggs. How can we tell it's a bug? Because
there is code to smoothly fly this sprite off the playfield, that
code just suddenly forgets that the sprite's position is stored in Q12.4
subpixels, and treats it as raw screen pixels instead.
As a result, the well-intentioned 640×400
screen-space clipping rectangle effectively shrinks to 38×23 pixels in the
top-left corner of the screen. Which the sprite is always outside of, and
thus never rendered again.
The intended animation is easily restored though:
Sariel's third pattern, and the first to spawn birds, in its original
and fixed versions. Note that I somewhat fixed the bird hatch animation
as well: ZUN's code never unblits any frame of animation there, and
simply blits every new one on top of the previous one.
Also, did you know that birds actually have a quite unfair 14×38-pixel
hitbox? Not that you'd ever collide with them in any of the patterns…
Another 3 of the 8 bugs can be found in the symmetric, interlaced spawn rays
used in three of the patterns, and the 32×32 debris "sprites" shown at their endpoint, at
the edge of the screen. You kinda have to commend ZUN's attention to detail
here, and how he wrote a lot of code for those few rapidly animated pixels
that you most likely don't
even notice, especially with all the other wrong pixels
resulting from rendering glitches. One of the bugs in the very final pattern
of phase 4 even turns them into the vortex sprites from the second pattern
in phase 1 during the first 5 frames of
the first time the pattern is active, and I had to single-step the blitting
calls to verify it.
It certainly was annoying how much time I spent making sense of these bugs,
and all weird blitting offsets, for just a few pixels… Let's look at
something more wholesome, shall we?
So far, we've only seen the PC-98 GRCG being used in RMW (read-modify-write)
mode, which I previously
📝 explained in the context of TH01's red-white HP pattern.
The second of its three modes, TCR (Tile Compare Read), affects VRAM reads
rather than writes, and performs "color extraction" across all 4 bitplanes:
Instead of returning raw 1bpp data from one plane, a VRAM read will instead
return a bitmask, with a 1 bit at every pixel whose full 4-bit color exactly
matches the color at that offset in the GRCG's tile register, and 0
everywhere else. Sariel uses this mode to make sure that the 2×2 particles
and the wind effect are only blitted on top of "air color" pixels, with
other parts of the background behaving like a mask. The algorithm:
Set the GRCG to TCR mode, and all 8 tile register dots to the air
color
Read N bits from the target VRAM position to obtain an N-bit mask where
all 1 bits indicate air color pixels at the respective position
AND that mask with the alpha plane of the sprite to be drawn, shifted to
the correct start bit within the 8-pixel VRAM byte
Set the GRCG to RMW mode, and all 8 tile register dots to the color that
should be drawn
Write the previously obtained bitmask to the same position in VRAM
Quite clever how the extracted colors double as a secondary alpha plane,
making for another well-earned good-code tag. The wind effect really doesn't deserve it, though:
ZUN calculates every intermediate result inside this function
over and over and over again… Together with some ugly
pointer arithmetic, this function turned into one of the most tedious
decompilations in a long while.
This gradual effect is blitted exclusively to the front page of VRAM,
since parts of it need to be unblitted to create the illusion of a gust of
wind. Then again, anything that moves on top of air-colored background –
most likely the Orb – will also unblit whatever it covered of the effect…
As far as I can tell, ZUN didn't use TCR mode anywhere else in PC-98 Touhou.
Tune in again later during a TH04 or TH05 push to learn about TDW, the final
GRCG mode!
Speaking about the 2×2 particle systems, why do we need three of them? Their
only observable difference lies in the way they move their particles:
Up or down in a straight line (used in phases 4 and 2,
respectively)
Left or right in a straight line (used in the second form)
Left and right in a sinusoidal motion (used in phase 3, the "dark
orange" one)
Out of all possible formats ZUN could have used for storing the positions
and velocities of individual particles, he chose a) 64-bit /
double-precision floating-point, and b) raw screen pixels. Want to take a
guess at which data type is used for which particle system?
If you picked double for 1) and 2), and raw screen pixels for
3), you are of course correct! Not that I'm implying
that it should have been the other way round – screen pixels would have
perfectly fit all three systems use cases, as all 16-bit coordinates
are extended to 32 bits for trigonometric calculations anyway. That's what,
another 1.080 bytes of wasted conventional memory? And that's even
calculated while keeping the current architecture, which allocates
space for 3×30 particles as part of the game's global data, although only
one of the three particle systems is active at any given time.
That's it for the first form, time to put on "Civilization
of Magic"! Or "死なばもろとも"? Or "Theme of 地獄めくり"? Or whatever SYUGEN is
supposed to mean…
… and the code of these final patterns comes out roughly as exciting as
their in-game impact. With the big exception of the very final "swaying
leaves" pattern: After 📝 Q4.4,
📝 Q28.4,
📝 Q24.8, and double variables,
this pattern uses… decimal subpixels? Like, multiplying the number by
10, and using the decimal one's digit to represent the fractional part?
Well, sure, if you really insist on moving the leaves in cleanly
represented integer multiples of ⅒, which is infamously impossible in IEEE
754. Aside from aesthetic reasons, it only really combines less precision
(10 possible fractions rather than the usual 16) with the inferior
performance of having to use integer divisions and multiplications rather
than simple bit shifts. And it's surely not because the leaf sprites needed
an extended integer value range of [-3276, +3276], compared to
Q12.4's [-2047, +2048]: They are clipped to 640×400 screen space
anyway, and are removed as soon as they leave this area.
This pattern also contains the second bug in the "subpixel/pixel confusion
hiding an entire animation" category, causing all of
BOSS6GR4.GRC to effectively become unused:
The "swaying leaves" pattern. ZUN intended a splash animation to be
shown once each leaf "spark" reaches the top of the playfield, which is
never displayed in the original game.
At least their hitboxes are what you would expect, exactly covering the
30×30 pixels of Reimu's sprite. Both animation fixes are available on the th01_sariel_fixes
branch.
After all that, Sariel's main function turned out fairly unspectacular, just
putting everything together and adding some shake, transition, and color
pulse effects with a bunch of unnecessary hardware palette changes. There is
one reference to a missing BOSS6.GRP file during the
first→second form transition, suggesting that Sariel originally had a
separate "first form defeat" graphic, before it was replaced with just the
shaking effect in the final game.
Speaking about the transition code, it is kind of funny how the… um,
imperative and concrete nature of TH01 leads to these 2×24
lines of straight-line code. They kind of look like ZUN rattling off a
laundry list of subsystems and raw variables to be reinitialized, making
damn sure to not forget anything.
Whew! Second PC-98 Touhou boss completely decompiled, 29 to go, and they'll
only get easier from here! 🎉 The next one in line, Elis, is somewhere
between Konngara and Sariel as far as x86 instruction count is concerned, so
that'll need to wait for some additional funding. Next up, therefore:
Looking at a thing in TH03's main game code – really, I have little
idea what it will be!
Now that the store is open again, also check out the
📝 updated RE progress overview I've posted
together with this one. In addition to more RE, you can now also directly
order a variety of mods; all of these are further explained in the order
form itself.
No technical obstacles for once! Just pure overcomplicated ZUN code. Unlike
📝 Konngara's main function, the main TH01
player function was every bit as difficult to decompile as you would expect
from its size.
With TH01 using both separate left- and right-facing sprites for all of
Reimu's moves and separate classes for Reimu's 32×32 and 48×*
sprites, we're already off to a bad start. Sure, sprite mirroring is
minimally more involved on PC-98, as the planar
nature of VRAM requires the bits within an 8-pixel byte to also be
mirrored, in addition to writing the sprite bytes from right to left. TH03
uses a 256-byte lookup table for this, generated at runtime by an infamous
micro-optimized and undecompilable ASM algorithm. With TH01's existing
architecture, ZUN would have then needed to write 3 additional blitting
functions. But instead, he chose to waste a total of 26,112 bytes of memory
on pre-mirrored sprites…
Alright, but surely selecting those sprites from code is no big deal? Just
store the direction Reimu is facing in, and then add some branches to the
rendering code. And there is in fact a variable for Reimu's direction…
during regular arrow-key movement, and another one while shooting and
sliding, and a third as part of the special attack types,
launched out of a slide.
Well, OK, technically, the last two are the same variable. But that's even
worse, because it means that ZUN stores two distinct enums at
the same place in memory: Shooting and sliding uses 1 for left,
2 for right, and 3 for the "invalid" direction of
holding both, while the special attack types indicate the direction in their
lowest bit, with 0 for right and 1 for left. I
decompiled the latter as bitflags, but in ZUN's code, each of the 8
permutations is handled as a distinct type, with copy-pasted and adapted
code… The interpretation of this
two-enum "sub-mode" union variable is controlled
by yet another "mode" variable… and unsurprisingly, two of the bugs in this
function relate to the sub-mode variable being interpreted incorrectly.
Also, "rendering code"? This one big function basically consists of separate
unblit→update→render code snippets for every state and direction Reimu can
be in (moving, shooting, swinging, sliding, special-attacking, and bombing),
pasted together into a tangled mess of nested if(…) statements.
While a lot of the code is copy-pasted, there are still a number of
inconsistencies that defeat the point of my usual refactoring treatment.
After all, with a total of 85 conditional branches, anything more than I did
would have just obscured the control flow too badly, making it even harder
to understand what's going on.
In the end, I spotted a total of 8 bugs in this function, all of which leave
Reimu invisible for one or more frames:
2 frames after all special attacks
2 frames after swing attacks, and
4 frames before swing attacks
Thanks to the last one, Reimu's first swing animation frame is never
actually rendered. So whenever someone complains about TH01 sprite
flickering on an emulator: That emulator is accurate, it's the game that's
poorly written.
And guess what, this function doesn't even contain everything you'd
associate with per-frame player behavior. While it does
handle Yin-Yang Orb repulsion as part of slides and special attacks, it does
not handle the actual player/Orb collision that results in lives being lost.
The funny thing about this: These two things are done in the same function…
Therefore, the life loss animation is also part of another function. This is
where we find the final glitch in this 3-push series: Before the 16-frame
shake, this function only unblits a 32×32 area around Reimu's center point,
even though it's possible to lose a life during the non-deflecting part of a
48×48-pixel animation. In that case, the extra pixels will just stay on
screen during the shake. They are unblitted afterwards though, which
suggests that ZUN was at least somewhat aware of the issue?
Finally, the chance to see the alternate life loss sprite is exactly ⅛.
As for any new insights into game mechanics… you know what? I'm just not
going to write anything, and leave you with this flowchart instead. Here's
the definitive guide on how to control Reimu in TH01 we've been waiting for
24 years:
Pellets are deflected during all gray
states. Not shown is the obvious "double-tap Z and X" transition from
all non-(#1) states to the Bomb state, but that would have made this
diagram even more unwieldy than it turned out. And yes, you can shoot
twice as fast while moving left or right.
While I'm at it, here are two more animations from MIKO.PTN
which aren't referenced by any code:
With that monster of a function taken care of, we've only got boss sprite animation as the final blocker of uninterrupted Sariel progress. Due to some unfavorable code layout in the Mima segment though, I'll need to spend a bit more time with some of the features used there. Next up: The missile bullets used in the Mima and YuugenMagan fights.
Of course, Sariel's potentially bloated and copy-pasted code is blocked by
even more definitely bloated and copy-pasted code. It's TH01, what did you
expect?
But even then, TH01's item code is on a new level of software architecture
ridiculousness. First, ZUN uses distinct arrays for both types of items,
with their own caps of 4 for bomb items, and 10 for point items. Since that
obviously makes any type-related switch statement redundant,
he also used distinct functions for both types, with copy-pasted
boilerplate code. The main per-item update and render function is
shared though… and takes every single accessed member of the item
structure as its own reference parameter. Like, why, you have a
structure, right there?! That's one way to really practice the C++ language
concept of passing arbitrary structure fields by mutable reference…
To complete the unwarranted grand generic design of this function, it calls
back into per-type collision detection, drop, and collect functions with
another three reference parameters. Yeah, why use C++ virtual methods when
you can also implement the effectively same polymorphism functionality by
hand? Oh, and the coordinate clamping code in one of these callbacks could
only possibly have come from nested min() and
max() preprocessor macros. And that's how you extend such
dead-simple functionality to 1¼ pushes…
Amidst all this jank, we've at least got a sensible item↔player hitbox this
time, with 24 pixels around Reimu's center point to the left and right, and
extending from 24 pixels above Reimu down to the bottom of the playfield.
It absolutely didn't look like that from the initial naive decompilation
though. Changing entity coordinates from left/top to center was one of the
better lessons from TH01 that ZUN implemented in later games, it really
makes collision detection code much more intuitive to grasp.
The card flip code is where we find out some slightly more interesting
aspects about item drops in this game, and how they're controlled by a
hidden cycle variable:
At the beginning of every 5-stage scene, this variable is set to a
random value in the [0..59] range
Point items are dropped at every multiple of 10
Every card flip adds 1 to its value after this mod 10
check
At a value of 140, the point item is replaced with a bomb item, but only
if no damaging bomb is active. In any case, its value is then reset to
1.
Then again, score players largely ignore point items anyway, as card
combos simply have a much bigger effect on the score. With this, I should
have RE'd all information necessary to construct a tool-assisted score run,
though? Edit: Turns out that 1) point items are becoming
increasingly important in score runs, and 2) Pearl already did a TAS some
months ago. Thanks to
spaztron64 for the info!
The Orb↔card hitbox also makes perfect sense, with 24 pixels around
the center point of a card in every direction.
The rest of the code confirms the
card
flip score formula documented on Touhou Wiki, as well as the way cards
are flipped by bombs: During every of the 90 "damaging" frames of the
140-frame bomb animation, there is a 75% chance to flip the card at the
[bomb_frame % total_card_count_in_stage] array index. Since
stages can only have up to 50 cards
📝 thanks to a bug, even a 75% chance is high
enough to typically flip most cards during a bomb. Each of these flips
still only removes a single card HP, just like after a regular collision
with the Orb.
Also, why are the card score popups rendered before the cards
themselves? That's two needless frames of flicker during that 25-frame
animation. Not all too noticeable, but still.
And that's over 50% of REIIDEN.EXE decompiled as well! Next
up: More HUD update and rendering code… with a direct dependency on
rank pellet speed modifications?
…or maybe not that soon, as it would have only wasted time to
untangle the bullet update commits from the rest of the progress. So,
here's all the bullet spawning code in TH04 and TH05 instead. I hope
you're ready for this, there's a lot to talk about!
(For the sake of readability, "bullets" in this blog post refers to the
white 8×8 pellets
and all 16×16 bullets loaded from MIKO16.BFT, nothing else.)
But first, what was going on📝 in 2020? Spent 4 pushes on the basic types
and constants back then, still ended up confusing a couple of things, and
even getting some wrong. Like how TH05's "bullet slowdown" flag actually
always prevents slowdown and fires bullets at a constant speed
instead. Or how "random spread" is not the
best term to describe that unused bullet group type in TH04.
Or that there are two distinct ways of clearing all bullets on screen,
which deserve different names:
Mechanic #1: Clearing bullets for a custom amount of
time, awarding 1000 points for all bullets alive on the first frame,
and 100 points for all bullets spawned during the clear time.
Mechanic #2: Zapping bullets for a fixed 16 frames,
awarding a semi-exponential and loudly announced Bonus!! for all
bullets alive on the first frame, and preventing new bullets from being
spawned during those 16 frames. In TH04 at least; thanks to a ZUN bug,
zapping got reduced to 1 frame and no animation in TH05…
Bullets are zapped at the end of most midboss and boss phases, and
cleared everywhere else – most notably, during bombs, when losing a
life, or as rewards for extends or a maximized Dream bonus. The
Bonus!! points awarded for zapping bullets are calculated iteratively,
so it's not trivial to give an exact formula for these. For a small number
𝑛 of bullets, it would exactly be 5𝑛³ - 10𝑛² + 15𝑛
points – or, using uth05win's (correct) recursive definition,
Bonus(𝑛) = Bonus(𝑛-1) + 15𝑛² - 5𝑛 + 10.
However, one of the internal step variables is capped at a different number
of points for each difficulty (and game), after which the points only
increase linearly. Hence, "semi-exponential".
On to TH04's bullet spawn code then, because that one can at least be
decompiled. And immediately, we have to deal with a pointless distinction
between regular bullets, with either a decelerating or constant
velocity, and special bullets, with preset velocity changes during
their lifetime. That preset has to be set somewhere, so why have
separate functions? In TH04, this separation continues even down to the
lowest level of functions, where values are written into the global bullet
array. TH05 merges those two functions into one, but then goes too far and
uses self-modifying code to save a grand total of two local variables…
Luckily, the rest of its actual code is identical to TH04.
Most of the complexity in bullet spawning comes from the (thankfully
shared) helper function that calculates the velocities of the individual
bullets within a group. Both games handle each group type via a large
switch statement, which is where TH04 shows off another Turbo
C++ 4.0 optimization: If the range of case values is too
sparse to be meaningfully expressed in a jump table, it usually generates a
linear search through a second value table. But with the -G
command-line option, it instead generates branching code for a binary
search through the set of cases. 𝑂(log 𝑛) as the worst case for a
switch statement in a C++ compiler from 1994… that's so cool.
But still, why are the values in TH04's group type enum all
over the place to begin with?
Unfortunately, this optimization is pretty rare in PC-98 Touhou. It only
shows up here and in a few places in TH02, compared to at least 50
switch value tables.
In all of its micro-optimized pointlessness, TH05's undecompilable version
at least fixes some of TH04's redundancy. While it's still not even
optimal, it's at least a decently written piece of ASM…
if you take the time to understand what's going on there, because it
certainly took quite a bit of that to verify that all of the things which
looked like bugs or quirks were in fact correct. And that's how the code
for this function ended up with 35% comments and blank lines before I could
confidently call it "reverse-engineered"…
Oh well, at least it finally fixes a correctness issue from TH01 and TH04,
where an invalid bullet group type would fill all remaining slots in the
bullet array with identical versions of the first bullet.
Something that both games also share in these functions is an over-reliance
on globals for return values or other local state. The most ridiculous
example here: Tuning the speed of a bullet based on rank actually mutates
the global bullet template… which ZUN then works around by adding a wrapper
function around both regular and special bullet spawning, which saves the
base speed before executing that function, and restores it afterward.
Add another set of wrappers to bypass that exact
tuning, and you've expanded your nice 1-function interface to 4 functions.
Oh, and did I mention that TH04 pointlessly duplicates the first set of
wrapper functions for 3 of the 4 difficulties, which can't even be
explained with "debugging reasons"? That's 10 functions then… and probably
explains why I've procrastinated this feature for so long.
At this point, I also finally stopped decompiling ZUN's original ASM just
for the sake of it. All these small TH05 functions would look horribly
unidiomatic, are identical to their decompiled TH04 counterparts anyway,
except for some unique constant… and, in the case of TH05's rank-based
speed tuning function, actually become undecompilable as soon as we
want to return a C++ class to preserve the semantic meaning of the return
value. Mainly, this is because Turbo C++ does not allow register
pseudo-variables like _AX or _AL to be cast into
class types, even if their size matches. Decompiling that function would
have therefore lowered the quality of the rest of the decompiled code, in
exchange for the additional maintenance and compile-time cost of another
translation unit. Not worth it – and for a TH05 port, you'd already have to
decompile all the rest of the bullet spawning code anyway!
The only thing in there that was still somewhat worth being
decompiled was the pre-spawn clipping and collision detection function. Due
to what's probably a micro-optimization mistake, the TH05 version continues
to spawn a bullet even if it was spawned on top of the player. This might
sound like it has a different effect on gameplay… until you realize that
the player got hit in this case and will either lose a life or deathbomb,
both of which will cause all on-screen bullets to be cleared anyway.
So it's at most a visual glitch.
But while we're at it, can we please stop talking about hitboxes? At least
in the context of TH04 and TH05 bullets. The actual collision detection is
described way better as a kill delta of 8×8 pixels between the
center points of the player and a bullet. You can distribute these pixels
to any combination of bullet and player "hitboxes" that make up 8×8. 4×4
around both the player and bullets? 1×1 for bullets, and 8×8 for the
player? All equally valid… or perhaps none of them, once you keep in mind
that other entity types might have different kill deltas. With that in
mind, the concept of a "hitbox" turns into just a confusing abstraction.
The same is true for the 36×44 graze box delta. For some reason,
this one is not exactly around the center of a bullet, but shifted to the
right by 2 pixels. So, a bullet can be grazed up to 20 pixels right of the
player, but only up to 16 pixels left of the player. uth05win also spotted
this… and rotated the deltas clockwise by 90°?!
Which brings us to the bullet updates… for which I still had to
research a decompilation workaround, because
📝 P0148 turned out to not help at all?
Instead, the solution was to lie to the compiler about the true segment
distance of the popup function and declare its signature far
rather than near. This allowed ZUN to save that ridiculous overhead of 1 additional far function
call/return per frame, and those precious 2 bytes in the BSS segment
that he didn't have to spend on a segment value.
📝 Another function that didn't have just a
single declaration in a common header file… really,
📝 how were these games even built???
The function itself is among the longer ones in both games. It especially
stands out in the indentation department, with 7 levels at its most
indented point – and that's the minimum of what's possible without
goto. Only two more notable discoveries there:
Bullets are the only entity affected by Slow Mode. If the number of
bullets on screen is ≥ (24 + (difficulty * 8) + rank) in TH04,
or (42 + (difficulty * 8)) in TH05, Slow Mode reduces the frame
rate by 33%, by waiting for one additional VSync event every two frames.
The code also reveals a second tier, with 50% slowdown for a slightly
higher number of bullets, but that conditional branch can never be executed
Bullets must have been grazed in a previous frame before they can
be collided with. (Note how this does not apply to bullets that spawned
on top of the player, as explained earlier!)
Whew… When did ReC98 turn into a full-on code review?! 😅 And after all
this, we're still not done with TH04 and TH05 bullets, with all the
special movement types still missing. That should be less than one push
though, once we get to it. Next up: Back to TH01 and Konngara! Now have fun
rewriting the Touhou Wiki Gameplay pages 😛
Technical debt, part 10… in which two of the PMD-related functions came
with such complex ramifications that they required one full push after
all, leaving no room for the additional decompilations I wanted to do. At
least, this did end up being the final one, completing all
SHARED segments for the time being.
The first one of these functions determines the BGM and sound effect
modes, combining the resident type of the PMD driver with the Option menu
setting. The TH04 and TH05 version is apparently coded quite smartly, as
PC-98 Touhou only needs to distinguish "OPN- /
PC-9801-26K-compatible sound sources handled by PMD.COM"
from "everything else", since all other PMD varieties are
OPNA- / PC-9801-86-compatible.
Therefore, I only documented those two results returned from PMD's
AH=09h function. I'll leave a comprehensive, fully documented
enum to interested contributors, since that would involve research into
basically the entire history of the PC-9800 series, and even the clearly
out-of-scope PC-88VA. After all, distinguishing between more versions of
the PMD driver in the Option menu (and adding new sprites for them!) is
strictly mod territory.
The honor of being the final decompiled function in any SHARED
segment went to TH04's snd_load(). TH04 contains by far the
sanest version of this function: Readable C code, no new ZUN bugs (and
still missing file I/O error handling, of course)… but wait, what about
that actual file read syscall, using the INT 21h, AH=3Fh DOS
file read API? Reading up to a hardcoded number of bytes into PMD's or
MMD's song or sound effect buffer, 20 KiB in TH02-TH04, 64 KiB in
TH05… that's kind of weird. About time we looked closer into this.
Turns out that no, KAJA's driver doesn't give you the full 64 KiB of one
memory segment for these, as especially TH05's code might suggest to
anyone unfamiliar with these drivers. Instead,
you can customize the size of these buffers on its command line. In
GAME.BAT, ZUN allocates 8 KiB for FM songs, 2 KiB for sound
effects, and 12 KiB for MMD files in TH02… which means that the hardcoded
sizes in snd_load() are completely wrong, no matter how you
look at them. Consequently, this read syscall
will overflow PMD's or MMD's song or sound effect buffer if the
given file is larger than the respective buffer size.
Now, ZUN could have simply hardcoded the sizes from GAME.BAT
instead, and it would have been fine. As it also turns out though,
PMD has an API function (AH=22h) to retrieve the actual
buffer sizes, provided for exactly that purpose. There is little excuse
not to use it, as it also gives you PMD's default sizes if you don't
specify any yourself.
(Unless your build process enumerates all PMD files that are part of the
game, and bakes the largest size into both snd_load() and
GAME.BAT. That would even work with MMD, which doesn't have
an equivalent for AH=22h.)
What'd be the consequence of loading a larger file then? Well, since we
don't get a full segment, let's look at the theoretical limit first.
PMD prefers to keep both its driver code and the data buffers in a single
memory segment. As a result, the limit for the combined size of the song,
instrument, and sound effect buffer is determined by the amount of
code in the driver itself. In PMD86 version 4.8o (bundled with TH04
and TH05) for example, the remaining size for these buffers is exactly
45,555 bytes. Being an actually good programmer who doesn't blindly trust
user input, KAJA thankfully validates the sizes given via the
/M, /V, and /E command-line options
before letting the driver reside in memory, and shuts down with an error
message if they exceed 40 KiB. Would have been even better if he calculated
the exact size – even in the current
PMD version 4.8s from
January 2020, it's still a hardcoded value (see line 8581).
Either way: If the file is larger than this maximum, the concrete effect
is down to the INT 21h, AH=3Fh implementation in the
underlying DOS version. DOS 3.3 treats the destination address as linear
and reads past the end of the segment,
DOS
5.0 and DOSBox-X truncate the number of bytes to not exceed the remaining
space in the segment, and maybe there's even a DOS that wraps around
and ends up overwriting the PMD driver code. In any case: You will
overwrite what's after the driver in memory – typically, the game .EXE and
its master.lib functions.
It almost feels like a happy accident that this doesn't cause issues in
the original games. The largest PMD file in any of the 4 games, the -86
version of 幽夢 ~ Inanimate Dream, takes up 8,099 bytes,
just under the 8,192 byte limit for BGM. For modders, I'd really recommend
implementing this properly, with PMD's AH=22h function and
error handling, once position independence has been reached.
Whew, didn't think I'd be doing more research into KAJA's drivers during
regular ReC98 development! That's probably been the final time though, as
all involved functions are now decompiled, and I'm unlikely to iterate
over them again.
And that's it! Repaid the biggest chunk of technical debt, time for some
actual progress again. Next up: Reopening the store tomorrow, and waiting
for new priorities. If we got nothing by Sunday, I'm going to put the
pending [Anonymous] pushes towards some work on the website.
Technical debt, part 5… and we only got TH05's stupidly optimized
.PI functions this time?
As far as actual progress is concerned, that is. In maintenance news
though, I was really hyped for the #include improvements I've
mentioned in 📝 the last post. The result: A
new x86real.h file, bundling all the declarations specific to
the 16-bit x86 Real Mode in a smaller file than Turbo C++'s own
DOS.H. After all, DOS is something else than the underlying
CPU. And while it didn't speed up build times quite as much as I had hoped,
it now clearly indicates the x86-specific parts of PC-98 Touhou code to
future port authors.
After another couple of improvements to parameter declaration in ASM land,
we get to TH05's .PI functions… and really, why did ZUN write all of
them in ASM? Why (re)declare all the necessary structures and data in
ASM land, when all these functions are merely one layer of abstraction
above master.lib, which does all the actual work?
I get that ZUN might have wanted masked blitting to be faster, which is
used for the fade-in effect seen during TH05's main menu animation and the
ending artwork. But, uh… he knew how to modify master.lib. In fact, he
did already modify the graph_pack_put_8() function
used for rendering a single .PI image row, to ignore master.lib's VRAM
clipping region. For this effect though, he first blits each row regularly
to the invisible 400th row of VRAM, and then does an EGC-accelerated
VRAM-to-VRAM blit of that row to its actual target position with the mask
enabled. It would have been way more efficient to add another version of
this function that takes a mask pattern. No amount of REP
MOVSW is going to change the fact that two VRAM writes per line are
slower than a single one. Not to mention that it doesn't justify writing
every other .PI function in ASM to go along with it…
This is where we also find the most hilarious aspect about this: For most
of ZUN's pointless micro-optimizations, you could have maybe made the
argument that they do save some CPU cycles here and there, and
therefore did something positive to the final, PC-98-exclusive result. But
some of the hand-written ASM here doesn't even constitute a
micro-optimization, because it's worse than what you would have got
out of even Turbo C++ 4.0J with its 80386 optimization flags!
At least it was possible to "decompile" 6 out of the 10 functions
here, making them easy to clean up for future modders and port authors.
Could have been 7 functions if I also decided to "decompile"
pi_free(), but all the C++ code is already surrounded by ASM,
resulting in 2 ASM translation units and 2 C++ translation units.
pi_free() would have needed a single translation unit by
itself, which wasn't worth it, given that I would have had to spell out
every single ASM instruction anyway.
There you go. What about this needed to be written in ASM?!?
The function calls between these small translation units even seemed to
glitch out TASM and the linker in the end, leading to one CALL
offset being weirdly shifted by 32 bytes. Usually, TLINK reports a fixup
overflow error when this happens, but this time it didn't, for some reason?
Mirroring the segment grouping in the affected translation unit did solve
the problem, and I already knew this, but only thought of it after spending
quite some RTFM time… during which I discovered the -lE
switch, which enables TLINK to use the expanded dictionaries in
Borland's .OBJ and .LIB files to speed up linking. That shaved off roughly
another second from the build time of the complete ReC98 repository. The
more you know… Binary blobs compiled with non-Borland tools would be the
only reason not to use this flag.
So, even more slowdown with this 5th dedicated push, since we've still only
repaid 41% of the technical debt in the SHARED segment so far.
Next up: Part 6, which hopefully manages to decompile the FM and SSG
channel animations in TH05's Music Room, and hopefully ends up being the
final one of the slow ones.
50% hype! 🎉 But as usual for TH01, even that final set of functions
shared between all bosses had to consume two pushes rather than one…
First up, in the ongoing series "Things that TH01 draws to the PC-98
graphics layer that really should have been drawn to the text layer
instead": The boss HP bar. Oh well, using the graphics layer at least made
it possible to have this half-red, half-white pattern
for the middle section.
This one pattern is drawn by making surprisingly good use of the GRCG. So
far, we've only seen it used for fast monochrome drawing:
// Setting up fast drawing using color #9 (1001 in binary)
grcg_setmode(GC_RMW);
outportb(0x7E, 0xFF); // Plane 0: (B): (********)
outportb(0x7E, 0x00); // Plane 1: (R): ( )
outportb(0x7E, 0x00); // Plane 2: (G): ( )
outportb(0x7E, 0xFF); // Plane 3: (E): (********)
// Write a checkerboard pattern (* * * * ) in color #9 to the top-left corner,
// with transparent blanks. Requires only 1 VRAM write to a single bitplane:
// The GRCG automatically writes to the correct bitplanes, as specified above
*(uint8_t *)(MK_FP(0xA800, 0)) = 0xAA;
But since this is actually an 8-pixel tile register, we can set any
8-pixel pattern for any bitplane. This way, we can get different colors
for every one of the 8 pixels, with still just a single VRAM write of the
alpha mask to a single bitplane:
And I thought TH01 only suffered the drawbacks of PC-98 hardware, making
so little use of its actual features that it's perhaps not fair to even
call it "a PC-98 game"… Still, I'd say that "bad PC-98 port of an idea"
describes it best.
However, after that tiny flash of brilliance, the surrounding HP rendering
code goes right back to being the typical sort of confusing TH01 jank.
There's only a single function for the three distinct jobs of
incrementing HP during the boss entrance animation,
decrementing HP if hit by the Orb, and
redrawing the entire bar, because it's still all in VRAM, and Sariel
wants different backgrounds,
with magic numbers to select between all of these.
VRAM of course also means that the backgrounds behind the individual hit
points have to be stored, so that they can be unblitted later as the boss
is losing HP. That's no big deal though, right? Just allocate some memory,
copy what's initially in VRAM, then blit it back later using your
foundational set of blitting funct– oh, wait, TH01 doesn't have this sort
of thing, right The closest thing,
📝 once again, are the .PTN functions. And
so, the game ends up handling these 8×16 background sprites with 16×16
wrappers around functions for 32×32 sprites.
That's quite the recipe for confusion, especially since ZUN
preferred copy-pasting the necessary ridiculous arithmetic expressions for
calculating positions, .PTN sprite IDs, and the ID of the 16×16 quarter
inside the 32×32 sprite, instead of just writing simple helper functions.
He did manage to make the result mostly bug-free this time
around, though! (Edit (2022-05-31): Nope, there's a
📝 potential heap corruption after all, which can be triggered in some fights in debug mode.)
There's one minor hit point discoloration bug if the red-white or white
sections start at an odd number of hit points, but that's never the case for
any of the original 7 bosses.
The remaining sloppiness is ultimately inconsequential as well: The game
always backs up twice the number of hit point backgrounds, and thus
uses twice the amount of memory actually required. Also, this
self-restriction of only unblitting 16×16 pixels at a time requires any
remaining odd hit point at the last position to, of course, be rendered
again
After stumbling over the weakest imaginable random number
generator, we finally arrive at the shared boss↔orb collision
handling function, the final blocker among the final blockers. This
function takes a whopping 12 parameters, 3 of them being references to
int values, some of which are duplicated for every one of the
7 bosses, with no generic boss struct anywhere.
📝 Previously, I speculated that YuugenMagan might have been the first boss to be programmed for TH01.
With all these variables though, there is some new evidence that SinGyoku
might have been the first one after all: It's the only boss to use its own
HP and phase frame variables, with the other bosses sharing the same two
globals.
While this function only handles the response to a boss↔orb
collision, it still does way too much to describe it briefly. Took me
quite a while to frame it in terms of invincibility (which is the
main impact of all of this that can be observed in gameplay code). That
made at least some sort of sense, considering the other usages of
the variables passed as references to that function. Turns out that
YuugenMagan, Kikuri, and Elis abuse what's meant to be the "invincibility
frame" variable as a frame counter for some of their animations 🙄
Oh well, the game at least doesn't call the collision handling function
during those, so "invincibility frame" is technically still a
correct variable name there.
And that's it! We're finally ready to start with Konngara, in 2021. I've
been waiting quite a while for this, as all this high-level boss code is
very likely to speed up TH01 progress quite a bit. Next up though: Closing
out 2020 with more of the technical debt in the other games.
Done with the .BOS format, at last! While there's still quite a bunch of
undecompiled non-format blitting code left, this was in fact the final
piece of graphics format loading code in TH01.
📝 Continuing the trend from three pushes ago,
we've got yet another class, this time for the 48×48 and 48×32 sprites
used in Reimu's gohei, slide, and kick animations. The only reason these
had to use the .BOS format at all is simply because Reimu's regular
sprites are 32×32, and are therefore loaded from
📝 .PTN files.
Yes, this makes no sense, because why would you split animations for
the same character across two file formats and two APIs, just because
of a sprite size difference?
This necessity for switching blitting APIs might also explain why Reimu
vanishes for a few frames at the beginning and the end of the gohei swing
animation, but more on that once we get to the high-level rendering code.
Now that we've decompiled all the .BOS implementations in TH01, here's an
overview of all of them, together with .PTN to show that there really was
no reason for not using the .BOS API for all of Reimu's sprites:
CBossEntity
CBossAnim
CPlayerAnim
ptn_* (32×32)
Format
.BOS
.BOS
.BOS
.PTN
Hitbox
✔
✘
✘
✘
Byte-aligned blitting
✔
✔
✔
✔
Byte-aligned unblitting
✔
✘
✔
✔
Unaligned blitting
Single-line and wave only
✘
✘
✘
Precise unblitting
✔
✘
✔
✔
Per-file sprite limit
8
8
32
64
Pixels blitted at once
16
16
8
32
And even that last property could simply be handled by branching based on
the sprite width, and wouldn't be a reason for switching formats. But
well, it just wouldn't be TH01 without all that redundant bloat though,
would it?
The basic loading, freeing, and blitting code was yet another variation
on the other .BOS code we've seen before. So this should have caused just
as little trouble as the CBossAnim code… except that
CPlayerAnimdid add one slightly difficult function to
the mix, which led to it requiring almost a full push after all.
Similar to 📝 the unblitting code for moving lasers we've seen in the last push,
ZUN tries to minimize the amount of VRAM writes when unblitting Reimu's
slide animations. Technically, it's only necessary to restore the pixels
that Reimu traveled by, plus the ones that wouldn't be redrawn by
the new animation frame at the new X position.
The theoretically arbitrary distance between the two sprites is, of
course, modeled by a fixed-size buffer on the stack
, coming with the further assumption that the
sprite surely hasn't moved by more than 1 horizontal VRAM byte compared to
the last frame. Which, of course, results in glitches if that's not the
case, leaving little Reimu parts in VRAM if the slide speed ever exceeded
8 pixels per frame. (Which it never does,
being hardcoded to 6 pixels, but still.). As it also turns out, all those
bit masking operations easily lead to incredibly sloppy C code.
Which compiles into incredibly terrible ASM, which in turn might end up
wasting way more CPU time than the final VRAM write optimization would
have gained? Then again, in-depth profiling is way beyond the scope of
this project at this point.
Next up: The TH04 main menu, and some more technical debt.
This time around, laser is 📝 actually not
difficult, with TH01's shootout laser class being simple enough to nicely
fit into a single push. All other stationary lasers (as used by
YuugenMagan, for example) don't even use a class, and are simply treated
as regular lines with collision detection.
But of course, the shootout lasers also come with the typical share of
TH01 jank we've all come to expect by now. This time, it already starts
with the hardcoded sprite data:
A shootout laser can have a width from 1 to 8 pixels, so ZUN stored a
separate 16×1 sprite with a line for each possible width (left-to-right).
Then, he shifted all of these sprites 1 pixel to the right for all of the
8 possible start positions within a planar VRAM byte (top-to-bottom).
Because… doing that bit shift programmatically is way too
expensive, so let's pre-shift at compile time, and use 16× the memory per
sprite?
Since a bunch of other sprite sheets need to be pre-shifted as well (this
is the 5th one we've found so far), our sprite converter has a feature to
automatically generate those pre-shifted variations. This way, we can
abstract away that implementation detail and leave modders with .BMP files
that still only contain a single version of each sprite. But, uh…, wait,
in this sprite sheet, the second row for 1-pixel lasers is accidentally
shifted right by one more pixel that it should have been?! Which means
that
we can't use the auto-preshift feature here, and have to store this
weird-looking (and quite frankly, completely unnecessary) sprite sheet in
its entirety
ZUN did, at least during TH01's development, not have a sprite
converter, and directly hardcoded these dot patterns in the C++ code
The waste continues with the class itself. 69 bytes, with 22 bytes
outright unused, and 11 not really necessary. As for actual innovations
though, we've got
📝 another 32-bit fixed-point type, this
time actually using 8 bits for the fractional part. Therefore, the
ray position is tracked to the 1/256th of a pixel, using the full
precision of master.lib's 8-bit sin() and cos() lookup
tables.
Unblitting is also remarkably efficient: It's only done once the laser
stopped extending and started moving, and only for the exact pixels at the
start of the ray that the laser traveled by in a single frame. If only the
ray part was also rendered as efficiently – it's fully blitted every frame,
right next to the collision detection for each row of the ray.
With a public interface of two functions (spawn, and update / collide /
unblit / render), that's superficially all there is to lasers in this
game. There's another (apparently inlined) function though, to both reset
and, uh, "fully unblit" all lasers at the end of every boss fight… except
that it fails hilariously at doing the latter, and ends up effectively
unblitting random 32-pixel line segments, due to ZUN confusing both the
coordinates and the parameter types for the line unblitting function.
A while ago, I was asked about
this crash that tends to
happen when defeating Elis. And while you can clearly see the random
unblitted line segments that are missing from the sprites, I don't
quite think we've found the cause for the crash, since the
📝 line unblitting function used theredoes clip its coordinates to the VRAM range.
Next up: The final piece of image format code in TH01, covering Reimu's
sprites!
Back to TH01, and its boss sprite format… with a separate class for
storing animations that only differs minutely from the
📝 regular boss entity class I covered last time?
Decompiling this class was almost free, and the main reason why the first
of these pushes ended up looking pretty huge.
Next up were the remaining shape drawing functions from the code segment
that started with the .GRC functions. P0105 already started these with the
(surprisingly sanely implemented) 8×8 diamond, star, and… uh, snowflake
(?) sprites
,
prominently seen in the Konngara, Elis, and Sariel fights, respectively.
Now, we've also got:
ellipse arcs with a customizable angle distance between the individual
dots – mostly just used for drawing full circles, though
line loops – which are only used for the rotating white squares around
Mima, meaning that the white star in the YuugenMagan fight got a completely
redundant reimplementation
and the surprisingly weirdest one, drawing the red invincibility
sprites.
The weirdness becomes obvious with just a single screenshot:
First, we've got the obvious issue of the sprites not being clipped at the
right edge of VRAM, with the rightmost pixels in each row of the sprite
extending to the beginning of the next row. Well, that's just what you get
if you insist on writing unique low-level blitting code for the majority
of the individual sprites in the game… 🤷
More importantly though, the sprite sheet looks like this:
So how do we even get these fully filled red diamonds?
Well, turns out that the sprites are never consistently unblitted during
their 8 frames of animation. There is a function that looks
like it unblits the sprite… except that it starts with by enabling the
GRCG and… reading from the first bitplane on the background page?
If this was the EGC, such a read would fill some internal registers with
the contents of all 4 bitplanes, which can then subsequently be blitted to
all 4 bitplanes of any VRAM page with a single memory write. But with the
GRCG in RMW mode, reads do nothing special, and simply copy the memory
contents of one bitplane to the read destination. Maybe ZUN thought
that setting the RMW color to red
also sets some internal 4-plane mask register to match that color?
Instead, the rather random pixels read from the first bitplane are then
used as a mask for a second blit of the same red sprite.
Effectively, this only really "unblits" the invincibility pixels that are
drawn on top of Reimu's sprite. Since Reimu is drawn first, the
invincibility sprites are overwritten anyway. But due to the palette color
layout of Reimu's sprite, its pixels end up fully masking away any
invincibility sprite pixels in that second blit, leaving VRAM untouched as
a result. Anywhere else though, this animation quickly turns into the
union of all animation frames.
Then again, if that 16-dot-aligned rectangular unblitting function is all
you know about the EGC, and you can't be bothered to write a perfect
unblitter for 8×8 sprites, it becomes obvious why you wouldn't want to use
it:
Because Reimu would barely be visible under all that flicker. In
comparison, those fully filled diamonds actually look pretty good.
After all that, the remaining time wouldn't have been enough for the next
few essential classes, so I closed out the push with three more VRAM
effects instead:
Single-bitplane pixel inversion inside a 32×32 square – the main effect
behind the discoloration seen in the bomb animation, as well as the
expanding squares at the end of Kikuri's and Sariel's entrance
animation
EGC-accelerated VRAM row copies – the second half of smooth and fully
hardware-accelerated scrolling for backgrounds that are twice the size of
VRAM
And finally, the VRAM page content transition function using meshed 8×8
squares, used for the blocky transition to Sariel's first and second phases.
Which is quite ridiculous in just how needlessly bloated it is. I'm positive
that this sort of thing could have also been accelerated using the PC-98's
EGC… although simply writing better C would have already gone a long way.
The function also comes with three unused mesh patterns.
And with that, ReC98, as a whole, is not only ⅓ done, but I've also fully
caught up with the feature backlog for the first time in the history of
this crowdfunding! Time to go into maintenance mode then, while we wait
for the next pushes to be funded. Got a huge backlog of tiny maintenance
issues to address at a leisurely pace, and of course there's also the
📝 16-bit build system waiting to be
finished.
Only one newly ordered push since I've reopened the store? Great, that's
all the justification I needed for the extended maintenance delay that was
part of these two pushes 😛
Having to write comments to explain whether coordinates are relative to
the top-left corner of the screen or the top-left corner of the playfield
has finally become old. So, I introduced
distinct
types for all the coordinate systems we typically encounter, applying
them to all code decompiled so far. Note how the planar nature of PC-98
VRAM meant that X and Y coordinates also had to be different from each
other. On the X side, there's mainly the distinction between the
[0; 640] screen space and the corresponding [0; 80] VRAM byte
space. On the Y side, we also have the [0; 400] screen space, but
the visible area of VRAM might be limited to [0; 200] when running in
the PC-98's line-doubled 640×200 mode. A VRAM Y coordinate also always
implies an added offset for vertical scrolling.
During all of the code reconstruction, these types can only have a
documenting purpose. Turning them into anything more than just
typedefs to int, in order to define conversion
operators between them, simply won't recompile into identical binaries.
Modding and porting projects, however, now have a nice foundation for
doing just that, and can entirely lift coordinate system transformations
into the type system, without having to proofread all the meaningless
int declarations themselves.
So, what was left in terms of memory references? EX-Alice's fire waves
were our final unknown entity that can collide with the player. Decently
implemented, with little to say about them.
That left the bomb animation structures as the one big remaining PI
blocker. They started out nice and simple in TH04, with a small 6-byte
star animation structure used for both Reimu and Marisa. TH05, however,
gave each character her own animation… and what the hell is going
on with Reimu's blue stars there? Nope, not going to figure this out on
ASM level.
A decompilation first required some more bomb-related variables to be
named though. Since this was part of a generic RE push, it made sense to
do this in all 5 games… which then led to nice PI gains in anything
but TH05. Most notably, we now got the
"pulling all items to player" flag in TH04 and TH05, which is
actually separate from bombing. The obvious cheat mod is left as an
exercise to the reader.
So, TH05 bomb animations. Just like the
📝 custom entity types of this game, all 4
characters share the same memory, with the superficially same 10-byte
structure.
But let's just look at the very first field. Seen from a low level, it's a
simple struct { int x, y; } pos, storing the current position
of the character-specific bomb animation entity. But all 4 characters use
this field differently:
For Reimu's blue stars, it's the top-left position of each star, in the
12.4 fixed-point format. But unlike the vast majority of these values in
TH04 and TH05, it's relative to the top-left corner of the
screen, not the playfield. Much better represented as
struct { Subpixel screen_x, screen_y; } topleft.
For Marisa's lasers, it's the center of each circle, as a regular 12.4
fixed-point coordinate, relative to the top-left corner of the playfield.
Much better represented as
struct { Subpixel x, y; } center.
For Mima's shrinking circles, it's the center of each circle in regular
pixel coordinates. Much better represented as
struct { screen_x_t x; screen_y_t y; } center.
For Yuuka's spinning heart, it's the top-left corner in regular pixel
coordinates. Much better represented as
struct { screen_x_t x; screen_y_t y; } topleft.
And yes, singular. The game is actually smart enough to only store a single
heart, and then create the rest of the circle on the fly. (If it were even
smarter, it wouldn't even use this structure member, but oh well.)
Therefore, I decompiled it as 4 separate structures once again, bundled
into an union of arrays.
As for Reimu… yup, that's some pointer arithmetic straight out of
Jigoku* for setting and updating the positions of the falling star
trails. While that certainly required several
comments to wrap my head around the current array positions, the one "bug"
in all this arithmetic luckily has no effect on the game.
There is a small glitch with the growing circles, though. They are
spawned at the end of the loop, with their position taken from the star
pointer… but after that pointer has already been incremented. On
the last loop iteration, this leads to an out-of-bounds structure access,
with the position taken from some unknown EX-Alice data, which is 0 during
most of the game. If you look at the animation, you can easily spot these
bugged circles, consistently growing from the top-left corner (0, 0)
of the playfield:
After all that, there was barely enough remaining time to filter out and
label the final few memory references. But now, TH05's
MAIN.EXE is technically position-independent! 🎉
-Tom- is going to work on a pretty extensive demo of this
unprecedented level of efficient Touhou game modding. For a more impactful
effect of both the 100% PI mark and that demo, I'll be delaying the push
covering the remaining false positives in that binary until that demo is
done. I've accumulated a pretty huge backlog of minor maintenance issues
by now…
Next up though: The first part of the long-awaited build system
improvements. I've finally come up with a way of sanely accelerating the
32-bit build part on most setups you could possibly want to build ReC98
on, without making the building experience worse for the other few setups.
It's vacation time! Which, for ReC98, means "relaxing by looking at
something boring and uninteresting that we'll ultimately have to cover
anyway"… like the TH01 HUD.
📝 As noted earlier, all the score, card
combo, stage, and time numbers are drawn into VRAM. Which turns TH01's HUD
rendering from the trivial, gaiji-assisted text RAM writes we see in later
games to something that, once again, requires blitting and unblitting
steps. For some reason though, everything on there is blitted to both
VRAM pages? And that's why the HUD chose to allocate a bunch of .PTN
sprite slots to store the background behind all "animated" elements at the
beginning of a 4-stage scene or boss battle… separately for every
affected 16×16 area. (Looking forward to the completely unnecessary
code in the Sariel fight that updates these slots after the backgrounds
were animated!) And without any separation into helper functions, we end
up with the same blitting calls separately copy-pasted for every single
HUD element. That's why something as seemingly trivial as this isn't even
done after 2 pushes, as we're still missing the stage timer.
Thankfully, the .PTN function signatures come with none of ZUN's little
inconsistencies, so I was able to mostly reduce this copy-pasta to a bunch
of small inline functions and macros. Those interfaces still remain a bit
annoying, though. As a 32×32 format, .PTN merely supports 16×16 sprites
with a separate bunch of functions that take an additional
quarter parameter from 0 to 3, to select one of the 4 16×16
quarters in a such a sprite…
For life and bomb counts, there was no way around VRAM though, since ZUN
wanted to use more than a single color for those. This is where we find at
least somewhat of a mildly interesting quirk in all of this: Any life
counts greater than the intended 6 will wrap into new rows, with the bombs
in the second row overlapping those excess lives. With the way the rest of
the HUD rendering works, that wrapping code code had to be explicitly
written… which means that ZUN did in fact accomodate (his own?) cheating
there.
Now, I promised image formats, and in the middle of this copy-pasta, we
did get one… sort of. MASK.GRF, the red HUD
background, is entirely handled with two small bespoke functions… and
that's all the code we have for this format. Basically, it's a variation
on the 📝 .GRZ format we've seen earlier. It
uses the exact same RLE algorithm, but only has a single byte stream for
both RLE commands and pixel data… as you would expect from an RLE format.
.GRF actually stores 4 separately encoded RLE streams, which suggests that
it was intended for full 16-color images. Unfortunately,
MASK.GRF only contains 4 copies of the same HUD background
, so no unused beta data for us there. The only
thing we could derive from 4 identical bitplanes would be that the
background was originally meant to be drawn using color #15, rather than
the red seen in the final game. Color
#15 is a stage-specific background color that would have made the
HUD blend in quite nicely – in the YuugenMagan fight, it's the changing
color of the 邪 in the background, for example. But
really, with no generic implementation of this format, that's all just
speculation.
Oh, and in case you were looking for a rip of that image:
So yeah, more of the usual TH01 code, with the usual small quirks, but
nothing all too horrible – as expected. Next up: The image formats that
didn't make it into this push.
Well, make that three days. Trying to figure out all the details behind
the sprite flickering was absolutely dreadful…
It started out easy enough, though. Unsurprisingly, TH01 had a quite
limited pellet system compared to TH04 and TH05:
The cap is 100, rather than 240 in TH04 or 180 in TH05.
Only 6 special motion functions (with one of them broken and unused)
instead of 10. This is where you find the code that generates SinGyoku's
chase pellets, Kikuri's small spinning multi-pellet circles, and
Konngara's rain pellets that bounce down from the top of the playfield.
A tiny selection of preconfigured multi-pellet groups. Rather than
TH04's and TH05's freely configurable n-way spreads, stacks, and rings,
TH01 only provides abstractions for 2-, 3-, 4-, and 5- way spreads (yup,
no 6-way or beyond), with a fixed narrow or wide angle between the
individual pellets. The resulting pellets are also hardcoded to linear
motion, and can't use the special motion functions. Maybe not the best
code, but still kind of cute, since the generated groups do follow a
clear logic.
As expected from TH01, the code comes with its fair share of smaller,
insignificant ZUN bugs and oversights. As you would also expect
though, the sprite flickering points to the biggest and most consequential
flaw in all of this.
Apparently, it started with ZUN getting the impression that it's only
possible to use the PC-98 EGC for fast blitting of all 4 bitplanes in one
CPU instruction if you blit 16 horizontal pixels (= 2 bytes) at a time.
Consequently, he only wrote one function for EGC-accelerated sprite
unblitting, which can only operate on a "grid" of 16×1 tiles in VRAM. But
wait, pellets are not only just 8×8, but can also be placed at any
unaligned X position…
… yet the game still insists on using this 16-dot-aligned function to
unblit pellets, forcing itself into using a super sloppy 16×8 rectangle
for the job. 🤦 ZUN then tried to mitigate the resulting flickering in two
hilarious ways that just make it worse:
An… "interlaced rendering" mode? This one's activated for all Stage 15
and 20 fights, and separates pellets into two halves that are rendered on
alternating frames. Collision detection with the Yin-Yang Orb and the
player is only done for the visible half, but collision detection with
player shots is still done for all pellets every frame, as are
motion updates – so that pellets don't end up moving half as fast as they
should.
So yeah, your eyes weren't deceiving you. The game does effectively
drop its perceived frame rate in the Elis, Kikuri, Sariel, and Konngara
fights, and it does so deliberately.
📝 Just like player shots, pellets
are also unblitted, moved, and rendered in a single function.
Thanks to the 16×8 rectangle, there's now the (completely unnecessary)
possibility of accidentally unblitting parts of a sprite that was
previously drawn into the 8 pixels right of a pellet. And this
is where ZUN went full and went "oh, I
know, let's test the entire 16 pixels, and in case we got an entity
there, we simply make the pellet invisible for this frame! Then
we don't even have to unblit it later!"
Except that this is only done for the first 3 elements of the player
shot array…?! Which don't even necessarily have to contain the 3 shots
fired last. It's not done for the player sprite, the Orb, or, heck,
other pellets that come earlier in the pellet array. (At least
we avoided going 𝑂(𝑛²) there?)
Actually, and I'm only realizing this now as I type this blog post:
This test is done even if the shots at those array elements aren't
active. So, pellets tend to be made invisible based on comparisons
with garbage data.
And then you notice that the player shot
unblit/move/render function is actually only ever called from the
pellet unblit/move/render function on the one global instance
of the player shot manager class, after pellets were unblitted. So, we
end up with a sequence of
which means that we can't ever unblit a previously rendered shot
with a pellet. Sure, as terrible as this one function call is from
a software architecture perspective, it was enough to fix this issue.
Yet we don't even get the intended positive effect, and walk away with
pellets that are made temporarily invisible for no reason at all. So,
uh, maybe it all just was an attempt at increasing the
ramerate on lower spec PC-98 models?
Yup, that's it, we've found the most stupid piece of code in this game,
period. It'll be hard to top this.
I'm confident that it's possible to turn TH01 into a well-written, fluid
PC-98 game, with no flickering, and no perceived lag, once it's
position-independent. With some more in-depth knowledge and documentation
on the EGC (remember, there's still
📝 this one TH03 push waiting to be funded),
you might even be able to continue using that piece of blitter hardware.
And no, you certainly won't need ASM micro-optimizations – just a bit of
knowledge about which optimizations Turbo C++ does on its own, and what
you'd have to improve in your own code. It'd be very hard to write
worse code than what you find in TH01 itself.
(Godbolt for Turbo C++ 4.0J when?
Seriously though, that would 📝 also be a
great project for outside contributors!)
Oh well. In contrast to TH04 and TH05, where 4 pushes only covered all the
involved data types, they were enough to completely cover all of
the pellet code in TH01. Everything's already decompiled, and we never
have to look at it again. 😌 And with that, TH01 has also gone from by far
the least RE'd to the most RE'd game within ReC98, in just half a year! 🎉
Still, that was enough TH01 game logic for a while.
Next up: Making up for the delay with some
more relaxing and easy pieces of TH01 code, that hopefully make just a
bit more sense than all this garbage. More image formats, mainly.
So, let's finally look at some TH01 gameplay structures! The obvious
choices here are player shots and pellets, which are conveniently located
in the last code segment. Covering these would therefore also help in
transferring some first bits of data in REIIDEN.EXE from ASM
land to C land. (Splitting the data segment would still be quite
annoying.) Player shots are immediately at the beginning…
…but wait, these are drawn as transparent sprites loaded from .PTN files.
Guess we first have to spend a push on
📝 Part 2 of this format.
Hm, 4 functions for alpha-masked blitting and unblitting of both 16×16 and
32×32 .PTN sprites that align the X coordinate to a multiple of 8
(remember, the PC-98 uses a
planar
VRAM memory layout, where 8 pixels correspond to a byte), but only one
function that supports unaligned blitting to any X coordinate, and only
for 16×16 sprites? Which is only called twice? And doesn't come with a
corresponding unblitting function?
Yeah, "unblitting". TH01 isn't
double-buffered,
and uses the PC-98's second VRAM page exclusively to store a stage's
background and static sprites. Since the PC-98 has no hardware sprites,
all you can do is write pixels into VRAM, and any animated sprite needs to
be manually removed from VRAM at the beginning of each frame. Not using
double-buffering theoretically allows TH01 to simply copy back all 128 KB
of VRAM once per frame to do this. But that
would be pretty wasteful, so TH01 just looks at all animated sprites, and
selectively copies only their occupied pixels from the second to the first
VRAM page.
Alright, player shot class methods… oh, wait, the collision functions
directly act on the Yin-Yang Orb, so we first have to spend a push on
that one. And that's where the impression we got from the .PTN
functions is confirmed: The orb is, in fact, only ever displayed at
byte-aligned X coordinates, divisible by 8. It's only thanks to the
constant spinning that its movement appears at least somewhat
smooth.
This is purely a rendering issue; internally, its position is
tracked at pixel precision. Sadly, smooth orb rendering at any unaligned X
coordinate wouldn't be that trivial of a mod, because well, the
necessary functions for unaligned blitting and unblitting of 32×32 sprites
don't exist in TH01's code. Then again, there's so much potential for
optimization in this code, so it might be very possible to squeeze those
additional two functions into the same C++ translation unit, even without
position independence…
More importantly though, this was the right time to decompile the core
functions controlling the orb physics – probably the highlight in these
three pushes for most people.
Well, "physics". The X velocity is restricted to the 5 discrete states of
-8, -4, 0, 4, and 8, and gravity is applied by simply adding 1 to the Y
velocity every 5 frames No wonder that this can
easily lead to situations in which the orb infinitely bounces from the
ground.
At least fangame authors now have
a
reference of how ZUN did it originally, because really, this bad
approximation of physics had to have been written that way on purpose. But
hey, it uses 64-bit floating-point variables!
…sometimes at least, and quite randomly. This was also where I had to
learn about Turbo C++'s floating-point code generation, and how rigorously
it defines the order of instructions when mixing double and
float variables in arithmetic or conditional expressions.
This meant that I could only get ZUN's original instruction order by using
literal constants instead of variables, which is impossible right now
without somehow splitting the data segment. In the end, I had to resort to
spelling out ⅔ of one function, and one conditional branch of another, in
inline ASM. 😕 If ZUN had just written 16.0 instead of
16.0f there, I would have saved quite some hours of my life
trying to decompile this correctly…
To sort of make up for the slowdown in progress, here's the TH01 orb
physics debug mod I made to properly understand them. Edit
(2022-07-12): This mod is outdated,
📝 the current version is here!2020-06-13-TH01OrbPhysicsDebug.zip
To use it, simply replace REIIDEN.EXE, and run the game
in debug mode, via game d on the DOS prompt.
Its code might also serve as an example of how to achieve this sort of
thing without position independence.
Alright, now it's time for player shots though. Yeah, sure, they
don't move horizontally, so it's not too bad that those are also
always rendered at byte-aligned positions. But, uh… why does this code
only use the 16×16 alpha-masked unblitting function for decaying shots,
and just sloppily unblits an entire 16×16 square everywhere else?
The worst part though: Unblitting, moving, and rendering player shots
is done in a single function, in that order. And that's exactly where
TH01's sprite flickering comes from. Since different types of sprites are
free to overlap each other, you'd have to first unblit all types, then
move all types, and then render all types, as done in later
PC-98 Touhou games. If you do these three steps per-type instead, you
will unblit sprites of other types that have been rendered before… and
therefore end up with flicker.
Oh, and finally, ZUN also added an additional sloppy 16×16 square unblit
call if a shot collides with a pellet or a boss, for some
guaranteed flicker. Sigh.
And that's ⅓ of all ZUN code in TH01 decompiled! Next up: Pellets!
Three pushes to decompile the TH01 high score menu… because it's
completely terrible, and needlessly complicated in pretty much every
aspect:
Another, final set of differences between the REIIDEN.EXE
and FUUIN.EXE versions of the code. Which are so
insignificant that it must mean that ZUN kept this code in two
separate, manually and imperfectly synced files. The REIIDEN.EXE
version, only shown when game-overing, automatically jumps to the
enter/終 button after the 8th character was entered,
and also has a completely invisible timeout that force-enters a high score
name after 1000… key presses? Not frames? Why. Like, how do you
even realistically such a number. (Best guess: It's a hidden easter egg to
amuse players who place drinking glasses on cursor keys. Or beer bottles.)
That's all the differences that are maybe visible if you squint
hard enough. On top of that though, we got a bunch of further, minor code
organization differences that serve no purpose other than to waste
decompilation time, and certainly did their part in stretching this out to
3 pushes instead of 2.
Entered names are restricted to a set of 16-bit, full-width Shift-JIS
codepoints, yet are still accessed as 8-bit byte arrays everywhere. This
bloats both the C++ and generated ASM code with needless byte splits,
swaps, and bit shifts. Same for the route kanji. You have this 16-, heck,
even 32-bit CPU, why not use it?! (Fun fact: FUUIN.EXE is
explicitly compiled for a 80186, for the most part – unlike
REIIDEN.EXE, which does use Turbo C++'s 80386 mode.)
The sensible way of storing the current position of the alphabet
cursor would simply be two variables, indicating the logical row and
column inside the character map. When rendering, you'd then transform
these into screen space. This can keep the on-screen position constants in
a single place of code.
TH01 does the opposite: The selected character is stored directly in terms
of its on-screen position, which is then mapped back to a character
index for every processed input and the subsequent screen update. There's
no notion of a logical row or column anywhere, and consequently, the
position constants are vomited all over the code.
Which might not be as bad if the character map had a uniform
grid structure, with no gaps. But the one in TH01 looks like this:
And with no sense of abstraction anywhere, both input handling and
rendering end up with a separate if branch for at least 4 of
the 6 rows.
In the end, I just gave up with my usual redundancy reduction efforts for
this one. Anyone wanting to change TH01's high score name entering code
would be better off just rewriting the entire thing properly.
And that's all of the shared code in TH01! Both OP.EXE and
FUUIN.EXE are now only missing the actual main menu and
ending code, respectively. Next up, though: The long awaited TH01 PI push.
Which will not only deliver 100% PI for OP.EXE and
FUUIN.EXE, but also probably quite some gains in
REIIDEN.EXE. With now over 30% of the game decompiled, it's about
time we get to look at some gameplay code!
Just like most of the time, it was more sensible to cover
GENSOU.SCR, the last structure missing in TH05's
OP.EXE,
everywhere it's used, rather than just rushing out OP.EXE
position independence. I did have to look into all of the functions to
fully RE it after all, and to find out whether the unused fields actually
are unused. The only thing that kept this push from yielding even
more above-average progress was the sheer inconsistency in how the games
implemented the operations on this PC-98 equivalent of score*.dat:
OP.EXE declares two structure instances, for simultaneous
access to both Reimu and Marisa scores. TH05 with its 4 playable
characters instead uses a single one, and overwrites it successively for
each character when drawing the high score menu – meaning, you'd only see
Yuuka's scores when looking at the structure inside the rendered high
score menu. However, it still declares the TH04 "Marisa" structure as a
leftover… and also decodes it and verifies its checksum, despite
nothing being ever loaded into it
MAIN.EXE uses a separate ASM implementation of the decoding
and encoding functions
TH05's MAIN.EXE also reimplements the basic loading
functions
in ASM – without the code to regenerate GENSOU.SCR with
default data if the file is missing or corrupted. That actually makes
sense, since any regeneration is already done in OP.EXE, which
always has to load that file anyway to check how much has been cleared
However, there is a regeneration function in TH05's
MAINE.EXE… which actually generates different default
data: OP.EXE consistently sets Extra Stage records to Stage 1,
while MAINE.EXE uses the same place-based stage numbering that
both versions use for the regular ranks
Technically though, TH05's OP.EXEis
position-independent now, and the rest are (should be?
) merely false positives. However, TH04's is
still missing another structure, in addition to its false
positives. So, let's wait with the big announcement until the next push…
which will also come with a demo video of what will be possible then.
The glacial pace continues, with TH05's unnecessarily, inappropriately
micro-optimized, and hence, un-decompilable code for rendering the current
and high score, as well as the enemy health / dream / power bars. While
the latter might still pass as well-written ASM, the former goes to such
ridiculous levels that it ends up being technically buggy. If you
enjoy quality ZUN code, it's
definitely worth a read.
In TH05, this all still is at the end of code segment #1, but in TH04,
the same code lies all over the same segment. And since I really
wanted to move that code into its final form now, I finally did the
research into decompiling from anywhere else in a segment.
Turns out we actually can! It's kinda annoying, though: After splitting
the segment after the function we want to decompile, we then need to group
the two new segments back together into one "virtual segment" matching the
original one. But since all ASM in ReC98 heavily relies on being
assembled in MASM mode, we then start to suffer from MASM's group
addressing quirk. Which then forces us to manually prefix every single
function call
from inside the group
to anywhere else within the newly created segment
with the group name. It's stupidly boring busywork, because of all the
function calls you mustn't prefix. Special tooling might make this
easier, but I don't have it, and I'm not getting crowdfunded for it.
So while you now definitely can request any specific thing in any
of the 5 games to be decompiled right now, it will take slightly
longer, and cost slightly more.
(Except for that one big segment in TH04, of course.)
Only one function away from the TH05 shot type control functions now!