Well, well. My original plan was to ship the first step of Shuusou Gyoku
OpenGL support on the next day after this delivery. But unfortunately, the
complications just kept piling up, to a point where the required solutions
definitely blow the current budget for that goal. I'm currently sitting on
over 70 commits that would take at least 5 pushes to deliver as a meaningful
release, and all of that is just rearchitecting work, preparing the
game for a not too Windows-specific OpenGL backend in the first place. I
haven't even written a single line of OpenGL yet… 🥲
This shifts the intended Big Release Month™ to June after all. Now I know
that the next round of Shuusou Gyoku features should better start with the
SC-88Pro recordings, which are much more likely to get done within their
current budget. At least I've already completed the configuration versioning
system required for that goal, which leaves only the actual audio part.
So, TH04 position independence. Thanks to a bit of funding for stage
dialogue RE, non-ASCII translations will soon become viable, which finally
presents a reason to push TH04 to 100% position independence after
📝 TH05 had been there for almost 3 years. I
haven't heard back from Touhou Patch Center about how much they want to be
involved in funding this goal, if at all, but maybe other backers are
interested as well.
And sure, it would be entirely possible to implement non-ASCII translations
in a way that retains the layout of the original binaries and can be easily
compared at a binary level, in case we consider translations to be a
critical piece of infrastructure. This wouldn't even just be an exercise in
needless perfectionism, and we only have to look to Shuusou Gyoku to realize
why: Players expected
that my builds were compatible with existing SpoilerAL SSG files, which
was something I hadn't even considered the need for. I mean, the game is
open-source 📝 and I made it easy to build.
You can just fork the code, implement all the practice features you want in
a much more efficient way, and I'd probably even merge your code into my
builds then?
But I get it – recompiling the game yields just yet another build that can't
be easily compared to the original release. A cheat table is much more
trustworthy in giving players the confidence that they're still practicing
the same original game. And given the current priorities of my backers,
it'll still take a while for me to implement proof by replay validation,
which will ultimately free every part of the community from depending on the
original builds of both Seihou and PC-98 Touhou.
However, such an implementation within the original binary layout would
significantly drive up the budget of non-ASCII translations, and I sure
don't want to constantly maintain this layout during development. So, let's
chase TH04 position independence like it's 2020, and quickly cover a larger
amount of PI-relevant structures and functions at a shallow level. The only
parts I decompiled for now contain calculations whose intent can't be
clearly communicated in ASM. Hitbox visualizations or other more in-depth
research would have to wait until I get to the proper decompilation of these
features.
But even this shallow work left us with a large amount of TH04-exclusive
code that had its worst parts RE'd and could be decompiled fairly quickly.
If you want to see big TH04 finalization% gains, general TH04 progress would
be a very good investment.
The first push went to the often-mentioned stage-specific custom entities
that share a single statically allocated buffer. Back in 2020, I
📝 wrongly claimed that these were a TH05 innovation,
but the system actually originated in TH04. Both games use a 26-byte
structure, but TH04 only allocates a 32-element array rather than TH05's
64-element one. The conclusions from back then still apply, but I also kept
wondering why these games used a static array for these entities to begin
with. You know what they call an area of memory that you can cleanly
repurpose for things? That's right, a heap!
And absolutely no one would mind one additional heap allocation at the start
of a stage, next to the ones for all the sprites and portraits.
However, we are still running in Real Mode with segmented memory. Accessing
anything outside a common data segment involves modifying segment registers,
which has a nonzero CPU cycle cost, and Turbo C++ 4.0J is terrible at
optimizing away the respective instructions. Does this matter? Probably not,
but you don't take "risks" like these if you're in a permanent
micro-optimization mindset…
In TH04, this system is used for:
Kurumi's symmetric bullet spawn rays, fired from her hands towards the left
and right edges of the playfield. These are rather infamous for being the
last thing you see before
📝 the Divide Error crash that can happen in ZUN's original build.
Capped to 6 entities.
The 4 📝 bits used in Marisa's Stage 4 boss
fight. Coincidentally also related to the rare Divide Error
crash in that fight.
Stage 4 Reimu's spinning orbs. Note how the game uses two different sets
of sprites just to have two different outline colors. This was probably
better than messing with the palette, which can easily cause unintended
effects if you only have 16 colors to work with. Heck, I have an entire blog post tag just to highlight
these cases. Capped to the full 32 entities.
The chasing cross bullets, seen in Phase 14 of the same Stage 6 Yuuka
fight. Featuring some smart sprite work, making use of point symmetry to
achieve a fluid animation in just 4 frames. This is
good-code in sprite form. Capped to 31 entities, because
the 32nd custom entity during this fight is defined to be…
The single purple pulsating and shrinking safety circle, seen in Phase 4 of
the same fight. The most interesting aspect here is actually still related
to the cross bullets, whose spawn function is wrongly limited to 32 entities
and could theoretically overwrite this circle. This
is strictly landmine territory though:
Yuuka never uses these bullets and the safety circle
simultaneously
She never spawns more than 24 cross bullets
All cross bullets are fast enough to have left the screen by the
time Yuuka restarts the corresponding subpattern
The cross bullets spawn at Yuuka's center position, and assign its
Q12.4 coordinates to structure fields that the safety circle interprets
as raw pixels. The game does try to render the circle afterward, but
since Yuuka's static position during this phase is nowhere near a valid
pixel coordinate, it is immediately clipped.
The flashing lines seen in Phase 5 of the Gengetsu fight,
telegraphing the slightly random bullet columns.
These structures only took 1 push to reverse-engineer rather than the 2 I
needed for their TH05 counterparts because they are much simpler in this
game. The "structure" for Gengetsu's lines literally uses just a single X
position, with the remaining 24 bytes being basically padding. The only
minor bug I found on this shallow level concerns Marisa's bits, which are
clipped at the right and bottom edges of the playfield 16 pixels earlier
than you would expect:
The remaining push went to a bunch of smaller structures and functions:
The structure for the up to 2 "thick" (a.k.a. "Master Spark") lasers. Much
saner than the
📝 madness of TH05's laser system while being
equally customizable in width and duration.
The structure for the various monochrome 16×16 shapes in the background of
the Stage 6 Yuuka fight, drawn on top of the checkerboard.
The rendering code for the three falling stars in the background of Stage 5.
The effect here is entirely palette-related: After blitting the stage tiles,
the 📝 1bpp star image is ORed
into only the 4th VRAM plane, which is equivalent to setting the
highest bit in the palette color index of every pixel within the star-shaped
region. This of course raises the question of how the stage would look like
if it was fully illuminated:
The full tile map of TH04's Stage 5, in both dark and fully
illuminated views. Since the illumination effect depends on two
matching sets of palette colors that are distinguished by a single
bit, the illuminated view is limited to only 8 of the 16 colors. The
dark view, on the other hand, can freely use colors from the
illuminated set, since those are unaffected by the OR
operation.
Most code that modifies a stage's tile map, and directly specifies tiles via
their top-left offset in VRAM.
Thanks to code alignment reasons, this forced a much longer detour into the
.STD format loader. Nothing all too noteworthy there since we're still
missing the enemy script and spawn structures before we can call .STD
"reverse-engineered", but maybe still helpful if you're looking for an
overview of the format. Also features a buffer overflow landmine if a .STD
file happens to contain more than 32 enemy scripts… you know, the usual
stuff.
To top off the second push, we've got the vertically scrolling checkerboard
background during the Stage 6 Yuuka fight, made up of 32×32 squares. This
one deserves a special highlight just because of its needless complexity.
You'd think that even a performant implementation would be pretty simple:
Set the GRCG to TDW mode
Set the GRCG tile to one of the two square colors
Start with Y as the current scroll offset, and X
as some indicator of which color is currently shown at the start of each row
of squares
Iterate over all lines of the playfield, filling in all pixels that
should be displayed in the current color, skipping over the other ones
Count down Y for each line drawn
If Y reaches 0, reset it to 32 and flip X
At the bottom of the playfield, change the GRCG tile to the other color,
and repeat with the initial value of X flipped
The most important aspect of this algorithm is how it reduces GRCG state
changes to a minimum, avoiding the costly port I/O that we've identified
time and time again as one of the main bottlenecks in TH01. With just 2
state variables and 3 loops, the resulting code isn't that complex either. A
naive implementation that just drew the squares from top to bottom in a
single pass would barely be simpler, but much slower: By changing the GRCG
tile on every color, such an implementation would burn a low 5-digit number
of CPU cycles per frame for the 12×11.5-square checkerboard used in the
game.
And indeed, ZUN retained all important aspects of this algorithm… but still
implemented it all in ASM, with a ridiculous layer of x86 segment arithmetic
on top? Which blows up the complexity to 4 state
variables, 5 nested loops, and a bunch of constants in unusual units. I'm
not sure what this code is supposed to optimize for, especially with that
rather questionable register allocation that nevertheless leaves one of the
general-purpose registers unused. Fortunately,
the function was still decompilable without too many code generation hacks,
and retains the 5 nested loops in all their goto-connected
glory. If you want to add a checkerboard to your next PC-98
demo, just stick to the algorithm I gave above.
(Using a single XOR for flipping the starting X offset between 32 and 64
pixels is pretty nice though, I have to give him that.)
This makes for a good occasion to talk about the third and final GRCG mode,
completing the series I started with my previous coverage of the
📝 RMW and
📝 TCR modes. The TDW (Tile Data Write) mode
is the simplest of the three and just writes the 8×1 GRCG tile into VRAM
as-is, without applying any alpha bitmask. This makes it perfect for
clearing rectangular areas of pixels – or even all of VRAM by doing a single
memset():
// Set up the GRCG in TDW mode.
outportb(0x7C, 0x80);
// Fill the tile register with color #7 (0111 in binary).
outportb(0x7E, 0xFF); // Plane 0: (B): (********)
outportb(0x7E, 0xFF); // Plane 1: (R): (********)
outportb(0x7E, 0xFF); // Plane 2: (G): (********)
outportb(0x7E, 0x00); // Plane 3: (E): ( )
// Set the 32 pixels at the top-left corner of VRAM to the exact contents of
// the tile register, effectively repeating the tile 4 times. In TDW mode, the
// GRCG ignores the CPU-supplied operand, so we might as well just pass the
// contents of a register with the intended width. This eliminates useless load
// instructions in the compiled assembly, and even sort of signals to readers
// of this code that we do not care about the source value.
*reinterpret_cast<uint32_t far *>(MK_FP(0xA800, 0)) = _EAX;
// Fill the entirety of VRAM with the GRCG tile. A simple C one-liner that will
// probably compile into a single `REP STOS` instruction. Unfortunately, Turbo
// C++ 4.0J only ever generates the 16-bit `REP STOSW` here, even when using
// the `__memset__` intrinsic and when compiling in 386 mode. When targeting
// that CPU and above, you'd ideally want `REP STOSD` for twice the speed.
memset(MK_FP(0xA800, 0), _AL, ((640 / 8) * 400));
However, this might make you wonder why TDW mode is even necessary. If it's
functionally equivalent to RMW mode with a CPU-supplied bitmask made up
entirely of 1 bits (i.e., 0xFF, 0xFFFF, or
0xFFFFFFFF), what's the point? The difference lies in the
hardware implementation: If all you need to do is write tile data to
VRAM, you don't need the read and modify parts of RMW mode
which require additional processing time. The PC-9801 Programmers'
Bible claims a speedup of almost 2× when using TDW mode over equivalent
operations in RMW mode.
And that's the only performance claim I found, because none of these old
PC-98 hardware and programming books did any benchmarks. Then again, it's
not too interesting of a question to benchmark either, as the byte-aligned
nature of TDW blitting severely limits its use in a game engine anyway.
Sure, maybe it makes sense to temporarily switch from RMW to TDW mode
if you've identified a large rectangular and byte-aligned section within a
sprite that could be blitted without a bitmask? But the necessary
identification work likely nullifies the performance gained from TDW mode,
I'd say. In any case, that's pretty deep
micro-optimization territory. Just use TDW mode for the
few cases it's good at, and stick to RMW mode for the rest.
So is this all that can be said about the GRCG? Not quite, because there are
4 bits I haven't talked about yet…
And now we're just 5.37% away from 100% position independence for TH04! From
this point, another 2 pushes should be enough to reach this goal. It might
not look like we're that close based on the current estimate, but a
big chunk of the remaining numbers are false positives from the player shot
control functions. Since we've got a very special deadline to hit, I'm going
to cobble these two pushes together from the two current general
subscriptions and the rest of the backlog. But you can, of course, still
invest in this goal to allow the existing contributions to go to something
else.
… Well, if the store was actually open. So I'd better
continue with a quick task to free up some capacity sooner rather than
later. Next up, therefore: Back to TH02, and its item and player systems.
Shouldn't take that long, I'm not expecting any surprises there. (Yeah, I
know, famous last words…)
Two years after
📝 the first look at TH04's and TH05's bullets,
we finally get to finish their logic code by looking at the special motion
types. Bullets as a whole still aren't completely finished as the
rendering code is still waiting to be RE'd, but now we've got everything
about them that's required for decompiling the midboss and boss fights of
these games.
Just like the motion types of TH01's pellets, the ones we've got here really
are special enough to warrant an enum, despite all the
overlap in the "slow down and turn" and "bounce at certain edges of the
playfield" types. Sure, including them in the bitfield I proposed two years
ago would have allowed greater variety, but it wouldn't have saved any
memory. On the contrary: These types use a single global state variable for
the maximum turn count and delta speed, which a proper customizable
architecture would have to integrate into the bullet structure. Maybe it is
possible to stuff everything into the same amount of bytes, but not without
first completely rearchitecting the bullet structure and removing every
single piece of redundancy in there. Simply extending the system by adding a
new enum value for a new motion type would be way more
straightforward for modders.
Speaking about memory, TH05 already extends the bullet structure by 6 bytes
for the "exact linear movement" type exclusive to that game. This type is
particularly interesting for all the prospective PC-98 game developers out
there, as it nicely points out the precision limits of Q12.4 subpixels.
Regular bullet movement works by adding a Q12.4 velocity to a Q12.4 position
every frame, with the velocity typically being calculated only once on spawn
time from an 8-bit angle and a Q12.4 speed. Quantization errors from this
initial calculation can quickly compound over all the frames a bullet spends
moving across the playfield. If a bullet is only supposed to move on a
straight line though, there is a more precise way of calculating its
position: By storing the origin point, movement angle, and total distance
traveled, you can perform a full polar→Cartesian transformation every frame.
Out of the 10 danmaku patterns in TH05 that use this motion type, the
difference to regular bullet movement can be best seen in Louise's final
pattern:
Louise's final pattern in its original form, demonstrating
exact linear bullet movement. Note how each bullet spawns slightly
behind the delay cloud: ZUN simply forgot to shift the fixed origin
point along with it.The same pattern with standard bullet movement, corrupting
its intended appearance. No delay cloud-related oversights here though,
at least.
Not far away from the regular bullet code, we've also got the movement
function for the infamous curve / "cheeto" bullets. I would have almost
called them "cheetos" in the code as well, which surely fits more nicely
into 8.3 filenames than "curve bullets" does, but eh, trademarks…
As for hitboxes, we got a 16×16 one on the head node, and a 12×12 one on the
16 trail nodes. The latter simply store the position of the head node during
the last 16 frames, Snake style. But what you're all here for is probably
the turning and homing algorithm, right? Boiled down to its essence, it
works like this:
// [head] points to the controlled "head" part of a curve bullet entity.
// Angles are stored with 8 bits representing a full circle, providing free
// normalization on arithmetic overflow.
// The directions are ordered as you would expect:
// • 0x00: right (sin(0x00) = 0, cos(0x00) = +1)
// • 0x40: down (sin(0x40) = +1, cos(0x40) = 0)
// • 0x80: left (sin(0x80) = 0, cos(0x80) = -1)
// • 0xC0: up (sin(0xC0) = -1, cos(0xC0) = 0)
uint8_t angle_delta = (head->angle - player_angle_from(
head->pos.cur.x, head->pos.cur.y
));
// Stop turning if the player is 1/128ths of a circle away from this bullet
const uint8_t SNAP = 0x02;
// Else, turn either clockwise or counterclockwise by 1/256th of a circle,
// depending on what would reach the player the fastest.
if((angle_delta > SNAP) && (angle_delta < static_cast<uint8_t>(-SNAP))) {
angle_delta = (angle_delta >= 0x80) ? -0x01 : +0x01;
}
head_p->angle -= angle_delta;
5 lines of code, and not all too difficult to follow once you are familiar
with 8-bit angles… unlike what ZUN actually wrote. Which is 26 lines,
and includes an unused "friction" variable that is never set to any value
that makes a difference in the formula. uth05win
correctly saw through that all and simplified this code to something
equivalent to my explanation. Redoing that work certainly wasted a bit of my
time, and means that I now definitely need to spend another push on RE'ing
all the shared boss functions before I can start with Shinki.
So while a curve bullet's speed does get faster over time, its
angular velocity is always limited to 1/256th of a
circle per frame. This reveals the optimal strategy for dodging them:
Maximize this delta angle by staying as close to 180° away from their
current direction as possible, and let their acceleration do the rest.
At least that's the theory for dodging a single one. As a danmaku
designer, you can now of course place other bullets at these technically
optimal places to prevent a curve bullet pattern from being cheesed like
that. I certainly didn't record the video above in a single take either…
After another bunch of boring entity spawn and update functions, the
playfield shaking feature turned out as the most notable (and tricky) one to
round out these two pushes. It's actually implemented quite well in how it
simply "un-shakes" the screen by just marking every stage tile to be
redrawn. In the context of all the other tile invalidation that can take
place during a frame, that's definitely more performant than
📝 doing another EGC-accelerated memmove().
Due to these two games being double-buffered via page flipping, this
invalidation only really needs to happen for the frame after the next
one though. The immediately next frame will show the regular, un-shaken
playfield on the other VRAM page first, except during the multi-frame
shake animation when defeating a midboss, where it will also appear shifted
in a different direction… 😵 Yeah, no wonder why ZUN just always invalidates
all stage tiles for the next two frames after every shaking animation, which
is guaranteed to handle both sporadic single-frame shakes and continuous
ones. So close to good-code here.
Finally, this delivery was delayed a bit because -Tom-
requested his round-up amount to be limited to the cap in the future. Since
that makes it kind of hard to explain on a static page how much money he
will exactly provide, I now properly modeled these discounts in the website
code. The exact round-up amount is now included in both the pre-purchase
breakdown, as well as the cap bar on the main page.
With that in place, the system is now also set up for round-up offers from
other patrons. If you'd also like to support certain goals in this way, with
any amount of money, now's the time for getting in touch with me about that.
Known contributors only, though! 😛
Next up: The final bunch of shared boring boss functions. Which certainly
will give me a break from all the maintenance and research work, and speed
up delivery progress again… right?
…or maybe not that soon, as it would have only wasted time to
untangle the bullet update commits from the rest of the progress. So,
here's all the bullet spawning code in TH04 and TH05 instead. I hope
you're ready for this, there's a lot to talk about!
(For the sake of readability, "bullets" in this blog post refers to the
white 8×8 pellets
and all 16×16 bullets loaded from MIKO16.BFT, nothing else.)
But first, what was going on📝 in 2020? Spent 4 pushes on the basic types
and constants back then, still ended up confusing a couple of things, and
even getting some wrong. Like how TH05's "bullet slowdown" flag actually
always prevents slowdown and fires bullets at a constant speed
instead. Or how "random spread" is not the
best term to describe that unused bullet group type in TH04.
Or that there are two distinct ways of clearing all bullets on screen,
which deserve different names:
Mechanic #1: Clearing bullets for a custom amount of
time, awarding 1000 points for all bullets alive on the first frame,
and 100 points for all bullets spawned during the clear time.
Mechanic #2: Zapping bullets for a fixed 16 frames,
awarding a semi-exponential and loudly announced Bonus!! for all
bullets alive on the first frame, and preventing new bullets from being
spawned during those 16 frames. In TH04 at least; thanks to a ZUN bug,
zapping got reduced to 1 frame and no animation in TH05…
Bullets are zapped at the end of most midboss and boss phases, and
cleared everywhere else – most notably, during bombs, when losing a
life, or as rewards for extends or a maximized Dream bonus. The
Bonus!! points awarded for zapping bullets are calculated iteratively,
so it's not trivial to give an exact formula for these. For a small number
𝑛 of bullets, it would exactly be 5𝑛³ - 10𝑛² + 15𝑛
points – or, using uth05win's (correct) recursive definition,
Bonus(𝑛) = Bonus(𝑛-1) + 15𝑛² - 5𝑛 + 10.
However, one of the internal step variables is capped at a different number
of points for each difficulty (and game), after which the points only
increase linearly. Hence, "semi-exponential".
On to TH04's bullet spawn code then, because that one can at least be
decompiled. And immediately, we have to deal with a pointless distinction
between regular bullets, with either a decelerating or constant
velocity, and special bullets, with preset velocity changes during
their lifetime. That preset has to be set somewhere, so why have
separate functions? In TH04, this separation continues even down to the
lowest level of functions, where values are written into the global bullet
array. TH05 merges those two functions into one, but then goes too far and
uses self-modifying code to save a grand total of two local variables…
Luckily, the rest of its actual code is identical to TH04.
Most of the complexity in bullet spawning comes from the (thankfully
shared) helper function that calculates the velocities of the individual
bullets within a group. Both games handle each group type via a large
switch statement, which is where TH04 shows off another Turbo
C++ 4.0 optimization: If the range of case values is too
sparse to be meaningfully expressed in a jump table, it usually generates a
linear search through a second value table. But with the -G
command-line option, it instead generates branching code for a binary
search through the set of cases. 𝑂(log 𝑛) as the worst case for a
switch statement in a C++ compiler from 1994… that's so cool.
But still, why are the values in TH04's group type enum all
over the place to begin with?
Unfortunately, this optimization is pretty rare in PC-98 Touhou. It only
shows up here and in a few places in TH02, compared to at least 50
switch value tables.
In all of its micro-optimized pointlessness, TH05's undecompilable version
at least fixes some of TH04's redundancy. While it's still not even
optimal, it's at least a decently written piece of ASM…
if you take the time to understand what's going on there, because it
certainly took quite a bit of that to verify that all of the things which
looked like bugs or quirks were in fact correct. And that's how the code
for this function ended up with 35% comments and blank lines before I could
confidently call it "reverse-engineered"…
Oh well, at least it finally fixes a correctness issue from TH01 and TH04,
where an invalid bullet group type would fill all remaining slots in the
bullet array with identical versions of the first bullet.
Something that both games also share in these functions is an over-reliance
on globals for return values or other local state. The most ridiculous
example here: Tuning the speed of a bullet based on rank actually mutates
the global bullet template… which ZUN then works around by adding a wrapper
function around both regular and special bullet spawning, which saves the
base speed before executing that function, and restores it afterward.
Add another set of wrappers to bypass that exact
tuning, and you've expanded your nice 1-function interface to 4 functions.
Oh, and did I mention that TH04 pointlessly duplicates the first set of
wrapper functions for 3 of the 4 difficulties, which can't even be
explained with "debugging reasons"? That's 10 functions then… and probably
explains why I've procrastinated this feature for so long.
At this point, I also finally stopped decompiling ZUN's original ASM just
for the sake of it. All these small TH05 functions would look horribly
unidiomatic, are identical to their decompiled TH04 counterparts anyway,
except for some unique constant… and, in the case of TH05's rank-based
speed tuning function, actually become undecompilable as soon as we
want to return a C++ class to preserve the semantic meaning of the return
value. Mainly, this is because Turbo C++ does not allow register
pseudo-variables like _AX or _AL to be cast into
class types, even if their size matches. Decompiling that function would
have therefore lowered the quality of the rest of the decompiled code, in
exchange for the additional maintenance and compile-time cost of another
translation unit. Not worth it – and for a TH05 port, you'd already have to
decompile all the rest of the bullet spawning code anyway!
The only thing in there that was still somewhat worth being
decompiled was the pre-spawn clipping and collision detection function. Due
to what's probably a micro-optimization mistake, the TH05 version continues
to spawn a bullet even if it was spawned on top of the player. This might
sound like it has a different effect on gameplay… until you realize that
the player got hit in this case and will either lose a life or deathbomb,
both of which will cause all on-screen bullets to be cleared anyway.
So it's at most a visual glitch.
But while we're at it, can we please stop talking about hitboxes? At least
in the context of TH04 and TH05 bullets. The actual collision detection is
described way better as a kill delta of 8×8 pixels between the
center points of the player and a bullet. You can distribute these pixels
to any combination of bullet and player "hitboxes" that make up 8×8. 4×4
around both the player and bullets? 1×1 for bullets, and 8×8 for the
player? All equally valid… or perhaps none of them, once you keep in mind
that other entity types might have different kill deltas. With that in
mind, the concept of a "hitbox" turns into just a confusing abstraction.
The same is true for the 36×44 graze box delta. For some reason,
this one is not exactly around the center of a bullet, but shifted to the
right by 2 pixels. So, a bullet can be grazed up to 20 pixels right of the
player, but only up to 16 pixels left of the player. uth05win also spotted
this… and rotated the deltas clockwise by 90°?!
Which brings us to the bullet updates… for which I still had to
research a decompilation workaround, because
📝 P0148 turned out to not help at all?
Instead, the solution was to lie to the compiler about the true segment
distance of the popup function and declare its signature far
rather than near. This allowed ZUN to save that ridiculous overhead of 1 additional far function
call/return per frame, and those precious 2 bytes in the BSS segment
that he didn't have to spend on a segment value.
📝 Another function that didn't have just a
single declaration in a common header file… really,
📝 how were these games even built???
The function itself is among the longer ones in both games. It especially
stands out in the indentation department, with 7 levels at its most
indented point – and that's the minimum of what's possible without
goto. Only two more notable discoveries there:
Bullets are the only entity affected by Slow Mode. If the number of
bullets on screen is ≥ (24 + (difficulty * 8) + rank) in TH04,
or (42 + (difficulty * 8)) in TH05, Slow Mode reduces the frame
rate by 33%, by waiting for one additional VSync event every two frames.
The code also reveals a second tier, with 50% slowdown for a slightly
higher number of bullets, but that conditional branch can never be executed
Bullets must have been grazed in a previous frame before they can
be collided with. (Note how this does not apply to bullets that spawned
on top of the player, as explained earlier!)
Whew… When did ReC98 turn into a full-on code review?! 😅 And after all
this, we're still not done with TH04 and TH05 bullets, with all the
special movement types still missing. That should be less than one push
though, once we get to it. Next up: Back to TH01 and Konngara! Now have fun
rewriting the Touhou Wiki Gameplay pages 😛
Back to TH05! Thanks to the good funding situation, I can strike a nice
balance between getting TH05 position-independent as quickly as possible,
and properly reverse-engineering some missing important parts of the game.
Once 100% PI will get the attention of modders, the code will then be in
better shape, and a bit more usable than if I just rushed that goal.
By now, I'm apparently also pretty spoiled by TH01's immediate
decompilability, after having worked on that game for so long.
Reverse-engineering in ASM land is pretty annoying, after all,
since it basically boils down to meticulously editing a piece of ASM into
something I can confidently call "reverse-engineered". Most of the
time, simply decompiling that piece of code would take just a little bit
longer, but be massively more useful. So, I immediately tried decompiling
with TH05… and it just worked, at every place I tried!? Whatever the issue
was that made 📝 segment splitting so
annoying at my first attempt, I seem to have completely solved it in the
meantime. 🤷 So yeah, backers can now request pretty much any part of TH04
and TH05 to be decompiled immediately, with no additional segment
splitting cost.
(Protip for everyone interested in starting their own ReC project: Just
declare one segment per function, right from the start, then group them
together to restore the original code segmentation…)
Except that TH05 then just throws more of its infamous micro-optimized and
undecompilable ASM at you. 🙄 This push covered the function that adjusts
the bullet group template based on rank and the selected difficulty,
called every time such a group is configured. Which, just like pretty
much all of TH05's bullet spawning code, is one of those undecompilable
functions. If C allowed labels of other functions as goto
targets, it might have been decompilable into something useful to
modders… maybe. But like this, there's no point in even trying.
This is such a terrible idea from a software architecture point of view, I
can't even. Because now, you suddenly have to mirror your C++
declarations in ASM land, and keep them in sync with each other. I'm
always happy when I get to delete an ASM declaration from the codebase
once I've decompiled all the instances where it was referenced. But for
TH05, we now have to keep those declarations around forever. 😕 And all
that for a performance increase you probably couldn't even measure. Oh
well, pulling off Galaxy Brain-level ASM optimizations is kind of
fun if you don't have portability plans… I guess?
If I started a full fangame mod of a PC-98 Touhou game, I'd base it on
TH04 rather than TH05, and backport selected features from TH05 as
needed. Just because it was released later doesn't make it better, and
this is by far not the only one of ZUN's micro-optimizations that just
went way too far.
Dropping down to ASM also makes it easier to introduce weird quirks.
Decompiled, one of TH05's tuning conditions for
stack
groups on Easy Mode would look something like:
case BP_STACK:
// […]
if(spread_angle_delta >= 2) {
stack_bullet_count--;
}
The fields of the bullet group template aren't typically reset when
setting up a new group. So, spread_angle_delta in the context
of a stack group effectively refers to "the delta angle of the last
spread group that was fired before this stack – whenever that was".
uth05win also spotted this quirk, considered it a bug, and wrote
fanfiction by changing spread_angle_delta to
stack_bullet_count.
As usual for functions that occur in more than one game, I also decompiled
the TH04 bullet group tuning function, and it's perfectly sane, with no
such quirks.
In the more PI-focused parts of this push, we got the TH05-exclusive
smooth boss movement functions, for flying randomly or towards a given
point. Pretty unspectacular for the most part, but we've got yet another
uth05win inconsistency in the latter one. Once the Y coordinate gets close
enough to the target point, it actually speeds up twice as much as the
X coordinate would, whereas uth05win used the same speedup factors for
both. This might make uth05win a couple of frames slower in all boss
fights from Stage 3 on. Hard to measure though – and boss movement partly
depends on RNG anyway.
Next up: Shinki's background animations – which are actually the single
biggest source of position dependence left in TH05.
To finish this TH05 stretch, we've got a feature that's exclusive to TH05
for once! As the final memory management innovation in PC-98 Touhou, TH05
provides a single static (64 * 26)-byte array for storing up to 64
entities of a custom type, specific to a stage or boss portion.
(Edit (2023-05-29): This system actually debuted in
📝 TH04, where it was used for much simpler
entities.)
TH05 uses this array for
the Stage 2 star particles,
Alice's puppets,
the tip of curve ("jello") bullets,
Mai's snowballs and Yuki's fireballs,
Yumeko's swords,
and Shinki's 32×32 bullets,
which makes sense, given that only one of those will be active at any
given time.
On the surface, they all appear to share the same 26-byte structure, with
consistently sized fields, merely using its 5 generic fields for different
purposes. Looking closer though, there actually are differences in
the signedness of certain fields across the six types. uth05win chose to
declare them as entirely separate structures, and given all the semantic
differences (pixels vs. subpixels, regular vs. tiny master.lib sprites,
…), it made sense to do the same in ReC98. It quickly turned out to be the
only solution to meet my own standards of code readability.
Which blew this one up to two pushes once again… But now, modders can
trivially resize any of those structures without affecting the other types
within the original (64 * 26)-byte boundary, even without full position
independence. While you'd still have to reduce the type-specific
number of distinct entities if you made any structure larger, you
could also have more entities with fewer structure members.
As for the types themselves, they're full of redundancy once again – as
you might have already expected from seeing #4, #5, and #6 listed as
unrelated to each other. Those could have indeed been merged into a single
32×32 bullet type, supporting all the unique properties of #4
(destructible, with optional revenge bullets), #5 (optional number of
twirl animation frames before they begin to move) and #6 (delay clouds).
The *_add(), *_update(), and *_render()
functions of #5 and #6 could even already be completely
reverse-engineered from just applying the structure onto the ASM, with the
ones of #3 and #4 only needing one more RE push.
But perhaps the most interesting discovery here is in the curve bullets:
TH05 only renders every second one of the 17 nodes in a curve
bullet, yet hit-tests every single one of them. In practice, this is an
acceptable optimization though – you only start to notice jagged edges and
gaps between the fragments once their speed exceeds roughly 11 pixels per
second:
And that brings us to the last 20% of TH05 position independence! But
first, we'll have more cheap and fast TH01 progress.
Long time no see! And this is exactly why I've been procrastinating
bullets while there was still meaningful progress to be had in other parts
of TH04 and TH05: There was bound to be quite some complexity in this most
central piece of game logic, and so I couldn't possibly get to a
satisfying understanding in just one push.
Or in two, because their rendering involves another bunch of
micro-optimized functions adapted from master.lib.
Or in three, because we'd like to actually name all the bullet sprites,
since there are a number of sprite ID-related conditional branches. And
so, I was refining things I supposedly RE'd in the the commits from the
first push until the very end of the fourth.
When we talk about "bullets" in TH04 and TH05, we mean just two things:
the white 8×8 pellets, with a cap of 240 in TH04 and 180 in TH05, and any
16×16 sprites from MIKO16.BFT, with a cap of 200 in TH04 and
220 in TH05. These are by far the most common types of… err, "things the
player can collide with", and so ZUN provides a whole bunch of pre-made
motion, animation, and
n-way spread / ring / stack group options for those, which can be
selected by simply setting a few fields in the bullet template. All the
other "non-bullets" have to be fired and controlled individually.
Which is nothing new, since uth05win covered this part pretty accurately –
I don't think anyone could just make up these structure member
overloads. The interesting insights here all come from applying this
research to TH04, and figuring out its differences compared to TH05. The
most notable one there is in the default groups: TH05 allows you to add
a stack
to any single bullet, n-way spread or ring, but TH04 only lets you create
stacks separately from n-way spreads and rings, and thus gets by with
fewer fields in its bullet template structure. On the other hand, TH04 has
a separate "n-way spread with random angles, yet still aimed at the
player" group? Which seems to be unused, at least as far as
midbosses and bosses are concerned; can't say anything about stage enemies
yet.
In fact, TH05's larger bullet template structure illustrates that these
distinct group types actually are a rather redundant piece of
over-engineering. You can perfectly indicate any permutation of the basic
groups through just the stack bullet count (1 = no stack), spread bullet
count (1 = no spread), and spread delta angle (0 = ring instead of
spread). Add a 4-flag bitfield to cover the rest (aim to player, randomize
angle, randomize speed, force single bullet regardless of difficulty or
rank), and the result would be less redundant and even slightly
more capable.
Even those 4 pushes didn't quite finish all of the bullet-related types,
stopping just shy of the most trivial and consistent enum that defines
special movement. This also left us in a
📝 TH03-like situation, in which we're still
a bit away from actually converting all this research into actual RE%. Oh
well, at least this got us way past 50% in overall position independence.
On to the second half! 🎉
For the next push though, we'll first have a quick detour to the remaining
C code of all the ZUN.COM binaries. Now that the
📝 TH04 and TH05 resident structures no
longer block those, -Tom- has requested TH05's
RES_KSO.COM to be covered in one of his outstanding pushes.
And since 32th System
recently RE'd TH03's resident structure, it makes sense to also review and
merge that, before decompiling all three remaining RES_*.COM
binaries in hopefully a single push. It might even get done faster than
that, in which case I'll then review and merge some more of
WindowsTiger's
research.