Slight change of plans, because we got instructions for
reliably reproducing the TH04 Kurumi Divide Error crash! Major thanks to
Colin Douglas Howell. With those, it also made sense to immediately look at
the crash in the Stage 4 Marisa fight as well. This way, I could release
both of the obligatory bugfix mods at the same time.
Especially since it turned out that I was wrong: Both crashes are entirely
unrelated to the custom entity structure that would have required PI-centric
progress. They are completely specific to Kurumi's and Marisa's
danmaku-pattern code, and really are two separate bugs
with no connection to each other. All of the necessary research nicely fit
into Arandui's 0.5 pushes, with no further deep understanding
required here.
But why were there still three weeks between Colin's message and this blog
post? DMCA distractions aside: There are no easy fixes this time, unlike
📝 back when I looked at the Stage 5 Yuuka crash.
Just like how division by zero is undefined in mathematics, it's also,
literally, undefined what should happen instead of these two
Divide error crashes. This means that any possible "fix" can
only ever be a fanfiction interpretation of the intentions behind ZUN's
code. The gameplay community should be aware of this, and
might decide to handle these cases differently. And if we
have to go into fanfiction territory to work around crashes in the
canon games, we'd better document what exactly we're fixing here and how, as
comprehensible as possible.
With that out of the way, let's look at Kurumi's crash first, since it's way
easier to grasp. This one is known to primarily happen to new players, and
it's easy to see why:
In one of the patterns in her third phase, Kurumi fires a series of 3
aimed rings from both edges of the playfield. By default (that is, on Normal
and with regular rank), these are 6-way rings.
6 happens to be quite a peculiar number here, due to how rings are
(manually) tuned based on the current "rank" value (playperf)
before being fired. The code, abbreviated for clarity:
Let's look at the range of possible playperf values per
difficulty level:
Easy
Normal
Hard
Lunatic
Extra
playperf_min
4
11
20
22
16
playperf_max
16
24
32
34
20
Edit (2022-05-24): This blog post initially had
26 instead of 16 for playperf_min for the Extra Stage. Thanks
to Popfan for pointing out that typo!
Reducing rank to its minimum on Easy mode will therefore result in a
0-ring after tuning.
To calculate the individual angles of each bullet in a ring, ZUN divides
360° (or, more correctly,
📝 0x100) by the total number of
bullets…
Boom, division by zero.
So, what should the workaround look like? Obviously, we want to modify
neither the default number of ring bullets nor the tuning algorithm – that
would change all other non-crashing variations of this pattern on other
difficulties and ranks, creating a fork of the original gameplay. Instead, I
came up with four possible workarounds that all seemed somewhat logical to
me:
Firing no bullet, i.e., interpreting 0-ring literally. This would
create the only constellation in which a call to the bullet group spawn
functions would not spawn at least one new bullet.
Firing a "1-ring", i.e., a single bullet. This would be consistent with
how the bullet spawn functions behave for "0-way" stack and spread
groups.
Firing a "∞-ring", i.e., 200 bullets, which is as much as the game's cap
on 16×16 bullets would allow. This would poke fun at the whole "division by
zero" idea… but given that we're still talking about Easy Mode (and
especially new players) here, it might be a tad too cruel. Certainly the
most trollish interpretation.
Triggering an immediate Game Over, exchanging the hard crash for a
softer and more controlled shutdown. Certainly the option that would be
closest to the behavior of the original games, and perhaps the only one to
be accepted in Serious, High-Level Play™.
As I was writing this post, it felt increasingly wrong for me to make this
decision. So I once again went to Twitter, where 56.3%
voted in favor of the 1-bullet option. Good that I asked! I myself was
more leaning towards the 0-bullet interpretation, which only got 28.7% of
the vote. Also interesting are the 2.3% in favor of the Game Over option but
I get it, low-rank Easy Mode isn't exactly the most competitive mode of
playing TH04.
There are reports of Kurumi crashing on higher difficulties as well, but I
could verify none of them. If they aren't fixed by this workaround, they're
caused by an entirely different bug that we have yet to discover.
Onto the Stage 4 Marisa crash then, which does in fact apply to all
difficulty levels. I was also wrong on this one – it's a hell of a lot more
intricate than being just a division by the number of on-screen bits.
Without having decompiled the entire fight, I can't give a completely
accurate picture of what happens there yet, but here's the rough idea:
Marisa uses different patterns, depending on whether at least one of her
bits is still alive, or all of them have been destroyed.
Destroying the last bit will immediately switch to the bit-less
counterpart of the current pattern.
The bits won't respawn before the pattern ended, which ensures that the
bit-less version is always shown in its entirety after being started or
switched into.
In two of the bit-less patterns, Marisa gradually moves to the point
reflection of her position at the start of the pattern across the playfield
coordinate of (192, 112), or (224, 128) on screen.
The velocity of this movement is determined by both her distance to that
point and the total amount of frames that this instance of the bit-less
pattern will last.
Since this frame amount is directly tied to the frame the player
destroyed the last bit on, it becomes a user-controlled variable. I think
you can see where this is going…
The last 12 frames of this duration, however, are always reserved for a
"braking phase", where Marisa's velocity is halved on each frame.
This part of the code only runs every 4 frames though. This expands the
time window for this crash to 4 frames, rather than just the two frames you
would expect from looking at the division itself.
Both of the broken patterns run for a maximum of 160 frames. Therefore,
the crash will occur when Marisa's last bit is destroyed between frame 152
and 155 inclusive. On these frames, the
last_frame_with_bits_alive variable is set to 148, which is the
crucial 12 duration frames away from the maximum of 160.
Interestingly enough, the calculated velocity is also only
applied every 4 frames, with Marisa actually staying still for the 3 frames
inbetween. As a result, she either moves
too slowly to ever actually reach the yellow point if the last bit
was destroyed early in the pattern (see destruction frames 68 or
112),
or way too quickly, and almost in a jerky, teleporting way (see
destruction frames 144 or 148).
Finally, as you may have already gathered from the formula: Destroying
the last bit between frame 156 and 160 inclusive results in
duration values of 8 or 4. These actually push Marisa
away from the intended point, as the divisor becomes negative.
tl;dr: "Game crashes if last bit destroyed within 4-frame window near end of
two patterns". For an informed decision on a new movement behavior for these
last 8 frames, we definitely need to know all the details behind the crash
though. Here's what I would interpret into the code:
Not moving at all, i.e., interpreting 0 as the middle ground between
positive and negative movement. This would also make sense because a
12-frame duration implies 100% of the movement to consist of
the braking phase – and Marisa wasn't moving before, after all.
Move at maximum speed, i.e., dividing by 1 rather than 0. Since the
movement duration is still 12 in this case, Marisa will immediately start
braking. In total, she will move exactly ¾ of the way from her initial
position to (192, 112) within the 8 frames before the pattern
ends.
Directly warping to (192, 112) on frame 0, and to the
point-reflected target on 4, respectively. This "emulates" the division by
zero by moving Marisa at infinite speed to the exact two points indicated by
the velocity formula. It also fits nicely into the 8 frames we have to fill
here. Sure, Marisa can't reach these points at any other duration, but why
shouldn't she be able to, with infinite speed? Then again, if Marisa
is far away enough from (192, 112), this workaround would warp her
across the entire playfield. Can Marisa teleport according to lore? I
have no idea…
Triggering an immediate Game O– hell no, this is the Stage 4 boss,
people already hate losing runs to this bug!
Asking Twitter worked great for the Kurumi workaround, so let's do it again!
Gotta attach a screenshot of an earlier draft of this blog post though,
since this stuff is impossible to explain in tweets…
…and it went
through the roof, becoming the most successful ReC98 tweet so far?!
Apparently, y'all really like to just look at descriptions of overly complex
bugs that I'd consider way beyond the typical attention span that can be
expected from Twitter. Unfortunately, all those tweet impressions didn't
quite translate into poll turnout. The results
were pretty evenly split between 1) and 2), with option 1) just coming out
slightly ahead at 49.1%, compared to 41.5% of option 2).
(And yes, I only noticed after creating the poll that warping to both the
green and yellow points made more sense than warping to just one of the two.
Let's hope that this additional variant wouldn't have shifted the results
too much. Both warp options only got 9.4% of the vote after all, and no one
else came up with the idea either. In the end,
you can always merge together your preferred combination of workarounds from
the Git branches linked below.)
So here you go: The new definitive version of TH04, containing not only the
community-chosen Kurumi and Stage 4 Marisa workaround variant, but also the
📝 No-EMS bugfix from last year.
Edit (2022-05-31): This package is outdated, 📝 the current version is here!2022-04-18-community-choice-fixes.zip
Oh, and let's also add spaztron64's TH03 GDC clock fix
from 2019 because why not. This binary was built from the community_choice_fixes
branch, and you can find the code for all the individual workarounds on
these branches:
Again, because it can't be stated often enough: These fixes are
fanfiction. The gameplay community should be aware of
this, and might decide to handle these cases differently.
With all of that taking way more time to evaluate and document, this
research really had to become part of a proper push, instead of just being
covered in the quick non-push blog post I initially intended. With ½ of a
push left at the end, TH05's Stage 1-5 boss background rendering functions
fit in perfectly there. If you wonder how these static backdrop images even
need any boss-specific code to begin with, you're right – it's basically the
same function copy-pasted 4 times, differing only in the backdrop image
coordinates and some other inconsequential details.
Only Sara receives a nice variation of the typical
📝 blocky entrance animation: The usually
opaque bitmap data from ST00.BB is instead used as a transition
mask from stage tiles to the backdrop image, by making clever use of the
tile invalidation system:
TH04 uses the same effect a bit more frequently, for its first three bosses.
Next up: Shinki, for real this time! I've already managed to decompile 10 of
her 11 danmaku patterns within a little more than one push – and yes,
that one is included in there. Looks like I've slightly
overestimated the amount of work required for TH04's and TH05's bosses…
EMS memory! The
infamous stopgap measure between the 640 KiB ("ought to be enough for
everyone") of conventional
memory offered by DOS from the very beginning, and the later XMS standard for
accessing all the rest of memory up to 4 GiB in the x86 Protected Mode. With
an optionally active EMS driver, TH04 and TH05 will make use of EMS memory
to preload a bunch of situational .CDG images at the beginning of
MAIN.EXE:
The "eye catch" game title image, shown while stages are loaded
The character-specific background image, shown while bombing
The player character dialog portraits
TH05 additionally stores the boss portraits there, preloading them
at the beginning of each stage. (TH04 instead keeps them in conventional
memory during the entire stage.)
Once these images are needed, they can then be copied into conventional
memory and accessed as usual.
Uh… wait, copied? It certainly would have been possible to map EMS
memory to a regular 16-bit Real Mode segment for direct access,
bank-switching out rarely used system or peripheral memory in exchange for
the EMS data. However, master.lib doesn't expose this functionality, and
only provides functions for copying data from EMS to regular memory and vice
versa.
But even that still makes EMS an excellent fit for the large image files
it's used for, as it's possible to directly copy their pixel data from EMS
to VRAM. (Yes, I tried!) Well… would, because ZUN doesn't do
that either, and always naively copies the images to newly allocated
conventional memory first. In essence, this dumbs down EMS into just another
layer of the memory hierarchy, inserted between conventional memory and
disk: Not quite as slow as disk, but still requiring that
memcpy() to retrieve the data. Most importantly though: Using
EMS in this way does not increase the total amount of memory
simultaneously accessible to the game. After all, some other data will have
to be freed from conventional memory to make room for the newly loaded data.
The most idiomatic way to define the game-specific layout of the EMS area
would be either a struct or an enum.
Unfortunately, the total size of all these images exceeds the range of a
16-bit value, and Turbo C++ 4.0J supports neither 32-bit enums
(which are silently degraded to 16-bit) nor 32-bit structs
(which simply don't compile). That still leaves raw compile-time constants
though, you only have to manually define the offset to each image in terms
of the size of its predecessor. But instead of doing that, ZUN just placed
each image at a nice round decimal offset, each slightly larger than the
actual memory required by the previous image, just to make sure that
everything fits. This results not only in quite
a bit of unnecessary padding, but also in technically the single
biggest amount of "wasted" memory in PC-98 Touhou: Out of the 180,000 (TH04)
and 320,000 (TH05) EMS bytes requested, the game only uses 135,552 (TH04)
and 175,904 (TH05) bytes. But hey, it's EMS, so who cares, right? Out of all
the opportunities to take shortcuts during development, this is among the
most acceptable ones. Any actual PC-98 model that could run these two games
comes with plenty of memory for this to not turn into an actual issue.
On to the EMS-using functions themselves, which are the definition of
"cross-cutting concerns". Most of these have a fallback path for the non-EMS
case, and keep the loaded .CDG images in memory if they are immediately
needed. Which totally makes sense, but also makes it difficult to find names
that reflect all the global state changed by these functions. Every one of
these is also just called from a single place, so inlining
them would have saved me a lot of naming and documentation trouble
there.
The TH04 version of the EMS allocation code was actually displayed on ZUN's monitor in the
2010 MAG・ネット documentary; WindowsTiger already transcribed the low-quality video image
in 2019. By 2015 ReC98 standards, I would have just run with that, but
the current project goal is to write better code than ZUN, so I didn't. 😛
We sure ain't going to use magic numbers for EMS offsets.
The dialog init and exit code then is completely different in both games,
yet equally cross-cutting. TH05 goes even further in saving conventional
memory, loading each individual player or boss portrait into a single .CDG
slot immediately before blitting it to VRAM and freeing the pixel data
again. People who play TH05 without an active EMS driver are surely going to
enjoy the hard drive access lag between each portrait change…
TH04, on the other hand, also abuses the dialog
exit function to preload the Mugetsu defeat / Gengetsu entrance and
Gengetsu defeat portraits, using a static variable to track how often the
function has been called during the Extra Stage… who needs function
parameters anyway, right?
This is also the function in which TH04 infamously crashes after the Stage 5
pre-boss dialog when playing with Reimu and without any active EMS driver.
That crash is what motivated this look into the games' EMS usage… but the
code looks perfectly fine? Oh well, guess the crash is not related to EMS
then. Next u–
OK, of course I can't leave it like that. Everyone is expecting a fix now,
and I still got half of a push left over after decompiling the regular EMS
code. Also, I've now RE'd every function that could possibly be involved in
the crash, and this is very likely to be the last time I'll be looking at
them.
Turns out that the bug has little to do with EMS, and everything to do with
ZUN limiting the amount of conventional RAM that TH04's
MAIN.EXE is allowed to use, and then slightly miscalculating
this upper limit. Playing Stage 5 with Reimu is the most asset-intensive
configuration in this game, due to the combination of
6 player portraits (Marisa has only 5), at 128×128 pixels each
a 288×256 background for the boss fight, tied in size only with the
ones in the Extra Stage
the additional 96×80 image for the vertically scrolling stars during
the stage, wastefully stored as 4 bitplanes rather than a single one.
This image is never freed, not even at the end of the stage.
Remove any single one of the above points, and this crash would have never
occurred. But with all of them combined, the total amount of memory consumed
by TH04's MAIN.EXE just barely exceeds ZUN's limit of 320,000
bytes, by no more than 3,840 bytes, the size of the star image.
But wait: As we established earlier, EMS does nothing to reduce the amount
of conventional memory used by the game. In fact, if you disabled TH04's EMS
handling, you'd still get this crash even if you are running an EMS
driver and loaded DOS into the High Memory Area to free up as much
conventional RAM as possible. How can EMS then prevent this crash in the
first place?
The answer: It's only because ZUN's usage of EMS bypasses the need to load
the cached images back out of the XOR-encrypted 東方幻想.郷
packfile. Leaving aside the general
stupidity of any game data file encryption*, master.lib's decryption
implementation is also quite wasteful: It uses a separate buffer that
receives fixed-size chunks of the file, before decrypting every individual
byte and copying it to its intended destination buffer. That really
resembles the typical slowness of a C fread() implementation
more than it does the highly optimized ASM code that master.lib purports to
be… And how large is this well-hidden decryption buffer? 4 KiB.
So, looking back at the game, here is what happens once the Stage 5
pre-battle dialog ends:
Reimu's bomb background image, which was previously freed to make space
for her dialog portraits, has to be loaded back into conventional memory
from disk
BB0.CDG is found inside the 東方幻想.郷
packfile
file_ropen() ends up allocating a 4 KiB buffer for the
encrypted packfile data, getting us the decisive ~4 KiB closer to the memory
limit
The .CDG loader tries to allocate 52 608 contiguous bytes for the
pixel data of Reimu's bomb image
This would exceed the memory limit, so hmem_allocbyte()
fails and returns a nullptr
ZUN doesn't check for this case (as usual)
The pixel data is loaded to address 0000:0000,
overwriting the Interrupt Vector Table and whatever comes after
The game crashes
The 4 KiB encryption buffer would only be freed by the corresponding
file_close() call, which of course never happens because the
game crashes before it gets there. At one point, I really did suspect the
cause to be some kind of memory leak or fragmentation inside master.lib,
which would have been quite delightful to fix.
Instead, the most straightforward fix here is to bump up that memory limit
by at least 4 KiB. Certainly easier than squeezing in a
cdg_free() call for the star image before the pre-boss dialog
without breaking position dependence.
Or, even better, let's nuke all these memory limits from orbit
because they make little sense to begin with, and fix every other potential
out-of-memory crash that modders would encounter when adding enough data to
any of the 4 games that impose such limits on themselves. Unless you want to
launch other binaries (which need to do their own memory allocations) after
launching the game, there's really no reason to restrict the amount of
memory available to a DOS process. Heck, whenever DOS creates a new one, it
assigns all remaining free memory by default anyway.
Removing the memory limits also removes one of ZUN's few error checks, which
end up quitting the game if there isn't at least a given maximum amount of
conventional RAM available. While it might be tempting to reserve enough
memory at the beginning of execution and then never check any allocation for
a potential failure, that's exactly where something like TH04's crash
comes from.
This game is also still running on DOS, where such an initial allocation
failure is very unlikely to happen – no one fills close to half of
conventional RAM with TSRs and then tries running one of these games. It
might have been useful to detect systems with less than 640 KiB of
actual, physical RAM, but none of the PC-98 models with that little amount
of memory are fast enough to run these games to begin with. How ironic… a
place where ZUN actually added an error check, and then it's mostly
pointless.
Here's an archive that contains both fix variants, just in case. These were
compiled from the th04_noems_crash_fix
and mem_assign_all
branches, and contain as little code changes as possible. Edit (2022-04-18): For TH04, you probably want to download
the 📝 community choice fix package instead,
which contains this fix along with other workarounds for the Divide
error crashes.
2021-11-29-Memory-limit-fixes.zip
So yeah, quite a complex bug, leaving no time for the TH03 scorefile format
research after all. Next up: Raising prices.
Alright, the score popup numbers shown when collecting items or defeating
(mid)bosses. The second-to-last remaining big entity type in TH05… with
quite some PI false positives in the memory range occupied by its data.
Good thing I still got some outstanding generic RE pushes that haven't
been claimed for anything more specific in over a month! These
conveniently allowed me to RE most of these functions right away, the
right way.
Most of the false positives were boss HP values, passed to a "boss phase
end" function which sets the HP value at which the next phase should end.
Stage 6 Yuuka, Mugetsu, and EX-Alice have their own copies of this
function, in which they also reset certain boss-specific global variables.
Since I always like to cover all varieties of such duplicated functions at
once, it made sense to reverse-engineer all the involved variables while I
was at it… and that's why this was exactly the right time to cover the
implementation details of Stage 6 Yuuka's parasol and vanishing animations
in TH04.
With still a bit of time left in that RE push afterwards, I could also
start looking into some of the smaller functions that didn't quite fit
into other pushes. The most notable one there was a simple function that
aims from any point to the current player position. Which actually only
became a separate function in TH05, probably since it's called 27 times in
total. That's 27 places no longer being blocked from further RE progress.
WindowsTiger already
did most of the work for the score popup numbers in January, which meant
that I only had to review it and bring it up to ReC98's current coding
styles and standards. This one turned out to be one of those rare features
whose TH05 implementation is significantly less insane than the
TH04 one. Both games lazily redraw only the tiles of the stage background
that were drawn over in the previous frame, and try their best to minimize
the amount of tiles to be redrawn in this way. For these popup numbers,
this involves calculating the on-screen width, based on the exact number
of digits in the point value. TH04 calculates this width every frame
during the rendering function, and even resorts to setting that field
through the digit iteration pointer via self-modifying code… yup. TH05, on
the other hand, simply calculates the width once when spawning a new popup
number, during the conversion of the point value to
binary-coded
decimal. The "×2" multiplier suffix being removed in TH05 certainly
also helped in simplifying that feature in this game.
And that's ⅓ of TH05 reverse-engineered! Next up, one more TH05 PI push,
in which the stage enemies hopefully finish all the big entity types.
Maybe it will also be accompanied by another RE push? In any case, that
will be the last piece of TH05 progress for quite some time. The next TH01
stretch will consist of 6 pushes at the very least, and I currently have
no idea of how much time I can spend on ReC98 a month from now…
Here we go, new C code! …eh, it will still take a bit to really get
decompilation going at the speeds I was hoping for. Especially with the
sheer amount of stuff that is set in the first few significant
functions we actually can decompile, which now all has to be
correctly declared in the C world. Turns out I spent the last 2 years
screwing up the case of exported functions, and even some of their names,
so that it didn't actually reflect their calling convention… yup. That's
just the stuff you tend to forget while it doesn't matter.
To make up for that, I decided to research whether we can make use of some
C++ features to improve code readability after all. Previously, it seemed
that TH01 was the only game that included any C++ code, whereas TH02 and
later seemed to be 100% C and ASM. However, during the development of the
soon to be released new build system, I noticed that even this old
compiler from the mid-90's, infamous for prioritizing compile speeds over
all but the most trivial optimizations, was capable of quite surprising
levels of automatic inlining with class methods…
…leading the research to culminate in the mindblow that is
9d121c7 – yes, we can use C++ class methods
and operator overloading to make the code more readable, while still
generating the same code than if we had just used C and preprocessor
macros.
Looks like there's now the potential for a few pull requests from outside
devs that apply C++ features to improve the legibility of previously
decompiled and terribly macro-ridden code. So, if anyone wants to help
without spending money…