And then, the supposed boilerplate code revealed yet another confusing issue
that quickly forced me back to serial work, leading to no parallel progress
made with Shuusou Gyoku after all. 🥲 The list of functions I put together
for the first ½ of this push seemed so boring at first, and I was so sure
that there was almost nothing I could possibly talk about:
TH02's gaiji animations at the start and end of each stage, resembling
opening and closing window blind slats. ZUN should have maybe not defined
the regular whitespace gaiji as what's technically the last frame of the
closing animation, but that's a minor nitpick. Nothing special there
otherwise.
The remaining spawn functions for TH04's and TH05's gather circles. The
only dumb antic there is the way ZUN initializes the template for bullets
fired at the end of the animation, featuring ASM instructions that are
equivalent to what Turbo C++ 4.0J generates for the __memcpy__
intrinsic, but show up in a different order. Which means that they must have
been handwritten. I already figured that out in 2022
though, so this was just more of the same.
EX-Alice's override for the game's main 16×16 sprite sheet, loaded
during her dialog script. More of a naming and consistency challenge, if
anything.
The regular version of TH05's big 16×16 sprite sheet.
EX-Alice's variant of TH05's big 16×16 sprite sheet.
The rendering function for TH04's Stage 4 midboss, which seems to
feature the same premature clipping quirk we've seen for
📝 TH05's Stage 5 midboss, 7 months ago?
The rendering function for the big 48×48 explosion sprite, which also
features the same clipping quirk?
That's three instances of ZUN removing sprites way earlier than you'd want
to, intentionally deciding against those sprites flying smoothly in and out
of the playfield. Clearly, there has to be a system and a reason behind it.
Turns out that it can be almost completely blamed on master.lib. None of the
super_*() sprite blitting functions can clip the rendered
sprite to the edges of VRAM, and much less to the custom playfield rectangle
we would actually want here. This is exactly the wrong choice to make for a
game engine: Not only is the game developer now stuck with either rendering
the sprite in full or not at all, but they're also left with the burden of
manually calculating when not to display a sprite.
However, strictly limiting the top-left screen-space coordinate to
(0, 0) and the bottom-right one to (640, 400) would actually
stop rendering some of the sprites much earlier than the clipping conditions
we encounter in these games. So what's going on there?
The answer is a combination of playfield borders, hardware scrolling, and
master.lib needing to provide at least some help to support the
latter. Hardware scrolling on PC-98 works by dividing VRAM into two vertical
partitions along the Y-axis and telling the GDC to display one of them at
the top of the screen and the other one below. The contents of VRAM remain
unmodified throughout, which raises the interesting question of how to deal
with sprites that reach the vertical edges of VRAM. If the top VRAM row that
starts at offset 0x0000 ends up being displayed below
the bottom row of VRAM that starts at offset 0x7CB0 for 399 of
the 400 possible scrolling positions, wouldn't we then need to vertically
wrap most of the rendered sprites?
For this reason, master.lib provides the super_roll_*()
functions, which unconditionally perform exactly this vertical wrapping. But
this creates a new problem: If these functions still can't clip, and don't
even know which VRAM rows currently correspond to the top and bottom row of
the screen (since master.lib's graph_scrollup() function
doesn't retain this information), won't we also see sprites wrapping around
the actual edges of the screen? That's something we certainly
wouldn't want in a vertically scrolling game…
The answer is yes, and master.lib offers no solution for this issue. But
this is where the playfield borders come in, and helpfully cover 16 pixels
at the top and 16 pixels at the bottom of the screen. As a result, they can
hide at least 32 pixels of potentially wrapped sprite pixels below them:
•
The earliest possible frame that TH05 can start rendering the Stage 5
midboss on. Hiding the text layer reveals how master.lib did in fact
"blindly" render the top part of her sprite to the bottom of the
playfield. That's where her sprite starts before it is correctly
wrapped around to the top of VRAM.
If we scrolled VRAM by another 200 pixels (and faked an equally shifted
TRAM for demonstration purposes), we get an equally valid game scene
that points out why a vertically scrolling PC-98 game must wrap all sprites
at the vertical edges of VRAM to begin with.
Also, note how the HP bar has filled up quite a bit before the midboss can
actually appear on screen.
And that's how the lowest possible top Y coordinate for sprites blitted
using the master.lib super_roll_*() functions during the
scrolling portions of TH02, TH04, and TH05 is not 0, but -16. Any lower, and
you would actually see some of the sprite's upper pixels at the
bottom of the playfield, as there are no more opaque black text cells to
cover them. Theoretically, you could lower this number for
some animation frames that start with multiple rows of transparent
pixels, but I thankfully haven't found any instance of ZUN using such a
hack. So far, at least…
Visualized like that, it all looks quite simple and logical, but for days, I
did not realize that these sprites were rendered to a scrolling VRAM.
This led to a much more complicated initial explanation involving the
invisible extra space of VRAM between offsets 0x7D00 and
0x7FFF that effectively grant a hidden additional 9.6 lines
below the playfield. Or even above, since PC-98 hardware ignores the highest
bit of any offset into a VRAM bitplane segment
(& 0x7FFF), which prevents blitting operations from
accidentally reaching into a different bitplane. Together with the
aforementioned rows of transparent pixels at the top of these midboss
sprites, the math would have almost worked out exactly.
The need for manual clipping also applies to the X-axis. Due to the lack of
scrolling in this dimension, the boundaries there are much more
straightforward though. The minimum left coordinate of a sprite can't fall
below 0 because any smaller coordinate would wrap around into the
📝 tile source area and overwrite some of the
pixels there, which we obviously don't want to re-blit every frame.
Similarly, the right coordinate must not extend into the HUD, which starts
at 448 pixels.
The last part might be surprising if you aren't familiar with the PC-98 text
chip. Contrary to the CGA and VGA text modes of IBM-compatibles, PC-98 text
cells can only use a single color for either their foreground or
background, with the other pixels being transparent and always revealing the
pixels in VRAM below. If you look closely at the HUD in the images above,
you can see how the background of cells with gaiji glyphs is slightly
brighter (◼ #100) than the opaque black
cells (◼ #000) surrounding them. This
rather custom color clearly implies that those pixels must have been
rendered by the graphics GDC. If any other sprite was rendered below the
HUD, you would equally see it below the glyphs.
So in the end, I did find the clear and logical system I was looking for,
and managed to reduce the new clipping conditions down to a
set of basic rules for each edge. Unfortunately, we also need a second
macro for each edge to differentiate between sprites that are smaller or
larger than the playfield border, which is treated as either 32×32 (for
super_roll_*()) or 32×16 (for non-"rolling"
super_*() functions). Since smaller sprites can be fully
contained within this border, the games can stop rendering them as soon as
their bottom-right coordinate is no longer seen within the playfield, by
comparing against the clipping boundaries with <= and
>=. For example, a 16×16 sprite would be completely
invisible once it reaches (16, 0), so it would still be rendered at
(17, 1). A larger sprite during the scrolling part of a stage, like,
say, the 64×64 midbosses, would still be rendered if their top-left
coordinate was (0, -16), so ZUN used < and
> comparisons to at least get an additional pixel before
having to stop rendering such a sprite. Turbo C++ 4.0J sadly can't
constant-fold away such a difference in comparison operators.
And for the most part, ZUN did follow this system consistently. Except for,
of course, the typical mistakes you make when faced with such manual
decisions, like how he treated TH04's Stage 4 midboss as a "small" sprite
below 32×32 pixels (it's 64×64), losing that precious one extra pixel. Or
how the entire rendering code for the 48×48 boss explosion sprite pretends
that it's actually 64×64 pixels large, which causes even the initial
transformation into screen space to be misaligned from the get-go.
But these are additional bugs on top of the single
one that led to all this research.
Because that's what this is, a bug. 🐞 Every resulting pixel boundary is a
systematic result of master.lib's unfortunate lack of clipping. It's as much
of a bug as TH01's byte-aligned rendering of entities whose internal
position is not byte-aligned. In both cases, the entities are alive,
simulated, and partake in collision detection, but their rendered appearance
doesn't accurately reflect their internal position.
Initially, I classified
📝 the sudden pop-in of TH05's Stage 5 midboss
as a quirk because we had no conclusive evidence that this wasn't
intentional, but now we do. There have been multiple explanations for why
ZUN put borders around the playfield, but master.lib's lack of sprite
clipping might be the biggest reason.
And just like byte-aligned rendering, the clipping conditions can easily be
removed when porting the game away from PC-98 hardware. That's also what
uth05win chose to do: By using OpenGL and not having to rely on hardware
scrolling, it can simply place every sprite as a textured quad at its exact
position in screen space, and then draw the black playfield borders on top
in the end to clip everything in a single draw call. This way, the Stage 5
midboss can smoothly fly into the playfield, just as defined by its movement
code:
The entire smooth Stage 5 midboss entrance animation as shown in
uth05win. If the simultaneous appearance of the Enemy!! label
doesn't lend further proof to this having been ZUN's actual intention, I
don't know what will.
Meanwhile, I designed the interface of the 📝 generic blitter used in the TH01 Anniversary Edition entirely around
clipping the blitted sprite at any explicit combination of VRAM edges. This
was nothing I tacked on in the end, but a core aspect that informed the
architecture of the code from the very beginning. You really want to
have one and only one place where sprite clipping is done right – and
only once per sprite, regardless of how many bitplanes you want to write to.
Which brings us to the goal for the final ¼ of this push went. I thought I
was going to start cleaning up the
📝 player movement and rendering code, but
that turned out too complicated for that amount of time – especially if you
want to start with just cleanup, preserving all original bugs for the
time being.
Fixing and smoothening player and Orb movement would be the next big task in
Anniversary Edition development, needing about 3 pushes. It would start with
more performance research into runtime-shifting of larger sprites, followed
by extending my generic blitter according to the results, writing new
optimized loaders for the original image formats, and finally rewriting all
rendering code accordingly. With that code in place, we can then start
cleaning up and fixing the unique code for each boss, one by one.
Until that's funded, the code still contains a few smaller and easier pieces
of code that are equally related to rendering bugs, but could be dealt with
in a more incremental way. Line rendering is one of those, and first needs
some refactoring of every call site, including
📝 the rotating squares around Mima and
📝 YuugenMagan's pentagram. So far, I managed
to remove another 1,360 bytes from the binary within this final ¼ of a push,
but there's still quite a bit to do in that regard.
This is the perfect kind of feature for smaller (micro-)transactions. Which
means that we've now got meaningful TH01 code cleanup and Anniversary
Edition subtasks at every price range, no matter whether you want to invest
a lot or just a little into this goal.
If you can, because Ember2528 revealed the plan behind
his Shuusou Gyoku contributions: A full-on Linux port of the game, which
will be receiving all the funding it needs to happen. 🐧 Next up, therefore:
Turning this into my main project within ReC98 for the next couple of
months, and getting started by shipping the long-awaited first step towards
that goal.
I've raised the cap to avoid the potential of rounding errors, which might
prevent the last needed Shuusou Gyoku push from being correctly funded. I
already had to pick the larger one of the two pending TH02 transactions for
this push, because we would have mathematically ended up
1/25500 short of a full push with the smaller
transaction. And if I'm already at it, I might
as well free up enough capacity to potentially ship the complete OpenGL
backend in a single delivery, which is currently estimated to cost 7 pushes
in total.
OK, let's decompile TH02's HUD code first, gain a solid understanding of how
increasing the score works, and then look at the item system of this game.
Should be no big deal, no surprises expected, let's go!
…Yeah, right, that's never how things end up in ReC98 land.
And so, we get the usual host of newly discovered
oddities in addition to the expected insights into the item mechanics. Let's
start with the latter:
Some regular stage enemies appear to randomly drop either or items. In reality, there is
very little randomness at play here: These items are picked from a
hardcoded, repeating ring of 10 items
(𝄆 𝄇), and the only source of
randomness is the initial position within this ring, which changes at
the beginning of every stage. ZUN further increased the illusion of
randomness by only dropping such a semi-random item for every
4th defeated enemy that is coded to drop one, and also having
enemies that drop fixed, non-random items. I'd say it's a decent way of
ensuring both randomness and balance.
There's a 1/512 chance for such a semi-random
item drop to turn into a item instead –
which translates to 1/2048 enemies due to the
fixed drop rate.
Edit (2023-06-11): These are the only ways that items can randomly drop in this game. All other drops, including
any items, are scripted and deterministic.
After using a continue (both after a Game Over, or after manually
choosing to do so through the Pause menu for whatever reason), the
next
(Stage number + 1) semi-random item
drops are turned into items instead.
Items can contribute up to 25 points to the skill value and subsequent
rating (あなたの腕前) on the final verdict
screen. Doing well at item collection first increases a separate
collect_skill value:
Item
Collection condition
collect_skill change
below max power
+1
at or above max power
+2
value == 51,200
+8
value ≥20,000 and <51,200
+4
value ≥10,000 and <20,000
+2
value <10,000
+1
with 5 bombs in stock
+16
Note, again, the lack of anything involving
items. At the maximum of 5 lives, the item spawn function transforms
them into bomb items anyway. It is possible though to gain
the 5th life by reaching one of the extend scores while a
item is still on screen; in that case,
collecting the 1-up has no effect at all.
Every 32 collect_skill points will then raise the
item_skill by 1, whereas every 16 dropped items will lower
it by 1. Before launching into the ending sequence,
item_skill is clamped to the [0; 25] range and
added to the other skill-relevant metrics we're going to look at in
future pushes.
When losing a life, the game will drop a single
and 4 randomly picked or items in a random order
around Reimu's position. Contrary to an
unsourced Touhou Wiki edit from 2009, each of the 4 does have an
equal and independent chance of being either a
or item.
Finally, and perhaps most
interestingly, item values! These are
determined by the top Y coordinate of an item during the frame it is
collected on. The maximum value of 51,200 points applies to the top 48
pixels of the playfield, and drops off as soon as an item falls below
that line. For the rest of the playfield, point items then use a formula
of (28,000 - (top Y coordinate of item in
screen space × 70)):
Point items and their collection value in TH02. The numbers
correspond to items that are collected while their top Y coordinate
matches the line they are directly placed on. The upper
item in the image would therefore give
23,450 points if the player collected it at that specific
position.
Reimu collects any item whose 16×16 bounding box lies fully within
the red 48×40 hitbox. Note that
the box isn't cut off in this specific case: At Reimu's lowest
possible position on the playfield, the lowest 8 pixels of her
sprite are clipped, but the item hitbox still happens to end exactly
at the bottom of the playfield. Since an item's Y velocity
accelerates on every frame, it's entirely possible to collect a
point item at the lowest value of 2,240 points, on the exact frame
before it falls below the collection hitbox.
Onto score tracking then, which only took a single commit to raise another
big research question. It's widely known that TH02 grants extra lives upon
reaching a score of 1, 2, 3, 5, or 8 million points. But what hasn't been
documented is the fact that the game does not stop at the end of the
hardcoded extend score array. ZUN merely ends it with a sentinel value of
999,999,990 points, but if the score ever increased beyond this value, the
game will interpret adjacent memory as signed 32-bit score values and
continue giving out extra lives based on whatever thresholds it ends up
finding there. Since the following bytes happen to turn into a negative
number, the next extra life would be awarded right after gaining another 10
points at exactly 1,000,000,000 points, and the threshold after that would
be 11,114,905,600 points. Without an explicit counterstop, the number of
score-based extra lives is theoretically unlimited, and would even continue
after the signed 32-bit value overflowed into the negative range. Although
we certainly have bigger problems once scores ever reach that point…
That said, it seems impossible that any of this could ever happen
legitimately. The current high scores of 42,942,800 points on
Lunatic and 42,603,800 points on
Extra don't even reach 1/20 of ZUN's sentinel
value. Without either a graze or a bullet cancel system, the scoring
potential in this game is fairly limited, making it unlikely for high scores
to ever increase by that additional order of magnitude to end up anywhere
near the 1 billion mark.
But can we really be sure? Is this a landmine because it's impossible
to ever reach such high scores, or is it a quirk because these extends
could be observed under rare conditions, perhaps as the result of
other quirks? And if it's the latter, how many of these adjacent bytes do we
need to preserve in cleaned-up versions and ports? We'd pretty much need to
know the upper bound of high scores within the original stage and boss
scripts to tell. This value should be rather easy to calculate in a
game with such a simple scoring system, but doing that only makes sense
after we RE'd all scoring-related code and could efficiently run such
simulations. It's definitely something we'd need to look at before working
on this game's debloated version in the far future, which is
when the difference between quirks and landmines will become relevant.
Still, all that uncertainty just because ZUN didn't restrict a loop to the
size of the extend threshold array…
TH02 marks a pivotal point in how the PC-98 Touhou games handle the current
score. It's the last game to use a 32-bit variable before the later games
would regrettably start using arrays of binary-coded
decimals. More importantly though, TH02 is also the first game to
introduce the delayed score counting animation, where the displayed score
intentionally lags behind and gradually counts towards the real one over
multiple frames. This could be implemented in one of two ways:
Keep the displayed score as a separate variable inside the presentation
layer, and let it gradually count up to the real score value passed in from
the logic layer
Burden the game logic with this presentation detail, and split the score
into two variables: One for the displayed score, and another for the
delta between that score and the actual one. Newly gained points are
first added to the delta variable, and then gradually subtracted from there
and added to the real score before being displayed.
And by now, we can all tell which option ZUN picked for the rest of the
PC-98 games, even if you don't remember
📝 me mentioning this system last year.
📝 Once again, TH02 immortalized ZUN's initial
attempt at the concept, which lacks the abstraction boundaries you'd want
for managing this one piece of state across two variables, and messes up the
abstractions it does have. In addition to the regular score
transfer/render function, the codebase therefore has
a function that transfers the current delta to the score immediately,
but does not re-render the HUD, and
a function that adds the delta to the score and re-renders the HUD, but
does not reset the delta.
And – you guessed it – I wouldn't have mentioned any of this if it didn't
result in one bug and one quirk in TH02. The bug resulting from 1) is pretty
minor: The function is called when losing a life, and simply stops any
active score-counting animation at the value rendered on the frame where the
player got hit. This one is only a rendering issue – no points are lost, and
you just need to gain 10 more for the rendered value to jump back up to its
actual value. You'll probably never notice this one because you're likely
busy collecting the single spawned around Reimu
when losing a life, which always awards at least 10 points.
The quirk resulting from 2) is more intriguing though. Without a separate
reset of the score delta, the function effectively awards the current delta
value as a one-time point bonus, since the same delta will still be
regularly transferred to the score on further game frames.
This function is called at the start of every dialog sequence. However, TH02
stops running the regular game loop between the post-boss dialog and the
next stage where the delta is reset, so we can only observe this quirk for
the pre-boss sequences and the dialog before Mima's form change.
Unfortunately, it's not all too exploitable in either case: Each of the
pre-boss dialog sequences is preceded by an ungrazeable pellet pattern and
followed by multiple seconds of flying over an empty playfield with zero
scoring opportunities. By the time the sequence starts, the game will have
long transferred any big score delta from max-valued point items. It's
slightly better with Mima since you can at least shoot her and use a bomb to
keep the delta at a nonzero value, but without a health bar, there is little
indication of when the dialog starts, and it'd be long after Mima
gave out her last bonus items in any case.
But two of the bosses – that is, Rika, and the Five Magic Stones – are
scrolled onto the playfield as part of the stage script, and can also be hit
with player shots and bombs for a few seconds before their dialog starts.
While I'll only get to cover shot types and bomb damage within the next few
TH02 pushes, there is an obvious initial strategy for maximizing the effect
of this quirk: Spreading out the A-Type / Wide / High Mobility shot to land
as many hits as possible on all Five Magic Stones, while firing off a bomb.
Turns out that the infamous button-mashing mechanics of the
player shot are also more complicated than simply pressing and releasing the
Shot key at alternating frames. Even this result took way too many
takes.
Wow, a grand total of 1,750 extra points! Totally worth wasting a bomb for…
yeah, probably not. But at the very least, it's
something that a TAS score run would want to keep in mind. And all that just
because ZUN "forgot" a single score_delta = 0; assignment at
the end of one function…
And that brings TH02 over the 30% RE mark! Next up: 100% position
independence for TH04. If anyone wants to grab the
that have now been freed up in the cap: Any small Touhou-related task would
be perfect to round out that upcoming TH04 PI delivery.
So, TH02! Being the only game whose main binary hadn't seen any dedicated
attention ever, we get to start the TH02-related blog posts at the very
beginning with the most foundational pieces of code. The stage tile system
is the best place to start here: It not only blocks every entity that is
rendered on top of these tiles, but is curiously placed right next to
master.lib code in TH02, and would need to be separated out into its own
translation unit before we can do the same with all the master.lib
functions.
In late 2018, I already RE'd
📝 TH04's and TH05's stage tile implementation, but haven't properly documented it on this
blog yet, so this post is also going to include the details that are unique
to those games. On a high level, the stage tile system works identically in
all three games:
The tiles themselves are 16×16 pixels large, and a stage can use 100 of
them at the same time.
The optimal way of blitting tiles would involve VRAM-to-VRAM copies
within the same page using the EGC, and that's exactly what the games do.
All tiles are stored on both VRAM pages within the rightmost 64×400 pixels
of the screen just right next to the HUD, and you only don't see them
because the games cover the same area in text RAM with black cells:
The initial screen of TH02's Stage 1, with the tile source
area uncovered by filling the same area in text RAM with transparent
cells instead of black ones. In TH02, this also reveals how the tile
area ends with a bunch of glitch tiles, tinted blue in the image. These
are the result of ZUN unconditionally blitting 100 tile images every
time, regardless of how many are actually contained in an
.MPN file.
These glitch tiles are another good example of a ZUN
landmine. Their appearance is the result of reading heap memory
outside allocated boundaries, which can easily cause segmentation faults
when porting the game to a system with virtual memory. Therefore, these
would not just be removed in this game's Anniversary Edition, but on the
more conservative debloated branch as well. Since the game
never uses these tiles and you can't observe them unless you manipulate
text RAM from outside the confines of the game, it's not a bug
according to our definition.
To reduce the memory required for a map, tiles are arranged into fixed
vertical sections of a game-specific constant size.
The 6 24×8-tile sections defined in TH02's STAGE0.MAP, in
reverse order compared to how they're defined in the file. Note the
duplicated row at the top of the final section: The boss fight starts
once the game scrolled the last full row of tiles onto the top of the
screen, not the playfield. But since the PC-98 text chip
covers the top tile row of the screen with black cells, this final row
is never visible, which effectively reduces a map's final tile section
to 7 rows rather than 8.
The actual stage map then is simply a list of these tile sections,
ordered from the start/bottom to the top/end.
Any manipulation of specific tiles within the fixed tile sections has to
be hardcoded. An example can be found right in Stage 1, where the Shrine
Tank leaves track marks on the tiles it appears to drive over:
This video also shows off the two issues with Touhou's first-ever
midboss: The replaced tiles are rendered below the midboss
during their first 4 frames, and maybe ZUN should have stopped the
tile replacements one row before the timeout. The first one is
clearly a bug, but it's not so clear-cut with the second one. I'd
need to look at the code to tell for sure whether it's a quirk or a
bug.
The differences between the three games can best be summarized in a table:
TH02
TH04
TH05
Tile image file extension
.MPN
Tile section format
.MAP
Tile section order defined as part of
.DT1
.STD
Tile section index format
0-based ID
0-based ID × 2
Tile image index format
Index between 0 and 100, 1 byte
VRAM offset in tile source area, 2 bytes
Scroll speed control
Hardcoded
Part of the .STD format, defined per referenced tile
section
Redraw granularity
Full tiles (16×16)
Half tiles (16×8)
Rows per tile section
8
5
Maximum number of tile sections
16
32
Lowest number of tile sections used
5 (Stage 3 / Extra)
8 (Stage 6)
11 (Stage 2 / 4)
Highest number of tile sections used
13 (Stage 4)
19 (Extra)
24 (Stage 3)
Maximum length of a map
320 sections (static buffer)
256 sections (format limitation)
Shortest map
14 sections (Stage 5)
20 sections (Stage 5)
15 sections (Stage 2)
Longest map
143 sections (Stage 4)
95 sections (Stage 4)
40 sections (Stage 1 / 4 / Extra)
The most interesting part about stage tiles is probably the fact that some
of the .MAP files contain unused tile sections. 👀 Many
of these are empty, duplicates, or don't really make sense, but a few
are unique, fit naturally into their respective stage, and might have
been part of the map during development. In TH02, we can find three unused
sections in Stage 5:
The non-empty tile sections defined in TH02's STAGE4.MAP,
showing off three unused ones.
These unused tile sections are much more common in the later games though,
where we can find them in TH04's Stage 3, 4, and 5, and TH05's Stage 1, 2,
and 4. I'll document those once I get to finalize the tile rendering code of
these games, to leave some more content for that blog post. TH04/TH05 tile
code would be quite an effective investment of your money in general, as
most of it is identical across both games. Or how about going for a full-on
PC-98 Touhou map viewer and editor GUI?
Compared to TH04 and TH05, TH02's stage tile code definitely feels like ZUN
was just starting to understand how to pull off smooth vertical scrolling on
a PC-98. As such, it comes with a few inefficiencies and suboptimal
implementation choices:
The redraw flag for each tile is stored in a 24×25 bool
array that does nothing with 7 of the 8 bits.
During bombs and the Stage 4, 5, and Extra bosses, the game disables the
tile system to render more elaborate backgrounds, which require the
playfield to be flood-filled with a single color on every frame. ZUN uses
the GRCG's RMW mode rather than TDW mode for this, leaving almost half of
the potential performance on the table for no reason. Literally,
changing modes only involves changing a single constant.
The scroll speed could theoretically be changed at any time. However,
the function that scrolls in new stage tiles can only ever blit part of a
single tile row during every call, so it's up to the caller to ensure
that scrolling always ends up on an exact 16-pixel boundary. TH02 avoids
this problem by keeping the scroll speed constant across a stage, using 2
pixels for Stage 4 and 1 pixel everywhere else.
Since the scroll speed is given in pixels, the slowest speed would be 1
pixel per frame. To allow the even slower speeds seen in the final game,
TH02 adds a separate scroll interval variable that only runs the
scroll function every 𝑛th frame, effectively adding a prescaler to the
scroll speed. In TH04 and TH05, the speed is specified as a Q12.4 value
instead, allowing true fractional speeds at any multiple of
1/16 pixels. This also necessitated a fixed algorithm
that correctly blits tile lines from two rows.
Finally, we've got a few inconsistencies in the way the code handles the
two VRAM pages, which cause a few unnecessary tiles to be rendered to just
one of the two pages. Mentioning that just in case someone tries to play
this game with a fully cleared text RAM and wonders where the flickering
tiles come from.
Even though this was ZUN's first attempt at scrolling tiles, he already saw
it fit to write most of the code in assembly. This was probably a reaction
to all of TH01's performance issues, and the frame rate reduction
workarounds he implemented to keep the game from slowing down too much in
busy places. "If TH01 was all C++ and slow, TH02 better contain more ASM
code, and then it will be fast, right?"
Another reason for going with ASM might be found in the kind of
documentation that may have been available to ZUN. Last year, the PC-98
community discovered and scanned two new game programming tutorial books
from 1991 (1, 2).
Their example code is not only entirely written in assembly, but restricts
itself to the bare minimum of x86 instructions that were available on the
8086 CPU used by the original PC-9801 model 9 years earlier. Such code is
not only suboptimal
on the 486, but can often be actually worse than what your C++
compiler would generate. TH02 is where the trend of bad hand-written ASM
code started, and it
📝 only intensified in ZUN's later games. So,
don't copy code from these books unless you absolutely want to target the
earlier 8086 and 286 models. Which,
📝 as we've gathered from the recent blitting benchmark results,
are not all too common among current real-hardware owners.
That said, all that ASM code really only impacts readability and
maintainability. Apart from the aforementioned issues, the algorithms
themselves are mostly fine – especially since most EGC and GRCG operations
are decently batched this time around, in contrast to TH01.
Luckily, the tile functions merely use inline assembly within a
typical C function and can therefore be at least part of a C++ source file,
even if the result is pretty ugly. This time, we can actually be sure that
they weren't written directly in a .ASM file, because they feature x86
instruction encodings that can only be generated with Turbo C++ 4.0J's
inline assembler, not with TASM. The same can't unfortunately be said about
the following function in the same segment, which marks the tiles covered by
the spark sprites for redrawing. In this one, it took just one dumb hand-written ASM
inconsistency in the function's epilog to make the entire function
undecompilable.
The standard x86 instruction sequence to set up a stack frame in a function prolog looks like this:
PUSH BP
MOV BP, SP
SUB SP, ?? ; if the function needs the stack for local variables
When compiling without optimizations, Turbo C++ 4.0J will
replace this sequence with a single ENTER instruction. That one
is two bytes smaller, but much slower on every x86 CPU except for the 80186
where it was introduced.
In functions without local variables, BP and SP
remain identical, and a single POP BP is all that's needed in
the epilog to tear down such a stack frame before returning from the
function. Otherwise, the function needs an additional MOV SP,
BP instruction to pop all local variables. With x86 being the helpful
CISC architecture that it is, the 80186 also introduced the
LEAVE instruction to perform both tasks. Unlike
ENTER, this single instruction
is faster than the raw two instructions on a lot of x86 CPUs (and
even current ones!), and it's always smaller, taking up just 1 byte instead
of 3. So what if you use LEAVE even if your function
doesn't use local variables? The fact that the
instruction first does the equivalent of MOV SP, BP doesn't
matter if these registers are identical, and who cares about the additional
CPU cycles of LEAVE compared to just POP BP,
right? So that's definitely something you could theoretically do, but
not something that any compiler would ever generate.
And so, TH02 MAIN.EXE decompilation already hits the first
brick wall after two pushes. Awesome! Theoretically,
we could slowly mash through this wall using the 📝 code generator. But having such an inconsistency in the
function epilog would mean that we'd have to keep Turbo C++ 4.0J from
emitting any epilog or prolog code so that we can write our
own. This means that we'd once again have to hide any use of the
SI and DI registers from the compiler… and doing
that requires code generation macros for 22 of the 49 instructions of
the function in question, almost none of which we currently have. So, this
gets quite silly quite fast, especially if we only need to do it
for one single byte.
Instead, wouldn't it be much better if we had a separate build step between
compile and link time that allowed us to replicate mistakes like these by
just patching the compiled .OBJ files? These files still contain the names
of exported functions for linking, which would allow us to look up the code
of a function in a robust manner, navigate to specific instructions using a
disassembler, replace them, and write the modified .OBJ back to disk before
linking. Such a system could then naturally expand to cover all other
decompilation issues, culminating in a full-on optimizer that could even
recreate ZUN's self-modifying code. At that point, we would have sealed away
all of ZUN's ugly ASM code within a separate build step, and could finally
decompile everything into readable C++.
Pulling that off would require a significant tooling investment though.
Patching that one byte in TH02's spark invalidation function could be done
within 1 or 2 pushes, but that's just one issue, and we currently have 32
other .ASM files with undecompilable code. Also, note that this is
fundamentally different from what we're doing with the
debloated branch and the Anniversary Editions. Mistake patching
would purely be about having readable code on master that
compiles into ZUN's exact binaries, without fixing weird
code. The Anniversary Editions go much further and rewrite such code in
a much more fundamental way, improving it further than mistake patching ever
could.
Right now, the Anniversary Editions seem much more
popular, which suggests that people just want 100% RE as fast as
possible so that I can start working on them. In that case, why bother with
such undecompilable functions, and not just leave them in raw and unreadable
x86 opcode form if necessary… But let's first
see how much backer support there actually is for mistake patching before
falling back on that.
The best part though: Once we've made a decision and then covered TH02's
spark and particle systems, that was it, and we will have already RE'd
all ZUN-written PC-98-specific blitting code in this game. Every further
sprite or shape is rendered via master.lib, and is thus decently abstracted.
Guess I'll need to update
📝 the assessment of which PC-98 Touhou game is the easiest to port,
because it sure isn't TH01, as we've seen with all the work required for the first Anniversary Edition build.
Until then, there are still enough parts of the game that don't use any of
the remaining few functions in the _TEXT segment. Previously, I
mentioned in the 📝 status overview blog post
that TH02 had a seemingly weird sprite system, but the spark and point popup
() structures showed that the game just
stores the current and previous position of its entities in a slightly
different way compared to the rest of PC-98 Touhou. Instead of having
dedicated structure fields, TH02 uses two-element arrays indexed with the
active VRAM page. Same thing, and such a pattern even helps during RE since
it's easy to spot once you know what to look for.
There's not much to criticize about the point popup system, except for maybe
a landmine that causes sprite glitches when trying to display more than
99,990 points. Sadly, the final push in this delivery was rounded out by yet
another piece of code at the opposite end of the quality spectrum. The
particle and smear effects for Reimu's bomb animations consist almost
entirely of assembly bloat, which would just be replaced with generic calls
to the generic blitter in this game's future Anniversary Edition.
If I continue to decompile TH02 while avoiding the brick wall, items would
be next, but they probably require two pushes. Next up, therefore:
Integrating Stripe as an alternative payment provider into the order form.
There have been at least three people who reported issues with PayPal, and
Stripe has been working much better in tests. In the meantime, here's a temporary Stripe
order link for everyone. This one is not connected to the cap yet, so
please make sure to stay within whatever value is currently shown on the
front page – I will treat any excess money as donations.
If there's some time left afterward, I might
also add some small improvements to the TH01 Anniversary Edition.
More than three months without any reverse-engineering progress! It's been
way too long. Coincidentally, we're at least back with a surprising 1.25% of
overall RE, achieved within just 3 pushes. The ending script system is not
only more or less the same in TH04 and TH05, but actually originated in
TH03, where it's also used for the cutscenes before stages 8 and 9. This
means that it was one of the final pieces of code shared between three of
the four remaining games, which I got to decompile at roughly 3× the usual
speed, or ⅓ of the price.
The only other bargains of this nature remain in OP.EXE. The
Music Room is largely equivalent in all three remaining games as well, and
the sound device selection, ZUN Soft logo screens, and main/option menus are
the same in TH04 and TH05. A lot of that code is in the "technically RE'd
but not yet decompiled" ASM form though, so it would shift Finalized% more
significantly than RE%. Therefore, make sure to order the new
Finalization option rather than Reverse-engineering if you
want to make number go up.
So, cutscenes. On the surface, the .TXT files look simple enough: You
directly write the text that should appear on the screen into the file
without any special markup, and add commands to define visuals, music, and
other effects at any place within the script. Let's start with the basics of
how text is rendered, which are the same in all three games:
First off, the text area has a size of 480×64 pixels. This means that it
does not correspond to the tiled area painted into TH05's
EDBK?.PI images:
The yellow area is designated for character names.
Since the font weight can be customized, all text is rendered to VRAM.
This also includes gaiji, despite them ignoring the font weight
setting.
The system supports automatic line breaks on a per-glyph basis, which
move the text cursor to the beginning of the red text area. This might seem like a piece of long-forgotten
ancient wisdom at first, considering the absence of automatic line breaks in
Windows Touhou. However, ZUN probably implemented it more out of pure
necessity: Text in VRAM needs to be unblitted when starting a new box, which
is way more straightforward and performant if you only need to worry
about a fixed area.
The system also automatically starts a new (key press-separated) text
box after the end of the 4th line. However, the text cursor is
also unconditionally moved to the top-left corner of the yellow name
area when this happens, which is almost certainly not what you expect, given
that automatic line breaks stay within the red area. A script author might
as well add the necessary text box change commands manually, if you're
forced to anticipate the automatic ones anyway…
Due to ZUN forgetting an unblitting call during the TH05 refactoring of the
box background buffer, this feature is even completely broken in that game,
as any new text will simply be blitted on top of the old one:
Wait, why are we already talking about game-specific differences after
all? Also, note how the ⏎ animation appears one line below where you'd
expect it.
Overall, the system is geared toward exclusively full-width text. As
exemplified by the 2014 static English patches and the screenshots in this
blog post, half-width text is possible, but comes with a lot of
asterisks attached:
Each loop of the script interpreter starts by looking at the next
byte to distinguish commands from text. However, this step also skips
over every ASCII space and control character, i.e., every byte
≤ 32. If you only intend to display full-width glyphs anyway, this
sort of makes sense: You gain complete freedom when it comes to the
physical layout of these script files, and it especially allows commands
to be freely separated with spaces and line breaks for improved
readability. Still, enforcing commands to be separated exclusively by
line breaks might have been even better for readability, and would have
freed up ASCII spaces for regular text…
Non-command text is blindly processed and rendered two bytes at a
time. The rendering function interprets these bytes as a Shift-JIS
string, so you can use half-width characters here. While the
second byte can even be an ASCII 0x20 space due to the
parser's blindness, all half-width characters must still occur in pairs
that can't be interrupted by commands:
As a workaround for at least the ASCII space issue, you can replace
them with any of the unassigned
Shift-JIS lead bytes – 0x80, 0xA0, or
anything between 0xF0 and 0xFF inclusive.
That's what you see in all screenshots of this post that display
half-width spaces.
Finally, did you know that you can hold ESC to fast-forward
through these cutscenes, which skips most frame delays and reduces the rest?
Due to the blocking nature of all commands, the ESC key state is
only updated between commands or 2-byte text groups though, so it can't
interrupt an ongoing delay.
Superficially, the list of game-specific differences doesn't look too long,
and can be summarized in a rather short table:
It's when you get into the implementation that the combined three systems
reveal themselves as a giant mess, with more like 56 differences between the
games. Every single new weird line of code opened up
another can of worms, which ultimately made all of this end up with 24
pieces of bloat and 14 bugs. The worst of these should be quite interesting
for the general PC-98 homebrew developers among my audience:
The final official 0.23 release of master.lib has a bug in
graph_gaiji_put*(). To calculate the JIS X 0208 code point for
a gaiji, it is enough to ADD 5680h onto the gaiji ID. However,
these functions accidentally use ADC instead, which incorrectly
adds the x86 carry flag on top, causing weird off-by-one errors based on the
previous program state. ZUN did fix this bug directly inside master.lib for
TH04 and TH05, but still needed to work around it in TH03 by subtracting 1
from the intended gaiji ID. Anyone up for maintaining a bug-fixed master.lib
repository?
The worst piece of bloat comes from TH03 and TH04 needlessly
switching the visibility of VRAM pages while blitting a new 320×200 picture.
This makes it much harder to understand the code, as the mere existence of
these page switches is enough to suggest a more complex interplay between
the two VRAM pages which doesn't actually exist. Outside this visibility
switch, page 0 is always supposed to be shown, and page 1 is always used
for temporarily storing pixels that are later crossfaded onto page 0. This
is also the only reason why TH03 has to render text and gaiji onto both VRAM
pages to begin with… and because TH04 doesn't, changing the picture in the
middle of a string of text is technically bugged in that game, even though
you only get to temporarily see the new text on very underclocked PC-98
systems.
These performance implications made me wonder why cutscenes even bother with
writing to the second VRAM page anyway, before copying each crossfade step
to the visible one.
📝 We learned in June how costly EGC-"accelerated" inter-page copies are;
shouldn't it be faster to just blit the image once rather than twice?
Well, master.lib decodes .PI images into a packed-pixel format, and
unpacking such a representation into bitplanes on the fly is just about the
worst way of blitting you could possibly imagine on a PC-98. EGC inter-page
copies are already fairly disappointing at 42 cycles for every 16 pixels, if
we look at the i486 and ignore VRAM latencies. But under the same
conditions, packed-pixel unpacking comes in at 81 cycles for every 8
pixels, or almost 4× slower. On lower-end systems, that can easily sum up to
more than one frame for a 320×200 image. While I'd argue that the resulting
tearing could have been an acceptable part of the transition between two
images, it's understandable why you'd want to avoid it in favor of the
pure effect on a slower framerate.
Really makes me wonder why master.lib didn't just directly decode .PI images
into bitplanes. The performance impact on load times should have been
negligible? It's such a good format for
the often dithered 16-color artwork you typically see on PC-98, and
deserves better than master.lib's implementation which is both slow to
decode and slow to blit.
That brings us to the individual script commands… and yes, I'm going to
document every single one of them. Some of their interactions and edge cases
are not clear at all from just looking at the code.
Almost all commands are preceded by… well, a 0x5C lead byte.
Which raises the question of whether we should
document it as an ASCII-encoded \ backslash, or a Shift-JIS-encoded
¥ yen sign. From a gaijin perspective, it seems obvious that it's a
backslash, as it's consistently displayed as one in most of the editors you
would actually use nowadays. But interestingly, iconv
-f shift-jis -t utf-8 does convert any 0x5C
lead bytes to actual ¥ U+00A5 YEN SIGN code points
.
Ultimately, the distinction comes down to the font. There are fonts
that still render 0x5C as ¥, but mainly do so out
of an obvious concern about backward compatibility to JIS X 0201, where this
mapping originated. Unsurprisingly, this group includes MS Gothic/Mincho,
the old Japanese fonts from Windows 3.1, but even Meiryo and Yu
Gothic/Mincho, Microsoft's modern Japanese fonts. Meanwhile, pretty much
every other modern font, and freely licensed ones in particular, render this
code point as \, even if you set your editor to Shift-JIS. And
while ZUN most definitely saw it as a ¥, documenting this code
point as \ is less ambiguous in the long run. It can only
possibly correspond to one specific code point in either Shift-JIS or UTF-8,
and will remain correct even if we later mod the cutscene system to support
full-blown Unicode.
Now we've only got to clarify the parameter syntax, and then we can look at
the big table of commands:
Numeric parameters are read as sequences of up to 3 ASCII digits. This
limits them to a range from 0 to 999 inclusive, with 000 and
0 being equivalent. Because there's no further sentinel
character, any further digit from the 4th one onwards is
interpreted as regular text.
Filename parameters must be terminated with a space or newline and are
limited to 12 characters, which translates to 8.3 basenames without any
directory component. Any further characters are ignored and displayed as
text as well.
Each .PI image can contain up to four 320×200 pictures ("quarters") for
the cutscene picture area. In the script commands, they are numbered like
this:
0
1
2
3
\@
Clears both VRAM pages by filling them with VRAM color 0. 🐞
In TH03 and TH04, this command does not update the internal text area
background used for unblitting. This bug effectively restricts usage of
this command to either the beginning of a script (before the first
background image is shown) or its end (after no more new text boxes are
started). See the image below for an
example of using it anywhere else.
\b2
Sets the font weight to a value between 0 (raw font ROM glyphs) to 3
(very thicc). Specifying any other value has no effect.
🐞 In TH04 and TH05, \b3 leads to glitched pixels when
rendering half-width glyphs due to a bug in the newly micro-optimized
ASM version of
📝 graph_putsa_fx(); see the image below for an example.
In these games, the parameter also directly corresponds to the
graph_putsa_fx() effect function, removing the sanity check
that was present in TH03. In exchange, you can also access the four
dissolve masks for the bold font (\b2) by specifying a
parameter between 4 (fewest pixels) to 7 (most
pixels). Demo video below.
\c15
Changes the text color to VRAM color 15.
\c=字,15
Adds a color map entry: If 字 is the first code point
inside the name area on a new line, the text color is automatically set
to 15. Up to 8 such entries can be registered
before overflowing the statically allocated buffer.
🐞 The comma is assumed to be present even if the color parameter is omitted.
\e0
Plays the sound effect with the given ID.
\f
(no-op)
\fi1
\fo1
Calls master.lib's palette_black_in() or
palette_black_out() to play a hardware palette fade
animation from or to black, spending roughly 1 frame on each of the 16 fade steps.
\fm1
Fades out BGM volume via PMD's AH=02h interrupt call,
in a non-blocking way. The fade speed can range from 1 (slowest) to 127 (fastest).
Values from 128 to 255 technically correspond to
AH=02h's fade-in feature, which can't be used from cutscene
scripts because it requires BGM volume to first be lowered via
AH=19h, and there is no command to do that.
\g8
Plays a blocking 8-frame screen shake
animation.
\ga0
Shows the gaiji with the given ID from 0 to 255
at the current cursor position. Even in TH03, gaiji always ignore the
text delay interval configured with \v.
@3
TH05's replacement for the \ga command from TH03 and
TH04. The default ID of 3 corresponds to the
gaiji. Not to be confused with \@, which starts with a backslash,
unlike this command.
@h
Shows the gaiji.
@t
Shows the gaiji.
@!
Shows the gaiji.
@?
Shows the gaiji.
@!!
Shows the gaiji.
@!?
Shows the gaiji.
\k0
Waits 0 frames (0 = forever) for an advance key to be pressed before
continuing script execution. Before waiting, TH05 crossfades in any new
text that was previously rendered to the invisible VRAM page…
🐞 …but TH04 doesn't, leaving the text invisible during the wait time.
As a workaround, \vp1 can be
used before \k to immediately display that text without a
fade-in animation.
\m$
Stops the currently playing BGM.
\m*
Restarts playback of the currently loaded BGM from the
beginning.
\m,filename
Stops the currently playing BGM, loads a new one from the given
file, and starts playback.
\n
Starts a new line at the leftmost X coordinate of the box, i.e., the
start of the name area. This is how scripts can "change" the name of the
currently speaking character, or use the entire 480×64 pixels without
being restricted to the non-name area.
Note that automatic line breaks already move the cursor into a new line.
Using this command at the "end" of a line with the maximum number of 30
full-width glyphs would therefore start a second new line and leave the
previously started line empty.
If this command moved the cursor into the 5th line of a box,
\s is executed afterward, with
any of \n's parameters passed to \s.
\p
(no-op)
\p-
Deallocates the loaded .PI image.
\p,filename
Loads the .PI image with the given file into the single .PI slot
available to cutscenes. TH04 and TH05 automatically deallocate any
previous image, 🐞 TH03 would leak memory without a manual prior call to
\p-.
\pp
Sets the hardware palette to the one of the loaded .PI image.
\p@
Sets the loaded .PI image as the full-screen 640×400 background
image and overwrites both VRAM pages with its pixels, retaining the
current hardware palette.
\p=
Runs \pp followed by \p@.
\s0
\s-
Ends a text box and starts a new one. Fades in any text rendered to
the invisible VRAM page, then waits 0 frames
(0 = forever) for an advance key to be
pressed. Afterward, the new text box is started with the cursor moved to
the top-left corner of the name area. \s- skips the wait time and starts the new box
immediately.
\t100
Sets palette brightness via master.lib's
palette_settone() to any value from 0 (fully black) to 200
(fully white). 100 corresponds to the palette's original colors.
Preceded by a 1-frame delay unless ESC is held.
\v1
Sets the number of frames to wait between every 2 bytes of rendered
text.
Sets the number of frames to spend on each of the 4 fade
steps when crossfading between old and new text. The game-specific
default value is also used before the first use of this command.
\v2
\vp0
Shows VRAM page 0. Completely useless in
TH03 (this game always synchronizes both VRAM pages at a command
boundary), only of dubious use in TH04 (for working around a bug in \k), and the games always return to
their intended shown page before every blitting operation anyway. A
debloated mod of this game would just remove this command, as it exposes
an implementation detail that script authors should not need to worry
about. None of the original scripts use it anyway.
\w64
\w and \wk wait for the given number
of frames
\wm and \wmk wait until PMD has played
back the current BGM for the total number of measures, including
loops, given in the first parameter, and fall back on calling
\w and \wk with the second parameter as
the frame number if BGM is disabled.
🐞 Neither PMD nor MMD reset the internal measure when stopping
playback. If no BGM is playing and the previous BGM hasn't been
played back for at least the given number of measures, this command
will deadlock.
Since both TH04 and TH05 fade in any new text from the invisible VRAM
page, these commands can be used to simulate TH03's typing effect in
those games. Demo video below.
Contrary to \k and \s, specifying 0 frames would
simply remove any frame delay instead of waiting forever.
The TH03-exclusive k variants allow the delay to be
interrupted if ⏎ Return or Shot are held down.
TH04 and TH05 recognize the k as well, but removed its
functionality.
All of these commands have no effect if ESC is held.
\wm64,64
\wk64
\wmk64,64
\wi1
\wo1
Calls master.lib's palette_white_in() or
palette_white_out() to play a hardware palette fade
animation from or to white, spending roughly 1 frame on each of the 16 fade steps.
\=4
Immediately displays the given quarter of the loaded .PI image in
the picture area, with no fade effect. Any value ≥ 4 resets the picture area to black.
\==4,1
Crossfades the picture area between its current content and quarter
#4 of the loaded .PI image, spending 1 frame on each of the 4 fade steps unless
ESC is held. Any value ≥ 4 is
replaced with quarter #0.
\$
Stops script execution. Must be called at the end of each file;
otherwise, execution continues into whatever lies after the script
buffer in memory.
TH05 automatically deallocates the loaded .PI image, TH03 and TH04
require a separate manual call to \p- to not leak its memory.
Bold values signify the default if the parameter
is omitted; \c is therefore
equivalent to \c15.
The \@ bug. Yes, the ¥ is fake. It
was easier to GIMP it than to reword the sentences so that the backslashes
landed on the second byte of a 2-byte half-width character pair.
The font weights and effects available through \b, including the glitch with
\b3 in TH04 and TH05.
Font weight 3 is technically not rendered correctly in TH03 either; if
you compare 1️⃣ with 4️⃣, you notice a single missing column of pixels
at the left side of each glyph, which would extend into the previous
VRAM byte. Ironically, the TH04/TH05 version is more correct in
this regard: For half-width glyphs, it preserves any further pixel
columns generated by the weight functions in the high byte of the 16-dot
glyph variable. Unlike TH03, which still cuts them off when rendering
text to unaligned X positions (3️⃣), TH04 and TH05 do bit-rotate them
towards their correct place (4️⃣). It's only at byte-aligned X positions
(2️⃣) where they remain at their internally calculated place, and appear
on screen as these glitched pixel columns, 15 pixels away from the glyph
they belong to. It's easy to blame bugs like these on micro-optimized
ASM code, but in this instance, you really can't argue against it if the
original C++ version was equally incorrect.
Combining \b and s- into a partial dissolve
animation. The speed can be controlled with \v.
Simulating TH03's typing effect in TH04 and TH05 via \w. Even prettier in TH05 where we
also get an additional fade animation
after the box ends.
So yeah, that's the cutscene system. I'm dreading the moment I will have to
deal with the other command interpreter in these games, i.e., the
stage enemy system. Luckily, that one is completely disconnected from any
other system, so I won't have to deal with it until we're close to finishing
MAIN.EXE… that is, unless someone requests it before. And it
won't involve text encodings or unblitting…
The cutscene system got me thinking in greater detail about how I would
implement translations, being one of the main dependencies behind them. This
goal has been on the order form for a while and could soon be implemented
for these cutscenes, with 100% PI being right around the corner for the TH03
and TH04 cutscene executables.
Once we're there, the "Virgin" old-school way of static translation patching
for Latin-script languages could be implemented fairly quickly:
Establish basic UTF-8 parsing for less painful manual editing of the
source files
Procedurally generate glyphs for the few required additional letters
based on existing font ROM glyphs. For example, we'd generate ä
by painting two short lines on top of the font ROM's a glyph,
or generate ¿ by vertically flipping the question mark. This
way, the text retains a consistent look regardless of whether the translated
game is run with an NEC or EPSON font ROM, or the that Neko Project II auto-generates if you
don't provide either.
(Optional) Change automatic line breaks to work on a per-word
basis, rather than per-glyph
That's it – script editing and distribution would be handled by your local
translation group. It might seem as if this would also work for Greek and
Cyrillic scripts due to their presence in the PC-98 font ROM, but I'm not
sure if I want to attempt procedurally shrinking these glyphs from 16×16 to
8×16… For any more thorough solution, we'd need to go for a more "Chad" kind
of full-blown translation support:
Implement text subdivisions at a sensible granularity while retaining
automatic line and box breaks
Compile translatable text into a Japanese→target language dictionary
(I'm too old to develop any further translation systems that would overwrite
modded source text with translations of the original text)
Implement a custom Unicode font system (glyphs would be taken from GNU
Unifont unless translators provide a different 8×16 font for their
language)
Combine the text compiler with the font compiler to only store needed
glyphs as part of the translation's font file (dealing with a multi-MB font
file would be rather ugly in a Real Mode game)
Write a simple install/update/patch stacking tool that supports both
.HDI and raw-file DOSBox-X scenarios (it's different enough from thcrap to
warrant a separate tool – each patch stack would be statically compiled into
a single package file in the game's directory)
Add a nice language selection option to the main menu
(Optional) Support proportional fonts
Which sounds more like a separate project to be commissioned from
Touhou Patch Center's Open Collective funds, separate from the ReC98 cap.
This way, we can make sure that the feature is completely implemented, and I
can talk with every interested translator to make sure that their language
works.
It's still cheaper overall to do this on PC-98 than to first port the games
to a modern system and then translate them. On the other hand, most
of the tasks in the Chad variant (3, 4, 5, and half of 2) purely deal with
the difficulty of getting arbitrary Unicode characters to work natively in a
PC-98 DOS game at all, and would be either unnecessary or trivial if we had
already ported the game. Depending on where the patrons' interests lie, it
may not be worth it. So let's see what all of you think about which
way we should go, or whether it's worth doing at all. (Edit
(2022-12-01): With Splashman's
order towards the stage dialogue system, we've pretty much confirmed that it
is.) Maybe we want to meet in the middle – using e.g. procedural glyph
generation for dynamic translations to keep text rendering consistent with
the rest of the PC-98 system, and just not support non-Latin-script
languages in the beginning? In any case, I've added both options to the
order form. Edit (2023-07-28):Touhou Patch Center has agreed to fund
a basic feature set somewhere between the Virgin and Chad level. Check the
📝 dedicated announcement blog post for more
details and ideas, and to find out how you can support this goal!
Surprisingly, there was still a bit of RE work left in the third push after
all of this, which I filled with some small rendering boilerplate. Since I
also wanted to include TH02's playfield overlay functions,
1/15 of that last push went towards getting a
TH02-exclusive function out of the way, which also ended up including that
game in this delivery.
The other small function pointed out how TH05's Stage 5 midboss pops into
the playfield quite suddenly, since its clipping test thinks it's only 32
pixels tall rather than 64:
Good chance that the pop-in might have been intended. Edit (2023-06-30): Actually, it's a
📝 systematic consequence of ZUN having to work around the lack of clipping in master.lib's sprite functions.
There's even another quirk here: The white flash during its first frame
is actually carried over from the previous midboss, which the
game still considers as actively getting hit by the player shot that
defeated it. It's the regular boilerplate code for rendering a
midboss that resets the responsible damage variable, and that code
doesn't run during the defeat explosion animation.
Next up: Staying with TH05 and looking at more of the pattern code of its
boss fights. Given the remaining TH05 budget, it makes the most sense to
continue in in-game order, with Sara and the Stage 2 midboss. If more money
comes in towards this goal, I could alternatively go for the Mai & Yuki
fight and immediately develop a pretty fix for the cheeto storage
glitch. Also, there's a rather intricate
pull request for direct ZMBV decoding on the website that I've still got
to review…
TH05 has passed the 50% RE mark, with both MAIN.EXE and the
game as a whole! With that, we've also reached what -Tom-
wanted out of the project, so he's suspending his discount offer for a
bit.
Curve bullets are now officially called cheetos! 76.7% of
fans prefer this term, and it fits into the 8.3 DOS filename scheme much
better than homing lasers (as they're called in
OMAKE.TXT) or Taito
lasers (which would indeed have made sense as well).
…oh, and I managed to decompile Shinki within 2 pushes after all. That
left enough budget to also add the Stage 1 midboss on top.
So, Shinki! As far as final boss code is concerned, she's surprisingly
economical, with 📝 her background animations
making up more than ⅓ of her entire code. Going straight from TH01's
📝 final📝 bosses
to TH05's final boss definitely showed how much ZUN had streamlined
danmaku pattern code by the end of PC-98 Touhou. Don't get me wrong, there
is still room for improvement: TH05 not only
📝 reuses the same 16 bytes of generic boss state we saw in TH04 last month,
but also uses them 4× as often, and even for midbosses. Most importantly
though, defining danmaku patterns using a single global instance of the
group template structure is just bad no matter how you look at it:
The script code ends up rather bloated, with a single MOV
instruction for setting one of the fields taking up 5 bytes. By comparison,
the entire structure for regular bullets is 14 bytes large, while the
template structure for Shinki's 32×32 ball bullets could have easily been
reduced to 8 bytes.
Since it's also one piece of global state, you can easily forget to set
one of the required fields for a group type. The resulting danmaku group
then reuses these values from the last time they were set… which might have
been as far back as another boss fight from a previous stage.
And of course, I wouldn't point this out if it
didn't actually happen in Shinki's pattern code. Twice.
Declaring a separate structure instance with the static data for every
pattern would be both safer and more space-efficient, and there's
more than enough space left for that in the game's data segment.
But all in all, the pattern functions are short, sweet, and easy to follow.
The "devil"
patternis significantly more complex than the others, but still
far from TH01's final bosses at their worst. I especially like the clear
architectural separation between "one-shot pattern" functions that return
true once they're done, and "looping pattern" functions that
run as long as they're being called from a boss's main function. Not many
all too interesting things in these pattern functions for the most part,
except for two pieces of evidence that Shinki was coded after Yumeko:
The gather animation function in the first two phases contains a bullet
group configuration that looks like it's part of an unused danmaku
pattern. It quickly turns out to just be copy-pasted from a similar function
in Yumeko's fight though, where it is turned into actual
bullets.
As one of the two places where ZUN forgot to set a template field, the
lasers at the end of the white wing preparation pattern reuse the 6-pixel
width of Yumeko's final laser pattern. This actually has an effect on
gameplay: Since these lasers are active for the first 8 frames after
Shinki's wings appear on screen, the player can get hit by them in the last
2 frames after they grew to their final width.
Of course, there are more than enough safespots between the lasers.
Speaking about that wing sprite: If you look at ST05.BB2 (or
any other file with a large sprite, for that matter), you notice a rather
weird file layout:
A large sprite split into multiple smaller ones with a width of
64 pixels each? What's this, hardware sprite limitations? On my
PC-98?!
And it's not a limitation of the sprite width field in the BFNT+ header
either. Instead, it's master.lib's BFNT functions which are limited to
sprite widths up to 64 pixels… or at least that's what
MASTER.MAN claims. Whatever the restriction was, it seems to be
completely nonexistent as of master.lib version 0.23, and none of the
master.lib functions used by the games have any issues with larger
sprites.
Since ZUN stuck to the supposed 64-pixel width limit though, it's now the
game that expects Shinki's winged form to consist of 4 physical
sprites, not just 1. Any conversion from another, more logical sprite sheet
layout back into BFNT+ must therefore replicate the original number of
sprites. Otherwise, the sequential IDs ("patnums") assigned to every newly
loaded sprite no longer match ZUN's hardcoded IDs, causing the game to
crash. This is exactly what used to happen with -Tom-'s
MysticTK automation scripts,
which combined these exact sprites into a single large one. This issue has
now been fixed – just in case there are some underground modders out there
who used these scripts and wonder why their game crashed as soon as the
Shinki fight started.
And then the code quality takes a nosedive with Shinki's main function.
Even in TH05, these boss and midboss update
functions are still very imperative:
The origin point of all bullet types used by a boss must be manually set
to the current boss/midboss position; there is no concept of a bullet type
tracking a certain entity.
The same is true for the target point of a player's homing shots…
… and updating the HP bar. At least the initial fill animation is
abstracted away rather decently.
Incrementing the phase frame variable also must be done manually. TH05
even "innovates" here by giving the boss update function exclusive ownership
of that variable, in contrast to TH04 where that ownership is given out to
the player shot collision detection (?!) and boss defeat helper
functions.
Speaking about collision detection: That is done by calling different
functions depending on whether the boss is supposed to be invincible or
not.
Timeout conditions? No standard way either, and all done with manual
if statements. In combination with the regular phase end
condition of lowering (mid)boss HP to a certain value, this leads to quite a
convoluted control flow.
The manual calls to the score bonus functions for cleared phases at least provide some sense of orientation.
One potentially nice aspect of all this imperative freedom is that
phases can end outside of HP boundaries… by manually incrementing the
phase variable and resetting the phase frame variable to 0.
The biggest WTF in there, however, goes to using one of the 16 state bytes
as a "relative phase" variable for differentiating between boss phases that
share the same branch within the switch(boss.phase)
statement. While it's commendable that ZUN tried to reduce code duplication
for once, he could have just branched depending on the actual
boss.phase variable? The same state byte is then reused in the
"devil" pattern to track the activity state of the big jerky lasers in the
second half of the pattern. If you somehow managed to end the phase after
the first few bullets of the pattern, but before these lasers are up,
Shinki's update function would think that you're still in the phase
before the "devil" pattern. The main function then sequence-breaks
right to the defeat phase, skipping the final pattern with the burning Makai
background. Luckily, the HP boundaries are far away enough to make this
impossible in practice.
The takeaway here: If you want to use the state bytes for your custom
boss script mods, alias them to your own 16-byte structure, and limit each
of the bytes to a clearly defined meaning across your entire boss script.
One final discovery that doesn't seem to be documented anywhere yet: Shinki
actually has a hidden bomb shield during her two purple-wing phases.
uth05win got this part slightly wrong though: It's not a complete
shield, and hitting Shinki will still deal 1 point of chip damage per
frame. For comparison, the first phase lasts for 3,000 HP, and the "devil"
pattern phase lasts for 5,800 HP.
And there we go, 3rd PC-98 Touhou boss
script* decompiled, 28 to go! 🎉 In case you were expecting a fix for
the Shinki death glitch: That one
is more appropriately fixed as part of the Mai & Yuki script. It also
requires new code, should ideally look a bit prettier than just removing
cheetos between one frame and the next, and I'd still like it to fit within
the original position-dependent code layout… Let's do that some other
time.
Not much to say about the Stage 1 midboss, or midbosses in general even,
except that their update functions have to imperatively handle even more
subsystems, due to the relative lack of helper functions.
The remaining ¾ of the third push went to a bunch of smaller RE and
finalization work that would have hardly got any attention otherwise, to
help secure that 50% RE mark. The nicest piece of code in there shows off
what looks like the optimal way of setting up the
📝 GRCG tile register for monochrome blitting
in a variable color:
mov ah, palette_index ; Any other non-AL 8-bit register works too.
; (x86 only supports AL as the source operand for OUTs.)
rept 4 ; For all 4 bitplanes…
shr ah, 1 ; Shift the next color bit into the x86 carry flag
sbb al, al ; Extend the carry flag to a full byte
; (CF=0 → 0x00, CF=1 → 0xFF)
out 7Eh, al ; Write AL to the GRCG tile register
endm
Thanks to Turbo C++'s inlining capabilities, the loop body even decompiles
into a surprisingly nice one-liner. What a beautiful micro-optimization, at
a place where micro-optimization doesn't hurt and is almost expected.
Unfortunately, the micro-optimizations went all downhill from there,
becoming increasingly dumb and undecompilable. Was it really necessary to
save 4 x86 instructions in the highly unlikely case of a new spark sprite
being spawned outside the playfield? That one 2D polar→Cartesian
conversion function then pointed out Turbo C++ 4.0J's woefully limited
support for 32-bit micro-optimizations. The code generation for 32-bit
📝 pseudo-registers is so bad that they almost
aren't worth using for arithmetic operations, and the inline assembler just
flat out doesn't support anything 32-bit. No use in decompiling a function
that you'd have to entirely spell out in machine code, especially if the
same function already exists in multiple other, more idiomatic C++
variations.
Rounding out the third push, we got the TH04/TH05 DEMO?.REC
replay file reading code, which should finally prove that nothing about the
game's original replay system could serve as even just the foundation for
community-usable replays. Just in case anyone was still thinking that.
Next up: Back to TH01, with the Elis fight! Got a bit of room left in the
cap again, and there are a lot of things that would make a lot of
sense now:
TH04 would really enjoy a large number of dedicated pushes to catch up
with TH05. This would greatly support the finalization of both games.
Continuing with TH05's bosses and midbosses has shown to be good value
for your money. Shinki would have taken even less than 2 pushes if she
hadn't been the first boss I looked at.
Oh, and I also added Seihou as a selectable goal, for the two people out
there who genuinely like it. If I ever want to quit my day job, I need to
branch out into safer territory that isn't threatened by takedowns, after
all.
Did you know that moving on top of a boss sprite doesn't kill the player in
TH04, only in TH05?
Yup, Reimu is not getting hit… yet.
That's the first of only three interesting discoveries in these 3 pushes,
all of which concern TH04. But yeah, 3 for something as seemingly simple as
these shared boss functions… that's still not quite the speed-up I had hoped
for. While most of this can be blamed, again, on TH04 and all of its
hardcoded complexities, there still was a lot of work to be done on the
maintenance front as well. These functions reference a bunch of code I RE'd
years ago and that still had to be brought up to current standards, with the
dependencies reaching from 📝 boss explosions
over 📝 text RAM overlay functionality up to
in-game dialog loading.
The latter provides a good opportunity to talk a bit about x86 memory
segmentation. Many aspiring PC-98 developers these days are very scared
of it, with some even going as far as to rather mess with Protected Mode and
DOS extenders just so that they don't have to deal with it. I wonder where
that fear comes from… Could it be because every modern programming language
I know of assumes memory to be flat, and lacks any standard language-level
features to even express something like segments and offsets? That's why
compilers have a hard time targeting 16-bit x86 these days: Doing anything
interesting on the architecture requires giving the programmer full
control over segmentation, which always comes down to adding the
typical non-standard language extensions of compilers from back in the day.
And as soon as DOS stopped being used, these extensions no longer made sense
and were subsequently removed from newer tools. A good example for this can
be found in an old version of the
NASM manual: The project started as an attempt to make x86 assemblers
simple again by throwing out most of the segmentation features from
MASM-style assemblers, which made complete sense in 1996 when 16-bit DOS and
Windows were already on their way out. But there was a point to all
those features, and that's why ReC98 still has to use the supposedly
inferior TASM.
Not that this fear of segmentation is completely unfounded: All the
segmentation-related keywords, directives, and #pragmas
provided by Borland C++ and TASM absolutely can be the cause of many
weird runtime bugs. Even if the compiler or linker catches them, you are
often left with confusing error messages that aged just as poorly as memory
segmentation itself.
However, embracing the concept does provide quite the opportunity for
optimizations. While it definitely was a very crazy idea, there is a small
bit of brilliance to be gained from making proper use of all these
segmentation features. Case in point: The buffer for the in-game dialog
scripts in TH04 and TH05.
// Thanks to the semantics of `far` pointers, we only need a single 32-bit
// pointer variable for the following code.
extern unsigned char far *dialog_p;
// This master.lib function returns a `void __seg *`, which is a 16-bit
// segment-only pointer. Converting to a `far *` yields a full segment:offset
// pointer to offset 0000h of that segment.
dialog_p = (unsigned char far *)hmem_allocbyte(/* … */);
// Running the dialog script involves pointer arithmetic. On a far pointer,
// this only affects the 16-bit offset part, complete with overflow at 64 KiB,
// from FFFFh back to 0000h.
dialog_p += /* … */;
dialog_p += /* … */;
dialog_p += /* … */;
// Since the segment part of the pointer is still identical to the one we
// allocated above, we can later correctly free the buffer by pulling the
// segment back out of the pointer.
hmem_free((void __seg *)dialog_p);
If dialog_p was a huge pointer, any pointer
arithmetic would have also adjusted the segment part, requiring a second
pointer to store the base address for the hmem_free call. Doing
that will also be necessary for any port to a flat memory model. Depending
on how you look at it, this compression of two logical pointers into a
single variable is either quite nice, or really, really dumb in its
reliance on the precise memory model of one single architecture.
Why look at dialog loading though, wasn't this supposed to be all about
shared boss functions? Well, TH04 unnecessarily puts certain stage-specific
code into the boss defeat function, such as loading the alternate Stage 5
Yuuka defeat dialog before a Bad Ending, or initializing Gengetsu after
Mugetsu's defeat in the Extra Stage.
That's TH04's second core function with an explicit conditional branch for
Gengetsu, after the
📝 dialog exit code we found last year during EMS research.
And I've heard people say that Shinki was the most hardcoded fight in PC-98
Touhou… Really, Shinki is a perfectly regular boss, who makes proper use of
all internal mechanics in the way they were intended, and doesn't blast
holes into the architecture of the game. Even within TH05, it's Mai and Yuki
who rely on hacks and duplicated code, not Shinki.
The worst part about this though? How the function distinguishes Mugetsu
from Gengetsu. Once again, it uses its own global variable to track whether
it is called the first or the second time within TH04's Extra Stage,
unrelated to the same variable used in the dialog exit function. But this
time, it's not just any newly created, single-use variable, oh no. In a
misguided attempt to micro-optimize away a few bytes of conventional memory,
TH04 reserves 16 bytes of "generic boss state", which can (and are) freely
used for anything a boss doesn't want to store in a more dedicated
variable.
It might have been worth it if the bosses actually used most of these
16 bytes, but the majority just use (the same) two, with only Stage 4 Reimu
using a whopping seven different ones. To reverse-engineer the various uses
of these variables, I pretty much had to map out which of the undecompiled
danmaku-pattern functions corresponds to which boss
fight. In the end, I assigned 29 different variable names for each of the
semantically different use cases, which made up another full push on its
own.
Now, 16 bytes of wildly shared state, isn't that the perfect recipe for
bugs? At least during this cursory look, I haven't found any obvious ones
yet. If they do exist, it's more likely that they involve reused state from
earlier bosses – just how the Shinki death glitch in
TH05 is caused by reusing cheeto data from way back in Stage 4 – and
hence require much more boss-specific progress.
And yes, it might have been way too early to look into all these tiny
details of specific boss scripts… but then, this happened:
Looks similar to another
screenshot of a crash in the same fight that was reported in December,
doesn't it? I was too much in a hurry to figure it out exactly, but notice
how both crashes happen right as the last of Marisa's four bits is destroyed.
KirbyComment has suspected
this to be the cause for a while, and now I can pretty much confirm it
to be an unguarded division by the number of on-screen bits in
Marisa-specific pattern code. But what's the cause for Kurumi then?
As for fixing it, I can go for either a fast or a slow option:
Superficially fixing only this crash will probably just take a fraction
of a push.
But I could also go for a deeper understanding by looking at TH04's
version of the 📝 custom entity structure. It
not only stores the data of Marisa's bits, but is also very likely to be
involved in Kurumi's crash, and would get TH04 a lot closer to 100%
PI. Taking that look will probably need at least 2 pushes, and might require
another 3-4 to completely decompile Marisa's fight, and 2-3 to decompile
Kurumi's.
OK, now that that's out of the way, time to finish the boss defeat function…
but not without stumbling over the third of TH04's quirks, relating to the
Clear Bonus for the main game or the Extra Stage:
To achieve the incremental addition effect for the in-game score display
in the HUD, all new points are first added to a score_delta
variable, which is then added to the actual score at a maximum rate of
61,110 points per frame.
There are a fixed 416 frames between showing the score tally and
launching into MAINE.EXE.
As a result, TH04's Clear Bonus is effectively limited to
(416 × 61,110) = 25,421,760 points.
Only TH05 makes sure to commit the entirety of the
score_delta to the actual score before switching binaries,
which fixes this issue.
And after another few collision-related functions, we're now truly,
finally ready to decompile bosses in both TH04 and TH05! Just as the
anything funds were running out… The
remaining ¼ of the third push then went to Shinki's 32×32 ball bullets,
rounding out this delivery with a small self-contained piece of the first
TH05 boss we're probably going to look at.
Next up, though: I'm not sure, actually. Both Shinki and Elis seem just a
little bit larger than the 2¼ or 4 pushes purchased so far, respectively.
Now that there's a bunch of room left in the cap again, I'll just let the
next contribution decide – with a preference for Shinki in case of a tie.
And if it will take longer than usual for the store to sell out again this
time (heh), there's still the
📝 PC-98 text RAM JIS trail word rendering research
waiting to be documented.
Technical debt, part 10… in which two of the PMD-related functions came
with such complex ramifications that they required one full push after
all, leaving no room for the additional decompilations I wanted to do. At
least, this did end up being the final one, completing all
SHARED segments for the time being.
The first one of these functions determines the BGM and sound effect
modes, combining the resident type of the PMD driver with the Option menu
setting. The TH04 and TH05 version is apparently coded quite smartly, as
PC-98 Touhou only needs to distinguish "OPN- /
PC-9801-26K-compatible sound sources handled by PMD.COM"
from "everything else", since all other PMD varieties are
OPNA- / PC-9801-86-compatible.
Therefore, I only documented those two results returned from PMD's
AH=09h function. I'll leave a comprehensive, fully documented
enum to interested contributors, since that would involve research into
basically the entire history of the PC-9800 series, and even the clearly
out-of-scope PC-88VA. After all, distinguishing between more versions of
the PMD driver in the Option menu (and adding new sprites for them!) is
strictly mod territory.
The honor of being the final decompiled function in any SHARED
segment went to TH04's snd_load(). TH04 contains by far the
sanest version of this function: Readable C code, no new ZUN bugs (and
still missing file I/O error handling, of course)… but wait, what about
that actual file read syscall, using the INT 21h, AH=3Fh DOS
file read API? Reading up to a hardcoded number of bytes into PMD's or
MMD's song or sound effect buffer, 20 KiB in TH02-TH04, 64 KiB in
TH05… that's kind of weird. About time we looked closer into this.
Turns out that no, KAJA's driver doesn't give you the full 64 KiB of one
memory segment for these, as especially TH05's code might suggest to
anyone unfamiliar with these drivers. Instead,
you can customize the size of these buffers on its command line. In
GAME.BAT, ZUN allocates 8 KiB for FM songs, 2 KiB for sound
effects, and 12 KiB for MMD files in TH02… which means that the hardcoded
sizes in snd_load() are completely wrong, no matter how you
look at them. Consequently, this read syscall
will overflow PMD's or MMD's song or sound effect buffer if the
given file is larger than the respective buffer size.
Now, ZUN could have simply hardcoded the sizes from GAME.BAT
instead, and it would have been fine. As it also turns out though,
PMD has an API function (AH=22h) to retrieve the actual
buffer sizes, provided for exactly that purpose. There is little excuse
not to use it, as it also gives you PMD's default sizes if you don't
specify any yourself.
(Unless your build process enumerates all PMD files that are part of the
game, and bakes the largest size into both snd_load() and
GAME.BAT. That would even work with MMD, which doesn't have
an equivalent for AH=22h.)
What'd be the consequence of loading a larger file then? Well, since we
don't get a full segment, let's look at the theoretical limit first.
PMD prefers to keep both its driver code and the data buffers in a single
memory segment. As a result, the limit for the combined size of the song,
instrument, and sound effect buffer is determined by the amount of
code in the driver itself. In PMD86 version 4.8o (bundled with TH04
and TH05) for example, the remaining size for these buffers is exactly
45,555 bytes. Being an actually good programmer who doesn't blindly trust
user input, KAJA thankfully validates the sizes given via the
/M, /V, and /E command-line options
before letting the driver reside in memory, and shuts down with an error
message if they exceed 40 KiB. Would have been even better if he calculated
the exact size – even in the current
PMD version 4.8s from
January 2020, it's still a hardcoded value (see line 8581).
Either way: If the file is larger than this maximum, the concrete effect
is down to the INT 21h, AH=3Fh implementation in the
underlying DOS version. DOS 3.3 treats the destination address as linear
and reads past the end of the segment,
DOS
5.0 and DOSBox-X truncate the number of bytes to not exceed the remaining
space in the segment, and maybe there's even a DOS that wraps around
and ends up overwriting the PMD driver code. In any case: You will
overwrite what's after the driver in memory – typically, the game .EXE and
its master.lib functions.
It almost feels like a happy accident that this doesn't cause issues in
the original games. The largest PMD file in any of the 4 games, the -86
version of 幽夢 ~ Inanimate Dream, takes up 8,099 bytes,
just under the 8,192 byte limit for BGM. For modders, I'd really recommend
implementing this properly, with PMD's AH=22h function and
error handling, once position independence has been reached.
Whew, didn't think I'd be doing more research into KAJA's drivers during
regular ReC98 development! That's probably been the final time though, as
all involved functions are now decompiled, and I'm unlikely to iterate
over them again.
And that's it! Repaid the biggest chunk of technical debt, time for some
actual progress again. Next up: Reopening the store tomorrow, and waiting
for new priorities. If we got nothing by Sunday, I'm going to put the
pending [Anonymous] pushes towards some work on the website.
Technical debt, part 9… and as it turns out, it's highly impractical to
repay 100% of it at this point in development. 😕
The reason: graph_putsa_fx(), ZUN's function for rendering
optionally boldfaced text to VRAM using the font ROM glyphs, in its
ridiculously micro-optimized TH04 and TH05 version. This one sets the
"callback function" for applying the boldface effect by self-modifying
the target of two CALL rel16 instructions… because
there really wasn't any free register left for an indirect
CALL, eh? The necessary distance, from the call site to the
function itself, has to be calculated at assembly time, by subtracting the
target function label from the call site label.
This usually wouldn't be a problem… if ZUN didn't store the resulting
lookup tables in the .DATA segment. With code segments, we
can easily split them at pretty much any point between functions because
there are multiple of them. But there's only a single .DATA
segment, with all ZUN and master.lib data sandwiched between Borland C++'s
crt0 at the
top, and Borland C++'s library functions at the bottom of the segment.
Adding another split point would require all data after that point to be
moved to its own translation unit, which in turn requires
EXTERN references in the big .ASM file to all that moved
data… in short, it would turn the codebase into an even greater
mess.
Declaring the labels as EXTERN wouldn't work either, since
the linker can't do fancy arithmetic and is limited to simply replacing
address placeholders with one single address. So, we're now stuck with
this function at the bottom of the SHARED segment, for the
foreseeable future.
We can still continue to separate functions off the top of that segment,
though. Pretty much the only thing noteworthy there, so far: TH04's code
for loading stage tile images from .MPN files, which we hadn't
reverse-engineered so far, and which nicely fit into one of
Blue Bolt's pending ⅓ RE contributions. Yup, we finally moved
the RE% bars again! If only for a tiny bit.
Both TH02 and TH05 simply store one pointer to one dynamically allocated
memory block for all tile images, as well as the number of images, in the
data segment. TH04, on the other hand, reserves memory for 8 .MPN slots,
complete with their color palettes, even though it only ever uses the
first one of these. There goes another 458 bytes of conventional RAM… I
should start summing up all the waste we've seen so far. Let's put the
next website contribution towards a tagging system for these blog posts.
At 86% of technical debt in the SHARED segment repaid, we
aren't quite done yet, but the rest is mostly just TH04 needing to catch
up with functions we've already separated. Next up: Getting to that
practical 98.5% point. Since this is very likely to not require a full
push, I'll also decompile some more actual TH04 and TH05 game code I
previously reverse-engineered – and after that, reopen the store!
Whoops, the build was broken again? Since
P0127 from
mid-November 2020, on TASM32 version 5.3, which also happens to be the
one in the DevKit… That version changed the alignment for the default
segments of certain memory models when requesting .386
support. And since redefining segment alignment apparently is highly
illegal and absolutely has to be a build error, some of the stand-alone
.ASM translation units didn't assemble anymore on this version. I've only
spotted this on my own because I casually compiled ReC98 somewhere else –
on my development system, I happened to have TASM32 version 5.0 in the
PATH during all this time.
At least this was a good occasion to
get rid of some
weird segment alignment workarounds from 2015, and replace them with the
superior convention of using the USE16 modifier for the
.MODEL directive.
ReC98 would highly benefit from a build server – both in order to
immediately spot issues like this one, and as a service for modders.
Even more so than the usual open-source project of its size, I would say.
But that might be exactly
because it doesn't seem like something you can trivially outsource
to one of the big CI providers for open-source projects, and quickly set
it up with a few lines of YAML.
That might still work in the beginning, and we might get by with a regular
64-bit Windows 10 and DOSBox running the exact build tools from the DevKit.
Ideally, though, such a server should really run the optimal configuration
of a 32-bit Windows 10, allowing both the 32-bit and the 16-bit build step
to run natively, which already is something that no popular CI service out
there offers. Then, we'd optimally expand to Linux, every other Windows
version down to 95, emulated PC-98 systems, other TASM versions… yeah, it'd
be a lot. An experimental project all on its own, with additional hosting
costs and probably diminishing returns, the more it expands…
I've added it as a category to the order form, let's see how much interest
there is once the store reopens (which will be at the beginning of May, at
the latest). That aside, it would 📝 also be
a great project for outside contributors!
So, technical debt, part 8… and right away, we're faced with TH03's
low-level input function, which
📝 once📝 again📝 insists on being word-aligned in a way we
can't fake without duplicating translation units.
Being undecompilable isn't exactly the best property for a function that
has been interesting to modders in the past: In 2018,
spaztron64 created an
ASM-level mod that hardcoded more ergonomic key bindings for human-vs-human
multiplayer mode: 2021-04-04-TH03-WASD-2player.zip
However, this remapping attempt remained quite limited, since we hadn't
(and still haven't) reached full position independence for TH03 yet.
There's quite some potential for size optimizations in this function, which
would allow more BIOS key groups to already be used right now, but it's not
all that obvious to modders who aren't intimately familiar with x86 ASM.
Therefore, I really wouldn't want to keep such a long and important
function in ASM if we don't absolutely have to…
… and apparently, that's all the motivation I needed? So I took the risk,
and spent the first half of this push on reverse-engineering
TCC.EXE, to hopefully find a way to get word-aligned code
segments out of Turbo C++ after all.
And there is! The -WX option, used for creating
DPMI
applications, messes up all sorts of code generation aspects in weird
ways, but does in fact mark the code segment as word-aligned. We can
consider ourselves quite lucky that we get to use Turbo C++ 4.0, because
this feature isn't available in any previous version of Borland's C++
compilers.
That allowed us to restore all the decompilations I previously threw away…
well, two of the three, that lookup table generator was too much of a mess
in C. But what an abuse this is. The
subtly different code generation has basically required one creative
workaround per usage of -WX. For example, enabling that option
causes the regular PUSH BP and POP BP prolog and
epilog instructions to be wrapped with INC BP and
DEC BP, for some reason:
a_function_compiled_with_wx proc
inc bp ; ???
push bp
mov bp, sp
; [… function code …]
pop bp
dec bp ; ???
ret
a_function_compiled_with_wx endp
Luckily again, all the functions that currently require -WX
don't set up a stack frame and don't take any parameters.
While this hasn't directly been an issue so far, it's been pretty
close: snd_se_reset(void) is one of the functions that require
word alignment. Previously, it shared a translation unit with the
immediately following snd_se_play(int new_se), which does take
a parameter, and therefore would have had its prolog and epilog code messed
up by -WX.
Since the latter function has a consistent (and thus, fakeable) alignment,
I simply split that code segment into two, with a new -WX
translation unit for just snd_se_reset(void). Problem solved –
after all, two C++ translation units are still better than one ASM
translation unit. Especially with all the
previous #include improvements.
The rest was more of the usual, getting us 74% done with repaying the
technical debt in the SHARED segment. A lot of the remaining
26% is TH04 needing to catch up with TH03 and TH05, which takes
comparatively little time. With some good luck, we might get this
done within the next push… that is, if we aren't confronted with all too
many more disgusting decompilations, like the two functions that ended this
push.
If we are, we might be needing 10 pushes to complete this after all, but
that piece of research was definitely worth the delay. Next up: One more of
these.
Alright, no more big code maintenance tasks that absolutely need to be
done right now. Time to really focus on parts 6 and 7 of repaying
technical debt, right? Except that we don't get to speed up just yet, as
TH05's barely decompilable PMD file loading function is rather…
complicated.
Fun fact: Whenever I see an unusual sequence of x86 instructions in PC-98
Touhou, I first consult the disassembly of Wolfenstein 3D. That game was
originally compiled with the quite similar Borland C++ 3.0, so it's quite
helpful to compare its ASM to the
officially released source
code. If I find the instructions in question, they mostly come from
that game's ASM code, leading to the amusing realization that "even John
Carmack was unable to get these instructions out of this compiler"
This time though, Wolfenstein 3D did point me
to Borland's intrinsics for common C functions like memcpy()
and strchr(), available via #pragma intrinsic.
Bu~t those unfortunately still generate worse code than what ZUN
micro-optimized here. Commenting how these sequences of instructions
should look in C is unfortunately all I could do here.
The conditional branches in this function did compile quite nicely
though, clarifying the control flow, and clearly exposing a ZUN
bug: TH05's snd_load() will hang in an infinite loop when
trying to load a non-existing -86 BGM file (with a .M2
extension) if the corresponding -26 BGM file (with a .M
extension) doesn't exist either.
Unsurprisingly, the PMD channel monitoring code in TH05's Music Room
remains undecompilable outside the two most "high-level" initialization
and rendering functions. And it's not because there's data in the
middle of the code segment – that would have actually been possible with
some #pragmas to ensure that the data and code segments have
the same name. As soon as the SI and DI registers are referenced
anywhere, Turbo C++ insists on emitting prolog code to save these
on the stack at the beginning of the function, and epilog code to restore
them from there before returning.
Found that out in
September 2019, and confirmed that there's no way around it. All the
small helper functions here are quite simply too optimized, throwing away
any concern for such safety measures. 🤷
Oh well, the two functions that were decompilable at least indicate
that I do try.
Within that same 6th push though, we've finally reached the one function
in TH05 that was blocking further progress in TH04, allowing that game
to finally catch up with the others in terms of separated translation
units. Feels good to finally delete more of those .ASM files we've
decompiled a while ago… finally!
But since that was just getting started, the most satisfying development
in both of these pushes actually came from some more experiments with
macros and inline functions for near-ASM code. By adding
"unused" dummy parameters for all relevant registers, the exact input
registers are made more explicit, which might help future port authors who
then maybe wouldn't have to look them up in an x86 instruction
reference quite as often. At its best, this even allows us to
declare certain functions with the __fastcall convention and
express their parameter lists as regular C, with no additional
pseudo-registers or macros required.
As for output registers, Turbo C++'s code generation turns out to be even
more amazing than previously thought when it comes to returning
pseudo-registers from inline functions. A nice example for
how this can improve readability can be found in this piece of TH02 code
for polling the PC-98 keyboard state using a BIOS interrupt:
inline uint8_t keygroup_sense(uint8_t group) {
_AL = group;
_AH = 0x04;
geninterrupt(0x18);
// This turns the output register of this BIOS call into the return value
// of this function. Surprisingly enough, this does *not* naively generate
// the `MOV AL, AH` instruction you might expect here!
return _AH;
}
void input_sense(void)
{
// As a result, this assignment becomes `_AH = _AH`, which Turbo C++
// never emits as such, giving us only the three instructions we need.
_AH = keygroup_sense(8);
// Whereas this one gives us the one additional `MOV BH, AH` instruction
// we'd expect, and nothing more.
_BH = keygroup_sense(7);
// And now it's obvious what both of these registers contain, from just
// the assignments above.
if(_BH & K7_ARROW_UP || _AH & K8_NUM_8) {
key_det |= INPUT_UP;
}
// […]
}
I love it. No inline assembly, as close to idiomatic C code as something
like this is going to get, yet still compiling into the minimum possible
number of x86 instructions on even a 1994 compiler. This is how I keep
this project interesting for myself during chores like these.
We might have even reached peak
inline already?
And that's 65% of technical debt in the SHARED segment repaid
so far. Next up: Two more of these, which might already complete that
segment? Finally!
Wow, 31 commits in a single push? Well, what the last push had in
progress, this one had in maintenance. The
📝 master.lib header transition absolutely
had to be completed in this one, for my own sanity. And indeed,
it reduced the build time for the entirety of ReC98 to about 27 seconds on
my system, just as expected in the original announcement. Looking forward
to even faster build times with the upcoming #include
improvements I've got up my sleeve! The port authors of the future are
going to appreciate those quite a bit.
As for the new translation units, the funniest one is probably TH05's
function for blitting the 1-color .CDG images used for the main menu
options. Which is so optimized that it becomes decompilable again,
by ditching the self-modifying code of its TH04 counterpart in favor of
simply making better use of CPU registers. The resulting C code is still a
mess, but what can you do.
This was followed by even more TH05 functions that clearly weren't
compiled from C, as evidenced by their padding
bytes. It's about time I've documented my lack of ideas of how to get
those out of Turbo C++.
And just like in the previous push, I also had to 📝 throw away a decompiled TH02 function purely due to alignment issues. Couldn't have been a better one though, no one's going to miss a residency check for the MMD driver that is largely identical to the corresponding (and indeed decompilable) function for the PMD driver. Both of those should have been merged into a single function anyway, given how they also mutate the game's sound configuration flags…
In the end, I've slightly slowed down with this one, with only 37% of technical debt done after this 4th dedicated push. Next up: One more of these, centered around TH05's stupidly optimized .PI functions. Maybe also with some more reverse-engineering, after not having done any for 1½ months?
Now that's the amount of translation unit separation progress I was
looking for! Too bad that RL is keeping me more and more occupied these
days, and ended up delaying this push until 2021. Now that
Touhou Patch Center is also commissioning me to update their
infrastructure, it's going to take a while for ReC98 to return to full
speed, and for the store to be reopened. Should happen by April at the
latest, though!
With everything related to this separation of translation units explained
earlier, we've really got a push with nothing to talk about, this
time. Except, maybe, for the realization that
📝 this current approach might not be the
best fit for TH02 after all: Not only did it force us to
📝 throw away the previous decompilation of
the sound effect playback functions, but OP.EXE also contains
obviously copy-pasted code in addition to the common, shared set of
library functions. How was that game even built, originally??? No
way around compiling that one instance of the "delay until given BGM
measure" function separately then, if it insists on using its own
instance of the VSync delay function…
Oh well, this separated layout still works better for the later games, and
consistency is good. Smooth sailing with all of the other functions, at
least.
Next up: One more of these, which might even end up completing the
📝 transition to our own master.lib header file.
In terms of the total number of ASM code left in the SHARED
code segments, we're now 30% done after 3 dedicated pushes. It really
shouldn't require 7 more pushes, though!
Turns out that TH04's player selection menu is exactly three times as
complicated as TH05's. Two screens for character and shot type rather than
one, and a way more intricate implementation for saving and restoring the
background behind the raised top and left edges of a character picture
when moving the cursor between Reimu and Marisa. TH04 decides to backup
precisely only the two 256×8 (top) and 8×244 (left) strips behind the
edges, indicated in red in the picture
below.
These take up just 4 KB of heap memory… but require custom blitting
functions, and expanding this explicitly hardcoded approach to TH05's 4
characters would have been pretty annoying. So, rather than, uh, not
explicitly hardcoding it all, ZUN decided to just be lazy with the backup
area in TH05, saving the entire 640×400 screen, and thus spending 128 KB
of heap memory on this rather simple selection shadow effect.
So, this really wasn't something to quickly get done during the first half
of a push, even after already having done TH05's equivalent of this menu.
But since life is very busy right now, I also used the occasion to start
addressing another code organization annoyance: master.lib's single master.h header file.
Now that ReC98 is trying to develop (or at least mimic) a more
type-safe C++ foundation to model the PC-98 hardware, a pure C header
(with counter-productive C++ extensions) is becoming increasingly
unidiomatic. By moving some of the original assumptions about function
parameters into the type system, we can also reduce the reliance on its
Japanese-only documentation without having to translate it
It's quite bloated, with at least 2800 lines of code that
currently are #included into the vast majority of files, not
counting master.h's recursively included C standard library
headers. PC-98 Touhou only makes direct use of a rather small fraction of
its contents.
And finally, all the DOS/V compatibility definitions are especially
useless in the context of ReC98. As I've noted
📝 time and
📝 time again, porting PC-98 Touhou to
IBM-compatible DOS won't be easy, and MASTER_DOSV won't be
helping much. Therefore, my upstream version of ReC98 will never include
all of master.lib. There's no point in lengthening compile times for
everyone by default, and those will be getting quite noticeable
after moving to a full 16-bit build process.
(Actually, what retro system ports should rather be doing: Get rid
of master.lib's original ASM code, replace it with
readable, modern
C++, and then simply convert the optimized assembly output of modern
compilers to your ISA of choice. Improving the landscape of such
assembly or object file converters would benefit everyone!)
So, time to start a new master.hpp header that would contain
just the declarations from master.h that PC-98 Touhou
actually needs, plus some semantic (yes, semantic) sugar. Comparing just
the old master.h to just the new master.hpp
after roughly 60% of the transition has been completed, we get median
build times of 319 ms for master.h, and 144 ms for
master.hpp on my (admittedly rather slow) DOSBox setup.
Nice!
As of this push, ReC98 consists of 107 translation units that have to be
compiled with Turbo C++ 4.0J. Fully rebuilding all of these currently
takes roughly 37.5 seconds in DOSBox. After the transition to
master.hpp is done, we could therefore shave some 10 to 15
seconds off this time, simply by switching header files. And that's just
the beginning, as this will also pave the way for further
#include optimizations. Life in this codebase will be great!
Unfortunately, there wasn't enough time to repay some of the actual
technical debt I was looking forward to, after all of this. Oh well, at
least we now also have nice identifiers for the three different boldface
options that are used when rendering text to VRAM, after procrastinating
that issue for almost 11 months. Next up, assuming the existing
subscriptions: More ridiculous decompilations of things that definitely
weren't originally written in C, and a big blocker in TH03's
MAIN.EXE.
🎉 TH05 is finally fully position-independent! 🎉 To celebrate this
milestone, -Tom- coded a little demo, which we recorded on
both an emulator and on real PC-98 hardware:
You can now freely add or remove both data and code anywhere in TH05, by
editing the ReC98 codebase, writing your mod in ASM or C/C++, and
recompiling the code. Since all absolute memory addresses have now been
converted to labels, this will work without causing any instability. See
the position independence section in the FAQ
for a more thorough explanation about why this was a problem.
By extension, this also means that it's now theoretically possible
to use a different compiler on the source code. But:
What does this not mean?
The original ZUN code hasn't been completely reverse-engineered yet, let
alone decompiled. As the final PC-98 Touhou game, TH05 also happens to
have the largest amount of actual ZUN-written ASM that can't ever
be decompiled within ReC98's constraints of a legit source code
reconstruction. But a lot of the originally-in-C code is also still in
ASM, which might make modding a bit inconvenient right now. And while I
have decompiled a bunch of functions, I selected them largely
because they would help with PI (as requested by the backers), and not
because they are particularly relevant to typical modding interests.
As a result, the code might also be a bit confusingly organized. There's
quite a conflict between various goals there: On the one hand, I'd like to
only have a single instance of every function shared with earlier games,
as well as reduce ZUN's code duplication within a single game. On the
other hand, this leads to quite a lot of code being scattered all over the
place and then #include-pasted back together, except for the
places where
📝 this doesn't work, and you'd have to use multiple translation units anyway…
I'm only beginning to figure out the best structure here, and some more
reverse-engineering attention surely won't hurt.
Also, keep in mind that the code still targets x86 Real Mode. To work
effectively in this codebase, you'd need some familiarity with
memory
segmentation, and how to express it all in code. This tends to make
even regular C++ development about an order of magnitude harder,
especially once you want to interface with the remaining ASM code. That
part made -Tom- struggle quite a bit with implementing his
custom scripting language for the demo above. For now, he built that demo
on quite a limited foundation – which is why he also chose to release
neither the build nor the source publically for the time being.
So yeah, you're definitely going to need the TASM and Borland C++ manuals
there.
tl;dr: We now know everything about this game's data, but not quite
as much about this game's code.
So, how long until source ports become a realistic project?
You probably want to wait for 100% RE, which is when everything
that can be decompiled has been decompiled.
Unless your target system is 16-bit Windows, in which case you could
theoretically start right away. 📝 Again,
this would be the ideal first system to port PC-98 Touhou to: It would
require all the generic portability work to remove the dependency on PC-98
hardware, thus paving the way for a subsequent port to modern systems,
yet you could still just drop in any undecompiled ASM.
Porting to IBM-compatible DOS would only be a harder and less universally
useful version of that. You'd then simply exchange one architecture, with
its idiosyncrasies and limits, for another, with its own set of
idiosyncrasies and limits. (Unless, of course, you already happen to be
intimately familiar with that architecture.) The fact that master.lib
provides DOS/V support would have only mattered if ZUN consistently used
it to abstract away PC-98 hardware at every single place in the code,
which is definitely not the case.
The list of actually interesting findings in this push is,
📝 again, very short. Probably the most
notable discovery: The low-level part of the code that renders Marisa's
laser from her TH04 Illusion Laser shot type is still present in
TH05. Insert wild mass guessing about potential beta version shot types…
Oh, and did you know that the order of background images in the Extra
Stage staff roll differs by character?
Next up: Finally driving up the RE% bar again, by decompiling some TH05
main menu code.
Alright, tooling and technical debt. Shouldn't be really much to talk
about… oh, wait, this is still ReC98
For the tooling part, I finished up the remaining ergonomics and error
handling for the
📝 sprite converter that Jonathan Campbell contributed two months ago.
While I familiarized myself with the tool, I've actually ran into some
unreported errors myself, so this was sort of important to me. Still got
no command-line help in there, but the error messages can now do that job
probably even better, since we would have had to write them anyway.
So, what's up with the technical debt then? Well, by now we've accumulated
quite a number of 📝 ASM code slices that
need to be either decompiled or clearly marked as undecompilable. Since we
define those slices as "already reverse-engineered", that decision won't
affect the numbers on the front page at all. But for a complete
decompilation, we'd still have to do this someday. So, rather than
incorporating this work into pushes that were purchased with the
expectation of measurable progress in a certain area, let's take the
"anything goes" pushes, and focus entirely on that during them.
The second code segment seemed like the best place to start with this,
since it affects the largest number of games simultaneously. Starting with
TH02, this segment contains a set of random "core" functions needed by the
binary. Image formats, sounds, input, math, it's all there in some
capacity. You could maybe call it all "libzun" or something like
that? But for the time being, I simply went with the obvious name,
seg2. Maybe I'll come up with something more convincing in
the future.
Oh, but wait, why were we assembling all the previous undecompilable ASM
translation units in the 16-bit build part? By moving those to the 32-bit
part, we don't even need a 16-bit TASM in our list of dependencies, as
long as our build process is not fully 16-bit.
And with that, ReC98 now also builds on Windows 95, and thus, every 32-bit
Windows version. 🎉 Which is certainly the most user-visible improvement
in all of these two pushes.
Back in 2015, I already decompiled all of TH02's seg2
functions. As suggested by the Borland compiler, I tried to follow a "one
translation unit per segment" layout, bundling the binary-specific
contents via #include. In the end, it required two
translation units – and that was even after manually inserting the
original padding bytes via #pragma codestring… yuck. But it
worked, compiled, and kept the linker's job (and, by extension,
segmentation worries) to a minimum. And as long as it all matched the
original binaries, it still counted as a valid reconstruction of ZUN's
code.
However, that idea ultimately falls apart once TH03 starts mixing
undecompilable ASM code inbetween C functions. Now, we officially have no
choice but to use multiple C and ASM translation units, with maybe only
just one or two #includes in them…
…or we finally start reconstructing the actual seg2 library,
turning every sequence of related functions into its own translation unit.
This way, we can simply reuse the once-compiled .OBJ files for all the
binaries those functions appear in, without requiring that additional
layer of translation units mirroring the original segmentation.
The best example for this is
TH03's
almost undecompilable function that generates a lookup table for
horizontally flipping 8 1bpp pixels. It's part of every binary since
TH03, but only used in that game. With the previous approach, we would
have had to add 9 C translation units, which would all have just
#included that one file. Now, we simply put the .OBJ file
into the correct place on the linker command line, as soon as we can.
💡 And suddenly, the linker just inserts the correct padding bytes itself.
The most immediate gains there also happened to come from TH03. Which is
also where we did get some tiny RE% and PI% gains out of this after
all, by reverse-engineering some of its sprite blitting setup code. Sure,
I should have done even more RE here, to also cover those 5 functions at
the end of code segment #2 in TH03's MAIN.EXE that were in
front of a number of library functions I already covered in this push. But
let's leave that to an actual RE push 😛
All in all though, I was just getting started with this; the real
gains in terms of removed ASM files are still to come. But in the
meantime, the funding situation has become even better in terms of
allowing me to focus on things nobody asked for. 🙂 So here's a slightly
better idea: Instead of spending two more pushes on this, let's shoot for
TH05 MAINE.EXE position independence next. If I manage to get
it done, we'll have a 100% position-independent TH05 by the time
-Tom- finishes his MAIN.EXE PI demo, rather
than the 94% we'd get from just MAIN.EXE. That's bound to
make a much better impression on all the people who will then
(re-)discover the project.
Only one newly ordered push since I've reopened the store? Great, that's
all the justification I needed for the extended maintenance delay that was
part of these two pushes 😛
Having to write comments to explain whether coordinates are relative to
the top-left corner of the screen or the top-left corner of the playfield
has finally become old. So, I introduced
distinct
types for all the coordinate systems we typically encounter, applying
them to all code decompiled so far. Note how the planar nature of PC-98
VRAM meant that X and Y coordinates also had to be different from each
other. On the X side, there's mainly the distinction between the
[0; 640] screen space and the corresponding [0; 80] VRAM byte
space. On the Y side, we also have the [0; 400] screen space, but
the visible area of VRAM might be limited to [0; 200] when running in
the PC-98's line-doubled 640×200 mode. A VRAM Y coordinate also always
implies an added offset for vertical scrolling.
During all of the code reconstruction, these types can only have a
documenting purpose. Turning them into anything more than just
typedefs to int, in order to define conversion
operators between them, simply won't recompile into identical binaries.
Modding and porting projects, however, now have a nice foundation for
doing just that, and can entirely lift coordinate system transformations
into the type system, without having to proofread all the meaningless
int declarations themselves.
So, what was left in terms of memory references? EX-Alice's fire waves
were our final unknown entity that can collide with the player. Decently
implemented, with little to say about them.
That left the bomb animation structures as the one big remaining PI
blocker. They started out nice and simple in TH04, with a small 6-byte
star animation structure used for both Reimu and Marisa. TH05, however,
gave each character her own animation… and what the hell is going
on with Reimu's blue stars there? Nope, not going to figure this out on
ASM level.
A decompilation first required some more bomb-related variables to be
named though. Since this was part of a generic RE push, it made sense to
do this in all 5 games… which then led to nice PI gains in anything
but TH05. Most notably, we now got the
"pulling all items to player" flag in TH04 and TH05, which is
actually separate from bombing. The obvious cheat mod is left as an
exercise to the reader.
So, TH05 bomb animations. Just like the
📝 custom entity types of this game, all 4
characters share the same memory, with the superficially same 10-byte
structure.
But let's just look at the very first field. Seen from a low level, it's a
simple struct { int x, y; } pos, storing the current position
of the character-specific bomb animation entity. But all 4 characters use
this field differently:
For Reimu's blue stars, it's the top-left position of each star, in the
12.4 fixed-point format. But unlike the vast majority of these values in
TH04 and TH05, it's relative to the top-left corner of the
screen, not the playfield. Much better represented as
struct { Subpixel screen_x, screen_y; } topleft.
For Marisa's lasers, it's the center of each circle, as a regular 12.4
fixed-point coordinate, relative to the top-left corner of the playfield.
Much better represented as
struct { Subpixel x, y; } center.
For Mima's shrinking circles, it's the center of each circle in regular
pixel coordinates. Much better represented as
struct { screen_x_t x; screen_y_t y; } center.
For Yuuka's spinning heart, it's the top-left corner in regular pixel
coordinates. Much better represented as
struct { screen_x_t x; screen_y_t y; } topleft.
And yes, singular. The game is actually smart enough to only store a single
heart, and then create the rest of the circle on the fly. (If it were even
smarter, it wouldn't even use this structure member, but oh well.)
Therefore, I decompiled it as 4 separate structures once again, bundled
into an union of arrays.
As for Reimu… yup, that's some pointer arithmetic straight out of
Jigoku* for setting and updating the positions of the falling star
trails. While that certainly required several
comments to wrap my head around the current array positions, the one "bug"
in all this arithmetic luckily has no effect on the game.
There is a small glitch with the growing circles, though. They are
spawned at the end of the loop, with their position taken from the star
pointer… but after that pointer has already been incremented. On
the last loop iteration, this leads to an out-of-bounds structure access,
with the position taken from some unknown EX-Alice data, which is 0 during
most of the game. If you look at the animation, you can easily spot these
bugged circles, consistently growing from the top-left corner (0, 0)
of the playfield:
After all that, there was barely enough remaining time to filter out and
label the final few memory references. But now, TH05's
MAIN.EXE is technically position-independent! 🎉
-Tom- is going to work on a pretty extensive demo of this
unprecedented level of efficient Touhou game modding. For a more impactful
effect of both the 100% PI mark and that demo, I'll be delaying the push
covering the remaining false positives in that binary until that demo is
done. I've accumulated a pretty huge backlog of minor maintenance issues
by now…
Next up though: The first part of the long-awaited build system
improvements. I've finally come up with a way of sanely accelerating the
32-bit build part on most setups you could possibly want to build ReC98
on, without making the building experience worse for the other few setups.
… and just as I explained 📝 in the last post
how decompilation is typically more sensible and efficient than ASM-level
reverse-engineering, we have this push demonstrating a counter-example.
The reason why the background particles and lines in the Shinki and
EX-Alice battles contributed so much to position dependence was simply
because they're accessed in a relatively large amount of functions, one
for each different animation. Too many to spend the remaining precious
crowdfunded time on reverse-engineering or even decompiling them all,
especially now that everyone anticipates 100% PI for TH05's
MAIN.EXE.
Therefore, I only decompiled the two functions of the line structure that
also demonstrate best how it works, which in turn also helped with RE.
Sadly, this revealed that we actually can't📝 overload operator =() to get
that nice assignment syntax for 12.4 fixed-point values, because one of
those new functions relies on Turbo C++'s built-in optimizations for
trivially copyable structures. Still, impressive that this abstraction
caused no other issues for almost one year.
As for the structures themselves… nope, nothing to criticize this time!
Sure, one good particle system would have been awesome, instead of having
separate structures for the Stage 2 "starfield" particles and the one used
in Shinki's battle, with hardcoded animations for both. But given the
game's short development time, that was quite an acceptable compromise,
I'd say.
And as for the lines, there just has to be a reason why the game
reserves 20 lines per set, but only renders lines #0, #6, #12, and #18.
We'll probably see once we get to look at those animation functions more
closely.
This was quite a 📝 TH03-style RE push,
which yielded way more PI% than RE%. But now that that's done, I can
finally not get distracted by all that stuff when looking at the
list of remaining memory references. Next up: The last few missing
structures in TH05's MAIN.EXE!
As expected, we've now got the TH04 and TH05 stage enemy structure,
finishing position independence for all big entity types. This one was
quite straightfoward, as the .STD scripting system is pretty simple.
Its most interesting aspect can be found in the way timing is handled. In
Windows Touhou, all .ECL script instructions come with a frame field that
defines when they are executed. In TH04's and TH05's .STD scripts, on the
other hand, it's up to each individual instruction to add a frame time
parameter, anywhere in its parameter list. This frame time defines for how
long this instruction should be repeatedly executed, before it manually
advances the instruction pointer to the next one. From what I've seen so
far, these instruction typically apply their effect on the first frame
they run on, and then do nothing for the remaining frames.
Oh, and you can't nest the LOOP instruction, since the enemy
structure only stores one single counter for the current loop iteration.
Just from the structure, the only innovation introduced by TH05 seems to
have been enemy subtypes. These can be used to parametrize scripts via
conditional jumps based on this value, as a first attempt at cutting down
the need to duplicate entire scripts for similar enemy behavior. And
thanks to TH05's favorable segment layout, this game's version of the
.STD enemy script interpreter is even immediately ready for decompilation,
in one single future push.
As far as I can tell, that now only leaves
.MPN file loading
player bomb animations
some structures specific to the Shinki and EX-Alice battles
plus some smaller things I've missed over the years
until TH05's MAIN.EXE is completely position-independent.
Which, however, won't be all it needs for that 100% PI rating on the front
page. And with that many false positives, it's quite easy to get lost with
immediately reverse-engineering everything around them. This time, the
rendering of the text dissolve circles, used for the stage and BGM title
popups, caught my eye… and since the high-level code to handle all of
that was near the end of a segment in both TH04 and TH05, I just decided
to immediately decompile it all. Like, how hard could it possibly be?
Sure, it needed another segment split, which was a bit harder due
to all the existing ASM referencing code in that segment, but certainly
not impossible…
Oh wait, this code depends on 9 other sets of identifiers that haven't
been declared in C land before, some of which require vast reorganizations
to bring them up to current consistency standards. Whoops! Good thing that
this is the part of the project I'm still offering for free…
Among the referenced functions was tiles_invalidate_around(),
which marks the stage background tiles within a rectangular area to be
redrawn this frame. And this one must have had the hardest function
signature to figure out in all of PC-98 Touhou, because it actually
seems impossible. Looking at all the ways the game passes the center
coordinate to this function, we have
X and Y as 16-bit integer literals, merged into a single
PUSH of a 32-bit immediate
X and Y calculated and pushed independently from each other
by-value copies of entire Point instances
Any single declaration would only lead to at most two of the three cases
generating the original instructions. No way around separately declaring
the function in every translation unit then, with the correct parameter
list for the respective calls. That's how ZUN must have also written it.
Oh well, we would have needed to do all of this some time. At least
there were quite a bit of insights to be gained from the actual
decompilation, where using const references actually made it
possible to turn quite a number of potentially ugly macros into wholesome
inline functions.
But still, TH04 and TH05 will come out of ReC98's decompilation as one big
mess. A lot of further manual decompilation and refactoring, beyond the
limits of the original binary, would be needed to make these games
portable to any non-PC-98, non-x86 architecture.
And yes, that includes IBM-compatible DOS – which, for some reason, a
number of people see as the obvious choice for a first system to port
PC-98 Touhou to. This will barely be easier. Sure, you'll save the effort
of decompiling all the remaining original ASM. But even with
master.lib's MASTER_DOSV setting, these games still very much
rely on PC-98 hardware, with corresponding assumptions all over ZUN's
code. You will need to provide abstractions for the PC-98's
superimposed text mode, the gaiji, and planar 4-bit color access in
general, exchanging the use of the PC-98's GRCG and EGC blitter chips with
something else. At that point, you might as well port the game to one
generic 640×400 framebuffer and away from the constraints of DOS,
resulting in that Doom source code-like situation which made that
game easily portable to every architecture to begin with. But ZUN just
wasn't a John Carmack, sorry.
Or what do I know. I've never programmed for IBM-compatible DOS, but maybe
ReC98's audience does include someone who is intimately familiar
with IBM-compatible DOS so that the constraints aren't much of an issue
for them? But even then, 16-bit Windows would make much more sense
as a first porting target if you don't want to bother with that
undecompilable ASM.
At least I won't have to look at TH04 and TH05 for quite a while now.
The delivery delays have made it obvious that
my life has become pretty busy again, probably until September. With a
total of 9 TH01 pushes from monthly subscriptions now waiting in the
backlog, the shop will stay closed until I've caught up with most of
these. Which I'm quite hyped for!
Alright, the score popup numbers shown when collecting items or defeating
(mid)bosses. The second-to-last remaining big entity type in TH05… with
quite some PI false positives in the memory range occupied by its data.
Good thing I still got some outstanding generic RE pushes that haven't
been claimed for anything more specific in over a month! These
conveniently allowed me to RE most of these functions right away, the
right way.
Most of the false positives were boss HP values, passed to a "boss phase
end" function which sets the HP value at which the next phase should end.
Stage 6 Yuuka, Mugetsu, and EX-Alice have their own copies of this
function, in which they also reset certain boss-specific global variables.
Since I always like to cover all varieties of such duplicated functions at
once, it made sense to reverse-engineer all the involved variables while I
was at it… and that's why this was exactly the right time to cover the
implementation details of Stage 6 Yuuka's parasol and vanishing animations
in TH04.
With still a bit of time left in that RE push afterwards, I could also
start looking into some of the smaller functions that didn't quite fit
into other pushes. The most notable one there was a simple function that
aims from any point to the current player position. Which actually only
became a separate function in TH05, probably since it's called 27 times in
total. That's 27 places no longer being blocked from further RE progress.
WindowsTiger already
did most of the work for the score popup numbers in January, which meant
that I only had to review it and bring it up to ReC98's current coding
styles and standards. This one turned out to be one of those rare features
whose TH05 implementation is significantly less insane than the
TH04 one. Both games lazily redraw only the tiles of the stage background
that were drawn over in the previous frame, and try their best to minimize
the amount of tiles to be redrawn in this way. For these popup numbers,
this involves calculating the on-screen width, based on the exact number
of digits in the point value. TH04 calculates this width every frame
during the rendering function, and even resorts to setting that field
through the digit iteration pointer via self-modifying code… yup. TH05, on
the other hand, simply calculates the width once when spawning a new popup
number, during the conversion of the point value to
binary-coded
decimal. The "×2" multiplier suffix being removed in TH05 certainly
also helped in simplifying that feature in this game.
And that's ⅓ of TH05 reverse-engineered! Next up, one more TH05 PI push,
in which the stage enemies hopefully finish all the big entity types.
Maybe it will also be accompanied by another RE push? In any case, that
will be the last piece of TH05 progress for quite some time. The next TH01
stretch will consist of 6 pushes at the very least, and I currently have
no idea of how much time I can spend on ReC98 a month from now…
Wait, PI for FUUIN.EXE is mainly blocked by the high score
menu? That one should really be properly decompiled in a separate
RE push, since it's also present in largely identical form in
REIIDEN.EXE… but I currently lack the explicit funding to do
that.
And as it turns out, I shouldn't really capture any of the existing generic
RE contributions for it either. Back in 2018 when I ran the crowdfunding
on the Touhou Patch Center Discord server, I said that generic RE
contributions would never go towards TH01. No one was interested in that
game back then, and as it's significantly different from all the other
games, it made sense to only cover it if explicitly requested.
As Touhou Patch Center still remains one of the biggest supporters and
advertisers for ReC98, someone recently believed that this rule was still
in effect, despite not being mentioned anywhere on this website.
Fast forward to today, and TH01 has become the single most supported game
lately, with plenty of incomplete pushes still open to be completed.
Reverse-engineering it has proven to be quite efficient, yielding lots of
completion percentage points per push. This, I suppose, is exactly what
backers that don't give any specific priorities are mainly interested in.
Therefore, I will allocate future partial
contributions to TH01, whenever it makes sense.
So, instead of rushing TH01 PI, let's wait for Ember2528's
April subscription, and get the 25% total RE milestone with some TH05 PI
progress instead. This one primarily focused on the gather circles
(spirals…?), the third-last missing entity type in TH05. These are
rendered using the same 8×8 pellet sprite introduced in TH02… except that
the actual pellets received a darkened bottom part in TH04
.
Which, in turn, is actually rendered quite efficiently – the games first
render the top white part of all pellets, followed by the bottom gray part
of all pellets. The PC-98 GRCG is used throughout the process, doing its
typical job of accelerating monochrome blitting, and by arranging the
rendering like this, only two GRCG color changes are required to draw any
number of pellets. I guess that makes it quite a worthwhile
optimization? Don't ask me for specific performance numbers or even saved
cycles, though
To finish this TH05 stretch, we've got a feature that's exclusive to TH05
for once! As the final memory management innovation in PC-98 Touhou, TH05
provides a single static (64 * 26)-byte array for storing up to 64
entities of a custom type, specific to a stage or boss portion.
(Edit (2023-05-29): This system actually debuted in
📝 TH04, where it was used for much simpler
entities.)
TH05 uses this array for
the Stage 2 star particles,
Alice's puppets,
the tip of curve ("jello") bullets,
Mai's snowballs and Yuki's fireballs,
Yumeko's swords,
and Shinki's 32×32 bullets,
which makes sense, given that only one of those will be active at any
given time.
On the surface, they all appear to share the same 26-byte structure, with
consistently sized fields, merely using its 5 generic fields for different
purposes. Looking closer though, there actually are differences in
the signedness of certain fields across the six types. uth05win chose to
declare them as entirely separate structures, and given all the semantic
differences (pixels vs. subpixels, regular vs. tiny master.lib sprites,
…), it made sense to do the same in ReC98. It quickly turned out to be the
only solution to meet my own standards of code readability.
Which blew this one up to two pushes once again… But now, modders can
trivially resize any of those structures without affecting the other types
within the original (64 * 26)-byte boundary, even without full position
independence. While you'd still have to reduce the type-specific
number of distinct entities if you made any structure larger, you
could also have more entities with fewer structure members.
As for the types themselves, they're full of redundancy once again – as
you might have already expected from seeing #4, #5, and #6 listed as
unrelated to each other. Those could have indeed been merged into a single
32×32 bullet type, supporting all the unique properties of #4
(destructible, with optional revenge bullets), #5 (optional number of
twirl animation frames before they begin to move) and #6 (delay clouds).
The *_add(), *_update(), and *_render()
functions of #5 and #6 could even already be completely
reverse-engineered from just applying the structure onto the ASM, with the
ones of #3 and #4 only needing one more RE push.
But perhaps the most interesting discovery here is in the curve bullets:
TH05 only renders every second one of the 17 nodes in a curve
bullet, yet hit-tests every single one of them. In practice, this is an
acceptable optimization though – you only start to notice jagged edges and
gaps between the fragments once their speed exceeds roughly 11 pixels per
second:
And that brings us to the last 20% of TH05 position independence! But
first, we'll have more cheap and fast TH01 progress.
Well, that took twice as long as I thought, with the two pushes containing
a lot more maintenance than actual new research. Spending some time
improving both field names and types in
32th System's
TH03 resident structure finally gives us all of those
structures. Which means that we can now cover all the remaining
decompilable ZUN.COM parts at once…
Oh wait, their main() functions have stayed largely identical
since TH02? Time to clean up and separate that first, then… and combine
two recent code generation observations into the solution to a
decompilation puzzle from 4½ years ago. Alright, time to decomp-
Oh wait, we'd kinda like to properly RE all the code in TH03-TH05
that deals with loading and saving .CFG files. Almost every outside
contributor wanted to grab this supposedly low-hanging fruit a lot
earlier, but (of course) always just for a single game, while missing how
the format evolved.
So, ZUN.COM. For some reason, people seem to consider it
particularly important, even though it contains neither any game logic nor
any code specific to PC-98 hardware… All that this decompilable part does
is to initialize a game's .CFG file, allocate an empty resident structure
using master.lib functions, release it after you quit the game,
error-check all that, and print some playful messages~ (OK, TH05's also
directly fills the resident structure with all data from
MIKO.CFG, which all the other games do in OP.EXE.)
At least modders can now freely change and extend all the resident
structures, as well as the .CFG files? And translators can translate those
messages that you won't see on a decently fast emulator anyway? Have fun,
I guess 🤷
And you can in fact do this right now – even for TH04 and TH05,
whose ZUN.COM currently isn't rebuilt by ReC98. There is
actually a rather involved reason for this:
One of the missing files is TH05's GJINIT.COM.
Which contains all of TH05's gaiji characters in hardcoded 1bpp form,
together with a bit of ASM for writing them to the PC-98's hardware gaiji
RAM
Which means we'd ideally first like to have a sprite compiler, for
all the hardcoded 1bpp sprites
Which must compile to an ASM slice in the meantime, but should also
output directly to an OMF .OBJ file (for performance now), as well as to C
code (for portability later)
Which I won't put in as long as the backlog contains actual
progress to drive up the percentages on the front page.
So yeah, no meaningful RE and PI progress at any of these levels. Heck,
even as a modder, you can just replace the zun zun_res
(TH02), zun -5 (TH03), or zun -s (TH04/TH05)
calls in GAME.BAT with a direct call to your modified
*RES*.COM. And with the alternative being "manually typing 0 and 1
bits into a text file", editing the sprites in TH05's
GJINIT.COM is way more comfortable in a binary sprite editor
anyway.
For me though, the best part in all of this was that it finally made sense
to throw out the old Borland C++ run-time assembly slices 🗑 This giant
waste of time
became obvious 5 years ago, but any ASM dump of a .COM
file would have needed rather ugly workarounds without those slices. Now
that all .COM binaries that were originally written in C are
compiled from C, we can all enjoy slightly faster grepping over the entire
repository, which now has 229 fewer files. Productivity will skyrocket!
Next up: Three weeks of almost full-time ReC98 work! Two more PI-focused
pushes to finish this TH05 stretch first, before switching priorities to
TH01 again.
🎉 TH04's and TH05's OP.EXE are now fully
position-independent! 🎉
What does this mean?
You can now add any data or code to the main menus of the two games, by
simply editing the ReC98 source, writing your mod in ASM or C/C++, and
recompiling the code. Since all absolute memory addresses have now been
converted to labels, this will work without causing any instability. See
the position independence section in the FAQ
for a more thorough explanation about why this was a problem.
What does this not mean?
The original ZUN code hasn't been completely reverse-engineered yet, let
alone decompiled. Pretty much all of that is still ASM, which might make
modding a bit inconvenient right now.
Since this push was otherwise pretty unremarkable, I made a video
demonstrating a few basic things you can do with this:
Now, what to do for the last outstanding Touhou Patch Center push?
Bullets, or resident structures?
So, where to start? Well, TH04 bullets are hard, so let's
procrastinate start with TH03 instead
The 📝 sprite display functions are the
obvious blocker for any structure describing a sprite, and therefore most
meaningful PI gains in that game… and I actually did manage to fit a
decompilation of those three functions into exactly the amount of time
that the Touhou Patch Center community votes alloted to TH03
reverse-engineering!
And a pretty amazing one at that. The original code was so obviously
written in ASM and was just barely decompilable by exclusively using
register pseudovariables and a bit of goto, but I was able to
abstract most of that away, not least thanks to a few helpful optimization
properties of Turbo C++… seriously, I can't stop marveling at this ancient
compiler. The end result is both readable, clear, and dare I say
portable?! To anyone interested in porting TH03,
take a look. How painful would it be to port that away from 16-bit
x86?
However, this push is also a typical example that the RE/PI priorities can
only control what I look at, and the outcome can actually differ
greatly. Even though the priorities were 65% RE and 35% PI, the progress
outcome was +0.13% RE and +1.35% PI. But hey, we've got one more push with
a focus on TH03 PI, so maybe that one will include more RE than
PI, and then everything will end up just as ordered?
Deathbombs confirmed, in both TH04 and TH05! On the surface, it's the same
8-frame window as in
most Windows games, but due to the slightly lower PC-98 frame rate of
56.4 Hz, it's actually slightly more lenient in TH04 and TH05.
The last function in front of the TH05 shot type control functions marks
the player's previous position in VRAM to be redrawn. But as it turns out,
"player" not only means "the player's option satellites on shot levels ≥
2", but also "the explosion animation if you lose a life", which required
reverse-engineering both things, ultimately leading to the confirmation of
deathbombs.
It actually was kind of surprising that we then had reverse-engineered
everything related to rendering all three things mentioned above,
and could also cover the player rendering function right now. Luckily,
TH05 didn't decide to also micro-optimize that function into
un-decompilability; in fact, it wasn't changed at all from TH04. Unlike
the one invalidation function whose decompilation would have
actually been the goal here…
But now, we've finally gotten to where we wanted to… and only got 2
outstanding decompilation pushes left. Time to get the website ready for
hosting an actual crowdfunding campaign, I'd say – It'll make a better
impression if people can still see things being delivered after the big
announcement.
The glacial pace continues, with TH05's unnecessarily, inappropriately
micro-optimized, and hence, un-decompilable code for rendering the current
and high score, as well as the enemy health / dream / power bars. While
the latter might still pass as well-written ASM, the former goes to such
ridiculous levels that it ends up being technically buggy. If you
enjoy quality ZUN code, it's
definitely worth a read.
In TH05, this all still is at the end of code segment #1, but in TH04,
the same code lies all over the same segment. And since I really
wanted to move that code into its final form now, I finally did the
research into decompiling from anywhere else in a segment.
Turns out we actually can! It's kinda annoying, though: After splitting
the segment after the function we want to decompile, we then need to group
the two new segments back together into one "virtual segment" matching the
original one. But since all ASM in ReC98 heavily relies on being
assembled in MASM mode, we then start to suffer from MASM's group
addressing quirk. Which then forces us to manually prefix every single
function call
from inside the group
to anywhere else within the newly created segment
with the group name. It's stupidly boring busywork, because of all the
function calls you mustn't prefix. Special tooling might make this
easier, but I don't have it, and I'm not getting crowdfunded for it.
So while you now definitely can request any specific thing in any
of the 5 games to be decompiled right now, it will take slightly
longer, and cost slightly more.
(Except for that one big segment in TH04, of course.)
Only one function away from the TH05 shot type control functions now!
Here we go, new C code! …eh, it will still take a bit to really get
decompilation going at the speeds I was hoping for. Especially with the
sheer amount of stuff that is set in the first few significant
functions we actually can decompile, which now all has to be
correctly declared in the C world. Turns out I spent the last 2 years
screwing up the case of exported functions, and even some of their names,
so that it didn't actually reflect their calling convention… yup. That's
just the stuff you tend to forget while it doesn't matter.
To make up for that, I decided to research whether we can make use of some
C++ features to improve code readability after all. Previously, it seemed
that TH01 was the only game that included any C++ code, whereas TH02 and
later seemed to be 100% C and ASM. However, during the development of the
soon to be released new build system, I noticed that even this old
compiler from the mid-90's, infamous for prioritizing compile speeds over
all but the most trivial optimizations, was capable of quite surprising
levels of automatic inlining with class methods…
…leading the research to culminate in the mindblow that is
9d121c7 – yes, we can use C++ class methods
and operator overloading to make the code more readable, while still
generating the same code than if we had just used C and preprocessor
macros.
Looks like there's now the potential for a few pull requests from outside
devs that apply C++ features to improve the legibility of previously
decompiled and terribly macro-ridden code. So, if anyone wants to help
without spending money…
So, let's continue with player shots! …eh, or maybe not directly, since they involve two other structure types in TH05, which we'd have to cover first. One of them is a different sort of sprite, and since I like me some context in my reverse-engineering, let's disable every other sprite type first to figure out what it is.
One of those other sprite types were the little sparks flying away from killed stage enemies, midbosses, and grazed bullets; easy enough to also RE right now. Turns out they use the same 8 hardcoded 8×8 sprites in TH02, TH04, and TH05. Except that it's actually 64 16×8 sprites, because ZUN wanted to pre-shift them for all 8 possible start pixels within a planar VRAM byte (rather than, like, just writing a few instructions to shift them programmatically), leading to them taking up 1,024 bytes rather than just 64.
Oh, and the thing I wanted to RE *actually* was the decay animation whenever a shot hits something. Not too complex either, especially since it's exclusive to TH05.
And since there was some time left and I actually have to pick some of the next RE places strategically to best prepare for the upcoming 17 decompilation pushes, here's two more function pointers for good measure.
Turns out I had only been about half done with the drawing routines. The rest was all related to redrawing the scrolling stage backgrounds after other sprites were drawn on top. Since the PC-98 does have hardware-accelerated scrolling, but no hardware-accelerated sprites, everything that draws animated sprites into a scrolling VRAM must then also make sure that the background tiles covered by the sprite are redrawn in the next frame, which required a bit of ZUN code. And that are the functions that have been in the way of the expected rapid reverse-engineering progress that uth05win was supposed to bring. So, looks like everything's going to go really fast now?
… yeah, no, we won't get very far without figuring out these drawing routines.
Which process data that comes from the .STD files.
Which has various arrays related to the background… including one to specify the scrolling speed. And wait, setting that to 0 actually is what starts a boss battle?
So, have a TH05 Boss Rush patch: 2018-12-26-TH05BossRush.zip
Theoretically, this should have also worked for TH04, but for some reason,
the Stage 3 boss gets stuck on the first phase if we do this?
Actually, I lied, and lasers ended up coming with everything that makes reverse-engineering ZUN code so difficult: weirdly reused variables, unexpected structures within structures, and those TH05-specific nasty, premature ASM micro-optimizations that will waste a lot of time during decompilation, since the majority of the code actually was C, except for where it wasn't.
What do you do if the TH06 text image feature for thcrap should have been done 3 days™ ago, but keeps getting more and more complex, and you have a ton of other pushes to deliver anyway? Get some distraction with some light ReC98 reverse-engineering work. This is where it becomes very obvious how much uth05win helps us with all the games, not just TH05.
5a5c347 is the most important one in there, this was the missing substructure that now makes every other sprite-like structure trivial to figure out.