More than three months without any reverse-engineering progress! It's been
way too long. Coincidentally, we're at least back with a surprising 1.25% of
overall RE, achieved within just 3 pushes. The ending script system is not
only more or less the same in TH04 and TH05, but actually originated in
TH03, where it's also used for the cutscenes before stages 8 and 9. This
means that it was one of the final pieces of code shared between three of
the four remaining games, which I got to decompile at roughly 3× the usual
speed, or ⅓ of the price.
The only other bargains of this nature remain in OP.EXE. The
Music Room is largely equivalent in all three remaining games as well, and
the sound device selection, ZUN Soft logo screens, and main/option menus are
the same in TH04 and TH05. A lot of that code is in the "technically RE'd
but not yet decompiled" ASM form though, so it would shift Finalized% more
significantly than RE%. Therefore, make sure to order the new
Finalization option rather than Reverse-engineering if you
want to make number go up.
So, cutscenes. On the surface, the .TXT files look simple enough: You
directly write the text that should appear on the screen into the file
without any special markup, and add commands to define visuals, music, and
other effects at any place within the script. Let's start with the basics of
how text is rendered, which are the same in all three games:
First off, the text area has a size of 480×64 pixels. This means that it
does not correspond to the tiled area painted into TH05's
Since the font weight can be customized, all text is rendered to VRAM.
This also includes gaiji, despite them ignoring the font weight
The system supports automatic line breaks on a per-glyph basis, which
move the text cursor to the beginning of the red text area. This might seem like a piece of long-forgotten
ancient wisdom at first, considering the absence of automatic line breaks in
Windows Touhou. However, ZUN probably implemented it more out of pure
necessity: Text in VRAM needs to be unblitted when starting a new box, which
is way more straightforward and performant if you only need to worry
about a fixed area.
The system also automatically starts a new (key press-separated) text
box after the end of the 4th line. However, the text cursor is
also unconditionally moved to the top-left corner of the yellow name
area when this happens, which is almost certainly not what you expect, given
that automatic line breaks stay within the red area. A script author might
as well add the necessary text box change commands manually, if you're
forced to anticipate the automatic ones anyway…
Due to ZUN forgetting an unblitting call during the TH05 refactoring of the
box background buffer, this feature is even completely broken in that game,
as any new text will simply be blitted on top of the old one:
Overall, the system is geared toward exclusively full-width text. As
exemplified by the 2014 static English patches and the screenshots in this
blog post, half-width text is possible, but comes with a lot of
Each loop of the script interpreter starts by looking at the next
byte to distinguish commands from text. However, this step also skips
over every ASCII space and control character, i.e., every byte
≤ 32. If you only intend to display full-width glyphs anyway, this
sort of makes sense: You gain complete freedom when it comes to the
physical layout of these script files, and it especially allows commands
to be freely separated with spaces and line breaks for improved
readability. Still, enforcing commands to be separated exclusively by
line breaks might have been even better for readability, and would have
freed up ASCII spaces for regular text…
Non-command text is blindly processed and rendered two bytes at a
time. The rendering function interprets these bytes as a Shift-JIS
string, so you can use half-width characters here. While the
second byte can even be an ASCII 0x20 space due to the
parser's blindness, all half-width characters must still occur in pairs
that can't be interrupted by commands:
As a workaround for at least the ASCII space issue, you can replace
them with any of the unassigned
Shift-JIS lead bytes – 0x80, 0xA0, or
anything between 0xF0 and 0xFF inclusive.
That's what you see in all screenshots of this post that display
Finally, did you know that you can hold ESC to fast-forward
through these cutscenes, which skips most frame delays and reduces the rest?
Due to the blocking nature of all commands, the ESC key state is
only updated between commands or 2-byte text groups though, so it can't
interrupt an ongoing delay.
Superficially, the list of game-specific differences doesn't look too long,
and can be summarized in a rather short table:
It's when you get into the implementation that the combined three systems
reveal themselves as a giant mess, with more like 56 differences between the
games. Every single new weird line of code opened up
another can of worms, which ultimately made all of this end up with 24
pieces of bloat and 14 bugs. The worst of these should be quite interesting
for the general PC-98 homebrew developers among my audience:
The final official 0.23 release of master.lib has a bug in
graph_gaiji_put*(). To calculate the JIS X 0208 code point for
a gaiji, it is enough to ADD 5680h onto the gaiji ID. However,
these functions accidentally use ADC instead, which incorrectly
adds the x86 carry flag on top, causing weird off-by-one errors based on the
previous program state. ZUN did fix this bug directly inside master.lib for
TH04 and TH05, but still needed to work around it in TH03 by subtracting 1
from the intended gaiji ID. Anyone up for maintaining a bug-fixed master.lib
The worst piece of bloat comes from TH03 and TH04 needlessly
switching the visibility of VRAM pages while blitting a new 320×200 picture.
This makes it much harder to understand the code, as the mere existence of
these page switches is enough to suggest a more complex interplay between
the two VRAM pages which doesn't actually exist. Outside this visibility
switch, page 0 is always supposed to be shown, and page 1 is always used
for temporarily storing pixels that are later crossfaded onto page 0. This
is also the only reason why TH03 has to render text and gaiji onto both VRAM
pages to begin with… and because TH04 doesn't, changing the picture in the
middle of a string of text is technically bugged in that game, even though
you only get to temporarily see the new text on very underclocked PC-98
These performance implications made me wonder why cutscenes even bother with
writing to the second VRAM page anyway, before copying each crossfade step
to the visible one.
📝 We learned in June how costly EGC-"accelerated" inter-page copies are;
shouldn't it be faster to just blit the image once rather than twice?
Well, master.lib decodes .PI images into a packed-pixel format, and
unpacking such a representation into bitplanes on the fly is just about the
worst way of blitting you could possibly imagine on a PC-98. EGC inter-page
copies are already fairly disappointing at 42 cycles for every 16 pixels, if
we look at the i486 and ignore VRAM latencies. But under the same
conditions, packed-pixel unpacking comes in at 81 cycles for every 8
pixels, or almost 4× slower. On lower-end systems, that can easily sum up to
more than one frame for a 320×200 image. While I'd argue that the resulting
tearing could have been an acceptable part of the transition between two
images, it's understandable why you'd want to avoid it in favor of the
pure effect on a slower framerate.
Really makes me wonder why master.lib didn't just directly decode .PI images
into bitplanes. The performance impact on load times should have been
negligible? It's such a good format for
the often dithered 16-color artwork you typically see on PC-98, and
deserves better than master.lib's implementation which is both slow to
decode and slow to blit.
That brings us to the individual script commands… and yes, I'm going to
document every single one of them. Some of their interactions and edge cases
are not clear at all from just looking at the code.
Almost all commands are preceded by… well, a 0x5C lead byte.
Which raises the question of whether we should
document it as an ASCII-encoded \ backslash, or a Shift-JIS-encoded
¥ yen sign. From a gaijin perspective, it seems obvious that it's a
backslash, as it's consistently displayed as one in most of the editors you
would actually use nowadays. But interestingly, iconv
-f shift-jis -t utf-8 does convert any 0x5C
lead bytes to actual ¥ U+00A5 YEN SIGN code points
Ultimately, the distinction comes down to the font. There are fonts
that still render 0x5C as ¥, but mainly do so out
of an obvious concern about backward compatibility to JIS X 0201, where this
mapping originated. Unsurprisingly, this group includes MS Gothic/Mincho,
the old Japanese fonts from Windows 3.1, but even Meiryo and Yu
Gothic/Mincho, Microsoft's modern Japanese fonts. Meanwhile, pretty much
every other modern font, and freely licensed ones in particular, render this
code point as \, even if you set your editor to Shift-JIS. And
while ZUN most definitely saw it as a ¥, documenting this code
point as \ is less ambiguous in the long run. It can only
possibly correspond to one specific code point in either Shift-JIS or UTF-8,
and will remain correct even if we later mod the cutscene system to support
Now we've only got to clarify the parameter syntax, and then we can look at
the big table of commands:
Numeric parameters are read as sequences of up to 3 ASCII digits. This
limits them to a range from 0 to 999 inclusive, with 000 and
0 being equivalent. Because there's no further sentinel
character, any further digit from the 4th one onwards is
interpreted as regular text.
Filename parameters must be terminated with a space or newline and are
limited to 12 characters, which translates to 8.3 basenames without any
directory component. Any further characters are ignored and displayed as
text as well.
Each .PI image can contain up to four 320×200 pictures ("quarters") for
the cutscene picture area. In the script commands, they are numbered like
Clears both VRAM pages by filling them with VRAM color 0. 🐞
In TH03 and TH04, this command does not update the internal text area
background used for unblitting. This bug effectively restricts usage of
this command to either the beginning of a script (before the first
background image is shown) or its end (after no more new text boxes are
started). See the image below for an
example of using it anywhere else.
Sets the font weight to a value between 0 (raw font ROM glyphs) to 3
(very thicc). Specifying any other value has no effect.
🐞 In TH04 and TH05, \b3 leads to glitched pixels when
rendering half-width glyphs due to a bug in the newly micro-optimized
ASM version of
📝 graph_putsa_fx(); see the image below for an example.
In these games, the parameter also directly corresponds to the
graph_putsa_fx() effect function, removing the sanity check
that was present in TH03. In exchange, you can also access the four
dissolve masks for the bold font (\b2) by specifying a
parameter between 4 (fewest pixels) to 7 (most
pixels). Demo video below.
Changes the text color to VRAM color 15.
Adds a color map entry: If 字 is the first code point
inside the name area on a new line, the text color is automatically set
to 15. Up to 8 such entries can be registered
before overflowing the statically allocated buffer.
🐞 The comma is assumed to be present even if the color parameter is omitted.
Plays the sound effect with the given ID.
Calls master.lib's palette_black_in() or
palette_black_out() to play a hardware palette fade
animation from or to black, spending roughly 1 frame on each of the 16 fade steps.
Fades out BGM volume via PMD's AH=02h interrupt call,
in a non-blocking way. The fade speed can range from 1 (slowest) to 127 (fastest).
Values from 128 to 255 technically correspond to
AH=02h's fade-in feature, which can't be used from cutscene
scripts because it requires BGM volume to first be lowered via
AH=19h, and there is no command to do that.
Plays a blocking 8-frame screen shake
Shows the gaiji with the given ID from 0 to 255
at the current cursor position. Even in TH03, gaiji always ignore the
text delay interval configured with \v.
TH05's replacement for the \ga command from TH03 and
TH04. The default ID of 3 corresponds to the
gaiji. Not to be confused with \@, which starts with a backslash,
unlike this command.
Shows the gaiji.
Shows the gaiji.
Shows the gaiji.
Shows the gaiji.
Shows the gaiji.
Shows the gaiji.
Waits 0 frames (0 = forever) for an advance key to be pressed before
continuing script execution. Before waiting, TH05 crossfades in any new
text that was previously rendered to the invisible VRAM page…
🐞 …but TH04 doesn't, leaving the text invisible during the wait time.
As a workaround, \vp1 can be
used before \k to immediately display that text without a
Stops the currently playing BGM.
Restarts playback of the currently loaded BGM from the
Stops the currently playing BGM, loads a new one from the given
file, and starts playback.
Starts a new line at the leftmost X coordinate of the box, i.e., the
start of the name area. This is how scripts can "change" the name of the
currently speaking character, or use the entire 480×64 pixels without
being restricted to the non-name area.
Note that automatic line breaks already move the cursor into a new line.
Using this command at the "end" of a line with the maximum number of 30
full-width glyphs would therefore start a second new line and leave the
previously started line empty.
If this command moved the cursor into the 5th line of a box,
\s is executed afterward, with
any of \n's parameters passed to \s.
Deallocates the loaded .PI image.
Loads the .PI image with the given file into the single .PI slot
available to cutscenes. TH04 and TH05 automatically deallocate any
previous image, 🐞 TH03 would leak memory without a manual prior call to
Sets the hardware palette to the one of the loaded .PI image.
Sets the loaded .PI image as the full-screen 640×400 background
image and overwrites both VRAM pages with its pixels, retaining the
current hardware palette.
Runs \pp followed by \p@.
Ends a text box and starts a new one. Fades in any text rendered to
the invisible VRAM page, then waits 0 frames
(0 = forever) for an advance key to be
pressed. Afterward, the new text box is started with the cursor moved to
the top-left corner of the name area. \s- skips the wait time and starts the new box
Sets palette brightness via master.lib's
palette_settone() to any value from 0 (fully black) to 200
(fully white). 100 corresponds to the palette's original colors.
Preceded by a 1-frame delay unless ESC is held.
Sets the number of frames to wait between every 2 bytes of rendered
Sets the number of frames to spend on each of the 4 fade
steps when crossfading between old and new text. The game-specific
default value is also used before the first use of this command.
Shows VRAM page 0. Completely useless in
TH03 (this game always synchronizes both VRAM pages at a command
boundary), only of dubious use in TH04 (for working around a bug in \k), and the games always return to
their intended shown page before every blitting operation anyway. A
debloated mod of this game would just remove this command, as it exposes
an implementation detail that script authors should not need to worry
about. None of the original scripts use it anyway.
\w and \wk wait for the given number
\wm and \wmk wait until PMD has played
back the current BGM for the total number of measures, including
loops, given in the first parameter, and fall back on calling
\w and \wk with the second parameter as
the frame number if BGM is disabled.
🐞 Neither PMD nor MMD reset the internal measure when stopping
playback. If no BGM is playing and the previous BGM hasn't been
played back for at least the given number of measures, this command
Since both TH04 and TH05 fade in any new text from the invisible VRAM
page, these commands can be used to simulate TH03's typing effect in
those games. Demo video below.
Contrary to \k and \s, specifying 0 frames would
simply remove any frame delay instead of waiting forever.
The TH03-exclusive k variants allow the delay to be
interrupted if ⏎ Return or Shot are held down.
TH04 and TH05 recognize the k as well, but removed its
All of these commands have no effect if ESC is held.
Calls master.lib's palette_white_in() or
palette_white_out() to play a hardware palette fade
animation from or to white, spending roughly 1 frame on each of the 16 fade steps.
Immediately displays the given quarter of the loaded .PI image in
the picture area, with no fade effect. Any value ≥ 4 resets the picture area to black.
Crossfades the picture area between its current content and quarter
#4 of the loaded .PI image, spending 1 frame on each of the 4 fade steps unless
ESC is held. Any value ≥ 4 is
replaced with quarter #0.
Stops script execution. Must be called at the end of each file;
otherwise, execution continues into whatever lies after the script
buffer in memory.
TH05 automatically deallocates the loaded .PI image, TH03 and TH04
require a separate manual call to \p- to not leak its memory.
Bold values signify the default if the parameter
is omitted; \c is therefore
equivalent to \c15.
So yeah, that's the cutscene system. I'm dreading the moment I will have to
deal with the other command interpreter in these games, i.e., the
stage enemy system. Luckily, that one is completely disconnected from any
other system, so I won't have to deal with it until we're close to finishing
MAIN.EXE… that is, unless someone requests it before. And it
won't involve text encodings or unblitting…
The cutscene system got me thinking in greater detail about how I would
implement translations, being one of the main dependencies behind them. This
goal has been on the order form for a while and could soon be implemented
for these cutscenes, with 100% PI being right around the corner for the TH03
and TH04 cutscene executables.
Once we're there, the "Virgin" old-school way of static translation patching
for Latin-script languages could be implemented fairly quickly:
Establish basic UTF-8 parsing for less painful manual editing of the
Procedurally generate glyphs for the few required additional letters
based on existing font ROM glyphs. For example, we'd generate ä
by painting two short lines on top of the font ROM's a glyph,
or generate ¿ by vertically flipping the question mark. This
way, the text retains a consistent look regardless of whether the translated
game is run with an NEC or EPSON font ROM, or the that Neko Project II auto-generates if you
don't provide either.
(Optional) Change automatic line breaks to work on a per-word
basis, rather than per-glyph
That's it – script editing and distribution would be handled by your local
translation group. It might seem as if this would also work for Greek and
Cyrillic scripts due to their presence in the PC-98 font ROM, but I'm not
sure if I want to attempt procedurally shrinking these glyphs from 16×16 to
8×16… For any more thorough solution, we'd need to go for a more "Chad" kind
of full-blown translation support:
Implement text subdivisions at a sensible granularity while retaining
automatic line and box breaks
Compile translatable text into a Japanese→target language dictionary
(I'm too old to develop any further translation systems that would overwrite
modded source text with translations of the original text)
Implement a custom Unicode font system (glyphs would be taken from GNU
Unifont unless translators provide a different 8×16 font for their
Combine the text compiler with the font compiler to only store needed
glyphs as part of the translation's font file (dealing with a multi-MB font
file would be rather ugly in a Real Mode game)
Write a simple install/update/patch stacking tool that supports both
.HDI and raw-file DOSBox-X scenarios (it's different enough from thcrap to
warrant a separate tool – each patch stack would be statically compiled into
a single package file in the game's directory)
Add a nice language selection option to the main menu
(Optional) Support proportional fonts
Which sounds more like a separate project to be commissioned from
Touhou Patch Center's Open Collective funds, separate from the ReC98 cap.
This way, we can make sure that the feature is completely implemented, and I
can talk with every interested translator to make sure that their language
It's still cheaper overall to do this on PC-98 than to first port the games
to a modern system and then translate them. On the other hand, most
of the tasks in the Chad variant (3, 4, 5, and half of 2) purely deal with
the difficulty of getting arbitrary Unicode characters to work natively in a
PC-98 DOS game at all, and would be either unnecessary or trivial if we had
already ported the game. Depending on where the patrons' interests lie, it
may not be worth it. So let's see what all of you think about which
way we should go, or whether it's worth doing at all. (Edit
(2022-12-01): With Splashman's
order towards the stage dialogue system, we've pretty much confirmed that it
is.) Maybe we want to meet in the middle – using e.g. procedural glyph
generation for dynamic translations to keep text rendering consistent with
the rest of the PC-98 system, and just not support non-Latin-script
languages in the beginning? In any case, I've added both options to the
Surprisingly, there was still a bit of RE work left in the third push after
all of this, which I filled with some small rendering boilerplate. Since I
also wanted to include TH02's playfield overlay functions,
1/15 of that last push went towards getting a
TH02-exclusive function out of the way, which also ended up including that
game in this delivery.
The other small function pointed out how TH05's Stage 5 midboss pops into
the playfield quite suddenly, since its clipping test thinks it's only 32
pixels tall rather than 64:
Next up: Staying with TH05 and looking at more of the pattern code of its
boss fights. Given the remaining TH05 budget, it makes the most sense to
continue in in-game order, with Sara and the Stage 2 midboss. If more money
comes in towards this goal, I could alternatively go for the Mai & Yuki
fight and immediately develop a pretty fix for the cheeto storage
glitch. Also, there's a rather intricate
pull request for direct ZMBV decoding on the website that I've still got
TH05 has passed the 50% RE mark, with both MAIN.EXE and the
game as a whole! With that, we've also reached what -Tom-
wanted out of the project, so he's suspending his discount offer for a
Curve bullets are now officially called cheetos! 76.7% of
fans prefer this term, and it fits into the 8.3 DOS filename scheme much
better than homing lasers (as they're called in
OMAKE.TXT) or Taito
lasers (which would indeed have made sense as well).
…oh, and I managed to decompile Shinki within 2 pushes after all. That
left enough budget to also add the Stage 1 midboss on top.
So, Shinki! As far as final boss code is concerned, she's surprisingly
economical, with 📝 her background animations
making up more than ⅓ of her entire code. Going straight from TH01's
📝 final📝 bosses
to TH05's final boss definitely showed how much ZUN had streamlined
danmaku pattern code by the end of PC-98 Touhou. Don't get me wrong, there
is still room for improvement: TH05 not only
📝 reuses the same 16 bytes of generic boss state we saw in TH04 last month,
but also uses them 4× as often, and even for midbosses. Most importantly
though, defining danmaku patterns using a single global instance of the
group template structure is just bad no matter how you look at it:
The script code ends up rather bloated, with a single MOV
instruction for setting one of the fields taking up 5 bytes. By comparison,
the entire structure for regular bullets is 14 bytes large, while the
template structure for Shinki's 32×32 ball bullets could have easily been
reduced to 8 bytes.
Since it's also one piece of global state, you can easily forget to set
one of the required fields for a group type. The resulting danmaku group
then reuses these values from the last time they were set… which might have
been as far back as another boss fight from a previous stage.
And of course, I wouldn't point this out if it
didn't actually happen in Shinki's pattern code. Twice.
Declaring a separate structure instance with the static data for every
pattern would be both safer and more space-efficient, and there's
more than enough space left for that in the game's data segment.
But all in all, the pattern functions are short, sweet, and easy to follow.
patternis significantly more complex than the others, but still
far from TH01's final bosses at their worst. I especially like the clear
architectural separation between "one-shot pattern" functions that return
true once they're done, and "looping pattern" functions that
run as long as they're being called from a boss's main function. Not many
all too interesting things in these pattern functions for the most part,
except for two pieces of evidence that Shinki was coded after Yumeko:
The gather animation function in the first two phases contains a bullet
group configuration that looks like it's part of an unused danmaku
pattern. It quickly turns out to just be copy-pasted from a similar function
in Yumeko's fight though, where it is turned into actual
As one of the two places where ZUN forgot to set a template field, the
lasers at the end of the white wing preparation pattern reuse the 6-pixel
width of Yumeko's final laser pattern. This actually has an effect on
gameplay: Since these lasers are active for the first 8 frames after
Shinki's wings appear on screen, the player can get hit by them in the last
2 frames after they grew to their final width.
Speaking about that wing sprite: If you look at ST05.BB2 (or
any other file with a large sprite, for that matter), you notice a rather
weird file layout:
And it's not a limitation of the sprite width field in the BFNT+ header
either. Instead, it's master.lib's BFNT functions which are limited to
sprite widths up to 64 pixels… or at least that's what
MASTER.MAN claims. Whatever the restriction was, it seems to be
completely nonexistent as of master.lib version 0.23, and none of the
master.lib functions used by the games have any issues with larger
Since ZUN stuck to the supposed 64-pixel width limit though, it's now the
game that expects Shinki's winged form to consist of 4 physical
sprites, not just 1. Any conversion from another, more logical sprite sheet
layout back into BFNT+ must therefore replicate the original number of
sprites. Otherwise, the sequential IDs ("patnums") assigned to every newly
loaded sprite no longer match ZUN's hardcoded IDs, causing the game to
crash. This is exactly what used to happen with -Tom-'s
MysticTK automation scripts,
which combined these exact sprites into a single large one. This issue has
now been fixed – just in case there are some underground modders out there
who used these scripts and wonder why their game crashed as soon as the
Shinki fight started.
And then the code quality takes a nosedive with Shinki's main function.
Even in TH05, these boss and midboss update
functions are still very imperative:
The origin point of all bullet types used by a boss must be manually set
to the current boss/midboss position; there is no concept of a bullet type
tracking a certain entity.
The same is true for the target point of a player's homing shots…
… and updating the HP bar. At least the initial fill animation is
abstracted away rather decently.
Incrementing the phase frame variable also must be done manually. TH05
even "innovates" here by giving the boss update function exclusive ownership
of that variable, in contrast to TH04 where that ownership is given out to
the player shot collision detection (?!) and boss defeat helper
Speaking about collision detection: That is done by calling different
functions depending on whether the boss is supposed to be invincible or
Timeout conditions? No standard way either, and all done with manual
if statements. In combination with the regular phase end
condition of lowering (mid)boss HP to a certain value, this leads to quite a
convoluted control flow.
The manual calls to the score bonus functions for cleared phases at least provide some sense of orientation.
One potentially nice aspect of all this imperative freedom is that
phases can end outside of HP boundaries… by manually incrementing the
phase variable and resetting the phase frame variable to 0.
The biggest WTF in there, however, goes to using one of the 16 state bytes
as a "relative phase" variable for differentiating between boss phases that
share the same branch within the switch(boss.phase)
statement. While it's commendable that ZUN tried to reduce code duplication
for once, he could have just branched depending on the actual
boss.phase variable? The same state byte is then reused in the
"devil" pattern to track the activity state of the big jerky lasers in the
second half of the pattern. If you somehow managed to end the phase after
the first few bullets of the pattern, but before these lasers are up,
Shinki's update function would think that you're still in the phase
before the "devil" pattern. The main function then sequence-breaks
right to the defeat phase, skipping the final pattern with the burning Makai
background. Luckily, the HP boundaries are far away enough to make this
impossible in practice.
The takeaway here: If you want to use the state bytes for your custom
boss script mods, alias them to your own 16-byte structure, and limit each
of the bytes to a clearly defined meaning across your entire boss script.
One final discovery that doesn't seem to be documented anywhere yet: Shinki
actually has a hidden bomb shield during her two purple-wing phases.
uth05win got this part slightly wrong though: It's not a complete
shield, and hitting Shinki will still deal 1 point of chip damage per
frame. For comparison, the first phase lasts for 3,000 HP, and the "devil"
pattern phase lasts for 5,800 HP.
And there we go, 3rd PC-98 Touhou boss
script* decompiled, 28 to go! 🎉 In case you were expecting a fix for
the Shinki death glitch: That one
is more appropriately fixed as part of the Mai & Yuki script. It also
requires new code, should ideally look a bit prettier than just removing
cheetos between one frame and the next, and I'd still like it to fit within
the original position-dependent code layout… Let's do that some other
Not much to say about the Stage 1 midboss, or midbosses in general even,
except that their update functions have to imperatively handle even more
subsystems, due to the relative lack of helper functions.
The remaining ¾ of the third push went to a bunch of smaller RE and
finalization work that would have hardly got any attention otherwise, to
help secure that 50% RE mark. The nicest piece of code in there shows off
what looks like the optimal way of setting up the
📝 GRCG tile register for monochrome blitting
in a variable color:
mov ah, palette_index ; Any other non-AL 8-bit register works too.
; (x86 only supports AL as the source operand for OUTs.)
rept 4 ; For all 4 bitplanes…
shr ah, 1 ; Shift the next color bit into the x86 carry flag
sbb al, al ; Extend the carry flag to a full byte
; (CF=0 → 0x00, CF=1 → 0xFF)
out 7Eh, al ; Write AL to the GRCG tile register
Thanks to Turbo C++'s inlining capabilities, the loop body even decompiles
into a surprisingly nice one-liner. What a beautiful micro-optimization, at
a place where micro-optimization doesn't hurt and is almost expected.
Unfortunately, the micro-optimizations went all downhill from there,
becoming increasingly dumb and undecompilable. Was it really necessary to
save 4 x86 instructions in the highly unlikely case of a new spark sprite
being spawned outside the playfield? That one 2D polar→Cartesian
conversion function then pointed out Turbo C++ 4.0J's woefully limited
support for 32-bit micro-optimizations. The code generation for 32-bit
📝 pseudo-registers is so bad that they almost
aren't worth using for arithmetic operations, and the inline assembler just
flat out doesn't support anything 32-bit. No use in decompiling a function
that you'd have to entirely spell out in machine code, especially if the
same function already exists in multiple other, more idiomatic C++
Rounding out the third push, we got the TH04/TH05 DEMO?.REC
replay file reading code, which should finally prove that nothing about the
game's original replay system could serve as even just the foundation for
community-usable replays. Just in case anyone was still thinking that.
Next up: Back to TH01, with the Elis fight! Got a bit of room left in the
cap again, and there are a lot of things that would make a lot of
TH04 would really enjoy a large number of dedicated pushes to catch up
with TH05. This would greatly support the finalization of both games.
Continuing with TH05's bosses and midbosses has shown to be good value
for your money. Shinki would have taken even less than 2 pushes if she
hadn't been the first boss I looked at.
Oh, and I also added Seihou as a selectable goal, for the two people out
there who genuinely like it. If I ever want to quit my day job, I need to
branch out into safer territory that isn't threatened by takedowns, after
Did you know that moving on top of a boss sprite doesn't kill the player in
TH04, only in TH05?
That's the first of only three interesting discoveries in these 3 pushes,
all of which concern TH04. But yeah, 3 for something as seemingly simple as
these shared boss functions… that's still not quite the speed-up I had hoped
for. While most of this can be blamed, again, on TH04 and all of its
hardcoded complexities, there still was a lot of work to be done on the
maintenance front as well. These functions reference a bunch of code I RE'd
years ago and that still had to be brought up to current standards, with the
dependencies reaching from 📝 boss explosions
over 📝 text RAM overlay functionality up to
in-game dialog loading.
The latter provides a good opportunity to talk a bit about x86 memory
segmentation. Many aspiring PC-98 developers these days are very scared
of it, with some even going as far as to rather mess with Protected Mode and
DOS extenders just so that they don't have to deal with it. I wonder where
that fear comes from… Could it be because every modern programming language
I know of assumes memory to be flat, and lacks any standard language-level
features to even express something like segments and offsets? That's why
compilers have a hard time targeting 16-bit x86 these days: Doing anything
interesting on the architecture requires giving the programmer full
control over segmentation, which always comes down to adding the
typical non-standard language extensions of compilers from back in the day.
And as soon as DOS stopped being used, these extensions no longer made sense
and were subsequently removed from newer tools. A good example for this can
be found in an old version of the
NASM manual: The project started as an attempt to make x86 assemblers
simple again by throwing out most of the segmentation features from
MASM-style assemblers, which made complete sense in 1996 when 16-bit DOS and
Windows were already on their way out. But there was a point to all
those features, and that's why ReC98 still has to use the supposedly
Not that this fear of segmentation is completely unfounded: All the
segmentation-related keywords, directives, and #pragmas
provided by Borland C++ and TASM absolutely can be the cause of many
weird runtime bugs. Even if the compiler or linker catches them, you are
often left with confusing error messages that aged just as poorly as memory
However, embracing the concept does provide quite the opportunity for
optimizations. While it definitely was a very crazy idea, there is a small
bit of brilliance to be gained from making proper use of all these
segmentation features. Case in point: The buffer for the in-game dialog
scripts in TH04 and TH05.
// Thanks to the semantics of `far` pointers, we only need a single 32-bit
// pointer variable for the following code.
extern unsigned char far *dialog_p;
// This master.lib function returns a `void __seg *`, which is a 16-bit
// segment-only pointer. Converting to a `far *` yields a full segment:offset
// pointer to offset 0000h of that segment.
dialog_p = (unsigned char far *)hmem_allocbyte(/* … */);
// Running the dialog script involves pointer arithmetic. On a far pointer,
// this only affects the 16-bit offset part, complete with overflow at 64 KiB,
// from FFFFh back to 0000h.
dialog_p += /* … */;
dialog_p += /* … */;
dialog_p += /* … */;
// Since the segment part of the pointer is still identical to the one we
// allocated above, we can later correctly free the buffer by pulling the
// segment back out of the pointer.
hmem_free((void __seg *)dialog_p);
If dialog_p was a huge pointer, any pointer
arithmetic would have also adjusted the segment part, requiring a second
pointer to store the base address for the hmem_free call. Doing
that will also be necessary for any port to a flat memory model. Depending
on how you look at it, this compression of two logical pointers into a
single variable is either quite nice, or really, really dumb in its
reliance on the precise memory model of one single architecture.
Why look at dialog loading though, wasn't this supposed to be all about
shared boss functions? Well, TH04 unnecessarily puts certain stage-specific
code into the boss defeat function, such as loading the alternate Stage 5
Yuuka defeat dialog before a Bad Ending, or initializing Gengetsu after
Mugetsu's defeat in the Extra Stage.
That's TH04's second core function with an explicit conditional branch for
Gengetsu, after the
📝 dialog exit code we found last year during EMS research.
And I've heard people say that Shinki was the most hardcoded fight in PC-98
Touhou… Really, Shinki is a perfectly regular boss, who makes proper use of
all internal mechanics in the way they were intended, and doesn't blast
holes into the architecture of the game. Even within TH05, it's Mai and Yuki
who rely on hacks and duplicated code, not Shinki.
The worst part about this though? How the function distinguishes Mugetsu
from Gengetsu. Once again, it uses its own global variable to track whether
it is called the first or the second time within TH04's Extra Stage,
unrelated to the same variable used in the dialog exit function. But this
time, it's not just any newly created, single-use variable, oh no. In a
misguided attempt to micro-optimize away a few bytes of conventional memory,
TH04 reserves 16 bytes of "generic boss state", which can (and are) freely
used for anything a boss doesn't want to store in a more dedicated
It might have been worth it if the bosses actually used most of these
16 bytes, but the majority just use (the same) two, with only Stage 4 Reimu
using a whopping seven different ones. To reverse-engineer the various uses
of these variables, I pretty much had to map out which of the undecompiled
danmaku-pattern functions corresponds to which boss
fight. In the end, I assigned 29 different variable names for each of the
semantically different use cases, which made up another full push on its
Now, 16 bytes of wildly shared state, isn't that the perfect recipe for
bugs? At least during this cursory look, I haven't found any obvious ones
yet. If they do exist, it's more likely that they involve reused state from
earlier bosses – just how the Shinki death glitch in
TH05 is caused by reusing cheeto data from way back in Stage 4 – and
hence require much more boss-specific progress.
And yes, it might have been way too early to look into all these tiny
details of specific boss scripts… but then, this happened:
Superficially fixing only this crash will probably just take a fraction
of a push.
But I could also go for a deeper understanding by looking at TH04's
version of the 📝 custom entity structure. It
not only stores the data of Marisa's bits, but is also very likely to be
involved in Kurumi's crash, and would get TH04 a lot closer to 100%
PI. Taking that look will probably need at least 2 pushes, and might require
another 3-4 to completely decompile Marisa's fight, and 2-3 to decompile
OK, now that that's out of the way, time to finish the boss defeat function…
but not without stumbling over the third of TH04's quirks, relating to the
Clear Bonus for the main game or the Extra Stage:
To achieve the incremental addition effect for the in-game score display
in the HUD, all new points are first added to a score_delta
variable, which is then added to the actual score at a maximum rate of
61,110 points per frame.
There are a fixed 416 frames between showing the score tally and
launching into MAINE.EXE.
As a result, TH04's Clear Bonus is effectively limited to
(416 × 61,110) = 25,421,760 points.
Only TH05 makes sure to commit the entirety of the
score_delta to the actual score before switching binaries,
which fixes this issue.
And after another few collision-related functions, we're now truly,
finally ready to decompile bosses in both TH04 and TH05! Just as the
anything funds were running out… The
remaining ¼ of the third push then went to Shinki's 32×32 ball bullets,
rounding out this delivery with a small self-contained piece of the first
TH05 boss we're probably going to look at.
Next up, though: I'm not sure, actually. Both Shinki and Elis seem just a
little bit larger than the 2¼ or 4 pushes purchased so far, respectively.
Now that there's a bunch of room left in the cap again, I'll just let the
next contribution decide – with a preference for Shinki in case of a tie.
And if it will take longer than usual for the store to sell out again this
time (heh), there's still the
📝 PC-98 text RAM JIS trail word rendering research
waiting to be documented.
EMS memory! The
infamous stopgap measure between the 640 KiB ("ought to be enough for
everyone") of conventional
memory offered by DOS from the very beginning, and the later XMS standard for
accessing all the rest of memory up to 4 GiB in the x86 Protected Mode. With
an optionally active EMS driver, TH04 and TH05 will make use of EMS memory
to preload a bunch of situational .CDG images at the beginning of
The "eye catch" game title image, shown while stages are loaded
The character-specific background image, shown while bombing
The player character dialog portraits
TH05 additionally stores the boss portraits there, preloading them
at the beginning of each stage. (TH04 instead keeps them in conventional
memory during the entire stage.)
Once these images are needed, they can then be copied into conventional
memory and accessed as usual.
Uh… wait, copied? It certainly would have been possible to map EMS
memory to a regular 16-bit Real Mode segment for direct access,
bank-switching out rarely used system or peripheral memory in exchange for
the EMS data. However, master.lib doesn't expose this functionality, and
only provides functions for copying data from EMS to regular memory and vice
But even that still makes EMS an excellent fit for the large image files
it's used for, as it's possible to directly copy their pixel data from EMS
to VRAM. (Yes, I tried!) Well… would, because ZUN doesn't do
that either, and always naively copies the images to newly allocated
conventional memory first. In essence, this dumbs down EMS into just another
layer of the memory hierarchy, inserted between conventional memory and
disk: Not quite as slow as disk, but still requiring that
memcpy() to retrieve the data. Most importantly though: Using
EMS in this way does not increase the total amount of memory
simultaneously accessible to the game. After all, some other data will have
to be freed from conventional memory to make room for the newly loaded data.
The most idiomatic way to define the game-specific layout of the EMS area
would be either a struct or an enum.
Unfortunately, the total size of all these images exceeds the range of a
16-bit value, and Turbo C++ 4.0J supports neither 32-bit enums
(which are silently degraded to 16-bit) nor 32-bit structs
(which simply don't compile). That still leaves raw compile-time constants
though, you only have to manually define the offset to each image in terms
of the size of its predecessor. But instead of doing that, ZUN just placed
each image at a nice round decimal offset, each slightly larger than the
actual memory required by the previous image, just to make sure that
everything fits. This results not only in quite
a bit of unnecessary padding, but also in technically the single
biggest amount of "wasted" memory in PC-98 Touhou: Out of the 180,000 (TH04)
and 320,000 (TH05) EMS bytes requested, the game only uses 135,552 (TH04)
and 175,904 (TH05) bytes. But hey, it's EMS, so who cares, right? Out of all
the opportunities to take shortcuts during development, this is among the
most acceptable ones. Any actual PC-98 model that could run these two games
comes with plenty of memory for this to not turn into an actual issue.
On to the EMS-using functions themselves, which are the definition of
"cross-cutting concerns". Most of these have a fallback path for the non-EMS
case, and keep the loaded .CDG images in memory if they are immediately
needed. Which totally makes sense, but also makes it difficult to find names
that reflect all the global state changed by these functions. Every one of
these is also just called from a single place, so inlining
them would have saved me a lot of naming and documentation trouble
The TH04 version of the EMS allocation code was actually displayed on ZUN's monitor in the
2010 MAG・ネット documentary; WindowsTiger already transcribed the low-quality video image
in 2019. By 2015 ReC98 standards, I would have just run with that, but
the current project goal is to write better code than ZUN, so I didn't. 😛
We sure ain't going to use magic numbers for EMS offsets.
The dialog init and exit code then is completely different in both games,
yet equally cross-cutting. TH05 goes even further in saving conventional
memory, loading each individual player or boss portrait into a single .CDG
slot immediately before blitting it to VRAM and freeing the pixel data
again. People who play TH05 without an active EMS driver are surely going to
enjoy the hard drive access lag between each portrait change…
TH04, on the other hand, also abuses the dialog
exit function to preload the Mugetsu defeat / Gengetsu entrance and
Gengetsu defeat portraits, using a static variable to track how often the
function has been called during the Extra Stage… who needs function
parameters anyway, right?
This is also the function in which TH04 infamously crashes after the Stage 5
pre-boss dialog when playing with Reimu and without any active EMS driver.
That crash is what motivated this look into the games' EMS usage… but the
code looks perfectly fine? Oh well, guess the crash is not related to EMS
then. Next u–
OK, of course I can't leave it like that. Everyone is expecting a fix now,
and I still got half of a push left over after decompiling the regular EMS
code. Also, I've now RE'd every function that could possibly be involved in
the crash, and this is very likely to be the last time I'll be looking at
Turns out that the bug has little to do with EMS, and everything to do with
ZUN limiting the amount of conventional RAM that TH04's
MAIN.EXE is allowed to use, and then slightly miscalculating
this upper limit. Playing Stage 5 with Reimu is the most asset-intensive
configuration in this game, due to the combination of
6 player portraits (Marisa has only 5), at 128×128 pixels each
a 288×256 background for the boss fight, tied in size only with the
ones in the Extra Stage
the additional 96×80 image for the vertically scrolling stars during
the stage, wastefully stored as 4 bitplanes rather than a single one.
This image is never freed, not even at the end of the stage.
Remove any single one of the above points, and this crash would have never
occurred. But with all of them combined, the total amount of memory consumed
by TH04's MAIN.EXE just barely exceeds ZUN's limit of 320,000
bytes, by no more than 3,840 bytes, the size of the star image.
But wait: As we established earlier, EMS does nothing to reduce the amount
of conventional memory used by the game. In fact, if you disabled TH04's EMS
handling, you'd still get this crash even if you are running an EMS
driver and loaded DOS into the High Memory Area to free up as much
conventional RAM as possible. How can EMS then prevent this crash in the
The answer: It's only because ZUN's usage of EMS bypasses the need to load
the cached images back out of the XOR-encrypted 東方幻想.郷
packfile. Leaving aside the general
stupidity of any game data file encryption*, master.lib's decryption
implementation is also quite wasteful: It uses a separate buffer that
receives fixed-size chunks of the file, before decrypting every individual
byte and copying it to its intended destination buffer. That really
resembles the typical slowness of a C fread() implementation
more than it does the highly optimized ASM code that master.lib purports to
be… And how large is this well-hidden decryption buffer? 4 KiB.
So, looking back at the game, here is what happens once the Stage 5
pre-battle dialog ends:
Reimu's bomb background image, which was previously freed to make space
for her dialog portraits, has to be loaded back into conventional memory
BB0.CDG is found inside the 東方幻想.郷
file_ropen() ends up allocating a 4 KiB buffer for the
encrypted packfile data, getting us the decisive ~4 KiB closer to the memory
The .CDG loader tries to allocate 52 608 contiguous bytes for the
pixel data of Reimu's bomb image
This would exceed the memory limit, so hmem_allocbyte()
fails and returns a nullptr
ZUN doesn't check for this case (as usual)
The pixel data is loaded to address 0000:0000,
overwriting the Interrupt Vector Table and whatever comes after
The game crashes
The 4 KiB encryption buffer would only be freed by the corresponding
file_close() call, which of course never happens because the
game crashes before it gets there. At one point, I really did suspect the
cause to be some kind of memory leak or fragmentation inside master.lib,
which would have been quite delightful to fix.
Instead, the most straightforward fix here is to bump up that memory limit
by at least 4 KiB. Certainly easier than squeezing in a
cdg_free() call for the star image before the pre-boss dialog
without breaking position dependence.
Or, even better, let's nuke all these memory limits from orbit
because they make little sense to begin with, and fix every other potential
out-of-memory crash that modders would encounter when adding enough data to
any of the 4 games that impose such limits on themselves. Unless you want to
launch other binaries (which need to do their own memory allocations) after
launching the game, there's really no reason to restrict the amount of
memory available to a DOS process. Heck, whenever DOS creates a new one, it
assigns all remaining free memory by default anyway.
Removing the memory limits also removes one of ZUN's few error checks, which
end up quitting the game if there isn't at least a given maximum amount of
conventional RAM available. While it might be tempting to reserve enough
memory at the beginning of execution and then never check any allocation for
a potential failure, that's exactly where something like TH04's crash
This game is also still running on DOS, where such an initial allocation
failure is very unlikely to happen – no one fills close to half of
conventional RAM with TSRs and then tries running one of these games. It
might have been useful to detect systems with less than 640 KiB of
actual, physical RAM, but none of the PC-98 models with that little amount
of memory are fast enough to run these games to begin with. How ironic… a
place where ZUN actually added an error check, and then it's mostly
Wow, 31 commits in a single push? Well, what the last push had in
progress, this one had in maintenance. The
📝 master.lib header transition absolutely
had to be completed in this one, for my own sanity. And indeed,
it reduced the build time for the entirety of ReC98 to about 27 seconds on
my system, just as expected in the original announcement. Looking forward
to even faster build times with the upcoming #include
improvements I've got up my sleeve! The port authors of the future are
going to appreciate those quite a bit.
As for the new translation units, the funniest one is probably TH05's
function for blitting the 1-color .CDG images used for the main menu
options. Which is so optimized that it becomes decompilable again,
by ditching the self-modifying code of its TH04 counterpart in favor of
simply making better use of CPU registers. The resulting C code is still a
mess, but what can you do.
This was followed by even more TH05 functions that clearly weren't
compiled from C, as evidenced by their padding
bytes. It's about time I've documented my lack of ideas of how to get
those out of Turbo C++.
And just like in the previous push, I also had to 📝 throw away a decompiled TH02 function purely due to alignment issues. Couldn't have been a better one though, no one's going to miss a residency check for the MMD driver that is largely identical to the corresponding (and indeed decompilable) function for the PMD driver. Both of those should have been merged into a single function anyway, given how they also mutate the game's sound configuration flags…
In the end, I've slightly slowed down with this one, with only 37% of technical debt done after this 4th dedicated push. Next up: One more of these, centered around TH05's stupidly optimized .PI functions. Maybe also with some more reverse-engineering, after not having done any for 1½ months?
Turns out that TH04's player selection menu is exactly three times as
complicated as TH05's. Two screens for character and shot type rather than
one, and a way more intricate implementation for saving and restoring the
background behind the raised top and left edges of a character picture
when moving the cursor between Reimu and Marisa. TH04 decides to backup
precisely only the two 256×8 (top) and 8×244 (left) strips behind the
edges, indicated in red in the picture
These take up just 4 KB of heap memory… but require custom blitting
functions, and expanding this explicitly hardcoded approach to TH05's 4
characters would have been pretty annoying. So, rather than, uh, not
explicitly hardcoding it all, ZUN decided to just be lazy with the backup
area in TH05, saving the entire 640×400 screen, and thus spending 128 KB
of heap memory on this rather simple selection shadow effect.
So, this really wasn't something to quickly get done during the first half
of a push, even after already having done TH05's equivalent of this menu.
But since life is very busy right now, I also used the occasion to start
addressing another code organization annoyance: master.lib's single master.h header file.
Now that ReC98 is trying to develop (or at least mimic) a more
type-safe C++ foundation to model the PC-98 hardware, a pure C header
(with counter-productive C++ extensions) is becoming increasingly
unidiomatic. By moving some of the original assumptions about function
parameters into the type system, we can also reduce the reliance on its
Japanese-only documentation without having to translate it
It's quite bloated, with at least 2800 lines of code that
currently are #included into the vast majority of files, not
counting master.h's recursively included C standard library
headers. PC-98 Touhou only makes direct use of a rather small fraction of
And finally, all the DOS/V compatibility definitions are especially
useless in the context of ReC98. As I've noted
📝 time and
📝 time again, porting PC-98 Touhou to
IBM-compatible DOS won't be easy, and MASTER_DOSV won't be
helping much. Therefore, my upstream version of ReC98 will never include
all of master.lib. There's no point in lengthening compile times for
everyone by default, and those will be getting quite noticeable
after moving to a full 16-bit build process.
(Actually, what retro system ports should rather be doing: Get rid
of master.lib's original ASM code, replace it with
C++, and then simply convert the optimized assembly output of modern
compilers to your ISA of choice. Improving the landscape of such
assembly or object file converters would benefit everyone!)
So, time to start a new master.hpp header that would contain
just the declarations from master.h that PC-98 Touhou
actually needs, plus some semantic (yes, semantic) sugar. Comparing just
the old master.h to just the new master.hpp
after roughly 60% of the transition has been completed, we get median
build times of 319 ms for master.h, and 144 ms for
master.hpp on my (admittedly rather slow) DOSBox setup.
As of this push, ReC98 consists of 107 translation units that have to be
compiled with Turbo C++ 4.0J. Fully rebuilding all of these currently
takes roughly 37.5 seconds in DOSBox. After the transition to
master.hpp is done, we could therefore shave some 10 to 15
seconds off this time, simply by switching header files. And that's just
the beginning, as this will also pave the way for further
#include optimizations. Life in this codebase will be great!
Unfortunately, there wasn't enough time to repay some of the actual
technical debt I was looking forward to, after all of this. Oh well, at
least we now also have nice identifiers for the three different boldface
options that are used when rendering text to VRAM, after procrastinating
that issue for almost 11 months. Next up, assuming the existing
subscriptions: More ridiculous decompilations of things that definitely
weren't originally written in C, and a big blocker in TH03's