And once again, the Shuusou Gyoku task was too complex to be satisfyingly solved within a single month. Even just finding provably correct loop sections in both the original and arranged MIDI files required some rather involved detection algorithms. I could have just defined what sounded like correct loops, but the results of these algorithms were quite surprising indeed. Turns out that not even Seihou is safe from ZUN quirks, and some tracks technically loop much later than you'd think they do, or don't loop at all. And since I then wanted to put these MIDI loops back into the game to ensure perfect synchronization between the recordings and MIDI versions, I ended up rewriting basically all the MIDI code in a cross-platform way. This rewrite also uncovered a pbg bug that has traveled from Shuusou Gyoku into Windows Touhou, where it survived until ZUN ultimately removed all MIDI code in TH11 (!)…
Fortunately, the backlog still had enough general PC-98 Touhou funds that I could spend on picking some soon-important low-hanging fruit, giving me something to deliver for the end of the month after all. TH04 and TH05 use almost identical code for their main/option menus, so decompiling it would make number go up quite significantly and the associated blog post won't be that long…
Wait, what's this, a bug report from touhou-memories concerning the website?
Tab switchers tended to break on certain Firefox versions, and
video playback didn't work on Microsoft Edge at all?
Those are definitely some high-priority bugs that demand immediate attention.
The tab switcher issue was easily fixed by replacing the previous z-index trickery with a more robust solution involving the hidden attribute. The second one, however, is much more aggravating, because video playback on Edge has been broken ever since I 📝 switched the preferred video codec to AV1.
This goes so far beyond not supporting a specific codec. Usually, unsupported codecs aren't supposed to be an issue: As soon as you start using the HTML <video> tag, you'll learn that not every browser supports all codecs. And so you set up an encoding pipeline to serve each video in a mix of new and ancient formats, put the <source> tag of the most preferred codec first, and rest assured that browsers will fall back on the best-supported option as necessary. Except that Edge doesn't even try, and insists on staying on a non-playing AV1 video. 🙄
The codecs parameter for the <source> type attribute was the first potential solution I came across. Specifying the video codec down to the finest encoding details right in the HTML markup sounds like a good idea, similar to specifying sizes of images and videos to prevent layout reflows on long pages during the initial page load. So why was this the first time I heard of this feature? The fact that there isn't a simple ffprobe -show_html_codecs_string command to retrieve this string might already give a clue about how useful it is in practice. Instead, you have to manually piece the string together by grepping your way through all of a video's metadata…
…and then it still doesn't change anything about Edge's behavior, even when also specifying the string for the VP9 and VP8 sources. Calling the infamously ridiculous HTMLMediaElement.canPlayType() method with a representative parameter of "video/webm; codecs=av01.1.04M.08.0.000.01.13.00.0" explains why: Both the AV1-supporting Chrome and Edge return "probably", but only the former can actually play this format. 🤦
Not to mention that the favorite video app part is obviously a lie as a lot of widely preferred Windows video apps are bundled with their own codecs, and have probably long supported AV1.
In the end, there's no way around the utter desperation move of removing the AV1 <source> for Edge users. Serving each video in two other formats means that we can at least do something here – try visiting the GitHub release page of the P0234-1 TH01 Anniversary Edition build in Edge and you also don't get to see anything, because that video uses AV1 and GitHub understandably doesn't re-encode every uploaded video into a variety of old formats.
Just for comparison, I tried both that page and the ReC98 blog on an old Android 6 phone from 2014, and even that phone picked and played the AV1 videos with the latest available Chrome and Firefox versions. This was the phone whose available Firefox version didn't support VP9 in 2019, which was my initial reason for adding the VP8 versions. Looks like it's finally time to drop those… 🤔 Maybe in the far future once I start running out of space on this server.
Removing the <source> tags can be done in one of two places:
server-side, detecting Edge via the User-Agent header, or
I went with 2) because more dynamic server-side code would only move us further away from static site generation, which would make a lot of sense as the next evolutionary step in the architecture of this website. The client-side solution is much simpler too, and we can defer the deletion until a user actually hovers over a specific video.
And while we're at it, let's also add a popup complaining about this whole state of affairs. Edge is heavily marketed inside Windows as "the modern browser recommended by Microsoft", and you sure wouldn't expect low-quality chroma-subsampled VP9 from such a tagline. With such a level of anti-support for AV1, Edge users deserve to know exactly what's going on, especially since this post also explains what they will encounter on other websites.
Alright, where was I? For TH01, the main menu was the last thing I decompiled before the 100% finalization mark, so it's rather anticlimactic to already cover the TH04/TH05 one now, with both of the games still being very far away from 100%, just because people will soon want to translate the description text in the bottom-right corner of the screen. But then again, the ZUN Soft logo animation would make for an even nicer final piece of decompiled code, especially since the bouncing-ball logo from TH01, TH02, and TH03 was the very first decompilation I did, all the way back in 2015.
The code quality of ZUN's VRAM-based menus has barely increased between TH01 and TH05. Both the top-level and option menu still need to know the bounding rectangle of the other one to unblit the right pixels when switching between the two. And since ZUN sure loved hardcoded and copy-pasted numbers in the PC-98 days, the coordinates both tend to be excessively large, and excessively wrong. Luckily, each menu item comes with its own correct unblitting rectangle, which avoids any graphical glitches that would otherwise occur.
As for actual observable quirks and bugs, these menus only contain one of each, and both are exclusive to TH04:
Quitting out of the Music Room moves the cursor to the Start option. In TH05, it stays on Music Room.
Changing the S.E. mode seems to do nothing within TH04's menus, and would only take effect if you also change the Music mode afterward, or launch into the game.
Now that 100% finalization of their OP.EXE binaries is within reach, all this bloat made me think about the viability of a 📝 single-executable build for TH04's and TH05's debloated and anniversary versions. It would be really nice to have such a build ready before I start working on the non-ASCII translations – not just because they will be based on the anniversary branch by default, but also because it would significantly help their development if there are 4 fewer executables to worry about.
However, it's not as simple for these games as it was for TH01. The unique code in their OP.EXE and MAINE.EXE binaries is much larger than Borland's easily removed C++ exception handler, so I'd have to remove a lot more bloat to keep the resulting single binary at or below the size of the original MAIN.EXE. But I'm sure going to try.
Speaking of code that can be debloated for great effect: The second push of this delivery focused on the first-launch sound setup menu, whose BGM and sound effect submenus are almost complete code duplicates of each other. The debloated branch could easily remove more than half of the code in there, yielding another ≈800 bytes in case we need them.
If hex-editing MIKO.CFG is more convenient for you than deleting that file, you can set its first byte to FF to re-trigger this menu. Decompiling this screen was not only relevant now because it contains text rendered with font ROM glyphs and it would help dig our way towards more important strings in the data segment, but also because of its visual style. I can imagine many potential mods that might want to use the same backgrounds and box graphics for their menus.
With the two submenus being shown in a fixed sequence, there's not a lot of room for the code to do anything wrong, and it's even more identical between the two games than the main menu already was. Thankfully, ZUN just reblits the respective options in the new color when moving the cursor, with no 📝 palette tricks. TH04's background image only uses 7 colors, so he could have easily reserved 3 colors for that. In exchange, the TH05 image gets to use the full 16 colors with no change to the code.
Rounding out this delivery, we also got TH05's rolling Yin-Yang Orb animation before the title screen… and it's just more bloat and landmines on a smaller scale that might be noticeable on slower PC-98 models. In total, there are three unnecessary inter-page copies of the entire VRAM that can easily insert lag frames, and two minor page-switching landmines that can potentially lead to tearing on the first frame of the roll or fade animation. Clearly, ZUN did not have smoothness or code quality in mind there, as evidenced by the fact that this animation simply displays 8 .PI files in sequence. But hey, a short animation like this is 📝 another perfectly appropriate place for a quick-and-dirty solution if you develop with a deadline.
And that's 1.30% of all PC-98 Touhou code finalized in two pushes! We're slowly running out of these big shared pieces of ASM code…
I've been neglecting TH03's OP.EXE quite a bit since it simply doesn't contain any translatable plaintext outside the Music Room. All menu labels are gaiji, and even the character selection menu displays its monochrome character names using the 4-plane sprites from CHNAME.BFT. Splitting off half of its data into a separate .ASM file was more akin to getting out a jackhammer to free up the room in front of the third remaining Music Room, but now we're there, and I can decompile all three of them in a natural way, with all referenced data.
Next up, therefore: Doing just that, securing another important piece of text for the upcoming non-ASCII translations and delivering another big piece of easily finalized code. I'm going to work full-time on ReC98 for almost all of December, and delivering that and the Shuusou Gyoku SC-88Pro recording BGM back-to-back should free up about half of the slightly higher cap for this month.
And we're back to PC-98 Touhou for a brief interruption of the ongoing Shuusou Gyoku Linux port.
Let's clear some of the Touhou-related progress from the backlog, and use
the unconstrained nature of these contributions to prepare the
📝 upcoming non-ASCII translations commissioned by Touhou Patch Center.
The current budget won't cover all of my ambitions, but it would at least be
nice if all text in these games was feasibly translatable by the time I
officially start working on that project.
At a little over 3 pushes, it might be surprising to see that this took
longer than the
📝 TH03/TH04/TH05 cutscene system. It's
obvious that TH02 started out with a different system for in-game dialog,
but while TH04 and TH05 look identical on the surface, they only
actually share 30% of their dialog code. So this felt more like decompiling
2.4 distinct systems, as opposed to one identical base with tons of
game-specific differences on top.
The table of contents was pretty popular last time around, so let's have
Let's start with the ones from TH04 and TH05, since they are not that
broken. For TH04, ZUN started out by copy-pasting the cutscene system,
causing the result to inherit many of the caveats I already described in the
cutscene blog post:
It's still a plaintext format geared exclusively toward full-width
The parser still ignores all whitespace, forcing ASCII text into hacks
with unassigned Shift-JIS lead bytes outside the second byte of a 2-byte
Commands are still preceded by a 0x5C byte, which renders
as either a \ or a ¥ depending on your font and
interpretation of Shift-JIS.
Command parameters are parsed in exactly the same way, with all the same
A lot of the same script commands are identical, including 7 of them
that were not used in TH04's original dialog scripts.
Then, however, he greatly simplified the system. Mainly, this was done by
moving text rendering from the PC-98 graphics chip to the text chip, which
avoids the need for any text-related unblitting code, but ZUN also added a
bunch of smaller changes:
The player must advance through every dialog box by releasing any held
keys and then pressing any key mapped to a game action. There are no
The delay for every 2 bytes of text was doubled to 2 frames, and can't
Instead of holding ESC to fast-forward, pressing any key
will immediately print the entire rest of a text box.
Dialogs run in their own single-buffered frame loop, interrupting the
rest of the game. The other VRAM page keeps the background pixels required
for unblitting the face images.
All script commands that affect the graphics layer are preceded by a
1-frame delay. ZUN most likely did this because of the single-buffered
nature, as it prevents tearing on the first frame by waiting for the CRT
beam to return to the top-left corner before changing any pixels.
Both boxes are intended to contain up to 30 half-width characters on
each of their up to 3 lines, but nothing in the code enforces these limits.
There is no support for automatic line breaks or starting new boxes.
TH05 then moved from TH04's plaintext scripts to the binary
.TX2 format while removing all the unused commands copy-pasted
from the cutscene system. Except for a
single additional command intended to clear a text box, TH05's dialog
system only supports a strict subset of the features of TH04's system.
This change also introduced the following differences compared to TH04:
The game now stores the dialog of all 4 playable characters in the same
file, with a (4 + 1)-word header that indicates the byte offset
and length of each character's script. This way, it can load only the one
script for the currently played character.
Since there is no need for whitespace in a binary format, you can now
use ASCII 0x20 spaces even as the first byte of a 2-byte text
When starting a new dialog box, any existing text in the other box is
now colored blue.
Thanks to ZUN messing up the return values of the command-interpreting
switch function, you can effectively use only line break and gaiji commands in the middle of text. All other
commands do execute, but the interpreter then also treats their command byte
as a Shift-JIS lead byte and places it in text RAM together with whatever
other byte follows in the script.
This is why TH04 can and does put its \= commandsinto the boxes
started with the 0 or 1 commands, but TH05 has to
put its 0x02 commands before the equivalent 0x0D.
For modding these files, you probably want to use TXDEF from
-Tom-'s MysticTK. It decodes these
files into a text representation, and its encoder then takes care of the
character-specific byte offsets in the 10-byte header. This text
representation simplifies the format a lot by avoiding all corner cases and
landmines you'd experience during hex-editing – most notably by interpreting
the box-starting 0x0D as a
command to show text that takes a string parameter, avoiding the broken
calls to script commands in the middle of text. However, you'd still have to
manually ensure an even number of bytes on every line of text.
In the entry function of TH05's dialog loop, we also encounter the hack that
is responsible for properly handling
📝 ZUN's hidden Extra Stage replay. Since the
dialog loop doesn't access the replay inputs but still requires key presses
to advance through the boxes, ZUN chose to just skip the dialog altogether in the
specific case of the Extra Stage replay being active, and replicated all
sprite management commands from the dialog script by just hardcoding
And you know what? Not only do I not mind this hack, but I would have
preferred it over the actual dialog system! The aforementioned sprite
management commands effectively boil down to manual memory management,
deallocating all stage enemy and midboss sprites and thus ensuring that the
boss sprites end up at specific master.lib sprite IDs (patnums). The
hardcoded boss rendering function then expects these sprites to be available
at these exact IDs… which means that the otherwise hardcoded bosses can't
render properly without the dialog script running before them.
There is absolutely no excuse for the game to burden dialog scripts with
this functionality. Sure, delayed deallocation would allow them to blit
stage-specific sprites, but the original games don't do that; probably
because none of the two games feature an unblitting command. And even if
they did, it would have still been cleaner to expose the boss-specific
sprite setup as a single script command that can then also be called from
game code if the script didn't do so. Commands like these just are a recipe
for crashes, especially with parsers that expect fullwidth Shift-JIS
text and where misaligned ASCII text can easily cause these commands to be
But then again, it does make for funny screenshot material if you
accidentally the deallocation and then see bosses being turned into stage
With all the general details out of the way, here's the command reference:
Selects either the player character (0) or the boss (1) as the
currently speaking character, and moves the cursor to the beginning of
the text box. In TH04, this command also directly starts the new dialog
box, which is probably why it's not prefixed with a \ as it
only makes sense outside of text. TH05 requires a separate 0x0D command to do the
Replaces the face portrait of the currently active speaking
character with image #1 within her .CD2
Removes the face portrait from the currently active text box.
0x03 filename 0x00
Calls master.lib's super_entry_bfnt() function, which
loads sprites from a BFNT file to consecutive IDs starting at the
current patnum write cursor.
Deallocates all face portrait images.
The game automatically does this at the end of each dialog sequence.
However, ZUN wanted to load Stage 6 Yuuka's 76 KiB of additional
animations inside the script via \l, and would have once again
run up against the master.lib heap size limit without that extra free
0x05 filename 0x00
Stops the currently playing BGM, loads a new one from the given
file, and starts playback.
0x05 $ 0x00
Stops the currently playing BGM.
Note that TH05 interprets $ as a null-terminated filename as
Restarts playback of the currently loaded BGM from the
Blits the master.lib patnum with the ID indicated by the third
parameter to the current VRAM page at the top-left screen position
indicated by the first two parameters.
Plays the sound effect with the given ID.
Sets palette brightness via master.lib's
palette_settone() to any value from 0 (fully black) to 200
(fully white). 100 corresponds to the palette's original colors.
Calls master.lib's palette_black_out() or
palette_black_in() to play a hardware palette fade
animation from or to black, spending roughly 1 frame on each of the 16 fade steps.
Calls master.lib's palette_white_out() or
palette_white_in() to play a hardware palette fade
animation from or to white, spending roughly 1 frame on each of the 16 fade steps. The
TH05 version of 0x09 also clears the text in both boxes
before the animation.
Starts a new line by resetting the X coordinate of the TRAM cursor
to the left edge of the text area and incrementing the Y coordinate.
The new line will always be the next one below the last one that was
properly started, regardless of whether the text previously wrapped to
the next TRAM row at the edge of the screen.
Plays a blocking 8-frame screen shake
animation. Copy-pasted from the cutscene parser, but actually used right
at the end of the dialog shown before TH04's Bad Ending.
Shows the gaiji with the given ID from 0 to 255
at the current cursor position, ignoring the per-glyph delay.
Waits 0 frames (0 = forever) for any key
to be pressed before continuing script execution.
Takes the current dialog cursor as the top-left corner of a
240×48-pixel rectangle, and replaces all text RAM characters within that
rectangle with whitespace.
This is only used to clear the player character's text box before
Shinki's final いくよ‼ box. Shinki has two
consecutive text boxes in all 4 scripts here, and ZUN probably wanted to
clear the otherwise blue text to imply a dramatic pause before Shinki's
final sentence. Nice touch.
(You could, however, also use it after a
box-ending 0xFF command to mess with text RAM in
Quits the currently running loop. This returns from either the text
loop to the command loop, or it ends the dialog sequence by returning
from the command loop back to gameplay. If this stage of the game later
starts another dialog sequence, it will start at the next script
Like \#, but first waits for any key to be
Behaves like TH04's \$ in the text loop, and like
\# in the command loop. Hence, it's not possible in TH05 to
automatically end a text box and advance to the next one without waiting
for a key press.
Unused commands are in gray.
At the end of the day, you might criticize the system for how its landmines
make it annoying to mod in ASCII text, but it all works and does what it's
supposed to. ZUN could have written the cleanest single and central
Shift-JIS iterator that properly chunks a byte buffer into halfwidth and
fullwidth codepoints, and I'd still be throwing it out for the upcoming
non-ASCII translations in favor of something that either also supports UTF-8
or performs dictionary lookups with a full box of text.
The only actual bug can be found in the input detection, which once
again doesn't correctly handle the infamous key
up/key down scancode quirk of PC-98 keyboards. All it takes
is one wrongly placed input polling call, and suddenly you have to think
about how the update cycle behind the PC-98 keyboard state bytes
might cause the game to run the regular 2-frame delay for a single
2-byte chunk of text before it shows the full text of a box is shown after
all… But even this bug is highly theoretical and could probably only be
observed very, very rarely, and exclusively on real hardware.
The same can't be said about TH02 though, but more on that later. Let's
first take a look at its data, which started out much simpler in that game.
The STAGE?.TXT files contain just raw Shift-JIS text with no
trace of commands or structure. Turning on the whitespace display feature in
your editor reveals how the dialog system even assumes a fixed byte
length for each box: 36 bytes per line which will appear on screen, followed
by 4 bytes of padding, which the original files conveniently use to visually
split the lines via a CR/LF newline sequence. Make sure to disable trimming
of trailing whitespace in your editor to not ruin the file when modding the
Consequently, everything else is hardcoded – every effect shown between text
boxes, the face portrait shown for each box, and even how many boxes are
part of each dialog sequence. Which means that the source code now contains
long hardcoded list of face IDs for most of the text boxes in the game,
with the rest being part of the
dedicated hardcoded dialog scripts for 2/3 of the
Without the restriction to a fixed set of scripting commands, TH02 naturally
gravitated to having the most varied dialog sequences of all PC-98 Touhou
games. This flexibility certainly facilitated Mima's grand entrance
animation in Stage 4, or the different lines in Stage 4 and 5 depending on
whether you already used a continue or not. Marisa's post-boss dialog even
inserts the number of continues into the text itself – by, you guessed it,
writing to hardcoded byte offsets inside the dialog text before printing it
to the screen. But once again, I have nothing to
criticize here – not even the fact that the alternate dialog scripts have to
mutate the "box cursor" to jump to the intended boxes within the file. I
know that some people in my audience like VMs, but I would have considered
it more bloated if ZUN had implemented a full-blown scripting
language just to handle all these special cases.
Another unique aspect of TH02 is the way it stores its face portraits, which
are infamous for how hard they are to find in the original data files. These
sprites are actually map tiles, stored in MIKO_K.MPN,
and drawn using the same functions used to blit the regular map tiles to the
📝 tile source area in VRAM. We can only guess
why ZUN chose this one out of the three graphics formats he used in TH02:
BFNT supports transparency, but sacrifices one of the 16 colors to do
so. ZUN only used 15 colors for the face portraits, but might have wanted to
keep open the option to use that 16th color. The detailed
backgrounds also suggest that these images were never supposed to be
transparent to begin with.
PI is used for all bigger and non-transparent images, but ZUN would have
had to write a separate small function to blit a 48×48 subsection of such an
image. That certainly wouldn't have stopped him in the TH01 days, but he
probably was already past that point by this game.
That only leaves .MPN. Sure, he did have to slice each face into 9
separate 16×16 "map" tiles to use this format, but that's a small price to
pay in exchange for not having to write any new low-level blitting code,
especially since he must have already had an asset pipeline to generate
And since you're certainly wondering about all these black tiles at the
edges: Yes, these are not only part of the file and pad it from the required
240×192 pixels to 256×256, but also kept in memory during a stage, wasting
9.5 KiB of conventional RAM. That's 172 seconds of potential input
replay data, just for those people who might still think that we need EMS
Alright, we've got the text, we've got the faces, let's slide in the box and
display it all on screen. Apparently though, we also have to blit the player
and option sprites using raw, low-level master.lib function calls in the
process? This can't be right, especially because ZUN
always blits the option sprite associated with the Reimu-A shot type,
regardless of which one the player actually selected. And if you keep moving
above the box area before the dialog starts, you get to see exactly how
wrong this is:
Let's look closer at Reimu's sprite during the slide-in animation, and in
the two frames before:
This one image shows off no less than 4 bugs:
ZUN blits the stationary player sprite here, regardless of whether the
player was previously moving left or right. This is a nice way of indicating
that Reimu stops moving once the dialog starts, but maybe ZUN should
have unblitted the old sprite so that the new one wouldn't have appeared on
top. The game only unblits the 384×64 pixels covered by the dialog box on
every frame of the slide-in animation, so Reimu would only appear correctly
if her sprite happened to be entirely located within that area.
All sprites are shifted up by 1 pixel in frame 2️⃣. This one is not a
bug in the dialog system, but in the main game loop. The game runs the
relevant actions in the following order:
Invalidate any map tiles covered by entities
Redraw invalidated tiles
Decrement the Y coordinate at the top of VRAM according to the
Update and render all game entities
Scroll in new tiles as necessary according to the scroll speed, and
report whether the game has scrolled one pixel past the end of the
If that happened, pretend it didn't by incrementing the value
calculated in #3 for all further frames and skipping to
Issue a GDC SCROLL command to reflect the line
calculated in #3 on the display
Wait for VSync
Flip VRAM pages
Start boss if we're past the end of the map
The problem here: Once the dialog starts, the game has already rendered
an entire new frame, with all sprites being offset by a new Y scroll
offset, without adjusting the graphics GDC's scroll registers to
compensate. Hence, the Y position in 3️⃣ is the correct one, and the
whole existence of frame 2️⃣ is a bug in itself. (Well… OK, probably a
quirk because speedrunning exists, and it would be pretty annoying to
synchronize any video regression tests of the future TH02 Anniversary
Edition if it renders one fewer frame in the middle of a stage.)
ZUN blits the option sprites to their position from frame 1️⃣. This
brings us back to
📝 TH02's special way of retaining the previous and current position in a two-element array, indexed with a VRAM page ID.
Normally, this would be equivalent to using dedicated prev and
cur structure fields and you'd just index it with the back page
for every rendering call. But if you then decide to go single-buffered for
dialogs and render them onto the front page instead…
Note that fixing bug #2 would not cancel out this one – the sprites would
then simply be rendered to their position in the frame before 1️⃣.
And of course, the fixed option sprite ID also counts as a bug.
As for the boxes themselves, it's yet another loop that prints 2-byte chunks
of Shift-JIS text at an even slower fixed interval of 3 frames. In an
interesting quirk though, ZUN assumes that every box starts with the name of
the speaking character in its first two fullwidth Shift-JIS characters,
followed by a fullwidth colon. These 6 bytes are displayed immediately at
the start of every box, without the usual delay. The resulting alignment
looks rather janky with Genjii, whose single right-padded 亀
kanji looks quite awkward with the fullwidth space between the name
and the colon. Kind of makes you wonder why ZUN just didn't spell out his
proper name, 玄爺, instead, but I get the stylistic
In Stage 4, the two-kanji assumption then breaks with Marisa's three-kanji
name, which causes the full-width colon to be printed as the first delayed
character in each of her boxes:
That's all the issues and quirks in the system itself. The scripts
themselves don't leave much room for bugs as they basically just loop over
the hardcoded face ID array at this level… until we reach the end of the
game. Previously, the slide-in animation could simply use the tile
invalidation and re-rendering system to unblit the box on each frame, which
also explained why Reimu had to be separately rendered on top. But this no
longer works with a custom-rendered boss background, and so the game just
chooses to flood-fill the area with graphics chip color #0:
For Mima's final defeat dialog though, ZUN chose to not even show the box.
He might have realized the issue by that point, or simply preferred the more
dramatic effect this had on the lines. The resulting issues, however, might
even have ramifications for such un-technical things as lore and
character dynamics. As it turns out, the code
for this dialog sequence does in fact render Mima's smiling face for all
boxes?! You only don't see it in the original game because it's rendered to
the other VRAM page that remains invisible during the dialog sequence:
However, Mima's smiling face jars with the words she says here. Adding
the face would deviate more significantly from the original game than
removing the player shot, item, bullet, or spark sprites would. It's
imaginable that ZUN just forgot about the dedicated code that
re-rendered just Mima to the back page, but the faces add
something to the dialog, and ZUN would have clearly noticed and
fixed it if their absence wasn't intended. Heck, ZUN might have just put
something related to Mima into the code because TH02's dialog system has
no way of not drawing a face for a dialog box. Filling the face
area with graphics chip color #0, as seen in the first and third boxes
of the Extra Stage pre-boss dialog, would have been an alternative, but
that would have been equally wrong with regard to the background.
Hence, the invisible face portrait from the original game is a ZUN
So, the future TH02 Anniversary Edition will fix the bug by showing
the back page, but retain the quirk by rewriting the dialog code to
not blit the face.
And with that, we've secured all in-game dialog for the upcoming non-ASCII
translations! The remaining 2/3 of the last push made
for a good occasion to also decompile the small amount of code related to
TH03's win messages, stored in the @0?TX.TXT files. Similar to
TH02's dialog format, these files are also split into fixed-size blocks of
3×60 bytes. But this time, TH03 loads all 60 bytes of a line, including the
CR/LF line breaking codepoints in the original files, into the statically
allocated buffer that it renders from. These control characters are then
only filtered to whitespace by ZUN's graph_putsa_fx() function.
If you remove the line breaks, you get to use the full 60 bytes on every
The final commits went to the MIKO.CFG loading and saving
functions used in TH04's and TH05's OP.EXE, as well as TH04's
game startup code to finally catch up with
📝 TH05's counterpart from over 3 years ago.
This brought us right in front of the main menu rendering code in both TH04
and TH05, which is identical in both games and will be tackled in the next
PC-98 Touhou delivery.
Next up, though: Returning to Shuusou Gyoku, and adding support for SC-88Pro
recordings as BGM. Which may or may not come with a slight controversy…
128 commits! Who would have thought that the ideal first release of the TH01
Anniversary Edition would involve so much maintenance, and raise so many
research questions? It's almost as if the real work only starts after
the 100% finalization mark… Once again, I had to steal some funding from the
reserved JIS trail word pushes to cover everything I liked to research,
which means that the next towards the
anything goal will repay this debt. Luckily, this doesn't affect any
immediate plans, as I'll be spending March with tasks that are already fully
So, how did this end up so massive? The list of things I originally set out
to do was pretty short:
Build entire game into single executable
Fix rendering issues in the one or two most important parts of the game
for a good initial impression
But even the first point already started with tons of little cleanup
commits. A part of them can definitely be blamed on the rush to hit the 100%
decompilation mark before the 25th anniversary last August.
However, all the structural changes that I can't commit to
master reveal how much of a mess the TH01 codebase actually
Merging the executables is mainly difficult because of all the
inconsistencies between REIIDEN.EXE and FUUIN.EXE.
The worst parts can be found in the REYHI*.DAT format code and
the High Score menu, but the little things are just as annoying, like how
the current score is an unsigned variable in
REIIDEN.EXE, but a signed one in FUUIN.EXE.
If it takes me this long and this many
commits just to sort out all of these issues, it's no wonder that the only
thing I've seen being done with this codebase since TH01's 100%
decompilation was a single porting attempt that ended in a rather quick
So why are we merging the executables in preparation for the Anniversary
Edition, and not waiting with it until we start doing ports?
Distributing and updating one executable is cleaner than doing the same
with three, especially as long as installation will still involve manually
dropping the new binary into the game directory.
The Anniversary Edition won't be the only fork binary. We are already
going to start out with a separate DEBLOAT.EXE that contains
only the bloat removal changes without any bug fixes, and spaztron64
will probably redo his seizure-less edition. We don't want to clutter
the game directory with three binaries for each of these fork builds, and we
especially don't want to remember things like oh, but this fork
only modifies REIIDEN.EXE…
All forks should run side-by-side with the original game. During the
time I was maintaining thcrap, I've had countless bug reports of people
assuming that thcrap was
responsible for bugs that were present in the original game, and the
same is certain to happen with the Anniversary Edition. Separate binaries
will make it easier for everyone to check where these bugs came from.
Also, I'd like to make a point about how bloated the original
three-executable structure really is, since I've heard people defending it
as neat software architecture. Really, even in Real Mode where you typically
want to use as little of the 640 KiB of conventional memory as possible, you
don't want to split your game up like this.
The game actually is so bloated that the combined binary ended up
smaller than the original REIIDEN.EXE. If all you see are the
file sizes of the original three executables, this might look like a
pretty impressive feat. Like, how can we possibly get 407,812
bytes into less than 238,612 bytes, without using compression?
If you've ever looked at the linker map though, it's not at all surprising.
Excluding the aforementioned inconsistencies that are hard to quantify,
OP.EXE and FUUIN.EXE only feature 5,767 and 6,475
bytes of unique code and data, respectively. All other code in these
binaries is already part of REIIDEN.EXE, with more than half of
the size coming from the Borland C++ runtime. The single worst offender here
is the C++ exception handler that Borland forces
onto every non-.COM binary by default, which alone adds 20,512 bytes
even if your binary doesn't use C++ exceptions.
On a more hilarious note, this
single line is responsible for pulling another unnecessary 14,242 bytes
into OP.EXE and FUUIN.EXE. This floating-point
multiplication is completely unnecessary in this context because all
possible parameters are integers, but it's enough for Turbo C++ and TLINK to
pull in the entire x87 FPU emulation machinery. These two binaries don't
even draw lines, but since this function is part of the general
graphics code translation unit and contains other functions that these
binaries do need, TLINK links in the entire thing. Maybe, multiple
executables aren't the best choice either if you use a linker that can't do
dead code elimination…
Since the 📝 Orb's physics do turn the entire
precision of a double variable into gameplay effects, it's not
feasible to ever get rid of all FPU code in TH01. The exception handler,
be removed, which easily brings the combined binary below the size of
the original REIIDEN.EXE. Compiling all code with a single set
of compiler optimization flags, including the more x86-friendly
pascal calling convention, then gets us a few more KB on top.
As does, of course, removing unused code: The only remaining purpose of
features such as 📝 resident palettes is to
potentially make porting more difficult for anyone who doesn't immediately
realize that nothing in the game uses these functions.
Technically, all unused code would be bloat, but for now, I'm keeping
the parts that may tell stories about the game's development history (such
as unused effects or the 📝 mouse cursor), or
that might help with debugging. Even with that in mind, I've only scratched
the surface when it comes to bloat removal, and the binary is only going to
get smaller from here. A lot smaller.
If only we now could start MDRV98 from this new combined binary, we wouldn't
need a second batch file either…
Which brings us to the first big research question of this delivery. Using
the C spawn() function works fine on this compiler, so
spawn("MDRV98.COM") would be all we need to do, right? Except
that the game crashes very soon after that subprocess returned.
So it's not going to be that easy if the spawned process is a TSR.
But why should this be a problem? Let's take a look at the DOS heap, and how
DOS lays out processes in conventional memory if we launch the game
regularly through GAME.BAT:
The batch file starts MDRV98 first, which will therefore end up below
the game in conventional memory. This is perfect for a TSR: The program can
resize itself arbitrarily before returning to DOS, and the rest of memory
will be left over for the game. If we assume such a layout, a DOS program
can implement a custom memory allocator in a very simple way, as it only has
to search for free memory in one direction – and this is exactly how Borland
implemented the C heap for functions like malloc() and
free(), and the C++ new and delete
But if we spawn MDRV98 after starting TH01, well…
MDRV98 will spawn in the next free memory location, allocate itself, return
to TH01… which suddenly finds its C heap blocked from growing. As a result,
the next big allocation will immediately fail with a rather misleading "out
of memory" error.
While DOS uses 16 bytes rather than Borland's 4 bytes for the control
structure of each memory block, this larger size automatically aligns all
allocations to 16-byte boundaries. Therefore, all allocation addresses would
fit into 16-bit segment-only pointers rather than needing 32-bit
far ones. On the Borland heap, the 4-byte header further limits
regular far pointers to 65,532 bytes, forcing you into
expensive huge pointers for bigger allocations.
Debuggers in DOS emulators typically have features to show and manage
the DOS heap. No need for custom debugging code.
You can change the memory placement
strategy to allocate from the top of conventional memory down to the
bottom. This is how the games allocate their resident structures.
Ultimately though, the drawbacks became too significant. Most of them are
related to the PC-98 Touhou games only ever creating a single DOS
process, even though they contain multiple executables.
Switching executables is done via exec(), which resizes a
program's main allocation to match the new binary and then overwrites the
old program image with the new one. If you've ever wondered why DOSBox-X
only ever shows OP as the active process name in the title bar,
you now know why. As far as DOS is concerned, it's still the same
OP.EXE process rooted at the same segment, and
exec() doesn't bother rewriting the name either. Most
importantly though, this is how REIIDEN.EXE can launch into
another REIIDEN.EXE process even if there are less than 238,612
bytes free when exec() is called, and without consuming more
memory for every successive binary.
For now, ANNIV.EXE still re-exec()s itself at
every point where the original game did, as ZUN's original code really
depends on being reinitialized at boss and scene boundaries. The resulting
accidental semi-hot reloading is also a useful property to retain
So why is the DOS heap a bad idea for regular game allocation after all?
Even DOS automatically releases all memory associated with a process
during its termination. But since we keep running the same process until the
player quits out of the main menu, we lose the C heap's implicit cleanup on
exec(), and have to manually free all memory ourselves.
Since the binary can be larger after hot reloading, we in fact have
to allocate all regular memory using the last fit strategy.
Otherwise, exec() fails to resize the program's main block for
the same reason that crashed the game on our initial attempt to
Just like Borland's heap implementation, the DOS heap stores its control
structures immediately before each allocation, forming a singly linked list.
But since the entire OS shares this single list, corruptions from heap
overflows also affect the whole system, and become much more disastrous.
Theoretically, it might be possible to recover from them by forcibly
releasing all blocks after the last correct one, or even by doing a
brute-force search for valid memory
control blocks, but in reality, DOS will likely just throw error code #7
(ERROR_ARENA_TRASHED) on the next memory management syscall,
forcing a reboot.
With a custom allocator, small corruptions remain isolated to the process.
They can be even further limited if the process adds some padding between
its last internal allocation and the end of the allocated DOS memory block;
Borland's heap sort of does this as well by always rounding up the DOS block
to a full KiB. All this might not make a difference in today's emulated and
single-tasked usage, but would have back then when software was still
developed inside IDEs running on the same system.
TH01's debug mode uses heapcheck() and
heapchecknode(), and reimplementing these on top of the DOS
heap is not trivial. On the contrary, it would be the most complicated part
of such a wrapper, by far.
I could release this DOS heap wrapper in unused form for another push if
anyone's interested, but for now, I'm pretty happy with not actually using
it in the games. Instead, let's stay with the Borland C heap, and find a way
to push MDRV98 to the very top of conventional RAM. Like this:
Which is much easier said than done. It would be nice if we could just use
the last fit allocation strategy here, but .COM executables always
receive all free memory by default anyway, which eliminates any difference
between the strategies.
But we can still change memory itself. So let's temporarily claim all
remaining free memory, minus the exact amount we need for MDRV98, for our
process. Then, the only remaining free space to spawn MDRV98 is at the exact
place where we want it to be:
Now we only need to know how much memory to not temporarily allocate. First,
we need to replicate the assumption that MDRV98's -M7
command-line parameter corresponds to a resident size of 23,552 bytes. This
is not as bad as it seems, because the -M parameter explicitly
has a KiB unit, and we can nicely abstract it away for the API.
The (env.) block though? Its minimum size equals the combined length
of all environment variables passed to the process, but its maximum size is…
not limited at all?! As in, DOS implementations can add and have
historically added more free space because some programs insisted on storing
their own new environment variables in this exact segment. DOSBox and
DOSBox-X follow this tradition by providing a configuration option for the
additional amount of environment space, with the latter adding 1024
additional bytes by default, y'know, just in case someone wants to compile
FreeDOS on a slow emulator. It's not even worth sending a bug report for
this specific case, because it's only a symptom of the fact that
unexpectedly large program environment blocks can and will happen, and are
to be expected in DOS land.
So thanks to this cruel joke, it's technically impossible to achieve what we
want to do there. Hooray! The only thing we can kind of do here is an
educated guess: Sum up the length of all environment variables in our
environment block, compare that length against the allocated size of the
block, and assume that the MDRV98 process will get as much additional memory
as our process got. 🤷
The remaining hurdles came courtesy of some Borland C runtime implementation
details. You would think that the temporary reallocation could even be done
in pure C using the sbrk(), coreleft(), and
brk() functions, but all values passed to or returned from
these functions are inaccurate because they don't factor in the
aforementioned KiB padding to the underlying DOS memory block. So we have to
directly use the DOS syscalls after all. Which at least means that learning
about them wasn't completely useless…
The final issue is caused inside Borland's
spawn() implementation. The environment block for the
child process is built out of all the strings reachable from C's
environ pointer, which is what that FreeDOS build process
should have used. Coalescing them into a single buffer involves yet
another C heap allocation… and since we didn't report our DOS memory block
manipulation back to the C heap, the malloc() call might think
it needs to request more memory from DOS. This resets the DOS memory block
back to its intended level, undoing our manipulation right before the actual
INT 21h, AH=4Bh
EXEC syscall. Or in short:
Manipulate DOS heap ➜ spawn() call ➜_LoadProg() ➜ allocate and prepare environment block ➜ _spawn() ➜ DOS EXEC syscall
The obvious solution: Replace _LoadProg(), implement the
coalescing ourselves, and do it before the heap manipulation. Fortunately,
Borland's internal low-level _spawn() function is not
static, so we can call it ourselves whenever we want to:
Allocate and prepare environment block ➜ manipulate DOS heap ➜ _spawn() call ➜EXEC syscall
So yes, launching MDRV98 from C can be done, but it involves advanced
witchcraft and is completely ridiculous.
Launching external sound drivers from a batch file is the right way
of doing things.
Fortunately, you don't have to rely on this auto-launching feature. You can
still launch DEBLOAT.EXE or ANNIV.EXE from a batch
file that launched MDRV98.COM before, and the binaries will
detect this case and skip the attempt of launching MDRV98 from C. It's
unlikely that my heuristic will ever break, but I definitely recommend
replicating GAME.BAT just to be completely sure – especially
for user-friendly repacks that don't want to include the original game
This is also why ANNIV.EXE doesn't launch
ZUNSOFT.COM: The "correct" and stable way to launch
ANNIV.EXE still involves a batch file, and I would say that
expecting people to remove ZUNSOFT.COM from that file is worse
than not playing the animation. It's certainly a debate we can have, though.
This deep dive into memory allocation revealed another previously
undocumented bug in the original game. The RLE decompression code for the
東方靈異.伝 packfile contains two heap overflows, which are
actually triggered by SinGyoku's BOSS1_3.BOS and Konngara's
BOSS8_1.BOS. They only do not immediately crash the game when
loading these bosses thanks to two implementation details of Borland's C
Obviously, this is a bug we should fix, but according to the definition of
bugs, that fix would be exclusive to the anniversary branch.
Isn't that too restrictive for something this critical? This code is
guaranteed to blow up with a different heap implementation, if only in a
Debug build. And besides, nobody would notice a fix
just by looking at the game's rendered output…
Looks like we have to introduce a fourth category of weird code, in addition
to the previous bloat, bug, and quirk categories, for
invisible internal issues like these. Let's call it landmine, and fix
them on the debloated branch as well. Thanks to
Clerish for the naming inspiration!
With this new category, the full definitions for all categories have become
quite extensive. Thus, they now live in CONTRIBUTING.md
inside the ReC98 repository.
With the new discoveries and the new landmine category, TH01 is now at 67
bugs and 20 landmines. And the solution for the landmine in question? Simplifying
the 61 lines of the original code down to 16. And yes, I'm including
comments in these numbers – if the interactions of the code are complex
enough to require multi-paragraph comments, these are a necessary and
valid part of the code.
While we're on the topic of weird code and its visible or invisible effects,
there's one thing you might be concerned about. With all the rearchitecting
and data shifting we're doing on the debloated branch, what
will happen to the 📝 negative glitch stages?
These are the result of a clearly observable bug that, by definition, must
not be fixed on the debloated branch. But given that the
observable layout of the glitch stages is defined by the memory
surrounding the scene stage variable, won't the
debloated branch inherently alter their appearance (= ⚠️
fanfiction ⚠️), or even remove them completely?
Well, yes, it will. But we can still preserve their layout by
the exact original data that the game would originally read, and even emulate
the original segment relocations and other pieces of global data.
Doing this is feasible thanks to the fact that there are only 4 glitch
stages. Unfortunately, the same can't be said for the timer values, which
are determined by an array lookup with the un-modulo'd stage ID. If we
wanted to preserve those as well, we'd have to bundle an exact copy of the
original REIIDEN.EXE data segment to preserve the values of all
32,768 negative stages you could possibly enter, together with a map
of all relocations in this segment. 😵 Which I've decided against for now,
since this has been going on for far too long already. Let's first see if
anyone ever actually complains about details like this…
Alright, time to start the anniversary branch by rendering
everything at its correct internal unaligned X position? Eh… maybe not quite
yet. If we just hacked all the necessary bit-shifting code into all the
format-specific blitting functions, we'd still retain all this largely
redundant, bad, and slow code, and would make no progress in terms of
portability. It'd be much better to first write a single generic blitter
that's decently optimized, but supports all kinds of sprites to make this
optimization actually worth something.
So, next research question: How would such a blitter look like? After I
learned during my
📝 first foray into cycle counting that port
I/O is slow on 486 CPUs, it became clear that TH04's
📝 GRCG batching for pellets was one of the
more useful optimizations that probably contributed a big deal towards
achieving the high bullet counts of that game. This leads to two
master.lib's super_*() sprite functions are slow, and not
worth looking at for inspiration. Even the 📝 tiny format reinitializes the GRCG on every color change, wasting 80
Hence, our low-level blitting API should not even care about colors. It
should only concern itself with blitting a given 1bpp sprite to a single
VRAM segment. This way, it can work for both 4-plane sprites and
single-plane sprites, and just assume that the GRCG is active.
What we should do, however, are some further general optimizations that I
would have expected from master.lib: Unrolling the vertical
loop, and baking a single function for every sprite width to eliminate
the horizontal loop. We can then use the widest possible x86
MOV instruction for the lowest possible number of cycles per
row – for example, we'd blit a 56-wide sprite with three MOVs
(32-bit + 16-bit + 8-bit), and a 64-wide one with two 32-bit
Or maybe not? There's a lot of blitting code in both master.lib and PC-98
Touhou that checks for empty bytes within sprites to skip needlessly writing
them to VRAM:
That's a single CPU instruction, compared to two instructions and two
branches. The only possible explanation for this would be that VRAM writes
are so slow on PC-98 that you'd want to avoid them at all costs, even
if that means additional branching on the CPU to do so. Or maybe that was
something you would want to do on certain models with slow VRAM, but not on
So I wrote a benchmark to answer all these questions, and to compare my new
blitter against typical TH01 blitting code:
Amount of frames required to render 2000 16×8 pellet sprites on a variety of
PC-98 models, using the new generic blitter. Both preshifted (first column)
and runtime-shifted (second column) sprites were tested; empty columns
correspond to times faster than a single frame. Thanks to cuba200611,
Shoutmon, cybermind, and Digmac for running the tests!
The key takeaways:
Checking for empty bytes has never been a good idea.
Preshifting sprites made a slight difference on the 286. Starting with
the 386 though, that difference got smaller and smaller, until it completely
vanished on Pentium models. The memory tradeoff is especially not worth it
for 4-plane sprites, given that you would have to preshift each of the 4
planes and possibly even a fifth alpha plane. Ironically, ZUN only ever
preshifted monochrome single-bitplane sprites with a width of 8 pixels.
That's the smallest possible amount of memory a sprite can possibly take,
and where preshifting consequently has the smallest effect on performance.
Shifting 8-wide sprites on the fly literally takes a single ROL
or ROR instruction per row.
You might want to use MOVS instead of MOV when
targeting the 286 and 386, but the performance gains are barely worth the
resulting mess you would make out of your blitting code. On Pentium models,
there is no difference.
Use the GRCG whenever you have to render lots of things that share a
static 8×1 pattern.
These are the PC-98 models that the people who are willing to test your
newly written PC-98 code actually use.
Since this won't be the only piece of game-independent and explicitly
PC-98-specific custom code involved in this delivery, it makes sense to
dedicated PC-98 platform layer. This code will gradually eliminate the
dependency on master.lib and replace it with better optimized and more
readable C++ code. The blitting benchmark, for example, is already
implemented completely without master.lib.
While this platform layer is mainly written to generate optimal code within
Turbo C++ 4.0J, it can also serve as general PC-98 documentation for
everyone who prefers code over machine-translating old Japanese books. Not
to mention the immediacy of having all actual relevant information in
one place, which might otherwise be pretty well hidden in these books, or
some obscure old text file. For example, did you know that uploading gaiji
via INT 18h might end up disabling the VSync interrupt trigger,
deadlocking the process on the next frame delay loop? This nuisance is not
replicated by any emulators, and it's quite frustrating to encounter it when
trying to run your code on real hardware. master.lib works around it by
simply hooking INT 18h and unconditionally reenabling the VSync
interrupt trigger after the original handler returns, and so does our
So, with the pellet draw calls batched and routed through the new renderer,
we should have gained enough free CPU cycles to disable
📝 interlaced pellet rendering without any
impact on frame rates?
Well, kinda. We do get 56.4 FPS, but only together with noticeable and
reproducible tearing in the top part of the playfield, suggesting exactly
why ZUN interlaced the rendering in the first place. 😕 So have we
already reached the limit of single-buffered PC-98 games here, or can we
still do something about it?
As it turns out, the main bottleneck actually lies in the pellet
unblitting code. Every EGC-"accelerated" unblitting call in TH01 is
as unbatched as the pellet blitting calls were, spending an additional 17
I/O port writes per call to completely set up and shut down the EGC, every
time. And since this is TH01, the two-instruction operation of changing the
active PC-98 VRAM page isn't inlined either, but instead done via a function
call to a faraway segment. On the 486, that's:
>341 cycles for EGC setup and teardown, plus
>72 cycles for each 16-pixel chunk to be unblitted.
This sums up to
>917 cycles of completely unnecessary work for every active pellet,
in the optimal 50% of cases where it lies on an even VRAM byte,
>1493 cycles if it lies on an odd VRAM byte, because ZUN's code
extends the unblitted rectangle to a gargantuan 32×8 pixels in this case
And this calculation even ignores the lack of small micro-optimizations that
could further optimize the blitting loop. Multiply that by the game's pellet
cap of 100, and we get a 6-digit number of wasted CPU cycles. On
paper, that's roughly 1/6 of the time we have for each
of our target 56.423 FPS on the game's target 33 MHz systems. Might not
sound all too critical, but the single-buffered nature of the game means
that we're effectively racing the beam on every frame. In turn, we have to
be even more serious about performance.
So, time to also add a batched EGC API to our PC-98 platform layer? Writing
our own EGC code presents a nice opportunity to finally look deeper into all
its registers and configuration options, and see what exactly we can do
about ZUN's enforced 16-pixel alignment.
To nobody's surprise, this alignment is completely unnecessary, and only
displays a lack of knowledge about the chip. While it is true that
the EGC wants VRAM to be exclusively addressed in 16-bit chunks at
16-bit-aligned addresses, it specifically provides
an address register (0x4AC) for shifting the horizontal
start offsets of the source and destination to any pixel within the
16 pixels of such a chunk, and
a bit length register (0x4AE) for specifying the total
width of pixels to be transferred, which also implies the correct end
And it gets even better: After ⌈bitlength ÷ 16⌉ write
instructions, the EGC's internal shifter state automatically reinitializes
itself in preparation for blitting another row of pixels with the same
initially configured bit addresses and length. This is perfect for blitting
rectangles, as two I/O port writes before the start of your blitting loop
are enough to define your entire rectangle.
The manual nature of reading and writing in 16-pixel chunks does come with a
slight pitfall though. If the source bit address is larger than the
destination bit address, the first 16-bit read won't fill the EGC's internal
shift register with all pixels that should appear in the first 16-pixel
destination chunk. In this case, the EGC simply won't write anything and
leave the first chunk unchanged. In a
📝 regular blitting loop, however, you expect
that memory to be written and immediately move on to the next chunks within
the row. As a result, the actual blitting process for such a rectangle will
no longer be aligned to the configured address and bit length. The first row
of the rectangle will appear 16 pixels to the right of the destination
address, and the second one will start at bit offset 0 with pixels from the
rightmost byte of the first line, which weren't blitted and remained in the
There is an easy solution though: Before the horizontal loop on each line of
the rectangle, simply read one additional 16-pixel chunk from the source
location to prefill the shift register. Thankfully, it's large enough to
also fit the second read of the then full 16 pixels, without dropping any
pixels along the way.
And that's how we get arbitrarily unaligned rectangle copies with the EGC!
Except for a small register allocation trick to use two-register addressing,
there's not much use in further optimizations, as the runtime of these
inter-page blit operations is dominated by the VRAM page switches anyway.
Except that T98-Next seems to disagree about the register prefilling issue:
Every other emulator agrees with real hardware in this regard, so we can
safely assume this to be a bug in T98-Next. Just in case this old emulator
with its last release from June 2010 still has any fans left nowadays… For
now though, even they can still enjoy the TH01 Anniversary Edition: The only
EGC copy algorithm that TH01 actually needs is the left one during the
single-buffered tests, which even that emulator gets right.
That only leaves
📝 my old offer of documenting the EGC raster ops,
and we've got the EGC figured out completely!
And that did in fact remove tearing from the pellet rendering function! For
the first time, we can now fight Elis, Kikuri, Sariel, and Konngara with a
doubled pellet frame rate:
With only pellets and no other animation on screen, this exact pattern
presents the optimal demonstration case for the new unblitter. But as you
can already tell from the invincibility sprites, we'd also need to route
every other kind of sprite through the same new code. This isn't all too
trivial: Most sprites are still rendered at byte-aligned positions, and
their blitting APIs hide that fact by taking a pixel position regardless.
This is why we can't just replace ZUN's original 16-pixel-aligned EGC
unblitting function with ours, and always have to replace both the blitter
and the unblitter on a per-sprite basis.
To completely remove all flickering, we'd also like to get rid of all the
sprite-specific unblit ➜ update ➜ render sequences, and instead
gather all unblitting code to the beginning of the game loop, before any
update and rendering calls. So yeah, it will take a long time to completely
get rid of all flickering. Until we're there, I recommend any backer to tell
me their favorite boss, so that I can focus on getting that one
rendered without any flickering. Remember that here at ReC98, we can have a
Touhou character popularity contest at any time during the year, whenever
the store is open!
In the meantime, the consistent use of 8×8 rectangles during pellet
unblitting does significantly reduce flickering across the entire game,
and shrinks certain holes that pellets tend to rip into lazily reblitted
To round out the first release, I added all the other bug fixes to achieve
parity with my previously released patched REIIDEN.EXE builds:
So here it is, the first build of TH01's Anniversary Edition:
2023-03-05-th01-anniv.zip Edit (2023-03-12): If you're playing on Neko Project and seeing more
flickering than in the original game, make sure you've checked the Screen
→ Disp vsync option.
Next up: The long overdue extended trip through the depths of TH02's
low-level code. From what I've seen of it so far, the work on this project
is finally going to become a bit more relaxing. Which is quite welcome
after, what, 6 months of stressful research-heavy work?
TH05 has passed the 50% RE mark, with both MAIN.EXE and the
game as a whole! With that, we've also reached what -Tom-
wanted out of the project, so he's suspending his discount offer for a
Curve bullets are now officially called cheetos! 76.7% of
fans prefer this term, and it fits into the 8.3 DOS filename scheme much
better than homing lasers (as they're called in
OMAKE.TXT) or Taito
lasers (which would indeed have made sense as well).
…oh, and I managed to decompile Shinki within 2 pushes after all. That
left enough budget to also add the Stage 1 midboss on top.
So, Shinki! As far as final boss code is concerned, she's surprisingly
economical, with 📝 her background animations
making up more than ⅓ of her entire code. Going straight from TH01's
📝 final📝 bosses
to TH05's final boss definitely showed how much ZUN had streamlined
danmaku pattern code by the end of PC-98 Touhou. Don't get me wrong, there
is still room for improvement: TH05 not only
📝 reuses the same 16 bytes of generic boss state we saw in TH04 last month,
but also uses them 4× as often, and even for midbosses. Most importantly
though, defining danmaku patterns using a single global instance of the
group template structure is just bad no matter how you look at it:
The script code ends up rather bloated, with a single MOV
instruction for setting one of the fields taking up 5 bytes. By comparison,
the entire structure for regular bullets is 14 bytes large, while the
template structure for Shinki's 32×32 ball bullets could have easily been
reduced to 8 bytes.
Since it's also one piece of global state, you can easily forget to set
one of the required fields for a group type. The resulting danmaku group
then reuses these values from the last time they were set… which might have
been as far back as another boss fight from a previous stage.
And of course, I wouldn't point this out if it
didn't actually happen in Shinki's pattern code. Twice.
Declaring a separate structure instance with the static data for every
pattern would be both safer and more space-efficient, and there's
more than enough space left for that in the game's data segment.
But all in all, the pattern functions are short, sweet, and easy to follow.
patternis significantly more complex than the others, but still
far from TH01's final bosses at their worst. I especially like the clear
architectural separation between "one-shot pattern" functions that return
true once they're done, and "looping pattern" functions that
run as long as they're being called from a boss's main function. Not many
all too interesting things in these pattern functions for the most part,
except for two pieces of evidence that Shinki was coded after Yumeko:
The gather animation function in the first two phases contains a bullet
group configuration that looks like it's part of an unused danmaku
pattern. It quickly turns out to just be copy-pasted from a similar function
in Yumeko's fight though, where it is turned into actual
As one of the two places where ZUN forgot to set a template field, the
lasers at the end of the white wing preparation pattern reuse the 6-pixel
width of Yumeko's final laser pattern. This actually has an effect on
gameplay: Since these lasers are active for the first 8 frames after
Shinki's wings appear on screen, the player can get hit by them in the last
2 frames after they grew to their final width.
Speaking about that wing sprite: If you look at ST05.BB2 (or
any other file with a large sprite, for that matter), you notice a rather
weird file layout:
And it's not a limitation of the sprite width field in the BFNT+ header
either. Instead, it's master.lib's BFNT functions which are limited to
sprite widths up to 64 pixels… or at least that's what
MASTER.MAN claims. Whatever the restriction was, it seems to be
completely nonexistent as of master.lib version 0.23, and none of the
master.lib functions used by the games have any issues with larger
Since ZUN stuck to the supposed 64-pixel width limit though, it's now the
game that expects Shinki's winged form to consist of 4 physical
sprites, not just 1. Any conversion from another, more logical sprite sheet
layout back into BFNT+ must therefore replicate the original number of
sprites. Otherwise, the sequential IDs ("patnums") assigned to every newly
loaded sprite no longer match ZUN's hardcoded IDs, causing the game to
crash. This is exactly what used to happen with -Tom-'s
MysticTK automation scripts,
which combined these exact sprites into a single large one. This issue has
now been fixed – just in case there are some underground modders out there
who used these scripts and wonder why their game crashed as soon as the
Shinki fight started.
And then the code quality takes a nosedive with Shinki's main function.
Even in TH05, these boss and midboss update
functions are still very imperative:
The origin point of all bullet types used by a boss must be manually set
to the current boss/midboss position; there is no concept of a bullet type
tracking a certain entity.
The same is true for the target point of a player's homing shots…
… and updating the HP bar. At least the initial fill animation is
abstracted away rather decently.
Incrementing the phase frame variable also must be done manually. TH05
even "innovates" here by giving the boss update function exclusive ownership
of that variable, in contrast to TH04 where that ownership is given out to
the player shot collision detection (?!) and boss defeat helper
Speaking about collision detection: That is done by calling different
functions depending on whether the boss is supposed to be invincible or
Timeout conditions? No standard way either, and all done with manual
if statements. In combination with the regular phase end
condition of lowering (mid)boss HP to a certain value, this leads to quite a
convoluted control flow.
The manual calls to the score bonus functions for cleared phases at least provide some sense of orientation.
One potentially nice aspect of all this imperative freedom is that
phases can end outside of HP boundaries… by manually incrementing the
phase variable and resetting the phase frame variable to 0.
The biggest WTF in there, however, goes to using one of the 16 state bytes
as a "relative phase" variable for differentiating between boss phases that
share the same branch within the switch(boss.phase)
statement. While it's commendable that ZUN tried to reduce code duplication
for once, he could have just branched depending on the actual
boss.phase variable? The same state byte is then reused in the
"devil" pattern to track the activity state of the big jerky lasers in the
second half of the pattern. If you somehow managed to end the phase after
the first few bullets of the pattern, but before these lasers are up,
Shinki's update function would think that you're still in the phase
before the "devil" pattern. The main function then sequence-breaks
right to the defeat phase, skipping the final pattern with the burning Makai
background. Luckily, the HP boundaries are far away enough to make this
impossible in practice.
The takeaway here: If you want to use the state bytes for your custom
boss script mods, alias them to your own 16-byte structure, and limit each
of the bytes to a clearly defined meaning across your entire boss script.
One final discovery that doesn't seem to be documented anywhere yet: Shinki
actually has a hidden bomb shield during her two purple-wing phases.
uth05win got this part slightly wrong though: It's not a complete
shield, and hitting Shinki will still deal 1 point of chip damage per
frame. For comparison, the first phase lasts for 3,000 HP, and the "devil"
pattern phase lasts for 5,800 HP.
And there we go, 3rd PC-98 Touhou boss
script* decompiled, 28 to go! 🎉 In case you were expecting a fix for
the Shinki death glitch: That one
is more appropriately fixed as part of the Mai & Yuki script. It also
requires new code, should ideally look a bit prettier than just removing
cheetos between one frame and the next, and I'd still like it to fit within
the original position-dependent code layout… Let's do that some other
Not much to say about the Stage 1 midboss, or midbosses in general even,
except that their update functions have to imperatively handle even more
subsystems, due to the relative lack of helper functions.
The remaining ¾ of the third push went to a bunch of smaller RE and
finalization work that would have hardly got any attention otherwise, to
help secure that 50% RE mark. The nicest piece of code in there shows off
what looks like the optimal way of setting up the
📝 GRCG tile register for monochrome blitting
in a variable color:
mov ah, palette_index ; Any other non-AL 8-bit register works too.
; (x86 only supports AL as the source operand for OUTs.)
rept 4 ; For all 4 bitplanes…
shr ah, 1 ; Shift the next color bit into the x86 carry flag
sbb al, al ; Extend the carry flag to a full byte
; (CF=0 → 0x00, CF=1 → 0xFF)
out 7Eh, al ; Write AL to the GRCG tile register
Thanks to Turbo C++'s inlining capabilities, the loop body even decompiles
into a surprisingly nice one-liner. What a beautiful micro-optimization, at
a place where micro-optimization doesn't hurt and is almost expected.
Unfortunately, the micro-optimizations went all downhill from there,
becoming increasingly dumb and undecompilable. Was it really necessary to
save 4 x86 instructions in the highly unlikely case of a new spark sprite
being spawned outside the playfield? That one 2D polar→Cartesian
conversion function then pointed out Turbo C++ 4.0J's woefully limited
support for 32-bit micro-optimizations. The code generation for 32-bit
📝 pseudo-registers is so bad that they almost
aren't worth using for arithmetic operations, and the inline assembler just
flat out doesn't support anything 32-bit. No use in decompiling a function
that you'd have to entirely spell out in machine code, especially if the
same function already exists in multiple other, more idiomatic C++
Rounding out the third push, we got the TH04/TH05 DEMO?.REC
replay file reading code, which should finally prove that nothing about the
game's original replay system could serve as even just the foundation for
community-usable replays. Just in case anyone was still thinking that.
Next up: Back to TH01, with the Elis fight! Got a bit of room left in the
cap again, and there are a lot of things that would make a lot of
TH04 would really enjoy a large number of dedicated pushes to catch up
with TH05. This would greatly support the finalization of both games.
Continuing with TH05's bosses and midbosses has shown to be good value
for your money. Shinki would have taken even less than 2 pushes if she
hadn't been the first boss I looked at.
Oh, and I also added Seihou as a selectable goal, for the two people out
there who genuinely like it. If I ever want to quit my day job, I need to
branch out into safer territory that isn't threatened by takedowns, after
TH03 finally passed 20% RE, and the newly decompiled code contains no
serious ZUN bugs! What a nice way to end the year.
There's only a single unlockable feature in TH03: Chiyuri and Yumemi as
playable characters, unlocked after a 1CC on any difficulty. Just like the
Extra Stages in TH04 and TH05, YUME.NEM contains a single
designated variable for this unlocked feature, making it trivial to craft a
fully unlocked score file without recording any high scores that others
would have to compete against. So, we can now put together a complete set
for all PC-98 Touhou games: 2021-12-27-Fully-unlocked-clean-score-files.zip
It would have been cool to set the randomly generated encryption keys in
these files to a fixed value so that they cancel out and end up not actually
encrypting the file. Too bad that TH03 also started feeding each encrypted
byte back into its stream cipher, which makes this impossible.
The main loading and saving code turned out to be the second-cleanest
implementation of a score file format in PC-98 Touhou, just behind TH02.
Only two of the YUME.NEM functions come with nonsensical
differences between OP.EXE and MAINL.EXE, rather
than 📝 all of them, as in TH01 or
📝 too many of them, as in TH04 and TH05. As
for the rest of the per-difficulty structure though… well, it quickly
becomes clear why this was the final score file format to be RE'd. The name,
score, and stage fields are directly stored in terms of the internal
REGI*.BFT sprite IDs used on the high score screen. TH03 also
stores 10 score digits for each place rather than the 9 possible ones, keeps
any leading 0 digits, and stores the letters of entered names in reverse
order… yeah, let's decompile the high score screen as well, for a full
understanding of why ZUN might have done all that. (Answer: For no reason at
And wow, what a breath of fresh air. It's surely not
good-code: The overlapping shadows resulting from using
a 24-pixel letterspacing with 32-pixel glyphs in the name column led ZUN to
do quite a lot of unnecessary and slightly confusing rendering work when
moving the cursor back and forth, and he even forgot about the EGC there.
But it's nowhere close to the level of jank we saw in
📝 TH01's high score menu last year. Good to
see that ZUN had learned a thing or two by his third game – especially when
it comes to storing the character map cursor in terms of a character ID,
and improving the layout of the character map:
That's almost a nicely regular grid there. With the question mark and the
double-wide SP, BS, and END options, the cursor
movement code only comes with a reasonable two exceptions, which are easily
handled. And while I didn't get this screen completely decompiled,
one additional push was enough to cover all important code there.
The only potential glitch on this screen is a result of ZUN's continued use
decimal digits without any bounds check or cap. Like the in-game HUD
score display in TH04 and TH05, TH03's high score screen simply uses the
next glyph in the character set for the most significant digit of any score
above 1,000,000,000 points – in this case, the period. Still, it only
really gets bad at 8,000,000,000 points: Once the glyphs are
exhausted, the blitting function ends up accessing garbage data and filling
the entire screen with garbage pixels. For comparison though, the current world record
is 133,650,710 points, so good luck getting 8 billion in the first
Next up: Starting 2022 with the long-awaited decompilation of TH01's Sariel
fight! Due to the 📝 recent price increase,
we now got a window in the cap that
is going to remain open until tomorrow, providing an early opportunity to
set a new priority after Sariel is done.
Didn't quite get to cover background rendering for TH05's Stage 1-5
bosses in this one, as I had to reverse-engineer two more fundamental parts
involved in boss background rendering before.
First, we got the those blocky transitions from stage tiles to bomb and
boss backgrounds, loaded from BB*.BB and ST*.BB,
respectively. These files store 16 frames of animation, with every bit
corresponding to a 16×16 tile on the playfield. With 384×368 pixels to be
covered, that would require 69 bytes per frame. But since that's a very odd
number to work with in micro-optimized ASM, ZUN instead stores 512×512
pixels worth of bits, ending up with a frame size of 128 bytes, and a
per-frame waste of 59 bytes. At least it was
possible to decompile the core blitting function as __fastcall
But wait, TH05 comes with, and loads, a bomb .BB file for every character,
not just for the Reimu and Yuuka bomb transitions you see in-game… 🤔
Restoring those unused stage tile → bomb image transition
animations for Mima and Marisa isn't that trivial without having decompiled
their actual bomb animation functions before, so stay tuned!
Interestingly though, the code leaves out what would look like the most
obvious optimization: All stage tiles are unconditionally redrawn
each frame before they're erased again with the 16×16 blocks, no matter if
they weren't covered by such a block in the previous frame, or are
going to be covered by such a block in this frame. The same is true
for the static bomb and boss background images, where ZUN simply didn't
write a .CDG blitting function that takes the dirty tile array into
account. If VRAM writes on PC-98 really were as slow as the games'
README.TXT files claim them to be, shouldn't all the
optimization work have gone towards minimizing them?
Oh well, it's not like I have any idea what I'm talking about here. I'd
better stop talking about anything relating to VRAM performance on PC-98…
Second, it finally was time to solve the long-standing confusion about all
those callbacks that are supposed to render the playfield background. Given
the aforementioned static bomb background images, ZUN chose to make this
needlessly complicated. And so, we have two callback function
pointers: One during bomb animations, one outside of bomb
animations, and each boss update function is responsible for keeping the
former in sync with the latter.
Other than that, this was one of the smoothest pushes we've had in a while;
the hardest parts of boss background rendering all were part of
📝 the last push. Once you figured out that
ZUN does indeed dynamically change hardware color #0 based on the current
boss phase, the remaining one function for Shinki, and all of EX-Alice's
background rendering becomes very straightforward and understandable.
Meanwhile, -Tom- told me about his plans to publicly
release 📝 his TH05 scripting toolkit once
TH05's MAIN.EXE would hit around 50% RE! That pretty much
defines what the next bunch of generic TH05 pushes will go towards:
bullets, shared boss code, and one
full, concrete boss script to demonstrate how it's all combined. Next up,
therefore: TH04's bullet firing code…? Yes, TH04's. I want to see what I'm
doing before I tackle the undecompilable mess that is TH05's bullet firing
code, and you all probably want readable code for that feature as
well. Turns out it's also the perfect place for Blue Bolt's
Technical debt, part 9… and as it turns out, it's highly impractical to
repay 100% of it at this point in development. 😕
The reason: graph_putsa_fx(), ZUN's function for rendering
optionally boldfaced text to VRAM using the font ROM glyphs, in its
ridiculously micro-optimized TH04 and TH05 version. This one sets the
"callback function" for applying the boldface effect by self-modifying
the target of two CALL rel16 instructions… because
there really wasn't any free register left for an indirect
CALL, eh? The necessary distance, from the call site to the
function itself, has to be calculated at assembly time, by subtracting the
target function label from the call site label.
This usually wouldn't be a problem… if ZUN didn't store the resulting
lookup tables in the .DATA segment. With code segments, we
can easily split them at pretty much any point between functions because
there are multiple of them. But there's only a single .DATA
segment, with all ZUN and master.lib data sandwiched between Borland C++'s
crt0 at the
top, and Borland C++'s library functions at the bottom of the segment.
Adding another split point would require all data after that point to be
moved to its own translation unit, which in turn requires
EXTERN references in the big .ASM file to all that moved
data… in short, it would turn the codebase into an even greater
Declaring the labels as EXTERN wouldn't work either, since
the linker can't do fancy arithmetic and is limited to simply replacing
address placeholders with one single address. So, we're now stuck with
this function at the bottom of the SHARED segment, for the
We can still continue to separate functions off the top of that segment,
though. Pretty much the only thing noteworthy there, so far: TH04's code
for loading stage tile images from .MPN files, which we hadn't
reverse-engineered so far, and which nicely fit into one of
Blue Bolt's pending ⅓ RE contributions. Yup, we finally moved
the RE% bars again! If only for a tiny bit.
Both TH02 and TH05 simply store one pointer to one dynamically allocated
memory block for all tile images, as well as the number of images, in the
data segment. TH04, on the other hand, reserves memory for 8 .MPN slots,
complete with their color palettes, even though it only ever uses the
first one of these. There goes another 458 bytes of conventional RAM… I
should start summing up all the waste we've seen so far. Let's put the
next website contribution towards a tagging system for these blog posts.
At 86% of technical debt in the SHARED segment repaid, we
aren't quite done yet, but the rest is mostly just TH04 needing to catch
up with functions we've already separated. Next up: Getting to that
practical 98.5% point. Since this is very likely to not require a full
push, I'll also decompile some more actual TH04 and TH05 game code I
previously reverse-engineered – and after that, reopen the store!
Now that's the amount of translation unit separation progress I was
looking for! Too bad that RL is keeping me more and more occupied these
days, and ended up delaying this push until 2021. Now that
Touhou Patch Center is also commissioning me to update their
infrastructure, it's going to take a while for ReC98 to return to full
speed, and for the store to be reopened. Should happen by April at the
With everything related to this separation of translation units explained
earlier, we've really got a push with nothing to talk about, this
time. Except, maybe, for the realization that
📝 this current approach might not be the
best fit for TH02 after all: Not only did it force us to
📝 throw away the previous decompilation of
the sound effect playback functions, but OP.EXE also contains
obviously copy-pasted code in addition to the common, shared set of
library functions. How was that game even built, originally??? No
way around compiling that one instance of the "delay until given BGM
measure" function separately then, if it insists on using its own
instance of the VSync delay function…
Oh well, this separated layout still works better for the later games, and
consistency is good. Smooth sailing with all of the other functions, at
Next up: One more of these, which might even end up completing the
📝 transition to our own master.lib header file.
In terms of the total number of ASM code left in the SHARED
code segments, we're now 30% done after 3 dedicated pushes. It really
shouldn't require 7 more pushes, though!
So, only one card-flipping function missing, and then we can start
decompiling TH01's two final bosses? Unfortunately, that had to be the one
big function that initializes and renders all gameplay objects. #17 on the
list of longest functions in all of PC-98 Touhou, requiring two pushes to
fully understand what's going on there… and then it immediately returns
for all "boss" stages whose number is divisible by 5, yet is still called
during Sariel's and Konngara's initialization 🤦
Oh well. This also involved the final file format we hadn't looked at
yet – the STAGE?.DAT files that describe the layout for all
stages within a single 5-stage scene. Which, for a change is a very
well-designed form– no, of course it's completely weird, what did you
expect? Development must have looked somewhat like this:
Weirdness #1: "Hm, the stage format should
include the file names for the background graphics and music… or should
it?" And so, the 22-byte header still references some music and
background files that aren't part of the final game. The game doesn't use
anything from there, and instead derives those file names from the
That's probably nothing new to anyone who has ever looked at TH01's data
files. In a slightly more interesting discovery though, seeing the
📝 .GRF extension, in some of the file names
that are short enough to not cut it off, confirms that .GRF was initially
used for background images. Probably before ZUN learned about .PI, and how
it achieves better compression than his own per-bitplane RLE approach?
Weirdness #2: "Hm, I might want to put
obstacles on top of cards?" You'd probably expect this format to
contain one single array for every stage, describing which object to place
on every 32×32 tile, if any. Well, the real format uses two arrays:
One for the cards, and a combined one for all "obstacles" – bumpers, bumper
bars, turrets, and portals. However, none of the card-flipping stages in
the final game come with any such overlaps. That's quite unfortunate, as it
would have made for some quite interesting level designs:
As you can see, the final version of the blitting code was not written
with such overlaps in mind either, blitting the cards on top of all
the obstacles, and not the other way round.
Weirdness #3: "In contrast to obstacles, of
which there are multiple types, cards only really need 1 bit. Time for some
bit twiddling!" Not the worst idea, given that the 640×336 playfield
can fit 20×10 cards, which would fit exactly into 25 bytes if you use a
single bit to indicate card or no card. But for whatever
reason, ZUN only stored 4 card bits per byte, leaving the other 4 bits
unused, and needlessly blowing up that array to 50 bytes. 🤷
Oh, and did I mention that the contents of the STAGE?.DAT files are
loaded into the main data segment, even though the game immediately parses
them into something more conveniently accessible? That's another 1250 bytes
of memory wasted for no reason…
Weirdness #4: "Hm, how about requiring the
player to flip some of the cards multiple times? But I've already written
all this bit twiddling code to store 4 cards in 1 byte. And if cards should
need anywhere from 1 to 4 flips, that would need at least 2 more bits,
which won't fit into the unused 4 bits either…" This feature
must have come later, because the final game uses 3 "obstacle" type
IDs to act as a flip count modifier for a card at the same relative array
position. Complete with lookup code to find the actual card index these
modifiers belong to, and ridiculous switch statements to not include
those non-obstacles in the game's internal obstacle array.
With all that, it's almost not worth mentioning how there are 12 turret
types, which only differ in which hardcoded pellet group they fire at a
hardcoded interval of either 100 or 200 frames, and that they're all
explicitly spelled out in every single switch statement. Or
how the layout of the internal card and obstacle SoA classes is quite
disjointed. So here's the new ZUN bugs you've probably already been
Cards and obstacles are blitted to both VRAM pages. This way, any other
entities moving on top of them can simply be unblitted by restoring pixels
from VRAM page 1, without requiring the stationary objects to be redrawn
from main memory. Obviously, the backgrounds behind the cards have to be
stored somewhere, since the player can remove them. For faster transitions
between stages of a scene, ZUN chose to store the backgrounds behind
obstacles as well. This way, the background image really only needs to be
blitted for the first stage in a scene.
All that memory for the object backgrounds adds up quite a bit though. ZUN
actually made the correct choice here and picked a memory allocation
function that can return more than the 64 KiB of a single x86 Real Mode
segment. He then accesses the individual backgrounds via regular array
subscripts… and that's where the bug lies, because he stores the returned
address in a regular far pointer rather than a
huge one. This way, the game still can only display a
total of 102 objects (i. e., cards and obstacles combined) per stage,
without any unblitting glitches.
What a shame, that limit could have been 127 if ZUN didn't needlessly
allocate memory for alpha planes when backing up VRAM content.
And since array subscripts on far pointers wrap around after
64 KiB, trying to save the background of the 103rd object is guaranteed to
corrupt the memory block header at the beginning of the returned segment.
When TH01 runs in debug mode, it
correctly reports a corrupted heap in this case.
After detecting such a corruption, the game loudly reports it by playing the
"player hit" sound effect and locking up, freezing any further gameplay or
rendering. The locking loop can be left by pressing ↵ Return, but the
game will simply re-enter it if the corruption is still present during the
next heapcheck(), in the next frame. And since heap
corruptions don't tend to repair themselves, you'd have to constantly hold
↵ Return to resume gameplay. Doing that could actually get you
safely to the next boss, since the game doesn't allocate or free any further
heap memory during a 5-stage card-flipping scene, and
just throws away its C heap when restarting the process for a boss. But then
again, holding ↵ Return will also auto-flip all cards on the way there…
Finally, some unused content! Upon discovering TH01's stage selection debug
feature, probably everyone tried to access Stage 21,
just to see what happens, and indeed landed in an actual stage, with a
black background and a weird color palette. Turns out that ZUN did
ship an unused scene in SCENE7.DAT, which is exactly what's
However, it's easy to believe that this is just garbage data (as I
initially did): At the beginning of "Stage 22", the game seems to enter an
infinite loop somewhere during the flip-in animation.
Well, we've had a heap overflow above, and the cause here is nothing but a
stack buffer overflow – a perhaps more modern kind of classic C bug,
given its prevalence in the Windows Touhou games. Explained in a few lines
int card_animation_frames; // even though there can be up to 200?!
int total_frames = 0;
(code that would end up resetting total_frames if it ever tried to reset
The number of cards in "Stage 22"? 76. There you have it.
But of course, it's trivial to disable this animation and fix these stage
transitions. So here they are, Stages 21 to 24, as shipped with the game
Wow, what a mess. All that was just a bit too much to be covered in two
pushes… Next up, assuming the current subscriptions: Taking a vacation with
one smaller TH01 push, covering some smaller functions here and there to
ensure some uninterrupted Konngara progress later on.
Done with the .BOS format, at last! While there's still quite a bunch of
undecompiled non-format blitting code left, this was in fact the final
piece of graphics format loading code in TH01.
📝 Continuing the trend from three pushes ago,
we've got yet another class, this time for the 48×48 and 48×32 sprites
used in Reimu's gohei, slide, and kick animations. The only reason these
had to use the .BOS format at all is simply because Reimu's regular
sprites are 32×32, and are therefore loaded from
📝 .PTN files.
Yes, this makes no sense, because why would you split animations for
the same character across two file formats and two APIs, just because
of a sprite size difference?
This necessity for switching blitting APIs might also explain why Reimu
vanishes for a few frames at the beginning and the end of the gohei swing
animation, but more on that once we get to the high-level rendering code.
Now that we've decompiled all the .BOS implementations in TH01, here's an
overview of all of them, together with .PTN to show that there really was
no reason for not using the .BOS API for all of Reimu's sprites:
Single-line and wave only
Per-file sprite limit
Pixels blitted at once
And even that last property could simply be handled by branching based on
the sprite width, and wouldn't be a reason for switching formats. But
well, it just wouldn't be TH01 without all that redundant bloat though,
The basic loading, freeing, and blitting code was yet another variation
on the other .BOS code we've seen before. So this should have caused just
as little trouble as the CBossAnim code… except that
CPlayerAnimdid add one slightly difficult function to
the mix, which led to it requiring almost a full push after all.
Similar to 📝 the unblitting code for moving lasers we've seen in the last push,
ZUN tries to minimize the amount of VRAM writes when unblitting Reimu's
slide animations. Technically, it's only necessary to restore the pixels
that Reimu traveled by, plus the ones that wouldn't be redrawn by
the new animation frame at the new X position.
The theoretically arbitrary distance between the two sprites is, of
course, modeled by a fixed-size buffer on the stack
, coming with the further assumption that the
sprite surely hasn't moved by more than 1 horizontal VRAM byte compared to
the last frame. Which, of course, results in glitches if that's not the
case, leaving little Reimu parts in VRAM if the slide speed ever exceeded
8 pixels per frame. (Which it never does,
being hardcoded to 6 pixels, but still.). As it also turns out, all those
bit masking operations easily lead to incredibly sloppy C code.
Which compiles into incredibly terrible ASM, which in turn might end up
wasting way more CPU time than the final VRAM write optimization would
have gained? Then again, in-depth profiling is way beyond the scope of
this project at this point.
Next up: The TH04 main menu, and some more technical debt.
So, TH05 OP.EXE. The first half of this push started out
nicely, with an easy decompilation of the entire player character
selection menu. Typical ZUN quality, with not much to say about it. While
the overall function structure is identical to its TH04 counterpart, the
two games only really share small snippets inside these functions, and do
need to be RE'd separately.
The high score viewing (not registration) menu would have been next.
Unfortunately, it calls one of the GENSOU.SCR loading
functions… which are all a complete mess that still needed to be sorted
out first. 5 distinct functions in 6 binaries, and of course TH05 also
micro-optimized its MAIN.EXE version to directly use the DOS
INT 21h file loading API instead of master.lib's wrappers.
Could have all been avoided with a single method on the score data
structure, taking a player character ID and a difficulty level as
So, no score menu in this push then. Looking at the other end of the ASM
code though, we find the starting functions for the main game, the Extra
Stage, and the demo replays, which did fit perfectly to round out
Which is where we find an easter egg! 🥚 If you've ever looked into
怪綺談2.DAT, you might have noticed 6 .REC files
with replays for the Demo Play mode. However, the game only ever seems to
cycle between 4 replays. So what's in the other two, and why are they
40 KB instead of just 10 KB like the others? Turns out that they
combine into a full Extra Stage Clear replay with Mima, with 3 bombs and 1
death, obviously recorded by ZUN himself. The split into two files for the
stage (DEMO4.REC) and boss (DEMO5.REC) portion is
merely an attempt to limit the amount of simultaneously allocated heap
To watch this replay without modding the game, unlock the Extra Stage with
all 4 characters, then hold both the ⬅️ left and ➡️ right arrow keys in the
main menu while waiting for the usual demo replay.
I can't possibly be the first one to discover this, but I couldn't find
any other mention of it. Edit (2021-03-15): ZUN did in fact document this replay
in Section 6 of TH05's OMAKE.TXT, along with the exact method
to view it.
to Popfan for the discovery!
Here's a recording of the whole replay:
Note how the boss dialogue is skipped. MAIN.EXE actually
contains no less than 6 if() branches just to distinguish
this overly long replay from the regular ones.
I'd really like to do the TH04 and TH05 main menus in parallel, since we
can expect a bit more shared code after all the initial differences.
Therefore, I'm going to put the next "anything" push towards covering the
TH04 version of those functions. Next up though, it's back to TH01, with
more redundant image format code…
Finally, after a long while, we've got two pushes with barely anything to
talk about! Continuing the road towards 100% PI for TH05, these were
exactly the two pushes that TH05 MAINE.EXE PI was estimated
to additionally cost, relative to TH04's. Consequently, they mostly went
to TH05's unique data structures in the ending cutscenes, the score name
registration menu, and the
A unique feature in there is TH05's support for automatic text color
changes in its ending scripts, based on the first full-width Shift-JIS
codepoint in a line. The \c=codepoint,color
commands at the top of the _ED??.TXT set up exactly this
codepoint→color mapping. As far as I can tell, TH05 is the only Touhou
game with a feature like this – even the Windows Touhou games went back to
manually spelling out each color change.
The orb particles in TH05's staff roll also try to be a bit unique by
using 32-bit X and Y subpixel variables for their current position. With
still just 4 fractional bits, I can't really tell yet whether the extended
range was actually necessary. Maybe due to how the "camera scrolling"
through "space" was implemented? All other entities were pretty much the
usual fare, though.
12.4, 4.4, and now a 28.4 fixed-point format… yup,
📝 C++ templates were
definitely the right choice.
At the end of its staff roll, TH05 not only displays
the usual performance
verdict, but then scrolls in the scores at the end of each stage
before switching to the high score menu. The simplest way to smoothly
scroll between two full screens on a PC-98 involves a separate bitmap…
which is exactly what TH05 does here, reserving 28,160 bytes of its global
data segment for just one overly large monochrome 320×704 bitmap where
both the screens are rendered to. That's… one benefit of splitting your
game into multiple executables, I guess?
Not sure if it's common knowledge that you can actually scroll back and
forth between the two screens with the Up and Down keys before moving to
the score menu. I surely didn't know that before. But it makes sense –
might as well get the most out of that memory.
The necessary groundwork for all of this may have actually made
TH04's (yes, TH04's) MAINE.EXE technically
position-independent. Didn't quite reach the same goal for TH05's – but
what we did reach is ⅔ of all PC-98 Touhou code now being
position-independent! Next up: Celebrating even more milestones, as
-Tom- is about to finish development on his TH05
MAIN.EXE PI demo…
Alright, tooling and technical debt. Shouldn't be really much to talk
about… oh, wait, this is still ReC98
For the tooling part, I finished up the remaining ergonomics and error
handling for the
📝 sprite converter that Jonathan Campbell contributed two months ago.
While I familiarized myself with the tool, I've actually ran into some
unreported errors myself, so this was sort of important to me. Still got
no command-line help in there, but the error messages can now do that job
probably even better, since we would have had to write them anyway.
So, what's up with the technical debt then? Well, by now we've accumulated
quite a number of 📝 ASM code slices that
need to be either decompiled or clearly marked as undecompilable. Since we
define those slices as "already reverse-engineered", that decision won't
affect the numbers on the front page at all. But for a complete
decompilation, we'd still have to do this someday. So, rather than
incorporating this work into pushes that were purchased with the
expectation of measurable progress in a certain area, let's take the
"anything goes" pushes, and focus entirely on that during them.
The second code segment seemed like the best place to start with this,
since it affects the largest number of games simultaneously. Starting with
TH02, this segment contains a set of random "core" functions needed by the
binary. Image formats, sounds, input, math, it's all there in some
capacity. You could maybe call it all "libzun" or something like
that? But for the time being, I simply went with the obvious name,
seg2. Maybe I'll come up with something more convincing in
Oh, but wait, why were we assembling all the previous undecompilable ASM
translation units in the 16-bit build part? By moving those to the 32-bit
part, we don't even need a 16-bit TASM in our list of dependencies, as
long as our build process is not fully 16-bit.
And with that, ReC98 now also builds on Windows 95, and thus, every 32-bit
Windows version. 🎉 Which is certainly the most user-visible improvement
in all of these two pushes.
Back in 2015, I already decompiled all of TH02's seg2
functions. As suggested by the Borland compiler, I tried to follow a "one
translation unit per segment" layout, bundling the binary-specific
contents via #include. In the end, it required two
translation units – and that was even after manually inserting the
original padding bytes via #pragma codestring… yuck. But it
worked, compiled, and kept the linker's job (and, by extension,
segmentation worries) to a minimum. And as long as it all matched the
original binaries, it still counted as a valid reconstruction of ZUN's
However, that idea ultimately falls apart once TH03 starts mixing
undecompilable ASM code inbetween C functions. Now, we officially have no
choice but to use multiple C and ASM translation units, with maybe only
just one or two #includes in them…
…or we finally start reconstructing the actual seg2 library,
turning every sequence of related functions into its own translation unit.
This way, we can simply reuse the once-compiled .OBJ files for all the
binaries those functions appear in, without requiring that additional
layer of translation units mirroring the original segmentation.
The best example for this is
almost undecompilable function that generates a lookup table for
horizontally flipping 8 1bpp pixels. It's part of every binary since
TH03, but only used in that game. With the previous approach, we would
have had to add 9 C translation units, which would all have just
#included that one file. Now, we simply put the .OBJ file
into the correct place on the linker command line, as soon as we can.
💡 And suddenly, the linker just inserts the correct padding bytes itself.
The most immediate gains there also happened to come from TH03. Which is
also where we did get some tiny RE% and PI% gains out of this after
all, by reverse-engineering some of its sprite blitting setup code. Sure,
I should have done even more RE here, to also cover those 5 functions at
the end of code segment #2 in TH03's MAIN.EXE that were in
front of a number of library functions I already covered in this push. But
let's leave that to an actual RE push 😛
All in all though, I was just getting started with this; the real
gains in terms of removed ASM files are still to come. But in the
meantime, the funding situation has become even better in terms of
allowing me to focus on things nobody asked for. 🙂 So here's a slightly
better idea: Instead of spending two more pushes on this, let's shoot for
TH05 MAINE.EXE position independence next. If I manage to get
it done, we'll have a 100% position-independent TH05 by the time
-Tom- finishes his MAIN.EXE PI demo, rather
than the 94% we'd get from just MAIN.EXE. That's bound to
make a much better impression on all the people who will then
(re-)discover the project.
And indeed, I got to end my vacation with a lot of image format and
blitting code, covering the final two formats, .GRC and .BOS. .GRC was
nothing noteworthy – one function for loading, one function for
byte-aligned blitting, and one function for freeing memory. That's it –
not even a unblitting function for this one. .BOS, on the other hand…
…has no generic (read: single/sane) implementation, and is only
implemented as methods of some boss entity class. And then again for
Sariel's dress and wand animations, and then again for Reimu's
animations, both of which weren't even part of these 4 pushes. Looking
forward to decompiling essentially the same algorithms all over again… And
that's how TH01 became the largest and most bloated PC-98 Touhou game. So
yeah, still not done with image formats, even at 44% RE.
This means I also had to reverse-engineer that "boss entity" class… yeah,
what else to call something a boss can have multiple of, that may or may
not be part of a larger boss sprite, may or may not be animated, and that
may or may not have an orb hitbox?
All bosses except for Kikuri share the same 5 global instances of this
class. Since renaming all these variables in ASM land is tedious anyway, I
went the extra mile and directly defined separate, meaningful names for
the entities of all bosses. These also now document the natural order in
which the bosses will ultimately be decompiled. So, unless a backer
requests anything else, this order will be:
(code for regular card-flipping stages)
As everyone kind of expects from TH01 by now, this class reveals yet
another… um, unique and quirky piece of code architecture. In
addition to the position and hitbox members you'd expect from a class like
this, the game also stores the .BOS metadata – width, height, animation
frame count, and 📝 bitplane pointer slot
number – inside the same class. But if each of those still corresponds to
one individual on-screen sprite, how can YuugenMagan have 5 eye sprites,
or Kikuri have more than one soul and tear sprite? By duplicating that
metadata, of course! And copying it from one entity to another
At this point, I feel like I even have to congratulate the game for not
actually loading YuugenMagan's eye sprites 5 times. But then again, 53,760
bytes of waste would have definitely been noticeable in the DOS days.
Makes much more sense to waste that amount of space on an unused C++
exception handler, and a bunch of redundant, unoptimized blitting
(Thinking about it, YuugenMagan fits this entire system perfectly. And
together with its position in the game's code – last to be decompiled
means first on the linker command line – we might speculate that
YuugenMagan was the first boss to be programmed for TH01?)
So if a boss wants to use sprites with different sizes, there's no way
around using another entity. And that's why Girl-Elis and Bat-Elis are two
distinct entities internally, and have to manually sync their position.
Except that there's also a third one for Attacking-Girl-Elis,
because Girl-Elis has 9 frames of animation in total, and the global .BOS
bitplane pointers are divided into 4 slots of only 8 images each.
Same for SinGyoku, who is split into a sphere entity, a
person entity, and a… white flash entity for all three forms,
all at the same resolution. Or Konngara's facial expressions, which also
require two entities just for themselves.
And once you decompile all this code, you notice just how much of it the
game didn't even use. 13 of the 50 bytes of the boss entity class are
outright unused, and 10 bytes are used for a movement clamping and lock
system that would have been nice if ZUN also used it outside of
Kikuri's soul sprites. Instead, all other bosses ignore this system
completely, and just
the X/Y coordinates of the boss entities directly.
As for the rendering functions, 5 out of 10 are unused. And while those
definitely make up less than half of the code, I still must have
spent at least 1 of those 4 pushes on effectively unused functionality.
Only one of these functions lends itself to some speculation. For Elis'
entrance animation, the class provides functions for wavy blitting and
unblitting, which use a separate X coordinate for every line of the
sprite. But there's also an unused and sort of broken one for unblitting
two overlapping wavy sprites, located at the same Y coordinate. This might
indicate that Elis could originally split herself into two sprites,
similar to TH04 Stage 6 Yuuka? Or it might just have been some other kind
of animation effect, who knows.
After over 3 months of TH01 progress though, it's finally time to look at
other games, to cover the rest of the crowdfunding backlog. Next up: Going
back to TH05, and getting rid of those last PI false positives. And since
I can potentially spend the next 7 weeks on almost full-time ReC98 work,
I've also re-opened the store until October!
It's vacation time! Which, for ReC98, means "relaxing by looking at
something boring and uninteresting that we'll ultimately have to cover
anyway"… like the TH01 HUD.
📝 As noted earlier, all the score, card
combo, stage, and time numbers are drawn into VRAM. Which turns TH01's HUD
rendering from the trivial, gaiji-assisted text RAM writes we see in later
games to something that, once again, requires blitting and unblitting
steps. For some reason though, everything on there is blitted to both
VRAM pages? And that's why the HUD chose to allocate a bunch of .PTN
sprite slots to store the background behind all "animated" elements at the
beginning of a 4-stage scene or boss battle… separately for every
affected 16×16 area. (Looking forward to the completely unnecessary
code in the Sariel fight that updates these slots after the backgrounds
were animated!) And without any separation into helper functions, we end
up with the same blitting calls separately copy-pasted for every single
HUD element. That's why something as seemingly trivial as this isn't even
done after 2 pushes, as we're still missing the stage timer.
Thankfully, the .PTN function signatures come with none of ZUN's little
inconsistencies, so I was able to mostly reduce this copy-pasta to a bunch
of small inline functions and macros. Those interfaces still remain a bit
annoying, though. As a 32×32 format, .PTN merely supports 16×16 sprites
with a separate bunch of functions that take an additional
quarter parameter from 0 to 3, to select one of the 4 16×16
quarters in a such a sprite…
For life and bomb counts, there was no way around VRAM though, since ZUN
wanted to use more than a single color for those. This is where we find at
least somewhat of a mildly interesting quirk in all of this: Any life
counts greater than the intended 6 will wrap into new rows, with the bombs
in the second row overlapping those excess lives. With the way the rest of
the HUD rendering works, that wrapping code code had to be explicitly
written… which means that ZUN did in fact accomodate (his own?) cheating
Now, I promised image formats, and in the middle of this copy-pasta, we
did get one… sort of. MASK.GRF, the red HUD
background, is entirely handled with two small bespoke functions… and
that's all the code we have for this format. Basically, it's a variation
on the 📝 .GRZ format we've seen earlier. It
uses the exact same RLE algorithm, but only has a single byte stream for
both RLE commands and pixel data… as you would expect from an RLE format.
.GRF actually stores 4 separately encoded RLE streams, which suggests that
it was intended for full 16-color images. Unfortunately,
MASK.GRF only contains 4 copies of the same HUD background
, so no unused beta data for us there. The only
thing we could derive from 4 identical bitplanes would be that the
background was originally meant to be drawn using color #15, rather than
the red seen in the final game. Color
#15 is a stage-specific background color that would have made the
HUD blend in quite nicely – in the YuugenMagan fight, it's the changing
color of the 邪 in the background, for example. But
really, with no generic implementation of this format, that's all just
Oh, and in case you were looking for a rip of that image:
So yeah, more of the usual TH01 code, with the usual small quirks, but
nothing all too horrible – as expected. Next up: The image formats that
didn't make it into this push.
So, let's finally look at some TH01 gameplay structures! The obvious
choices here are player shots and pellets, which are conveniently located
in the last code segment. Covering these would therefore also help in
transferring some first bits of data in REIIDEN.EXE from ASM
land to C land. (Splitting the data segment would still be quite
annoying.) Player shots are immediately at the beginning…
…but wait, these are drawn as transparent sprites loaded from .PTN files.
Guess we first have to spend a push on
📝 Part 2 of this format.
Hm, 4 functions for alpha-masked blitting and unblitting of both 16×16 and
32×32 .PTN sprites that align the X coordinate to a multiple of 8
(remember, the PC-98 uses a
VRAM memory layout, where 8 pixels correspond to a byte), but only one
function that supports unaligned blitting to any X coordinate, and only
for 16×16 sprites? Which is only called twice? And doesn't come with a
corresponding unblitting function?
Yeah, "unblitting". TH01 isn't
and uses the PC-98's second VRAM page exclusively to store a stage's
background and static sprites. Since the PC-98 has no hardware sprites,
all you can do is write pixels into VRAM, and any animated sprite needs to
be manually removed from VRAM at the beginning of each frame. Not using
double-buffering theoretically allows TH01 to simply copy back all 128 KB
of VRAM once per frame to do this. But that
would be pretty wasteful, so TH01 just looks at all animated sprites, and
selectively copies only their occupied pixels from the second to the first
Alright, player shot class methods… oh, wait, the collision functions
directly act on the Yin-Yang Orb, so we first have to spend a push on
that one. And that's where the impression we got from the .PTN
functions is confirmed: The orb is, in fact, only ever displayed at
byte-aligned X coordinates, divisible by 8. It's only thanks to the
constant spinning that its movement appears at least somewhat
This is purely a rendering issue; internally, its position is
tracked at pixel precision. Sadly, smooth orb rendering at any unaligned X
coordinate wouldn't be that trivial of a mod, because well, the
necessary functions for unaligned blitting and unblitting of 32×32 sprites
don't exist in TH01's code. Then again, there's so much potential for
optimization in this code, so it might be very possible to squeeze those
additional two functions into the same C++ translation unit, even without
More importantly though, this was the right time to decompile the core
functions controlling the orb physics – probably the highlight in these
three pushes for most people.
Well, "physics". The X velocity is restricted to the 5 discrete states of
-8, -4, 0, 4, and 8, and gravity is applied by simply adding 1 to the Y
velocity every 5 frames No wonder that this can
easily lead to situations in which the orb infinitely bounces from the
At least fangame authors now have
reference of how ZUN did it originally, because really, this bad
approximation of physics had to have been written that way on purpose. But
hey, it uses 64-bit floating-point variables!
…sometimes at least, and quite randomly. This was also where I had to
learn about Turbo C++'s floating-point code generation, and how rigorously
it defines the order of instructions when mixing double and
float variables in arithmetic or conditional expressions.
This meant that I could only get ZUN's original instruction order by using
literal constants instead of variables, which is impossible right now
without somehow splitting the data segment. In the end, I had to resort to
spelling out ⅔ of one function, and one conditional branch of another, in
inline ASM. 😕 If ZUN had just written 16.0 instead of
16.0f there, I would have saved quite some hours of my life
trying to decompile this correctly…
To sort of make up for the slowdown in progress, here's the TH01 orb
physics debug mod I made to properly understand them. Edit
(2022-07-12): This mod is outdated,
📝 the current version is here!2020-06-13-TH01OrbPhysicsDebug.zip
To use it, simply replace REIIDEN.EXE, and run the game
in debug mode, via game d on the DOS prompt.
Its code might also serve as an example of how to achieve this sort of
thing without position independence.
Alright, now it's time for player shots though. Yeah, sure, they
don't move horizontally, so it's not too bad that those are also
always rendered at byte-aligned positions. But, uh… why does this code
only use the 16×16 alpha-masked unblitting function for decaying shots,
and just sloppily unblits an entire 16×16 square everywhere else?
The worst part though: Unblitting, moving, and rendering player shots
is done in a single function, in that order. And that's exactly where
TH01's sprite flickering comes from. Since different types of sprites are
free to overlap each other, you'd have to first unblit all types, then
move all types, and then render all types, as done in later
PC-98 Touhou games. If you do these three steps per-type instead, you
will unblit sprites of other types that have been rendered before… and
therefore end up with flicker.
Oh, and finally, ZUN also added an additional sloppy 16×16 square unblit
call if a shot collides with a pellet or a boss, for some
guaranteed flicker. Sigh.
And that's ⅓ of all ZUN code in TH01 decompiled! Next up: Pellets!
Three pushes to decompile the TH01 high score menu… because it's
completely terrible, and needlessly complicated in pretty much every
Another, final set of differences between the REIIDEN.EXE
and FUUIN.EXE versions of the code. Which are so
insignificant that it must mean that ZUN kept this code in two
separate, manually and imperfectly synced files. The REIIDEN.EXE
version, only shown when game-overing, automatically jumps to the
enter/終 button after the 8th character was entered,
and also has a completely invisible timeout that force-enters a high score
name after 1000… key presses? Not frames? Why. Like, how do you
even realistically such a number. (Best guess: It's a hidden easter egg to
amuse players who place drinking glasses on cursor keys. Or beer bottles.)
That's all the differences that are maybe visible if you squint
hard enough. On top of that though, we got a bunch of further, minor code
organization differences that serve no purpose other than to waste
decompilation time, and certainly did their part in stretching this out to
3 pushes instead of 2.
Entered names are restricted to a set of 16-bit, full-width Shift-JIS
codepoints, yet are still accessed as 8-bit byte arrays everywhere. This
bloats both the C++ and generated ASM code with needless byte splits,
swaps, and bit shifts. Same for the route kanji. You have this 16-, heck,
even 32-bit CPU, why not use it?! (Fun fact: FUUIN.EXE is
explicitly compiled for a 80186, for the most part – unlike
REIIDEN.EXE, which does use Turbo C++'s 80386 mode.)
The sensible way of storing the current position of the alphabet
cursor would simply be two variables, indicating the logical row and
column inside the character map. When rendering, you'd then transform
these into screen space. This can keep the on-screen position constants in
a single place of code.
TH01 does the opposite: The selected character is stored directly in terms
of its on-screen position, which is then mapped back to a character
index for every processed input and the subsequent screen update. There's
no notion of a logical row or column anywhere, and consequently, the
position constants are vomited all over the code.
Which might not be as bad if the character map had a uniform
grid structure, with no gaps. But the one in TH01 looks like this:
And with no sense of abstraction anywhere, both input handling and
rendering end up with a separate if branch for at least 4 of
the 6 rows.
In the end, I just gave up with my usual redundancy reduction efforts for
this one. Anyone wanting to change TH01's high score name entering code
would be better off just rewriting the entire thing properly.
And that's all of the shared code in TH01! Both OP.EXE and
FUUIN.EXE are now only missing the actual main menu and
ending code, respectively. Next up, though: The long awaited TH01 PI push.
Which will not only deliver 100% PI for OP.EXE and
FUUIN.EXE, but also probably quite some gains in
REIIDEN.EXE. With now over 30% of the game decompiled, it's about
time we get to look at some gameplay code!
Back to TH01, and its high score menu… oh, wait, that one will eventually
involve keyboard input. And thanks to the generous TH01 funding situation,
there's really no reason not to cover that right now. After all,
TH01 is the last game where input still hadn't been RE'd.
But first, let's also cover that one unused blitting function, together
with REIIDEN.CFG loading and saving, which are in front of
the input function in OP.EXE… (By now, we all know about
the hidden start bomb configuration, right?)
Unsurprisingly, the earliest game also implements input in the messiest
way, with a different function for each of the three executables. "Because
they all react differently to keyboard inputs ",
apparently? OP.EXE even has two functions for it, one for the
START / CONTINUE / OPTION / QUIT main
menu, and one for both Option and Music Test menus, both of which directly
perform the ring arithmetic on the menu cursor variable. A consistent
separation of keyboard polling from input processing apparently wasn't all
too obvious of a thought, since it's only truly done from TH02 on.
Which means that this screen is actually re-entered for every 3 frames
that the PgUp key is being held. And yes, you can, of course, also
crash the system via a stack overflow this way by holding down PgUp for a
few seconds, if that's your thing. Edit (2020-09-17): Here's a video from
spaztron64, showing off this
exact stack overflow crash while running under the
memory manager, which displays additional information about these
sorts of crashes:
What makes this even funnier is that the code actually tracks the last
state of every polled key, to prevent exactly that sort of bug. But the
copy-pasted assignment of the last input state is only done aftertest_mem() already returned, making it effectively pointless
for PgUp. It does work as intended for PgDown… and that's why you
have to actually press and release this key once for every call to
test_mem() in order to actually get back into the game. Even
though a single call to PgDown will already show the game screen
In maybe more relevant news though, this function also came with what can
be considered the first piece of actual gameplay logic! Bombing via
double-tapping the Z and X keys is also handled here, and now we know that
both keys simply have to be tapped twice within a window of 20 frames.
They are tracked independently from each other, so you don't necessarily
have to press them simultaneously.
In debug mode, the bomb count tracks precisely this window of
time. That's why it only resets back to 0 when pressing Z or X if it's
Sure, TH01's code is expectedly terrible and messy. But compared to the
micro-optimizations of TH04 and TH05, it's an absolute joy to work on, and
opening all these ZUN bug loot boxes is just the icing on the cake.
Looking forward to more of the high score menu in the next pushes!
Final TH01 RE push for the time being, and as expected, we've got the
superficially final piece of shared code between the TH01 executables.
However, just having a single implementation for loading and recreating
the REYHI*.DAT score files would have been way above ZUN's
standards of consistency. So ZUN had the unique idea to mix up the file
I/O APIs, using master.lib functions in REIIDEN.EXE, and
POSIX functions (along with error messages and disabled interrupts) in
FUUIN.EXE… Could have been worse
though, as it was possible to abstract that away quite nicely.
That code wasn't quite in the natural way of decompilation either. As it
turns out though, 📝 segment splitting isn't
so painful after all if one of the new segments only has a few functions.
Definitely going to do that more often from now on, since it allows a much
larger number of functions to be immediately decompiled. Which is always
superior to somehow transforming a function's ASM into a form that I can
confidently call "reverse-engineered", only to revisit it again later for
And while I unfortunately missed 25% of total RE by a bit, this push
reached two other and perhaps even more significant milestones:
After (finally) compressing all unknown parts of the BSS segments
using arrays, the number of remaining lines in the
REIIDEN.EXE ASM dump has fallen below TASM's limit of 65,535. Which
means that we no longer need that annoying th01_reiiden_2.inc
file that everyone has forgotten about at least once.
Nope, RL has given me plenty of things to do from home after all,
so the current cap still remains an accurate representation of my free
For now though, we've got one more TH01 file format push, covering the
core functions for loading and displaying the 32×32 and 16×16 sprites from
the .PTN files, as announced – and probably one of the last ones for quite
a while to yield both RE and PI progress way above average. But what is
this, error return values in a ZUN game?! And actually good code
for deriving the alpha channel from the 16th color in the hardware
palette?! Sure, the rest of the code could still be improved a lot, but
that was quite a surprise, especially after the spaghetti code of
📝 the last push. That makes up for two of
the .PTN structure fields (one of them always 0, and one of them always 1)
remaining unused, and therefore unknown.
ZUN also uses the .PTN image slots to store the background of frequently
updated VRAM sections, in order to be able to repeatedly draw on top of
them – like for example the HUD area where the score and time numbers are
drawn. Future games would simply use the text RAM and gaiji for those
numbers. This would have worked just fine for TH01 too – especially since
all the functions decompiled so far align the VRAM X coordinate to the
8-pixel byte grid, which is the simplest way of accessing VRAM given the
memory layout. Looks as if ZUN simply wasn't aware of gaiji during the
development of TH01.
This won't be the last time I cover the .PTN format, since all the
blitting functions that actually use alpha are exclusive to
REIIDEN.EXE, and currently out of decompilation reach. But after
some more long overdue cleaning work, TH01 has now passed both TH02 and
even TH04 to become the second-most reverse-engineered game in
all of ReC98, in terms of absolute numbers! 🎉
Also, PI for TH01's OP.EXE is imminent. Next up though, we've
first got the probably final double-speed push for TH01, covering the last
set of duplicated functions between the three binaries – quite fitting for
the currently last fully funded, outstanding TH01 RE push. Then, we also
might get FUUIN.EXE PI within the same push
afterwards? After that, TH01 progress will be slowing down, since
I'd then have to cover either the main menu or in-game code
or the cutscenes, depending on what the backers request. (By
default, it's going to be in-game code, of course.)
Last of the 3 weeks of almost full-time ReC98 work, supposedly the least
stressful one, and then things still get delayed thanks to illness 😕 In
better news though, it looks like I'll be able to extend these 3 weeks to
8, as my RL is shutting down for coronavirus reasons. I'm going to
wait a bit for the dust to settle before raising the crowdfunding cap
though, since RL might give me more to do from home after all. I may or
may not also get commissioned for a non-Touhou translation patch project
to be worked on in that time…
The .GRP file functions turned out to, of course, also be present in
FUUIN.EXE. In fact, that binary had the largest share of
progress in this push, since it's the only one to include another
reimplementation of master.lib-style hardware palette fading. As a typical
little ZUN inconsistency, the FUUIN.EXE version of one .GRP
palette function directly calls one of these functions.
As for the functions themselves, they basically wrap the single-function
Pi load and
display library by 電脳科学研究所/BERO in a bowl of global state
spaghetti. 🍝 At least the function names now clearly encode important
side effects like, y'know, a changed hardware palette. The reason ZUN used
this separate library over master.lib's PI loading functions was probably
its support for defining a color as transparent. This feature is used for
the red box in the main menu, and the large cyan Siddhaṃ seed syllables in
(again) the Konngara fight.
Sadly, we've already reached the end of fast triple-speed TH01 progress
with 📝 the last push, which decompiled the
last segment shared by all three of TH01's executables. There's still a
bit of double-speed progress left though, with a small number of code
segments that are shared between just two of the three executables.
At the end of the first one of these, we've got all the code for the .GRZ
format – which is yet another run-length encoded image format, but this
time storing up to 16 full 640×400 16-color images with an alpha bit. This
one is exclusively used to wastefully store Konngara's sword slash and
animations. Due to… suboptimal code organization, the code for the format
is also present in OP.EXE, despite not being used there. But
hey, that brings TH01 to over 20% in RE!
Decoupling the RLE command stream from the pixel data sounds like a nice
idea at first, allowing the format to efficiently encode a variety of
animation frames displayed all over the screen… if ZUN actually made
use of it. The RLE stream also has quite some ridiculous overhead,
starting with 1 byte to store the 1-bit command (putting a single 8×1
pixel block, or entering a run of N such blocks). Run commands then store
another 1-byte run length, which has to be followed by another
command byte to identify the run as putting N blocks, or skipping N blocks.
And the pixel data is just a sequence of these blocks for all 4 bitplanes,
in uncompressed form…
Also, have some rips of all the images this format is used for:
To make these, I just wrote a small viewer, calling the same decompiled
TH01 code: 2020-03-07-grzview.zip
Obviously, this means that it not only must to be run on a PC-98, but also
discards the alpha information.
If any backers are really interested in having a proper converter
to and from PNG, I can implement that in an upcoming push… although that
would be the perfect thing for outside contributors to do.
Next up, we got some code for the PI format… oh, wait, the actual files
are called "GRP" in TH01.
Just like most of the time, it was more sensible to cover
GENSOU.SCR, the last structure missing in TH05's
everywhere it's used, rather than just rushing out OP.EXE
position independence. I did have to look into all of the functions to
fully RE it after all, and to find out whether the unused fields actually
are unused. The only thing that kept this push from yielding even
more above-average progress was the sheer inconsistency in how the games
implemented the operations on this PC-98 equivalent of score*.dat:
OP.EXE declares two structure instances, for simultaneous
access to both Reimu and Marisa scores. TH05 with its 4 playable
characters instead uses a single one, and overwrites it successively for
each character when drawing the high score menu – meaning, you'd only see
Yuuka's scores when looking at the structure inside the rendered high
score menu. However, it still declares the TH04 "Marisa" structure as a
leftover… and also decodes it and verifies its checksum, despite
nothing being ever loaded into it
MAIN.EXE uses a separate ASM implementation of the decoding
and encoding functions
TH05's MAIN.EXE also reimplements the basic loading
in ASM – without the code to regenerate GENSOU.SCR with
default data if the file is missing or corrupted. That actually makes
sense, since any regeneration is already done in OP.EXE, which
always has to load that file anyway to check how much has been cleared
However, there is a regeneration function in TH05's
MAINE.EXE… which actually generates different default
data: OP.EXE consistently sets Extra Stage records to Stage 1,
while MAINE.EXE uses the same place-based stage numbering that
both versions use for the regular ranks
Technically though, TH05's OP.EXEis
position-independent now, and the rest are (should be?
) merely false positives. However, TH04's is
still missing another structure, in addition to its false
positives. So, let's wait with the big announcement until the next push…
which will also come with a demo video of what will be possible then.
The glacial pace continues, with TH05's unnecessarily, inappropriately
micro-optimized, and hence, un-decompilable code for rendering the current
and high score, as well as the enemy health / dream / power bars. While
the latter might still pass as well-written ASM, the former goes to such
ridiculous levels that it ends up being technically buggy. If you
enjoy quality ZUN code, it's
definitely worth a read.
In TH05, this all still is at the end of code segment #1, but in TH04,
the same code lies all over the same segment. And since I really
wanted to move that code into its final form now, I finally did the
research into decompiling from anywhere else in a segment.
Turns out we actually can! It's kinda annoying, though: After splitting
the segment after the function we want to decompile, we then need to group
the two new segments back together into one "virtual segment" matching the
original one. But since all ASM in ReC98 heavily relies on being
assembled in MASM mode, we then start to suffer from MASM's group
addressing quirk. Which then forces us to manually prefix every single
from inside the group
to anywhere else within the newly created segment
with the group name. It's stupidly boring busywork, because of all the
function calls you mustn't prefix. Special tooling might make this
easier, but I don't have it, and I'm not getting crowdfunded for it.
So while you now definitely can request any specific thing in any
of the 5 games to be decompiled right now, it will take slightly
longer, and cost slightly more.
(Except for that one big segment in TH04, of course.)
Only one function away from the TH05 shot type control functions now!
Sometimes, "strategically picking things to reverse-engineer" unfortunately also means "having to move seemingly random and utterly uninteresting stuff, which will only make sense later, out of the way". Really, this was so boring. Gonna get a lot more exciting in the next ones though.
… yeah, no, we won't get very far without figuring out these drawing routines.
Which process data that comes from the .STD files.
Which has various arrays related to the background… including one to specify the scrolling speed. And wait, setting that to 0 actually is what starts a boss battle?
So, have a TH05 Boss Rush patch: 2018-12-26-TH05BossRush.zip
Theoretically, this should have also worked for TH04, but for some reason,
the Stage 3 boss gets stuck on the first phase if we do this?
While we're waiting for Bruno to release the next thcrap build with ANM header patching, here are the resulting commits of the ReC98 CDG/CD2 special offer purchased by DTM, reverse-engineering all code that covers these formats.