TH03 gameplay! 📝 It's been over two years. People have been investing some decent money with the intention of eventually getting netplay, so let's cover some more foundations around player movement… and quickly notice that there's almost no overlap between gameplay RE and netplay preparations? That makes for a fitting opportunity to think about what TH03 netplay would look like:
You'd want UDP rather than TCP for both its low latency and its NAT hole-punching ability
However, raw UDP does not guarantee that the packets arrive in order, or that they even arrive at all
WebRTC implements these reliability guarantees on top of UDP in a modern package, providing the best of both worlds
I'm not too deep into networking to argue here, and it clearly works for Ju.N.Owen. If we do explore other options, it would mainly be because I can't easily get something as modern as WebRTC to natively run on Windows 9x or DOS, if we decide to go for that route.
Matchmaking: I like Ju.N.Owen's initial way of copy-pasting signaling codes into chat clients to establish a peer-to-peer connection without a dedicated matchmaking server. progre eventually implemented rooms on the AWS cloud, but signaling codes are still used for spectating and the Pure P2P mode. We'll probably copy the same evolution, with a slight preference for Pure P2P – if only because you would have to check a GDPR consent box before I can put the combination of your room name and IP address into a database. Server costs shouldn't be an issue at the scale I expect this to have.
Rollback: In emulators, rollback netcode can be and has been implemented by keeping savestates of the last few frames together with the local player's inputs and then replaying the emulation with updated inputs of the remote player if a prediction turned out to be incorrect. This technique is a great fit for TH03 for two reasons:
All game state is contained within a relatively small bit of memory. The only heap allocations done in MAIN.EXE are the 📝 .MRS images for gauge attack portraits and bomb backgrounds, and the enemy scripts and formations, both of which remain constant throughout a round. All other state is statically allocated, which can reduce per-frame snapshots from the naive 640 KiB of conventional DOS memory to just the 37 KiB of MAIN.EXE's data segment. And that's the upper bound – this number is only going to go down as we move towards 100% PI, figure out how TH03 uses all its static data, and get to consolidate all mutated data into an even smaller block of memory.
For input prediction, we could even let the game's existing AI play the remote player until the actual inputs come in, guaranteeing perfect play until the remote inputs prove otherwise. Then again… probably only while the remote player is not moving, because the chance for a human to replicate the AI's infamous erratic dodging is fairly low.
The only issue with rollback in specifically a PC-98 emulator is its implications for performance. Rendering is way more computationally expensive on PC-98 than it is on consoles with hardware sprites, involving lots of memory writes to the disjointed 4 bitplane segments that make up the 128 KB framebuffer, and equally as many reads and bitshift operations on sprite data. TH03 lessens the impact somewhat thanks to most of its rendering being EGC-accelerated and thus running inside the emulator as optimized native code, but we'd still be emulating all the x86 code surrounding the EGC accesses – from the emulator's point of view, it looks no different than game logic. Let's take my aging i5 system for example:
With the Screen → No wait option, Neko Project 21/W can emulate TH03 gameplay at 260 FPS, or 4.6× its regular speed.
This leaves room for each frame to contain 3.6 frames of rollback in addition to the frame that's supposed to be displayed,
which results in a maximum safe ping time of ≈63 ms. According to this site, that's enough for a smooth connection from Germany to any other place in Europe. At this ping, my system could still run the game without slowdown even if every single frame required a rollback, which is highly unlikely.
Any higher ping, however, could occasionally lead to a rollback queue that's too large for my system to process within a single frame at the intended 56.4 FPS rate. As a result, playing anyone in the US is highly likely to involve at least occasional slowdowns. Delaying inputs on purpose is the usual workaround, but isn't Touhou that kind of game series where people use vpatch to get rid of even the default input delay in the Windows games?
So we'd ideally want to put TH03 into an update-only mode that skips all rendering calls during re-simulation of rolled-back frames. Ironically, this means that netplay-focused RE would actually focus on the game's rendering code and ensure that it doesn't mutate any statically allocated data, allowing it to be freely skipped without affecting the game. Imagine palette-based flashing animations that are implemented by gradually mutating statically allocated values – these would cause wrong colors for the rest of the game if the animation doesn't run on every frame.
Implementing all of this into TH03 can be done in one, a few, or all of the following five ways, depending on what the backers prefer. Sorted from the most generic to the most specialized solution (and, coincidentally, from least to most total effort required):
Generic PC-98 netcode for one or more emulators
This is the most basic and puristic variant that implements generic netplay for PC-98 games in general by effectively providing remote control of the emulated keyboard and joypad. The emulator will be unaware of the game, and the game will be unaware of being netplayed, which makes this solution particularly interesting for the non-Touhou PC-98 scene, or competitive players who absolutely insist on using ZUN's original binaries and won't trust any of my modded game builds.
Applied to TH03, this means that players would select the regular hot-seat 1P vs 2P mode and then initiate a match through a new menu in the emulator UI. The same UI must then provide an option to manually remap incoming key and button presses to the 2P controls (newly introducing remapping to the emulator if necessary), as well as blocking any non-2P keys. The host then sends an initial savestate to the guest to ensure an identical starting state, and starts synchronizing and rolling back inputs at VSync boundaries.
This generic nature means that we don't get to include any of the TH03-specific rollback optimizations mentioned above, leading to the highest CPU and memory requirements out of all the variants. It sure is the easiest to implement though, as we get to freely use modern C++ WebRTC libraries that are designed to work with the network stack of the underlying OS.
I can try to build this netcode as a generic library that can work with any PC-98 emulator, but it would ultimately be up to the respective upstream developers to integrate it into official releases. Therefore, expect this variant to require separate funding and custom builds for each individual emulator codebase that we'd like to support.
Emulator-level netcode with optional game integration
Takes the generic netcode developed in 1) and adds the possibility for the game to control it via a special interrupt API. This enables several improvements:
Online matches could be initiated through new options in TH03's main menu rather than the emulator's UI.
The game could communicate the memory region that should be backed up every frame, cutting down memory usage as described above.
The exchanged input data could use the game's internal format instead of keyboard or joypad inputs. This removes the need for key remapping at the emulator level and naturally prevents the inherent issue of remote control where players could mess with each other's controls.
The game could be aware of the rollbacks, allowing it to jump over its rendering code while processing the queue of remote inputs and thus gain some performance as explained above.
The game could add synchronization points that block gameplay until both players have reached them, preventing the rollback queue from growing infinitely. This solves the issue of 1) not having any inherent way of working around desyncs and the resulting growth of the rollback queue. As an example, if one of the two emulators in 1) took, say, 2 seconds longer to load the game due to a random CPU spike caused by some bloatware on their system, the two players would be out of sync by 2 seconds for the rest of the session, forcing the faster system to render 113 frames every time an input prediction turned out to be incorrect.
Good places for synchronization points include the beginning of each round, the WARNING!! You are forced to evade / Your life is in peril popups that pause the game for a few frames anyway, and whenever the game is paused via the ESC key.
During such pauses, the game could then also block the resuming ESC key of the player who didn't pause the game.
Native PC-98 Windows 9x netcode
Equivalent in features to 2), but pulls the netcode from the emulator down into the PC-98 system itself. Just like Ju.N.Owen or Adonis, it would run as a separate 32-bit or 16-bit Windows application that somehow communicates with the game running in a DOS window. This removes the need for emulator-specific integration code and provides real-hardware support for that handful of real-hardware owners who have actually equipped their PC-98 with a network card such as the LGY-98.
This specific card is also low-level-emulated by the 21/W fork of Neko Project, which would allow this type of netplay to be used with official builds of that specific emulator. DOSBox-X currently emulates the apparently hardware-compatible NE2000 card, but disables its emulation in PC-98 mode, most likely because its I/O ports clash with the typical peripherals of a PC-98 system.
As this is just low-level emulation of a network device though, setting up this kind of networking in Neko Project 21/W requires you to install and configure both a TAP driver on the host system, and the whole Windows 9x network stack on the emulated PC-98. Thankfully, the setup is well-documented, and I did manage to get a working Internet connection inside an emulated Windows 95, but it's definitely not foolproof. A more specific guide or setup program would greatly help here.
And that's not the end of the drawbacks:
Netplay would depend on the PC-98 versions of Windows 9x, requiring players to acquire and install another OS on the emulated system.
Porting libdatachannel (and especially the required transport encryption) to Windows 95 will probably involve a bit of effort as well.
As would actually finding a way to access V86 mode memory from a 32-bit or 16-bit Windows process.
Correctly setting up TH03 to run within Windows 95 to begin with can be rather tricky. The GDC clock speed check needs to be either patched out or overridden using mode-setting tools, Windows needs to be blocked from accessing the FM chip, and even then, MAIN.EXE might still immediately crash during the first frame and leave all of VRAM corrupted:
A matchmaking server would be much more of a requirement than in any of the emulator variants. Players are unlikely to run their favorite chat client on the same PC-98 system, and the signaling codes are way too unwieldy to type them in manually. (Then again, IRC is always an option, and the people who would fund this variant are probably the exact same people who are already running IRC clients on their PC-98.)
Native PC-98 DOS netcode
Conceptually the same as 3), but going yet another level deeper, replacing the Windows 9x network stack with a DOS-based one. This might look even more intimidating and error-prone, but after I got pingand even Telnet working, I was pleasantly surprised at it's much simpler it is when compared to the Windows variant. The whole stack consists of just one LGY-98 hardware information tool, a LGY-98 packet driver TSR, and a TSR that implements TCP/IP/UDP/DNS/ICMP and is configured with a plaintext file. I don't have any deep experience with these protocols, so I was quite surprised that you can implement all of them in a single 40 KiB binary. Installed as TSRs, the entire stack takes up an acceptable 82 KiB of conventional memory, leaving more than enough space for the game itself. And since both of the TSRs are open-source, we can even legally bundle them with the future modified game binaries.
However, all the emulation and matchmaking issues from the Windows 9x approach remain, along with the following issues:
Porting libdatachannel and the required transport encryption to the TEEN stack seems even more time-consuming than a Windows 95 port.
The TEEN stack has no UI for specifying the system's or gateway's IP addresses outside of its plaintext configuration file. This provides a nice opportunity for adding a new Internet settings menu with great error feedback to the game itself. Great for UX, but it's another thing I'd have to write.
Due to the memory requirements of rollback, the netcode should better run as a TSR in Protected Mode, adding yet another technology I'd have to learn about.
The LGY-98 is not the only network card for the PC-98. Others might have more complicated DOS drivers that might not work as seamlessly with the TEEN stack, or have no preserved DOS drivers at all. Heck, the most time-consuming part of the DOS setup was finding the correct download link for the LGY-98 packet driver, as the one link that appears in a lot of places only throws an access denied error these days. If you're interested in funding this variant and are using a non-LGY-98 card on real hardware, make sure you get general Internet working on DOS first.
Porting the game first
As always, this is the premium option. If the entire game already runs as a standalone executable on a modern system, we can just put all the netcode into the same binary and have the most seamless integration possible.
That leaves us with these prerequisites:
1), by definition, needs nothing from ReC98, and I could theoretically start implementing it right now. If you're interested in funding it, just tell me via the usual Twitter or Discord channels.
2), 3), and 4) require at least 100% RE of TH03's OP.EXE to facilitate the new menu code. Reverse-engineering all rendering-related code in MAIN.EXE would be nice for performance, but we don't strictly need all of it before we start. Re-simulated frames can just skip over the few pieces of rendering code we do know, and we can gradually increase the skipped area of code in future pushes.
100% PI won't be a requirement either, as I expect the MAIN.EXE part of the interfacing netcode layer to be thin enough that it can easily fit within the original game's code layout.
5), obviously, requires all of TH03 to be RE'd, decompiled, cleaned up, and ported to modern systems. Currently, TH03 appears to be the second-easiest game to port behind TH02:
Although TH03 already has more needlessly micro-optimized ASM code than TH02 and there's even more to come, it still appears to have way less than TH04 or TH05.
Its game logic and rendering code seem to be somewhat neatly separated from each other, unlike TH01 which deeply intertwines them.
Its graphics seem free of obvious bugs, unlike – again — the flicker-fest that is TH01.
But still, it's the game with the least amount of RE%. Decompilation might get easier once I've worked myself up to the higher levels of game code, and even more so if we're lucky and all of the 9 characters are coded in a similar way, but I can't promise anything at this point.
Once we've reached any of these prerequisites, I'll set up a separate campaign funding method that runs parallel to the cap. As netplay is one of those big features where incremental progress makes little sense and we can expect wide community support for the idea, I'll go for a more classic crowdfunding model with a fixed goal for the minimum feature set and stretch goals for optional quality-of-life features. Since I've still got two other big projects waiting to be finished, I'd like to at least complete the Shuusou Gyoku Linux port before I start working on TH03 netplay, even if we manage to hit any of the funding goals before that.
For the first time in a long while, the actual content of this push can be listed fairly quickly. I've now RE'd:
conversions from playfield-relative coordinates to screen coordinates and back (a first in PC-98 Touhou; even TH02 uses screen space for every coordinate I've seen so far),
the low-level code that moves the player entity across the screen,
a copy of the per-round frame counter that, for some reason, resets to 0 at the start of the Win/Lose animation, resetting a bunch of animations with it,
a global hitbox with one variable that sometimes stores the center of an entity, and sometimes its top-left corner,
and the 48×48 hit circles from EN2.PI.
It's also the third TH03 gameplay push in a row that features inappropriate ASM code in places that really, really didn't need any. As usual, the code is worse than what Turbo C++ 4.0J would generate for idiomatic C code, and the surrounding code remains full of untapped and quick optimization opportunities anyway. This time, the biggest joke is the sprite offset calculation in the hit circle rendering code:
But while we've all come to expect the usual share of ZUN bloat by now, this is also the first push without either a ZUN bug or a landmine since I started using these terms! 🎉 It does contain a single ZUN quirk though, which can also be found in the hit circles. This animation comes in two types with different caps: 12 animation slots across both playfields for the enemy circles shown in alternating bright/dark yellow colors, whereas the white animation for the player characters has a cap of… 1? P2 takes precedence over P1 because its update code always runs last, which explains what happens when both players get hit within the 16 frames of the animation:
SPRITE16 uses the PC-98's EGC to draw these single-color sprites. If the EGC is already set up, it can be set into a GRCG-equivalent RMW mode using the pattern/read plane register (0x4A2) and foreground color register (0x4A6), together with setting the mode register (0x4A4) to 0x0CAC. Unlike the typical blitting operations that involve its 16-dot pattern register, the EGC even supports 8- or 32-bit writes in this mode, just like the GRCG. 📝 As expected for EGC features beyond the most ordinary ones though, T98-Next simply sets every written pixel to black on a 32-bit write. Comparing the actual performance of such writes to the GRCG would be 📝 yet another interesting question to benchmark.
Next up: I think it's time for ReC98's build system to reach its final form.
For almost 5 years, I've been using an unreleased sane build system on a parallel private branch that was just missing some final polish and bugfixes. Meanwhile, the public repo is still using the project's initial Makefile that, 📝 as typical for Makefiles, is so unreliable that BUILD16B.BAT force-rebuilds everything by default anyway. While my build system has scaled decently over the years, something even better happened in the meantime: MS-DOS Player, a DOS emulator exclusively meant for seamless integration of CLI programs into the Windows console, has been forked and enhanced enough to finally run Turbo C++ 4.0J at an acceptable speed. So let's remove DOSBox from the equation, merge the 32-bit and 16-bit build steps into a single 32-bit one, set all of this up in a user-friendly way, and maybe squeeze even more performance out of MS-DOS Player specifically for this use case.
More than three months without any reverse-engineering progress! It's been
way too long. Coincidentally, we're at least back with a surprising 1.25% of
overall RE, achieved within just 3 pushes. The ending script system is not
only more or less the same in TH04 and TH05, but actually originated in
TH03, where it's also used for the cutscenes before stages 8 and 9. This
means that it was one of the final pieces of code shared between three of
the four remaining games, which I got to decompile at roughly 3× the usual
speed, or ⅓ of the price.
The only other bargains of this nature remain in OP.EXE. The
Music Room is largely equivalent in all three remaining games as well, and
the sound device selection, ZUN Soft logo screens, and main/option menus are
the same in TH04 and TH05. A lot of that code is in the "technically RE'd
but not yet decompiled" ASM form though, so it would shift Finalized% more
significantly than RE%. Therefore, make sure to order the new
Finalization option rather than Reverse-engineering if you
want to make number go up.
So, cutscenes. On the surface, the .TXT files look simple enough: You
directly write the text that should appear on the screen into the file
without any special markup, and add commands to define visuals, music, and
other effects at any place within the script. Let's start with the basics of
how text is rendered, which are the same in all three games:
First off, the text area has a size of 480×64 pixels. This means that it
does not correspond to the tiled area painted into TH05's
EDBK?.PI images:
Since the font weight can be customized, all text is rendered to VRAM.
This also includes gaiji, despite them ignoring the font weight
setting.
The system supports automatic line breaks on a per-glyph basis, which
move the text cursor to the beginning of the red text area. This might seem like a piece of long-forgotten
ancient wisdom at first, considering the absence of automatic line breaks in
Windows Touhou. However, ZUN probably implemented it more out of pure
necessity: Text in VRAM needs to be unblitted when starting a new box, which
is way more straightforward and performant if you only need to worry
about a fixed area.
The system also automatically starts a new (key press-separated) text
box after the end of the 4th line. However, the text cursor is
also unconditionally moved to the top-left corner of the yellow name
area when this happens, which is almost certainly not what you expect, given
that automatic line breaks stay within the red area. A script author might
as well add the necessary text box change commands manually, if you're
forced to anticipate the automatic ones anyway…
Due to ZUN forgetting an unblitting call during the TH05 refactoring of the
box background buffer, this feature is even completely broken in that game,
as any new text will simply be blitted on top of the old one:
Overall, the system is geared toward exclusively full-width text. As
exemplified by the 2014 static English patches and the screenshots in this
blog post, half-width text is possible, but comes with a lot of
asterisks attached:
Each loop of the script interpreter starts by looking at the next
byte to distinguish commands from text. However, this step also skips
over every ASCII space and control character, i.e., every byte
≤ 32. If you only intend to display full-width glyphs anyway, this
sort of makes sense: You gain complete freedom when it comes to the
physical layout of these script files, and it especially allows commands
to be freely separated with spaces and line breaks for improved
readability. Still, enforcing commands to be separated exclusively by
line breaks might have been even better for readability, and would have
freed up ASCII spaces for regular text…
Non-command text is blindly processed and rendered two bytes at a
time. The rendering function interprets these bytes as a Shift-JIS
string, so you can use half-width characters here. While the
second byte can even be an ASCII 0x20 space due to the
parser's blindness, all half-width characters must still occur in pairs
that can't be interrupted by commands:
As a workaround for at least the ASCII space issue, you can replace
them with any of the unassigned
Shift-JIS lead bytes – 0x80, 0xA0, or
anything between 0xF0 and 0xFF inclusive.
That's what you see in all screenshots of this post that display
half-width spaces.
Finally, did you know that you can hold ESC to fast-forward
through these cutscenes, which skips most frame delays and reduces the rest?
Due to the blocking nature of all commands, the ESC key state is
only updated between commands or 2-byte text groups though, so it can't
interrupt an ongoing delay.
Superficially, the list of game-specific differences doesn't look too long,
and can be summarized in a rather short table:
It's when you get into the implementation that the combined three systems
reveal themselves as a giant mess, with more like 56 differences between the
games. Every single new weird line of code opened up
another can of worms, which ultimately made all of this end up with 24
pieces of bloat and 14 bugs. The worst of these should be quite interesting
for the general PC-98 homebrew developers among my audience:
The final official 0.23 release of master.lib has a bug in
graph_gaiji_put*(). To calculate the JIS X 0208 code point for
a gaiji, it is enough to ADD 5680h onto the gaiji ID. However,
these functions accidentally use ADC instead, which incorrectly
adds the x86 carry flag on top, causing weird off-by-one errors based on the
previous program state. ZUN did fix this bug directly inside master.lib for
TH04 and TH05, but still needed to work around it in TH03 by subtracting 1
from the intended gaiji ID. Anyone up for maintaining a bug-fixed master.lib
repository?
The worst piece of bloat comes from TH03 and TH04 needlessly
switching the visibility of VRAM pages while blitting a new 320×200 picture.
This makes it much harder to understand the code, as the mere existence of
these page switches is enough to suggest a more complex interplay between
the two VRAM pages which doesn't actually exist. Outside this visibility
switch, page 0 is always supposed to be shown, and page 1 is always used
for temporarily storing pixels that are later crossfaded onto page 0. This
is also the only reason why TH03 has to render text and gaiji onto both VRAM
pages to begin with… and because TH04 doesn't, changing the picture in the
middle of a string of text is technically bugged in that game, even though
you only get to temporarily see the new text on very underclocked PC-98
systems.
These performance implications made me wonder why cutscenes even bother with
writing to the second VRAM page anyway, before copying each crossfade step
to the visible one.
📝 We learned in June how costly EGC-"accelerated" inter-page copies are;
shouldn't it be faster to just blit the image once rather than twice?
Well, master.lib decodes .PI images into a packed-pixel format, and
unpacking such a representation into bitplanes on the fly is just about the
worst way of blitting you could possibly imagine on a PC-98. EGC inter-page
copies are already fairly disappointing at 42 cycles for every 16 pixels, if
we look at the i486 and ignore VRAM latencies. But under the same
conditions, packed-pixel unpacking comes in at 81 cycles for every 8
pixels, or almost 4× slower. On lower-end systems, that can easily sum up to
more than one frame for a 320×200 image. While I'd argue that the resulting
tearing could have been an acceptable part of the transition between two
images, it's understandable why you'd want to avoid it in favor of the
pure effect on a slower framerate.
Really makes me wonder why master.lib didn't just directly decode .PI images
into bitplanes. The performance impact on load times should have been
negligible? It's such a good format for
the often dithered 16-color artwork you typically see on PC-98, and
deserves better than master.lib's implementation which is both slow to
decode and slow to blit.
That brings us to the individual script commands… and yes, I'm going to
document every single one of them. Some of their interactions and edge cases
are not clear at all from just looking at the code.
Almost all commands are preceded by… well, a 0x5C lead byte.
Which raises the question of whether we should
document it as an ASCII-encoded \ backslash, or a Shift-JIS-encoded
¥ yen sign. From a gaijin perspective, it seems obvious that it's a
backslash, as it's consistently displayed as one in most of the editors you
would actually use nowadays. But interestingly, iconv
-f shift-jis -t utf-8 does convert any 0x5C
lead bytes to actual ¥ U+00A5 YEN SIGN code points
.
Ultimately, the distinction comes down to the font. There are fonts
that still render 0x5C as ¥, but mainly do so out
of an obvious concern about backward compatibility to JIS X 0201, where this
mapping originated. Unsurprisingly, this group includes MS Gothic/Mincho,
the old Japanese fonts from Windows 3.1, but even Meiryo and Yu
Gothic/Mincho, Microsoft's modern Japanese fonts. Meanwhile, pretty much
every other modern font, and freely licensed ones in particular, render this
code point as \, even if you set your editor to Shift-JIS. And
while ZUN most definitely saw it as a ¥, documenting this code
point as \ is less ambiguous in the long run. It can only
possibly correspond to one specific code point in either Shift-JIS or UTF-8,
and will remain correct even if we later mod the cutscene system to support
full-blown Unicode.
Now we've only got to clarify the parameter syntax, and then we can look at
the big table of commands:
Numeric parameters are read as sequences of up to 3 ASCII digits. This
limits them to a range from 0 to 999 inclusive, with 000 and
0 being equivalent. Because there's no further sentinel
character, any further digit from the 4th one onwards is
interpreted as regular text.
Filename parameters must be terminated with a space or newline and are
limited to 12 characters, which translates to 8.3 basenames without any
directory component. Any further characters are ignored and displayed as
text as well.
Each .PI image can contain up to four 320×200 pictures ("quarters") for
the cutscene picture area. In the script commands, they are numbered like
this:
0
1
2
3
\@
Clears both VRAM pages by filling them with VRAM color 0. 🐞
In TH03 and TH04, this command does not update the internal text area
background used for unblitting. This bug effectively restricts usage of
this command to either the beginning of a script (before the first
background image is shown) or its end (after no more new text boxes are
started). See the image below for an
example of using it anywhere else.
\b2
Sets the font weight to a value between 0 (raw font ROM glyphs) to 3
(very thicc). Specifying any other value has no effect.
🐞 In TH04 and TH05, \b3 leads to glitched pixels when
rendering half-width glyphs due to a bug in the newly micro-optimized
ASM version of
📝 graph_putsa_fx(); see the image below for an example.
In these games, the parameter also directly corresponds to the
graph_putsa_fx() effect function, removing the sanity check
that was present in TH03. In exchange, you can also access the four
dissolve masks for the bold font (\b2) by specifying a
parameter between 4 (fewest pixels) to 7 (most
pixels). Demo video below.
\c15
Changes the text color to VRAM color 15.
\c=字,15
Adds a color map entry: If 字 is the first code point
inside the name area on a new line, the text color is automatically set
to 15. Up to 8 such entries can be registered
before overflowing the statically allocated buffer.
🐞 The comma is assumed to be present even if the color parameter is omitted.
\e0
Plays the sound effect with the given ID.
\f
(no-op)
\fi1
\fo1
Calls master.lib's palette_black_in() or
palette_black_out() to play a hardware palette fade
animation from or to black, spending roughly 1 frame on each of the 16 fade steps.
\fm1
Fades out BGM volume via PMD's AH=02h interrupt call,
in a non-blocking way. The fade speed can range from 1 (slowest) to 127 (fastest).
Values from 128 to 255 technically correspond to
AH=02h's fade-in feature, which can't be used from cutscene
scripts because it requires BGM volume to first be lowered via
AH=19h, and there is no command to do that.
\g8
Plays a blocking 8-frame screen shake
animation.
\ga0
Shows the gaiji with the given ID from 0 to 255
at the current cursor position. Even in TH03, gaiji always ignore the
text delay interval configured with \v.
@3
TH05's replacement for the \ga command from TH03 and
TH04. The default ID of 3 corresponds to the
gaiji. Not to be confused with \@, which starts with a backslash,
unlike this command.
@h
Shows the gaiji.
@t
Shows the gaiji.
@!
Shows the gaiji.
@?
Shows the gaiji.
@!!
Shows the gaiji.
@!?
Shows the gaiji.
\k0
Waits 0 frames (0 = forever) for an advance key to be pressed before
continuing script execution. Before waiting, TH05 crossfades in any new
text that was previously rendered to the invisible VRAM page…
🐞 …but TH04 doesn't, leaving the text invisible during the wait time.
As a workaround, \vp1 can be
used before \k to immediately display that text without a
fade-in animation.
\m$
Stops the currently playing BGM.
\m*
Restarts playback of the currently loaded BGM from the
beginning.
\m,filename
Stops the currently playing BGM, loads a new one from the given
file, and starts playback.
\n
Starts a new line at the leftmost X coordinate of the box, i.e., the
start of the name area. This is how scripts can "change" the name of the
currently speaking character, or use the entire 480×64 pixels without
being restricted to the non-name area.
Note that automatic line breaks already move the cursor into a new line.
Using this command at the "end" of a line with the maximum number of 30
full-width glyphs would therefore start a second new line and leave the
previously started line empty.
If this command moved the cursor into the 5th line of a box,
\s is executed afterward, with
any of \n's parameters passed to \s.
\p
(no-op)
\p-
Deallocates the loaded .PI image.
\p,filename
Loads the .PI image with the given file into the single .PI slot
available to cutscenes. TH04 and TH05 automatically deallocate any
previous image, 🐞 TH03 would leak memory without a manual prior call to
\p-.
\pp
Sets the hardware palette to the one of the loaded .PI image.
\p@
Sets the loaded .PI image as the full-screen 640×400 background
image and overwrites both VRAM pages with its pixels, retaining the
current hardware palette.
\p=
Runs \pp followed by \p@.
\s0
\s-
Ends a text box and starts a new one. Fades in any text rendered to
the invisible VRAM page, then waits 0 frames
(0 = forever) for an advance key to be
pressed. Afterward, the new text box is started with the cursor moved to
the top-left corner of the name area. \s- skips the wait time and starts the new box
immediately.
\t100
Sets palette brightness via master.lib's
palette_settone() to any value from 0 (fully black) to 200
(fully white). 100 corresponds to the palette's original colors.
Preceded by a 1-frame delay unless ESC is held.
\v1
Sets the number of frames to wait between every 2 bytes of rendered
text.
Sets the number of frames to spend on each of the 4 fade
steps when crossfading between old and new text. The game-specific
default value is also used before the first use of this command.
\v2
\vp0
Shows VRAM page 0. Completely useless in
TH03 (this game always synchronizes both VRAM pages at a command
boundary), only of dubious use in TH04 (for working around a bug in \k), and the games always return to
their intended shown page before every blitting operation anyway. A
debloated mod of this game would just remove this command, as it exposes
an implementation detail that script authors should not need to worry
about. None of the original scripts use it anyway.
\w64
\w and \wk wait for the given number
of frames
\wm and \wmk wait until PMD has played
back the current BGM for the total number of measures, including
loops, given in the first parameter, and fall back on calling
\w and \wk with the second parameter as
the frame number if BGM is disabled.
🐞 Neither PMD nor MMD reset the internal measure when stopping
playback. If no BGM is playing and the previous BGM hasn't been
played back for at least the given number of measures, this command
will deadlock.
Since both TH04 and TH05 fade in any new text from the invisible VRAM
page, these commands can be used to simulate TH03's typing effect in
those games. Demo video below.
Contrary to \k and \s, specifying 0 frames would
simply remove any frame delay instead of waiting forever.
The TH03-exclusive k variants allow the delay to be
interrupted if ⏎ Return or Shot are held down.
TH04 and TH05 recognize the k as well, but removed its
functionality.
All of these commands have no effect if ESC is held.
\wm64,64
\wk64
\wmk64,64
\wi1
\wo1
Calls master.lib's palette_white_in() or
palette_white_out() to play a hardware palette fade
animation from or to white, spending roughly 1 frame on each of the 16 fade steps.
\=4
Immediately displays the given quarter of the loaded .PI image in
the picture area, with no fade effect. Any value ≥ 4 resets the picture area to black.
\==4,1
Crossfades the picture area between its current content and quarter
#4 of the loaded .PI image, spending 1 frame on each of the 4 fade steps unless
ESC is held. Any value ≥ 4 is
replaced with quarter #0.
\$
Stops script execution. Must be called at the end of each file;
otherwise, execution continues into whatever lies after the script
buffer in memory.
TH05 automatically deallocates the loaded .PI image, TH03 and TH04
require a separate manual call to \p- to not leak its memory.
Bold values signify the default if the parameter
is omitted; \c is therefore
equivalent to \c15.
So yeah, that's the cutscene system. I'm dreading the moment I will have to
deal with the other command interpreter in these games, i.e., the
stage enemy system. Luckily, that one is completely disconnected from any
other system, so I won't have to deal with it until we're close to finishing
MAIN.EXE… that is, unless someone requests it before. And it
won't involve text encodings or unblitting…
The cutscene system got me thinking in greater detail about how I would
implement translations, being one of the main dependencies behind them. This
goal has been on the order form for a while and could soon be implemented
for these cutscenes, with 100% PI being right around the corner for the TH03
and TH04 cutscene executables.
Once we're there, the "Virgin" old-school way of static translation patching
for Latin-script languages could be implemented fairly quickly:
Establish basic UTF-8 parsing for less painful manual editing of the
source files
Procedurally generate glyphs for the few required additional letters
based on existing font ROM glyphs. For example, we'd generate ä
by painting two short lines on top of the font ROM's a glyph,
or generate ¿ by vertically flipping the question mark. This
way, the text retains a consistent look regardless of whether the translated
game is run with an NEC or EPSON font ROM, or the that Neko Project II auto-generates if you
don't provide either.
(Optional) Change automatic line breaks to work on a per-word
basis, rather than per-glyph
That's it – script editing and distribution would be handled by your local
translation group. It might seem as if this would also work for Greek and
Cyrillic scripts due to their presence in the PC-98 font ROM, but I'm not
sure if I want to attempt procedurally shrinking these glyphs from 16×16 to
8×16… For any more thorough solution, we'd need to go for a more "Chad" kind
of full-blown translation support:
Implement text subdivisions at a sensible granularity while retaining
automatic line and box breaks
Compile translatable text into a Japanese→target language dictionary
(I'm too old to develop any further translation systems that would overwrite
modded source text with translations of the original text)
Implement a custom Unicode font system (glyphs would be taken from GNU
Unifont unless translators provide a different 8×16 font for their
language)
Combine the text compiler with the font compiler to only store needed
glyphs as part of the translation's font file (dealing with a multi-MB font
file would be rather ugly in a Real Mode game)
Write a simple install/update/patch stacking tool that supports both
.HDI and raw-file DOSBox-X scenarios (it's different enough from thcrap to
warrant a separate tool – each patch stack would be statically compiled into
a single package file in the game's directory)
Add a nice language selection option to the main menu
(Optional) Support proportional fonts
Which sounds more like a separate project to be commissioned from
Touhou Patch Center's Open Collective funds, separate from the ReC98 cap.
This way, we can make sure that the feature is completely implemented, and I
can talk with every interested translator to make sure that their language
works.
It's still cheaper overall to do this on PC-98 than to first port the games
to a modern system and then translate them. On the other hand, most
of the tasks in the Chad variant (3, 4, 5, and half of 2) purely deal with
the difficulty of getting arbitrary Unicode characters to work natively in a
PC-98 DOS game at all, and would be either unnecessary or trivial if we had
already ported the game. Depending on where the patrons' interests lie, it
may not be worth it. So let's see what all of you think about which
way we should go, or whether it's worth doing at all. (Edit
(2022-12-01): With Splashman's
order towards the stage dialogue system, we've pretty much confirmed that it
is.) Maybe we want to meet in the middle – using e.g. procedural glyph
generation for dynamic translations to keep text rendering consistent with
the rest of the PC-98 system, and just not support non-Latin-script
languages in the beginning? In any case, I've added both options to the
order form. Edit (2023-07-28):Touhou Patch Center has agreed to fund
a basic feature set somewhere between the Virgin and Chad level. Check the
📝 dedicated announcement blog post for more
details and ideas, and to find out how you can support this goal!
Surprisingly, there was still a bit of RE work left in the third push after
all of this, which I filled with some small rendering boilerplate. Since I
also wanted to include TH02's playfield overlay functions,
1/15 of that last push went towards getting a
TH02-exclusive function out of the way, which also ended up including that
game in this delivery.
The other small function pointed out how TH05's Stage 5 midboss pops into
the playfield quite suddenly, since its clipping test thinks it's only 32
pixels tall rather than 64:
Next up: Staying with TH05 and looking at more of the pattern code of its
boss fights. Given the remaining TH05 budget, it makes the most sense to
continue in in-game order, with Sara and the Stage 2 midboss. If more money
comes in towards this goal, I could alternatively go for the Mai & Yuki
fight and immediately develop a pretty fix for the cheeto storage
glitch. Also, there's a rather intricate
pull request for direct ZMBV decoding on the website that I've still got
to review…
TH05 has passed the 50% RE mark, with both MAIN.EXE and the
game as a whole! With that, we've also reached what -Tom-
wanted out of the project, so he's suspending his discount offer for a
bit.
Curve bullets are now officially called cheetos! 76.7% of
fans prefer this term, and it fits into the 8.3 DOS filename scheme much
better than homing lasers (as they're called in
OMAKE.TXT) or Taito
lasers (which would indeed have made sense as well).
…oh, and I managed to decompile Shinki within 2 pushes after all. That
left enough budget to also add the Stage 1 midboss on top.
So, Shinki! As far as final boss code is concerned, she's surprisingly
economical, with 📝 her background animations
making up more than ⅓ of her entire code. Going straight from TH01's
📝 final📝 bosses
to TH05's final boss definitely showed how much ZUN had streamlined
danmaku pattern code by the end of PC-98 Touhou. Don't get me wrong, there
is still room for improvement: TH05 not only
📝 reuses the same 16 bytes of generic boss state we saw in TH04 last month,
but also uses them 4× as often, and even for midbosses. Most importantly
though, defining danmaku patterns using a single global instance of the
group template structure is just bad no matter how you look at it:
The script code ends up rather bloated, with a single MOV
instruction for setting one of the fields taking up 5 bytes. By comparison,
the entire structure for regular bullets is 14 bytes large, while the
template structure for Shinki's 32×32 ball bullets could have easily been
reduced to 8 bytes.
Since it's also one piece of global state, you can easily forget to set
one of the required fields for a group type. The resulting danmaku group
then reuses these values from the last time they were set… which might have
been as far back as another boss fight from a previous stage.
And of course, I wouldn't point this out if it
didn't actually happen in Shinki's pattern code. Twice.
Declaring a separate structure instance with the static data for every
pattern would be both safer and more space-efficient, and there's
more than enough space left for that in the game's data segment.
But all in all, the pattern functions are short, sweet, and easy to follow.
The "devil"
patternis significantly more complex than the others, but still
far from TH01's final bosses at their worst. I especially like the clear
architectural separation between "one-shot pattern" functions that return
true once they're done, and "looping pattern" functions that
run as long as they're being called from a boss's main function. Not many
all too interesting things in these pattern functions for the most part,
except for two pieces of evidence that Shinki was coded after Yumeko:
The gather animation function in the first two phases contains a bullet
group configuration that looks like it's part of an unused danmaku
pattern. It quickly turns out to just be copy-pasted from a similar function
in Yumeko's fight though, where it is turned into actual
bullets.
As one of the two places where ZUN forgot to set a template field, the
lasers at the end of the white wing preparation pattern reuse the 6-pixel
width of Yumeko's final laser pattern. This actually has an effect on
gameplay: Since these lasers are active for the first 8 frames after
Shinki's wings appear on screen, the player can get hit by them in the last
2 frames after they grew to their final width.
Speaking about that wing sprite: If you look at ST05.BB2 (or
any other file with a large sprite, for that matter), you notice a rather
weird file layout:
And it's not a limitation of the sprite width field in the BFNT+ header
either. Instead, it's master.lib's BFNT functions which are limited to
sprite widths up to 64 pixels… or at least that's what
MASTER.MAN claims. Whatever the restriction was, it seems to be
completely nonexistent as of master.lib version 0.23, and none of the
master.lib functions used by the games have any issues with larger
sprites.
Since ZUN stuck to the supposed 64-pixel width limit though, it's now the
game that expects Shinki's winged form to consist of 4 physical
sprites, not just 1. Any conversion from another, more logical sprite sheet
layout back into BFNT+ must therefore replicate the original number of
sprites. Otherwise, the sequential IDs ("patnums") assigned to every newly
loaded sprite no longer match ZUN's hardcoded IDs, causing the game to
crash. This is exactly what used to happen with -Tom-'s
MysticTK automation scripts,
which combined these exact sprites into a single large one. This issue has
now been fixed – just in case there are some underground modders out there
who used these scripts and wonder why their game crashed as soon as the
Shinki fight started.
And then the code quality takes a nosedive with Shinki's main function.
Even in TH05, these boss and midboss update
functions are still very imperative:
The origin point of all bullet types used by a boss must be manually set
to the current boss/midboss position; there is no concept of a bullet type
tracking a certain entity.
The same is true for the target point of a player's homing shots…
… and updating the HP bar. At least the initial fill animation is
abstracted away rather decently.
Incrementing the phase frame variable also must be done manually. TH05
even "innovates" here by giving the boss update function exclusive ownership
of that variable, in contrast to TH04 where that ownership is given out to
the player shot collision detection (?!) and boss defeat helper
functions.
Speaking about collision detection: That is done by calling different
functions depending on whether the boss is supposed to be invincible or
not.
Timeout conditions? No standard way either, and all done with manual
if statements. In combination with the regular phase end
condition of lowering (mid)boss HP to a certain value, this leads to quite a
convoluted control flow.
The manual calls to the score bonus functions for cleared phases at least provide some sense of orientation.
One potentially nice aspect of all this imperative freedom is that
phases can end outside of HP boundaries… by manually incrementing the
phase variable and resetting the phase frame variable to 0.
The biggest WTF in there, however, goes to using one of the 16 state bytes
as a "relative phase" variable for differentiating between boss phases that
share the same branch within the switch(boss.phase)
statement. While it's commendable that ZUN tried to reduce code duplication
for once, he could have just branched depending on the actual
boss.phase variable? The same state byte is then reused in the
"devil" pattern to track the activity state of the big jerky lasers in the
second half of the pattern. If you somehow managed to end the phase after
the first few bullets of the pattern, but before these lasers are up,
Shinki's update function would think that you're still in the phase
before the "devil" pattern. The main function then sequence-breaks
right to the defeat phase, skipping the final pattern with the burning Makai
background. Luckily, the HP boundaries are far away enough to make this
impossible in practice.
The takeaway here: If you want to use the state bytes for your custom
boss script mods, alias them to your own 16-byte structure, and limit each
of the bytes to a clearly defined meaning across your entire boss script.
One final discovery that doesn't seem to be documented anywhere yet: Shinki
actually has a hidden bomb shield during her two purple-wing phases.
uth05win got this part slightly wrong though: It's not a complete
shield, and hitting Shinki will still deal 1 point of chip damage per
frame. For comparison, the first phase lasts for 3,000 HP, and the "devil"
pattern phase lasts for 5,800 HP.
And there we go, 3rd PC-98 Touhou boss
script* decompiled, 28 to go! 🎉 In case you were expecting a fix for
the Shinki death glitch: That one
is more appropriately fixed as part of the Mai & Yuki script. It also
requires new code, should ideally look a bit prettier than just removing
cheetos between one frame and the next, and I'd still like it to fit within
the original position-dependent code layout… Let's do that some other
time.
Not much to say about the Stage 1 midboss, or midbosses in general even,
except that their update functions have to imperatively handle even more
subsystems, due to the relative lack of helper functions.
The remaining ¾ of the third push went to a bunch of smaller RE and
finalization work that would have hardly got any attention otherwise, to
help secure that 50% RE mark. The nicest piece of code in there shows off
what looks like the optimal way of setting up the
📝 GRCG tile register for monochrome blitting
in a variable color:
mov ah, palette_index ; Any other non-AL 8-bit register works too.
; (x86 only supports AL as the source operand for OUTs.)
rept 4 ; For all 4 bitplanes…
shr ah, 1 ; Shift the next color bit into the x86 carry flag
sbb al, al ; Extend the carry flag to a full byte
; (CF=0 → 0x00, CF=1 → 0xFF)
out 7Eh, al ; Write AL to the GRCG tile register
endm
Thanks to Turbo C++'s inlining capabilities, the loop body even decompiles
into a surprisingly nice one-liner. What a beautiful micro-optimization, at
a place where micro-optimization doesn't hurt and is almost expected.
Unfortunately, the micro-optimizations went all downhill from there,
becoming increasingly dumb and undecompilable. Was it really necessary to
save 4 x86 instructions in the highly unlikely case of a new spark sprite
being spawned outside the playfield? That one 2D polar→Cartesian
conversion function then pointed out Turbo C++ 4.0J's woefully limited
support for 32-bit micro-optimizations. The code generation for 32-bit
📝 pseudo-registers is so bad that they almost
aren't worth using for arithmetic operations, and the inline assembler just
flat out doesn't support anything 32-bit. No use in decompiling a function
that you'd have to entirely spell out in machine code, especially if the
same function already exists in multiple other, more idiomatic C++
variations.
Rounding out the third push, we got the TH04/TH05 DEMO?.REC
replay file reading code, which should finally prove that nothing about the
game's original replay system could serve as even just the foundation for
community-usable replays. Just in case anyone was still thinking that.
Next up: Back to TH01, with the Elis fight! Got a bit of room left in the
cap again, and there are a lot of things that would make a lot of
sense now:
TH04 would really enjoy a large number of dedicated pushes to catch up
with TH05. This would greatly support the finalization of both games.
Continuing with TH05's bosses and midbosses has shown to be good value
for your money. Shinki would have taken even less than 2 pushes if she
hadn't been the first boss I looked at.
Oh, and I also added Seihou as a selectable goal, for the two people out
there who genuinely like it. If I ever want to quit my day job, I need to
branch out into safer territory that isn't threatened by takedowns, after
all.
Been 📝 a while since we last looked at any of
TH03's game code! But before that, we need to talk about Y coordinates.
During TH03's MAIN.EXE, the PC-98 graphics GDC runs in its
line-doubled 640×200 resolution, which gives the in-game portion its
distinctive stretched low-res look. This lower resolution is a consequence
of using 📝 Promisence Soft's SPRITE16 driver:
Its performance simply stems from the fact that it expects sprites to be
stored in the bottom half of VRAM, which allows them to be blitted using the
same EGC-accelerated VRAM-to-VRAM copies we've seen again and again in all
other games. Reducing the visible resolution also means that the sprites can
be stored on both VRAM pages, allowing the game to still be double-buffered.
If you force the graphics chip to run at 640×400, you can see them:
Note that the text chip still displays its overlaid contents at 640×400,
which means that TH03's in-game portion technically runs at two
resolutions at the same time.
But that means that any mention of a Y coordinate is ambiguous: Does it
refer to undoubled VRAM pixels, or on-screen stretched pixels? Especially
people who have known about the line doubling for years might almost expect
technical blog posts on this game to use undoubled VRAM coordinates. So,
let's introduce a new formatting convention for both on-screen
640×400 and undoubled 640×200 coordinates,
and always write out both to minimize the confusion.
Alright, now what's the thing gonna be? The enemy structure is highly
overloaded, being used for enemies, fireballs, and explosions with seemingly
different semantics for each. Maybe a bit too much to be figured out in what
should ideally be a single push, especially with all the functions that
would need to be decompiled? Bullet code would be easier, but not exactly
single-push material either. As it turns out though, there's something more
fundamental left to be done first, which both of these subsystems depend on:
collision detection!
And it's implemented exactly how I always naively imagined collision
detection to be implemented in a fixed-resolution 2D bullet hell game with
small hitboxes: By keeping a separate 1bpp bitmap of both playfields in
memory, drawing in the collidable regions of all entities on every frame,
and then checking whether any pixels at the current location of the player's
hitbox are set to 1. It's probably not done in the other games because their
single data segment was already too packed for the necessary 17,664 bytes to
store such a bitmap at pixel resolution, and 282,624 bytes for a bitmap at
Q12.4 subpixel resolution would have been prohibitively expensive in 16-bit
Real Mode DOS anyway. In TH03, on the other hand, this bitmap is doubly
useful, as the AI also uses it to elegantly learn what's on the playfield.
By halving the resolution and only tracking tiles of 2×2 / 2×1 pixels, TH03 only requires an adequate total
of 6,624 bytes of memory for the collision bitmaps of both playfields.
So how did the implementation not earn the good-code tag this time? Because the code for drawing into these bitmaps is undecompilable hand-written x86 assembly. And not just your usual ASM that was basically compiled from C and then edited to maybe optimize register allocation and maybe replace a bunch of local variables with self-modifying code, oh no. This code is full of overly clever bit twiddling, abusing the fact that the 16-bit AX,
BX, CX, and DX registers can also be
accessed as two 8-bit registers, calculations that change the semantic
meaning behind the value of a register, or just straight-up reassignments of
different values to the same small set of registers. Sure, in some way it is
impressive, and it all does work and correctly covers every edge
case, but come on. This could have all been a lot more readable in
exchange for just a few CPU cycles.
What's most interesting though are the actual shapes that these functions
draw into the collision bitmap. On the surface, we have:
vertical slopes at any angle across the whole playfield; exclusively
used for Chiyuri's diagonal laser EX attack
straight vertical lines, with a width of 1 tile; exclusively used for
the 2×2 / 2×1 hitboxes of bullets
rectangles at arbitrary sizes
But only 2) actually draws a full solid line. 1) and 3) are only ever drawn
as horizontal stripes, with a hardcoded distance of 2 vertical tiles
between every stripe of a slope, and 4 vertical tiles between every stripe
of a rectangle. That's 66-75% of each rectangular entity's intended hitbox
not actually taking part in collision detection. Now, if player hitboxes
were ≤ 6 / 3 pixels, we'd have one
possible explanation of how the AI can "cheat", because it could just
precisely move through those blank regions at TAS speeds. So, let's make
this two pushes after all and tell the complete story, since this is one of
the more interesting aspects to still be documented in this game.
And the code only gets worse. While the player
collision detection function is decompilable, it might as well not
have been, because it's just more of the same "optimized", hard-to-follow
assembly. With the four splittable 16-bit registers having a total of 20
different meanings in this function, I would have almost preferred
self-modifying code…
In fact, it was so bad that it prompted some maintenance work on my inline
assembly coding standards as a whole. Turns out that the _asm
keyword is not only still supported in modern Visual Studio compilers, but
also in Clang with the -fms-extensions flag, and compiles fine
there even for 64-bit targets. While that might sound like amazing news at
first ("awesome, no need to rewrite this stuff for my x86_64 Linux
port!"), you quickly realize that almost all inline assembly in this
codebase assumes either PC-98 hardware, segmented 16-bit memory addressing,
or is a temporary hack that will be removed with further RE progress.
That's mainly because most of the raw arithmetic code uses Turbo C++'s
register pseudovariables where possible. While they certainly have their
drawbacks, being a non-standard extension that's not supported in other
x86-targeting C compilers, their advantages are quite significant: They
allow this code to stay in the same language, and provide slightly more
immediate portability to any other architecture, together with
📝 readability and maintainability improvements that can get quite significant when combined with inlining:
// This one line compiles to five ASM instructions, which would need to be
// spelled out in any C compiler that doesn't support register pseudovariables.
// By adding typed aliases for these registers via `#define`, this code can be
// both made even more readable, and be prepared for an easier transformation
// into more portable local variables.
_ES = (((_AX * 4) + _BX) + SEG_PLANE_B);
However, register pseudovariables might cause potential portability issues
as soon as they are mixed with inline assembly instructions that rely on
their state. The lazy way of "supporting pseudo-registers" in other
compilers would involve declaring the full set as global variables, which
would immediately break every one of those instances:
_DI = 0;
_AX = 0xFFFF;
// Special x86 instruction doing the equivalent of
//
// *reinterpret_cast(MK_FP(_ES, _DI)) = _AX;
// _DI += sizeof(uint16_t);
//
// Only generated by Turbo C++ in very specific cases, and therefore only
// reliably available through inline assembly.
asm { movsw; }
What's also not all too standardized, though, are certain variants of
the asm keyword. That's why I've now introduced a distinction
between the _asm keyword for "decently sane" inline assembly,
and the slightly less standard asm keyword for inline assembly
that relies on the contents of pseudo-registers, and should break on
compilers that don't support them. So yeah, have some minor
portability work in exchange for these two pushes not having all that much
in RE'd content.
With that out of the way and the function deciphered, we can confirm the
player hitboxes to be a constant 8×8 /
8×4 pixels, and prove that the hit stripes are nothing but
an adequate optimization that doesn't affect gameplay in any way.
And what's the obvious thing to immediately do if you have both the
collision bitmap and the player hitbox? Writing a "real hitbox" mod, of
course:
Reorder the calls to rendering functions so that player and shot sprites
are rendered after bullets
Blank out all player sprite pixels outside an
8×8 / 8×4 box around the center
point
After the bullet rendering function, turn on the GRCG in RMW mode and
set the tile register set to the background color
Stretch the negated contents of collision bitmap onto each playfield,
leaving only collidable pixels untouched
Do the same with the actual, non-negated contents and a white color, for
extra contrast against the background. This also makes sure to show any
collidable areas whose sprite pixels are transparent, such as with the moon
enemy. (Yeah, how unfair.) Doing that also loses a lot of information about
the playfield, such as enemy HP indicated by their color, but what can you
do:
2022-02-18-TH03-real-hitbox.zip
The secret for writing such mods before having reached a sufficient level of
position independence? Put your new code segment into DGROUP,
past the end of the uninitialized data section. That's why this modded
MAIN.EXE is a lot larger than you would expect from the raw amount of new code: The file now actually needs to store all these
uninitialized 0 bytes between the end of the data segment and the first
instruction of the mod code – normally, this number is simply a part of the
MZ EXE header, and doesn't need to be redundantly stored on disk. Check the
th03_real_hitbox
branch for the code.
And now we know why so many "real hitbox" mods for the Windows Touhou games
are inaccurate: The games would simply be unplayable otherwise – or can
you dodge rapidly moving 2×2 /
2×1 blocks as an 8×8 /
8×4 rectangle that is smaller than your shot sprites,
especially without focused movement? I can't.
Maybe it will feel more playable after making explosions visible, but that
would need more RE groundwork first.
It's also interesting how adding two full GRCG-accelerated redraws of both
playfields per frame doesn't significantly drop the game's frame rate – so
why did the drawing functions have to be micro-optimized again? It
would be possible in one pass by using the GRCG's TDW mode, which
should theoretically be 8× faster, but I have to stop somewhere.
Next up: The final missing piece of TH04's and TH05's
bullet-moving code, which will include a certain other
type of projectile as well.
Technical debt, part 9… and as it turns out, it's highly impractical to
repay 100% of it at this point in development. 😕
The reason: graph_putsa_fx(), ZUN's function for rendering
optionally boldfaced text to VRAM using the font ROM glyphs, in its
ridiculously micro-optimized TH04 and TH05 version. This one sets the
"callback function" for applying the boldface effect by self-modifying
the target of two CALL rel16 instructions… because
there really wasn't any free register left for an indirect
CALL, eh? The necessary distance, from the call site to the
function itself, has to be calculated at assembly time, by subtracting the
target function label from the call site label.
This usually wouldn't be a problem… if ZUN didn't store the resulting
lookup tables in the .DATA segment. With code segments, we
can easily split them at pretty much any point between functions because
there are multiple of them. But there's only a single .DATA
segment, with all ZUN and master.lib data sandwiched between Borland C++'s
crt0 at the
top, and Borland C++'s library functions at the bottom of the segment.
Adding another split point would require all data after that point to be
moved to its own translation unit, which in turn requires
EXTERN references in the big .ASM file to all that moved
data… in short, it would turn the codebase into an even greater
mess.
Declaring the labels as EXTERN wouldn't work either, since
the linker can't do fancy arithmetic and is limited to simply replacing
address placeholders with one single address. So, we're now stuck with
this function at the bottom of the SHARED segment, for the
foreseeable future.
We can still continue to separate functions off the top of that segment,
though. Pretty much the only thing noteworthy there, so far: TH04's code
for loading stage tile images from .MPN files, which we hadn't
reverse-engineered so far, and which nicely fit into one of
Blue Bolt's pending ⅓ RE contributions. Yup, we finally moved
the RE% bars again! If only for a tiny bit.
Both TH02 and TH05 simply store one pointer to one dynamically allocated
memory block for all tile images, as well as the number of images, in the
data segment. TH04, on the other hand, reserves memory for 8 .MPN slots,
complete with their color palettes, even though it only ever uses the
first one of these. There goes another 458 bytes of conventional RAM… I
should start summing up all the waste we've seen so far. Let's put the
next website contribution towards a tagging system for these blog posts.
At 86% of technical debt in the SHARED segment repaid, we
aren't quite done yet, but the rest is mostly just TH04 needing to catch
up with functions we've already separated. Next up: Getting to that
practical 98.5% point. Since this is very likely to not require a full
push, I'll also decompile some more actual TH04 and TH05 game code I
previously reverse-engineered – and after that, reopen the store!
Alright, no more big code maintenance tasks that absolutely need to be
done right now. Time to really focus on parts 6 and 7 of repaying
technical debt, right? Except that we don't get to speed up just yet, as
TH05's barely decompilable PMD file loading function is rather…
complicated.
Fun fact: Whenever I see an unusual sequence of x86 instructions in PC-98
Touhou, I first consult the disassembly of Wolfenstein 3D. That game was
originally compiled with the quite similar Borland C++ 3.0, so it's quite
helpful to compare its ASM to the
officially released source
code. If I find the instructions in question, they mostly come from
that game's ASM code, leading to the amusing realization that "even John
Carmack was unable to get these instructions out of this compiler"
This time though, Wolfenstein 3D did point me
to Borland's intrinsics for common C functions like memcpy()
and strchr(), available via #pragma intrinsic.
Bu~t those unfortunately still generate worse code than what ZUN
micro-optimized here. Commenting how these sequences of instructions
should look in C is unfortunately all I could do here.
The conditional branches in this function did compile quite nicely
though, clarifying the control flow, and clearly exposing a ZUN
bug: TH05's snd_load() will hang in an infinite loop when
trying to load a non-existing -86 BGM file (with a .M2
extension) if the corresponding -26 BGM file (with a .M
extension) doesn't exist either.
Unsurprisingly, the PMD channel monitoring code in TH05's Music Room
remains undecompilable outside the two most "high-level" initialization
and rendering functions. And it's not because there's data in the
middle of the code segment – that would have actually been possible with
some #pragmas to ensure that the data and code segments have
the same name. As soon as the SI and DI registers are referenced
anywhere, Turbo C++ insists on emitting prolog code to save these
on the stack at the beginning of the function, and epilog code to restore
them from there before returning.
Found that out in
September 2019, and confirmed that there's no way around it. All the
small helper functions here are quite simply too optimized, throwing away
any concern for such safety measures. 🤷
Oh well, the two functions that were decompilable at least indicate
that I do try.
Within that same 6th push though, we've finally reached the one function
in TH05 that was blocking further progress in TH04, allowing that game
to finally catch up with the others in terms of separated translation
units. Feels good to finally delete more of those .ASM files we've
decompiled a while ago… finally!
But since that was just getting started, the most satisfying development
in both of these pushes actually came from some more experiments with
macros and inline functions for near-ASM code. By adding
"unused" dummy parameters for all relevant registers, the exact input
registers are made more explicit, which might help future port authors who
then maybe wouldn't have to look them up in an x86 instruction
reference quite as often. At its best, this even allows us to
declare certain functions with the __fastcall convention and
express their parameter lists as regular C, with no additional
pseudo-registers or macros required.
As for output registers, Turbo C++'s code generation turns out to be even
more amazing than previously thought when it comes to returning
pseudo-registers from inline functions. A nice example for
how this can improve readability can be found in this piece of TH02 code
for polling the PC-98 keyboard state using a BIOS interrupt:
inline uint8_t keygroup_sense(uint8_t group) {
_AL = group;
_AH = 0x04;
geninterrupt(0x18);
// This turns the output register of this BIOS call into the return value
// of this function. Surprisingly enough, this does *not* naively generate
// the `MOV AL, AH` instruction you might expect here!
return _AH;
}
void input_sense(void)
{
// As a result, this assignment becomes `_AH = _AH`, which Turbo C++
// never emits as such, giving us only the three instructions we need.
_AH = keygroup_sense(8);
// Whereas this one gives us the one additional `MOV BH, AH` instruction
// we'd expect, and nothing more.
_BH = keygroup_sense(7);
// And now it's obvious what both of these registers contain, from just
// the assignments above.
if(_BH & K7_ARROW_UP || _AH & K8_NUM_8) {
key_det |= INPUT_UP;
}
// […]
}
I love it. No inline assembly, as close to idiomatic C code as something
like this is going to get, yet still compiling into the minimum possible
number of x86 instructions on even a 1994 compiler. This is how I keep
this project interesting for myself during chores like these.
We might have even reached peak
inline already?
And that's 65% of technical debt in the SHARED segment repaid
so far. Next up: Two more of these, which might already complete that
segment? Finally!
Wow, 31 commits in a single push? Well, what the last push had in
progress, this one had in maintenance. The
📝 master.lib header transition absolutely
had to be completed in this one, for my own sanity. And indeed,
it reduced the build time for the entirety of ReC98 to about 27 seconds on
my system, just as expected in the original announcement. Looking forward
to even faster build times with the upcoming #include
improvements I've got up my sleeve! The port authors of the future are
going to appreciate those quite a bit.
As for the new translation units, the funniest one is probably TH05's
function for blitting the 1-color .CDG images used for the main menu
options. Which is so optimized that it becomes decompilable again,
by ditching the self-modifying code of its TH04 counterpart in favor of
simply making better use of CPU registers. The resulting C code is still a
mess, but what can you do.
This was followed by even more TH05 functions that clearly weren't
compiled from C, as evidenced by their padding
bytes. It's about time I've documented my lack of ideas of how to get
those out of Turbo C++.
And just like in the previous push, I also had to 📝 throw away a decompiled TH02 function purely due to alignment issues. Couldn't have been a better one though, no one's going to miss a residency check for the MMD driver that is largely identical to the corresponding (and indeed decompilable) function for the PMD driver. Both of those should have been merged into a single function anyway, given how they also mutate the game's sound configuration flags…
In the end, I've slightly slowed down with this one, with only 37% of technical debt done after this 4th dedicated push. Next up: One more of these, centered around TH05's stupidly optimized .PI functions. Maybe also with some more reverse-engineering, after not having done any for 1½ months?
Alright, back to continuing the master.hpp transition started
in P0124, and repaying technical debt. The last blog post already
announced some ridiculous decompilations… and in fact, not a single
one of the functions in these two pushes was decompilable into
idiomatic C/C++ code.
As usual, that didn't keep me from trying though. The TH04 and TH05
version of the infamous 16-pixel-aligned, EGC-accelerated rectangle
blitting function from page 1 to page 0 was fairly average as far as
unreasonable decompilations are concerned.
The big blocker in TH03's MAIN.EXE, however, turned out to be
the .MRS functions, used to render the gauge attack portraits and bomb
backgrounds. The blitting code there uses the additional FS and GS segment
registers provided by the Intel 386… which
are not supported by Turbo C++'s inline assembler, and
can't be turned into pointers, due to a compiler bug in Turbo C++ that
generates wrong segment prefix opcodes for the _FS and
_GS pseudo-registers.
Apparently I'm the first one to even try doing that with this compiler? I
haven't found any other mention of this bug…
Compiling via assembly (#pragma inline) would work around
this bug and generate the correct instructions. But that would incur yet
another dependency on a 16-bit TASM, for something honestly quite
insignificant.
What we can always do, however, is using __emit__() to simply
output x86 opcodes anywhere in a function. Unlike spelled-out inline
assembly, that can even be used in helper functions that are supposed to
inline… which does in fact allow us to fully abstract away this compiler
bug. Regular if() comparisons with pseudo-registers
wouldn't inline, but "converting" them into C++ template function
specializations does. All that's left is some C preprocessor abuse
to turn the pseudo-registers into types, and then we do retain a
normal-looking poke() call in the blitting functions in the
end. 🤯
Yeah… the result is
batshitinsane.
I may have gone too far in a few places…
One might certainly argue that all these ridiculous decompilations
actually hurt the preservation angle of this project. "Clearly, ZUN
couldn't have possibly written such unreasonable C++ code.
So why pretend he did, and not just keep it all in its more natural ASM
form?" Well, there are several reasons:
Future port authors will merely have to translate all the
pseudo-registers and inline assembly to C++. For the former, this is
typically as easy as replacing them with newly declared local variables. No
need to bother with function prolog and epilog code, calling conventions, or
the build system.
No duplication of constants and structures in ASM land.
As a more expressive language, C++ can document the code much better.
Meticulous documentation seems to have become the main attraction of ReC98
these days – I've seen it appreciated quite a number of times, and the
continued financial support of all the backers speaks volumes. Mods, on the
other hand, are still a rather rare sight.
Having as few .ASM files in the source tree as possible looks better to
casual visitors who just look at GitHub's repo language breakdown. This way,
ReC98 will also turn from an "Assembly project" to its rightful state
of "C++ project" much sooner.
And finally, it's not like the ASM versions are
gone – they're still part of the Git history.
Unfortunately, these pushes also demonstrated a second disadvantage in
trying to decompile everything possible: Since Turbo C++ lacks TASM's
fine-grained ability to enforce code alignment on certain multiples of
bytes, it might actually be unfeasible to link in a C-compiled object file
at its intended original position in some of the .EXE files it's used in.
Which… you're only going to notice once you encounter such a case. Due to
the slightly jumbled order of functions in the
📝 second, shared code segment, that might
be long after you decompiled and successfully linked in the function
everywhere else.
And then you'll have to throw away that decompilation after all 😕 Oh
well. In this specific case (the lookup table generator for horizontally
flipping images), that decompilation was a mess anyway, and probably
helped nobody. I could have added a dummy .OBJ that does nothing but
enforce the needed 2-byte alignment before the function if I
really insisted on keeping the C version, but it really wasn't
worth it.
Now that I've also described yet another meta-issue, maybe there'll
really be nothing to say about the next technical debt pushes?
Next up though: Back to actual progress
again, with TH01. Which maybe even ends up pushing that game over the 50%
RE mark?
So, where to start? Well, TH04 bullets are hard, so let's
procrastinate start with TH03 instead
The 📝 sprite display functions are the
obvious blocker for any structure describing a sprite, and therefore most
meaningful PI gains in that game… and I actually did manage to fit a
decompilation of those three functions into exactly the amount of time
that the Touhou Patch Center community votes alloted to TH03
reverse-engineering!
And a pretty amazing one at that. The original code was so obviously
written in ASM and was just barely decompilable by exclusively using
register pseudovariables and a bit of goto, but I was able to
abstract most of that away, not least thanks to a few helpful optimization
properties of Turbo C++… seriously, I can't stop marveling at this ancient
compiler. The end result is both readable, clear, and dare I say
portable?! To anyone interested in porting TH03,
take a look. How painful would it be to port that away from 16-bit
x86?
However, this push is also a typical example that the RE/PI priorities can
only control what I look at, and the outcome can actually differ
greatly. Even though the priorities were 65% RE and 35% PI, the progress
outcome was +0.13% RE and +1.35% PI. But hey, we've got one more push with
a focus on TH03 PI, so maybe that one will include more RE than
PI, and then everything will end up just as ordered?
Boss explosions! And… urgh, I really also had to wade through that overly complicated HUD rendering code. Even though I had to pick -Tom-'s 7th push here as well, the worst of that is still to come. TH04 and TH05 exclusively store the current and high score internally as unpacked little-endian BCD, with some pretty dense ASM code involving the venerable x86 BCD instructions to update it.
So, what's actually the goal here. Since I was given no priorities , I still haven't had to (potentially) waste time researching whether we really can decompile from anywhere else inside a segment other than backwards from the end. So, the most efficient place for decompilation right now still is the end of TH05's main_01_TEXT segment. With maybe 1 or 2 more reverse-engineering commits, we'd have everything for an efficient decompilation up to sub_123AD. And that mass of code just happens to include all the shot type control functions, and makes up 3,007 instructions in total, or 12% of the entire remaining unknown code in MAIN.EXE.
So, the most reasonable thing would be to actually put some of the upcoming decompilation pushes towards reverse-engineering that missing part. I don't think that's a bad deal since it will allow us to mod TH05 shot types in C sooner, but zorg and qp might disagree
Next up: thcrap TL notes, followed by finally finishing GhostPhanom's old ReC98 future-proofing pushes. I really don't want to decompile without a proper build system.