Stripe is now
properly integrated into this website as an alternative to PayPal! Now, you
can also financially support the project if PayPal doesn't work for you, or
if you prefer using a
provider out of Stripe's greater variety. It's unfortunate that I had to
ship this integration while the store is still sold out, but the Shuusou
Gyoku OpenGL backend has turned out way too complicated to be finished next
to these two pushes within a month. It will take quite a while until the
store reopens and you all can start using Stripe, so I'll just link back to
this blog post when it happens.
Integrating Stripe wasn't the simplest task in the world either. At first,
the Checkout API
seems pretty friendly to developers: The entire payment flow is handled on
the backend, in the server language of your choice, and requires no frontend
JavaScript except for the UI feedback code you choose to write. Your
backend API endpoint initiates the Stripe Checkout session, answers with a
redirect to Stripe, and Stripe then sends a redirect back to your server if
the customer completed the payment. Superficially, this server-based
approach seems much more GDPR-friendly than PayPal, because there are no
remote scripts to obtain consent for. In reality though, Stripe shares
much more potential personal data about your credit card or bank
account with a merchant, compared to PayPal's almost bare minimum of
necessary data.
It's also rather annoying how the backend has to persist the order form
information throughout the entire Checkout session, because it would
otherwise be lost if the server restarts while a customer is still busy
entering data into Stripe's Checkout form. Compare that to the PayPal
JavaScript SDK, which only POSTs back to your server after the
customer completed a payment. In Stripe's case, more JavaScript actually
only makes the integration harder: If you trigger the initial payment
HTTP request from JavaScript, you will have
to improvise a bit to avoid the CORS error when redirecting away to a
different domain.
But sure, it's all not too bad… for regular orders at least. With
subscriptions, however, things get much worse. Unlike PayPal, Stripe
kind of wants to stay out of the way of the payment process as much as
possible, and just be a wrapper around its supported payment methods. So if
customers aren't really meant to register with Stripe, how would they cancel
their subscriptions?
Answer: Through
the… merchant? Which I quite dislike in principle, because why should
you have to trust me to actually cancel your subscription after you
requested it? It also means that I probably should add some sort of UI for
self-canceling a Stripe subscription, ideally without adding full-blown user
accounts. Not that this solves the underlying trust issue, but it's more
convenient than contacting me via email or, worse, going through your bank
somehow. Here is how my solution works:
When setting up a Stripe subscription, the server will generate a random
ID for authentication. This ID is then used as a salt for a hash
of the Stripe subscription ID, linking the two without storing the latter on
my server.
The thank you page, which is parameterized with the Stripe
Checkout session ID, will use that ID to retrieve the subscription
ID via an API call to Stripe, and display it together with the above
salt. This works indefinitely – contrary to what the expiry field in the
Checkout session object suggests, Stripe sessions are indeed stored
forever. After all, Stripe also displays this session information in a
merchant's transaction log with an excessive amount of detail. It might have
been better to add my own expiration system to these pages, but this had
been taking long enough already. For now, be aware that sharing the link to
a Stripe thank you page is equivalent to sharing your subscription
cancellation password.
The salt is then used as the key for a subscription management page. To
cancel, you visit this page and enter the Stripe subscription ID to confirm.
The server then checks whether the salt and subscription ID pair belong to
each other, and sends the actual cancellation
request back to Stripe if they do.
I might have gone a bit overboard with the crypto there, but I liked the
idea of not storing any of the Stripe session IDs in the server database.
It's not like that makes the system more complex anyway, and it's nice to
have a separate confirmation step before canceling a subscription.
But even that wasn't everything I had to keep in mind here. Once you
switch from test to production mode for the final tests, you'll notice that
certain SEPA-based
payment providers take their sweet time to process and activate new
subscriptions. The Checkout session object even informs you about that, by
including a payment status field. Which initially seems just like
another field that could indicate hacking attempts, but treating it as such
and rejecting any unpaid session can also reject perfectly valid
subscriptions. I don't want all this control… 🥲
Instead, all I can do in this case is to tell you about it. In my test, the
Stripe dashboard said that it might take days or even weeks for the initial
subscription transaction to be confirmed. In such a case, the respective
fraction of the cap will unfortunately need to remain red for that entire time.
And that was 1½ pushes just to replicate the basic functionality of a simple
PayPal integration with the simplest type of Stripe integration. On the
architectural site, all the necessary refactoring work made me finally
upgrade my frontend code to TypeScript at least, using the amazing esbuild to handle transpilation inside
the server binary. Let's see how long it will now take for me to upgrade to
SCSS…
With the new payment options, it makes sense to go for another slight price
increase, from up to per push.
The amount of taxes I have to pay on this income is slowly becoming
significant, and the store has been selling out almost immediately for the
last few months anyway. If demand remains at the current level or even
increases, I plan to gradually go up to by the end
of the year. 📝 As📝 usual,
I'm going to deliver existing orders in the backlog at the value they were
originally purchased at. Due to the way the cap has to be calculated, these
contributions now appear to have increased in value by a rather awkward
13.33%.
This left ½ of a push for some more work on the TH01 Anniversary Edition.
Unfortunately, this was too little time for the grand issue of removing
byte-aligned rendering of bigger sprites, which will need some additional
blitting performance research. Instead, I went for a bunch of smaller
bugfixes:
ANNIV.EXE now launches ZUNSOFT.COM if
MDRV98 wasn't resident before. In hindsight, it's completely obvious
why this is the right thing to do: Either you start
ANNIV.EXE directly, in which case there's no resident
MDRV98 and you haven't seen the ZUN Soft logo, or you have
made a single-line edit to GAME.BAT and replaced
op with anniv, in which case MDRV98 is
resident and you have seen the logo. These are the two
reasonable cases to support out of the box. If you are doing
anything else, it shouldn't be that hard to adjust though?
You might be wondering why I didn't just include all code of
ZUNSOFT.COM inside ANNIV.EXE together with
the rest of the game. The reason: ZUNSOFT.COM has
almost nothing in common with regular TH01 code. While the rest of
TH01 uses the custom image formats and bad rendering code I
documented again and again during its RE process,
ZUNSOFT.COM fully relies on master.lib for everything
about the bouncing-ball logo animation. Its code is much closer to
TH02 in that respect, which suggests that ZUN did in fact write this
animation for TH02, and just included the binary in TH01 for
consistency when he first sold both games together at Comiket 52.
Unlike the 📝 various bad reasons for splitting the PC-98 Touhou games into three main executables,
it's still a good idea to split off animations that use a completely
different set of rendering and file format functions. Combined with
all the BFNT and shape rendering code, ZUNSOFT.COM
actually contains even more unique code than OP.EXE,
and only slightly less than FUUIN.EXE.
The optional AUTOEXEC.BAT is now correctly encoded in
Shift-JIS instead of accidentally being UTF-8, fixing the previous
mojibake in its final ECHO line.
The command-line option that just adds a stage selection without
other debug features (anniv s) now works reliably.
This one's quite interesting because it only ever worked
because of a ZUN bug. From a superficial look at the code, it
shouldn't: While the presence of an 's' branch proves
that ZUN had such a mode during development, he nevertheless forgot
to initialize the debug flag inside the resident structure within
this branch. This mode only ever worked because master.lib's
resdata_create() function doesn't clear the resident
structure after allocation. If anything on the system previously
happened to write something other than 0x00,
0x01, or 0x03 to the specific byte that
then gets repurposed as the debug mode flag, this lack of
initialization does in fact result in a distinct non-test and
non-debug stage selection mode.
This is what happens on a certain widely circulated .HDI copy of
TH01 that boots MS-DOS 3.30C. On this system, the memory that
master.lib will allocate to the TH01 resident structure was
previously used by DOS as stack for its kernel, which left the
future resident debug flag byte at address 9FF6:0012 at
a value of 0x12. This might be the entire reason why
game s is even widely documented to trigger a stage
selection to begin with – on the widely circulated TH04 .HDI that
boots MS-DOS 6.20, or on DOSBox-X, the s parameter
doesn't work because both DOS systems leave the resident debug flag
byte at 0x00. And since ANNIV.EXE pushes
MDRV98 into that area of conventional DOS RAM, anniv s
previously didn't work even on MS-DOS 3.30C.
Both bugs in the
📝 1×1 particle system during the Mima fight
have been fixed. These include the off-by-one error that killed off the
very first particle on the 80th
frame and left it in VRAM, and, just like every other entity type, a
replacement of ZUN's EGC unblitter with the new pixel-perfect and fast
one. Until I've rearchitected unblitting as a whole, the particles will
now merely rip barely visible 1×1 holes into the sprites they overlap.
The bomb value shown in the lowest line of the in-game
debug mode output is now right-aligned together with the rest of the
values. This ensures that the game always writes a consistent number
of characters to TRAM, regardless of the magnitude of the
bomb value, preventing the seemingly wrong
timer values that appeared in the original game
whenever the value of the bomb variable changed to a
lower number of digits:
Finally, I've streamlined VRAM page access changes, which allowed me to
consistently replace ZUN's expensive function call with the optimal two
inlined x86 instructions. Interestingly, this change alone removed
2 KiB from the binary size, which is almost all of the difference
between 📝 the P0234-1 release and this
one. Let's see how much longer we can make each new release of
ANNIV.EXE smaller than the previous one.
The final point, however, raised the question of what we're now going to do
about
📝 a certain issue in the 地獄/Jigoku Bad Ending.
ZUN's original expensive way of switching the accessed VRAM page was the
main reason behind the lag frames on slower PC-98 systems, and
search-replacing the respective function calls would immediately get us to
the optimized version shown in that blog post. But is this something we
actually want? If we wanted to retain the lag, we could surely preserve that
function just for this one instance… The discovery of this issue
predates the clear distinction between bloat, quirks, and bugs, so it makes
sense to first classify what this issue even is. The distinction comes all
down to observability, which I defined as changes to rendered frames
between explicitly defined frame boundaries. That alone would be enough to
categorize any cause behind lag frames as bloat, but it can't hurt to be
more explicit here.
Therefore, I now officially judge observability in terms of an infinitely
fast PC-98 that can instantly render everything between two explicitly
defined frames, and will never add additional lag frames. If we plan to port
the games to faster architectures that aren't bottlenecked by disappointing
blitter chips, this is the only reasonable assumption to make, in my
opinion: The minimum system requirements in the games' README files are
minimums, after all, not recommendations. Chasing the exact frame
drop behavior that ZUN must have experienced during the time he developed
these games can only be a guessing game at best, because how can we know
which PC-98 model ZUN actually developed the games on? There might even be
more than one model, especially when it comes to TH01 which had been in
development for at least two years before ZUN first sold it. It's also not
like any current PC-98 emulator even claims to emulate the specific timing
of any existing model, and I sure hope that nobody expects me to import a
bunch of bulky obsolete hardware just to count dropped frames.
That leaves the tearing, where it's much more obvious how it's a bug. On an
infinitely fast PC-98, the ドカーン
frame would never be visible, and thus falls into the same category as the
📝 two unused animations in the Sariel fight.
With only a single unconditional 2-frame delay inside the animation loop, it
becomes clear that ZUN intended both frames of the animation to be displayed
for 2 frames each:
No tearing, and 34 frames in total for the first of the two
instances of this animation.
Next up: Taking the oldest still undelivered push and working towards TH04
position independence in preparation for multilingual translations. The
Shuusou Gyoku OpenGL backend shouldn't take that much longer either,
so I should have lots of stuff coming up in May afterward.
> "OK, TH03/TH04/TH05 cutscenes done, let's quickly finish the Touhou Patch Center MediaWiki upgrade. Just some scripting and verification left, it will be done so quickly that I don't even have to mention it on this blog"
> Still not done after 3 weeks
> Blocked by one final critical bug that really should be fixed upstream
> Code reviewers are probably on vacation
And so, the year unfortunately ended with yet another slow month. During the
MediaWiki upgrade, I was slowly decompiling the TH05 Sara fight on the side,
but stumbled over one interesting but high-maintenance detail there that
would really enhance her blog post. TH02 would need a lot of attention for
the basic rendering calls as well…
…so let's end the year with Shuusou Gyoku instead, looking at its most
critical issue in particular. As if that were the easy option here…
The game does not run properly on modern Windows systems due to its usage of
the ancient DirectDraw APIs, with issues ranging from unbearable slowdown to
glitched colors to the game not even starting at all. Thankfully, Shuusou
Gyoku is not the only ancient Windows game affected by these issues, and
people have developed a variety of generic DirectDraw wrappers and patches
for playing such games on modern systems. Out of all these, DDrawCompat is one of the
simpler solutions for Shuusou Gyoku in particular: Just drop its
ddraw proxy DLL into the game directory, and the game will run
as it's supposed to.
So let's just bundle that DLL with all my future Shuusou Gyoku releases
then? That would have been the quick and dirty option, coming with
several drawbacks:
Linux users might be annoyed by the potential need to configure a native
DLL override for ddraw.dll. It's not too much of an issue as we
could simply rename the DLL and replace the import with the new name.
However, doing that reproducibly would already involve changes to either the
DDrawCompat or Shuusou Gyoku build process.
Win32 API hooking is another potential point of failure in general,
requiring continual maintenance for new Windows versions. This is not even a
hypothetical concern: DDrawCompat does rely on particularly volatile Win32
API details, to the point that the recent Windows 11 22H2 update completely
broke it, causing a hang at startup that required a workaround.
But sure, it's still just a single third-party component. Keeping it up to
date doesn't sound too bad by itself…
…if DDrawCompat weren't evolving way beyond what we need to keep Shuusou
Gyoku running. Being a typical DirectDraw wrapper, it has always aimed to
solve all sorts of issues in old DirectDraw games. However, the latest
version, 0.4.0, has gone above and beyond in this regard, adding lots of
configuration options with default settings that actually
break Shuusou Gyoku.
To get a glimpse of how this is likely to play out, we only have to look at
the more mature DxWnd
project. In its expert mode, DxWnd features three rows of tabs, each packed
with checkboxes that toggle individual hacks, and most of these are
related to something that Shuusou Gyoku could be affected by. Imagine
checking a precise permutation of a three-digit number of checkboxes just to
keep an old game running at full speed on modern systems…
Finally, aesthetic and bloat considerations. If
📝 C++ fstreams were already too embarrassing
with the ~100 KB of bloat they add to the binary, a 565 KiB DLL is
even worse. And that's the old version 0.3.2 – version 0.4.0 comes in
at 2.43 MiB.
Fortunately, I had the budget to dig a bit deeper and figure out what
exactly DDrawCompat does to make Shuusou Gyoku work properly. Turns
out that among all the hooks and patches, the game only needs the most
central one: Enforcing a 32-bit display mode regardless of whatever lower
bit depth the game requests natively, combined with converting the game's
pixel buffer to 32-bit on the fly.
So does this mean that adding 32-bit to the game's list of supported bit
depths is everything we have to do?
Interestingly, Shuusou Gyoku already saved the DirectDraw enumeration flag
that indicates support for 32-bit display modes. The official version just
did nothing with it.
Well, almost everything. Initially, this surprised me as well: With
all the if statements checking for precise bit depths, you
would think that supporting one more bit depth would be way harder in this
code base. As it turned out though, these conditional branches are not
really about 8-bit or 16-bit color for the most part, but instead
differentiate between two very distinct rendering approaches:
"8-bit" is a pure 2D mode with palettized colors,
while "16-bit" is a hybrid 2D/3D mode that uses Direct3D 2 on top of DirectDraw, with
3-channel RGB colors.
Consequently, most of these branches deal with differences between these two
approaches that couldn't be nicely abstracted away in pbg's renderer
interface: Specific palette changes that are exclusive to "8-bit" mode, or
certain entities and effects whose Direct3D draw calls in "16-bit" mode
require tailor-made approximations for the "8-bit" mode. Since our new
32-bit mode is equivalent to the 16-bit mode in all of these branches, I
only needed to replace the raw number comparisons with more meaningful
method calls.
That only left a very small number of 2D raster effects that directly write
to or read from DirectDraw surface memory, and therefore do need to know the
bit size of each pixel. Thanks to std::variant and
std::visit(), adding 32-bit support becomes trivial here: By
rewriting the code in a generic manner that derives all offsets from the
template type, you only have to say hey,
I'd like to have 32-bit as well, and C++ will automatically
instantiate correct 32-bit variants of all bit depth-dependent code
snippets.
There are only three features in the entire game that access pixel buffers
this way: a color key retrieval function, the lens ball animation on the
logo screen, and… the ending staff roll? Sure, the text sprites fade in and
out, but so does the picture next to it, using Direct3D alpha blending or
palette color ramping depending on the current rendering mode. Instead, the
only reason why these sprites directly access their pixel buffer is… an
unused and pretty wild spiral effect. 😮 It's still part of the code, and
only doesn't show up because the
parameters that control its timing were commented out before release:
They probably considered it too wild for the mood of this
ending.
The main ending text was the only remaining issue of mojibake present in my
previous Shuusou Gyoku builds, and is now fixed as well. Windows can
render Shift-JIS text via GDI even outside Japanese locale, but only when
explicitly selecting a font that supports the SHIFTJIS_CHARSET,
and the game simply didn't select any font for rendering this text.
Thus, GDI fell back onto its default font, which obviously is only
guaranteed to support the SHIFTJIS_CHARSET if your system
locale is set to Japanese. This is why the font in the original game might
lookdifferent between systems.
For my build, I chose the font that would appear on a clean Windows
installation – a basic 400-weighted MS Gothic at font size 16, which is
already used all throughout the game.
Alright, 32-bit mode complete, let's set it as the default if possible… and
break compatibility to the original 秋霜CFG.DAT format in the
process? When validating this file, the original game only allows the
originally supported 8-bit or 16-bit modes. Setting the
BitDepth field to any other value causes the entire file
to be reset to its defaults, re-locking the Extra Stage in the process.
Introducing a backward-compatible version
system for 秋霜CFG.DAT was beyond the scope of this push.
Changing the validation to a per-field approach was a good small first step
to take though. The new build no longer validates the BitDepth
field against a fixed list, but against the actually supported bit depths on
your system, picking a different supported one if necessary. With the
original approach, this would have caused your entire configuration to fail
the validation check. Instead, you can now safely update to the new build
without losing your option settings, or your previously unlocked access to
the Extra Stage.
Side note: The validation limit for starting bombs is off by one, and the
one for starting lives check is off by two. By modifying
秋霜CFG.DAT, you could theoretically get new games to start with
7 lives and 3 bombs… if you then calculate a correct checksum for your
hacked config file, that is. 🧑💻
Interestingly, DirectDraw doesn't even indicate support for 8-bit or 16-bit
color on systems that are affected by the initially mentioned issues.
Therefore, these issues are not the fault of DirectDraw, but of
Shuusou Gyoku, as the original release requested a bit depth that it has
even verified to be unsupported. Unfortunately, Windows sides with
Sim City Shuusou Gyoku here: If you previously experimented with the
Windows app compatibility settings, you might have ended up with the
DWM8And16BitMitigation flag assigned to the full file path of
your Shuusou Gyoku executable in either
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\AppCompatFlags\Layers, or
As the term mitigation suggests, these modes are (poorly) emulated,
which is exactly what causes the issues with this game in the first place.
Sure, this might be the lesser evil from the point of view of an operating
system: If you don't have the budget for a full-blown DDrawCompat-style
DirectDraw wrapper, you might consider it better for users to have the game
run poorly than have it fail at startup due to incorrect API usage.
Controlling this with a flag that sticks around for future runs of a binary
is definitely suboptimal though, especially given how hard it
is to programmatically remove this flag within the binary itself. It
only adds additional complexity to the ideal clean upgrade path.
So, make sure to check your registry and manually remove these flags for the
time being. Without them, the new Config → Graphic menu will
correctly prevent you from selecting anything else but 32-bit on modern
Windows.
After all that, there was just enough time left in this push to implement
basic locale independence, as requested by the Seihou development
Discord group, without looking into automatic fixes for previous mojibake
filenames yet. Combining std::filesystem::path with the native
Win32 API should be straightforward and bloat-free, especially with all the
abstractions I've been building, right?
Well, turns out that std::filesystem::path does not
actually meet my expectations. At least as long as it's not
constexpr-enabled, because you still get the unfortunate
conversion from narrow to wide encoding at runtime, even for globals with
static storage duration. That brings us back to writing our path abstraction
in terms of the regular std::string and
std::wstring containers, which at least allow us to enforce the
respective encoding at compile time. Even std::string_view only
adds to the complexity here, as its strings are never inherently
null-terminated, which is required by both the POSIX and Win32 APIs. Not to
mention dynamic filenames: C++20's std::format() would be the
obvious idiomatic choice here, but using it almost doubles the size
of the compiled binary… 🤮
In the end, the most bloat-free way of implementing C++ file I/O in 2023 is
still the same as it was 30 years ago: Call system APIs, roll a custom
abstraction that conditionally uses the L prefix, and pass
around raw pointers. And if you need a dynamic filename, just write the
dynamic characters into arrays at fixed positions. Just as PC-98 Touhou used
to do…
Oh, and the game's window also uses a Unicode title bar now.
And that's it for this push! Make sure to rename your configuration
(秋霜CFG.DAT), score (秋霜SC.DAT), and replay
(秋霜りぷ*.DAT) filenames if you were previously running the
game on a non-Japanese locale, and then grab the new build:
Next up: Starting the new year with all my plans hopefully working out for
once. TH05 Sara very soon, ZMBV code review afterward, low-hanging fruit of
the TH01 Anniversary Edition after that, and then kicking off TH02 with a
bunch of low-level blitting code.
More than three months without any reverse-engineering progress! It's been
way too long. Coincidentally, we're at least back with a surprising 1.25% of
overall RE, achieved within just 3 pushes. The ending script system is not
only more or less the same in TH04 and TH05, but actually originated in
TH03, where it's also used for the cutscenes before stages 8 and 9. This
means that it was one of the final pieces of code shared between three of
the four remaining games, which I got to decompile at roughly 3× the usual
speed, or ⅓ of the price.
The only other bargains of this nature remain in OP.EXE. The
Music Room is largely equivalent in all three remaining games as well, and
the sound device selection, ZUN Soft logo screens, and main/option menus are
the same in TH04 and TH05. A lot of that code is in the "technically RE'd
but not yet decompiled" ASM form though, so it would shift Finalized% more
significantly than RE%. Therefore, make sure to order the new
Finalization option rather than Reverse-engineering if you
want to make number go up.
So, cutscenes. On the surface, the .TXT files look simple enough: You
directly write the text that should appear on the screen into the file
without any special markup, and add commands to define visuals, music, and
other effects at any place within the script. Let's start with the basics of
how text is rendered, which are the same in all three games:
First off, the text area has a size of 480×64 pixels. This means that it
does not correspond to the tiled area painted into TH05's
EDBK?.PI images:
The yellow area is designated for character names.
Since the font weight can be customized, all text is rendered to VRAM.
This also includes gaiji, despite them ignoring the font weight
setting.
The system supports automatic line breaks on a per-glyph basis, which
move the text cursor to the beginning of the red text area. This might seem like a piece of long-forgotten
ancient wisdom at first, considering the absence of automatic line breaks in
Windows Touhou. However, ZUN probably implemented it more out of pure
necessity: Text in VRAM needs to be unblitted when starting a new box, which
is way more straightforward and performant if you only need to worry
about a fixed area.
The system also automatically starts a new (key press-separated) text
box after the end of the 4th line. However, the text cursor is
also unconditionally moved to the top-left corner of the yellow name
area when this happens, which is almost certainly not what you expect, given
that automatic line breaks stay within the red area. A script author might
as well add the necessary text box change commands manually, if you're
forced to anticipate the automatic ones anyway…
Due to ZUN forgetting an unblitting call during the TH05 refactoring of the
box background buffer, this feature is even completely broken in that game,
as any new text will simply be blitted on top of the old one:
Wait, why are we already talking about game-specific differences after
all? Also, note how the ⏎ animation appears one line below where you'd
expect it.
Overall, the system is geared toward exclusively full-width text. As
exemplified by the 2014 static English patches and the screenshots in this
blog post, half-width text is possible, but comes with a lot of
asterisks attached:
Each loop of the script interpreter starts by looking at the next
byte to distinguish commands from text. However, this step also skips
over every ASCII space and control character, i.e., every byte
≤ 32. If you only intend to display full-width glyphs anyway, this
sort of makes sense: You gain complete freedom when it comes to the
physical layout of these script files, and it especially allows commands
to be freely separated with spaces and line breaks for improved
readability. Still, enforcing commands to be separated exclusively by
line breaks might have been even better for readability, and would have
freed up ASCII spaces for regular text…
Non-command text is blindly processed and rendered two bytes at a
time. The rendering function interprets these bytes as a Shift-JIS
string, so you can use half-width characters here. While the
second byte can even be an ASCII 0x20 space due to the
parser's blindness, all half-width characters must still occur in pairs
that can't be interrupted by commands:
As a workaround for at least the ASCII space issue, you can replace
them with any of the unassigned
Shift-JIS lead bytes – 0x80, 0xA0, or
anything between 0xF0 and 0xFF inclusive.
That's what you see in all screenshots of this post that display
half-width spaces.
Finally, did you know that you can hold ESC to fast-forward
through these cutscenes, which skips most frame delays and reduces the rest?
Due to the blocking nature of all commands, the ESC key state is
only updated between commands or 2-byte text groups though, so it can't
interrupt an ongoing delay.
Superficially, the list of game-specific differences doesn't look too long,
and can be summarized in a rather short table:
It's when you get into the implementation that the combined three systems
reveal themselves as a giant mess, with more like 56 differences between the
games. Every single new weird line of code opened up
another can of worms, which ultimately made all of this end up with 24
pieces of bloat and 14 bugs. The worst of these should be quite interesting
for the general PC-98 homebrew developers among my audience:
The final official 0.23 release of master.lib has a bug in
graph_gaiji_put*(). To calculate the JIS X 0208 code point for
a gaiji, it is enough to ADD 5680h onto the gaiji ID. However,
these functions accidentally use ADC instead, which incorrectly
adds the x86 carry flag on top, causing weird off-by-one errors based on the
previous program state. ZUN did fix this bug directly inside master.lib for
TH04 and TH05, but still needed to work around it in TH03 by subtracting 1
from the intended gaiji ID. Anyone up for maintaining a bug-fixed master.lib
repository?
The worst piece of bloat comes from TH03 and TH04 needlessly
switching the visibility of VRAM pages while blitting a new 320×200 picture.
This makes it much harder to understand the code, as the mere existence of
these page switches is enough to suggest a more complex interplay between
the two VRAM pages which doesn't actually exist. Outside this visibility
switch, page 0 is always supposed to be shown, and page 1 is always used
for temporarily storing pixels that are later crossfaded onto page 0. This
is also the only reason why TH03 has to render text and gaiji onto both VRAM
pages to begin with… and because TH04 doesn't, changing the picture in the
middle of a string of text is technically bugged in that game, even though
you only get to temporarily see the new text on very underclocked PC-98
systems.
These performance implications made me wonder why cutscenes even bother with
writing to the second VRAM page anyway, before copying each crossfade step
to the visible one.
📝 We learned in June how costly EGC-"accelerated" inter-page copies are;
shouldn't it be faster to just blit the image once rather than twice?
Well, master.lib decodes .PI images into a packed-pixel format, and
unpacking such a representation into bitplanes on the fly is just about the
worst way of blitting you could possibly imagine on a PC-98. EGC inter-page
copies are already fairly disappointing at 42 cycles for every 16 pixels, if
we look at the i486 and ignore VRAM latencies. But under the same
conditions, packed-pixel unpacking comes in at 81 cycles for every 8
pixels, or almost 4× slower. On lower-end systems, that can easily sum up to
more than one frame for a 320×200 image. While I'd argue that the resulting
tearing could have been an acceptable part of the transition between two
images, it's understandable why you'd want to avoid it in favor of the
pure effect on a slower framerate.
Really makes me wonder why master.lib didn't just directly decode .PI images
into bitplanes. The performance impact on load times should have been
negligible? It's such a good format for
the often dithered 16-color artwork you typically see on PC-98, and
deserves better than master.lib's implementation which is both slow to
decode and slow to blit.
That brings us to the individual script commands… and yes, I'm going to
document every single one of them. Some of their interactions and edge cases
are not clear at all from just looking at the code.
Almost all commands are preceded by… well, a 0x5C lead byte.
Which raises the question of whether we should
document it as an ASCII-encoded \ backslash, or a Shift-JIS-encoded
¥ yen sign. From a gaijin perspective, it seems obvious that it's a
backslash, as it's consistently displayed as one in most of the editors you
would actually use nowadays. But interestingly, iconv
-f shift-jis -t utf-8 does convert any 0x5C
lead bytes to actual ¥ U+00A5 YEN SIGN code points
.
Ultimately, the distinction comes down to the font. There are fonts
that still render 0x5C as ¥, but mainly do so out
of an obvious concern about backward compatibility to JIS X 0201, where this
mapping originated. Unsurprisingly, this group includes MS Gothic/Mincho,
the old Japanese fonts from Windows 3.1, but even Meiryo and Yu
Gothic/Mincho, Microsoft's modern Japanese fonts. Meanwhile, pretty much
every other modern font, and freely licensed ones in particular, render this
code point as \, even if you set your editor to Shift-JIS. And
while ZUN most definitely saw it as a ¥, documenting this code
point as \ is less ambiguous in the long run. It can only
possibly correspond to one specific code point in either Shift-JIS or UTF-8,
and will remain correct even if we later mod the cutscene system to support
full-blown Unicode.
Now we've only got to clarify the parameter syntax, and then we can look at
the big table of commands:
Numeric parameters are read as sequences of up to 3 ASCII digits. This
limits them to a range from 0 to 999 inclusive, with 000 and
0 being equivalent. Because there's no further sentinel
character, any further digit from the 4th one onwards is
interpreted as regular text.
Filename parameters must be terminated with a space or newline and are
limited to 12 characters, which translates to 8.3 basenames without any
directory component. Any further characters are ignored and displayed as
text as well.
Each .PI image can contain up to four 320×200 pictures ("quarters") for
the cutscene picture area. In the script commands, they are numbered like
this:
0
1
2
3
\@
Clears both VRAM pages by filling them with VRAM color 0. 🐞
In TH03 and TH04, this command does not update the internal text area
background used for unblitting. This bug effectively restricts usage of
this command to either the beginning of a script (before the first
background image is shown) or its end (after no more new text boxes are
started). See the image below for an
example of using it anywhere else.
\b2
Sets the font weight to a value between 0 (raw font ROM glyphs) to 3
(very thicc). Specifying any other value has no effect.
🐞 In TH04 and TH05, \b3 leads to glitched pixels when
rendering half-width glyphs due to a bug in the newly micro-optimized
ASM version of
📝 graph_putsa_fx(); see the image below for an example.
In these games, the parameter also directly corresponds to the
graph_putsa_fx() effect function, removing the sanity check
that was present in TH03. In exchange, you can also access the four
dissolve masks for the bold font (\b2) by specifying a
parameter between 4 (fewest pixels) to 7 (most
pixels). Demo video below.
\c15
Changes the text color to VRAM color 15.
\c=字,15
Adds a color map entry: If 字 is the first code point
inside the name area on a new line, the text color is automatically set
to 15. Up to 8 such entries can be registered
before overflowing the statically allocated buffer.
🐞 The comma is assumed to be present even if the color parameter is omitted.
\e0
Plays the sound effect with the given ID.
\f
(no-op)
\fi1
\fo1
Calls master.lib's palette_black_in() or
palette_black_out() to play a hardware palette fade
animation from or to black, spending roughly 1 frame on each of the 16 fade steps.
\fm1
Fades out BGM volume via PMD's AH=02h interrupt call,
in a non-blocking way. The fade speed can range from 1 (slowest) to 127 (fastest).
Values from 128 to 255 technically correspond to
AH=02h's fade-in feature, which can't be used from cutscene
scripts because it requires BGM volume to first be lowered via
AH=19h, and there is no command to do that.
\g8
Plays a blocking 8-frame screen shake
animation.
\ga0
Shows the gaiji with the given ID from 0 to 255
at the current cursor position. Even in TH03, gaiji always ignore the
text delay interval configured with \v.
@3
TH05's replacement for the \ga command from TH03 and
TH04. The default ID of 3 corresponds to the
gaiji. Not to be confused with \@, which starts with a backslash,
unlike this command.
@h
Shows the gaiji.
@t
Shows the gaiji.
@!
Shows the gaiji.
@?
Shows the gaiji.
@!!
Shows the gaiji.
@!?
Shows the gaiji.
\k0
Waits 0 frames (0 = forever) for an advance key to be pressed before
continuing script execution. Before waiting, TH05 crossfades in any new
text that was previously rendered to the invisible VRAM page…
🐞 …but TH04 doesn't, leaving the text invisible during the wait time.
As a workaround, \vp1 can be
used before \k to immediately display that text without a
fade-in animation.
\m$
Stops the currently playing BGM.
\m*
Restarts playback of the currently loaded BGM from the
beginning.
\m,filename
Stops the currently playing BGM, loads a new one from the given
file, and starts playback.
\n
Starts a new line at the leftmost X coordinate of the box, i.e., the
start of the name area. This is how scripts can "change" the name of the
currently speaking character, or use the entire 480×64 pixels without
being restricted to the non-name area.
Note that automatic line breaks already move the cursor into a new line.
Using this command at the "end" of a line with the maximum number of 30
full-width glyphs would therefore start a second new line and leave the
previously started line empty.
If this command moved the cursor into the 5th line of a box,
\s is executed afterward, with
any of \n's parameters passed to \s.
\p
(no-op)
\p-
Deallocates the loaded .PI image.
\p,filename
Loads the .PI image with the given file into the single .PI slot
available to cutscenes. TH04 and TH05 automatically deallocate any
previous image, 🐞 TH03 would leak memory without a manual prior call to
\p-.
\pp
Sets the hardware palette to the one of the loaded .PI image.
\p@
Sets the loaded .PI image as the full-screen 640×400 background
image and overwrites both VRAM pages with its pixels, retaining the
current hardware palette.
\p=
Runs \pp followed by \p@.
\s0
\s-
Ends a text box and starts a new one. Fades in any text rendered to
the invisible VRAM page, then waits 0 frames
(0 = forever) for an advance key to be
pressed. Afterward, the new text box is started with the cursor moved to
the top-left corner of the name area. \s- skips the wait time and starts the new box
immediately.
\t100
Sets palette brightness via master.lib's
palette_settone() to any value from 0 (fully black) to 200
(fully white). 100 corresponds to the palette's original colors.
Preceded by a 1-frame delay unless ESC is held.
\v1
Sets the number of frames to wait between every 2 bytes of rendered
text.
Sets the number of frames to spend on each of the 4 fade
steps when crossfading between old and new text. The game-specific
default value is also used before the first use of this command.
\v2
\vp0
Shows VRAM page 0. Completely useless in
TH03 (this game always synchronizes both VRAM pages at a command
boundary), only of dubious use in TH04 (for working around a bug in \k), and the games always return to
their intended shown page before every blitting operation anyway. A
debloated mod of this game would just remove this command, as it exposes
an implementation detail that script authors should not need to worry
about. None of the original scripts use it anyway.
\w64
\w and \wk wait for the given number
of frames
\wm and \wmk wait until PMD has played
back the current BGM for the total number of measures, including
loops, given in the first parameter, and fall back on calling
\w and \wk with the second parameter as
the frame number if BGM is disabled.
🐞 Neither PMD nor MMD reset the internal measure when stopping
playback. If no BGM is playing and the previous BGM hasn't been
played back for at least the given number of measures, this command
will deadlock.
Since both TH04 and TH05 fade in any new text from the invisible VRAM
page, these commands can be used to simulate TH03's typing effect in
those games. Demo video below.
Contrary to \k and \s, specifying 0 frames would
simply remove any frame delay instead of waiting forever.
The TH03-exclusive k variants allow the delay to be
interrupted if ⏎ Return or Shot are held down.
TH04 and TH05 recognize the k as well, but removed its
functionality.
All of these commands have no effect if ESC is held.
\wm64,64
\wk64
\wmk64,64
\wi1
\wo1
Calls master.lib's palette_white_in() or
palette_white_out() to play a hardware palette fade
animation from or to white, spending roughly 1 frame on each of the 16 fade steps.
\=4
Immediately displays the given quarter of the loaded .PI image in
the picture area, with no fade effect. Any value ≥ 4 resets the picture area to black.
\==4,1
Crossfades the picture area between its current content and quarter
#4 of the loaded .PI image, spending 1 frame on each of the 4 fade steps unless
ESC is held. Any value ≥ 4 is
replaced with quarter #0.
\$
Stops script execution. Must be called at the end of each file;
otherwise, execution continues into whatever lies after the script
buffer in memory.
TH05 automatically deallocates the loaded .PI image, TH03 and TH04
require a separate manual call to \p- to not leak its memory.
Bold values signify the default if the parameter
is omitted; \c is therefore
equivalent to \c15.
The \@ bug. Yes, the ¥ is fake. It
was easier to GIMP it than to reword the sentences so that the backslashes
landed on the second byte of a 2-byte half-width character pair.
The font weights and effects available through \b, including the glitch with
\b3 in TH04 and TH05.
Font weight 3 is technically not rendered correctly in TH03 either; if
you compare 1️⃣ with 4️⃣, you notice a single missing column of pixels
at the left side of each glyph, which would extend into the previous
VRAM byte. Ironically, the TH04/TH05 version is more correct in
this regard: For half-width glyphs, it preserves any further pixel
columns generated by the weight functions in the high byte of the 16-dot
glyph variable. Unlike TH03, which still cuts them off when rendering
text to unaligned X positions (3️⃣), TH04 and TH05 do bit-rotate them
towards their correct place (4️⃣). It's only at byte-aligned X positions
(2️⃣) where they remain at their internally calculated place, and appear
on screen as these glitched pixel columns, 15 pixels away from the glyph
they belong to. It's easy to blame bugs like these on micro-optimized
ASM code, but in this instance, you really can't argue against it if the
original C++ version was equally incorrect.
Combining \b and s- into a partial dissolve
animation. The speed can be controlled with \v.
Simulating TH03's typing effect in TH04 and TH05 via \w. Even prettier in TH05 where we
also get an additional fade animation
after the box ends.
So yeah, that's the cutscene system. I'm dreading the moment I will have to
deal with the other command interpreter in these games, i.e., the
stage enemy system. Luckily, that one is completely disconnected from any
other system, so I won't have to deal with it until we're close to finishing
MAIN.EXE… that is, unless someone requests it before. And it
won't involve text encodings or unblitting…
The cutscene system got me thinking in greater detail about how I would
implement translations, being one of the main dependencies behind them. This
goal has been on the order form for a while and could soon be implemented
for these cutscenes, with 100% PI being right around the corner for the TH03
and TH04 cutscene executables.
Once we're there, the "Virgin" old-school way of static translation patching
for Latin-script languages could be implemented fairly quickly:
Establish basic UTF-8 parsing for less painful manual editing of the
source files
Procedurally generate glyphs for the few required additional letters
based on existing font ROM glyphs. For example, we'd generate ä
by painting two short lines on top of the font ROM's a glyph,
or generate ¿ by vertically flipping the question mark. This
way, the text retains a consistent look regardless of whether the translated
game is run with an NEC or EPSON font ROM, or the that Neko Project II auto-generates if you
don't provide either.
(Optional) Change automatic line breaks to work on a per-word
basis, rather than per-glyph
That's it – script editing and distribution would be handled by your local
translation group. It might seem as if this would also work for Greek and
Cyrillic scripts due to their presence in the PC-98 font ROM, but I'm not
sure if I want to attempt procedurally shrinking these glyphs from 16×16 to
8×16… For any more thorough solution, we'd need to go for a more "Chad" kind
of full-blown translation support:
Implement text subdivisions at a sensible granularity while retaining
automatic line and box breaks
Compile translatable text into a Japanese→target language dictionary
(I'm too old to develop any further translation systems that would overwrite
modded source text with translations of the original text)
Implement a custom Unicode font system (glyphs would be taken from GNU
Unifont unless translators provide a different 8×16 font for their
language)
Combine the text compiler with the font compiler to only store needed
glyphs as part of the translation's font file (dealing with a multi-MB font
file would be rather ugly in a Real Mode game)
Write a simple install/update/patch stacking tool that supports both
.HDI and raw-file DOSBox-X scenarios (it's different enough from thcrap to
warrant a separate tool – each patch stack would be statically compiled into
a single package file in the game's directory)
Add a nice language selection option to the main menu
(Optional) Support proportional fonts
Which sounds more like a separate project to be commissioned from
Touhou Patch Center's Open Collective funds, separate from the ReC98 cap.
This way, we can make sure that the feature is completely implemented, and I
can talk with every interested translator to make sure that their language
works.
It's still cheaper overall to do this on PC-98 than to first port the games
to a modern system and then translate them. On the other hand, most
of the tasks in the Chad variant (3, 4, 5, and half of 2) purely deal with
the difficulty of getting arbitrary Unicode characters to work natively in a
PC-98 DOS game at all, and would be either unnecessary or trivial if we had
already ported the game. Depending on where the patrons' interests lie, it
may not be worth it. So let's see what all of you think about which
way we should go, or whether it's worth doing at all. (Edit
(2022-12-01): With Splashman's
order towards the stage dialogue system, we've pretty much confirmed that it
is.) Maybe we want to meet in the middle – using e.g. procedural glyph
generation for dynamic translations to keep text rendering consistent with
the rest of the PC-98 system, and just not support non-Latin-script
languages in the beginning? In any case, I've added both options to the
order form. Edit (2023-07-28):Touhou Patch Center has agreed to fund
a basic feature set somewhere between the Virgin and Chad level. Check the
📝 dedicated announcement blog post for more
details and ideas, and to find out how you can support this goal!
Surprisingly, there was still a bit of RE work left in the third push after
all of this, which I filled with some small rendering boilerplate. Since I
also wanted to include TH02's playfield overlay functions,
1/15 of that last push went towards getting a
TH02-exclusive function out of the way, which also ended up including that
game in this delivery.
The other small function pointed out how TH05's Stage 5 midboss pops into
the playfield quite suddenly, since its clipping test thinks it's only 32
pixels tall rather than 64:
Good chance that the pop-in might have been intended. Edit (2023-06-30): Actually, it's a
📝 systematic consequence of ZUN having to work around the lack of clipping in master.lib's sprite functions.
There's even another quirk here: The white flash during its first frame
is actually carried over from the previous midboss, which the
game still considers as actively getting hit by the player shot that
defeated it. It's the regular boilerplate code for rendering a
midboss that resets the responsible damage variable, and that code
doesn't run during the defeat explosion animation.
Next up: Staying with TH05 and looking at more of the pattern code of its
boss fights. Given the remaining TH05 budget, it makes the most sense to
continue in in-game order, with Sara and the Stage 2 midboss. If more money
comes in towards this goal, I could alternatively go for the Mai & Yuki
fight and immediately develop a pretty fix for the cheeto storage
glitch. Also, there's a rather intricate
pull request for direct ZMBV decoding on the website that I've still got
to review…
Wow, it's been 3 days and I'm already back with an unexpectedly long post
about TH01's bonus point screens? 3 days used to take much longer in my
previous projects…
Before I talk about graphics for the rest of this post, let's start with the
exact calculations for both bonuses. Touhou Wiki already got these right,
but it still makes sense to provide them here, in a format that allows you
to cross-reference them with the source code more easily. For the
card-flipping stage bonus:
Time
min((Stage timer * 3), 6553)
Continuous
min((Highest card combo * 100), 6553)
Bomb&Player
min(((Lives * 200) + (Bombs * 100)), 6553)
STAGE
min(((Stage number - 1) * 200), 6553)
BONUS Point
Sum of all above values * 10
The boss stage bonus is calculated from the exact same metrics, despite half
of them being labeled differently. The only actual differences are in the
higher multipliers and in the cap for the stage number bonus. Why remove it
if raising it high enough also effectively disables it?
Time
min((Stage timer * 5), 6553)
Continuous
min((Highest card combo * 200), 6553)
MIKOsan
min(((Lives * 500) + (Bombs * 200)), 6553)
Clear
min((Stage number * 1000), 65530)
TOTLE
Sum of all above values * 10
The transition between the gameplay and TOTLE screens is one of the more
impressive effects showcased in this game, especially due to how wavy it
often tends to look. Aside from the palette interpolation (which is, by the
way, the first time ZUN wrote a correct interpolation algorithm between two
4-bit palettes), the core of the effect is quite simple. With the TOTLE
image blitted to VRAM page 1:
Shift the contents of a line on VRAM page 0 by 32 pixels, alternating
the shift direction between right edge → left edge (even Y
values) and the other way round (odd Y values)
Keep a cursor for the destination pixels on VRAM page 1 for every line,
starting at the respective opposite edge
Blit the 32 pixels at the VRAM page 1 cursor to the newly freed 32
pixels on VRAM page 0, and advance the cursor towards the other edge
Successive line shifts will then include these newly blitted 32 pixels
as well
Repeat (640 / 32) = 20 times, after which all new pixels
will be in their intended place
So it's really more like two interlaced shift effects with opposite
directions, starting on different scanlines. No trigonometry involved at
all.
Horizontally scrolling pixels on a single VRAM page remains one of the few
📝 appropriate uses of the EGC in a fullscreen 640×400 PC-98 game,
regardless of the copied block size. The few inter-page copies in this
effect are also reasonable: With 8 new lines starting on each effect frame,
up to (8 × 20) = 160 lines are transferred at any given time, resulting
in a maximum of (160 × 2 × 2) = 640 VRAM page switches per frame for the newly
transferred pixels. Not that frame rate matters in this situation to begin
with though, as the game is doing nothing else while playing this effect.
What does sort of matter: Why 32 pixels every 2 frames, instead of 16
pixels on every frame? There's no performance difference between doing one
half of the work in one frame, or two halves of the work in two frames. It's
not like the overhead of another loop has a serious impact here,
especially with the PC-98 VRAM being said to have rather high
latencies. 32 pixels over 2 frames is also harder to code, so ZUN
must have done it on purpose. Guess he really wanted to go for that 📽
cinematic 30 FPS look 📽 here…
Removing the palette interpolation and transitioning from a black screen
to CLEAR3.GRP makes it a lot clearer how the effect works.
Once all the metrics have been calculated, ZUN animates each value with a
rather fancy left-to-right typing effect. As 16×16 images that use a single
bright-red color, these numbers would be
perfect candidates for gaiji… except that ZUN wanted to render them at the
more natural Y positions of the labels inside CLEAR3.GRP that
are far from aligned to the 8×16 text RAM grid. Not having been in the mood
for hardcoding another set of monochrome sprites as C arrays that day, ZUN
made the still reasonable choice of storing the image data for these numbers
in the single-color .GRC form– yeah, no, of course he once again
chose the .PTN hammer, and its
📝 16×16 "quarter" wrapper functions around nominal 32×32 sprites.
The three 32×32 TOTLE metric digit sprites inside
NUMB.PTN.
Why do I bring up such a detail? What's actually going on there is that ZUN
loops through and blits each digit from 0 to 9, and then continues the loop
with "digit" numbers from 10 to 19, stopping before the number whose ones
digit equals the one that should stay on screen. No problem with that in
theory, and the .PTN sprite selection is correct… but the .PTN
quarter selection isn't, as ZUN wrote (digit % 4)
instead of the correct ((digit % 10) % 4).
Since .PTN quarters are indexed in a row-major
way, the 10-19 part of the loop thus ends up blitting
2 →
3 →
0 →
1 →
6 →
7 →
4 →
5 →
(nothing):
This footage was slowed down to show one sprite blitting operation per
frame. The actual game waits a hardcoded 4 milliseconds between each
sprite, so even theoretically, you would only see roughly every
4th digit. And yes, we can also observe the empty quarter
here, only blitted if one of the digits is a 9.
Seriously though? If the deadline is looming and you've got to rush
some part of your game, a standalone screen that doesn't affect
anything is the best place to pick. At 4 milliseconds per digit, the
animation goes by so fast that this quirk might even add to its
perceived fanciness. It's exactly the reason why I've always been rather
careful with labeling such quirks as "bugs". And in the end, the code does
perform one more blitting call after the loop to make sure that the correct
digit remains on screen.
The remaining ¾ of the second push went towards transferring the final data
definitions from ASM to C land. Most of the details there paint a rather
depressing picture about ZUN's original code layout and the bloat that came
with it, but it did end on a real highlight. There was some unused data
between ZUN's non-master.lib VSync and text RAM code that I just moved away
in September 2015 without taking a closer look at it. Those bytes kind of
look like another hardcoded 1bpp image though… wait, what?!
Lovely! With no mouse-related code left in the game otherwise, this cursor
sprite provides some great fuel for wild fan theories about TH01's
development history:
Could ZUN have 📝 stolen the basic PC-98
VSync or text RAM function code from a source that also implemented mouse
support?
Or was this game actually meant to have mouse-controllable portions at
some point during development? Even if it would have just been the
menus.
… Actually, you know what, with all shared data moved to C land, I might as
well finish FUUIN.EXE right now. The last secret hidden in its
main() function: Just like GAME.BAT supports
launching the game in various debug modes from the DOS command line,
FUUIN.EXE can directly launch one of the game's endings. As
long as the MDRV2 driver is installed, you can enter
fuuin t1 for the 魔界/Makai Good Ending, or
fuuin t for 地獄/Jigoku Good Ending.
Unfortunately, the command-line parameter can only control the route.
Choosing between a Good or Bad Ending is still done exclusively through
TH01's resident structure, and the continues_per_scene array in
particular. But if you pre-allocate that structure somehow and set one of
the members to a nonzero value, it would work. Trainers, anyone?
Alright, gotta get back to the code if I want to have any chance of
finishing this game before the 15th… Next up: The final 17
functions in REIIDEN.EXE that tie everything together and add
some more debug features on top.
What's this? A simple, straightforward, easy-to-decompile TH01 boss with
just a few minor quirks and only two rendering-related ZUN bugs? Yup, 2½
pushes, and Kikuri was done. Let's get right into the overview:
Just like 📝 Elis, Kikuri's fight consists
of 5 phases, excluding the entrance animation. For some reason though, they
are numbered from 2 to 6 this time, skipping phase 1? For consistency, I'll
use the original phase numbers from the source code in this blog post.
The main phases (2, 5, and 6) also share Elis' HP boundaries of 10, 6,
and 0, respectively, and are once again indicated by different colors in the
HP bar. They immediately end upon reaching the given number of HP, making
Kikuri immune to the
📝 heap corruption in test or debug mode that can happen with Elis and Konngara.
Phase 2 solely consists of the infamous big symmetric spiral
pattern.
Phase 3 fades Kikuri's ball of light from its default bluish color to bronze over 100 frames. Collision detection is deactivated
during this phase.
In Phase 4, Kikuri activates her two souls while shooting the spinning
8-pellet circles from the previously activated ball. The phase ends shortly
after the souls fired their third spread pellet group.
Note that this is a timed phase without an HP boundary, which makes
it possible to reduce Kikuri's HP below the boundaries of the next
phases, effectively skipping them. Take this video for example,
where Kikuri has 6 HP by the end of Phase 4, and therefore directly
starts Phase 6.
(Obviously, Kikuri's HP can also be reduced to 0 or below, which will
end the fight immediately after this phase.)
Phase 5 combines the teardrop/ripple "pattern" from the souls with the
"two crossed eye laser" pattern, on independent cycles.
Finally, Kikuri cycles through her remaining 4 patterns in Phase 6,
while the souls contribute single aimed pellets every 200 frames.
Interestingly, all HP-bounded phases come with an additional hidden
timeout condition:
Phase 2 automatically ends after 6 cycles of the spiral pattern, or
5,400 frames in total.
Phase 5 ends after 1,600 frames, or the first frame of the
7th cycle of the two crossed red lasers.
If you manage to keep Kikuri alive for 29 of her Phase 6 patterns,
her HP are automatically set to 1. The HP bar isn't redrawn when this
happens, so there is no visual indication of this timeout condition even
existing – apart from the next Orb hit ending the fight regardless of
the displayed HP. Due to the deterministic order of patterns, this
always happens on the 8th cycle of the "symmetric gravity
pellet lines from both souls" pattern, or 11,800 frames. If dodging and
avoiding orb hits for 3½ minutes sounds tiring, you can always watch the
byte at DS:0x1376 in your emulator's memory viewer. Once
it's at 0x1E, you've reached this timeout.
So yeah, there's your new timeout challenge.
The few issues in this fight all relate to hitboxes, starting with the main
one of Kikuri against the Orb. The coordinates in the code clearly describe
a hitbox in the upper center of the disc, but then ZUN wrote a < sign
instead of a > sign, resulting in an in-game hitbox that's not
quite where it was intended to be…
Kikuri's actual hitbox.
Since the Orb sprite doesn't change its shape, we can visualize the
hitbox in a pixel-perfect way here. The Orb must be completely within
the red area for a hit to be registered.
Much worse, however, are the teardrop ripples. It already starts with their
rendering routine, which places the sprites from TAMAYEN.PTN at byte-aligned VRAM positions in the ultimate piece of if(…) {…}
else if(…) {…} else if(…) {…} meme code. Rather than
tracking the position of each of the five ripple sprites, ZUN suddenly went
purely functional and manually hardcoded the exact rendering and collision
detection calls for each frame of the animation, based on nothing but its
total frame counter.
Each of the (up to) 5 columns is also unblitted and blitted individually
before moving to the next column, starting at the center and then
symmetrically moving out to the left and right edges. This wouldn't be a
problem if ZUN's EGC-powered unblitting function didn't word-align its X
coordinates to a 16×1 grid. If the ripple sprites happen to start at an
odd VRAM byte position, their unblitting coordinates get rounded both down
and up to the nearest 16 pixels, thus touching the adjacent 8 pixels of the
previously blitted columns and leaving the well-known black vertical bars in
their place.
OK, so where's the hitbox issue here? If you just look at the raw
calculation, it's a slightly confusingly expressed, but perfectly logical 17
pixels. But this is where byte-aligned blitting has a direct effect on
gameplay: These ripples can be spawned at any arbitrary, non-byte-aligned
VRAM position, and collisions are calculated relative to this internal
position. Therefore, the actual hitbox is shifted up to 7 pixels to the
right, compared to where you would expect it from a ripple sprite's
on-screen position:
Due to the deterministic nature of this part of the fight, it's
always 5 pixels for this first set of ripples. These visualizations are
obviously not pixel-perfect due to the different potential shapes of
Reimu's sprite, so they instead relate to her 32×32 bounding box, which
needs to be entirely inside the red
area.
We've previously seen the same issue with the
📝 shot hitbox of Elis' bat form, where
pixel-perfect collision detection against a byte-aligned sprite was merely a
sidenote compared to the more serious X=Y coordinate bug. So why do I
elevate it to bug status here? Because it directly affects dodging: Reimu's
regular movement speed is 4 pixels per frame, and with the internal position
of an on-screen ripple sprite varying by up to 7 pixels, any micrododging
(or "grazing") attempt turns into a coin flip. It's sort of mitigated
by the fact that Reimu is also only ever rendered at byte-aligned
VRAM positions, but I wouldn't say that these two bugs cancel out each
other.
Oh well, another set of rendering issues to be fixed in the hypothetical
Anniversary Edition – obviously, the hitboxes should remain unchanged. Until
then, you can always memorize the exact internal positions. The sequence of
teardrop spawn points is completely deterministic and only controlled by the
fixed per-difficulty spawn interval.
Aside from more minor coordinate inaccuracies, there's not much of interest
in the rest of the pattern code. In another parallel to Elis though, the
first soul pattern in phase 4 is aimed on every difficulty except
Lunatic, where the pellets are once again statically fired downwards. This
time, however, the pattern's difficulty is much more appropriately
distributed across the four levels, with the simultaneous spinning circle
pellets adding a constant aimed component to every difficulty level.
Kikuri's phase 4 patterns, on every difficulty.
That brings us to 5 fully decompiled PC-98 Touhou bosses, with 26 remaining…
and another ½ of a push going to the cutscene code in
FUUIN.EXE.
You wouldn't expect something as mundane as the boss slideshow code to
contain anything interesting, but there is in fact a slight bit of
speculation fuel there. The text typing functions take explicit string
lengths, which precisely match the corresponding strings… for the most part.
For the "Gatekeeper 'SinGyoku'" string though, ZUN passed 23
characters, not 22. Could that have been the "h" from the Hepburn
romanization of 神玉?!
Also, come on, if this text is already blitted to VRAM for no reason,
you could have gone for perfect centering at unaligned byte positions; the
rendering function would have perfectly supported it. Instead, the X
coordinates are still rounded up to the nearest byte.
The hardcoded ending cutscene functions should be even less interesting –
don't they just show a bunch of images followed by frame delays? Until they
don't, and we reach the 地獄/Jigoku Bad Ending with
its special shake/"boom" effect, and this picture:
Picture #2 from ED2A.GRP.
Which is rendered by the following code:
for(int i = 0; i <= boom_duration; i++) { // (yes, off-by-one)
if((i & 3) == 0) {
graph_scrollup(8);
} else {
graph_scrollup(0);
}
end_pic_show(1); // ← different picture is rendered
frame_delay(2); // ← blocks until 2 VSync interrupts have occurred
if(i & 1) {
end_pic_show(2); // ← picture above is rendered
} else {
end_pic_show(1);
}
}
Notice something? You should never see this picture because it's
immediately overwritten before the frame is supposed to end. And yet
it's clearly flickering up for about one frame with common emulation
settings as well as on my real PC-9821 Nw133, clocked at 133 MHz.
master.lib's graph_scrollup() doesn't block until VSync either,
and removing these calls doesn't change anything about the blitted images.
end_pic_show() uses the EGC to blit the given 320×200 quarter
of VRAM from page 1 to the visible page 0, so the bottleneck shouldn't be
there either…
…or should it? After setting it up via a few I/O port writes, the common
method of EGC-powered blitting works like this:
Read 16 bits from the source VRAM position on any single
bitplane. This fills the EGC's 4 16-bit tile registers with the VRAM
contents at that specific position on every bitplane. You do not care
about the value the CPU returns from the read – in optimized code, you would
make sure to just read into a register to avoid useless additional stores
into local variables.
Write any 16 bits
to the target VRAM position on any single bitplane. This copies the
contents of the EGC's tile registers to that specific position on
every bitplane.
To transfer pixels from one VRAM page to another, you insert an additional
write to I/O port 0xA6 before 1) and 2) to set your source and
destination page… and that's where we find the bottleneck. Taking a look at
the i486 CPU and its cycle
counts, a single one of these page switches costs 17 cycles – 1 for
MOVing the page number into AL, and 16 for the
OUT instruction itself. Therefore, the 8,000 page switches
required for EGC-copying a 320×200-pixel image require 136,000 cycles in
total.
And that's the optimal case of using only those two
instructions. 📝 As I implied last time, TH01
uses a function call for VRAM page switches, complete with creating
and destroying a useless stack frame and unnecessarily updating a global
variable in main memory. I tried optimizing ZUN's code by throwing out
unnecessary code and using 📝 pseudo-registers
to generate probably optimal assembly code, and that did speed up the
blitting to almost exactly 50% of the original version's run time. However,
it did little about the flickering itself. Here's a comparison of the first
loop with boom_duration = 16, recorded in DOSBox-X with
cputype=auto and cycles=max, and with
i overlaid using the text chip. Caution, flashing lights:
The original animation, completing in 50 frames instead of the expected
34, thanks to slow blitting. Combined with the lack of
double-buffering, this results in noticeable tearing as the screen
refreshes while blitting is still in progress.
(Note how the background of the ドカーン image is shifted 1 pixel to the left compared to pic
#1.)
This optimized version completes in the expected 34 frames. No tearing
happens to be visible in this recording, but the ドカーン image is still visible on every
second loop iteration. (Note how the background of the ドカーン image is shifted 1 pixel to the left compared to pic
#1.)
I pushed the optimized code to the th01_end_pic_optimize
branch, to also serve as an example of how to get close to optimal code out
of Turbo C++ 4.0J without writing a single ASM instruction.
And if you really want to use the EGC for this, that's the best you can do.
It really sucks that it merely expanded the GRCG's 4×8-bit tile register to
4×16 bits. With 32 bits, ≥386 CPUs could have taken advantage of their wider
registers and instructions to double the blitting performance. Instead, we
now know the reason why
📝 Promisence Soft's EGC-powered sprite driver that ZUN later stole for TH03
is called SPRITE16 and not SPRITE32. What a massive disappointment.
But what's perhaps a bigger surprise: Blitting planar
images from main memory is much faster than EGC-powered inter-page
VRAM copies, despite the required manual access to all 4 bitplanes. In
fact, the blitting functions for the .CDG/.CD2 format, used from TH03
onwards, would later demonstrate the optimal method of using REP
MOVSD for blitting every line in 32-pixel chunks. If that was also
used for these ending images, the core blitting operation would have taken
((12 + (3 × (320 / 32))) × 200 × 4) =
33,600 cycles, with not much more overhead for the surrounding row
and bitplane loops. Sure, this doesn't factor in the whole infamous issue of
VRAM being slow on PC-98, but the aforementioned 136,000 cycles don't even
include any actual blitting either. And as you move up to later PC-98
models with Pentium CPUs, the gap between OUT and REP
MOVSD only becomes larger. (Note that the page I linked above has a
typo in the cycle count of REP MOVSD on Pentium CPUs: According
to the original Intel Architecture and Programming Manual, it's
13+𝑛, not 3+𝑛.)
This difference explains why later games rarely use EGC-"accelerated"
inter-page VRAM copies, and keep all of their larger images in main memory.
It especially explains why TH04 and TH05 can get away with naively redrawing
boss backdrop images on every frame.
In the end, the whole fact that ZUN did not define how long this image
should be visible is enough for me to increment the game's overall bug
counter. Who would have thought that looking at endings of all things
would teach us a PC-98 performance lesson… Sure, optimizing TH01 already
seemed promising just by looking at its bloated code, but I had no idea that
its performance issues extended so far past that level.
That only leaves the common beginning part of all endings and a short
main() function before we're done with FUUIN.EXE,
and 98 functions until all of TH01 is decompiled! Next up: SinGyoku, who not
only is the quickest boss to defeat in-game, but also comes with the least
amount of code. See you very soon!
Finally, after a long while, we've got two pushes with barely anything to
talk about! Continuing the road towards 100% PI for TH05, these were
exactly the two pushes that TH05 MAINE.EXE PI was estimated
to additionally cost, relative to TH04's. Consequently, they mostly went
to TH05's unique data structures in the ending cutscenes, the score name
registration menu, and the
staff roll.
A unique feature in there is TH05's support for automatic text color
changes in its ending scripts, based on the first full-width Shift-JIS
codepoint in a line. The \c=codepoint,color
commands at the top of the _ED??.TXT set up exactly this
codepoint→color mapping. As far as I can tell, TH05 is the only Touhou
game with a feature like this – even the Windows Touhou games went back to
manually spelling out each color change.
The orb particles in TH05's staff roll also try to be a bit unique by
using 32-bit X and Y subpixel variables for their current position. With
still just 4 fractional bits, I can't really tell yet whether the extended
range was actually necessary. Maybe due to how the "camera scrolling"
through "space" was implemented? All other entities were pretty much the
usual fare, though.
12.4, 4.4, and now a 28.4 fixed-point format… yup,
📝 C++ templates were
definitely the right choice.
At the end of its staff roll, TH05 not only displays
the usual performance
verdict, but then scrolls in the scores at the end of each stage
before switching to the high score menu. The simplest way to smoothly
scroll between two full screens on a PC-98 involves a separate bitmap…
which is exactly what TH05 does here, reserving 28,160 bytes of its global
data segment for just one overly large monochrome 320×704 bitmap where
both the screens are rendered to. That's… one benefit of splitting your
game into multiple executables, I guess?
Not sure if it's common knowledge that you can actually scroll back and
forth between the two screens with the Up and Down keys before moving to
the score menu. I surely didn't know that before. But it makes sense –
might as well get the most out of that memory.
The necessary groundwork for all of this may have actually made
TH04's (yes, TH04's) MAINE.EXE technically
position-independent. Didn't quite reach the same goal for TH05's – but
what we did reach is ⅔ of all PC-98 Touhou code now being
position-independent! Next up: Celebrating even more milestones, as
-Tom- is about to finish development on his TH05
MAIN.EXE PI demo…