128 commits! Who would have thought that the ideal first release of the TH01
Anniversary Edition would involve so much maintenance, and raise so many
research questions? It's almost as if the real work only starts after
the 100% finalization mark… Once again, I had to steal some funding from the
reserved JIS trail word pushes to cover everything I liked to research,
which means that the next towards the
anything goal will repay this debt. Luckily, this doesn't affect any
immediate plans, as I'll be spending March with tasks that are already fully
funded.
So, how did this end up so massive? The list of things I originally set out
to do was pretty short:
Build entire game into single executable
Fix rendering issues in the one or two most important parts of the game
for a good initial impression
But even the first point already started with tons of little cleanup
commits. A part of them can definitely be blamed on the rush to hit the 100%
decompilation mark before the 25th anniversary last August.
However, all the structural changes that I can't commit to
master reveal how much of a mess the TH01 codebase actually
is.
Merging the executables is mainly difficult because of all the
inconsistencies between REIIDEN.EXE and FUUIN.EXE.
The worst parts can be found in the REYHI*.DAT format code and
the High Score menu, but the little things are just as annoying, like how
the current score is an unsigned variable in
REIIDEN.EXE, but a signed one in FUUIN.EXE.
If it takes me this long and this many
commits just to sort out all of these issues, it's no wonder that the only
thing I've seen being done with this codebase since TH01's 100%
decompilation was a single porting attempt that ended in a rather quick
ragequit.
So why are we merging the executables in preparation for the Anniversary
Edition, and not waiting with it until we start doing ports?
Distributing and updating one executable is cleaner than doing the same
with three, especially as long as installation will still involve manually
dropping the new binary into the game directory.
The Anniversary Edition won't be the only fork binary. We are already
going to start out with a separate DEBLOAT.EXE that contains
only the bloat removal changes without any bug fixes, and spaztron64
will probably redo his seizure-less edition. We don't want to clutter
the game directory with three binaries for each of these fork builds, and we
especially don't want to remember things like oh, but this fork
only modifies REIIDEN.EXE…
All forks should run side-by-side with the original game. During the
time I was maintaining thcrap, I've had countless bug reports of people
assuming that thcrap was
responsible for bugs that were present in the original game, and the
same is certain to happen with the Anniversary Edition. Separate binaries
will make it easier for everyone to check where these bugs came from.
Also, I'd like to make a point about how bloated the original
three-executable structure really is, since I've heard people defending it
as neat software architecture. Really, even in Real Mode where you typically
want to use as little of the 640 KiB of conventional memory as possible, you
don't want to split your game up like this.
The game actually is so bloated that the combined binary ended up
smaller than the original REIIDEN.EXE. If all you see are the
file sizes of the original three executables, this might look like a
pretty impressive feat. Like, how can we possibly get 407,812
bytes into less than 238,612 bytes, without using compression?
If you've ever looked at the linker map though, it's not at all surprising.
Excluding the aforementioned inconsistencies that are hard to quantify,
OP.EXE and FUUIN.EXE only feature 5,767 and 6,475
bytes of unique code and data, respectively. All other code in these
binaries is already part of REIIDEN.EXE, with more than half of
the size coming from the Borland C++ runtime. The single worst offender here
is the C++ exception handler that Borland forces
onto every non-.COM binary by default, which alone adds 20,512 bytes
even if your binary doesn't use C++ exceptions.
On a more hilarious note, this
single line is responsible for pulling another unnecessary 14,242 bytes
into OP.EXE and FUUIN.EXE. This floating-point
multiplication is completely unnecessary in this context because all
possible parameters are integers, but it's enough for Turbo C++ and TLINK to
pull in the entire x87 FPU emulation machinery. These two binaries don't
even draw lines, but since this function is part of the general
graphics code translation unit and contains other functions that these
binaries do need, TLINK links in the entire thing. Maybe, multiple
executables aren't the best choice either if you use a linker that can't do
dead code elimination…
Since the 📝 Orb's physics do turn the entire
precision of a double variable into gameplay effects, it's not
feasible to ever get rid of all FPU code in TH01. The exception handler,
however, can
be removed, which easily brings the combined binary below the size of
the original REIIDEN.EXE. Compiling all code with a single set
of compiler optimization flags, including the more x86-friendly
pascal calling convention, then gets us a few more KB on top.
As does, of course, removing unused code: The only remaining purpose of
features such as 📝 resident palettes is to
potentially make porting more difficult for anyone who doesn't immediately
realize that nothing in the game uses these functions.
Technically, all unused code would be bloat, but for now, I'm keeping
the parts that may tell stories about the game's development history (such
as unused effects or the 📝 mouse cursor), or
that might help with debugging. Even with that in mind, I've only scratched
the surface when it comes to bloat removal, and the binary is only going to
get smaller from here. A lot smaller.
If only we now could start MDRV98 from this new combined binary, we wouldn't
need a second batch file either…
Which brings us to the first big research question of this delivery. Using
the C spawn() function works fine on this compiler, so
spawn("MDRV98.COM") would be all we need to do, right? Except
that the game crashes very soon after that subprocess returned.
So it's not going to be that easy if the spawned process is a TSR.
But why should this be a problem? Let's take a look at the DOS heap, and how
DOS lays out processes in conventional memory if we launch the game
regularly through GAME.BAT:
The rough layout of the DOS heap when launching TH01 from
GAME.BAT.
The batch file starts MDRV98 first, which will therefore end up below
the game in conventional memory. This is perfect for a TSR: The program can
resize itself arbitrarily before returning to DOS, and the rest of memory
will be left over for the game. If we assume such a layout, a DOS program
can implement a custom memory allocator in a very simple way, as it only has
to search for free memory in one direction – and this is exactly how Borland
implemented the C heap for functions like malloc() and
free(), and the C++ new and delete
operators.
But if we spawn MDRV98 after starting TH01, well…
MDRV98 will spawn in the next free memory location, allocate itself, return
to TH01… which suddenly finds its C heap blocked from growing. As a result,
the next big allocation will immediately fail with a rather misleading "out
of memory" error.
So, what can we do about this? Still in a bloat removal mindset, my gut
reaction was to just throw out Borland's C heap implementation, and replace
it with a very thin wrapper around the DOS heap as managed by INT 21h,
AH=48h/49h/4Ah. Like, why
did these DOS compilers even bother with a custom allocator in the first
place if DOS already comes with a perfectly fine native one? Using the
native allocator would completely erase the distinction between TSR memory
and game memory, and inherently allow the game to allocate beyond
MDRV98.
I did in fact implement this, and noticed even more benefits:
While DOS uses 16 bytes rather than Borland's 4 bytes for the control
structure of each memory block, this larger size automatically aligns all
allocations to 16-byte boundaries. Therefore, all allocation addresses would
fit into 16-bit segment-only pointers rather than needing 32-bit
far ones. On the Borland heap, the 4-byte header further limits
regular far pointers to 65,532 bytes, forcing you into
expensive huge pointers for bigger allocations.
Debuggers in DOS emulators typically have features to show and manage
the DOS heap. No need for custom debugging code.
You can change the memory placement
strategy to allocate from the top of conventional memory down to the
bottom. This is how the games allocate their resident structures.
Ultimately though, the drawbacks became too significant. Most of them are
related to the PC-98 Touhou games only ever creating a single DOS
process, even though they contain multiple executables.
Switching executables is done via exec(), which resizes a
program's main allocation to match the new binary and then overwrites the
old program image with the new one. If you've ever wondered why DOSBox-X
only ever shows OP as the active process name in the title bar,
you now know why. As far as DOS is concerned, it's still the same
OP.EXE process rooted at the same segment, and
exec() doesn't bother rewriting the name either. Most
importantly though, this is how REIIDEN.EXE can launch into
another REIIDEN.EXE process even if there are less than 238,612
bytes free when exec() is called, and without consuming more
memory for every successive binary.
For now, ANNIV.EXE still re-exec()s itself at
every point where the original game did, as ZUN's original code really
depends on being reinitialized at boss and scene boundaries. The resulting
accidental semi-hot reloading is also a useful property to retain
during development.
So why is the DOS heap a bad idea for regular game allocation after all?
Even DOS automatically releases all memory associated with a process
during its termination. But since we keep running the same process until the
player quits out of the main menu, we lose the C heap's implicit cleanup on
exec(), and have to manually free all memory ourselves.
Since the binary can be larger after hot reloading, we in fact have
to allocate all regular memory using the last fit strategy.
Otherwise, exec() fails to resize the program's main block for
the same reason that crashed the game on our initial attempt to
spawn("MDRV98.COM").
Just like Borland's heap implementation, the DOS heap stores its control
structures immediately before each allocation, forming a singly linked list.
But since the entire OS shares this single list, corruptions from heap
overflows also affect the whole system, and become much more disastrous.
Theoretically, it might be possible to recover from them by forcibly
releasing all blocks after the last correct one, or even by doing a
brute-force search for valid memory
control blocks, but in reality, DOS will likely just throw error code #7
(ERROR_ARENA_TRASHED) on the next memory management syscall,
forcing a reboot.
With a custom allocator, small corruptions remain isolated to the process.
They can be even further limited if the process adds some padding between
its last internal allocation and the end of the allocated DOS memory block;
Borland's heap sort of does this as well by always rounding up the DOS block
to a full KiB. All this might not make a difference in today's emulated and
single-tasked usage, but would have back then when software was still
developed inside IDEs running on the same system.
TH01's debug mode uses heapcheck() and
heapchecknode(), and reimplementing these on top of the DOS
heap is not trivial. On the contrary, it would be the most complicated part
of such a wrapper, by far.
I could release this DOS heap wrapper in unused form for another push if
anyone's interested, but for now, I'm pretty happy with not actually using
it in the games. Instead, let's stay with the Borland C heap, and find a way
to push MDRV98 to the very top of conventional RAM. Like this:
Which is much easier said than done. It would be nice if we could just use
the last fit allocation strategy here, but .COM executables always
receive all free memory by default anyway, which eliminates any difference
between the strategies.
But we can still change memory itself. So let's temporarily claim all
remaining free memory, minus the exact amount we need for MDRV98, for our
process. Then, the only remaining free space to spawn MDRV98 is at the exact
place where we want it to be:
Obviously, we release all the additional memory after spawning MDRV98.
Now we only need to know how much memory to not temporarily allocate. First,
we need to replicate the assumption that MDRV98's -M7
command-line parameter corresponds to a resident size of 23,552 bytes. This
is not as bad as it seems, because the -M parameter explicitly
has a KiB unit, and we can nicely abstract it away for the API.
The (env.) block though? Its minimum size equals the combined length
of all environment variables passed to the process, but its maximum size is…
not limited at all?! As in, DOS implementations can add and have
historically added more free space because some programs insisted on storing
their own new environment variables in this exact segment. DOSBox and
DOSBox-X follow this tradition by providing a configuration option for the
additional amount of environment space, with the latter adding 1024
additional bytes by default, y'know, just in case someone wants to compile
FreeDOS on a slow emulator. It's not even worth sending a bug report for
this specific case, because it's only a symptom of the fact that
unexpectedly large program environment blocks can and will happen, and are
to be expected in DOS land.
So thanks to this cruel joke, it's technically impossible to achieve what we
want to do there. Hooray! The only thing we can kind of do here is an
educated guess: Sum up the length of all environment variables in our
environment block, compare that length against the allocated size of the
block, and assume that the MDRV98 process will get as much additional memory
as our process got. 🤷
The remaining hurdles came courtesy of some Borland C runtime implementation
details. You would think that the temporary reallocation could even be done
in pure C using the sbrk(), coreleft(), and
brk() functions, but all values passed to or returned from
these functions are inaccurate because they don't factor in the
aforementioned KiB padding to the underlying DOS memory block. So we have to
directly use the DOS syscalls after all. Which at least means that learning
about them wasn't completely useless…
The final issue is caused inside Borland's
spawn() implementation. The environment block for the
child process is built out of all the strings reachable from C's
environ pointer, which is what that FreeDOS build process
should have used. Coalescing them into a single buffer involves yet
another C heap allocation… and since we didn't report our DOS memory block
manipulation back to the C heap, the malloc() call might think
it needs to request more memory from DOS. This resets the DOS memory block
back to its intended level, undoing our manipulation right before the actual
INT 21h, AH=4Bh
EXEC syscall. Or in short:
Manipulate DOS heap ➜ spawn() call ➜_LoadProg() ➜ allocate and prepare environment block ➜ _spawn() ➜ DOS EXEC syscall
The obvious solution: Replace _LoadProg(), implement the
coalescing ourselves, and do it before the heap manipulation. Fortunately,
Borland's internal low-level _spawn() function is not
static, so we can call it ourselves whenever we want to:
Allocate and prepare environment block ➜ manipulate DOS heap ➜ _spawn() call ➜EXEC syscall
So yes, launching MDRV98 from C can be done, but it involves advanced
witchcraft and is completely ridiculous.
Launching external sound drivers from a batch file is the right way
of doing things.
Fortunately, you don't have to rely on this auto-launching feature. You can
still launch DEBLOAT.EXE or ANNIV.EXE from a batch
file that launched MDRV98.COM before, and the binaries will
detect this case and skip the attempt of launching MDRV98 from C. It's
unlikely that my heuristic will ever break, but I definitely recommend
replicating GAME.BAT just to be completely sure – especially
for user-friendly repacks that don't want to include the original game
anyway.
This is also why ANNIV.EXE doesn't launch
ZUNSOFT.COM: The "correct" and stable way to launch
ANNIV.EXE still involves a batch file, and I would say that
expecting people to remove ZUNSOFT.COM from that file is worse
than not playing the animation. It's certainly a debate we can have, though.
This deep dive into memory allocation revealed another previously
undocumented bug in the original game. The RLE decompression code for the
東方靈異.伝 packfile contains two heap overflows, which are
actually triggered by SinGyoku's BOSS1_3.BOS and Konngara's
BOSS8_1.BOS. They only do not immediately crash the game when
loading these bosses thanks to two implementation details of Borland's C
heap.
Obviously, this is a bug we should fix, but according to the definition of
bugs, that fix would be exclusive to the anniversary branch.
Isn't that too restrictive for something this critical? This code is
guaranteed to blow up with a different heap implementation, if only in a
Debug build. And besides, nobody would notice a fix
just by looking at the game's rendered output…
Looks like we have to introduce a fourth category of weird code, in addition
to the previous bloat, bug, and quirk categories, for
invisible internal issues like these. Let's call it landmine, and fix
them on the debloated branch as well. Thanks to
Clerish for the naming inspiration!
With this new category, the full definitions for all categories have become
quite extensive. Thus, they now live in CONTRIBUTING.md
inside the ReC98 repository.
With the new discoveries and the new landmine category, TH01 is now at 67
bugs and 20 landmines. And the solution for the landmine in question? Simplifying
the 61 lines of the original code down to 16. And yes, I'm including
comments in these numbers – if the interactions of the code are complex
enough to require multi-paragraph comments, these are a necessary and
valid part of the code.
While we're on the topic of weird code and its visible or invisible effects,
there's one thing you might be concerned about. With all the rearchitecting
and data shifting we're doing on the debloated branch, what
will happen to the 📝 negative glitch stages?
These are the result of a clearly observable bug that, by definition, must
not be fixed on the debloated branch. But given that the
observable layout of the glitch stages is defined by the memory
surrounding the scene stage variable, won't the
debloated branch inherently alter their appearance (= ⚠️
fanfiction ⚠️), or even remove them completely?
Well, yes, it will. But we can still preserve their layout by
hardcoding
the exact original data that the game would originally read, and even emulate
the original segment relocations and other pieces of global data.
Doing this is feasible thanks to the fact that there are only 4 glitch
stages. Unfortunately, the same can't be said for the timer values, which
are determined by an array lookup with the un-modulo'd stage ID. If we
wanted to preserve those as well, we'd have to bundle an exact copy of the
original REIIDEN.EXE data segment to preserve the values of all
32,768 negative stages you could possibly enter, together with a map
of all relocations in this segment. 😵 Which I've decided against for now,
since this has been going on for far too long already. Let's first see if
anyone ever actually complains about details like this…
Alright, time to start the anniversary branch by rendering
everything at its correct internal unaligned X position? Eh… maybe not quite
yet. If we just hacked all the necessary bit-shifting code into all the
format-specific blitting functions, we'd still retain all this largely
redundant, bad, and slow code, and would make no progress in terms of
portability. It'd be much better to first write a single generic blitter
that's decently optimized, but supports all kinds of sprites to make this
optimization actually worth something.
So, next research question: How would such a blitter look like? After I
learned during my
📝 first foray into cycle counting that port
I/O is slow on 486 CPUs, it became clear that TH04's
📝 GRCG batching for pellets was one of the
more useful optimizations that probably contributed a big deal towards
achieving the high bullet counts of that game. This leads to two
conclusions:
master.lib's super_*() sprite functions are slow, and not
worth looking at for inspiration. Even the 📝 tiny format reinitializes the GRCG on every color change, wasting 80
cycles.
Hence, our low-level blitting API should not even care about colors. It
should only concern itself with blitting a given 1bpp sprite to a single
VRAM segment. This way, it can work for both 4-plane sprites and
single-plane sprites, and just assume that the GRCG is active.
Maybe we should also start by not even doing these unaligned bit shifts
ourselves, and instead expect the call site to
📝 always deliver a byte-aligned sprite that is correctly preshifted,
if necessary? Some day, we definitely should measure how slow runtime
shifting would really be…
What we should do, however, are some further general optimizations that I
would have expected from master.lib: Unrolling the vertical
loop, and baking a single function for every sprite width to eliminate
the horizontal loop. We can then use the widest possible x86
MOV instruction for the lowest possible number of cycles per
row – for example, we'd blit a 56-wide sprite with three MOVs
(32-bit + 16-bit + 8-bit), and a 64-wide one with two 32-bit
MOVs.
Or maybe not? There's a lot of blitting code in both master.lib and PC-98
Touhou that checks for empty bytes within sprites to skip needlessly writing
them to VRAM:
Which goes against everything you seem to know about computers. We aren't
running on an 8-bit CPU here, so wouldn't it be faster to always write both
halves of a sprite in a single operation?
That's a single CPU instruction, compared to two instructions and two
branches. The only possible explanation for this would be that VRAM writes
are so slow on PC-98 that you'd want to avoid them at all costs, even
if that means additional branching on the CPU to do so. Or maybe that was
something you would want to do on certain models with slow VRAM, but not on
others?
So I wrote a benchmark to answer all these questions, and to compare my new
blitter against typical TH01 blitting code:
A not really representative run on DOSBox-X. Since the master.lib sprite
functions are also unbatched, I expect them to not be much faster than the
naive C implementation.
2023-03-05-blitperf.zip
And here are the real-hardware results I've got from the PC-9800
Central Discord server:
PC-286LS
PC-9801ES
PC-9821Cb/Cx
PC-9821Ap3
PC-9821An
PC-9821Nw133
PC-9821Ra20
80286, 12 MHz
i386SX, 16 MHz
486SX, 33 MHz
486DX4, 100 MHz
Pentium, 90 MHz
Pentium, 133 MHz
Pentium Pro, 200 MHz
1987
1989
1994
1994
1994
1997
1996
Unchecked
C
GRCG
36,85
38,42
26,02
26,87
3,98
4,13
2,08
2,16
1,81
1,87
0,86
0,89
1,25
1,25
MOVS
GRCG
15,22
16,87
9,33
10,19
1,22
1,37
0,44
0,44
MOV
GRCG
15,42
17,08
9,65
10,53
1,15
1,3
0,44
0,44
4-plane
37,23
43,97
29,2
32,96
4,44
5,01
4,39
4,67
5,11
5,32
5,61
5,74
6,63
6,64
Checking first
GRCG
17,49
19,15
10,84
11,72
1,27
1,44
1,04
1,07
0,54
0,54
4-plane
46,49
53,36
35,01
38,79
5,66
6,26
5,43
5,74
6,56
6,8
8,08
8,29
10,25
10,29
Checking second
GRCG
16,47
18,12
10,77
11,65
1,25
1,39
1,02
0,51
0,51
4-plane
43,41
50,26
33,79
37,82
5,22
5,81
5,14
5,43
6,18
6,4
7,57
7,77
9,58
9,62
Checking both
GRCG
16,14
18,03
10,84
11,71
1,33
1,49
1,01
0,49
0,49
4-plane
43,61
50,45
34,11
37,87
5,39
5,99
4,92
5,23
5,88
6,11
7,19
7,43
9,1
9,13
Amount of frames required to render 2000 16×8 pellet sprites on a variety of
PC-98 models, using the new generic blitter. Both preshifted (first column)
and runtime-shifted (second column) sprites were tested; empty columns
correspond to times faster than a single frame. Thanks to cuba200611,
Shoutmon, cybermind, and Digmac for running the tests!
The key takeaways:
Checking for empty bytes has never been a good idea.
Preshifting sprites made a slight difference on the 286. Starting with
the 386 though, that difference got smaller and smaller, until it completely
vanished on Pentium models. The memory tradeoff is especially not worth it
for 4-plane sprites, given that you would have to preshift each of the 4
planes and possibly even a fifth alpha plane. Ironically, ZUN only ever
preshifted monochrome single-bitplane sprites with a width of 8 pixels.
That's the smallest possible amount of memory a sprite can possibly take,
and where preshifting consequently has the smallest effect on performance.
Shifting 8-wide sprites on the fly literally takes a single ROL
or ROR instruction per row.
You might want to use MOVS instead of MOV when
targeting the 286 and 386, but the performance gains are barely worth the
resulting mess you would make out of your blitting code. On Pentium models,
there is no difference.
Use the GRCG whenever you have to render lots of things that share a
static 8×1 pattern.
These are the PC-98 models that the people who are willing to test your
newly written PC-98 code actually use.
Since this won't be the only piece of game-independent and explicitly
PC-98-specific custom code involved in this delivery, it makes sense to
start a
dedicated PC-98 platform layer. This code will gradually eliminate the
dependency on master.lib and replace it with better optimized and more
readable C++ code. The blitting benchmark, for example, is already
implemented completely without master.lib.
While this platform layer is mainly written to generate optimal code within
Turbo C++ 4.0J, it can also serve as general PC-98 documentation for
everyone who prefers code over machine-translating old Japanese books. Not
to mention the immediacy of having all actual relevant information in
one place, which might otherwise be pretty well hidden in these books, or
some obscure old text file. For example, did you know that uploading gaiji
via INT 18h might end up disabling the VSync interrupt trigger,
deadlocking the process on the next frame delay loop? This nuisance is not
replicated by any emulators, and it's quite frustrating to encounter it when
trying to run your code on real hardware. master.lib works around it by
simply hooking INT 18h and unconditionally reenabling the VSync
interrupt trigger after the original handler returns, and so does our
platform layer.
So, with the pellet draw calls batched and routed through the new renderer,
we should have gained enough free CPU cycles to disable
📝 interlaced pellet rendering without any
impact on frame rates?
Well, kinda. We do get 56.4 FPS, but only together with noticeable and
reproducible tearing in the top part of the playfield, suggesting exactly
why ZUN interlaced the rendering in the first place. 😕 So have we
already reached the limit of single-buffered PC-98 games here, or can we
still do something about it?
As it turns out, the main bottleneck actually lies in the pellet
unblitting code. Every EGC-"accelerated" unblitting call in TH01 is
as unbatched as the pellet blitting calls were, spending an additional 17
I/O port writes per call to completely set up and shut down the EGC, every
time. And since this is TH01, the two-instruction operation of changing the
active PC-98 VRAM page isn't inlined either, but instead done via a function
call to a faraway segment. On the 486, that's:
>341 cycles for EGC setup and teardown, plus
>72 cycles for each 16-pixel chunk to be unblitted. This sums up
to
>917 cycles of completely unnecessary work for every active pellet,
in the optimal 50% of cases where it lies on an even VRAM byte,
or
>1493 cycles if it lies on an odd VRAM byte, because ZUN's code
extends the unblitted rectangle to a gargantuan 32×8 pixels in this case
And this calculation even ignores the lack of small micro-optimizations that
could further optimize the blitting loop. Multiply that by the game's pellet
cap of 100, and we get a 6-digit number of wasted CPU cycles. On
paper, that's roughly 1/6 of the time we have for each
of our target 56.423 FPS on the game's target 33 MHz systems. Might not
sound all too critical, but the single-buffered nature of the game means
that we're effectively racing the beam on every frame. In turn, we have to
be even more serious about performance.
So, time to also add a batched EGC API to our PC-98 platform layer? Writing
our own EGC code presents a nice opportunity to finally look deeper into all
its registers and configuration options, and see what exactly we can do
about ZUN's enforced 16-pixel alignment.
To nobody's surprise, this alignment is completely unnecessary, and only
displays a lack of knowledge about the chip. While it is true that
the EGC wants VRAM to be exclusively addressed in 16-bit chunks at
16-bit-aligned addresses, it specifically provides
an address register (0x4AC) for shifting the horizontal
start offsets of the source and destination to any pixel within the
16 pixels of such a chunk, and
a bit length register (0x4AE) for specifying the total
width of pixels to be transferred, which also implies the correct end
offsets.
And it gets even better: After ⌈bitlength ÷ 16⌉ write
instructions, the EGC's internal shifter state automatically reinitializes
itself in preparation for blitting another row of pixels with the same
initially configured bit addresses and length. This is perfect for blitting
rectangles, as two I/O port writes before the start of your blitting loop
are enough to define your entire rectangle.
The manual nature of reading and writing in 16-pixel chunks does come with a
slight pitfall though. If the source bit address is larger than the
destination bit address, the first 16-bit read won't fill the EGC's internal
shift register with all pixels that should appear in the first 16-pixel
destination chunk. In this case, the EGC simply won't write anything and
leave the first chunk unchanged. In a
📝 regular blitting loop, however, you expect
that memory to be written and immediately move on to the next chunks within
the row. As a result, the actual blitting process for such a rectangle will
no longer be aligned to the configured address and bit length. The first row
of the rectangle will appear 16 pixels to the right of the destination
address, and the second one will start at bit offset 0 with pixels from the
rightmost byte of the first line, which weren't blitted and remained in the
tile register.
There is an easy solution though: Before the horizontal loop on each line of
the rectangle, simply read one additional 16-pixel chunk from the source
location to prefill the shift register. Thankfully, it's large enough to
also fit the second read of the then full 16 pixels, without dropping any
pixels along the way.
And that's how we get arbitrarily unaligned rectangle copies with the EGC!
Except for a small register allocation trick to use two-register addressing,
there's not much use in further optimizations, as the runtime of these
inter-page blit operations is dominated by the VRAM page switches anyway.
Except that T98-Next seems to disagree about the register prefilling issue:
Every other emulator agrees with real hardware in this regard, so we can
safely assume this to be a bug in T98-Next. Just in case this old emulator
with its last release from June 2010 still has any fans left nowadays… For
now though, even they can still enjoy the TH01 Anniversary Edition: The only
EGC copy algorithm that TH01 actually needs is the left one during the
single-buffered tests, which even that emulator gets right.
That only leaves
📝 my old offer of documenting the EGC raster ops,
and we've got the EGC figured out completely!
And that did in fact remove tearing from the pellet rendering function! For
the first time, we can now fight Elis, Kikuri, Sariel, and Konngara with a
doubled pellet frame rate:
Switchable videos like these can nicely provide evidence that these
changes have no effect on gameplay, making it easy to see that the Orb
still collides with all pellets on the same frames. Also, check out the
difference in remaining conventional memory (coreleft)…
With only pellets and no other animation on screen, this exact pattern
presents the optimal demonstration case for the new unblitter. But as you
can already tell from the invincibility sprites, we'd also need to route
every other kind of sprite through the same new code. This isn't all too
trivial: Most sprites are still rendered at byte-aligned positions, and
their blitting APIs hide that fact by taking a pixel position regardless.
This is why we can't just replace ZUN's original 16-pixel-aligned EGC
unblitting function with ours, and always have to replace both the blitter
and the unblitter on a per-sprite basis.
To completely remove all flickering, we'd also like to get rid of all the
sprite-specific unblit ➜ update ➜ render sequences, and instead
gather all unblitting code to the beginning of the game loop, before any
update and rendering calls. So yeah, it will take a long time to completely
get rid of all flickering. Until we're there, I recommend any backer to tell
me their favorite boss, so that I can focus on getting that one
rendered without any flickering. Remember that here at ReC98, we can have a
Touhou character popularity contest at any time during the year, whenever
the store is open!
In the meantime, the consistent use of 8×8 rectangles during pellet
unblitting does significantly reduce flickering across the entire game,
and shrinks certain holes that pellets tend to rip into lazily reblitted
sprites:
SinGyoku's "crossing pellets" pattern, shortly before completing
the transformation back to the sphere.
To round out the first release, I added all the other bug fixes to achieve
parity with my previously released patched REIIDEN.EXE builds:
I removed the 📝 shootout laser crash by
simply leaving the lasers on screen if a boss is defeated,
prevented the HP bar heap corruption bug in test or debug mode by not
letting it display negative HP in the first place, and
So here it is, the first build of TH01's Anniversary Edition:
2023-03-05-th01-anniv.zip Edit (2023-03-12): If you're playing on Neko Project and seeing more
flickering than in the original game, make sure you've checked the Screen
→ Disp vsync option.
Next up: The long overdue extended trip through the depths of TH02's
low-level code. From what I've seen of it so far, the work on this project
is finally going to become a bit more relaxing. Which is quite welcome
after, what, 6 months of stressful research-heavy work?
> "OK, TH03/TH04/TH05 cutscenes done, let's quickly finish the Touhou Patch Center MediaWiki upgrade. Just some scripting and verification left, it will be done so quickly that I don't even have to mention it on this blog"
> Still not done after 3 weeks
> Blocked by one final critical bug that really should be fixed upstream
> Code reviewers are probably on vacation
And so, the year unfortunately ended with yet another slow month. During the
MediaWiki upgrade, I was slowly decompiling the TH05 Sara fight on the side,
but stumbled over one interesting but high-maintenance detail there that
would really enhance her blog post. TH02 would need a lot of attention for
the basic rendering calls as well…
…so let's end the year with Shuusou Gyoku instead, looking at its most
critical issue in particular. As if that were the easy option here…
The game does not run properly on modern Windows systems due to its usage of
the ancient DirectDraw APIs, with issues ranging from unbearable slowdown to
glitched colors to the game not even starting at all. Thankfully, Shuusou
Gyoku is not the only ancient Windows game affected by these issues, and
people have developed a variety of generic DirectDraw wrappers and patches
for playing such games on modern systems. Out of all these, DDrawCompat is one of the
simpler solutions for Shuusou Gyoku in particular: Just drop its
ddraw proxy DLL into the game directory, and the game will run
as it's supposed to.
So let's just bundle that DLL with all my future Shuusou Gyoku releases
then? That would have been the quick and dirty option, coming with
several drawbacks:
Linux users might be annoyed by the potential need to configure a native
DLL override for ddraw.dll. It's not too much of an issue as we
could simply rename the DLL and replace the import with the new name.
However, doing that reproducibly would already involve changes to either the
DDrawCompat or Shuusou Gyoku build process.
Win32 API hooking is another potential point of failure in general,
requiring continual maintenance for new Windows versions. This is not even a
hypothetical concern: DDrawCompat does rely on particularly volatile Win32
API details, to the point that the recent Windows 11 22H2 update completely
broke it, causing a hang at startup that required a workaround.
But sure, it's still just a single third-party component. Keeping it up to
date doesn't sound too bad by itself…
…if DDrawCompat weren't evolving way beyond what we need to keep Shuusou
Gyoku running. Being a typical DirectDraw wrapper, it has always aimed to
solve all sorts of issues in old DirectDraw games. However, the latest
version, 0.4.0, has gone above and beyond in this regard, adding lots of
configuration options with default settings that actually
break Shuusou Gyoku.
To get a glimpse of how this is likely to play out, we only have to look at
the more mature DxWnd
project. In its expert mode, DxWnd features three rows of tabs, each packed
with checkboxes that toggle individual hacks, and most of these are
related to something that Shuusou Gyoku could be affected by. Imagine
checking a precise permutation of a three-digit number of checkboxes just to
keep an old game running at full speed on modern systems…
Finally, aesthetic and bloat considerations. If
📝 C++ fstreams were already too embarrassing
with the ~100 KB of bloat they add to the binary, a 565 KiB DLL is
even worse. And that's the old version 0.3.2 – version 0.4.0 comes in
at 2.43 MiB.
Fortunately, I had the budget to dig a bit deeper and figure out what
exactly DDrawCompat does to make Shuusou Gyoku work properly. Turns
out that among all the hooks and patches, the game only needs the most
central one: Enforcing a 32-bit display mode regardless of whatever lower
bit depth the game requests natively, combined with converting the game's
pixel buffer to 32-bit on the fly.
So does this mean that adding 32-bit to the game's list of supported bit
depths is everything we have to do?
Interestingly, Shuusou Gyoku already saved the DirectDraw enumeration flag
that indicates support for 32-bit display modes. The official version just
did nothing with it.
Well, almost everything. Initially, this surprised me as well: With
all the if statements checking for precise bit depths, you
would think that supporting one more bit depth would be way harder in this
code base. As it turned out though, these conditional branches are not
really about 8-bit or 16-bit color for the most part, but instead
differentiate between two very distinct rendering approaches:
"8-bit" is a pure 2D mode with palettized colors,
while "16-bit" is a hybrid 2D/3D mode that uses Direct3D 2 on top of DirectDraw, with
3-channel RGB colors.
Consequently, most of these branches deal with differences between these two
approaches that couldn't be nicely abstracted away in pbg's renderer
interface: Specific palette changes that are exclusive to "8-bit" mode, or
certain entities and effects whose Direct3D draw calls in "16-bit" mode
require tailor-made approximations for the "8-bit" mode. Since our new
32-bit mode is equivalent to the 16-bit mode in all of these branches, I
only needed to replace the raw number comparisons with more meaningful
method calls.
That only left a very small number of 2D raster effects that directly write
to or read from DirectDraw surface memory, and therefore do need to know the
bit size of each pixel. Thanks to std::variant and
std::visit(), adding 32-bit support becomes trivial here: By
rewriting the code in a generic manner that derives all offsets from the
template type, you only have to say hey,
I'd like to have 32-bit as well, and C++ will automatically
instantiate correct 32-bit variants of all bit depth-dependent code
snippets.
There are only three features in the entire game that access pixel buffers
this way: a color key retrieval function, the lens ball animation on the
logo screen, and… the ending staff roll? Sure, the text sprites fade in and
out, but so does the picture next to it, using Direct3D alpha blending or
palette color ramping depending on the current rendering mode. Instead, the
only reason why these sprites directly access their pixel buffer is… an
unused and pretty wild spiral effect. 😮 It's still part of the code, and
only doesn't show up because the
parameters that control its timing were commented out before release:
They probably considered it too wild for the mood of this
ending.
The main ending text was the only remaining issue of mojibake present in my
previous Shuusou Gyoku builds, and is now fixed as well. Windows can
render Shift-JIS text via GDI even outside Japanese locale, but only when
explicitly selecting a font that supports the SHIFTJIS_CHARSET,
and the game simply didn't select any font for rendering this text.
Thus, GDI fell back onto its default font, which obviously is only
guaranteed to support the SHIFTJIS_CHARSET if your system
locale is set to Japanese. This is why the font in the original game might
lookdifferent between systems.
For my build, I chose the font that would appear on a clean Windows
installation – a basic 400-weighted MS Gothic at font size 16, which is
already used all throughout the game.
Alright, 32-bit mode complete, let's set it as the default if possible… and
break compatibility to the original 秋霜CFG.DAT format in the
process? When validating this file, the original game only allows the
originally supported 8-bit or 16-bit modes. Setting the
BitDepth field to any other value causes the entire file
to be reset to its defaults, re-locking the Extra Stage in the process.
Introducing a backward-compatible version
system for 秋霜CFG.DAT was beyond the scope of this push.
Changing the validation to a per-field approach was a good small first step
to take though. The new build no longer validates the BitDepth
field against a fixed list, but against the actually supported bit depths on
your system, picking a different supported one if necessary. With the
original approach, this would have caused your entire configuration to fail
the validation check. Instead, you can now safely update to the new build
without losing your option settings, or your previously unlocked access to
the Extra Stage.
Side note: The validation limit for starting bombs is off by one, and the
one for starting lives check is off by two. By modifying
秋霜CFG.DAT, you could theoretically get new games to start with
7 lives and 3 bombs… if you then calculate a correct checksum for your
hacked config file, that is. 🧑💻
Interestingly, DirectDraw doesn't even indicate support for 8-bit or 16-bit
color on systems that are affected by the initially mentioned issues.
Therefore, these issues are not the fault of DirectDraw, but of
Shuusou Gyoku, as the original release requested a bit depth that it has
even verified to be unsupported. Unfortunately, Windows sides with
Sim City Shuusou Gyoku here: If you previously experimented with the
Windows app compatibility settings, you might have ended up with the
DWM8And16BitMitigation flag assigned to the full file path of
your Shuusou Gyoku executable in either
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\AppCompatFlags\Layers, or
As the term mitigation suggests, these modes are (poorly) emulated,
which is exactly what causes the issues with this game in the first place.
Sure, this might be the lesser evil from the point of view of an operating
system: If you don't have the budget for a full-blown DDrawCompat-style
DirectDraw wrapper, you might consider it better for users to have the game
run poorly than have it fail at startup due to incorrect API usage.
Controlling this with a flag that sticks around for future runs of a binary
is definitely suboptimal though, especially given how hard it
is to programmatically remove this flag within the binary itself. It
only adds additional complexity to the ideal clean upgrade path.
So, make sure to check your registry and manually remove these flags for the
time being. Without them, the new Config → Graphic menu will
correctly prevent you from selecting anything else but 32-bit on modern
Windows.
After all that, there was just enough time left in this push to implement
basic locale independence, as requested by the Seihou development
Discord group, without looking into automatic fixes for previous mojibake
filenames yet. Combining std::filesystem::path with the native
Win32 API should be straightforward and bloat-free, especially with all the
abstractions I've been building, right?
Well, turns out that std::filesystem::path does not
actually meet my expectations. At least as long as it's not
constexpr-enabled, because you still get the unfortunate
conversion from narrow to wide encoding at runtime, even for globals with
static storage duration. That brings us back to writing our path abstraction
in terms of the regular std::string and
std::wstring containers, which at least allow us to enforce the
respective encoding at compile time. Even std::string_view only
adds to the complexity here, as its strings are never inherently
null-terminated, which is required by both the POSIX and Win32 APIs. Not to
mention dynamic filenames: C++20's std::format() would be the
obvious idiomatic choice here, but using it almost doubles the size
of the compiled binary… 🤮
In the end, the most bloat-free way of implementing C++ file I/O in 2023 is
still the same as it was 30 years ago: Call system APIs, roll a custom
abstraction that conditionally uses the L prefix, and pass
around raw pointers. And if you need a dynamic filename, just write the
dynamic characters into arrays at fixed positions. Just as PC-98 Touhou used
to do…
Oh, and the game's window also uses a Unicode title bar now.
And that's it for this push! Make sure to rename your configuration
(秋霜CFG.DAT), score (秋霜SC.DAT), and replay
(秋霜りぷ*.DAT) filenames if you were previously running the
game on a non-Japanese locale, and then grab the new build:
Next up: Starting the new year with all my plans hopefully working out for
once. TH05 Sara very soon, ZMBV code review afterward, low-hanging fruit of
the TH01 Anniversary Edition after that, and then kicking off TH02 with a
bunch of low-level blitting code.
TH05 has passed the 50% RE mark, with both MAIN.EXE and the
game as a whole! With that, we've also reached what -Tom-
wanted out of the project, so he's suspending his discount offer for a
bit.
Curve bullets are now officially called cheetos! 76.7% of
fans prefer this term, and it fits into the 8.3 DOS filename scheme much
better than homing lasers (as they're called in
OMAKE.TXT) or Taito
lasers (which would indeed have made sense as well).
…oh, and I managed to decompile Shinki within 2 pushes after all. That
left enough budget to also add the Stage 1 midboss on top.
So, Shinki! As far as final boss code is concerned, she's surprisingly
economical, with 📝 her background animations
making up more than ⅓ of her entire code. Going straight from TH01's
📝 final📝 bosses
to TH05's final boss definitely showed how much ZUN had streamlined
danmaku pattern code by the end of PC-98 Touhou. Don't get me wrong, there
is still room for improvement: TH05 not only
📝 reuses the same 16 bytes of generic boss state we saw in TH04 last month,
but also uses them 4× as often, and even for midbosses. Most importantly
though, defining danmaku patterns using a single global instance of the
group template structure is just bad no matter how you look at it:
The script code ends up rather bloated, with a single MOV
instruction for setting one of the fields taking up 5 bytes. By comparison,
the entire structure for regular bullets is 14 bytes large, while the
template structure for Shinki's 32×32 ball bullets could have easily been
reduced to 8 bytes.
Since it's also one piece of global state, you can easily forget to set
one of the required fields for a group type. The resulting danmaku group
then reuses these values from the last time they were set… which might have
been as far back as another boss fight from a previous stage.
And of course, I wouldn't point this out if it
didn't actually happen in Shinki's pattern code. Twice.
Declaring a separate structure instance with the static data for every
pattern would be both safer and more space-efficient, and there's
more than enough space left for that in the game's data segment.
But all in all, the pattern functions are short, sweet, and easy to follow.
The "devil"
patternis significantly more complex than the others, but still
far from TH01's final bosses at their worst. I especially like the clear
architectural separation between "one-shot pattern" functions that return
true once they're done, and "looping pattern" functions that
run as long as they're being called from a boss's main function. Not many
all too interesting things in these pattern functions for the most part,
except for two pieces of evidence that Shinki was coded after Yumeko:
The gather animation function in the first two phases contains a bullet
group configuration that looks like it's part of an unused danmaku
pattern. It quickly turns out to just be copy-pasted from a similar function
in Yumeko's fight though, where it is turned into actual
bullets.
As one of the two places where ZUN forgot to set a template field, the
lasers at the end of the white wing preparation pattern reuse the 6-pixel
width of Yumeko's final laser pattern. This actually has an effect on
gameplay: Since these lasers are active for the first 8 frames after
Shinki's wings appear on screen, the player can get hit by them in the last
2 frames after they grew to their final width.
Of course, there are more than enough safespots between the lasers.
Speaking about that wing sprite: If you look at ST05.BB2 (or
any other file with a large sprite, for that matter), you notice a rather
weird file layout:
A large sprite split into multiple smaller ones with a width of
64 pixels each? What's this, hardware sprite limitations? On my
PC-98?!
And it's not a limitation of the sprite width field in the BFNT+ header
either. Instead, it's master.lib's BFNT functions which are limited to
sprite widths up to 64 pixels… or at least that's what
MASTER.MAN claims. Whatever the restriction was, it seems to be
completely nonexistent as of master.lib version 0.23, and none of the
master.lib functions used by the games have any issues with larger
sprites.
Since ZUN stuck to the supposed 64-pixel width limit though, it's now the
game that expects Shinki's winged form to consist of 4 physical
sprites, not just 1. Any conversion from another, more logical sprite sheet
layout back into BFNT+ must therefore replicate the original number of
sprites. Otherwise, the sequential IDs ("patnums") assigned to every newly
loaded sprite no longer match ZUN's hardcoded IDs, causing the game to
crash. This is exactly what used to happen with -Tom-'s
MysticTK automation scripts,
which combined these exact sprites into a single large one. This issue has
now been fixed – just in case there are some underground modders out there
who used these scripts and wonder why their game crashed as soon as the
Shinki fight started.
And then the code quality takes a nosedive with Shinki's main function.
Even in TH05, these boss and midboss update
functions are still very imperative:
The origin point of all bullet types used by a boss must be manually set
to the current boss/midboss position; there is no concept of a bullet type
tracking a certain entity.
The same is true for the target point of a player's homing shots…
… and updating the HP bar. At least the initial fill animation is
abstracted away rather decently.
Incrementing the phase frame variable also must be done manually. TH05
even "innovates" here by giving the boss update function exclusive ownership
of that variable, in contrast to TH04 where that ownership is given out to
the player shot collision detection (?!) and boss defeat helper
functions.
Speaking about collision detection: That is done by calling different
functions depending on whether the boss is supposed to be invincible or
not.
Timeout conditions? No standard way either, and all done with manual
if statements. In combination with the regular phase end
condition of lowering (mid)boss HP to a certain value, this leads to quite a
convoluted control flow.
The manual calls to the score bonus functions for cleared phases at least provide some sense of orientation.
One potentially nice aspect of all this imperative freedom is that
phases can end outside of HP boundaries… by manually incrementing the
phase variable and resetting the phase frame variable to 0.
The biggest WTF in there, however, goes to using one of the 16 state bytes
as a "relative phase" variable for differentiating between boss phases that
share the same branch within the switch(boss.phase)
statement. While it's commendable that ZUN tried to reduce code duplication
for once, he could have just branched depending on the actual
boss.phase variable? The same state byte is then reused in the
"devil" pattern to track the activity state of the big jerky lasers in the
second half of the pattern. If you somehow managed to end the phase after
the first few bullets of the pattern, but before these lasers are up,
Shinki's update function would think that you're still in the phase
before the "devil" pattern. The main function then sequence-breaks
right to the defeat phase, skipping the final pattern with the burning Makai
background. Luckily, the HP boundaries are far away enough to make this
impossible in practice.
The takeaway here: If you want to use the state bytes for your custom
boss script mods, alias them to your own 16-byte structure, and limit each
of the bytes to a clearly defined meaning across your entire boss script.
One final discovery that doesn't seem to be documented anywhere yet: Shinki
actually has a hidden bomb shield during her two purple-wing phases.
uth05win got this part slightly wrong though: It's not a complete
shield, and hitting Shinki will still deal 1 point of chip damage per
frame. For comparison, the first phase lasts for 3,000 HP, and the "devil"
pattern phase lasts for 5,800 HP.
And there we go, 3rd PC-98 Touhou boss
script* decompiled, 28 to go! 🎉 In case you were expecting a fix for
the Shinki death glitch: That one
is more appropriately fixed as part of the Mai & Yuki script. It also
requires new code, should ideally look a bit prettier than just removing
cheetos between one frame and the next, and I'd still like it to fit within
the original position-dependent code layout… Let's do that some other
time.
Not much to say about the Stage 1 midboss, or midbosses in general even,
except that their update functions have to imperatively handle even more
subsystems, due to the relative lack of helper functions.
The remaining ¾ of the third push went to a bunch of smaller RE and
finalization work that would have hardly got any attention otherwise, to
help secure that 50% RE mark. The nicest piece of code in there shows off
what looks like the optimal way of setting up the
📝 GRCG tile register for monochrome blitting
in a variable color:
mov ah, palette_index ; Any other non-AL 8-bit register works too.
; (x86 only supports AL as the source operand for OUTs.)
rept 4 ; For all 4 bitplanes…
shr ah, 1 ; Shift the next color bit into the x86 carry flag
sbb al, al ; Extend the carry flag to a full byte
; (CF=0 → 0x00, CF=1 → 0xFF)
out 7Eh, al ; Write AL to the GRCG tile register
endm
Thanks to Turbo C++'s inlining capabilities, the loop body even decompiles
into a surprisingly nice one-liner. What a beautiful micro-optimization, at
a place where micro-optimization doesn't hurt and is almost expected.
Unfortunately, the micro-optimizations went all downhill from there,
becoming increasingly dumb and undecompilable. Was it really necessary to
save 4 x86 instructions in the highly unlikely case of a new spark sprite
being spawned outside the playfield? That one 2D polar→Cartesian
conversion function then pointed out Turbo C++ 4.0J's woefully limited
support for 32-bit micro-optimizations. The code generation for 32-bit
📝 pseudo-registers is so bad that they almost
aren't worth using for arithmetic operations, and the inline assembler just
flat out doesn't support anything 32-bit. No use in decompiling a function
that you'd have to entirely spell out in machine code, especially if the
same function already exists in multiple other, more idiomatic C++
variations.
Rounding out the third push, we got the TH04/TH05 DEMO?.REC
replay file reading code, which should finally prove that nothing about the
game's original replay system could serve as even just the foundation for
community-usable replays. Just in case anyone was still thinking that.
Next up: Back to TH01, with the Elis fight! Got a bit of room left in the
cap again, and there are a lot of things that would make a lot of
sense now:
TH04 would really enjoy a large number of dedicated pushes to catch up
with TH05. This would greatly support the finalization of both games.
Continuing with TH05's bosses and midbosses has shown to be good value
for your money. Shinki would have taken even less than 2 pushes if she
hadn't been the first boss I looked at.
Oh, and I also added Seihou as a selectable goal, for the two people out
there who genuinely like it. If I ever want to quit my day job, I need to
branch out into safer territory that isn't threatened by takedowns, after
all.
TH03 finally passed 20% RE, and the newly decompiled code contains no
serious ZUN bugs! What a nice way to end the year.
There's only a single unlockable feature in TH03: Chiyuri and Yumemi as
playable characters, unlocked after a 1CC on any difficulty. Just like the
Extra Stages in TH04 and TH05, YUME.NEM contains a single
designated variable for this unlocked feature, making it trivial to craft a
fully unlocked score file without recording any high scores that others
would have to compete against. So, we can now put together a complete set
for all PC-98 Touhou games: 2021-12-27-Fully-unlocked-clean-score-files.zip
It would have been cool to set the randomly generated encryption keys in
these files to a fixed value so that they cancel out and end up not actually
encrypting the file. Too bad that TH03 also started feeding each encrypted
byte back into its stream cipher, which makes this impossible.
The main loading and saving code turned out to be the second-cleanest
implementation of a score file format in PC-98 Touhou, just behind TH02.
Only two of the YUME.NEM functions come with nonsensical
differences between OP.EXE and MAINL.EXE, rather
than 📝 all of them, as in TH01 or
📝 too many of them, as in TH04 and TH05. As
for the rest of the per-difficulty structure though… well, it quickly
becomes clear why this was the final score file format to be RE'd. The name,
score, and stage fields are directly stored in terms of the internal
REGI*.BFT sprite IDs used on the high score screen. TH03 also
stores 10 score digits for each place rather than the 9 possible ones, keeps
any leading 0 digits, and stores the letters of entered names in reverse
order… yeah, let's decompile the high score screen as well, for a full
understanding of why ZUN might have done all that. (Answer: For no reason at
all. )
And wow, what a breath of fresh air. It's surely not
good-code: The overlapping shadows resulting from using
a 24-pixel letterspacing with 32-pixel glyphs in the name column led ZUN to
do quite a lot of unnecessary and slightly confusing rendering work when
moving the cursor back and forth, and he even forgot about the EGC there.
But it's nowhere close to the level of jank we saw in
📝 TH01's high score menu last year. Good to
see that ZUN had learned a thing or two by his third game – especially when
it comes to storing the character map cursor in terms of a character ID,
and improving the layout of the character map:
That's almost a nicely regular grid there. With the question mark and the
double-wide SP, BS, and END options, the cursor
movement code only comes with a reasonable two exceptions, which are easily
handled. And while I didn't get this screen completely decompiled,
one additional push was enough to cover all important code there.
The only potential glitch on this screen is a result of ZUN's continued use
of binary-coded
decimal digits without any bounds check or cap. Like the in-game HUD
score display in TH04 and TH05, TH03's high score screen simply uses the
next glyph in the character set for the most significant digit of any score
above 1,000,000,000 points – in this case, the period. Still, it only
really gets bad at 8,000,000,000 points: Once the glyphs are
exhausted, the blitting function ends up accessing garbage data and filling
the entire screen with garbage pixels. For comparison though, the current world record
is 133,650,710 points, so good luck getting 8 billion in the first
place.
Next up: Starting 2022 with the long-awaited decompilation of TH01's Sariel
fight! Due to the 📝 recent price increase,
we now got a window in the cap that
is going to remain open until tomorrow, providing an early opportunity to
set a new priority after Sariel is done.
Didn't quite get to cover background rendering for TH05's Stage 1-5
bosses in this one, as I had to reverse-engineer two more fundamental parts
involved in boss background rendering before.
First, we got the those blocky transitions from stage tiles to bomb and
boss backgrounds, loaded from BB*.BB and ST*.BB,
respectively. These files store 16 frames of animation, with every bit
corresponding to a 16×16 tile on the playfield. With 384×368 pixels to be
covered, that would require 69 bytes per frame. But since that's a very odd
number to work with in micro-optimized ASM, ZUN instead stores 512×512
pixels worth of bits, ending up with a frame size of 128 bytes, and a
per-frame waste of 59 bytes. At least it was
possible to decompile the core blitting function as __fastcall
for once.
But wait, TH05 comes with, and loads, a bomb .BB file for every character,
not just for the Reimu and Yuuka bomb transitions you see in-game… 🤔
Restoring those unused stage tile → bomb image transition
animations for Mima and Marisa isn't that trivial without having decompiled
their actual bomb animation functions before, so stay tuned!
Interestingly though, the code leaves out what would look like the most
obvious optimization: All stage tiles are unconditionally redrawn
each frame before they're erased again with the 16×16 blocks, no matter if
they weren't covered by such a block in the previous frame, or are
going to be covered by such a block in this frame. The same is true
for the static bomb and boss background images, where ZUN simply didn't
write a .CDG blitting function that takes the dirty tile array into
account. If VRAM writes on PC-98 really were as slow as the games'
README.TXT files claim them to be, shouldn't all the
optimization work have gone towards minimizing them?
Oh well, it's not like I have any idea what I'm talking about here. I'd
better stop talking about anything relating to VRAM performance on PC-98…
Second, it finally was time to solve the long-standing confusion about all
those callbacks that are supposed to render the playfield background. Given
the aforementioned static bomb background images, ZUN chose to make this
needlessly complicated. And so, we have two callback function
pointers: One during bomb animations, one outside of bomb
animations, and each boss update function is responsible for keeping the
former in sync with the latter.
Other than that, this was one of the smoothest pushes we've had in a while;
the hardest parts of boss background rendering all were part of
📝 the last push. Once you figured out that
ZUN does indeed dynamically change hardware color #0 based on the current
boss phase, the remaining one function for Shinki, and all of EX-Alice's
background rendering becomes very straightforward and understandable.
Meanwhile, -Tom- told me about his plans to publicly
release 📝 his TH05 scripting toolkit once
TH05's MAIN.EXE would hit around 50% RE! That pretty much
defines what the next bunch of generic TH05 pushes will go towards:
bullets, shared boss code, and one
full, concrete boss script to demonstrate how it's all combined. Next up,
therefore: TH04's bullet firing code…? Yes, TH04's. I want to see what I'm
doing before I tackle the undecompilable mess that is TH05's bullet firing
code, and you all probably want readable code for that feature as
well. Turns out it's also the perfect place for Blue Bolt's
pending contributions.
Technical debt, part 9… and as it turns out, it's highly impractical to
repay 100% of it at this point in development. 😕
The reason: graph_putsa_fx(), ZUN's function for rendering
optionally boldfaced text to VRAM using the font ROM glyphs, in its
ridiculously micro-optimized TH04 and TH05 version. This one sets the
"callback function" for applying the boldface effect by self-modifying
the target of two CALL rel16 instructions… because
there really wasn't any free register left for an indirect
CALL, eh? The necessary distance, from the call site to the
function itself, has to be calculated at assembly time, by subtracting the
target function label from the call site label.
This usually wouldn't be a problem… if ZUN didn't store the resulting
lookup tables in the .DATA segment. With code segments, we
can easily split them at pretty much any point between functions because
there are multiple of them. But there's only a single .DATA
segment, with all ZUN and master.lib data sandwiched between Borland C++'s
crt0 at the
top, and Borland C++'s library functions at the bottom of the segment.
Adding another split point would require all data after that point to be
moved to its own translation unit, which in turn requires
EXTERN references in the big .ASM file to all that moved
data… in short, it would turn the codebase into an even greater
mess.
Declaring the labels as EXTERN wouldn't work either, since
the linker can't do fancy arithmetic and is limited to simply replacing
address placeholders with one single address. So, we're now stuck with
this function at the bottom of the SHARED segment, for the
foreseeable future.
We can still continue to separate functions off the top of that segment,
though. Pretty much the only thing noteworthy there, so far: TH04's code
for loading stage tile images from .MPN files, which we hadn't
reverse-engineered so far, and which nicely fit into one of
Blue Bolt's pending ⅓ RE contributions. Yup, we finally moved
the RE% bars again! If only for a tiny bit.
Both TH02 and TH05 simply store one pointer to one dynamically allocated
memory block for all tile images, as well as the number of images, in the
data segment. TH04, on the other hand, reserves memory for 8 .MPN slots,
complete with their color palettes, even though it only ever uses the
first one of these. There goes another 458 bytes of conventional RAM… I
should start summing up all the waste we've seen so far. Let's put the
next website contribution towards a tagging system for these blog posts.
At 86% of technical debt in the SHARED segment repaid, we
aren't quite done yet, but the rest is mostly just TH04 needing to catch
up with functions we've already separated. Next up: Getting to that
practical 98.5% point. Since this is very likely to not require a full
push, I'll also decompile some more actual TH04 and TH05 game code I
previously reverse-engineered – and after that, reopen the store!
Now that's the amount of translation unit separation progress I was
looking for! Too bad that RL is keeping me more and more occupied these
days, and ended up delaying this push until 2021. Now that
Touhou Patch Center is also commissioning me to update their
infrastructure, it's going to take a while for ReC98 to return to full
speed, and for the store to be reopened. Should happen by April at the
latest, though!
With everything related to this separation of translation units explained
earlier, we've really got a push with nothing to talk about, this
time. Except, maybe, for the realization that
📝 this current approach might not be the
best fit for TH02 after all: Not only did it force us to
📝 throw away the previous decompilation of
the sound effect playback functions, but OP.EXE also contains
obviously copy-pasted code in addition to the common, shared set of
library functions. How was that game even built, originally??? No
way around compiling that one instance of the "delay until given BGM
measure" function separately then, if it insists on using its own
instance of the VSync delay function…
Oh well, this separated layout still works better for the later games, and
consistency is good. Smooth sailing with all of the other functions, at
least.
Next up: One more of these, which might even end up completing the
📝 transition to our own master.lib header file.
In terms of the total number of ASM code left in the SHARED
code segments, we're now 30% done after 3 dedicated pushes. It really
shouldn't require 7 more pushes, though!
So, only one card-flipping function missing, and then we can start
decompiling TH01's two final bosses? Unfortunately, that had to be the one
big function that initializes and renders all gameplay objects. #17 on the
list of longest functions in all of PC-98 Touhou, requiring two pushes to
fully understand what's going on there… and then it immediately returns
for all "boss" stages whose number is divisible by 5, yet is still called
during Sariel's and Konngara's initialization 🤦
Oh well. This also involved the final file format we hadn't looked at
yet – the STAGE?.DAT files that describe the layout for all
stages within a single 5-stage scene. Which, for a change is a very
well-designed form– no, of course it's completely weird, what did you
expect? Development must have looked somewhat like this:
Weirdness #1: "Hm, the stage format should
include the file names for the background graphics and music… or should
it?" And so, the 22-byte header still references some music and
background files that aren't part of the final game. The game doesn't use
anything from there, and instead derives those file names from the
scene ID.
That's probably nothing new to anyone who has ever looked at TH01's data
files. In a slightly more interesting discovery though, seeing the
📝 .GRF extension, in some of the file names
that are short enough to not cut it off, confirms that .GRF was initially
used for background images. Probably before ZUN learned about .PI, and how
it achieves better compression than his own per-bitplane RLE approach?
Weirdness #2: "Hm, I might want to put
obstacles on top of cards?" You'd probably expect this format to
contain one single array for every stage, describing which object to place
on every 32×32 tile, if any. Well, the real format uses two arrays:
One for the cards, and a combined one for all "obstacles" – bumpers, bumper
bars, turrets, and portals. However, none of the card-flipping stages in
the final game come with any such overlaps. That's quite unfortunate, as it
would have made for some quite interesting level designs:
As you can see, the final version of the blitting code was not written
with such overlaps in mind either, blitting the cards on top of all
the obstacles, and not the other way round.
Weirdness #3: "In contrast to obstacles, of
which there are multiple types, cards only really need 1 bit. Time for some
bit twiddling!" Not the worst idea, given that the 640×336 playfield
can fit 20×10 cards, which would fit exactly into 25 bytes if you use a
single bit to indicate card or no card. But for whatever
reason, ZUN only stored 4 card bits per byte, leaving the other 4 bits
unused, and needlessly blowing up that array to 50 bytes. 🤷
Oh, and did I mention that the contents of the STAGE?.DAT files are
loaded into the main data segment, even though the game immediately parses
them into something more conveniently accessible? That's another 1250 bytes
of memory wasted for no reason…
Weirdness #4: "Hm, how about requiring the
player to flip some of the cards multiple times? But I've already written
all this bit twiddling code to store 4 cards in 1 byte. And if cards should
need anywhere from 1 to 4 flips, that would need at least 2 more bits,
which won't fit into the unused 4 bits either…" This feature
must have come later, because the final game uses 3 "obstacle" type
IDs to act as a flip count modifier for a card at the same relative array
position. Complete with lookup code to find the actual card index these
modifiers belong to, and ridiculous switch statements to not include
those non-obstacles in the game's internal obstacle array.
With all that, it's almost not worth mentioning how there are 12 turret
types, which only differ in which hardcoded pellet group they fire at a
hardcoded interval of either 100 or 200 frames, and that they're all
explicitly spelled out in every single switch statement. Or
how the layout of the internal card and obstacle SoA classes is quite
disjointed. So here's the new ZUN bugs you've probably already been
expecting!
Cards and obstacles are blitted to both VRAM pages. This way, any other
entities moving on top of them can simply be unblitted by restoring pixels
from VRAM page 1, without requiring the stationary objects to be redrawn
from main memory. Obviously, the backgrounds behind the cards have to be
stored somewhere, since the player can remove them. For faster transitions
between stages of a scene, ZUN chose to store the backgrounds behind
obstacles as well. This way, the background image really only needs to be
blitted for the first stage in a scene.
All that memory for the object backgrounds adds up quite a bit though. ZUN
actually made the correct choice here and picked a memory allocation
function that can return more than the 64 KiB of a single x86 Real Mode
segment. He then accesses the individual backgrounds via regular array
subscripts… and that's where the bug lies, because he stores the returned
address in a regular far pointer rather than a
huge one. This way, the game still can only display a
total of 102 objects (i. e., cards and obstacles combined) per stage,
without any unblitting glitches.
What a shame, that limit could have been 127 if ZUN didn't needlessly
allocate memory for alpha planes when backing up VRAM content.
And since array subscripts on far pointers wrap around after
64 KiB, trying to save the background of the 103rd object is guaranteed to
corrupt the memory block header at the beginning of the returned segment.
When TH01 runs in debug mode, it
correctly reports a corrupted heap in this case.
After detecting such a corruption, the game loudly reports it by playing the
"player hit" sound effect and locking up, freezing any further gameplay or
rendering. The locking loop can be left by pressing ↵ Return, but the
game will simply re-enter it if the corruption is still present during the
next heapcheck(), in the next frame. And since heap
corruptions don't tend to repair themselves, you'd have to constantly hold
↵ Return to resume gameplay. Doing that could actually get you
safely to the next boss, since the game doesn't allocate or free any further
heap memory during a 5-stage card-flipping scene, and
just throws away its C heap when restarting the process for a boss. But then
again, holding ↵ Return will also auto-flip all cards on the way there…
🤨
Finally, some unused content! Upon discovering TH01's stage selection debug
feature, probably everyone tried to access Stage 21,
just to see what happens, and indeed landed in an actual stage, with a
black background and a weird color palette. Turns out that ZUN did
ship an unused scene in SCENE7.DAT, which is exactly what's
loaded there.
However, it's easy to believe that this is just garbage data (as I
initially did): At the beginning of "Stage 22", the game seems to enter an
infinite loop somewhere during the flip-in animation.
Well, we've had a heap overflow above, and the cause here is nothing but a
stack buffer overflow – a perhaps more modern kind of classic C bug,
given its prevalence in the Windows Touhou games. Explained in a few lines
of code:
void stageobjs_init_and_render()
{
int card_animation_frames[50]; // even though there can be up to 200?!
int total_frames = 0;
(code that would end up resetting total_frames if it ever tried to reset
card_animation_frames[50]…)
}
The number of cards in "Stage 22"? 76. There you have it.
But of course, it's trivial to disable this animation and fix these stage
transitions. So here they are, Stages 21 to 24, as shipped with the game
in STAGE7.DAT:
Wow, what a mess. All that was just a bit too much to be covered in two
pushes… Next up, assuming the current subscriptions: Taking a vacation with
one smaller TH01 push, covering some smaller functions here and there to
ensure some uninterrupted Konngara progress later on.
Done with the .BOS format, at last! While there's still quite a bunch of
undecompiled non-format blitting code left, this was in fact the final
piece of graphics format loading code in TH01.
📝 Continuing the trend from three pushes ago,
we've got yet another class, this time for the 48×48 and 48×32 sprites
used in Reimu's gohei, slide, and kick animations. The only reason these
had to use the .BOS format at all is simply because Reimu's regular
sprites are 32×32, and are therefore loaded from
📝 .PTN files.
Yes, this makes no sense, because why would you split animations for
the same character across two file formats and two APIs, just because
of a sprite size difference?
This necessity for switching blitting APIs might also explain why Reimu
vanishes for a few frames at the beginning and the end of the gohei swing
animation, but more on that once we get to the high-level rendering code.
Now that we've decompiled all the .BOS implementations in TH01, here's an
overview of all of them, together with .PTN to show that there really was
no reason for not using the .BOS API for all of Reimu's sprites:
CBossEntity
CBossAnim
CPlayerAnim
ptn_* (32×32)
Format
.BOS
.BOS
.BOS
.PTN
Hitbox
✔
✘
✘
✘
Byte-aligned blitting
✔
✔
✔
✔
Byte-aligned unblitting
✔
✘
✔
✔
Unaligned blitting
Single-line and wave only
✘
✘
✘
Precise unblitting
✔
✘
✔
✔
Per-file sprite limit
8
8
32
64
Pixels blitted at once
16
16
8
32
And even that last property could simply be handled by branching based on
the sprite width, and wouldn't be a reason for switching formats. But
well, it just wouldn't be TH01 without all that redundant bloat though,
would it?
The basic loading, freeing, and blitting code was yet another variation
on the other .BOS code we've seen before. So this should have caused just
as little trouble as the CBossAnim code… except that
CPlayerAnimdid add one slightly difficult function to
the mix, which led to it requiring almost a full push after all.
Similar to 📝 the unblitting code for moving lasers we've seen in the last push,
ZUN tries to minimize the amount of VRAM writes when unblitting Reimu's
slide animations. Technically, it's only necessary to restore the pixels
that Reimu traveled by, plus the ones that wouldn't be redrawn by
the new animation frame at the new X position.
The theoretically arbitrary distance between the two sprites is, of
course, modeled by a fixed-size buffer on the stack
, coming with the further assumption that the
sprite surely hasn't moved by more than 1 horizontal VRAM byte compared to
the last frame. Which, of course, results in glitches if that's not the
case, leaving little Reimu parts in VRAM if the slide speed ever exceeded
8 pixels per frame. (Which it never does,
being hardcoded to 6 pixels, but still.). As it also turns out, all those
bit masking operations easily lead to incredibly sloppy C code.
Which compiles into incredibly terrible ASM, which in turn might end up
wasting way more CPU time than the final VRAM write optimization would
have gained? Then again, in-depth profiling is way beyond the scope of
this project at this point.
Next up: The TH04 main menu, and some more technical debt.
So, TH05 OP.EXE. The first half of this push started out
nicely, with an easy decompilation of the entire player character
selection menu. Typical ZUN quality, with not much to say about it. While
the overall function structure is identical to its TH04 counterpart, the
two games only really share small snippets inside these functions, and do
need to be RE'd separately.
The high score viewing (not registration) menu would have been next.
Unfortunately, it calls one of the GENSOU.SCR loading
functions… which are all a complete mess that still needed to be sorted
out first. 5 distinct functions in 6 binaries, and of course TH05 also
micro-optimized its MAIN.EXE version to directly use the DOS
INT 21h file loading API instead of master.lib's wrappers.
Could have all been avoided with a single method on the score data
structure, taking a player character ID and a difficulty level as
parameters…
So, no score menu in this push then. Looking at the other end of the ASM
code though, we find the starting functions for the main game, the Extra
Stage, and the demo replays, which did fit perfectly to round out
this push.
Which is where we find an easter egg! 🥚 If you've ever looked into
怪綺談2.DAT, you might have noticed 6 .REC files
with replays for the Demo Play mode. However, the game only ever seems to
cycle between 4 replays. So what's in the other two, and why are they
40 KB instead of just 10 KB like the others? Turns out that they
combine into a full Extra Stage Clear replay with Mima, with 3 bombs and 1
death, obviously recorded by ZUN himself. The split into two files for the
stage (DEMO4.REC) and boss (DEMO5.REC) portion is
merely an attempt to limit the amount of simultaneously allocated heap
memory.
To watch this replay without modding the game, unlock the Extra Stage with
all 4 characters, then hold both the ⬅️ left and ➡️ right arrow keys in the
main menu while waiting for the usual demo replay.
I can't possibly be the first one to discover this, but I couldn't find
any other mention of it. Edit (2021-03-15): ZUN did in fact document this replay
in Section 6 of TH05's OMAKE.TXT, along with the exact method
to view it.
Thanks
to Popfan for the discovery!
Here's a recording of the whole replay:
Note how the boss dialogue is skipped. MAIN.EXE actually
contains no less than 6 if() branches just to distinguish
this overly long replay from the regular ones.
I'd really like to do the TH04 and TH05 main menus in parallel, since we
can expect a bit more shared code after all the initial differences.
Therefore, I'm going to put the next "anything" push towards covering the
TH04 version of those functions. Next up though, it's back to TH01, with
more redundant image format code…
Finally, after a long while, we've got two pushes with barely anything to
talk about! Continuing the road towards 100% PI for TH05, these were
exactly the two pushes that TH05 MAINE.EXE PI was estimated
to additionally cost, relative to TH04's. Consequently, they mostly went
to TH05's unique data structures in the ending cutscenes, the score name
registration menu, and the
staff roll.
A unique feature in there is TH05's support for automatic text color
changes in its ending scripts, based on the first full-width Shift-JIS
codepoint in a line. The \c=codepoint,color
commands at the top of the _ED??.TXT set up exactly this
codepoint→color mapping. As far as I can tell, TH05 is the only Touhou
game with a feature like this – even the Windows Touhou games went back to
manually spelling out each color change.
The orb particles in TH05's staff roll also try to be a bit unique by
using 32-bit X and Y subpixel variables for their current position. With
still just 4 fractional bits, I can't really tell yet whether the extended
range was actually necessary. Maybe due to how the "camera scrolling"
through "space" was implemented? All other entities were pretty much the
usual fare, though.
12.4, 4.4, and now a 28.4 fixed-point format… yup,
📝 C++ templates were
definitely the right choice.
At the end of its staff roll, TH05 not only displays
the usual performance
verdict, but then scrolls in the scores at the end of each stage
before switching to the high score menu. The simplest way to smoothly
scroll between two full screens on a PC-98 involves a separate bitmap…
which is exactly what TH05 does here, reserving 28,160 bytes of its global
data segment for just one overly large monochrome 320×704 bitmap where
both the screens are rendered to. That's… one benefit of splitting your
game into multiple executables, I guess?
Not sure if it's common knowledge that you can actually scroll back and
forth between the two screens with the Up and Down keys before moving to
the score menu. I surely didn't know that before. But it makes sense –
might as well get the most out of that memory.
The necessary groundwork for all of this may have actually made
TH04's (yes, TH04's) MAINE.EXE technically
position-independent. Didn't quite reach the same goal for TH05's – but
what we did reach is ⅔ of all PC-98 Touhou code now being
position-independent! Next up: Celebrating even more milestones, as
-Tom- is about to finish development on his TH05
MAIN.EXE PI demo…
Alright, tooling and technical debt. Shouldn't be really much to talk
about… oh, wait, this is still ReC98
For the tooling part, I finished up the remaining ergonomics and error
handling for the
📝 sprite converter that Jonathan Campbell contributed two months ago.
While I familiarized myself with the tool, I've actually ran into some
unreported errors myself, so this was sort of important to me. Still got
no command-line help in there, but the error messages can now do that job
probably even better, since we would have had to write them anyway.
So, what's up with the technical debt then? Well, by now we've accumulated
quite a number of 📝 ASM code slices that
need to be either decompiled or clearly marked as undecompilable. Since we
define those slices as "already reverse-engineered", that decision won't
affect the numbers on the front page at all. But for a complete
decompilation, we'd still have to do this someday. So, rather than
incorporating this work into pushes that were purchased with the
expectation of measurable progress in a certain area, let's take the
"anything goes" pushes, and focus entirely on that during them.
The second code segment seemed like the best place to start with this,
since it affects the largest number of games simultaneously. Starting with
TH02, this segment contains a set of random "core" functions needed by the
binary. Image formats, sounds, input, math, it's all there in some
capacity. You could maybe call it all "libzun" or something like
that? But for the time being, I simply went with the obvious name,
seg2. Maybe I'll come up with something more convincing in
the future.
Oh, but wait, why were we assembling all the previous undecompilable ASM
translation units in the 16-bit build part? By moving those to the 32-bit
part, we don't even need a 16-bit TASM in our list of dependencies, as
long as our build process is not fully 16-bit.
And with that, ReC98 now also builds on Windows 95, and thus, every 32-bit
Windows version. 🎉 Which is certainly the most user-visible improvement
in all of these two pushes.
Back in 2015, I already decompiled all of TH02's seg2
functions. As suggested by the Borland compiler, I tried to follow a "one
translation unit per segment" layout, bundling the binary-specific
contents via #include. In the end, it required two
translation units – and that was even after manually inserting the
original padding bytes via #pragma codestring… yuck. But it
worked, compiled, and kept the linker's job (and, by extension,
segmentation worries) to a minimum. And as long as it all matched the
original binaries, it still counted as a valid reconstruction of ZUN's
code.
However, that idea ultimately falls apart once TH03 starts mixing
undecompilable ASM code inbetween C functions. Now, we officially have no
choice but to use multiple C and ASM translation units, with maybe only
just one or two #includes in them…
…or we finally start reconstructing the actual seg2 library,
turning every sequence of related functions into its own translation unit.
This way, we can simply reuse the once-compiled .OBJ files for all the
binaries those functions appear in, without requiring that additional
layer of translation units mirroring the original segmentation.
The best example for this is
TH03's
almost undecompilable function that generates a lookup table for
horizontally flipping 8 1bpp pixels. It's part of every binary since
TH03, but only used in that game. With the previous approach, we would
have had to add 9 C translation units, which would all have just
#included that one file. Now, we simply put the .OBJ file
into the correct place on the linker command line, as soon as we can.
💡 And suddenly, the linker just inserts the correct padding bytes itself.
The most immediate gains there also happened to come from TH03. Which is
also where we did get some tiny RE% and PI% gains out of this after
all, by reverse-engineering some of its sprite blitting setup code. Sure,
I should have done even more RE here, to also cover those 5 functions at
the end of code segment #2 in TH03's MAIN.EXE that were in
front of a number of library functions I already covered in this push. But
let's leave that to an actual RE push 😛
All in all though, I was just getting started with this; the real
gains in terms of removed ASM files are still to come. But in the
meantime, the funding situation has become even better in terms of
allowing me to focus on things nobody asked for. 🙂 So here's a slightly
better idea: Instead of spending two more pushes on this, let's shoot for
TH05 MAINE.EXE position independence next. If I manage to get
it done, we'll have a 100% position-independent TH05 by the time
-Tom- finishes his MAIN.EXE PI demo, rather
than the 94% we'd get from just MAIN.EXE. That's bound to
make a much better impression on all the people who will then
(re-)discover the project.
And indeed, I got to end my vacation with a lot of image format and
blitting code, covering the final two formats, .GRC and .BOS. .GRC was
nothing noteworthy – one function for loading, one function for
byte-aligned blitting, and one function for freeing memory. That's it –
not even a unblitting function for this one. .BOS, on the other hand…
…has no generic (read: single/sane) implementation, and is only
implemented as methods of some boss entity class. And then again for
Sariel's dress and wand animations, and then again for Reimu's
animations, both of which weren't even part of these 4 pushes. Looking
forward to decompiling essentially the same algorithms all over again… And
that's how TH01 became the largest and most bloated PC-98 Touhou game. So
yeah, still not done with image formats, even at 44% RE.
This means I also had to reverse-engineer that "boss entity" class… yeah,
what else to call something a boss can have multiple of, that may or may
not be part of a larger boss sprite, may or may not be animated, and that
may or may not have an orb hitbox?
All bosses except for Kikuri share the same 5 global instances of this
class. Since renaming all these variables in ASM land is tedious anyway, I
went the extra mile and directly defined separate, meaningful names for
the entities of all bosses. These also now document the natural order in
which the bosses will ultimately be decompiled. So, unless a backer
requests anything else, this order will be:
Konngara
Sariel
Elis
Kikuri
SinGyoku
(code for regular card-flipping stages)
Mima
YuugenMagan
As everyone kind of expects from TH01 by now, this class reveals yet
another… um, unique and quirky piece of code architecture. In
addition to the position and hitbox members you'd expect from a class like
this, the game also stores the .BOS metadata – width, height, animation
frame count, and 📝 bitplane pointer slot
number – inside the same class. But if each of those still corresponds to
one individual on-screen sprite, how can YuugenMagan have 5 eye sprites,
or Kikuri have more than one soul and tear sprite? By duplicating that
metadata, of course! And copying it from one entity to another
At this point, I feel like I even have to congratulate the game for not
actually loading YuugenMagan's eye sprites 5 times. But then again, 53,760
bytes of waste would have definitely been noticeable in the DOS days.
Makes much more sense to waste that amount of space on an unused C++
exception handler, and a bunch of redundant, unoptimized blitting
functions
(Thinking about it, YuugenMagan fits this entire system perfectly. And
together with its position in the game's code – last to be decompiled
means first on the linker command line – we might speculate that
YuugenMagan was the first boss to be programmed for TH01?)
So if a boss wants to use sprites with different sizes, there's no way
around using another entity. And that's why Girl-Elis and Bat-Elis are two
distinct entities internally, and have to manually sync their position.
Except that there's also a third one for Attacking-Girl-Elis,
because Girl-Elis has 9 frames of animation in total, and the global .BOS
bitplane pointers are divided into 4 slots of only 8 images each.
Same for SinGyoku, who is split into a sphere entity, a
person entity, and a… white flash entity for all three forms,
all at the same resolution. Or Konngara's facial expressions, which also
require two entities just for themselves.
And once you decompile all this code, you notice just how much of it the
game didn't even use. 13 of the 50 bytes of the boss entity class are
outright unused, and 10 bytes are used for a movement clamping and lock
system that would have been nice if ZUN also used it outside of
Kikuri's soul sprites. Instead, all other bosses ignore this system
completely, and just
party on
the X/Y coordinates of the boss entities directly.
As for the rendering functions, 5 out of 10 are unused. And while those
definitely make up less than half of the code, I still must have
spent at least 1 of those 4 pushes on effectively unused functionality.
Only one of these functions lends itself to some speculation. For Elis'
entrance animation, the class provides functions for wavy blitting and
unblitting, which use a separate X coordinate for every line of the
sprite. But there's also an unused and sort of broken one for unblitting
two overlapping wavy sprites, located at the same Y coordinate. This might
indicate that Elis could originally split herself into two sprites,
similar to TH04 Stage 6 Yuuka? Or it might just have been some other kind
of animation effect, who knows.
After over 3 months of TH01 progress though, it's finally time to look at
other games, to cover the rest of the crowdfunding backlog. Next up: Going
back to TH05, and getting rid of those last PI false positives. And since
I can potentially spend the next 7 weeks on almost full-time ReC98 work,
I've also re-opened the store until October!
It's vacation time! Which, for ReC98, means "relaxing by looking at
something boring and uninteresting that we'll ultimately have to cover
anyway"… like the TH01 HUD.
📝 As noted earlier, all the score, card
combo, stage, and time numbers are drawn into VRAM. Which turns TH01's HUD
rendering from the trivial, gaiji-assisted text RAM writes we see in later
games to something that, once again, requires blitting and unblitting
steps. For some reason though, everything on there is blitted to both
VRAM pages? And that's why the HUD chose to allocate a bunch of .PTN
sprite slots to store the background behind all "animated" elements at the
beginning of a 4-stage scene or boss battle… separately for every
affected 16×16 area. (Looking forward to the completely unnecessary
code in the Sariel fight that updates these slots after the backgrounds
were animated!) And without any separation into helper functions, we end
up with the same blitting calls separately copy-pasted for every single
HUD element. That's why something as seemingly trivial as this isn't even
done after 2 pushes, as we're still missing the stage timer.
Thankfully, the .PTN function signatures come with none of ZUN's little
inconsistencies, so I was able to mostly reduce this copy-pasta to a bunch
of small inline functions and macros. Those interfaces still remain a bit
annoying, though. As a 32×32 format, .PTN merely supports 16×16 sprites
with a separate bunch of functions that take an additional
quarter parameter from 0 to 3, to select one of the 4 16×16
quarters in a such a sprite…
For life and bomb counts, there was no way around VRAM though, since ZUN
wanted to use more than a single color for those. This is where we find at
least somewhat of a mildly interesting quirk in all of this: Any life
counts greater than the intended 6 will wrap into new rows, with the bombs
in the second row overlapping those excess lives. With the way the rest of
the HUD rendering works, that wrapping code code had to be explicitly
written… which means that ZUN did in fact accomodate (his own?) cheating
there.
Now, I promised image formats, and in the middle of this copy-pasta, we
did get one… sort of. MASK.GRF, the red HUD
background, is entirely handled with two small bespoke functions… and
that's all the code we have for this format. Basically, it's a variation
on the 📝 .GRZ format we've seen earlier. It
uses the exact same RLE algorithm, but only has a single byte stream for
both RLE commands and pixel data… as you would expect from an RLE format.
.GRF actually stores 4 separately encoded RLE streams, which suggests that
it was intended for full 16-color images. Unfortunately,
MASK.GRF only contains 4 copies of the same HUD background
, so no unused beta data for us there. The only
thing we could derive from 4 identical bitplanes would be that the
background was originally meant to be drawn using color #15, rather than
the red seen in the final game. Color
#15 is a stage-specific background color that would have made the
HUD blend in quite nicely – in the YuugenMagan fight, it's the changing
color of the 邪 in the background, for example. But
really, with no generic implementation of this format, that's all just
speculation.
Oh, and in case you were looking for a rip of that image:
So yeah, more of the usual TH01 code, with the usual small quirks, but
nothing all too horrible – as expected. Next up: The image formats that
didn't make it into this push.
So, let's finally look at some TH01 gameplay structures! The obvious
choices here are player shots and pellets, which are conveniently located
in the last code segment. Covering these would therefore also help in
transferring some first bits of data in REIIDEN.EXE from ASM
land to C land. (Splitting the data segment would still be quite
annoying.) Player shots are immediately at the beginning…
…but wait, these are drawn as transparent sprites loaded from .PTN files.
Guess we first have to spend a push on
📝 Part 2 of this format.
Hm, 4 functions for alpha-masked blitting and unblitting of both 16×16 and
32×32 .PTN sprites that align the X coordinate to a multiple of 8
(remember, the PC-98 uses a
planar
VRAM memory layout, where 8 pixels correspond to a byte), but only one
function that supports unaligned blitting to any X coordinate, and only
for 16×16 sprites? Which is only called twice? And doesn't come with a
corresponding unblitting function?
Yeah, "unblitting". TH01 isn't
double-buffered,
and uses the PC-98's second VRAM page exclusively to store a stage's
background and static sprites. Since the PC-98 has no hardware sprites,
all you can do is write pixels into VRAM, and any animated sprite needs to
be manually removed from VRAM at the beginning of each frame. Not using
double-buffering theoretically allows TH01 to simply copy back all 128 KB
of VRAM once per frame to do this. But that
would be pretty wasteful, so TH01 just looks at all animated sprites, and
selectively copies only their occupied pixels from the second to the first
VRAM page.
Alright, player shot class methods… oh, wait, the collision functions
directly act on the Yin-Yang Orb, so we first have to spend a push on
that one. And that's where the impression we got from the .PTN
functions is confirmed: The orb is, in fact, only ever displayed at
byte-aligned X coordinates, divisible by 8. It's only thanks to the
constant spinning that its movement appears at least somewhat
smooth.
This is purely a rendering issue; internally, its position is
tracked at pixel precision. Sadly, smooth orb rendering at any unaligned X
coordinate wouldn't be that trivial of a mod, because well, the
necessary functions for unaligned blitting and unblitting of 32×32 sprites
don't exist in TH01's code. Then again, there's so much potential for
optimization in this code, so it might be very possible to squeeze those
additional two functions into the same C++ translation unit, even without
position independence…
More importantly though, this was the right time to decompile the core
functions controlling the orb physics – probably the highlight in these
three pushes for most people.
Well, "physics". The X velocity is restricted to the 5 discrete states of
-8, -4, 0, 4, and 8, and gravity is applied by simply adding 1 to the Y
velocity every 5 frames No wonder that this can
easily lead to situations in which the orb infinitely bounces from the
ground.
At least fangame authors now have
a
reference of how ZUN did it originally, because really, this bad
approximation of physics had to have been written that way on purpose. But
hey, it uses 64-bit floating-point variables!
…sometimes at least, and quite randomly. This was also where I had to
learn about Turbo C++'s floating-point code generation, and how rigorously
it defines the order of instructions when mixing double and
float variables in arithmetic or conditional expressions.
This meant that I could only get ZUN's original instruction order by using
literal constants instead of variables, which is impossible right now
without somehow splitting the data segment. In the end, I had to resort to
spelling out ⅔ of one function, and one conditional branch of another, in
inline ASM. 😕 If ZUN had just written 16.0 instead of
16.0f there, I would have saved quite some hours of my life
trying to decompile this correctly…
To sort of make up for the slowdown in progress, here's the TH01 orb
physics debug mod I made to properly understand them. Edit
(2022-07-12): This mod is outdated,
📝 the current version is here!2020-06-13-TH01OrbPhysicsDebug.zip
To use it, simply replace REIIDEN.EXE, and run the game
in debug mode, via game d on the DOS prompt.
Its code might also serve as an example of how to achieve this sort of
thing without position independence.
Alright, now it's time for player shots though. Yeah, sure, they
don't move horizontally, so it's not too bad that those are also
always rendered at byte-aligned positions. But, uh… why does this code
only use the 16×16 alpha-masked unblitting function for decaying shots,
and just sloppily unblits an entire 16×16 square everywhere else?
The worst part though: Unblitting, moving, and rendering player shots
is done in a single function, in that order. And that's exactly where
TH01's sprite flickering comes from. Since different types of sprites are
free to overlap each other, you'd have to first unblit all types, then
move all types, and then render all types, as done in later
PC-98 Touhou games. If you do these three steps per-type instead, you
will unblit sprites of other types that have been rendered before… and
therefore end up with flicker.
Oh, and finally, ZUN also added an additional sloppy 16×16 square unblit
call if a shot collides with a pellet or a boss, for some
guaranteed flicker. Sigh.
And that's ⅓ of all ZUN code in TH01 decompiled! Next up: Pellets!
Three pushes to decompile the TH01 high score menu… because it's
completely terrible, and needlessly complicated in pretty much every
aspect:
Another, final set of differences between the REIIDEN.EXE
and FUUIN.EXE versions of the code. Which are so
insignificant that it must mean that ZUN kept this code in two
separate, manually and imperfectly synced files. The REIIDEN.EXE
version, only shown when game-overing, automatically jumps to the
enter/終 button after the 8th character was entered,
and also has a completely invisible timeout that force-enters a high score
name after 1000… key presses? Not frames? Why. Like, how do you
even realistically such a number. (Best guess: It's a hidden easter egg to
amuse players who place drinking glasses on cursor keys. Or beer bottles.)
That's all the differences that are maybe visible if you squint
hard enough. On top of that though, we got a bunch of further, minor code
organization differences that serve no purpose other than to waste
decompilation time, and certainly did their part in stretching this out to
3 pushes instead of 2.
Entered names are restricted to a set of 16-bit, full-width Shift-JIS
codepoints, yet are still accessed as 8-bit byte arrays everywhere. This
bloats both the C++ and generated ASM code with needless byte splits,
swaps, and bit shifts. Same for the route kanji. You have this 16-, heck,
even 32-bit CPU, why not use it?! (Fun fact: FUUIN.EXE is
explicitly compiled for a 80186, for the most part – unlike
REIIDEN.EXE, which does use Turbo C++'s 80386 mode.)
The sensible way of storing the current position of the alphabet
cursor would simply be two variables, indicating the logical row and
column inside the character map. When rendering, you'd then transform
these into screen space. This can keep the on-screen position constants in
a single place of code.
TH01 does the opposite: The selected character is stored directly in terms
of its on-screen position, which is then mapped back to a character
index for every processed input and the subsequent screen update. There's
no notion of a logical row or column anywhere, and consequently, the
position constants are vomited all over the code.
Which might not be as bad if the character map had a uniform
grid structure, with no gaps. But the one in TH01 looks like this:
And with no sense of abstraction anywhere, both input handling and
rendering end up with a separate if branch for at least 4 of
the 6 rows.
In the end, I just gave up with my usual redundancy reduction efforts for
this one. Anyone wanting to change TH01's high score name entering code
would be better off just rewriting the entire thing properly.
And that's all of the shared code in TH01! Both OP.EXE and
FUUIN.EXE are now only missing the actual main menu and
ending code, respectively. Next up, though: The long awaited TH01 PI push.
Which will not only deliver 100% PI for OP.EXE and
FUUIN.EXE, but also probably quite some gains in
REIIDEN.EXE. With now over 30% of the game decompiled, it's about
time we get to look at some gameplay code!
Back to TH01, and its high score menu… oh, wait, that one will eventually
involve keyboard input. And thanks to the generous TH01 funding situation,
there's really no reason not to cover that right now. After all,
TH01 is the last game where input still hadn't been RE'd.
But first, let's also cover that one unused blitting function, together
with REIIDEN.CFG loading and saving, which are in front of
the input function in OP.EXE… (By now, we all know about
the hidden start bomb configuration, right?)
Unsurprisingly, the earliest game also implements input in the messiest
way, with a different function for each of the three executables. "Because
they all react differently to keyboard inputs ",
apparently? OP.EXE even has two functions for it, one for the
START / CONTINUE / OPTION / QUIT main
menu, and one for both Option and Music Test menus, both of which directly
perform the ring arithmetic on the menu cursor variable. A consistent
separation of keyboard polling from input processing apparently wasn't all
too obvious of a thought, since it's only truly done from TH02 on.
This lack of proper architecture becomes actually hilarious once you
notice that it did in fact facilitate a recursion bug!
In case you've been living under a rock for the past 8 years, TH01 shipped
with debugging features, which you can enter by running the game via
game d from the DOS prompt. These features include a
memory info screen, shown when pressing PgUp, implemented as one blocking
function (test_mem()) called directly in response to the
pressed key inside the polling function. test_mem() only
returns once that screen is left by pressing PgDown. And in order to poll
input… it directly calls back into the same polling function that called
it in the first place, after a 3-frame delay.
Which means that this screen is actually re-entered for every 3 frames
that the PgUp key is being held. And yes, you can, of course, also
crash the system via a stack overflow this way by holding down PgUp for a
few seconds, if that's your thing. Edit (2020-09-17): Here's a video from
spaztron64, showing off this
exact stack overflow crash while running under the
VEM486
memory manager, which displays additional information about these
sorts of crashes:
What makes this even funnier is that the code actually tracks the last
state of every polled key, to prevent exactly that sort of bug. But the
copy-pasted assignment of the last input state is only done aftertest_mem() already returned, making it effectively pointless
for PgUp. It does work as intended for PgDown… and that's why you
have to actually press and release this key once for every call to
test_mem() in order to actually get back into the game. Even
though a single call to PgDown will already show the game screen
again.
In maybe more relevant news though, this function also came with what can
be considered the first piece of actual gameplay logic! Bombing via
double-tapping the Z and X keys is also handled here, and now we know that
both keys simply have to be tapped twice within a window of 20 frames.
They are tracked independently from each other, so you don't necessarily
have to press them simultaneously.
In debug mode, the bomb count tracks precisely this window of
time. That's why it only resets back to 0 when pressing Z or X if it's
≥20.
Sure, TH01's code is expectedly terrible and messy. But compared to the
micro-optimizations of TH04 and TH05, it's an absolute joy to work on, and
opening all these ZUN bug loot boxes is just the icing on the cake.
Looking forward to more of the high score menu in the next pushes!
Final TH01 RE push for the time being, and as expected, we've got the
superficially final piece of shared code between the TH01 executables.
However, just having a single implementation for loading and recreating
the REYHI*.DAT score files would have been way above ZUN's
standards of consistency. So ZUN had the unique idea to mix up the file
I/O APIs, using master.lib functions in REIIDEN.EXE, and
POSIX functions (along with error messages and disabled interrupts) in
FUUIN.EXE… Could have been worse
though, as it was possible to abstract that away quite nicely.
That code wasn't quite in the natural way of decompilation either. As it
turns out though, 📝 segment splitting isn't
so painful after all if one of the new segments only has a few functions.
Definitely going to do that more often from now on, since it allows a much
larger number of functions to be immediately decompiled. Which is always
superior to somehow transforming a function's ASM into a form that I can
confidently call "reverse-engineered", only to revisit it again later for
its decompilation.
And while I unfortunately missed 25% of total RE by a bit, this push
reached two other and perhaps even more significant milestones:
After (finally) compressing all unknown parts of the BSS segments
using arrays, the number of remaining lines in the
REIIDEN.EXE ASM dump has fallen below TASM's limit of 65,535. Which
means that we no longer need that annoying th01_reiiden_2.inc
file that everyone has forgotten about at least once.
Nope, RL has given me plenty of things to do from home after all,
so the current cap still remains an accurate representation of my free
time. 😕
For now though, we've got one more TH01 file format push, covering the
core functions for loading and displaying the 32×32 and 16×16 sprites from
the .PTN files, as announced – and probably one of the last ones for quite
a while to yield both RE and PI progress way above average. But what is
this, error return values in a ZUN game?! And actually good code
for deriving the alpha channel from the 16th color in the hardware
palette?! Sure, the rest of the code could still be improved a lot, but
that was quite a surprise, especially after the spaghetti code of
📝 the last push. That makes up for two of
the .PTN structure fields (one of them always 0, and one of them always 1)
remaining unused, and therefore unknown.
ZUN also uses the .PTN image slots to store the background of frequently
updated VRAM sections, in order to be able to repeatedly draw on top of
them – like for example the HUD area where the score and time numbers are
drawn. Future games would simply use the text RAM and gaiji for those
numbers. This would have worked just fine for TH01 too – especially since
all the functions decompiled so far align the VRAM X coordinate to the
8-pixel byte grid, which is the simplest way of accessing VRAM given the
PC-98's
planar
memory layout. Looks as if ZUN simply wasn't aware of gaiji during the
development of TH01.
This won't be the last time I cover the .PTN format, since all the
blitting functions that actually use alpha are exclusive to
REIIDEN.EXE, and currently out of decompilation reach. But after
some more long overdue cleaning work, TH01 has now passed both TH02 and
even TH04 to become the second-most reverse-engineered game in
all of ReC98, in terms of absolute numbers! 🎉
Also, PI for TH01's OP.EXE is imminent. Next up though, we've
first got the probably final double-speed push for TH01, covering the last
set of duplicated functions between the three binaries – quite fitting for
the currently last fully funded, outstanding TH01 RE push. Then, we also
might get FUUIN.EXE PI within the same push
afterwards? After that, TH01 progress will be slowing down, since
I'd then have to cover either the main menu or in-game code
or the cutscenes, depending on what the backers request. (By
default, it's going to be in-game code, of course.)
Last of the 3 weeks of almost full-time ReC98 work, supposedly the least
stressful one, and then things still get delayed thanks to illness 😕 In
better news though, it looks like I'll be able to extend these 3 weeks to
8, as my RL is shutting down for coronavirus reasons. I'm going to
wait a bit for the dust to settle before raising the crowdfunding cap
though, since RL might give me more to do from home after all. I may or
may not also get commissioned for a non-Touhou translation patch project
to be worked on in that time…
The .GRP file functions turned out to, of course, also be present in
FUUIN.EXE. In fact, that binary had the largest share of
progress in this push, since it's the only one to include another
reimplementation of master.lib-style hardware palette fading. As a typical
little ZUN inconsistency, the FUUIN.EXE version of one .GRP
palette function directly calls one of these functions.
As for the functions themselves, they basically wrap the single-function
Pi load and
display library by 電脳科学研究所/BERO in a bowl of global state
spaghetti. 🍝 At least the function names now clearly encode important
side effects like, y'know, a changed hardware palette. The reason ZUN used
this separate library over master.lib's PI loading functions was probably
its support for defining a color as transparent. This feature is used for
the red box in the main menu, and the large cyan Siddhaṃ seed syllables in
(again) the Konngara fight.
Sadly, we've already reached the end of fast triple-speed TH01 progress
with 📝 the last push, which decompiled the
last segment shared by all three of TH01's executables. There's still a
bit of double-speed progress left though, with a small number of code
segments that are shared between just two of the three executables.
At the end of the first one of these, we've got all the code for the .GRZ
format – which is yet another run-length encoded image format, but this
time storing up to 16 full 640×400 16-color images with an alpha bit. This
one is exclusively used to wastefully store Konngara's sword slash and
kuji-in kill
animations. Due to… suboptimal code organization, the code for the format
is also present in OP.EXE, despite not being used there. But
hey, that brings TH01 to over 20% in RE!
Decoupling the RLE command stream from the pixel data sounds like a nice
idea at first, allowing the format to efficiently encode a variety of
animation frames displayed all over the screen… if ZUN actually made
use of it. The RLE stream also has quite some ridiculous overhead,
starting with 1 byte to store the 1-bit command (putting a single 8×1
pixel block, or entering a run of N such blocks). Run commands then store
another 1-byte run length, which has to be followed by another
command byte to identify the run as putting N blocks, or skipping N blocks.
And the pixel data is just a sequence of these blocks for all 4 bitplanes,
in uncompressed form…
Also, have some rips of all the images this format is used for:
To make these, I just wrote a small viewer, calling the same decompiled
TH01 code: 2020-03-07-grzview.zip
Obviously, this means that it not only must to be run on a PC-98, but also
discards the alpha information.
If any backers are really interested in having a proper converter
to and from PNG, I can implement that in an upcoming push… although that
would be the perfect thing for outside contributors to do.
Next up, we got some code for the PI format… oh, wait, the actual files
are called "GRP" in TH01.
Just like most of the time, it was more sensible to cover
GENSOU.SCR, the last structure missing in TH05's
OP.EXE,
everywhere it's used, rather than just rushing out OP.EXE
position independence. I did have to look into all of the functions to
fully RE it after all, and to find out whether the unused fields actually
are unused. The only thing that kept this push from yielding even
more above-average progress was the sheer inconsistency in how the games
implemented the operations on this PC-98 equivalent of score*.dat:
OP.EXE declares two structure instances, for simultaneous
access to both Reimu and Marisa scores. TH05 with its 4 playable
characters instead uses a single one, and overwrites it successively for
each character when drawing the high score menu – meaning, you'd only see
Yuuka's scores when looking at the structure inside the rendered high
score menu. However, it still declares the TH04 "Marisa" structure as a
leftover… and also decodes it and verifies its checksum, despite
nothing being ever loaded into it
MAIN.EXE uses a separate ASM implementation of the decoding
and encoding functions
TH05's MAIN.EXE also reimplements the basic loading
functions
in ASM – without the code to regenerate GENSOU.SCR with
default data if the file is missing or corrupted. That actually makes
sense, since any regeneration is already done in OP.EXE, which
always has to load that file anyway to check how much has been cleared
However, there is a regeneration function in TH05's
MAINE.EXE… which actually generates different default
data: OP.EXE consistently sets Extra Stage records to Stage 1,
while MAINE.EXE uses the same place-based stage numbering that
both versions use for the regular ranks
Technically though, TH05's OP.EXEis
position-independent now, and the rest are (should be?
) merely false positives. However, TH04's is
still missing another structure, in addition to its false
positives. So, let's wait with the big announcement until the next push…
which will also come with a demo video of what will be possible then.
The glacial pace continues, with TH05's unnecessarily, inappropriately
micro-optimized, and hence, un-decompilable code for rendering the current
and high score, as well as the enemy health / dream / power bars. While
the latter might still pass as well-written ASM, the former goes to such
ridiculous levels that it ends up being technically buggy. If you
enjoy quality ZUN code, it's
definitely worth a read.
In TH05, this all still is at the end of code segment #1, but in TH04,
the same code lies all over the same segment. And since I really
wanted to move that code into its final form now, I finally did the
research into decompiling from anywhere else in a segment.
Turns out we actually can! It's kinda annoying, though: After splitting
the segment after the function we want to decompile, we then need to group
the two new segments back together into one "virtual segment" matching the
original one. But since all ASM in ReC98 heavily relies on being
assembled in MASM mode, we then start to suffer from MASM's group
addressing quirk. Which then forces us to manually prefix every single
function call
from inside the group
to anywhere else within the newly created segment
with the group name. It's stupidly boring busywork, because of all the
function calls you mustn't prefix. Special tooling might make this
easier, but I don't have it, and I'm not getting crowdfunded for it.
So while you now definitely can request any specific thing in any
of the 5 games to be decompiled right now, it will take slightly
longer, and cost slightly more.
(Except for that one big segment in TH04, of course.)
Only one function away from the TH05 shot type control functions now!
Sometimes, "strategically picking things to reverse-engineer" unfortunately also means "having to move seemingly random and utterly uninteresting stuff, which will only make sense later, out of the way". Really, this was so boring. Gonna get a lot more exciting in the next ones though.
… yeah, no, we won't get very far without figuring out these drawing routines.
Which process data that comes from the .STD files.
Which has various arrays related to the background… including one to specify the scrolling speed. And wait, setting that to 0 actually is what starts a boss battle?
So, have a TH05 Boss Rush patch: 2018-12-26-TH05BossRush.zip
Theoretically, this should have also worked for TH04, but for some reason,
the Stage 3 boss gets stuck on the first phase if we do this?
While we're waiting for Bruno to release the next thcrap build with ANM header patching, here are the resulting commits of the ReC98 CDG/CD2 special offer purchased by DTM, reverse-engineering all code that covers these formats.