Technical debt, part 5… and we only got TH05's stupidly optimized
.PI functions this time?
As far as actual progress is concerned, that is. In maintenance news
though, I was really hyped for the #include improvements I've
mentioned in 📝 the last post. The result: A
new x86real.h file, bundling all the declarations specific to
the 16-bit x86 Real Mode in a smaller file than Turbo C++'s own
DOS.H. After all, DOS is something else than the underlying
CPU. And while it didn't speed up build times quite as much as I had hoped,
it now clearly indicates the x86-specific parts of PC-98 Touhou code to
future port authors.
After another couple of improvements to parameter declaration in ASM land,
we get to TH05's .PI functions… and really, why did ZUN write all of
them in ASM? Why (re)declare all the necessary structures and data in
ASM land, when all these functions are merely one layer of abstraction
above master.lib, which does all the actual work?
I get that ZUN might have wanted masked blitting to be faster, which is
used for the fade-in effect seen during TH05's main menu animation and the
ending artwork. But, uh… he knew how to modify master.lib. In fact, he
did already modify the graph_pack_put_8() function
used for rendering a single .PI image row, to ignore master.lib's VRAM
clipping region. For this effect though, he first blits each row regularly
to the invisible 400th row of VRAM, and then does an EGC-accelerated
VRAM-to-VRAM blit of that row to its actual target position with the mask
enabled. It would have been way more efficient to add another version of
this function that takes a mask pattern. No amount of REP
MOVSW is going to change the fact that two VRAM writes per line are
slower than a single one. Not to mention that it doesn't justify writing
every other .PI function in ASM to go along with it…
This is where we also find the most hilarious aspect about this: For most
of ZUN's pointless micro-optimizations, you could have maybe made the
argument that they do save some CPU cycles here and there, and
therefore did something positive to the final, PC-98-exclusive result. But
some of the hand-written ASM here doesn't even constitute a
micro-optimization, because it's worse than what you would have got
out of even Turbo C++ 4.0J with its 80386 optimization flags!
At least it was possible to "decompile" 6 out of the 10 functions
here, making them easy to clean up for future modders and port authors.
Could have been 7 functions if I also decided to "decompile"
pi_free(), but all the C++ code is already surrounded by ASM,
resulting in 2 ASM translation units and 2 C++ translation units.
pi_free() would have needed a single translation unit by
itself, which wasn't worth it, given that I would have had to spell out
every single ASM instruction anyway.
There you go. What about this needed to be written in ASM?!?
The function calls between these small translation units even seemed to
glitch out TASM and the linker in the end, leading to one CALL
offset being weirdly shifted by 32 bytes. Usually, TLINK reports a fixup
overflow error when this happens, but this time it didn't, for some reason?
Mirroring the segment grouping in the affected translation unit did solve
the problem, and I already knew this, but only thought of it after spending
quite some RTFM time… during which I discovered the -lE
switch, which enables TLINK to use the expanded dictionaries in
Borland's .OBJ and .LIB files to speed up linking. That shaved off roughly
another second from the build time of the complete ReC98 repository. The
more you know… Binary blobs compiled with non-Borland tools would be the
only reason not to use this flag.
So, even more slowdown with this 5th dedicated push, since we've still only
repaid 41% of the technical debt in the SHARED segment so far.
Next up: Part 6, which hopefully manages to decompile the FM and SSG
channel animations in TH05's Music Room, and hopefully ends up being the
final one of the slow ones.
Alright, tooling and technical debt. Shouldn't be really much to talk
about… oh, wait, this is still ReC98
For the tooling part, I finished up the remaining ergonomics and error
handling for the
📝 sprite converter that Jonathan Campbell contributed two months ago.
While I familiarized myself with the tool, I've actually ran into some
unreported errors myself, so this was sort of important to me. Still got
no command-line help in there, but the error messages can now do that job
probably even better, since we would have had to write them anyway.
So, what's up with the technical debt then? Well, by now we've accumulated
quite a number of 📝 ASM code slices that
need to be either decompiled or clearly marked as undecompilable. Since we
define those slices as "already reverse-engineered", that decision won't
affect the numbers on the front page at all. But for a complete
decompilation, we'd still have to do this someday. So, rather than
incorporating this work into pushes that were purchased with the
expectation of measurable progress in a certain area, let's take the
"anything goes" pushes, and focus entirely on that during them.
The second code segment seemed like the best place to start with this,
since it affects the largest number of games simultaneously. Starting with
TH02, this segment contains a set of random "core" functions needed by the
binary. Image formats, sounds, input, math, it's all there in some
capacity. You could maybe call it all "libzun" or something like
that? But for the time being, I simply went with the obvious name,
seg2. Maybe I'll come up with something more convincing in
the future.
Oh, but wait, why were we assembling all the previous undecompilable ASM
translation units in the 16-bit build part? By moving those to the 32-bit
part, we don't even need a 16-bit TASM in our list of dependencies, as
long as our build process is not fully 16-bit.
And with that, ReC98 now also builds on Windows 95, and thus, every 32-bit
Windows version. 🎉 Which is certainly the most user-visible improvement
in all of these two pushes.
Back in 2015, I already decompiled all of TH02's seg2
functions. As suggested by the Borland compiler, I tried to follow a "one
translation unit per segment" layout, bundling the binary-specific
contents via #include. In the end, it required two
translation units – and that was even after manually inserting the
original padding bytes via #pragma codestring… yuck. But it
worked, compiled, and kept the linker's job (and, by extension,
segmentation worries) to a minimum. And as long as it all matched the
original binaries, it still counted as a valid reconstruction of ZUN's
code.
However, that idea ultimately falls apart once TH03 starts mixing
undecompilable ASM code inbetween C functions. Now, we officially have no
choice but to use multiple C and ASM translation units, with maybe only
just one or two #includes in them…
…or we finally start reconstructing the actual seg2 library,
turning every sequence of related functions into its own translation unit.
This way, we can simply reuse the once-compiled .OBJ files for all the
binaries those functions appear in, without requiring that additional
layer of translation units mirroring the original segmentation.
The best example for this is
TH03's
almost undecompilable function that generates a lookup table for
horizontally flipping 8 1bpp pixels. It's part of every binary since
TH03, but only used in that game. With the previous approach, we would
have had to add 9 C translation units, which would all have just
#included that one file. Now, we simply put the .OBJ file
into the correct place on the linker command line, as soon as we can.
💡 And suddenly, the linker just inserts the correct padding bytes itself.
The most immediate gains there also happened to come from TH03. Which is
also where we did get some tiny RE% and PI% gains out of this after
all, by reverse-engineering some of its sprite blitting setup code. Sure,
I should have done even more RE here, to also cover those 5 functions at
the end of code segment #2 in TH03's MAIN.EXE that were in
front of a number of library functions I already covered in this push. But
let's leave that to an actual RE push 😛
All in all though, I was just getting started with this; the real
gains in terms of removed ASM files are still to come. But in the
meantime, the funding situation has become even better in terms of
allowing me to focus on things nobody asked for. 🙂 So here's a slightly
better idea: Instead of spending two more pushes on this, let's shoot for
TH05 MAINE.EXE position independence next. If I manage to get
it done, we'll have a 100% position-independent TH05 by the time
-Tom- finishes his MAIN.EXE PI demo, rather
than the 94% we'd get from just MAIN.EXE. That's bound to
make a much better impression on all the people who will then
(re-)discover the project.