- 📝 Posted:
- 🚚 Summary of:
- P0002, P0003, P0004, P0281, P0282, P0283, P0284, P0285
- ⌨ Commits:
87eed57...f131878
,f131878...d60fb3e
,b86a2b1...62bd5b1
, (MS-DOS Player)07d1088...P0281
,d60fb3e...056085b
,056085b...b86a2b1
,62bd5b1...18b9cd7
,18b9cd7...23fc9e7
- 💰 Funded by:
- GhostPhanom, [Anonymous], Blue Bolt, Yanga
- 🏷 Tags:
I'm 13 days late, but 🎉 ReC98 is now 10 years old! 🎉 On June 26, 2014, I first tried exporting IDA's disassembly of TH05's OP.EXE
and reassembling and linking the resulting file back into a binary, and was amazed that it actually yielded an identical binary. Now, this doesn't actually mean that I've spent 10 years working on this project; priorities have been shifting and continue to shift, and time-consuming mistakes were certainly made. Still, it's a good occasion to finally fully realize the good future for ReC98 that GhostPhanom invested in with the very first financial contribution back in 2018, deliver the last three of the first four reserved pushes, cross another piece of time-consuming maintenance off the list, and prepare the build process for hopefully the next 10 years.
But why did it take 8 pushes and over two months to restore feature parity with the old system? 🥲
- The previous build system(s)
- Migrating the 16-bit build part to Tup
- Optimizing MS-DOS Player
- Continued support for building on 32-bit Windows
- The new tier list of supported build platforms
- Cleaning up
#include
lists - TH02's High Score menu
The original plan for ReC98's good future was quite different from what I ended up shipping here. Before I started writing the code for this website in August 2019, I focused on feature-completing the experimental 16-bit DOS build system for Borland compilers that I'd been developing since 2018, and which would form the foundation of my internal development work in the following years. Eventually, I wanted to polish and publicly release this system as soon as people stopped throwing money at me. But as of November 2019, just one month after launch, the store kept selling out with everyone investing into all the flashier goals, so that release never happened.
The main idea behind the system still has its charm: Your build script is a regular C++ program that #include
s the build system as a static library and passes fixed structures with names of source files and build flags. By employing static constructors, even a 1994 Turbo C++ would let you define the whole build at compile time, although this certainly requires some dank preprocessor magic to remain anywhere near readable at ReC98 scale. 🪄 While this system does require a bootstrapping process, the resulting binary can then use the same dependency-checking mechanisms to recompile and overwrite itself if you change the C++ build code later. Since DOS just simply loads an entire binary into RAM before executing it, there is no lock to worry about, and overwriting the originating binary is something you can just do.
Later on, the system also made use of batched compilation: By passing more than one source file to TCC.EXE
, you get to avoid TCC's quite noticeable startup times, thus speeding up the build proportional to the number of translation units in each batch. Of course, this requires that every passed source file is supposed to be compiled with the same set of command-line flags, but that's a generally good complexity-reducing guideline to follow in a build script. I went even further and enforced this guideline in the system itself, thus truly making per-file compiler command line switches considered harmful. Thanks to Turbo C++'s #pragma option
, changing the command line isn't even necessary for the few unfortunate cases where parts of ZUN's code were compiled with inconsistent flags.
I combined all these ideas with a general approach of "targeting DOSBox": By maximizing DOS syscalls and minimizing algorithms and data structures, we spend as much time as possible in DOSBox's native-code DOS implementation, which should give us a performance advantage over DOS-native implementations of MAKE that typically follow the opposite approach.
Of course, all this only matters if the system is correct and reliable at its core. Tup teaches us that it's fundamentally impossible to have a reliable generic build system without
- augmenting the build graph with all actual files read and written by each invoked build tool, which involves tracing all file-related syscalls, and
- persistently serializing the full build graph every time the system runs, allowing later runs to detect every possible kind of change in the build script and rebuild or clean up accordingly.
Unfortunately, the design limitations of my system only allowed half-baked attempts at solving both of these prerequisites:
- If your build system is not supposed to be generic and only intended to work with specific tools that emit reliable dependency information, you can replace syscall tracing with a parser for those specific formats. This is what my build system was doing, reading dependency information out of each .OBJ file's OMF COMENT record.
- Since DOS command lines are limited to 127 bytes, DOS compilers support reading additional arguments from response files, typically indicated with an
@
next to their path on the command line. If we now put every parameter passed to TCC or TLINK into a response file and leave these files on disk afterward, we've effectively serialized all command-line arguments of the entire build into a makeshift database. In later builds, the system can then detect changed command-line arguments by comparing the existing response files from the previous run with the new contents it would write based on the current build structures. This way, we still only recompile the parts of the codebase that are affected by the changed arguments, which is fundamentally impossible with Makefiles.
But this strategy only covers changes within each binary's compile or link arguments, and ignores the required deletions in "the database" when removing binaries between build runs. This is a non-issue as long as we keep decompiling on master
, but as soon as we switch between master
and similarly old commits on the debloated
/anniversary
branches, we can get very confusing errors:
Apparently, there's also such a thing as "too much batching", because TCC would suddenly stop applying certain compiler optimizations at very specific places if too many files were compiled within a single process? At least you quickly remember which source files you then need to manually touch and recompile to make the binaries match ZUN's original ones again…
But the final nail in the coffin was something I'd notice on every single build: 5 years down the line, even the performance argument wasn't convincing anymore. The strategy of minimizing emulated code still left me with an 𝑂(𝑛) algorithm, and with this entire thing still being single-threaded, there was no force to counteract the dependency check times as they grew linearly with the number of source files.
At P0280, each build run would perform a total of 28,130 file-related DOS syscalls to figure out which source files have changed and need to be rebuilt. At some point, this was bound to become noticeable even despite these syscalls being native, not to mention that they're still surrounded by emulator code that must convert their parameters and results to and from the DOS ABI. And with the increasing delays before TCC would do its actual work, the entire thing started feeling increasingly jankier.
While this system was waiting to be eventually finished, the public master
branch kept using the Makefile that dates back to early 2015. Back then, it didn't take long for me to abandon raw dumb batch files because Make was simply the most straightforward way of ensuring that the build process would abort on the first compile error.
The following years also proved that Makefile syntax is quite well-suited for expressing the build rules of a codebase at this scale. The built-in support for automatically turning long commands into response files was especially helpful because of how naturally it works together with batched compilation. Both of these advantages culminate in this wonderfully arcane incantation of ASCII special characters and syntactically significant linebreaks:
Which translates to "take the filenames of all dependents of this explicit rule, write them into a temporary file with an autogenerated name, insert this filename into the tcc … @
command line, and delete the file after the command finished executing". The @
is part of TCC's command-line interface, the rest is all MAKE syntax.
But 📝 as we all know by now, these surface-level niceties change nothing about Makefiles inherently being unreliable trash due to implementing none of the aforementioned two essential properties of a generic build system. Borland got so close to a correct and reliable implementation of autodependencies, but that would have just covered one of the two properties. Due to this unreliability, the old build16b.bat
called Borland's MAKER.EXE
with the -B
flag, recompiling everything all the time. Not only did this leave modders with a much worse build process than I was using internally, but it also eventually got old for me to merge my internal branch onto master
before every delivery. Let's finally rectify that and work towards a single good build process for everyone.
As you would expect by now, I've once again migrated to Tup's Lua syntax. Rewriting it all makes you realize once again how complex the PC-98 Touhou build process is: It has to cover 2 programming languages, 2 pipeline steps, and 3 third-party libraries, and currently generates a total of 39 executables, including the small programs I wrote for research. The final Lua code comprises over 1,300 lines – but then again, if I had written it in 📝 Zig, it would certainly be as long or even longer due to manual memory management. The Tup building blocks I constructed for Shuusou Gyoku quickly turned out to be the wrong abstraction for a project that has no debug builds, but their 📝 basic idea of a branching tree of command-line options remained at the foundation of this script as well.
This rewrite also provided an excellent opportunity for finally dumping all the intermediate compilation outputs into a separate dedicated obj/
subdirectory, finally leaving bin/
nice and clean with only the final executables. I've also merged this new system into most of the public branches of the GitHub repo.
As soon as I first tried to build it all though, I was greeted with a particularly nasty Tup bug. Due to how DOS specified file metadata mutation, MS-DOS Player has to open every file in a way that current Tup treats as a write access… but since unannotated file writes introduce the risk of a malformed build graph if these files are read by another build command later on, Tup providently deletes these files after the command finished executing. And by these files, I mean TCC.EXE
as well as every one of its C library header files opened during compilation.
Due to a minor unsolved question about a failing test case, my fix has not been merged yet. But even if it was, we're now faced with a problem: If you previously chose to set up Tup for ReC98 or 📝 Shuusou Gyoku and are maybe still running 📝 my 32-bit build from September 2020, running the new build.bat
would in fact delete the most important files of your Turbo C++ 4.0J installation, forcing you to reinstall it or restore it from a backup. So what do we do?
- Should my custom build get a special version number so that the surrounding batch file can fail if the version number of your installed Tup is lower?
- Or do I just put a message somewhere, which some people invariably won't read?
The easiest solution, however, is to just put a fixed Tup binary directly into the ReC98 repo. This not only allows me to make Tup mandatory for 64-bit builds, but also cuts out one step in the build environment setup that at least one person previously complained about. *nix users might not like this idea all too much (or do they?), but then again, TASM32 and the Windows-exclusive MS-DOS Player require Wine anyway. Running Tup through Wine as well means that there's only one PATH
to worry about, and you get to take advantage of the tool checks in the surrounding batch file.
If you're one of those people who doesn't trust binaries in Git repos, the repo also links to instructions for building this binary yourself. Replicating this specific optimized binary is slightly more involved than the classic ./configure && make && make install trinity, so having these instructions is a good idea regardless of the fact that Tup's GPL license requires it.
One particularly interesting aspect of the Lua code is the way it handles sprite dependencies:
If build commands read from files that were created by other build commands, Tup requires these input dependencies to be spelled out so that it can arrange the build graph and parallelize the build correctly. We could simply put every sprite into a single array and automatically pass that as an extra input to every source file, but that would effectively split the build into a "sprite convert" and "code compile" phase. Spelling out every individual dependency allows such source files to be compiled as soon as possible, before (and in parallel to) the rest of the sprites they don't depend on. Similarly, code files without sprite dependencies can compile before the first sprite got converted, or even before the sprite converter itself got compiled and linked, maximizing the throughput of the overall build process.
Running a 30-year-old DOS toolchain in a parallel build system also introduces new issues, though. The easiest and recommended way of compiling and linking a program in Turbo C++ is a single tcc
invocation:
This performs a batched compilation of main.cpp
and utils.cpp
within a single TCC process, and then launches TLINK to link the resulting .obj
files into main.exe
, together with the C++ runtime library and any needed objects from master.lib
. The linking step works by TCC generating a TLINK command line and writing it into a response file with the fixed name turboc.$ln
… which obviously can't work in a parallel build where multiple TCC processes will want to link different executables via the same response file.
Therefore, we have to launch TLINK with a custom response file ourselves. This file is echo
'd as a separate parallel build rule, and the Lua code that constructs its contents has to replicate TCC's logic for picking the correct C++ runtime .lib
file for the selected memory model.
While this does add more string formatting logic, not relying on TCC to launch TLINK actually removes the one possible PATH
-related error case I previously documented in the README. Back in 2021 when I first stumbled over the issue, it took a few hours of RE to figure this out. I don't like these hours to go to waste, so here's a Gist, and here's the text replicated for SEO reasons:
Issue: TCC compiles, but fails to link, with Unable to execute command 'tlink.exe'
Cause: This happens when invoking TCC as a compiler+linker, without the
-c
flag. To locate TLINK, TCC needlessly copies thePATH
environment variable into a statically allocated 128-byte buffer. It then constructs absolutetlink.exe
filenames for each of the semicolon- or\0
-terminated paths, writing these into a buffer that immediately follows the 128-bytePATH
buffer in memory. The search is finished as soon as TCC finds an existing file, which gives precedence to earlier paths in thePATH
. If the search didn't complete until a potential "final" path that runs past the 128 bytes, the final attempted filename will consist of the part that still managed to fit into the buffer, followed by the previously attempted path.Workaround: Make sure that the
BIN\
path to Turbo C++ is fully contained within the first 127 bytes of thePATH
inside your DOS system. (The 128th byte must either be a separating;
or the terminating\0
of thePATH
string.)
Now that DOS emulation is an integral component of the single-part build process, it even makes sense to compile our pipeline tools as 16-bit DOS executables and then emulate them as part of the build. Sure, it's technically slower, but realistically it doesn't matter: Our only current pipeline tools are 📝 the converter for hardcoded sprites and the 📝 ZUN.COM
generators, both of which involve very little code and are rarely run during regular development after the initial full build. In return, we get to drop that awkward dependency on the separate Borland C++ 5.5 compiler for Windows and yet another additional manual setup step. 🗑️ Once PC-98 Touhou becomes portable, we're probably going to require a modern compiler anyway, so you can now delete that one as well.
That gives us perfect dependency tracking and minimal parallel rebuilds across the whole codebase! While MS-DOS Player is noticeably slower than DOSBox-X, it's not going to matter all too much; unless you change one of the more central header files, you're rarely if ever going to cause a full rebuild. Then again, given that I'm going to use this setup for at least a couple of years, it's worth taking a closer look at why exactly the compilation performance is so underwhelming …
On the surface, MS-DOS Player seems like the right tool for our job, with a lot of advantages over DOSBox:
- It doesn't spawn a window that boots an entire emulated PC, but is instead
- perfectly integrated into the Windows console. Using it in a modern developer console would allow you to click on a compile error and have your editor immediately open the relevant file and jump to that specific line! With DOSBox, this basic comfort feature was previously unthinkable.
- Heck, Takeda Toshiya originally developed it to run the equally vintage LSI C-86 compiler on 64-bit Windows. Fixing any potential issues we'd run into would be well within the scope of the project.
- It consists of just a single comparatively small binary that we could just drop into the ReC98 repo. No manual setup steps required.
But once I began integrating it, I quickly noticed two glaring flaws:
Back in 2009, Takeda Toshiya chose to start the project by writing a custom DOS implementation from scratch. He was aware of DOSBox, but only adapted small tricky parts of its source code rather than starting with the DOSBox codebase and ripping out everything he didn't need. This matches the more research-oriented nature that all of his projects appear to follow, where the primary goal of writing the code is a personal understanding of the problem domain rather than a widely usable piece of software. MS-DOS Player is even the outlier in this regard, with Takeda Toshiya describing it as
珍しく実用的かもしれません
. I am definitely sympathetic to this mindset; heck, my old internal build system falls under this category too, being so specialized and narrow that it made little sense to use it outside of ReC98. But when you apply it to emulators for niche systems, you end up with exactly the current PC-98 emulation scene, where there's no single universally good emulator because all of them have some inaccuracy somewhere. This scene is too small for you not to eventually become part of someone else's supply chain… 🥲
Emulating DOS is a particularly poor fit for a research/NIH project because it's Hyrum's Law incarnate. With the lack of memory protection in Real Mode, programs could freely access internal DOS (and even BIOS) data structures if they only knew where to look, and frequently did. It might look as if "DOS command-line tools" just equals x86 plusINT 21h
, but soon you'll also be emulating the BIOS, PIC, PIT, EMS, XMS, and probably a few more things, all with their individual quirks that some application out there relies on. DOSBox simply had much more time to grow and mature and figure out all of these details by trial and error. If you start a DOS emulator from scratch, you're bound to duplicate all this research as people want to use your emulator to run more and more programs, until you've ended up with what's effectively a clone of DOSBox's exact logic. Unless, of course, if you draw a line somewhere and limit the scope of the DOS and BIOS emulation. But given how many people have wanted to use MS-DOS Player for running DOS TUIs in arbitrarily sized terminal windows with arbitrary fonts, that's not what happened. I guess it made sense for this use case before DOSBox-X gained a TTF output mode in late 2020?
As usual, I wouldn't mention this if I didn't run into two bugs when combining MS-DOS Player with Turbo C++ and Tup. Both of these originated from workarounds for inaccuracies in the DOS emulation that date back to MS-DOS Player's initial release and were thankfully no longer necessary with the accuracy improvements implemented in the years since.For CPU emulation, MS-DOS Player can use either MAME's or Neko Project 21/W's x86 core, both of which are interpreters and won't win any performance contests. The NP21/W core is significantly better optimized and runs ≈41% faster, but still pales in comparison to DOSBox-X's dynamic recompiler. Running the same sequential commands that the P0280 Makefile would execute, the upstream 2024-03-02 NP21/W core build of MS-DOS Player would take to compile the entire ReC98 codebase on my system, whereas DOSBox-X's dynamic core manages the same in , or 94% faster.
Granted, even the DOSBox-X performance is much slower than we would like it to be. Most of it can be blamed on the awkward time in the early-to-mid-90s when Turbo C++ 4.0J came out. This was the time when DOS applications had long grown past the limitations of the x86 Real Mode and required DOS extenders or even sillier hacks to actually use all the RAM in a typical system of that period, but Win32 didn't exist yet to put developers out of this misery. As such, this compiler not only requires at least a 386 CPU, but also brings its own DOS extender (DPMI16BI.OVL
) plus a loader for said extender (RTM.EXE
), both of which need to be emulated alongside the compiler, to the great annoyance of emulator maintainers 30 years later. Even MS-DOS Player's README file notes how Protected Mode adds a lot of complexity and slowdown:
8086 binaries are much faster than 80286/80386/80486/Pentium4/IA32 binaries. If you don't need the protected mode or new mnemonics added after 80286, I recommend i86_x86 or i86_x64 binary.
The immediate reaction to these performance numbers is obvious: Let's just put DOSBox-X's dynamic recompiler into MS-DOS Player, right?!
🙌 Except that once you look at DOSBox-X, you immediately get why Takeda Toshiya might have preferred to start from scratch. Its codebase is a historically grown tangled mess, requiring intimate familiarity and a significant engineering effort to isolate the dynamic core in the first place. I did spend a few days trying to untangle and copy it all over into MS-DOS Player… only to be greeted with an infinite loop as soon as everything compiled for the first time. 😶 Yeah, no, that's bound to turn into a budget-exceeding maintenance nightmare.
Instead, let's look at squeezing at least some additional performance out of what we already have. A generic emulator for the entire CISCy instruction set of the 80386, with complete support for Protected Mode, but it's only supposed to run the subset of instructions and features used by a specific compiler and linker as fast as possible… wait a moment, that sounds like a use case for profile-guided optimization! This is the first time I've encountered a situation that would justify the required 2-phase build process and lengthy profile collection – after all, writing into some sort of database for every function call does slow down MS-DOS Player by roughly 15×. However, profiling just the compilation of our most complex translation unit (📝 TH01 YuugenMagan) and the linking of our largest executable (TH01's REIIDEN.EXE
) should be representative enough.
I'll get to the performance numbers later, but even the build output is quite intriguing. Based on this profile, Visual Studio chooses to optimize only 104 out of MS-DOS Player's 1976 functions for speed and the rest for size, shaving off a nice 109 KiB from the binary. Presumably, keeping rare code small is also considered kind of fast these days because it takes up less space in your CPU's instruction cache once it does get executed?
With PGO as our foundation, let's run a performance profile and see if there are any further code-level optimizations worth trying out:
- Removing redundant
memset()
calls: MS-DOS Player is written in a very C-like style of C++, and initializes a bunch of its statically allocated data bymemset()
ing it with00
bytes at startup. This is strictly redundant even in C; Section 6.7.9/10 of the C standard mandates that all static data is zero-initialized by default. In turn, the program loaders of modern operating systems employ all sorts of paging tricks to reduce the CPU cost (and actual RAM usage!) of this initialization as much as possible. If you manuallymemset()
afterward, you throw all these advantages out of the window.
Of course, these calls would only ever show up among the top CPU consumers in a performance profile if a program uses a large amount of static data, but the hardcoded 32 MiB of emulated RAM in ≥i386-supporting builds definitely qualifies. Zeroing 32.8 MiB of memory makes up a significant chunk of the runtime of some of the shorter build steps and quickly adds up; a full rebuild of the ReC98 codebase currently spawns a total of 361 MS-DOS Player instances, totaling 11.5 GiB of needless memory writes. - Limiting the emulated instruction set: NP21/W's x86 core emulates everything up to the SSE3 extension from 2004, but Turbo C++ 4.0J's x86 instruction set usage doesn't stretch past the 386. It doesn't even need the x87 FPU for compiling code that involves floating-point constants. Disabling all these unneeded extensions speeds up x86's infamously annoying instruction decoding, and also reduces the size of the MS-DOS Player binary by another 149.5 KiB. The source code already had macros for this purpose, and only needed a slight fix for the code to compile with these macros disabled.
- Removing x86 paging: Borland's DOS extender uses segmented memory addressing even in Protected Mode. This allows us to remove the MMU emulation and the corresponding "are we paging" check for every memory access.
- Removing cycle counting: When emulating a whole system, counting the cycles of each instruction is important for accurately synchronizing the CPU with other pieces of hardware. As hinted above, MS-DOS Player does emulate and periodically update a few pieces of hardware outside the CPU, but we need none of them for a build tool.
- Testing Takeda Toshiya's optimizations: In a nice turn of events, Takeda Toshiya merged every single one of my bugfixes and optimization flags into his upstream codebase. He even agreed with my
memset()
and cycle counting removal optimizations, which are now part of all upstream builds as of 2024-06-24. For the 2024-06-27 build, he claims to have gone even further than my more minimal optimization, so let's see how these additional changes affect our build process. - Further risky optimizations: A lot of the remaining slowness of x86 emulation comes from the segmentation and protection fault checks required for every memory access. If we assume that the emulator only ever executes correct code, we can remove these checks and implement further shortcuts based on their absence.
TheL[DEFGS]S
group of instructions that load a segment and offset register from a 32-bitfar
pointer, for example, are both frequently used in Turbo C++ 4.0J code and particularly expensive to emulate. Intel specified their Real Mode operation as loading the segment and offset part in two separate 16-bit reads. But if we assume that neither of those reads can fault, we can compress them into a single 32-bit read and thus only perform the costly address translation once rather than twice. Emulator authors are probably rolling their eyes at this gross violation of Intel documentation now, but it's at least worth a try to see just how much performance we could get out of it.
So, what do we get?
MS-DOS Player build | Full build (Pipeline + 5 games + research code) | Median translation unit + median link | 📝 YuugenMagan compile + link | |||
---|---|---|---|---|---|---|
Generic | PGO | Generic | PGO | Generic | PGO | |
MAME x86 core | 46.522s / 50.854s | 32.162s / 34.885s | 1.346s / 1.429s | 0.966s / 0.963s | 6.975s / 7.155s | 4.024s / 3.981s |
NP21/W core, before optimizations |
34.620s / 36.151s | 30.218s / 31.318s | 1.031s / 1.065s | 0.885s / 0.916s | 5.294s / 5.330s | 4.260s / 4.299s |
No initial memset() |
31.886s / 34.398s | 27.151s / 29.184s | 0.945s / 1.009s | 0.802s / 0.852s | 5.094s / 5.266s | 4.104s / 4.190s |
Limited instructions | 32.404s / 34.276s | 26.602s / 27.833s | 0.963s / 1.001s | 0.783s / 0.819s | 5.086s / 5.182s | 3.886s / 3.987s |
No paging | 29.836s / 31.646s | 25.124s / 26.356s | 0.865s / 0.918s | 0.748s / 0.769s | 4.611s / 4.717s | 3.500s / 3.572s |
No cycle counting | 25.407s / 26.691s | 21.461s / 22.599s | 0.735s / 0.752s | 0.617s / 0.625s | 3.747s / 3.868s | 2.873s / 2.979s |
2024-06-27 build | 26.297s / 27.629s | 21.014s / 22.143s | 0.771s / 0.779s | 0.612s / 0.632s | 4.372s / 4.506s | 3.253s / 3.272s |
Risky optimizations | 23.168s / 24.193s | 20.711s / 21.782s | 0.658s / 0.663s | 0.582s / 0.603s | 3.269s / 3.414s | 2.823s / 2.805s |
#include
cleanup explained below, and the second one corresponds to this commit. All builds are 64-bit, 32-bit builds were ≈5% slower across the board. I kept the fastest run within three attempts; as Tup parallelizes the build process across all CPU cores, it's common for the long-running full build to take up to a few seconds longer depending on what else is running on your system. Tup's standard output is also redirected to a file here; its regular terminal output and nice progress bar will add more slowdown on top.
The key takeaways:
- By merely disabling certain x86 features from MS-DOS Player and retaining the accuracy of the remaining emulation, we get speedups of ≈60% (full build), ≈70% (median TU), and ≈80% (largest TU).
- ≈25% (full build), ≈29% (median TU), and ≈41% (largest TU) of this speedup came from Visual Studio's profile-guided optimization, with no changes to the MS-DOS Player codebase.
- The effects of removing cycle counting are the biggest surprise. Between ≈17% and ≈23%, just for removing one subtraction per emulated instruction? Turns out that in the absence of a "target cycle amount" setting, the x86 emulation loop previously ran for only a single cycle. This caused the PIC check to run after every instruction, followed by PIT, serial I/O, keyboard, mouse, and CRTC update code every millisecond. Without cycle counting, the x86 loop actually keeps running until a CPU exception is raised or the emulated process terminates, skipping the hardware code during the vast majority of the program's execution time.
- While Takeda Toshiya's changes in the 2024-06-27 build completely throw out the cycle counter and clean up process termination, they also reintroduce the hardware updates that made up the majority of the cycle removal speedup. This explains the results we're getting: The small speedup for full rebuilds is too insignificant to bother with and might even fall within a statistical margin of error, but the build slows down more and more the longer the emulated process runs. Compiling and linking YuugenMagan takes a whole 14% longer on generic builds, and ≈9-12% longer on PGO builds. I did another in-between test that just removed the x86 loop from the cycle removal version, and got exactly the same numbers. This just goes to show how much removing two writes to a fixed memory address per emulated instruction actually matters. Let's not merge back this one, and stay on top of 2024-06-24 for the time being.
- The risky optimizations of ignoring segment limits and speeding up 32-bit segment+offset pointer load instructions could yield a further speedup. However, most of these changes boil down to removing branches that would never be taken when emulating correct x86 code. Consequently, these branches get recorded as unlikely during PGO training, which then causes the profile-guided rebuild to rearrange the instructions on these branches in a way that favors the common case, leaving the rest of their effective removal to your CPU's branch predictor. As such, the 10%-15% speedup we can observe in generic builds collapses down to 2%-6% in PGO builds. At this rate and with these absolute durations, it's not worth it to maintain what's strictly a more inaccurate fork of Neko Project 21/W's x86 core.
- The redundant header inclusions afforded by
#include
guards do in fact have a measurable performance cost on Turbo C++ 4.0J, slowing down compile times by 5%.
But how does this compare to DOSBox-X's dynamic core? Dynamic recompilers need some kind of cache to ensure that every block of original ASM gets recompiled only once, which gives them an advantage in long-running processes after the initial warmup. As a result, DOSBox-X compiles and links YuugenMagan in , ≈92% faster than even our optimized MS-DOS Player build. That percentage resembles the slowdown we were initially getting when comparing full rebuilds between DOSBox-X and MS-DOS Player, as if we hadn't optimized anything.
On paper, this would mean that DOSBox-X barely lost any of its huge advantage when it comes to single-threaded compile+link performance. In practice, though, this metric is supposed to measure a typical decompilation or modding workflow that focuses on repeatedly editing a single file. Thus, a more appropriate comparison would also have to add the aforementioned constant 28,130 syscalls that my old build system required to detect that this is the one file/binary that needs to be recompiled/relinked. The video at the top of this blog post happens to capture the best time () I got for the detection process on DOSBox-X. This is almost as slow as the compilation and linking itself, and would have only gotten slower as we continue decompiling the rest of the games. Tup, on the other hand, performs its filesystem scan in a near-constant , matching the claim in Section 4.7 of its paper, and thus shrinking the performance difference to ≈14% after all. Sure, merging the dynamic core would have been even better (contribution-ideas, anyone?), but this is good enough for now.
Just like with Tup, I've also placed this optimized binary directly into the ReC98 repo and added the specific build instructions to the GitHub release page.
I do have more far-reaching ideas for further optimizing Neko Project 21/W's x86 core for this specific case of repeated switches between Real Mode and Protected Mode while still retaining the interpreted nature of this core, but these already strained the budget enough.
The perhaps more important remaining bottleneck, however, is hiding in the actual DOS emulation. Right now, a Tup-driven full rebuild spawns a total of 361 MS-DOS Player processes, which means that we're booting an emulated DOS 361 times. This isn't as bad as it sounds, as "booting DOS" basically just involves initializing a bunch of internal DOS structures in conventional memory to meaningful values. However, these structures also include a few environment variables like PATH
, APPEND
, or TEMP
/TMP
, which MS-DOS Player seamlessly integrates by translating them from their value on the Windows host system to the DOS 8.3 format. This could be one of the main reasons why MS-DOS Player is a native Windows program rather than being cross-platform:
- On Windows, this path translation is as simple as calling
GetShortPathNameA()
, which returns a unique 8.3 name for every component along the path. - Also, drive letters are an integral part of the DOS
INT 21h
API, and Windows still uses them as well.
However, the NT kernel doesn't actually use drive letters either, and views them as just a legacy abstraction over its reality of volume GUIDs. Converting paths back and forth between these two views therefore requires it to communicate with a
mount point manager service, which can coincidentally also be observed in debug builds of Tup.
As a result, calling any path-retrieving API is a surprisingly expensive operation on modern Windows. When running a small sprite through our 📝 sprite converter, MS-DOS Player's boot process makes up 56% of the runtime, with 64% of that boot time (or 36% of the entire runtime) being spent on path translation. The actual x86 emulation to run the program only takes up 6.5% of the runtime, with the remaining 37.5% spent on initializing the multithreaded C++ runtime.
But then again, the truly optimal solution would not involve MS-DOS Player at all. If you followed general video game hacking news in May, you'll probably remember the N64 community putting the concept of statically recompiled game ports on the map. In case you're wondering where this seemingly sudden innovation came from and whether a reverse-engineered decompilation project like ReC98 is obsolete now, I wrote a new FAQ entry about why this hype, although justified, is at least in part misguided. tl;dr: None of this can be meaningfully applied to PC-98 games at the moment.
On the other hand, recompiling our compiler would not only be a reasonable thing to attempt, but exactly the kind of problem that recompilation solves best. A 16-bit command-line tool has none of the pesky hardware factors that drag down the usefulness of recompilations when it comes to game ports, and a recompiled port could run even faster than it would on 32-bit Windows. Sure, it's not as flashy as a recompiled game, but if we got a few generous backers, it would still be a great investment into improving the state of static x86 recompilation by simply having another open-source project in that space. Not to mention that it would be a great foundation for improving Turbo C++ 4.0J's code generation and optimizations, which would allow us to simplify lots of awkward pieces of ZUN code… 🤩
That takes care of building ReC98 on 64-bit platforms, but what about the 32-bit ones we used to support? The previous split of the build process into a Tup-driven 32-bit part and a Makefile-driven 16-bit part sure was awkward and I'm glad it's gone, but it did give you the choice between 1) emulating the 16-bit part or 2) running both parts natively on 32-bit Windows. While Tup's upstream Windows builds are 64-bit-only, it made sense to 📝 compile a custom 32-bit version and thus turn any 32-bit Windows ≥Vista into the perfect build platform for ReC98. Older Windows versions that can't run Tup had to build the 32-bit part using a separately maintained dumb batch script created by tup generate
, but again, due to Make being trash, they were fully rebuilding the entire codebase every time anyway.
Driving the entire build via Tup changes all of that. Now, it makes little sense to continue using 32-bit Tup:
- We need to DLL-inject into a 64-bit MS-DOS Player. Sure, we could compile a 32-bit build of MS-DOS Player, but why would we? If we look at current market shares, nobody runs 32-bit Windows anymore, not even by accident. If you run 32-bit Windows in 2024, it's because you know what you're doing and made a conscious choice for the niche use case of natively running DOS programs. Emulating them defeats the whole point of setting up this environment to begin with.
- It would make sense if Tup could inject into DOS programs, but it can't.
- Also, as we're going to see later, requiring Windows ≥Vista goes in the opposite direction of what we want for a 32-bit build. The earlier the Windows version, the better it is at running native DOS tools.
This means that we could now only support 32-bit Windows via an even larger tup generate
d batch file. We'd have to move the MS-DOS Player prefix of the respective command lines into an environment variable to make Tup use the same rules for both itself and the batch file, but the result seems to work…
…but it's really slow, especially on Windows 9x. 🐌 If we look back at the theory behind my previous custom build system, we can already tell why: Efficiently building ReC98 requires a completely different approach depending on whether you're running a typical modern multi-core 64-bit system or a vintage single-core 32-bit system. On the former, you'd want to parallelize the slow emulation as much as you can, so you maximize the amount of TCC processes to keep all CPU cores as busy as possible. But on the latter, you'd want the exact opposite – there, the biggest annoyance is the repeated startup and shutdown of the VDM, TCC, and its DOS extender, so you want to continue batching translation units into as few TCC processes as possible.
CMake fans will probably feel vindicated now, thinking "that sounds exactly like you need a meta build system 🤪". Leaving aside the fact that the output vomited by all of CMake's Makefile generators is a disgusting monstrosity that's far removed from addressing any performance concerns, we sure could solve this problem by adding another layer of abstraction. But then, I'd have to rewrite my working Lua script into either C++ or (heaven forbid) Batch, which are the only options we'd have for bootstrapping without adding any further dependencies, and I really wouldn't want to do that. Alternatively, we could fork Tup and modify tup generate
to rewrite the low-level build rules that end up in Tup's database.
But why should we go for any of these if the Lua script already describes the build in a high-level declarative way? The most appropriate place for transforming the build rules is the Lua script itself…
… if there wasn't the slight problem of Tup forbidding file writes from Lua. 🥲 Presumably, this limitation exists because there is no way of replicating these writes in a tup generate
d dumb shell script, and it does make sense from that point of view.
But wait, printing to stdout
or stderr
works, and we always invoke Tup from a batch file anyway. You can now tell where this is going. Hey, exfiltrating commands from a build script to the build system via standard I/O streams works for Rust's Cargo too!
Just like Cargo, we want to add a sufficiently unique prefix to every line of the generated batch script to distinguish it from Tup's other output. Since Tup only reruns the Lua script – and would therefore print the batch file – if the script changed between the previous and current build run, we only want to overwrite the batch file if we got one or more lines. Getting all of this to work wasn't all too easy; we're once again entering the more awful parts of Batch syntax here, which apparently are so terrible that Wine doesn't even bother to correctly implement parts of it. 😩
Most importantly, we don't really want to redirect any of Tup's standard I/O streams. Redirecting stdout
disables console output coloring and the pretty progress bar at the bottom, and looping over stderr
instead of stdout
in Batch is incredibly awkward. Ideally, we'd run a second Tup process with a sub-command that would just evaluate the Lua script if it changed - and fortunately, tup parse
does exactly that. 😌
In the end, the optimally fast and ERRORLEVEL
-preserving solution involves two temporary files. But since creating files between two Tup runs causes it to reparse the Lua code, which would print the batch file to the unfiltered stdout
, we have to hide these temporary files from Tup by placing them into its .tup/
database directory. 🤪
On a more positive note, programmatically generating batches from single-file TCC rules turned out to be a great idea. Since the Lua code maps command-line flags to arrays of input files, it can also batch across binaries, surpassing my old system in this regard. This works especially well on the debloated
and anniversary
branches, which replace ZUN's little command-line flag inconsistencies with a single set of good optimization flags that every translation unit is compiled with.
Time to fire up some VMs then… only to see the build failing on Windows 9x with multiple unhelpful Bad command or file name errors. Clearly, the long echo
lines that write our response files run up against some length limit in command.com
and need to be split into multiple ones. Windows 9x's limit is larger than the 127 characters of DOS, that's for sure, and the exact number should just be one search away…
…except that it's not the 1024 characters recounted in a surviving newsgroup post. Sure, lines are truncated to 1023 bytes and that off-by-one error is no big deal in this context, but that's not the whole story:
Wait, what, something about /
being the SWITCHAR
? And not even just that…
And what's perhaps the worst example:
My complete set of test cases: 2024-07-09-Win9x-batch-tokenizer-tests.bat
So, time to load command.com
into DOSBox-X's debugger and step through some code. 🤷 The earliest NT-based Windows versions were ported to a variety of CPUs and therefore received the then-all-new cmd.exe
shell written in C, whereas Windows 9x's command.com
was still built on top of the dense hand-written ASM code that originated in the very first DOS versions. Fortunately though, Microsoft open-sourced one of the later DOS versions in April. This made it somewhat easier to cross-reference the disassembly even though the Windows 9x version significantly diverged in the parts we're interested in.
And indeed: After truncating to 1023 bytes and parsing out any redirectors, each line is split into tokens around whitespace and =
signs and before every occurrence of the SWITCHAR
. These tokens are written into a statically allocated 64-element array, and once the code tries to write the 65th element, we get the Bad command or file name error instead.
#
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
String
echo
-DA
1
2
3
a
/B
/C
/D
/1
a
/B
/C
/D
/2
Switch flag
🚩
🚩
🚩
🚩
🚩
🚩
🚩
🚩
command.com
's internal argument array after calling the Windows 9x equivalent of parseline
with my initial example string. Note how all the "switches" got capitalized and annotated with a flag, whereas the =
sign no longer appears in either string or flag form.
Needless to say, this makes no sense. Both DOS and Windows pass command lines as a single string to newly created processes, and since this tokenization is lossy, command.com
will just have to pass the original string anyway. If your shell wants to handle tokenization at a central place, it should happen after it decided that the command matches a builtin that can actually make use of a pointer to the resulting token array – or better yet, as the first call of each builtin's code. Doing it before is patently ridiculous.
I don't know what's worse – the fact that Windows 9x blindly grinds each batch line through this tokenizer, or the fact that no documentation of this behavior has survived on today's Internet, if any even ever existed. The closest thing I found was this page that doesn't exist anymore, and it also just contains a mere hint rather than a clear description of the issue. Even the usual Batch experts who document everything else seem to have a blind spot when it comes to this specific issue. As do emulators: DOSBox and FreeDOS only reimplement the sane DOS versions of command.com
, and Wine only reimplements cmd.exe
.
Oh well. 71 lines of Lua later, the resulting batch file does in fact work everywhere:
But wait, there's more! The codebase now compiles on all 32-bit Windows systems I've tested, and yields binaries that are equivalent to ZUN's… except on 32-bit Windows 10. 🙄 Suddenly, we're facing the exact same batched compilation bug from my custom build system again, with REIIDEN.EXE
being 16 bytes larger than it's supposed to be.
Looks like I have to look into that issue after all, but figuring out the exact cause by debugging TCC would take ages again. Thankfully, trial and error quickly revealed a functioning workaround: Separating translation unit filenames in the response file with two spaces rather than one. Really, I couldn't make this up. This is the most ridiculous workaround for a bug I've encountered in a long time.
Hopefully, you've now got the impression that supporting any kind of 32-bit Windows build is way more of a liability than an asset these days, at least for this specific project. "Real hardware", "motivating a TCC recompilation", and "not dropping previous features" really were the only reasons for putting up with the sheer jank and testing effort I had to go through. And I wouldn't even be surprised if real-hardware developers told me that the first reason doesn't actually hold up because compiling ReC98 on actual PC-98 hardware is slow enough that they'd rather compile it on their main machine and then transfer the binaries over some kind of network connection.
I guess it also made for some mildly interesting blog content, but this was definitely the last time I bothered with such a wide variety of Windows versions without being explicitly funded to do so. If I ever get to recompile TCC, it will be 64-bit only by default as well.
Instead, let's have a tier list of supported build platforms that clearly defines what I am maintaining, with just the most convincing 32-bit Windows version in Tier 1. Initially, that was supposed to be Windows 98 SE due to its superior performance, but that's just unreasonable if key parts of the OS remain undocumented and make no sense. So, XP it is.
*nix fans will probably once again be disappointed to see their preferred OS in Tier 2. But at least, all we'd need for that to move up to Tier 1 is a CI configuration, contributed either via funding me or sending a PR. (Look, even more contribution-ideas!)
Getting rid of the Wine requirement for a fully cross-platform build process wouldn't be too unrealistic either, but would require us to make a few quality decisions, as usual:
- Do we run the DOS tools by creating a cross-platform MS-DOS Player fork, or do we statically recompile them?
- Do we replace 32-bit Windows TASM with the 16-bit DOS
TASM.EXE
orTASMX.EXE
, which we then either run through our forked MS-DOS Player or recompile? This would further slow down the build and require us to get rid of these nice long non-8.3 filenames… 😕 I'd only recommend this after the looming librarization of ZUN's master.lib fork is completed. - Or do we try migrating to JWasm again? As an open-source assembler that aims for MASM compatibility, it's the closest we can get to TASM, but it's not a drop-in replacement by any means. I already tried in late 2014, but encountered too many issues and quickly abandoned the idea. Maybe it works better now that we have less ASM? In any case, this migration would only get easier the less ASM code we have remaining in the codebase as we get closer to the 100% finalization mark.
Y'know what I think would be the best idea for right now, though? Savoring this new build system and spending an extended amount of time doing actual decompilation or modding for a change.
Now that even full rebuilds are decently fast, let's make use of that productivity boost by doing some urgent and far-reaching code cleanup that touches almost every single C++ source file. The most immediately annoying quirk of this codebase was the silly way each translation unit #included the headers it needed. Many years ago, I measured that repeatedly including the same header did significantly impact Turbo C++ 4.0J's compilation times, regardless of any include guards inside. As a consequence of this discovery, I slightly overreacted and decided to just not use any include guards, ever. After all, this emulated build process is slow enough, and we don't want it to needlessly slow down even more! This way, redundantly including any file that adds more than just a few #define
macros won't even compile, throwing lots of Multiple definition errors.
Consequently, the headers themselves #included almost nothing. Starting a new translation unit therefore always involved figuring and spelling out the transitive dependencies of the headers the new unit actually wants to use, in a short trial-and-error process. While not too bad by itself, this was bound to become quite counterproductive once we get closer to porting these games: If some inlined function in a header needed access to, let's say, PC-98-specific I/O ports as an implementation detail, the header would have externalized this dependency to the top-level translation unit, which in turn made that that unit appear to contain PC-98-native code even if the unit's code itself was perfectly portable.
But once we start making some of these implicit transitive dependencies optional, it all stops being justifiable. Sometimes, a.hpp
declared things that required declarations from b.hpp
but these things are used so rarely that it didn't justify adding #include "b.hpp"
to all translation units that #include "a.hpp"
. So how about conditionally declaring these things based on previously #included headers?
Now that we've measured that the sane alternative of include guards comes with a performance cost of just 5% and we've further reduced its effective impact by parallelizing the build, it's worth it to take that cost in exchange for a tidy codebase without such surprises. From now on, every header file will #include
its own dependencies and be a valid translation unit that must compile on its own without errors. In turn, this allows us to remove at least 1,000 #include
of transitive dependencies from .cpp
files. 🗑️
However, that 5% number was only measured after I reduced these redundant #include
s to their absolute minimum. So it still makes sense to only add include guards where they are absolutely necessary – i.e., transitively dependent headers included from more than one other file – and continue to (ab)use the Multiple definition compiler errors as a way of communicating "you're probably #including too many headers, try removing a few". Certainly a less annoying error than Undefined symbol.
Since all of this went way over the 7-push mark, we've got some small bits of RE and PI work to round it all out. The .REC
loader in TH04 and TH05 is completely unremarkable, but I've got at least a bit to say about TH02's High Score menu. I already decompiled MAINE.EXE
's post-Staff Roll variant in 2015, so we were only missing the almost identical MAIN.EXE
variant shown after a Game Over or when quitting out of the game. The two variants are similar enough that it mostly needed just a small bit of work to bring my old 2015 code up to current standards, and allowed me to quickly push TH02 over the 40% RE mark.
Functionally, the two variants only differ in two assignments, but ZUN once again chose to copy-paste the entire code to handle them. This was one of ZUN's better copy-pasting jobs though – and honestly, I can't even imagine how you would mess up a menu that's entirely rendered on the PC-98's text RAM. It almost makes you wonder whether ZUN actually used the same #if ENDING
preprocessor branching that my decompilation uses… until the visual inconsistencies in the alignment of the place numbers and the and labels clearly give it away as copy-pasted:
Next up: Starting the big Seihou summer! Fortunately, waiting two more months was worth it: In mid-June, Microsoft released a preview version of Visual Studio that, in response to my bug report, finally, finally makes C++ standard library modules fully usable. Let's clean up that codebase for real, and put this game into a window.