📝 The first part of this blitting performance series dealt with single-color transparent 8-wide and 16-wide sprites blitted to non-byte-aligned X positions, whose performance was the primary goal for the first release of the TH01 Anniversary Edition. Now that we're all about menus and cutscenes, we're dealing with the exact opposite kind of image: No transparency, always displayed at byte-aligned X positions, and much wider. The PC-98 doesn't offer any hardware support here, so this is purely about finding the fastest way to memcpy() from RAM to VRAM. We could just take Intel's CPU reference manuals, count cycles, pick the fastest method, and call it a day, but those manuals can't know about the PC-98's infamous hefty VRAM latencies or any potentially relevant differences between real hardware and emulators.
This time, I'll also provide the ASM code for each technique. All of the following code snippets show a single row of the dedicated blitter for 64-wide pixels, with the following register allocation:
DS:SI points to the source sprite data in conventional RAM
ES:DI points into VRAM
CX contains the stride of the source image
Displaced
Let's start with the method that is supposedly optimal on i486 CPUs, at least according to Intel's reference manual. x86's addressing modes have always supported displacements for shifting the pointed-to address by a constant value, at the natural cost of a single code byte. This is the typical way you'd address structure fields from a pointer to an instance. In the context of blitting, this technique analogously treats the pixel chunks as fields within a "VRAM row" structure, and only updates the pointers at the end of each row to jump to the next one. Since all kinds of memory-accessing MOV instructions take just one cycle on an i486, this approach at least shouldn't add any CPU overhead on top of the unavoidable VRAM latencies.
Marching
These displacements sure are a nice feature of x86, but how big is their effect on performance really? Asked probably no one ever, considering that the alternative of constantly moving both pointers would add two more instructions that cost 6 bytes per 32-bit chunk, as well as >4 more CPU cycles on ≤386, not including instruction fetching costs as usual. I still had a column to fill in my benchmark output though, so why not fill it with a really bad approach that can demonstrate the impact of unoptimized code. Or lack thereof 👀
; We march by 64 pixels in every row.
sub cx, (64 / 8)
; Row loop body
mov eax, [si] ; Pixels 0-31
mov es:[di], eax
add si, 4
add di, 4
mov eax, [si] ; Pixels 32-63
mov es:[di], eax
add si, 4
add di, 4
; Move to next row
add si, cx
add di, ((640 - 64) / 8)
MOVS
Maybe, however, the marching approach isn't all that bad if we can express it more succinctly. In fact, the four-instruction MOV/MOV/ADD/ADD sequence does the same as the single MOVS instruction, and MOVS doesn't even require a temporary register! At one byte per 16 pixels, this would by far be the most compact method for shorter sprites. Our 2023 benchmark showed MOVS to be the preferred approach for 8- and 16-wide sprites on 286 and 386 CPUs, which matches Intel's documentation. Usually, complex instructions like these are discouraged on later microarchitectures as they tend to get slower and slower, but maybe VRAM is even slower and it doesn't actually matter?
; We march by 64 pixels in every row.
sub cx, (64 / 8)
cld ; Don't go backwards!
; Row loop body
movsd ; Pixels 0-31
movsd ; Pixels 32-63
add si, cx ; Move to next row
add di, ((640 - 64) / 8)
REP MOVS
Thanks to the REP prefix, we can compress such a series of MOVS instructions down even further and let the CPU handle the repetition. This is the preferred way you'd implement a memcpy() in hand-written x86 ASM if you're going for code clarity, but does it also perform well? Given the startup cost of REP, this will probably only ever be faster than MOVS past a certain sprite width. But it's almost certain to be the optimal approach for 640-wide sprites: Once the distance between the end of the previous row and the start of a new row becomes zero, we no longer need a vertical loop either, and can multiply the height into the repetition count to blit a whole bitplane in a single instruction.
; We might be blitting a smaller area out of a larger sprite.
mov ax, cx ; The REP prefix needs CX itself
sub ax, (64 / 8)
cld ; Don't go backwards!
; Row loop body
mov cx, (64 / 32) ; 64 pixels = 2 DWORDs
rep movsd
; Move to next row
add si, ax
add di, ((640 - 64) / 8)
Since we've thoroughly tested the GRCG against 4-plane blitting performance 📝 last time, we no longer need to check both. Instead, we can use the screen space for testing all possible image widths at 32-pixel intervals, using a self-made 640-wide dithered gradient:
This recording in Neko Project 21/W at 2.4576 × 27 = 66.3552 MHz might even be representative?
But of course, the only relevant results are those from real hardware:
Width
Displaced
Marching
MOVS
REP MOVS
32
1.47
1.56
1.34
1.35
64
2.28
2.49
2.04
2.09
96
3.16
3.51
2.84
2.84
128
4.01
4.46
3.58
3.54
160
4.92
5.52
4.39
4.34
192
5.74
6.49
5.10
5.04
224
6.72
7.58
5.78
5.80
256
7.61
8.50
6.73
6.46
288
8.35
9.55
7.44
7.18
320
9.36
10.51
8.18
7.89
352
10.26
11.45
8.93
8.62
384
11.14
12.48
9.68
9.30
416
12.03
13.43
10.17
10.01
448
12.85
14.34
11.13
10.72
480
13.71
15.36
11.83
11.43
512
14.53
16.23
12.62
12.13
544
15.51
17.23
13.31
12.84
576
16.24
18.25
14.08
13.54
608
17.10
19.19
14.57
14.29
640
18.09
20.04
15.44
14.67
Width
Displaced
Marching
MOVS
REP MOVS
32
0.74
0.76
0.68
0.67
64
1.06
1.21
0.98
1.08
96
1.44
1.64
1.28
1.41
128
1.78
2.06
1.58
1.71
160
2.19
2.51
1.91
2.03
192
2.52
2.96
2.22
2.33
224
3.00
3.47
2.58
2.74
256
3.31
3.88
2.88
3.00
288
3.67
4.29
3.18
3.31
320
4.00
4.75
3.50
3.60
352
4.38
5.14
3.75
3.89
384
4.74
5.56
4.05
4.20
416
5.07
6.00
4.35
4.49
448
5.43
6.41
4.64
4.76
480
5.74
6.83
4.94
5.07
512
6.13
7.25
5.25
5.38
544
6.50
7.69
5.50
5.66
576
6.82
8.10
5.81
5.96
608
7.18
8.50
6.11
6.25
640
7.50
8.88
6.32
6.16
Width
Displaced
Marching
MOVS
REP MOVS
32
0.16
0.18
0.15
0.15
64
0.25
0.28
0.22
0.25
96
0.33
0.38
0.30
0.33
128
0.42
0.49
0.37
0.40
160
0.51
0.59
0.45
0.48
192
0.60
0.70
0.52
0.55
224
0.70
0.82
0.61
0.64
256
0.78
0.92
0.68
0.71
288
0.87
1.02
0.75
0.78
320
0.95
1.13
0.82
0.85
352
1.04
1.22
0.89
0.92
384
1.13
1.33
0.96
1.00
416
1.20
1.43
1.03
1.06
448
1.29
1.53
1.10
1.13
480
1.38
1.63
1.17
1.21
512
1.46
1.73
1.25
1.28
544
1.54
1.83
1.31
1.34
576
1.63
1.94
1.38
1.42
608
1.71
2.03
1.46
1.49
640
1.79
2.13
1.52
1.47
Number of frames required to blit Width×5120 pixels of a 1bpp bitmap to a byte-aligned position in VRAM, using the GRCG. The plots are relative to the respective REP MOVS result. Thanks to 1️⃣ オップナー2608 for the real-hardware test.
Wait, only one run from a real system that's way below the minimum requirements of PC-98 Touhou? Furball is currently waiting for 286 and 486 hardware and will test those models once the shipment arrives. But all other test results I received came from Pentium-era or later models, and had identical performance for every blitting technique, including the terrible marching one! Turns out that the superscalar microarchitecture of these CPUs erases any difference between blitting methods, and does so independently of clock speed. My testers and I replicated this behavior on
a PC-9821Na7 (clocked at 75 MHz),
my own PC-9821Nw133 (clocked at 133 MHz),
an AMD K6-III (clocked at 400 MHz), and
an AMD K6-2 (clocked at 533 MHz).
Thus, if your minimum target is a Pentium and you've got any kind of large image to blit to a byte-aligned position: Go with REP MOVS or even just memcpy(), enjoy having one generic blitter for every sprite width, and don't waste any further thoughts on the issue.
But since we have to make a choice, we might as well optimize for the one target with stable performance characteristics that would be used by the most people: the i386 emulated by Neko Project, which prefers MOVS for all widths except 640 pixels.
Deploying these blitting functions outside of a benchmark quickly runs into a practical problem. MOVS requires us to stay with the original plan of generating a dedicated blitter function for every possible byte-aligned sprite width. But this would mean that we'd unconditionally generate (640/8) = 80 blitter functions and link them into every game. Most of these will not only never be used and just bloat the binaries, but also needlessly increase compile times due to how they are currently stitched together out of an unholy mess of macros and force-inlined recursive functions. I'll probably fully drop down to ASM and one translation unit per blitter in the next iteration of this system; Turbo C++ 4.0J still generates several unnecessary instructions around Duff's Device in particular. Now that I've constructed all these test cases, I really get how one could lose themselves in these tiny micro-optimizations. But I doubt that ZUN had the tools to measure the actual impact of what he did.
For now though, an opt-in model would be the next obvious step: Start with an empty blit function array, and let each game request and generate blitters for all the sizes it needs. But that just replaces the bloat with manual tedium: For deduplication reasons, this instantiation must happen in a dedicated (and obviously PC-98-exclusive) translation unit, far away from the actual blitting call in the (future cross-platform) translation unit that requires each size. Maybe this is practicable for a single game, but it gets really dumb if you have five games whose blitting widths are kind of similar except for the places where they aren't.
Also, the games will now crash if they ever try to blit at a width we didn't generate. Which in turn either requires a bloat-free way of reporting this error, or living with the annoyance of these mysterious crashes during development, because all the text areas in menus sure come in a wide variety of widths…
In the end, we'd like to have some kind of generic and width-independent blitter after all. Execution speed doesn't matter all too much for these menu use cases, which only call their blitters whenever any pixels are supposed to change. REP MOVSD would be the preferred generic method in such a blitter, and this choice is even well-informed now that the benchmarks have confirmed it to be at least the second-fastest option everywhere. Then, we only need two more conditional branches per line to cover the remaining pixels in case the width doesn't cleanly divide by 32:
; BH: Sprite width in bytes
; BL: 32-pixel chunks per row (BH / 4)
; AX: Bytes between end of blitted width
; and start of next row (= stride - BH)
xor cx, cx
; Row loop body
mov cl, bl
rep movsd
@@missing_16_or_24_pixels?:
test bh, 2
jz @@missing_8_pixels?
movsw
@@missing_8_or_24_pixels?:
test bh, 1
jz @@next_row
movsb
@@next_row:
add si, ax
mov cl, bh ; Roll back DI to
sub di, cx ; start of row, and
add di, (640 / 8) ; jump to the next.
The generic byte-aligned blitter, supporting any byte size between 8 and 2,048.
And since we still have this whole on-demand blitter generation framework for specific widths, games can still opt into faster blitters to accelerate commonly used sprite widths. TH01, for example, definitely gets back its fast 8- and 16-wide blitters for byte-aligned and pre-shifted pellets, that's for sure.
If that was the only kind of image blitting in menus and cutscenes, we'd be done with benchmarking now. But PC-98 Touhou also has one special effect we need to look at, which also happens to hold the most interesting unresolved performance question…
Masked crossfading
This effect made its debut in TH03, where ZUN used it for a small number of the 320×200 picture transitions in the Stage 8/9 cutscenes and endings. In TH04 and TH05, it then became the default for most of these picture transitions. Its most prominent appearance, though, can be found at the end of TH05's title screen animation, where ZUN applies this effect to a full 640×400 image. Here's how this looks at 66 MHz on Neko Project 21/W, in all of its sluggish glory:
As you might have guessed, this effect works by only blitting certain pixels of the target image, as selected by a mask, on top of the current contents of VRAM. Each of these fade animations iterates over four fixed 4×4 masks that gradually show more pixels of the target image, before blitting it without a mask:
The white/1 bits represent the new pixels of the destination image.
On a system with packed 8-bit colors and a fast CPU, you could naively implement this effect by dropping down to drawing one pixel at a time and simply skipping over any pixel whose corresponding mask bit is not set. In a planar environment, however, you'll always have to perform a read-modify-write operation on a larger chunk of pixels. Wikipedia has a nice visualization of the general algorithm, which translates naturally to the planar VRAM of the PC-98:
Read the source pixels from a VRAM bitplane and the target pixels from an image buffer in RAM
AND the VRAM bits with the negated bits from the current mask row to erase the bits that we want to overwrite
AND the RAM bits with the bits from the current mask row to erase the bits we don't want to overwrite
OR the two pixel chunks together and write the result back to VRAM
Once again, there are four ways in which we could implement this on a PC-98. The code examples assume 64-wide sprites once again:
CPU only
Let's start with the most naive approach that ignores all of the PC-98's blitter chips. Since we've decided on targeting Neko Project, we'd like to use x86 string instructions where possible, but we unfortunately can't do a direct copy due to the mask we have to apply to the source and destination pixels. But we can at least use LODSD for getting the source pixels into EAX while advancing the source pointer, and use displaced instructions to avoid marching the destination register.
If we go for the least amount of instructions, we end up with the following algorithm:
; EAX: Source pixels
; EBX: Source mask (1 = blit pixel)
; ECX: Destination mask (1 = erase pixel)
; [si_skip]: (image stride - (64 / 8))
; Row loop body
lodsd ; EAX = next 32 pixels at DS:[SI], then SI += 4
and eax, ebx ; Clear out source bits we don't want to change
and es:[di], ecx ; Clear out destination bits we want to change
or es:[di], eax ; Pixels 0-31
lodsd ; Repeat with a destination displacement for
and eax, ebx ; pixels 32-63
and es:[di+4], ecx
or es:[di+4], eax
; Move to the next row
add si, [bp-si_skip]
add di, (640 / 8)
DX is already used for the image height, in case you're wondering about that variable on the stack.
However, this specific algorithm would turn out to be the worst thing we could possibly do. On paper, these memory-mutating AND and OR instructions might only cost 3 cycles on a 486, but both of them must transfer bits from VRAM to the CPU and back. That's two loads and two stores in a situation where we only really need one of each.
We could go for minimal VRAM accesses instead of blindly keeping instruction counts low, but then we'd end up with twice as many instructions:
; EAX: Source pixels
; EBX: Source mask (1 = blit pixel)
; ECX: VRAM pixels
; [si_skip]: (image stride - (64 / 8))
; Row loop body
lodsd ; EAX = next 32 pixels at DS:[SI], then SI += 4
and eax, ebx ; Clear out source bits we don't want to change
not ebx ; Turn EBX into the destination mask
mov ecx, es:[di] ; VRAM read
and ecx, ebx ; Clear out destination bits we want to change
or eax, ecx ; Combine source and VRAM
stosd ; VRAM write; DI += 4
not ebx ; Turn EBX back into the source mask
lodsd ; Repeat for pixels 32-63
and eax, ebx
not ebx
mov ecx, es:[di]
and ecx, ebx
or eax, ecx
stosd
not ebx
; Move to the next row
add si, [bp-si_skip]
add di, ((640 - 64) / 8)
But sure enough, this version also runs almost twice as fast as my initial attempt above, across all systems we've tried!
I still left a few micro-optimizations on the table with this one, though. Storing both masks on the stack might be faster than flipping EBX back and forth, and we should probably read from the per-row mask array through the FS or GS segment registers if we compile for ≥386 CPUs. But that shouldn't matter too much in view of all the more promising methods we've still got to try.
GRCG + CPU
The simpler one of the PC-98's blitter chips is best used as part of a two-pass approach: First, we use the GRCG's RMW mode and a tile register filled with 0 bits to perform the erase step across the entire area of the image. The erase loop uses a simple REP STOS per image line and probably doesn't need any further optimizing attention if the REP MOVS result above is any indication. The second pass then only needs a mere three instructions per 32 pixels to perform the missing OR of the masked pixels:
; EAX: Source pixels
; EBX: Source mask (1 = blit pixel)
; CX: Free for mask retrieval!
; [si_skip]: (image stride - (64 / 8))
; Row loop body
lodsd ; EAX = next 32 pixels at DS:[SI], then SI += 4
and eax, ebx ; Clear out source bits we don't want to change
or es:[di], eax ; Pixels 0-31
lodsd ; Repeat with a destination displacement for
and eax, ebx ; pixels 32-63
or es:[di+4], eax
; Move to the next row
add si, [bp-si_skip]
add di, (640 / 8)
But that looks eerily similar to the initial CPU attempt we scrapped earlier. The combination of a GRCG run in Read-Modify-Write mode and the OR instruction means that we're back to two read-modify-write operations per pixel, one more than we actually need. Therefore, this approach can only outperform our fast CPU algorithm if the GRCG run manages to be faster than the 5×4 extra instructions we need to perform the erase step on the CPU.
And that's indeed the best use we can get out of the GRCG for this effect. The GRCG is hardwired for blitting a rather static 8-pixel tile with a more variable mask provided by the CPU, but this effect calls for the exact opposite, a static mask and variable pixels. If we wanted to draw the entire effect using the GRCG, we'd have to send our source pixels through slow I/O ports 8 pixels at a time, which already wastes all the added throughput we could get from a 32-bit and even 16-bit CPU. Meanwhile, the mask register on the CPU – the thing we can easily change – only changes on every line of pixels.
The initial blit of the Siddhaṃ seed syllables at the beginning of TH01's Konngara fight roughly demonstrates how slow such an approach would be. BERO's PiLoad library implements transparent blitting in a similar way, sending each decoded byte to the GRCG and then blitting it with a mask:
All of these images would decompress within 6 frames on Neko Project 21/W at 66 MHz, but PiLoad's transparent blitting algorithm easily doubles that. That's what, ~36,750 cycles for every single line of the image?
The resulting rolling effect is another great example of CPU-speed-dependent lag. Everything here is supposed to happen within a single logical frame, and only doesn't because the system is too slow. A faster PC-98 would blit the image more quickly and thus spend fewer frames, and ports will display all four syllables instantly.
Fortunately, the PC-98 also has a second blitter chip that is much more suited to what we're trying to do here, …
EGC (ZUN's Version)
…and it also happens to be the one ZUN used in his original code. But isn't the EGC infamous for only supporting block copies if the source pixels are already in VRAM? How would we get them there in a setting where we show a full-screen 640×400 image and don't hide anything using 📝 black cells on the text layer?
The answer can be found by looking at the PC-98's memory map. Each bitplane requires the physical existence of (640/8 × 400) = 32,000 bytes of memory. But that's a rather odd number for a memory map, which typically prefers sizes that are powers of 2. And so, NEC not only rounded up the area in the memory map, but also the size of physically available VRAM, giving us 32,768 bytes per bitplane and page. That's an extra 768×4 bytes of memory that isn't shown on screen but still counts as VRAM and can be accessed by the EGC.
This allowed ZUN to go for a line-by-line approach using the EGC's mask register (0x4A8):
Regularly blit the source pixels of the current line to the beginning of each bitplane's offscreen section
Turn on the EGC, configure it for a regular VRAM-to-VRAM copy, and set the mask register depending on the current row
EGC-blit the line to its intended position
Turn off the EGC
Repeat for every line of the image
From all we've learned about PC-98 hardware so far, this should still be pretty slow. While we're back to a single read-modify-write operation for the masked blit itself, the initial blit of the source pixels to the offscreen area still adds one extra VRAM write per 32 pixels, on an architecture that's already notorious for having slow VRAM. Worse, however, is the fact that configuring the EGC still involves these slow I/O port writes I've been writing about time and time again. EGC-accelerated blitting would have to be really fast to be worth the added I/O cost of this approach. 📝 I sure couldn't believe that this was actually faster when I first saw this piece of ZUN code in early 2021, but he probably was onto something here.
EGC (optimized)
But wait. Blitting only one line in each EGC run only utilizes (640/8) =80 of the 768 available offscreen bytes we get on each bitplane. How much faster could we go if we used the entire offscreen area during each run?
After all, nothing stops us from treating invisible VRAM as regular linear memory for 4bpp pixel data, ignoring the spatial layout of the source image. For TH05's 640×400 title screen image, we can fit ⌊768 / (640/8)⌋ = 9 lines of pixels into the offscreen area, which reduces the number of EGC runs from 400 to just 45. For the crossfaded 320×200 cutscene pictures, we get ⌊768 / (320/8)⌋ = 19 lines and 11 EGC runs, wasting even less VRAM.
A visualization of the 11 EGC runs required for masked blitting of a 320×200 cutscene picture. The offscreen area of VRAM is highlighted in pink on the first frame; the white stripe on its last line represents the nonexistent memory past 32,768 bytes, since 768 bytes do not cleanly divide into 640-pixel rows.
For 320-wide images, this approach only leaves 64 pixels (or 8 bytes) of VRAM unused. Micro-optimizing those last few bytes won't matter – not only because we'd still need 11 EGC runs even if we blitted 19.2 lines on each frame instead of 19, but also because the mask patterns force us to iterate over discrete rows anyway.
Aside from this major algorithmic optimization, this approach also offers a vast potential for micro-optimizations:
The EGC retains its registers across activations, so we only need to set them up once at the beginning of the image. This saves an additional 5 I/O port writes per EGC run over ZUN's version.
Using REP MOVSW for the EGC copy not only follows from our previous results, but also makes sense because we can use a segment override prefix to overwrite MOVSW's source register and thus make it copy entirely within the ES register. Since we still access the pixel masks for each line through DS, using an ES override removes the need for the awkward DS register juggling that ZUN had to do for each line of the image.
Our linear nature of accessing offscreen VRAM necessitates a new set of blitters that doesn't move DI to the next VRAM row at the end of an image row. However, we only actually need these blitters for their ability to skip to the next line of the source image in case we only blit a subregion whose width is smaller than the width of the image. If we blit an image's entire width, we can go down an even faster code path that blasts all lines of an EGC run to VRAM with a single REP MOVS per bitplane. As we've seen for the 640-pixel case above, this is always the optimal choice.
So, let's construct another benchmark, do some preliminary tests on Neko Project 21/W, and…
…whoa, ZUN's EGC algorithm is bad on emulators. This benchmark even paints it somewhat favorably; for code simplicity reasons, I merely simulate it within my optimized implementation by reducing the number of blitted image rows per EGC run to 1.
Let's see how fast it all runs on real hardware. This time, the results for ≥Pentium models are interesting as well:
Width
EGC (optimized)
EGC (ZUN)
GRCG + CPU
CPU only
32
0.66
2.50
1.15
1.11
64
1.02
2.90
1.65
1.64
96
1.36
3.26
2.14
2.14
128
1.72
3.56
2.60
2.64
160
2.07
3.79
3.07
3.14
192
2.42
4.15
3.58
3.65
224
2.81
4.34
4.06
4.15
256
3.15
4.68
4.52
4.65
288
3.49
5.02
5.01
5.15
320
3.82
5.35
5.52
5.66
352
4.20
5.70
5.99
6.17
384
4.52
6.03
6.44
6.66
416
4.88
6.36
6.92
7.16
448
5.20
6.71
7.43
7.68
480
5.58
7.04
7.93
8.17
512
5.91
7.38
8.36
8.67
544
6.26
7.72
8.85
9.18
576
6.59
8.05
9.39
9.68
608
6.95
8.40
9.84
10.18
640
7.30
8.65
10.39
10.66
Width
EGC (optimized)
EGC (ZUN)
GRCG + CPU
CPU only
32
0.17
0.41
0.23
0.23
64
0.27
0.51
0.40
0.39
96
0.38
0.66
0.58
0.56
128
0.48
0.72
0.76
0.73
160
0.60
0.79
0.95
0.91
192
0.70
0.89
1.13
1.08
224
0.80
0.98
1.30
1.25
256
0.90
1.09
1.47
1.42
288
1.02
1.21
1.67
1.60
320
1.13
1.31
1.84
1.77
352
1.23
1.41
2.02
1.93
384
1.33
1.51
2.19
2.10
416
1.45
1.63
2.39
2.29
448
1.56
1.74
2.57
2.45
480
1.66
1.83
2.74
2.62
512
1.76
1.93
2.91
2.78
544
1.86
2.04
3.09
2.95
576
1.97
2.14
3.27
3.12
608
2.07
2.24
3.44
3.28
640
2.17
2.34
3.61
3.45
Width
EGC (optimized)
EGC (ZUN)
GRCG + CPU
CPU only
32
0.09
0.21
0.15
0.14
64
0.17
0.28
0.30
0.27
96
0.25
0.36
0.44
0.41
128
0.33
0.43
0.59
0.55
160
0.41
0.52
0.74
0.68
192
0.49
0.59
0.89
0.82
224
0.58
0.66
1.03
0.95
256
0.66
0.75
1.18
1.09
288
0.74
0.84
1.33
1.22
320
0.82
0.91
1.48
1.36
352
0.90
0.99
1.62
1.49
384
0.99
1.08
1.77
1.63
416
1.07
1.17
1.92
1.76
448
1.15
1.24
2.06
1.90
480
1.23
1.32
2.21
2.03
512
1.31
1.40
2.35
2.17
544
1.40
1.48
2.50
2.30
576
1.48
1.56
2.65
2.44
608
1.56
1.65
2.80
2.57
640
1.64
1.73
2.94
2.71
Width
EGC (optimized)
EGC (ZUN)
GRCG + CPU
CPU only
32
0.08
0.16
0.14
0.13
64
0.16
0.23
0.29
0.26
96
0.24
0.31
0.43
0.39
128
0.32
0.39
0.57
0.52
160
0.40
0.46
0.72
0.65
192
0.48
0.54
0.86
0.78
224
0.56
0.62
1.00
0.92
256
0.64
0.70
1.15
1.05
288
0.71
0.78
1.29
1.18
320
0.79
0.86
1.43
1.31
352
0.87
0.93
1.58
1.44
384
0.95
1.02
1.72
1.57
416
1.03
1.09
1.86
1.70
448
1.11
1.17
2.01
1.83
480
1.19
1.25
2.15
1.96
512
1.27
1.33
2.29
2.09
544
1.35
1.41
2.44
2.22
576
1.43
1.49
2.58
2.35
608
1.50
1.56
2.72
2.48
640
1.58
1.64
2.87
2.61
Width
EGC (optimized)
EGC (ZUN)
GRCG + CPU
CPU only
32
0.10
0.24
0.17
0.16
64
0.19
0.33
0.34
0.31
96
0.29
0.40
0.50
0.46
128
0.38
0.48
0.67
0.61
160
0.47
0.57
0.83
0.77
192
0.56
0.66
1.00
0.92
224
0.65
0.75
1.16
1.07
256
0.75
0.84
1.33
1.22
288
0.84
0.93
1.50
1.38
320
0.93
1.02
1.66
1.53
352
1.02
1.11
1.83
1.69
384
1.11
1.21
1.99
1.83
416
1.21
1.30
2.16
1.99
448
1.30
1.39
2.33
2.14
480
1.39
1.48
2.49
2.29
512
1.48
1.57
2.66
2.45
544
1.57
1.66
2.82
2.60
576
1.67
1.76
2.99
2.75
608
1.76
1.84
3.15
2.90
640
1.85
1.94
3.32
3.06
Width
EGC (optimized)
EGC (ZUN)
GRCG + CPU
CPU only
32
0.08
0.44
0.20
0.18
64
0.11
0.47
0.25
0.24
96
0.15
0.51
0.30
0.30
128
0.18
0.54
0.36
0.36
160
0.22
0.56
0.41
0.42
192
0.25
0.59
0.46
0.48
224
0.30
0.60
0.52
0.54
256
0.33
0.64
0.57
0.60
288
0.37
0.67
0.62
0.66
320
0.40
0.71
0.68
0.72
352
0.44
0.74
0.73
0.78
384
0.47
0.77
0.78
0.84
416
0.51
0.81
0.83
0.90
448
0.54
0.84
0.89
0.96
480
0.57
0.88
0.94
1.02
512
0.61
0.91
0.99
1.08
544
0.64
0.94
1.04
1.14
576
0.68
0.98
1.09
1.20
608
0.71
1.01
1.15
1.26
640
0.74
1.03
1.20
1.32
Number of frames required to crossfade Width×400 pixels onto all four planes of VRAM. Thanks to 1️⃣ オップナー2608, 2️⃣ Furball, 4️⃣ Will.Broke.It, and 5️⃣ spaztron64 for the real-hardware tests.
Now that's what we want to see! A clear winner across all relevant generations of PC-98 hardware. Activating the EGC really is only costly on paper; it's worth it even if you just need to perform a single tall RMW blit across all bitplanes.
Most importantly, though: This optimization was crucial for emulators, and absolutely worth it. At Neko Project's emulated 66 MHz, ZUN's code took consistently longer than one frame to crossfade 640×400 pixels onto VRAM. Hence, TH05's title screen animation always effectively rounded up that logical frame to 2 displayed frames. While this is still valid as per 📝 frame-perfection rule #2, it still counts as unintentional slowdown that we've now removed.
In return, ZUN also made the right call by not bothering about this particular optimization. As image widths and CPU speeds increase, minimizing the number of EGC runs has increasingly diminishing returns.
These results might even suggest that we should just formalize a 2-frame delay because clearly, no real-hardware model could ever overcome the VRAM latencies and actually crossfade a single 640×400 image within less than one frame. But again, such a 2-frame delay would be no more or less correct than the 3-frame delay on the PC-9821Na7, or the probably even higher delays on a 486 model. When in doubt, we just go by what ZUN wrote. My allegiance lies with code, not with Japan's e-waste.
Finally, we can also see how upgrading the CPU of your real-hardware PC-98 does nothing to work around VRAM latencies – i.e., the one bottleneck that actually matters for PC-98 Touhou. You'd basically just burn all those additional CPU cycles on waiting for VRAM. Such a CPU upgrade might even be counter-productive, as we can see with the 533 MHz CPU getting outperformed by not only the 400 MHz AMD K6-III, but even a 133 MHz Pentium, due to what spaztron64 suspects to be a southbridge issue.
Future work
But let's think ahead for a moment about what this means for the games as a whole. If the EGC is that good at this task compared to the CPU, how good could it possibly be for 16-color sprites with an alpha plane? master.lib's super_*() functions blit these using GRCG + CPU techniques, which decisively lost this benchmark on ≥Pentium hardware and weren't that good on Neko Project either. In contrast, the EGC has a lot going for it, and not just because of these benchmark numbers:
Since we never cut any of these super_*() sprites into subregions, we can always blast sprite data into VRAM with a single REP MOVSD instruction per bitplane. The VRAM representation as a contiguous 1D strip of bytes matches the way sprites are stored in RAM.
If we treat the 768×4 offscreen bytes as a more sophisticated kind of sprite cache, we might be able to further reduce those initial VRAM writes.
(Probably not, though: There's a good chance that this would actually slow down the game due to the overhead of managing these very few cache bytes. At first, this idea might seem perfect for directional bullets: Since these typically come in multiples and at different angles, couldn't we just blit the 512×4 bytes for 📝 all 16 directional variants to the offscreen area once and then blit every bullet of that type at every possible direction without refilling the cache? However, the original game always renders 16×16 bullets in array order, without grouping them by type first. We can't add grouping ourselves because it would violate our 📝 frame-perfection rules: If the game fires different 16×16 bullet types simultaneously, such as in these three patterns in the Mai & Yuki fight, we might change the rendering order and thus end up with a visible difference if two bullets of different types overlap. And since there's no bug here, I can't even justify changing this for the Anniversary Editions.)
If this frees up enough performance, it might not just help with achieving 0% slowdown on real or emulated PC-98 systems. Maybe the games would then even run decently well on real 386 models, which might even shift the entire marketplace for PC-98 hardware in response? Or maybe, we could even have a high-res mod of TH03 that removes 📝 SPRITE16 and its sprite area to run the in-game portion at twice its original vertical resolution with more detailed sprites, but without increasing real-hardware system requirements all too much? Looking forward to writing and running that benchmark in another 2 years or so…
For now though, that's enough research into blitting performance. Now that we know the optimal way of getting big images from RAM to VRAM in any situation, how do we get them from disk to RAM in the first place? Also, wasn't that animation much slower than the 2 frames per fade step that we would expect from the benchmark results? What's going on there?
Optimizing .PI image handling
Uncompressed 640×400 images are rather large, common hard disk capacities were just starting to hit single-digit GB ranges in the late 90s, and you wouldn't want to haul even more floppy disks to Comiket. So it made sense why ZUN wanted to store these images in compressed form. Enter PI, the second-most common losslessly compressed image format within the PC-98 scene of the 90s. Technically, it's not that complex, but it still achieves great compression ratios for ZUN's images:
TH02
TH03
TH04
TH05
Total
Uncompressed
2,195,904
4,874,240
4,815,360
5,905,920
17,791,424
.PI
290,921
807,690
765,909
822,993
2,687,513
oxipng -o max -Z
254,074
660,020
635,537
617,115
2,166,746
TH02
TH03
TH04
TH05
Total
Uncompressed
100.00 %
100.00 %
100.00 %
100.00 %
100.00 %
.PI
13.25 %
16.57 %
15.91 %
13.94 %
14.91 %
oxipng -o max -Z
11.57 %
13.54 %
13.20 %
10.45 %
12.19 %
However, these compression ratios come at an annoying price for the PC-98 in particular. The algorithm naturally decompresses images into a packed representation with color values from 0 to 15 inclusive, not into the planar 4×1bpp format required by the PC-98's 16-color graphics mode. It makes a lot of sense why PI is designed around packed pixels:
PIC, PI's bigger sibling, originated on the Sharp X68000, whose graphics RAM uses a packed format as well. (By the way, that platform would also be the technically most interesting porting target for PC-98 Touhou, but that's a story for another day.)
If a bitplane-centric compressor only looks at one bitplane at a time and doesn't correlate its bits with the other three planes, it's bound to compress the same shapes and image features up to 4 times, depending on how many palette index bits are involved in creating them. Optimally compressing individual bitplanes therefore turns into an exercise of isolating these features within as few bitplanes as possible, which doesn't always work for every image. This could work for standalone images where the encoder is free to optimize the palette and image bits. But game assets must typically conform to a predefined palette because of other sprites or even game code that already references certain colors within that palette.
By focusing on packed colors, PI gets to implement a move-to-front transform that easily manages a similar kind of compression regardless of specific bit values. If the next color to be encoded has been recently seen next to the previously encoded color, that next color can then be encoded in just 2 bits rather than 4. Combine this delta encoding with a way of copying previously encoded pixel strips of arbitrary length, and you've got a great match for the often heavily dithered images you'd want to use on a PC-98.
But why, then, does master.lib only provide graph_pi_load_pack(), which returns the loaded image in the same packed format that the decompressor naturally produces? I've been wondering that ever since I 📝 decompiled the TH03 cutscene system, where the performance issues of this packed format became obvious. This approach only makes sense for the rare use case of using 16-color images as an input for some algorithm and not actually displaying them on screen. But since people do want to display .PI images, master.lib had to provide a second function for actually rendering these packed pixel buffers anyway. Just look at the sheer amount of work that graph_pack_put_8() has to do to unpack just 8 pixels into the 4×1 bytes for every bitplane, every time you want to write them to VRAM:
mov CL, 2
mov BL, [SI]
mov BH, 0
shl BX, CL
mov AX, word ptr CS:RotTbl[BX]
mov DX, word ptr CS:RotTbl[BX + 2]
inc SI
shl AX, CL
shl DX, CL
mov BL, [SI]
mov BH, 0
shl BX, CL
or AX, word ptr CS:RotTbl[BX]
or DX, word ptr CS:RotTbl[BX + 2]
inc SI
shl AX, CL
shl DX, CL
mov BL, [SI]
mov BH, 0
shl BX, CL
or AX, word ptr CS:RotTbl[BX]
or DX, word ptr CS:RotTbl[BX + 2]
inc SI
shl AX, CL
shl DX, CL
mov BL, [SI]
mov BH, 0
shl BX, CL
or AX, word ptr CS:RotTbl[BX]
or DX, word ptr CS:RotTbl[BX + 2]
inc SI
mov ES:[DI], AL ; 0a800h
mov BX, ES
mov ES:[DI+8000h], AH ; 0b000h
add BH, 10h
mov ES, BX
mov ES:[DI], DL ; 0b800h
add BH, 28h
mov ES, BX
mov ES:[DI], DH ; 0e000h
sub BH, 38h
mov ES, BX
inc DI
Don't try to understand this code, just look at the sheer amount of instructions. Ideally, we want to replace four of these loop bodies with one MOVSD per bitplane, or just blit an entire bitplane with a single REP MOVSD for 640-wide images.
This algorithm costs >81 cycles on a 486 and >141 cycles on a 386, for every 8 pixels. Multiply this by the 256,000 pixels of a single 640×400 image, and you get up to 5 frames on Neko Project's 66 MHz CPU for blitting any full-screen background image.
There is an undocumented graph_pi_load_unpack() function, but that one "unpacks" a .PI into an utterly wasteful and unhelpful packed 8-bit format, and so is probably undocumented for a reason.
OK, but if performance is bad, there must be some really good other arguments in favor of this packed representation. Right?
PI's sequence repetition mechanism requires access to previously decoded pixels. Copying them around within a single packed buffer is just way faster than masking and reassembling bits from four different buffers.
Keeping both packed and planar versions of the same image in heap memory during decoding becomes increasingly prohibitive in Real Mode as image sizes increase.
Also, we can support images whose widths are multiples of 2 rather than 8!
Err, nope. If you look at PI's location codes, you'll see that they can only reach up to two rows back from the current cursor position. Hence, we only ever need to keep a three-row window of packed pixels – the current row, and the two previous ones – which would slide down as the image gets decoded. At the end of each row, we then convert the third row to planar pixels, write them to our result buffer, and slide down the window by shifting up the bottom two packed rows, making room for the next row. Sure, this means that we end up shifting ((width × height/2) × 2) more bytes around in memory. But copies are fast, and the first blit of a resulting 640×400 planar image would still be faster than >5 frames even with these copies factored in.
Also, the width restriction is not really an argument if you consider that graph_pack_put_8() already had the same limitation. Consequently, all of ZUN's .PI images already fulfill that requirement.
Still, 23 frames to decode a single 640×400 image, using a function that restricts itself to 80186-level instructions? Surely we can do better if we target ≥386 CPUs anyway? I've also been on a 📝 mission to remove master.lib and any ASM translation units in the debloated/portable branch, so let's write a new .PI loader in C++!
Unfortunately, it quickly becomes obvious why you want to write this sort of unpacker in ASM. If we look back at the current record holder for the most unoptimized ZUN-written code, 📝 the curve animation in TH03's character selection screen primarily wastes CPU cycles because the sheer number of 4,096 points per logical frame magnifies the impact of every suboptimal code generation detail. Now increase that to the roughly 100,000 decoding operations required by a moderately complex 640×400 .PI file, and your C++ code just can't possibly compete with any well-written ASM implementation. There isn't much use for 32-bit instructions in PI decoding either, since it mostly comes down to making decisions based on individual bits. You'd rather use the 4 main general-purpose registers (AX, BX, CX, and DX) as 8 8-bit registers rather than 4 32-bit ones. And not only are C compilers of that era bad at allocating registers, but we would also have to constantly fight C's 📝 integer📝 promotion📝 rules that will prevent us from using 8-bit registers at every point.
And so, my initial attempt at a C++ version was 3.3× slower than the combination of master.lib's graph_pi_load_pack() and graph_pack_put_8(). Even after deploying every inline ASM trick I came up with, I could only bring down that number to 1.7×. There goes the dream of using 100% C++ and no ASM in the codebase for PC-98 Touhou. Might as well continue using all the good parts of master.lib then…
On to plan B then, forking graph_pi_load_pack() to immediately unpack the image into four bitplanes. Integrating the sliding-window algorithm into plain ASM code was slightly tricky, but not the worst thing in the world.
It did reveal another slight drawback with the general approach of decoding a PI image into a single output buffer, though. PI supports backreferences to the two previous rows even within the first two rows of an image, where you'd expect them to point outside of the image and thus be invalid. In this case, the algorithm expects the two rows "above" the image to be flood-filled with the 2×1-pixel pair in the top-left corner of the image. This also means that every PI bitstream must start by explicitly specifying these colors in PI's delta encoding. master.lib implements this requirement by always allocating the image buffer with two additional rows at the top, which remain allocated once that buffer is returned. I probably don't even have to explain how the sliding-window algorithm naturally solves this issue as well. Thus, we get to reduce the size of the returned buffer to exactly the size of the image and save a few bytes on the heap.
As expected, this change merges the execution time of graph_pack_put_8() into the execution time of graph_pi_load_pack(), and even manages to be slightly shorter despite all the added memory copies.
But now that this works, it would be interesting to compare the performance against the .PI library that ZUN used for TH01:
PiLoad
It's obvious why ZUN moved away from this library for the later four games. Despite the Load in its name, PiLoad exclusively decodes directly to VRAM, without ever writing the full planar image to conventional memory. This prevents any of the more dynamic use cases that ZUN had in mind for TH02: The multiple and quickly alternating 東方封魔録 images in the title screen animation, the darkening/highlighting effect in the character selection, and the in-game bomb backgrounds all require decompressed image buffers in RAM to run at anywhere close to acceptable speeds at 66 MHz.
Besides, PiLoad also restricts itself to 80186-level instructions, but looks like a giant mess. Self-modifying code? A packed→planar conversion that doesn't even use lookup tables, which master.lib probably uses by default for good reasons? Sure seems as if PiLoad is heavily micro-optimized for the very first generations of PC-98 models that typically wouldn't even have enough RAM for such an image…
…except that all this micro-optimization makes PiLoad more than 2× as fast as master.lib's PI functions, even on Neko Project's emulated i386?!
Yup. If you look closer, PiLoad is nothing short of a masterclass in x86 optimization, applied to a use case where every cycle actually matters:
PiLoad internally decompresses into an 8-bit format that keeps the color in the top 4 bits and leaves the bottom 4 bits zeroed. While this wastes quite a bit of buffer space on the surface, it allows a few key optimizations:
Since every pixel is now byte-aligned, PiLoad can handle every possible type of sequence repetition with a simple REP MOVSB instruction. This is especially handy for the 110 and 111 location codes, which refer to a source block of pixels that starts either (width - 1) or (width + 1) pixels before the current write cursor. With a 4bpp in-memory format, you additionally have to handle the slower case you'd run into 50% of the time, where the source block starts in the middle of a byte and forces you to manually extract and recombine nibbles.
PiLoad can then optimally place the color table for the move-to-front transform in the "zero page" of the decode buffer's segment, from offsets 0x00 to 0xFF inclusive. Then, each 8-bit color value doubles as the address to its corresponding table row, removing any need for further address calculations. 🤯
Bit reading is incredibly efficient, making use of the key insight that early x86 favors non-taken jumps over taken ones. If you consumed all bits in your current byte-sized input data shift register, you must advance to the next byte in your file buffer and potentially reload bytes from disk after you reach the end of the buffer. Optimally, you'd want to optimize for the happy path you will hit in 7 out of 8 cases: decrement the bit counter (1 instruction), jump away to buffer refill code if necessary (1 instruction), read the bit into the carry flag (1 instruction), and then immediately process the value of the carry flag in the next instruction. But how do you conditionally "jump away" and then return to where you came from? x86 only has conditional jumps and no conditional calls, and even if it did, calls would add even more cycles of latency. This means that you can't just define a single bit reader function and then inline it, because such a function would always look something like this:
dec bits_remaining
jnz @@buffer_still_contains_bits
@@refill:
; Load [bits] from buffer or disk…
; …
mov bits_remaining, 8
@@buffer_still_contains_bits:
shl bits, 1 ; Shift next bit into carry flag
The @@refill code necessarily has to be part of the function, but awkwardly sits between the check and the call-site usage code. Execution would have to jump over it in the majority of cases, which is the opposite of what we want. This is what graph_pi_load_pack() does, and it wastes >14 cycles per compressed .PI input byte on a 486, >35 cycles on a 386 or 286, >63 cycles on a 186, and >84 cycles on an 8086.
Instead, PiLoad instantiates a separate copy of its @@refill code for every one of the 22 instances where the decoder needs to read a bit. This way, each bit reader instance can directly jump back to the happy path of its usage code:
dec bits_remaining
jz refill_instance_0
happy_path_0:
add bit_reg, bit_reg ; Shift next bit into carry flag
; Use the carry flag…
; …
refill_instance_0:
; Load [bit_reg] from buffer or disk…
; …
mov bits_remaining, 8
jmp happy_path_0
This kind of optimization even reaches into the physical layout of the code. On the 80186, conditional jumps were still limited to ±128 bytes from the current position. Therefore, PiLoad can't just bunch all 22 instances into a single place, but must place each instance as close as possible to its usage code, at a place where regular execution jumps over a large part of code anyway.
Finally, the packed→planar conversion is done by… alternating ADD and ADC instructions?!
; Each iteration of this loop consumes two pixels and thus writes two output bits per bitplane.
; Thus, looping 4 times gives us a full byte for each bitplane.
rept 4
; Note how processing 2 pixels / 16 bits is the most natural solution here, even on a 32-bit
; system. The algorithm relies on each decoded packed pixel byte being independently
; accessible, and x86 only gives you this access to the first and second byte of its AX, BX,
; CX, and DX registers. Hence, this instruction loads the next two pixels into AL (left) and AH
; (right), due to x86 being little-endian.
; On 32-bit CPUs, `LODSD` followed by a `SHR EAX, 16` further down would be faster on paper,
; but the actual impact is something we'd have to benchmark.
lodsw
; Shift out each successive color bit into the carry flag, and from there into one of our four
; target 8-bit registers (DH, DL, BH, BL) that represent the next 8 pixels on each bitplane.
; Since we have to look at the upper 4 bits of AL and AH, left-shifting via addition gives us
; the color bits in order from most to least significant, so we also have to shift them into
; the bitplane registers in this order.
; Another neat little detail: We don't even have to clear the target registers before this
; unrolled loop! After 4 iterations, we will have rotated 8 new bits into each of our four
; 8-bit target registers, which will have automatically shifted out any of their previous bits.
add al, al ; Left
adc bl, bl ; Plane 3
add ah, ah ; Right
adc bl, bl ; Plane 3
add al, al ; Left
adc bh, bh ; Plane 2
add ah, ah ; Right
adc bh, bh ; Plane 2
add al, al ; Left
adc dl, dl ; Plane 1
add ah, ah ; Right
adc dl, dl ; Plane 1
add al, al ; Left
adc dh, dh ; Plane 0
add ah, ah ; Right
adc dh, dh ; Plane 0
endm
A lot simpler than what you'd apparently have to do on an Amiga, as revealed by a Google search for chunky to planar.
It seems like BERO was just as surprised about this equivalence between ADD/ADC reg, reg and SHL/RCL reg, 1 as I was, and probably defined the shl1 and rcl1 macros used by the actual code to improve clarity.
Comparing the cycle counts of the two variants then reveals quite a surprise:
8086
80186
80286
80386
80486
ADD/ADC reg, reg
3
3
2
2
1
SHL/ROL/RCL reg, 1
2
2
2
3
3
Without Pentium numbers, as we've seen above that Pentium systems are bottlenecked by the GDC, not the CPU.
Yup. PiLoad's choice of instructions tells us that it was indeed deliberately optimized for ≥386 CPUs, the exact ones you'd want to optimize PC-98 Touhou code for, despite formally restricting itself to the 80186 ISA. We don't know whether Koizuka deliberately wanted to target ≤286 CPUs with graph_pack_put_8(), but the fact remains that this choice slowed down PC-98 Touhou on its target hardware. And yes, this includes the table-less version: While that one uses the exact same algorithm as PiLoad, it spells ADD and ADC as ROL 1 and RCL 1 and thus only runs minimally faster than the table-driven version on Neko Project's emulated i386.
Despite all the inlining, PiLoad still ends up 207 bytes smaller than the combination of master.lib's graph_pi_load_pack() and the default version of graph_pack_put_8() with its 1,024-byte lookup table.
So, let's add decoding into memory to PiLoad instead! Turns out that PiLoad was the simpler library to extend all along, since it already used the same sliding-window technique for unpacking pixels into VRAM. All it needed was a width check for a multiple of 8, a calculation of the bitplane size, and a new and much simpler unpacking function that doesn't need to support unaligned X positions or GRCG-powered transparency.
However, 📝 a certain unfortunate issue that will crop up in part 4 will require more outside control over the loading process. As a result, the API of my forked version looks quite different:
I separated the loading process into a header loading and bitplane decoding step, giving the caller full control over how and where to allocate the final image buffer.
I replaced all of PiLoad's own INT 21h file reading code with a single read callback, shifting the responsibility for opening and closing .PI files to the caller. master.lib's INT 21h-hooking packfile feature would later turn out to be one of the most poorly-implemented and overly complex features I've seen so far within that library, perhaps even worse than packed .PI loading. Hooking INT 21h was the obvious choice for easily supporting optional packfiles in both master.lib's own code and any third-party libraries, but the cost this comes at…
In exchange for the new features, I decided to drop a few features that we either don't need in this codebase or that are better implemented at another level. This list includes
This might not be ideal for other PC-98 developers who might want to use my fork in their projects, but you can always revert these commits in your own version.
Before we can entirely replace master.lib's PI functions with PiLoad in TH02-TH05 though, we have to address one particular silly detail:
TH03's silly line-doubled sprite sheets
As we all know by now, 📝 TH03 renders its in-game graphics at line-doubled 640×200 to free up half of VRAM for sprite data. This means that it can store all sprites at half of their displayed vertical resolution, saving some disk space and RAM in the process. However, ZUN only actually did this for the uncompressed sprites loaded from the .BF2 and 📝 .MRS files. All in-game .PI sprite sheets are, for some reason, stored in the same line-doubled form that would be shown on screen:
Since PI's location codes are designed around vertical repetition, line doubling has a negligible impact on compressed file sizes. Such an image still decodes into twice as many bytes as needed, though…
This is very silly because your PI implementation now needs a feature to undo this doubling by skipping over every second line at either load or render time. ZUN chose to place this feature at render time because that's the simplest solution when you use master.lib's PI functions. Since graph_pack_put_8() unpacks and blits only one line at a time, it already expects the caller to loop over all lines of a PI image and to do all the x86 segment arithmetic required for handling images larger than 64 KiB. From there, it's only natural to add a copy of that function that skips every second line.
But such a render-time feature would strain our generic blitter:
At worst, we'd have a "are we line-skipping" check and branch within the low-level blitter functions, at the end of every row. Such a branch would add new code for the line-skipping case that would be jumped over in the optimal case, slowing down the common case in the one place where we want to go fast.
We could generate separate line-skipping versions for all low-level blitters, but that adds code bloat, blitter management bloat, and API surface area bloat.
We could adapt master.lib's solution and have the high-level draw calls only run the blitter for each 1-pixel row, which allows them to jump over every second line just like you would do with graph_pack_put_8(). But conceptually, this just feels so wrong. I designed the entire blitter to be fast for sprites with arbitrary height, even going so far as to unroll the usual vertical loop for images shorter than 192 pixels on the X axis, and then we're slamming the brakes just to work around these silly assets. Plus, the added API surface is still a burden for regular draw calls.
Moving the skip to load time, on the other hand, has several clear advantages:
The loaded images actually only require half the memory.
They also decode slightly faster since we can skip the packed→planar conversion for every second line.
We'd still have to add a flag to the load call. But that's a one-time API surface cost, not a multiplicative one, and thus doesn't hurt at all.
And if we now put all of this together, we indeed get TH05's crossfaded title screen animation running at its intended speed on a 66 MHz Neko Project. Mission accomplished!
Interestingly though, running this animation at its denoted speed now reveals a page-flipping bug in ZUN's code. Each crossfading step is supposed to be visible for 4 logical frames, but that's not what we get:
For some reason, ZUN goes single-buffered before the first logical frame of the first mask pattern, by both showing and accessing page 0. The recording from the original game demonstrates how ZUN's code hilariously fails to race the beam here, as the combination of graph_pack_put_8() and unoptimized crossfaded blitting stretches this single high-level draw call to 7 frames.
ZUN then continues double-buffering after this first frame, accessing page 1 and showing page 0 for the second frame of the first mask pattern. However, we've just seen page 0 on the frame before, so we're now spending a second logical frame on the same page.
The rest of the animation is properly double-buffered. Each fade step is (wastefully) rendered twice, or once for each page, providing an example of 📝 the hardware detail leak I mentioned in part 1. But since the very first frame got duplicated, we've shifted the entire timing of the animation back by one logical frame.
After the code rendered the 4th logical frame of the 4th mask pattern, ZUN immediately returns to single-buffering on page 0 for the main menu, as the original game needs to use page 1 to store the raw background image for cursor and text unblitting purposes. This creates a more unusual landmine:
The original game does first blit the final unmasked image to page 1, but then copies it from page 1 to page 0 using master.lib's notoriously slow graph_copy_page() function while showing page 0. graph_copy_page() fully copies each bitplane before moving on to the next one, which explains the intermediate result on frame 97: Some colors of the title screen image already appear unmasked because their palette indices only use the lowest two bits whose planes have already been fully copied, the other colors get a discolored version of the fourth mask pattern, and we get a screen tearing line at the 371st row of VRAM which reveals that master.lib only managed to copy 2.9275 out of the 4 bitplanes within this physical frame.
Hence, ZUN wasn't effectively double-buffering at all, despite the page flip.
Since the debloated build retains the background image in RAM, it doesn't need to blit anything to page 1. Therefore, it can go fully single-buffered and start blasting planar pixels from RAM to VRAM at some point near the start of the vertical blanking interval. Mysteriously, we manage to race the beam on every 66 MHz configuration, despite also fully blitting each 640×400 bitplane in order. 😲
In both cases, the sudden flip to single-buffering cuts the last double-buffered frame from the crossfade animation that we were still supposed to show. Thus, the code defines a 5-4-4-3 sequence of logical frames for each fade step instead of the intended 4-4-4-4 one.
Another minor reason: Some of the code I wanted to debloat or make more portable within these 11 pushes was still in ASM, and it would have been way more annoying to keep and work around that code than to just decompile it. Next up: Taking a very quick look at these few decompilations.