Blog

📝 Posted:
💰 Funded by:
[Anonymous], Congrio, Ember2528
🏷️ Tags:

PC-98 blitting performance research, part 2! As well as part 2 in 📝 the 4-post series about the big 2025 PC-98 Touhou portability subproject. This one gets pretty technical and is primarily interesting for anyone else wanting to write code for the PC-98. If you're just here for funny details about PC-98 Touhou, you can 📝 skip to the last two bullet points.

  1. Benchmarking wide and opaque sprites
  2. Reworking the blitter
  3. Benchmarking masked crossfading
  4. Optimizing .PI image handling
  5. Meeting our performance target (and uncovering ZUN bugs along with it)

📝 The first part of this blitting performance series dealt with single-color transparent 8-wide and 16-wide sprites blitted to non-byte-aligned X positions, whose performance was the primary goal for the first release of the TH01 Anniversary Edition. Now that we're all about menus and cutscenes, we're dealing with the exact opposite kind of image: No transparency, always displayed at byte-aligned X positions, and much wider. The PC-98 doesn't offer any hardware support here, so this is purely about finding the fastest way to memcpy() from RAM to VRAM. We could just take Intel's CPU reference manuals, count cycles, pick the fastest method, and call it a day, but those manuals can't know about the PC-98's infamous hefty VRAM latencies or any potentially relevant differences between real hardware and emulators.
This time, I'll also provide the ASM code for each technique. All of the following code snippets show a single row of the dedicated blitter for 64-wide pixels, with the following register allocation:

  1. Displaced
    Let's start with the method that is supposedly optimal on i486 CPUs, at least according to Intel's reference manual. x86's addressing modes have always supported displacements for shifting the pointed-to address by a constant value, at the natural cost of a single code byte. This is the typical way you'd address structure fields from a pointer to an instance. In the context of blitting, this technique analogously treats the pixel chunks as fields within a "VRAM row" structure, and only updates the pointers at the end of each row to jump to the next one. Since all kinds of memory-accessing MOV instructions take just one cycle on an i486, this approach at least shouldn't add any CPU overhead on top of the unavoidable VRAM latencies.
    ; Row loop
    mov  	eax, [si]     	; Pixels  0-31
    mov  	es:[di], eax
    mov  	eax, [si+4]   	; Pixels 32-63
    mov  	es:[di+4], eax
    
    ; Move to next row
    add  	si, cx
    add  	di, (640 / 8)
  2. Marching
    These displacements sure are a nice feature of x86, but how big is their effect on performance really? Asked probably no one ever, considering that the alternative of constantly moving both pointers would add two more instructions that cost 6 bytes per 32-bit chunk, as well as >4 more CPU cycles on ≤386, not including instruction fetching costs as usual. I still had a column to fill in my benchmark output though, so why not fill it with a really bad approach that can demonstrate the impact of unoptimized code. Or lack thereof 👀
    ; We march by 64 pixels in every row.
    sub  	cx, (64 / 8)
    
    ; Row loop body
    mov  	eax, [si]   	; Pixels  0-31
    mov  	es:[di], eax
    add  	si, 4
    add  	di, 4
    mov  	eax, [si]   	; Pixels 32-63
    mov  	es:[di], eax
    add  	si, 4
    add  	di, 4
    
    ; Move to next row
    add  	si, cx
    add  	di, ((640 - 64) / 8)
  3. MOVS
    Maybe, however, the marching approach isn't all that bad if we can express it more succinctly. In fact, the four-instruction MOV/MOV/ADD/ADD sequence does the same as the single MOVS instruction, and MOVS doesn't even require a temporary register! At one byte per 16 pixels, this would by far be the most compact method for shorter sprites. Our 2023 benchmark showed MOVS to be the preferred approach for 8- and 16-wide sprites on 286 and 386 CPUs, which matches Intel's documentation. Usually, complex instructions like these are discouraged on later microarchitectures as they tend to get slower and slower, but maybe VRAM is even slower and it doesn't actually matter?
    ; We march by 64 pixels in every row.
    sub  	cx, (64 / 8)
    cld  	                    	; Don't go backwards!
    
    ; Row loop body
    movsd	                    	; Pixels  0-31
    movsd	                    	; Pixels 32-63
    add  	si, cx              	; Move to next row
    add  	di, ((640 - 64) / 8)
  4. REP MOVS
    Thanks to the REP prefix, we can compress such a series of MOVS instructions down even further and let the CPU handle the repetition. This is the preferred way you'd implement a memcpy() in hand-written x86 ASM if you're going for code clarity, but does it also perform well? Given the startup cost of REP, this will probably only ever be faster than MOVS past a certain sprite width. But it's almost certain to be the optimal approach for 640-wide sprites: Once the distance between the end of the previous row and the start of a new row becomes zero, we no longer need a vertical loop either, and can multiply the height into the repetition count to blit a whole bitplane in a single instruction.
    ; We might be blitting a smaller area out of a larger sprite.
    mov  	ax, cx              	; The REP prefix needs CX itself
    sub  	ax, (64 / 8)
    cld  	                    	; Don't go backwards!
    
    ; Row loop body
    mov  	cx, (64 / 32)       	; 64 pixels = 2 DWORDs
    rep movsd
    
    ; Move to next row
    add  	si, ax
    add  	di, ((640 - 64) / 8)

Since we've thoroughly tested the GRCG against 4-plane blitting performance 📝 last time, we no longer need to check both. Instead, we can use the screen space for testing all possible image widths at 32-pixel intervals, using a self-made 640-wide dithered gradient:

This recording in Neko Project 21/W at 2.4576 × 27 = 66.3552 MHz might even be representative?

But of course, the only relevant results are those from real hardware:

Width Displaced Marching MOVS REP MOVS
32 1.47 1.56 1.34 1.35
64 2.28 2.49 2.04 2.09
96 3.16 3.51 2.84 2.84
128 4.01 4.46 3.58 3.54
160 4.92 5.52 4.39 4.34
192 5.74 6.49 5.10 5.04
224 6.72 7.58 5.78 5.80
256 7.61 8.50 6.73 6.46
288 8.35 9.55 7.44 7.18
320 9.3610.51 8.18 7.89
35210.2611.45 8.93 8.62
38411.1412.48 9.68 9.30
41612.0313.4310.1710.01
44812.8514.3411.1310.72
48013.7115.3611.8311.43
51214.5316.2312.6212.13
54415.5117.2313.3112.84
57616.2418.2514.0813.54
60817.1019.1914.5714.29
64018.0920.0415.4414.67
Width Displaced Marching MOVS REP MOVS
320.740.760.680.67
641.061.210.981.08
961.441.641.281.41
1281.782.061.581.71
1602.192.511.912.03
1922.522.962.222.33
2243.003.472.582.74
2563.313.882.883.00
2883.674.293.183.31
3204.004.753.503.60
3524.385.143.753.89
3844.745.564.054.20
4165.076.004.354.49
4485.436.414.644.76
4805.746.834.945.07
5126.137.255.255.38
5446.507.695.505.66
5766.828.105.815.96
6087.188.506.116.25
6407.508.886.326.16
Width Displaced Marching MOVS REP MOVS
320.160.180.150.15
640.250.280.220.25
960.330.380.300.33
1280.420.490.370.40
1600.510.590.450.48
1920.600.700.520.55
2240.700.820.610.64
2560.780.920.680.71
2880.871.020.750.78
3200.951.130.820.85
3521.041.220.890.92
3841.131.330.961.00
4161.201.431.031.06
4481.291.531.101.13
4801.381.631.171.21
5121.461.731.251.28
5441.541.831.311.34
5761.631.941.381.42
6081.712.031.461.49
6401.792.131.521.47
Number of frames required to blit Width×5120 pixels of a 1bpp bitmap to a byte-aligned position in VRAM, using the GRCG. The plots are relative to the respective REP MOVS result. Thanks to 1️⃣ オップナー2608 for the real-hardware test.

Wait, only one run from a real system that's way below the minimum requirements of PC-98 Touhou? Furball is currently waiting for 286 and 486 hardware and will test those models once the shipment arrives. But all other test results I received came from Pentium-era or later models, and had identical performance for every blitting technique, including the terrible marching one! Turns out that the superscalar microarchitecture of these CPUs erases any difference between blitting methods, and does so independently of clock speed. My testers and I replicated this behavior on

Thus, if your minimum target is a Pentium and you've got any kind of large image to blit to a byte-aligned position: Go with REP MOVS or even just memcpy(), enjoy having one generic blitter for every sprite width, and don't waste any further thoughts on the issue.
But since we have to make a choice, we might as well optimize for the one target with stable performance characteristics that would be used by the most people: the i386 emulated by Neko Project, which prefers MOVS for all widths except 640 pixels.


Deploying these blitting functions outside of a benchmark quickly runs into a practical problem. MOVS requires us to stay with the original plan of generating a dedicated blitter function for every possible byte-aligned sprite width. But this would mean that we'd unconditionally generate (640/8) = 80 blitter functions and link them into every game. Most of these will not only never be used and just bloat the binaries, but also needlessly increase compile times due to how they are currently stitched together out of an unholy mess of macros and force-inlined recursive functions. I'll probably fully drop down to ASM and one translation unit per blitter in the next iteration of this system; Turbo C++ 4.0J still generates several unnecessary instructions around Duff's Device in particular. Now that I've constructed all these test cases, I really get how one could lose themselves in these tiny micro-optimizations. But I doubt that ZUN had the tools to measure the actual impact of what he did. :onricdennat:

For now though, an opt-in model would be the next obvious step: Start with an empty blit function array, and let each game request and generate blitters for all the sizes it needs. But that just replaces the bloat with manual tedium: For deduplication reasons, this instantiation must happen in a dedicated (and obviously PC-98-exclusive) translation unit, far away from the actual blitting call in the (future cross-platform) translation unit that requires each size. Maybe this is practicable for a single game, but it gets really dumb if you have five games whose blitting widths are kind of similar except for the places where they aren't.
Also, the games will now crash if they ever try to blit at a width we didn't generate. Which in turn either requires a bloat-free way of reporting this error, or living with the annoyance of these mysterious crashes during development, because all the text areas in menus sure come in a wide variety of widths…

In the end, we'd like to have some kind of generic and width-independent blitter after all. Execution speed doesn't matter all too much for these menu use cases, which only call their blitters whenever any pixels are supposed to change. REP MOVSD would be the preferred generic method in such a blitter, and this choice is even well-informed now that the benchmarks have confirmed it to be at least the second-fastest option everywhere. Then, we only need two more conditional branches per line to cover the remaining pixels in case the width doesn't cleanly divide by 32:

; BH: Sprite width in bytes
; BL: 32-pixel chunks per row (BH / 4)
; AX: Bytes between end of blitted width
;     and start of next row (= stride - BH)

	xor  	cx, cx

; Row loop body
	mov  	cl, bl
	rep movsd

@@missing_16_or_24_pixels?:
	test 	bh, 2
	jz   	@@missing_8_pixels?
	movsw

@@missing_8_or_24_pixels?:
	test 	bh, 1
	jz   	@@next_row
	movsb

@@next_row:
	add  	si, ax
	mov  	cl, bh       	; Roll back DI to
	sub  	di, cx       	; start of row, and
	add  	di, (640 / 8)	; jump to the next.
The generic byte-aligned blitter, supporting any byte size between 8 and 2,048.

And since we still have this whole on-demand blitter generation framework for specific widths, games can still opt into faster blitters to accelerate commonly used sprite widths. TH01, for example, definitely gets back its fast 8- and 16-wide blitters for byte-aligned and pre-shifted pellets, that's for sure.


If that was the only kind of image blitting in menus and cutscenes, we'd be done with benchmarking now. But PC-98 Touhou also has one special effect we need to look at, which also happens to hold the most interesting unresolved performance question…

Masked crossfading

This effect made its debut in TH03, where ZUN used it for a small number of the 320×200 picture transitions in the Stage 8/9 cutscenes and endings. In TH04 and TH05, it then became the default for most of these picture transitions. Its most prominent appearance, though, can be found at the end of TH05's title screen animation, where ZUN applies this effect to a full 640×400 image. Here's how this looks at 66 MHz on Neko Project 21/W, in all of its sluggish glory:

As you might have guessed, this effect works by only blitting certain pixels of the target image, as selected by a mask, on top of the current contents of VRAM. Each of these fade animations iterates over four fixed 4×4 masks that gradually show more pixels of the target image, before blitting it without a mask:

PC-98 Touhou PI mask #1PC-98 Touhou PI mask #2PC-98 Touhou PI mask #3PC-98 Touhou PI mask #4
The white/1 bits represent the new pixels of the destination image.

On a system with packed 8-bit colors and a fast CPU, you could naively implement this effect by dropping down to drawing one pixel at a time and simply skipping over any pixel whose corresponding mask bit is not set. In a planar environment, however, you'll always have to perform a read-modify-write operation on a larger chunk of pixels. Wikipedia has a nice visualization of the general algorithm, which translates naturally to the planar VRAM of the PC-98:

Once again, there are four ways in which we could implement this on a PC-98. The code examples assume 64-wide sprites once again:

  1. CPU only
    Let's start with the most naive approach that ignores all of the PC-98's blitter chips. Since we've decided on targeting Neko Project, we'd like to use x86 string instructions where possible, but we unfortunately can't do a direct copy due to the mask we have to apply to the source and destination pixels. But we can at least use LODSD for getting the source pixels into EAX while advancing the source pointer, and use displaced instructions to avoid marching the destination register.
    If we go for the least amount of instructions, we end up with the following algorithm:

    ; EAX: Source pixels
    ; EBX: Source mask (1 = blit pixel)
    ; ECX: Destination mask (1 = erase pixel)
    ; [si_skip]: (image stride - (64 / 8))
    
    ; Row loop body
    lodsd	                	; EAX = next 32 pixels at DS:[SI], then SI += 4
    and  	eax, ebx        	; Clear out source bits we don't want to change
    and   	es:[di], ecx    	; Clear out destination bits we want to change
    or     	es:[di], eax    	; Pixels 0-31
    lodsd	                	; Repeat with a destination displacement for
    and  	eax, ebx        	; pixels 32-63
    and   	es:[di+4], ecx
    or     	es:[di+4], eax
    
    ; Move to the next row
    add  	si, [bp-si_skip]
    add  	di, (640 / 8)
    DX is already used for the image height, in case you're wondering about that variable on the stack.

    However, this specific algorithm would turn out to be the worst thing we could possibly do. On paper, these memory-mutating AND and OR instructions might only cost 3 cycles on a 486, but both of them must transfer bits from VRAM to the CPU and back. That's two loads and two stores in a situation where we only really need one of each.
    We could go for minimal VRAM accesses instead of blindly keeping instruction counts low, but then we'd end up with twice as many instructions:

    ; EAX: Source pixels
    ; EBX: Source mask (1 = blit pixel)
    ; ECX: VRAM pixels
    ; [si_skip]: (image stride - (64 / 8))
    
    ; Row loop body
    lodsd	                	; EAX = next 32 pixels at DS:[SI], then SI += 4
    and  	eax, ebx        	; Clear out source bits we don't want to change
    not  	ebx             	; Turn EBX into the destination mask
    mov  	ecx, es:[di]    	; VRAM read
    and  	ecx, ebx        	; Clear out destination bits we want to change
    or   	eax, ecx        	; Combine source and VRAM
    stosd	                	; VRAM write; DI += 4
    not  	ebx             	; Turn EBX back into the source mask
    lodsd	                	; Repeat for pixels 32-63
    and  	eax, ebx
    not  	ebx
    mov  	ecx, es:[di]
    and  	ecx, ebx
    or   	eax, ecx
    stosd
    not  	ebx
    
    ; Move to the next row
    add  	si, [bp-si_skip]
    add  	di, ((640 - 64) / 8)

    But sure enough, this version also runs almost twice as fast as my initial attempt above, across all systems we've tried!
    I still left a few micro-optimizations on the table with this one, though. Storing both masks on the stack might be faster than flipping EBX back and forth, and we should probably read from the per-row mask array through the FS or GS segment registers if we compile for ≥386 CPUs. But that shouldn't matter too much in view of all the more promising methods we've still got to try.

  2. GRCG + CPU
    The simpler one of the PC-98's blitter chips is best used as part of a two-pass approach: First, we use the GRCG's RMW mode and a tile register filled with 0 bits to perform the erase step across the entire area of the image. The erase loop uses a simple REP STOS per image line and probably doesn't need any further optimizing attention if the REP MOVS result above is any indication. The second pass then only needs a mere three instructions per 32 pixels to perform the missing OR of the masked pixels:
    ; EAX: Source pixels
    ; EBX: Source mask (1 = blit pixel)
    ; CX: Free for mask retrieval!
    ; [si_skip]: (image stride - (64 / 8))
    
    ; Row loop body
    lodsd	                	; EAX = next 32 pixels at DS:[SI], then SI += 4
    and  	eax, ebx        	; Clear out source bits we don't want to change
    or     	es:[di], eax    	; Pixels 0-31
    lodsd	                	; Repeat with a destination displacement for
    and  	eax, ebx        	; pixels 32-63
    or     	es:[di+4], eax
    
    ; Move to the next row
    add  	si, [bp-si_skip]
    add  	di, (640 / 8)
    But that looks eerily similar to the initial CPU attempt we scrapped earlier. The combination of a GRCG run in Read-Modify-Write mode and the OR instruction means that we're back to two read-modify-write operations per pixel, one more than we actually need. Therefore, this approach can only outperform our fast CPU algorithm if the GRCG run manages to be faster than the 5×4 extra instructions we need to perform the erase step on the CPU.
    And that's indeed the best use we can get out of the GRCG for this effect. The GRCG is hardwired for blitting a rather static 8-pixel tile with a more variable mask provided by the CPU, but this effect calls for the exact opposite, a static mask and variable pixels. If we wanted to draw the entire effect using the GRCG, we'd have to send our source pixels through slow I/O ports 8 pixels at a time, which already wastes all the added throughput we could get from a 32-bit and even 16-bit CPU. Meanwhile, the mask register on the CPU – the thing we can easily change – only changes on every line of pixels.
    The initial blit of the Siddhaṃ seed syllables at the beginning of TH01's Konngara fight roughly demonstrates how slow such an approach would be. BERO's PiLoad library implements transparent blitting in a similar way, sending each decoded byte to the GRCG and then blitting it with a mask:
    All of these images would decompress within 6 frames on Neko Project 21/W at 66 MHz, but PiLoad's transparent blitting algorithm easily doubles that. That's what, ~36,750 cycles for every single line of the image?
    The resulting rolling effect is another great example of CPU-speed-dependent lag. Everything here is supposed to happen within a single logical frame, and only doesn't because the system is too slow. A faster PC-98 would blit the image more quickly and thus spend fewer frames, and ports will display all four syllables instantly.

    Fortunately, the PC-98 also has a second blitter chip that is much more suited to what we're trying to do here, …

  3. EGC (ZUN's Version)
    …and it also happens to be the one ZUN used in his original code. But isn't the EGC infamous for only supporting block copies if the source pixels are already in VRAM? How would we get them there in a setting where we show a full-screen 640×400 image and don't hide anything using 📝 black cells on the text layer?
    The answer can be found by looking at the PC-98's memory map. Each bitplane requires the physical existence of (640/8 × 400) = 32,000 bytes of memory. But that's a rather odd number for a memory map, which typically prefers sizes that are powers of 2. And so, NEC not only rounded up the area in the memory map, but also the size of physically available VRAM, giving us 32,768 bytes per bitplane and page. That's an extra 768×4 bytes of memory that isn't shown on screen but still counts as VRAM and can be accessed by the EGC.
    This allowed ZUN to go for a line-by-line approach using the EGC's mask register (0x4A8):

    1. Regularly blit the source pixels of the current line to the beginning of each bitplane's offscreen section
    2. Turn on the EGC, configure it for a regular VRAM-to-VRAM copy, and set the mask register depending on the current row
    3. EGC-blit the line to its intended position
    4. Turn off the EGC
    5. Repeat for every line of the image

    From all we've learned about PC-98 hardware so far, this should still be pretty slow. While we're back to a single read-modify-write operation for the masked blit itself, the initial blit of the source pixels to the offscreen area still adds one extra VRAM write per 32 pixels, on an architecture that's already notorious for having slow VRAM. Worse, however, is the fact that configuring the EGC still involves these slow I/O port writes I've been writing about time and time again. EGC-accelerated blitting would have to be really fast to be worth the added I/O cost of this approach. 📝 I sure couldn't believe that this was actually faster when I first saw this piece of ZUN code in early 2021, but he probably was onto something here.

  4. EGC (optimized)
    But wait. Blitting only one line in each EGC run only utilizes (640/8) =80 of the 768 available offscreen bytes we get on each bitplane. How much faster could we go if we used the entire offscreen area during each run?
    After all, nothing stops us from treating invisible VRAM as regular linear memory for 4bpp pixel data, ignoring the spatial layout of the source image. For TH05's 640×400 title screen image, we can fit ⌊768 / (640/8)⌋ = 9 lines of pixels into the offscreen area, which reduces the number of EGC runs from 400 to just 45. For the crossfaded 320×200 cutscene pictures, we get ⌊768 / (320/8)⌋ = 19 lines and 11 EGC runs, wasting even less VRAM.

    A visualization of the 11 EGC runs required for masked blitting of a 320×200 cutscene picture. The offscreen area of VRAM is highlighted in pink on the first frame; the white stripe on its last line represents the nonexistent memory past 32,768 bytes, since 768 bytes do not cleanly divide into 640-pixel rows.
    For 320-wide images, this approach only leaves 64 pixels (or 8 bytes) of VRAM unused. Micro-optimizing those last few bytes won't matter – not only because we'd still need 11 EGC runs even if we blitted 19.2 lines on each frame instead of 19, but also because the mask patterns force us to iterate over discrete rows anyway.
    Aside from this major algorithmic optimization, this approach also offers a vast potential for micro-optimizations:
    • The EGC retains its registers across activations, so we only need to set them up once at the beginning of the image. This saves an additional 5 I/O port writes per EGC run over ZUN's version.
    • Using REP MOVSW for the EGC copy not only follows from our previous results, but also makes sense because we can use a segment override prefix to overwrite MOVSW's source register and thus make it copy entirely within the ES register. Since we still access the pixel masks for each line through DS, using an ES override removes the need for the awkward DS register juggling that ZUN had to do for each line of the image.
    • Our linear nature of accessing offscreen VRAM necessitates a new set of blitters that doesn't move DI to the next VRAM row at the end of an image row. However, we only actually need these blitters for their ability to skip to the next line of the source image in case we only blit a subregion whose width is smaller than the width of the image. If we blit an image's entire width, we can go down an even faster code path that blasts all lines of an EGC run to VRAM with a single REP MOVS per bitplane. As we've seen for the 640-pixel case above, this is always the optimal choice.

So, let's construct another benchmark, do some preliminary tests on Neko Project 21/W, and…

…whoa, ZUN's EGC algorithm is bad on emulators. This benchmark even paints it somewhat favorably; for code simplicity reasons, I merely simulate it within my optimized implementation by reducing the number of blitted image rows per EGC run to 1.
Let's see how fast it all runs on real hardware. This time, the results for ≥Pentium models are interesting as well:

Width EGC
(optimized)
EGC
(ZUN)
GRCG +
CPU
CPU
only
320.662.501.151.11
641.022.901.651.64
961.363.262.142.14
1281.723.562.602.64
1602.073.793.073.14
1922.424.153.583.65
2242.814.344.064.15
2563.154.684.524.65
2883.495.025.015.15
3203.825.355.525.66
3524.205.705.996.17
3844.526.036.446.66
4164.886.366.927.16
4485.206.717.437.68
4805.587.047.938.17
5125.917.388.368.67
5446.267.728.859.18
5766.598.059.399.68
6086.958.409.8410.18
6407.308.6510.3910.66
Width EGC
(optimized)
EGC
(ZUN)
GRCG +
CPU
CPU
only
320.170.410.230.23
640.270.510.400.39
960.380.660.580.56
1280.480.720.760.73
1600.600.790.950.91
1920.700.891.131.08
2240.800.981.301.25
2560.901.091.471.42
2881.021.211.671.60
3201.131.311.841.77
3521.231.412.021.93
3841.331.512.192.10
4161.451.632.392.29
4481.561.742.572.45
4801.661.832.742.62
5121.761.932.912.78
5441.862.043.092.95
5761.972.143.273.12
6082.072.243.443.28
6402.172.343.613.45
Width EGC
(optimized)
EGC
(ZUN)
GRCG +
CPU
CPU
only
320.090.210.150.14
640.170.280.300.27
960.250.360.440.41
1280.330.430.590.55
1600.410.520.740.68
1920.490.590.890.82
2240.580.661.030.95
2560.660.751.181.09
2880.740.841.331.22
3200.820.911.481.36
3520.900.991.621.49
3840.991.081.771.63
4161.071.171.921.76
4481.151.242.061.90
4801.231.322.212.03
5121.311.402.352.17
5441.401.482.502.30
5761.481.562.652.44
6081.561.652.802.57
6401.641.732.942.71
Width EGC
(optimized)
EGC
(ZUN)
GRCG +
CPU
CPU
only
320.080.160.140.13
640.160.230.290.26
960.240.310.430.39
1280.320.390.570.52
1600.400.460.720.65
1920.480.540.860.78
2240.560.621.000.92
2560.640.701.151.05
2880.710.781.291.18
3200.790.861.431.31
3520.870.931.581.44
3840.951.021.721.57
4161.031.091.861.70
4481.111.172.011.83
4801.191.252.151.96
5121.271.332.292.09
5441.351.412.442.22
5761.431.492.582.35
6081.501.562.722.48
6401.581.642.872.61
Width EGC
(optimized)
EGC
(ZUN)
GRCG +
CPU
CPU
only
320.100.240.170.16
640.190.330.340.31
960.290.400.500.46
1280.380.480.670.61
1600.470.570.830.77
1920.560.661.000.92
2240.650.751.161.07
2560.750.841.331.22
2880.840.931.501.38
3200.931.021.661.53
3521.021.111.831.69
3841.111.211.991.83
4161.211.302.161.99
4481.301.392.332.14
4801.391.482.492.29
5121.481.572.662.45
5441.571.662.822.60
5761.671.762.992.75
6081.761.843.152.90
6401.851.943.323.06
Width EGC
(optimized)
EGC
(ZUN)
GRCG +
CPU
CPU
only
320.080.440.200.18
640.110.470.250.24
960.150.510.300.30
1280.180.540.360.36
1600.220.560.410.42
1920.250.590.460.48
2240.300.600.520.54
2560.330.640.570.60
2880.370.670.620.66
3200.400.710.680.72
3520.440.740.730.78
3840.470.770.780.84
4160.510.810.830.90
4480.540.840.890.96
4800.570.880.941.02
5120.610.910.991.08
5440.640.941.041.14
5760.680.981.091.20
6080.711.011.151.26
6400.741.031.201.32
Number of frames required to crossfade Width×400 pixels onto all four planes of VRAM. Thanks to 1️⃣ オップナー2608, 2️⃣ Furball, 4️⃣ Will.Broke.It, and 5️⃣ spaztron64 for the real-hardware tests.

Now that's what we want to see! A clear winner across all relevant generations of PC-98 hardware. Activating the EGC really is only costly on paper; it's worth it even if you just need to perform a single tall RMW blit across all bitplanes.
Most importantly, though: This optimization was crucial for emulators, and absolutely worth it. At Neko Project's emulated 66 MHz, ZUN's code took consistently longer than one frame to crossfade 640×400 pixels onto VRAM. Hence, TH05's title screen animation always effectively rounded up that logical frame to 2 displayed frames. While this is still valid as per 📝 frame-perfection rule #2, it still counts as unintentional slowdown that we've now removed.

In return, ZUN also made the right call by not bothering about this particular optimization. As image widths and CPU speeds increase, minimizing the number of EGC runs has increasingly diminishing returns.
These results might even suggest that we should just formalize a 2-frame delay because clearly, no real-hardware model could ever overcome the VRAM latencies and actually crossfade a single 640×400 image within less than one frame. But again, such a 2-frame delay would be no more or less correct than the 3-frame delay on the PC-9821Na7, or the probably even higher delays on a 486 model. When in doubt, we just go by what ZUN wrote. My allegiance lies with code, not with Japan's e-waste.

Finally, we can also see how upgrading the CPU of your real-hardware PC-98 does nothing to work around VRAM latencies – i.e., the one bottleneck that actually matters for PC-98 Touhou. You'd basically just burn all those additional CPU cycles on waiting for VRAM. Such a CPU upgrade might even be counter-productive, as we can see with the 533 MHz CPU getting outperformed by not only the 400 MHz AMD K6-III, but even a 133 MHz Pentium, due to what spaztron64 suspects to be a southbridge issue.

Future work

But let's think ahead for a moment about what this means for the games as a whole. If the EGC is that good at this task compared to the CPU, how good could it possibly be for 16-color sprites with an alpha plane? master.lib's super_*() functions blit these using GRCG + CPU techniques, which decisively lost this benchmark on ≥Pentium hardware and weren't that good on Neko Project either. In contrast, the EGC has a lot going for it, and not just because of these benchmark numbers:

If this frees up enough performance, it might not just help with achieving 0% slowdown on real or emulated PC-98 systems. Maybe the games would then even run decently well on real 386 models, which might even shift the entire marketplace for PC-98 hardware in response? :tannedcirno: Or maybe, we could even have a high-res mod of TH03 that removes 📝 SPRITE16 and its sprite area to run the in-game portion at twice its original vertical resolution with more detailed sprites, but without increasing real-hardware system requirements all too much? Looking forward to writing and running that benchmark in another 2 years or so…


For now though, that's enough research into blitting performance. Now that we know the optimal way of getting big images from RAM to VRAM in any situation, how do we get them from disk to RAM in the first place? Also, wasn't that animation much slower than the 2 frames per fade step that we would expect from the benchmark results? What's going on there?

Optimizing .PI image handling

Uncompressed 640×400 images are rather large, common hard disk capacities were just starting to hit single-digit GB ranges in the late 90s, and you wouldn't want to haul even more floppy disks to Comiket. So it made sense why ZUN wanted to store these images in compressed form. Enter PI, the second-most common losslessly compressed image format within the PC-98 scene of the 90s. Technically, it's not that complex, but it still achieves great compression ratios for ZUN's images:

:th02: TH02 :th03: TH03 :th04: TH04 :th05: TH05 Total
Uncompressed 2,195,904 4,874,240 4,815,360 5,905,920 17,791,424
.PI 290,921 807,690 765,909 822,993 2,687,513
oxipng -o max -Z 254,074 660,020 635,537 617,115 2,166,746
:th02: TH02 :th03: TH03 :th04: TH04 :th05: TH05 Total
Uncompressed 100.00 % 100.00 % 100.00 % 100.00 % 100.00 %
.PI 13.25 % 16.57 % 15.91 % 13.94 % 14.91 %
oxipng -o max -Z 11.57 % 13.54 % 13.20 % 10.45 % 12.19 %

However, these compression ratios come at an annoying price for the PC-98 in particular. The algorithm naturally decompresses images into a packed representation with color values from 0 to 15 inclusive, not into the planar 4×1bpp format required by the PC-98's 16-color graphics mode. It makes a lot of sense why PI is designed around packed pixels:

But why, then, does master.lib only provide graph_pi_load_pack(), which returns the loaded image in the same packed format that the decompressor naturally produces? I've been wondering that ever since I 📝 decompiled the TH03 cutscene system, where the performance issues of this packed format became obvious. This approach only makes sense for the rare use case of using 16-color images as an input for some algorithm and not actually displaying them on screen. But since people do want to display .PI images, master.lib had to provide a second function for actually rendering these packed pixel buffers anyway. Just look at the sheer amount of work that graph_pack_put_8() has to do to unpack just 8 pixels into the 4×1 bytes for every bitplane, every time you want to write them to VRAM:

mov	CL, 2

mov	BL, [SI]
mov	BH, 0
shl	BX, CL
mov	AX, word ptr CS:RotTbl[BX]
mov	DX, word ptr CS:RotTbl[BX + 2]
inc	SI
shl	AX, CL
shl	DX, CL
mov	BL, [SI]
mov	BH, 0
shl	BX, CL
or 	AX, word ptr CS:RotTbl[BX]
or 	DX, word ptr CS:RotTbl[BX + 2]
inc	SI
shl	AX, CL
shl	DX, CL
mov	BL, [SI]
mov	BH, 0
shl	BX, CL
or 	AX, word ptr CS:RotTbl[BX]
or 	DX, word ptr CS:RotTbl[BX + 2]
inc	SI
shl	AX, CL
shl	DX, CL
mov	BL, [SI]
mov	BH, 0
shl	BX, CL
or 	AX, word ptr CS:RotTbl[BX]
or 	DX, word ptr CS:RotTbl[BX + 2]
inc	SI

mov	ES:[DI], AL      	; 0a800h
mov	BX, ES
mov	ES:[DI+8000h], AH	; 0b000h
add	BH, 10h
mov	ES, BX
mov	ES:[DI], DL       	; 0b800h
add	BH, 28h
mov	ES, BX
mov	ES:[DI], DH       	; 0e000h
sub	BH, 38h
mov	ES, BX
inc	DI
Don't try to understand this code, just look at the sheer amount of instructions. Ideally, we want to replace four of these loop bodies with one MOVSD per bitplane, or just blit an entire bitplane with a single REP MOVSD for 640-wide images.

This algorithm costs >81 cycles on a 486 and >141 cycles on a 386, for every 8 pixels. Multiply this by the 256,000 pixels of a single 640×400 image, and you get up to 5 frames on Neko Project's 66 MHz CPU for blitting any full-screen background image.
There is an undocumented graph_pi_load_unpack() function, but that one "unpacks" a .PI into an utterly wasteful and unhelpful packed 8-bit format, and so is probably undocumented for a reason.

OK, but if performance is bad, there must be some really good other arguments in favor of this packed representation. Right?

Err, nope. If you look at PI's location codes, you'll see that they can only reach up to two rows back from the current cursor position. Hence, we only ever need to keep a three-row window of packed pixels – the current row, and the two previous ones – which would slide down as the image gets decoded. At the end of each row, we then convert the third row to planar pixels, write them to our result buffer, and slide down the window by shifting up the bottom two packed rows, making room for the next row. Sure, this means that we end up shifting ((width × height/2) × 2) more bytes around in memory. But copies are fast, and the first blit of a resulting 640×400 planar image would still be faster than >5 frames even with these copies factored in.
Also, the width restriction is not really an argument if you consider that graph_pack_put_8() already had the same limitation. Consequently, all of ZUN's .PI images already fulfill that requirement.


Still, 23 frames to decode a single 640×400 image, using a function that restricts itself to 80186-level instructions? Surely we can do better if we target ≥386 CPUs anyway? I've also been on a 📝 mission to remove master.lib and any ASM translation units in the debloated/portable branch, so let's write a new .PI loader in C++!
Unfortunately, it quickly becomes obvious why you want to write this sort of unpacker in ASM. If we look back at the current record holder for the most unoptimized ZUN-written code, 📝 the curve animation in TH03's character selection screen primarily wastes CPU cycles because the sheer number of 4,096 points per logical frame magnifies the impact of every suboptimal code generation detail. Now increase that to the roughly 100,000 decoding operations required by a moderately complex 640×400 .PI file, and your C++ code just can't possibly compete with any well-written ASM implementation. There isn't much use for 32-bit instructions in PI decoding either, since it mostly comes down to making decisions based on individual bits. You'd rather use the 4 main general-purpose registers (AX, BX, CX, and DX) as 8 8-bit registers rather than 4 32-bit ones. And not only are C compilers of that era bad at allocating registers, but we would also have to constantly fight C's 📝 integer 📝 promotion 📝 rules that will prevent us from using 8-bit registers at every point.
And so, my initial attempt at a C++ version was 3.3× slower than the combination of master.lib's graph_pi_load_pack() and graph_pack_put_8(). Even after deploying every inline ASM trick I came up with, I could only bring down that number to 1.7×. There goes the dream of using 100% C++ and no ASM in the codebase for PC-98 Touhou. Might as well continue using all the good parts of master.lib then…

On to plan B then, forking graph_pi_load_pack() to immediately unpack the image into four bitplanes. Integrating the sliding-window algorithm into plain ASM code was slightly tricky, but not the worst thing in the world.
It did reveal another slight drawback with the general approach of decoding a PI image into a single output buffer, though. PI supports backreferences to the two previous rows even within the first two rows of an image, where you'd expect them to point outside of the image and thus be invalid. In this case, the algorithm expects the two rows "above" the image to be flood-filled with the 2×1-pixel pair in the top-left corner of the image. This also means that every PI bitstream must start by explicitly specifying these colors in PI's delta encoding. master.lib implements this requirement by always allocating the image buffer with two additional rows at the top, which remain allocated once that buffer is returned. I probably don't even have to explain how the sliding-window algorithm naturally solves this issue as well. Thus, we get to reduce the size of the returned buffer to exactly the size of the image and save a few bytes on the heap.

As expected, this change merges the execution time of graph_pack_put_8() into the execution time of graph_pi_load_pack(), and even manages to be slightly shorter despite all the added memory copies.
But now that this works, it would be interesting to compare the performance against the .PI library that ZUN used for TH01:

PiLoad

It's obvious why ZUN moved away from this library for the later four games. Despite the Load in its name, PiLoad exclusively decodes directly to VRAM, without ever writing the full planar image to conventional memory. This prevents any of the more dynamic use cases that ZUN had in mind for TH02: The multiple and quickly alternating 東方封魔録 images in the title screen animation, the darkening/highlighting effect in the character selection, and the in-game bomb backgrounds all require decompressed image buffers in RAM to run at anywhere close to acceptable speeds at 66 MHz.
Besides, PiLoad also restricts itself to 80186-level instructions, but looks like a giant mess. Self-modifying code? A packed→planar conversion that doesn't even use lookup tables, which master.lib probably uses by default for good reasons? Sure seems as if PiLoad is heavily micro-optimized for the very first generations of PC-98 models that typically wouldn't even have enough RAM for such an image…

…except that all this micro-optimization makes PiLoad more than 2× as fast as master.lib's PI functions, even on Neko Project's emulated i386?!
Yup. If you look closer, PiLoad is nothing short of a masterclass in x86 optimization, applied to a use case where every cycle actually matters:

So, let's add decoding into memory to PiLoad instead! Turns out that PiLoad was the simpler library to extend all along, since it already used the same sliding-window technique for unpacking pixels into VRAM. All it needed was a width check for a multiple of 8, a calculation of the bitplane size, and a new and much simpler unpacking function that doesn't need to support unaligned X positions or GRCG-powered transparency.
However, 📝 a certain unfortunate issue that will crop up in part 4 will require more outside control over the loading process. As a result, the API of my forked version looks quite different:

In exchange for the new features, I decided to drop a few features that we either don't need in this codebase or that are better implemented at another level. This list includes

This might not be ideal for other PC-98 developers who might want to use my fork in their projects, but you can always revert these commits in your own version.


Before we can entirely replace master.lib's PI functions with PiLoad in TH02-TH05 though, we have to address one particular silly detail:

:zunpet: TH03's silly line-doubled sprite sheets :zunpet:

As we all know by now, 📝 TH03 renders its in-game graphics at line-doubled 640×200 to free up half of VRAM for sprite data. This means that it can store all sprites at half of their displayed vertical resolution, saving some disk space and RAM in the process. However, ZUN only actually did this for the uncompressed sprites loaded from the .BF2 and 📝 .MRS files. All in-game .PI sprite sheets are, for some reason, stored in the same line-doubled form that would be shown on screen:

TH03's ENEMY01.PI, showing off the unnecessary line doubling of TH03's in-game sprites loaded from .PI files
Since PI's location codes are designed around vertical repetition, line doubling has a negligible impact on compressed file sizes. Such an image still decodes into twice as many bytes as needed, though…

This is very silly because your PI implementation now needs a feature to undo this doubling by skipping over every second line at either load or render time. ZUN chose to place this feature at render time because that's the simplest solution when you use master.lib's PI functions. Since graph_pack_put_8() unpacks and blits only one line at a time, it already expects the caller to loop over all lines of a PI image and to do all the x86 segment arithmetic required for handling images larger than 64 KiB. From there, it's only natural to add a copy of that function that skips every second line.
But such a render-time feature would strain our generic blitter:

Moving the skip to load time, on the other hand, has several clear advantages:


And if we now put all of this together, we indeed get TH05's crossfaded title screen animation running at its intended speed on a 66 MHz Neko Project. Mission accomplished!

Interestingly though, running this animation at its denoted speed now reveals a page-flipping bug in ZUN's code. Each crossfading step is supposed to be visible for 4 logical frames, but that's not what we get:

See? And that's just one example of 📝 the bad code quality you invite by having page flips in high-level menu code. Now multiply that by the number of screen tearing landmines, and you'll get why this has taken so long.

Another minor reason: Some of the code I wanted to debloat or make more portable within these 11 pushes was still in ASM, and it would have been way more annoying to keep and work around that code than to just decompile it. Next up: Taking a very quick look at these few decompilations.

Oh, and just in case you want to run these benchmarks yourself: 2025-09-10-benchmarks.zip