- 📝 Posted:
- 🚚 Summary of:
- P0218, P0219, P0220, P0221, P0222
- ⌨ Commits:
- 💰 Funded by:
- [Anonymous], Yanga, Ember2528
- 🏷 Tags:
Yes, I'm still alive. This delivery was just plagued by all of the worst luck: Data loss, physical hard drive failure, exploding phone batteries, minor illness… and after taking 4 weeks to recover from all of that, I had to face this beast of a task. 😵
Turns out that neither part of improving video performance and usability on
this blog was particularly easy. Decently encoding the videos into all
web-supported formats required unexpected trade-offs even for the low-res,
low-color material we are working with, and writing custom video player
controls added the timing precision resistance of HTML
<video> on top of the inherent complexity of frontend web
200 lines of commented CSS, and consume almost more than 5 pushes?!
Apparently, the latest price increase also seemed to have raised the minimum
level of acceptable polish in my work, since that's more than the maximum of
3.67 pushes it should have taken. To fund the rest, I stole some of the
reserved JIS trail word rendering research pushes, which means that the next
towards anything will go back towards that goal.
The codec situation is especially sad because it seems like so much of a
solved problem. ZMBV, the lossless capture codec introduced by DOSBox, is
both very well suited for retro game footage and remarkably simple too:
DOSBox-X's implementation of both an encoder and decoder comes in at under
650 lines of C++, excluding the Deflate implementation. Heck, the AVI
container around the codec is more complicated to write than the
compressed video data itself, and AVI is already the easiest choice you have
for a widely supported video container format.
Currently, this blog contains 9:02 minutes of video across 86 files, with a total frame count of 24,515. In case this post attracts a general video encoding audience that isn't familiar with what I'm encoding here: The maximum resolution is 640×400, and most of the video uses 16 colors, with some parts occasionally using more. With ZMBV, the lossless source files take up 43.8 MiB, and that's even with AVI's infamously bad overhead. While you can always spend more time on any compression task and precisely tune your algorithm to match your source data even better, 43.8 MiB looks like a more than reasonable amount for this type of content.
Especially compared with what I actually have to ship here, because sadly, ZMBV is not supported by browsers. 😔 Writing a WebAssembly player for ZMBV would have certainly been interesting, but it already took 5 pushes to get to what we have now. So, let's instead shell out to ffmpeg and build a pipeline to convert ZMBV to the ill-suited codecs supported by web browsers, replacing the previously committed VP9 and VP8 files. From that point, we can then look into AV1, the latest and greatest web-supported video codec, to save some additional bandwidth.
But first, we've got to gather all the ZMBV source files. While I was
working on the 📝 2022-07-10 blog post, I
noticed some weirdly washed-out colors in the converted videos, leading to
the shocking realization that my previous, historically grown conversion
script didn't actually encode in a lossless way. 😢 By extension,
this meant that every video before that post could have had minor
discolorations as well.
For the majority of videos, I still had the original ZMBV capture files straight out of DOSBox-X, and reproducing the final videos wasn't too big of a deal. For the few cases where I didn't, I went the extra mile, took the VP9 files, and manually fixed up all the minor color errors based on reference videos from the same gameplay stage. There might be a huge ffmpeg command line with a complicated filter graph to do the job, but for such a small 4-digit number of frames, it is much more straightforward to just dump each frame as an image and perform the color replacement with ImageMagick's
So, time to encode our new definite collection of source files into AV1, and
what the hell, how slow is this codec? With ffmpeg's
libaom-av1, fully encoding all 86 videos takes almost 9
hours on my mid-range
development system, regardless of the quality selected.
But sure, the encoded videos are managed by a cache, and this obviously only needs to be done once. If the results are amazing, they might even justify these glacial encoding speeds. Unfortunately, they don't: In its lossless
-crf 0 mode, AV1 performs even worse than VP9, taking up
222 MiB rather than 182 MiB. It might not sound bad now,
but as we're later going to find out, we want to have a lot of
keyframes in these videos, which will blow up video sizes even further.
So, time to go lossy and maybe take a deep dive into AV1 tuning? Turns out that it only gets worse from there:
- The alternative
libsvtav1encoder is fast and creates small files… but even on the highest-quality settings,
-qp 0, the video quality resembled the terrible x264 YUV420P format that Twitter enforces on uploaded videos.
- I don't remember the
librav1eresults, but they sure weren't convincing either.
-usage realtimeoption is a complete joke. 771 MiB for all videos, and it doesn't even compress in real time on my system, more like 2.5× real-time. For comparison, a certain stone-age technology by the name of "animated GIF" would take 54.3 MiB, encode in sub-realtime (0.47×), and the only necessary tuning you need is an easily googled palette generation and usage filter. Why can't I just use those in a
<video>tag?! These results have clearly proven the top-voted
just use modern video codecsStack Overflow answers wrong.
- What you're actually supposed to do is to drop
-cpu-usedto maybe 2 or 3, and then selectively add back prediction filters that suit your type of content. In our case, these are
Because that's what all this tuning ended up being: a complete waste of
time. No matter which tuning options I tried, all they did was cut down
encoding time in exchange for slightly larger files on average. If there is
a magic tuning option that would suddenly cause AV1 to maybe even beat ZMBV,
I haven't found it. Heck, at particularly low settings,
-enable-intrabc even caused blocky glitches with certain pellet
patterns that looked like the internal frame block hashes were colliding all
over the place. Unfortunately, I didn't save the video where it happened.
So yeah, if you've already invested the computation time and encoded your
content by just specifying a
-crf value and keeping the
remaining settings at their time-consuming defaults, any further tuning will
make no difference. Which is… an interesting choice from a usability
perspective. I would have expected the exact
opposite: default to a reasonably fast and efficient profile, and leave the
vast selection of tuning options for those people to explore who do
want to wait 5× as long for their encoder for that additional 5% of
compression efficiency. On the other hand, that surely is one way to get
people to extensively study your glorious engineering efforts, I guess? You
know what would maybe even motivate people to intrinsically do that?
Good documentation, with examples of the intent behind every option and its
optimal use case. Nobody needs long help strings that just spell out all of
the abbreviations that occur in the name of the option…
But hey, that at least means there's no reason to not use anything but ZMBV for storing and archiving the lossless source files. Best compression efficiency, encodes in real-time, and the files are much easier to edit.
OK, end of rant. To understand why anyone could be hyped about AV1 to begin
with, we just have to compare it to VP9, not to ZMBV. In that light, AV1
is pretty impressive even at
-crf 1, compressing all 86
videos to 68.9 MiB, and even preserving 22.3% of frames completely
losslessly. The remaining frames exhibit the exact kind of quality loss
you'd want for retro game footage: Minor discoloration in individual pixels,
so minuscule that subtracting the encoded image from the source yields an
almost completely black image. Even after highlighting the errors by
normalizing such a difference image, they are barely visible even if you
know where to look. If "compressed PNG size of the normalized difference
between ZMBV and AV1
-crf 1" is a useful metric, this would be
its median frame among the 77.7% of non-lossless frames:
For comparison, here's the 13th worst one. The codec only resorts to color bleeding with particularly heavy effects, exactly where it doesn't matter:
Whether you can actually spot the difference is pretty much down to the
glass between the physical pixels and your eyes. In any case, it's very
hard, even if you know where to look. As far as I'm concerned, I can
confidently call this "visually lossless", and it's definitely good enough
for regular watching and even single-frame stepping on this blog.
Since the appeal of the original lossless files is undeniable though, I also made those more easily available. You can directly download the one for the currently active video with the ⍗ button in the new video player – or directly get all of them from the Git repository if you don't like clicking.
Unfortunately, even that only made up for half of the complexity in this
pipeline. As impressive as the AV1
-crf 1 result may be, it
does in fact come with the drawback of also being impressively heavy to
decode within today's browsers. Seeking is dog slow, with even the latencies
for single-frame stepping being way beyond what I'd consider
tolerable. To compensate, we have to invest another 78 MiB into turning
every 10th frame into a keyframe until single-stepping through an
entire video becomes as fast as it could be on my system.
But fine, 146 MiB, that's still less than the 178 MiB that the old committed VP9 files used to take up. However, we still want to support VP9 for older browsers, older hardware, and people who use Safari. And it's this codec where keyframes are so bad that there is no clear best solution, only compromises. The main issue: The lower you turn VP9's
-crf value, the slower the
seeking performance with the same number of keyframes. Conversely,
this means that raising quality also requires more keyframes for the same
seeking performance – and at these file sizes, you really don't want to
raise either. We're talking 1.2 GiB for all 86 videos at
-crf 10 and
-g 5, and even on that configuration,
seeking takes 1.3× as long as it would in the optimal case.
Thankfully, a full VP9 encode of all 86 videos only takes some 30 minutes as opposed to 9 hours. At that speed, it made sense to try a larger number of encoding settings during the ongoing development of the player. Here's a table with all the trials I've kept:
||Other parameters||Total size||Seek time|
|VP9||32||20||-vf format=yuv420p||111 MiB||32 s|
|VP8||10||30||-qmin 10 -qmax 10 -b:v 1G||120 MiB||32 s|
|VP8||7||30||-qmin 7 -qmax 7 -b:v 1G||140 MiB||32 s|
|AV1||1||10||146 MiB||32 s|
|VP8||10||20||-qmin 10 -qmax 10 -b:v 1G||147 MiB||32 s|
|VP8||6||30||-qmin 6 -qmax 6 -b:v 1G||149 MiB||32 s|
|VP8||15||10||-qmin 15 -qmax 15 -b:v 1G||177 MiB||32 s|
|VP8||10||10||-qmin 10 -qmax 10 -b:v 1G||225 MiB||32 s|
|VP9||32||10||-vf format=yuv422p||329 MiB||32 s|
|VP8||0-4||10||-qmin 0 -qmax 4 -b:v 1G||376 MiB||32 s|
|VP8||5||30||-qmin 5 -qmax 5 -b:v 1G||169 MiB||33 s|
|VP9||63||40||47 MiB||34 s|
|VP9||32||20||-vf format=yuv422p||146 MiB||34 s|
|VP8||4||30||-qmin 0 -qmax 4 -b:v 1G||192 MiB||34 s|
|VP8||4||40||-qmin 4 -qmax 4 -b:v 1G||168 MiB||35 s|
|VP9||25||20||-vf format=yuv422p||173 MiB||36 s|
|VP9||15||15||-vf format=yuv422p||252 MiB||36 s|
|VP9||32||25||-vf format=yuv422p||118 MiB||37 s|
|VP9||20||20||-vf format=yuv422p||190 MiB||37 s|
|VP9||19||21||-vf format=yuv422p||187 MiB||38 s|
|VP9||32||10||553 MiB||38 s|
|VP9||32||10||-tune-content screen||553 MiB|
|VP9||32||10||-tile-columns 6 -tile-rows 2||553 MiB|
|VP9||15||20||-vf format=yuv422p||207 MiB||39 s|
|VP9||10||5||1210 MiB||43 s|
|VP9||32||20||264 MiB||45 s|
|VP9||32||20||-vf format=yuv444p||215 MiB||46 s|
|VP9||32||20||-vf format=gbrp10le||272 MiB||49 s|
|VP9||63||24 MiB||67 s|
|VP8||0-4||-qmin 0 -qmax 4 -b:v 1G||119 MiB||76 s|
|VP9||32||107 MiB||170 s|
Yup, the compromise ended up including a chroma subsampling conversion to YUV422P. That's the one thing you don't want to do for retro pixel graphics, as it's the exact cause behind washed-out colors and red fringing around edges:
But there simply was no satisfying solution around the ~200 MiB mark
with RGB colors, and even this compromise is still a disappointment in both
size and seeking speed. Let's hope that Safari
users do get AV1 support soon… Heck, even VP8, with its exclusive
support for YUV420P, performs much better here, with the impact of
-crf on seeking speed being much less pronounced. Encoding VP8
also just takes 3 minutes for all 86 videos, so I could have experimented
much more. Too bad that it only matters for really ancient systems…
Two final takeaways about VP9:
-tune-content screenand the tile options make no difference at all.
- All results used two-pass encoding. VP9 is the only codec where two passes made a noticeable difference, cutting down the final encoded size from 224 MiB to 207 MiB. For AV1, compression even seems to be slightly worse with two passes, yielding 154,201,892 bytes rather than the 153,643,316 bytes we get with a single pass. But that's a difference of 0.36%, and hardly significant.
Alright, now we're done with codecs and get to finish the work on the
pipeline with perhaps its biggest advantage. With a ffmpeg conversion
infrastructure in place, we can also easily output a video's first frame as
a poster image to be passed into the
If this image is kept at the exact resolution of the video, the browser
doesn't need to wait for an indeterminate amount of "video metadata" to be
loaded, and can reserve the necessary space in the page layout much faster
and without any of these dreaded loading spinners. For the big
/blog page, this cuts down the minimum amount of required
resources from 69.5 MB to 3.6 MB, finally making it usable again without
waiting an eternity for the page to fully load. It's become pretty bad, so I
really had to prioritize this task before adding any more blog posts on top.
That leaves the player itself, which is basically a sum of lots of little
implementation challenges. Single-frame stepping and seeking to discrete
frames is the biggest one of them, as it's technically
not possible within the
<video> tag, which only
returns the current time as a continuous value in seconds. It only sort
of works for us because the backend can pass the necessary FPS and frame
count values to the frontend. These allow us to place a discrete grid of
frame "frets" at regular intervals, and thus establish a consistent mapping
from frames to seconds and back. The only drawback here is a noticeably
weird jump back by one frame when pausing a video within the second half of
a frame, caused by snapping the continuous time in seconds back onto the
frame grid in order to maintain a consistent frame counter. But the whole
feature of frame-based seeking more than makes up for that.
The new scrubbable timeline might be even nicer to use with a mouse or a finger than just letting a video play regularly. With all the tuning work I put into keyframes, seeking is buttery smooth, and much better than the built-in
<video> UI of either Chrome or Firefox.
Unfortunately, it still costs a whole lot of CPU, but I'd say it's worth it.
Finally, the new player also has a few features that might not be immediately obvious:
- Keybindings for almost everything you might want them for, indicated by hovering on top of each button. The tab switchers additionally support the ↑ Up and ↓ Down keys to cycle through all tabs, or the number keys to jump to a specific tab. Couldn't find a way to indicate these mappings in the UI yet.
- Per-video captions now reserve the maximum height of any caption in the
layout. This prevents layout reflows when switching through such videos,
which previously caused quite annoying lag on the big
- Useful fullscreen modes on both desktop and mobile, including all
markers and the video caption. Firefox made this harder than it needed to
be, and if it weren't for
display: contents, the implementation would have been even worse. In the end though, we didn't even need any video pixel sizes from the backend – just as it should be…
- … and supporting Firefox was definitely worth it, as it's the only browser to support nearest-neighbor interpolation on videos.
- As some of the Unicode codepoints on the buttons aren't covered by the
default fonts of some operating systems, I've taken them from the Catrinity font, licensed under the SIL
Open Font License. With all
the edits I did on this font, that license definitely was necessary. I
hope I applied it correctly though; it's not straightforward at all how to
properly license a
Modified Versionof an original font with a
Reserved Font Name.
And with that, development hell is over, and I finally get to return to the core business! Just more than one month late. Next up: Shipping the oldest still pending order, covering the TH04/TH05 ending script format. Meanwhile, the Seihou community also wants to keep investing in Shuusou Gyoku, so we're also going to see more of that on the side.