Peerless Whisper
Published 1 year, 8 months pastWhat happened was, I was hanging out in an online chatter channel when a little birdy named Bruce chirped about OpenAI’s Whisper and how he was using it to transcribe audio. And I thought, Hey, I have audio that needs to be transcribed. Brucie Bird also mentioned it would output text, SRT, and WebVTT formats, and I thought, Hey, I have videos I’ll need to upload with transcription to YouTube! And then he said you could run it from the command line, and I thought, Hey, I have a command line!
So off I went to install it and try it out, and immediately ran smack into some hurdles I thought I’d document here in case someone else has similar problems. All of this took place on my M2 MacBook Pro, though I believe most of the below should be relevant to anyone trying to do this at the command line.
The first thing I did was what the GitHub repository’s README recommended, which is:
$ pip install -U openai-whisper
That failed because I didn’t have pip
installed. Okay, fair enough. I figured out how to install that, setting up an alias of python
for python3
along the way, and then tried again. This time, the install started and then bombed out:
Collecting openai-whisper
Using cached openai-whisper-20230314.tar.gz (792 kB)
Installing build dependencies ... done
Getting requirements to build wheel ... done
Preparing metadata (pyproject.toml) ... done
Collecting numba
Using cached numba-0.56.4.tar.gz (2.4 MB)
Preparing metadata (setup.py) ... error
error: subprocess-exited-with-error
…followed by some stack trace stuff, none of which was really useful until ten or so lines down, where I found:
RuntimeError: Cannot install on Python version 3.11.2; only versions >=3.7,<3.11 are supported.
In other words, the version of Python I have installed is too modern to run AI. What a world.
I DuckDucked around a bit and hit upon pyenv
, which is I guess a way of installing and running older versions of Python without having to overwrite whatever version(s) you already have. I’ll skip over the error part of my trial-and-error process and give you the commands that made it all work:
$ brew install pyenv
$ pyenv install 3.10
$ PATH="~/.pyenv/shims:${PATH}"
$ pyenv local 3.10
$ pip install -U openai-whisper
That got Whisper to install. It didn’t take very long.
At that point, I wondered what I’d have to configure to transcribe something, and the answer turned out to be precisely zilch. Once the install was done, I dropped into the directory containing my MP4 video, and typed this:
$ whisper wpe-mse-eme-v2.mp4
Here’s what I got back. I’ve marked the very few errors.
[00:00.000 --> 00:07.000] In this video, we'll show you several demos showcasing multi-media capabilities in WPE WebKit, [00:07.000 --> 00:11.000] the official port of the WebKit engine for embedded devices. [00:11.000 --> 00:18.000] Each of these demos are running on the low-powered Raspberry Pi 3 seen in the lower right-hand side of the screen here. [00:18.000 --> 00:25.000] Infotainment systems and media players often need to consume digital rights-managed videos. [00:25.000 --> 00:32.000] They tell me, is Michael coming out? Affirmative, Mike's coming out. [00:32.000 --> 00:45.000] Here you can see just that, smooth streaming playback using encrypted media extensions, or EME, with PlayReady 4. [00:45.000 --> 00:52.000] Media source extensions, or MSE, are used by many players for greater control over playback. [00:52.000 --> 01:00.000] YouTube TV has a whole conformance test suite for this, which WPE has been passing since 2021. [01:00.000 --> 01:09.000] The loan exceptions here are those tests requiring hardware support not available on the Raspberry Pi 4, but available for other platforms. [01:09.000 --> 01:16.000] YouTube TV has a conformance test for EME, which WPE WebKit passes with flying colors. [01:22.000 --> 01:40.000] Music [01:40.000 --> 01:45.000] Finally, perhaps most impressively, we can put all these things together. [01:45.000 --> 01:56.000] Here is the dash.js player using MSE, running in a page, and using Widevine DRM to decrypt and play rights-managed video with EME all fluidly. [01:56.000 --> 02:04.000] Music [02:04.000 --> 02:09.000] Remember, all of this is being played back on the same low-powered Raspberry Pi 3. [02:27.000 --> 02:34.000] For more about WPE WebKit, please visit WPE WebKit.com. [02:34.000 --> 02:42.000] For more information about EGALIA, or to find out how we can help with your embedded device needs, please visit us at EGALIA.com.
I am, frankly, astonished. This has no business being as accurate as it is, for all kinds of reasons. There’s a lot of jargon and very specific terminology in there, and Whisper nailed pretty much every last bit of it, first time in, no special configuration, nothing. I didn’t even bump up the model size from the default of small
. I felt a little like that Froyo guy in the animated Hunchback of Notre Dame meme yelling about sorcery or whatever.
True, the output isn’t absolutely perfect. Let’s review the glitches in reverse order. The last two errors, turning “Igalia” into “EGALIA”, seems fair enough given I didn’t specify that there would be languages other than English involved. I routinely have to spell it for my fellow Americans, so no reason to think a codebase could do any better.
The space inserted into “WPEWebKit” (which happens throughout) is similarly understandable. I’m impressed it understood “WebKit” at all, never mind that it was properly capitalized and not-spaced.
The place where it says Music
and I marked it as an error: This is essentially an echoing countdown and then a white-noise roar from rocket engines. There’s a “music today is just noise” joke in here somewhere, but I’m too hip to find it.
Whisper turning “lone” into “loan” doesn’t particularly faze me, given the difficulty of handling soundalike words. Hell, just yesterday, I was scribing a conference call and mistakenly recorded “gamut” as “gamma”, and those aren’t even technically homophones. They just sound like they are.
Rounding out the glitch tour, “Hey” got turned into “They”, which (given the audio quality of that particular part of the video) is still pretty good.
There is one other error I couldn’t mark because there’s nothing to mark, but if you scrutinize the timeline, you’ll see a gap from 02:09.000 and 02:27.000. In there, a short clip from a movie plays, and there’s a brief dialogue between two characters in not-very-Dutch-accented English there. It’s definitely louder and more clear than the 00:25.000 –> 00:32.000 bit, so I’m not sure why Whisper just skipped over it. Manually transcribing that part isn’t a big deal, but it’s odd to see it perform so flawlessly on every other piece of speech and then drop this completely on the floor.
Before posting, I decided to give Whisper another go, this time on a different video:
$ whisper wpe-gamepad-support-v3.mp4
This was the result, with the one actual error marked:
[00:00.000 --> 00:13.760] In this video, we demonstrate WPE WebKit's support for the W3C's GamePad API. [00:13.760 --> 00:20.080] Here we're running WPE WebKit on a Raspberry Pi 4, but any device that will run WPE WebKit [00:20.080 --> 00:22.960] can benefit from this support. [00:22.960 --> 00:28.560] The GamePad API provides a JavaScript interface that makes it possible for developers to access [00:28.560 --> 00:35.600] and respond to signals from GamePads and other game controllers in a simple, consistent way. [00:35.600 --> 00:40.320] Having connected a standard Xbox controller, we boot up the Raspberry Pi with a customized [00:40.320 --> 00:43.040] build route image. [00:43.040 --> 00:48.560] Once the device is booted, we run cog, which is a small, single window launcher made specifically [00:48.560 --> 00:51.080] for WPE WebKit. [00:51.080 --> 00:57.360] The window cog creates can be full screen, which is what we're doing here. [00:57.360 --> 01:01.800] The game is loaded from a website that hosts a version of the classic video arcade game [01:01.800 --> 01:05.480] Asteroids. [01:05.480 --> 01:11.240] Once the game has loaded, the Xbox controller is used to start the game and control the spaceship. [01:11.240 --> 01:17.040] All the GamePad inputs are handled by the JavaScript GamePad API. [01:17.040 --> 01:22.560] This GamePad support is now possible thanks to work done by Igalia in 2022 and is available [01:22.560 --> 01:27.160] to anyone who uses WPE WebKit on their embedded device. [01:27.160 --> 01:32.000] For more about WPE WebKit, please visit wpewebkit.com. [01:32.000 --> 01:35.840] For more information about Igalia, or to find out how we can help with your embedded device [01:35.840 --> 01:39.000] needs, please visit us at Igalia.com.
That should have been “buildroot”. Again, an entirely reasonable error. I’ve made at least an order of magnitude more typos writing this post than Whisper has in transcribing these videos. And this time, it got the spelling of Igalia correct. I didn’t make any changes between the two runs. It just… figured it out.
I don’t have a lot to say about this other than, wow. Just WOW. This is some real Clarke’s Third Law stuff right here, and the technovertigo is Marianas deep.