Peerless Whisper

Published 11 months, 4 days past

What happened was, I was hanging out in an online chatter channel when a little birdy named Bruce chirped about OpenAI’s Whisper and how he was using it to transcribe audio.  And I thought, Hey, I have audio that needs to be transcribed.  Brucie Bird also mentioned it would output text, SRT, and WebVTT formats, and I thought, Hey, I have videos I’ll need to upload with transcription to YouTube!  And then he said you could run it from the command line, and I thought, Hey, I have a command line!

So off I went to install it and try it out, and immediately ran smack into some hurdles I thought I’d document here in case someone else has similar problems.  All of this took place on my M2 MacBook Pro, though I believe most of the below should be relevant to anyone trying to do this at the command line.

The first thing I did was what the GitHub repository’s README recommended, which is:

$ pip install -U openai-whisper

That failed because I didn’t have pip installed.  Okay, fair enough.  I figured out how to install that, setting up an alias of python for python3 along the way, and then tried again.  This time, the install started and then bombed out:

Collecting openai-whisper
  Using cached openai-whisper-20230314.tar.gz (792 kB)
  Installing build dependencies ...  done
  Getting requirements to build wheel ...  done
  Preparing metadata (pyproject.toml) ...  done
Collecting numba
  Using cached numba-0.56.4.tar.gz (2.4 MB)
  Preparing metadata (setup.py) ...  error
  error: subprocess-exited-with-error

…followed by some stack trace stuff, none of which was really useful until ten or so lines down, where I found:

RuntimeError: Cannot install on Python version 3.11.2; only versions >=3.7,<3.11 are supported.

In other words, the version of Python I have installed is too modern to run AI.  What a world.

I DuckDucked around a bit and hit upon pyenv, which is I guess a way of installing and running older versions of Python without having to overwrite whatever version(s) you already have.  I’ll skip over the error part of my trial-and-error process and give you the commands that made it all work:

$ brew install pyenv

$ pyenv install 3.10

$ PATH="~/.pyenv/shims:${PATH}"

$ pyenv local 3.10

$ pip install -U openai-whisper

That got Whisper to install.  It didn’t take very long.

At that point, I wondered what I’d have to configure to transcribe something, and the answer turned out to be precisely zilch.  Once the install was done, I dropped into the directory containing my MP4 video, and typed this:

$ whisper wpe-mse-eme-v2.mp4

Here’s what I got back.  I’ve marked the very few errors.

[00:00.000 --> 00:07.000]  In this video, we'll show you several demos showcasing multi-media capabilities in WPE WebKit,
[00:07.000 --> 00:11.000]  the official port of the WebKit engine for embedded devices.
[00:11.000 --> 00:18.000]  Each of these demos are running on the low-powered Raspberry Pi 3 seen in the lower right-hand side of the screen here.
[00:18.000 --> 00:25.000]  Infotainment systems and media players often need to consume digital rights-managed videos.
[00:25.000 --> 00:32.000]  They tell me, is Michael coming out?  Affirmative, Mike's coming out.
[00:32.000 --> 00:45.000]  Here you can see just that, smooth streaming playback using encrypted media extensions, or EME, with PlayReady 4.
[00:45.000 --> 00:52.000]  Media source extensions, or MSE, are used by many players for greater control over playback.
[00:52.000 --> 01:00.000]  YouTube TV has a whole conformance test suite for this, which WPE has been passing since 2021.
[01:00.000 --> 01:09.000]  The loan exceptions here are those tests requiring hardware support not available on the Raspberry Pi 4, but available for other platforms.
[01:09.000 --> 01:16.000]  YouTube TV has a conformance test for EME, which WPE WebKit passes with flying colors.
[01:22.000 --> 01:40.000]  Music
[01:40.000 --> 01:45.000]  Finally, perhaps most impressively, we can put all these things together.
[01:45.000 --> 01:56.000]  Here is the dash.js player using MSE, running in a page, and using Widevine DRM to decrypt and play rights-managed video with EME all fluidly.
[01:56.000 --> 02:04.000]  Music
[02:04.000 --> 02:09.000]  Remember, all of this is being played back on the same low-powered Raspberry Pi 3.
[02:27.000 --> 02:34.000]  For more about WPE WebKit, please visit WPE WebKit.com.
[02:34.000 --> 02:42.000]  For more information about EGALIA, or to find out how we can help with your embedded device needs, please visit us at EGALIA.com.  

I am, frankly, astonished.  This has no business being as accurate as it is, for all kinds of reasons.  There’s a lot of jargon and very specific terminology in there, and Whisper nailed pretty much every last bit of it, first time in, no special configuration, nothing.  I didn’t even bump up the model size from the default of small.  I felt a little like that Froyo guy in the animated Hunchback of Notre Dame meme yelling about sorcery or whatever.

True, the output isn’t absolutely perfect.  Let’s review the glitches in reverse order.  The last two errors, turning “Igalia” into “EGALIA”, seems fair enough given I didn’t specify that there would be languages other than English involved.  I routinely have to spell it for my fellow Americans, so no reason to think a codebase could do any better.

The space inserted into “WPEWebKit” (which happens throughout) is similarly understandable.  I’m impressed it understood “WebKit” at all, never mind that it was properly capitalized and not-spaced.

The place where it says Music and I marked it as an error: This is essentially an echoing countdown and then a white-noise roar from rocket engines.  There’s a “music today is just noise” joke in here somewhere, but I’m too hip to find it.

Whisper turning “lone” into “loan” doesn’t particularly faze me, given the difficulty of handling soundalike words.  Hell, just yesterday, I was scribing a conference call and mistakenly recorded “gamut” as “gamma”, and those aren’t even technically homophones.  They just sound like they are.

Rounding out the glitch tour, “Hey” got turned into “They”, which (given the audio quality of that particular part of the video) is still pretty good.

There is one other error I couldn’t mark because there’s nothing to mark, but if you scrutinize the timeline, you’ll see a gap from 02:09.000 and 02:27.000.  In there, a short clip from a movie plays, and there’s a brief dialogue between two characters in not-very-Dutch-accented English there.  It’s definitely louder and more clear than the 00:25.000 –> 00:32.000 bit, so I’m not sure why Whisper just skipped over it.  Manually transcribing that part isn’t a big deal, but it’s odd to see it perform so flawlessly on every other piece of speech and then drop this completely on the floor.

Before posting, I decided to give Whisper another go, this time on a different video:

$ whisper wpe-gamepad-support-v3.mp4

This was the result, with the one actual error marked:

[00:00.000 --> 00:13.760]  In this video, we demonstrate WPE WebKit's support for the W3C's GamePad API.
[00:13.760 --> 00:20.080]  Here we're running WPE WebKit on a Raspberry Pi 4, but any device that will run WPE WebKit
[00:20.080 --> 00:22.960]  can benefit from this support.
[00:22.960 --> 00:28.560]  The GamePad API provides a JavaScript interface that makes it possible for developers to access
[00:28.560 --> 00:35.600]  and respond to signals from GamePads and other game controllers in a simple, consistent way.
[00:35.600 --> 00:40.320]  Having connected a standard Xbox controller, we boot up the Raspberry Pi with a customized
[00:40.320 --> 00:43.040]  build route image.
[00:43.040 --> 00:48.560]  Once the device is booted, we run cog, which is a small, single window launcher made specifically
[00:48.560 --> 00:51.080]  for WPE WebKit.
[00:51.080 --> 00:57.360]  The window cog creates can be full screen, which is what we're doing here.
[00:57.360 --> 01:01.800]  The game is loaded from a website that hosts a version of the classic video arcade game
[01:01.800 --> 01:05.480]  Asteroids.
[01:05.480 --> 01:11.240]  Once the game has loaded, the Xbox controller is used to start the game and control the spaceship.
[01:11.240 --> 01:17.040]  All the GamePad inputs are handled by the JavaScript GamePad API.
[01:17.040 --> 01:22.560]  This GamePad support is now possible thanks to work done by Igalia in 2022 and is available
[01:22.560 --> 01:27.160]  to anyone who uses WPE WebKit on their embedded device.
[01:27.160 --> 01:32.000]  For more about WPE WebKit, please visit wpewebkit.com.
[01:32.000 --> 01:35.840]  For more information about Igalia, or to find out how we can help with your embedded device
[01:35.840 --> 01:39.000]  needs, please visit us at Igalia.com.  

That should have been “buildroot”.  Again, an entirely reasonable error.  I’ve made at least an order of magnitude more typos writing this post than Whisper has in transcribing these videos.  And this time, it got the spelling of Igalia correct.  I didn’t make any changes between the two runs.  It just… figured it out.

I don’t have a lot to say about this other than, wow.  Just WOW.  This is some real Clarke’s Third Law stuff right here, and the technovertigo is Marianas deep.


Comments (5)

  1. I was just trying the same thing: brew install openai-whisper was a handy shortcut. My results weren’t as unequivocally awesome but some of the files I was working with are old field recordings which even human listeners have trouble with.

    The major drawback I found was specific to Apple Silicon, namely that the GPU acceleration is still unstable:

    https://github.com/openai/whisper/pull/382

  2. For what it’s worth, pipx does for pip what pyenv does for entire Python installations; it installs and manages them separately from each other so you don’t have to worry about version conflicts with dependencies. It’s really useful for tools and programs you just want to run because it can save headaches down the road, and if you stop using something it can remove the whole dependency tree at once without a bunch of packages left on your system.

  3. Was the title of this entry a George Michael allusion? 🙂

  4. To play around with AI tools like this but avoid the command line hijinks you went through, I tend to look for a free native app that’s wrapped it. For Whisper, I found Buzz and MacWhisper, the latter having free and paid versions (it looks like paid versions have a mix of unlocking options inherent to Whisper and actual added features).

  5. Chris, David, Curits: thank you all for your contributions! The alternate ways to install/use Whisper are much appreciated.

    I should’ve known better than to cheat a head, Aristotle.

Add Your Thoughts

Meyerweb dot com reserves the right to edit or remove any comment, especially when abusive or irrelevant to the topic at hand.

HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <em> <i> <q cite=""> <s> <strong> <pre class=""> <kbd>


if you’re satisfied with it.

Comment Preview