The two videos I was using Whisper on have been published, so you can see for yourself how the captioning worked out. Designed as trade-show booth reel pieces, they’re below three minutes each, so watching both should take less than ten minutes, even with pauses to scrutinize specific bits of captioning.
As I noted in my previous post about this, I only had to make one text correction to the second video, plus a quick find-and-replace to turn “WPE WebKit” into “WPEWebKit”. For the first video, I did make a couple of edits beyond fixing transcription errors; specifically, I added the dashes and line breaking in this part of the final SubRip Subtitle (SRT) file uploaded to YouTube:
00:00:25,000 --> 00:00:32,000
- Hey tell me, is Michael coming out?
- Affirmative, Mike's coming out.
This small snippet actually embodies the two things where Whisper falls down a bit: multiple voices, and caption line lengths.
Right now, Whisper doesn’t even try to distinguish between different voices, the technical term for which is “speaker diarisation”. This means Whisper ideal for transcribing, say, a conference talk or a single-narrator video. It’s lot less useful for things like podcasts, because while it will probably get (nearly) all the words right, it won’t even throw in a marker that the voice changed, let alone try to tell which bits belong to a given voice. You have to go into the output and add those yourself, which for an hourlong podcast could be… quite the task.
There are requests for adding this to Whisper scattered in their GitHub discussions, but I didn’t see any open pull requests or mention of it in the README, so I don’t know if that’s coming or not. If you do, please leave a comment!
As for the length of captions, I agree with J David Eisenberg: Whisper too frequently errs on the side of “too long”. For example, here’s one of the bits Whisper output:
00:01:45,000 --> 00:01:56,000
Here is the dash.js player using MSE, running in a page, and using Widevine DRM to decrypt and play rights-managed video with EME, all fluidly.
That’s eleven seconds of static subtitling, with 143 characters of line length. The BBC recommends line lengths at or below 37 characters, and Netflix suggests a limit of 42 characters, with actual hard limits for a few languages. You can throw in line breaks to reduce line length, but should never have more than three lines, which wouldn’t be possible with 143 characters. But let’s be real, that 11-second caption really should be split in twain, at the absolute minimum.
Whisper does not, as of yet, have a way to request limiting caption lengths, either in time or in text. There is a fairly detailed discussion of this over on Whisper’s repository, with some code graciously shared by people working to address this, but it would be a lot better if Whisper accepted an argument to limit the length of any given bit of output. And also if it threw in line breaks on its own, say around 40 characters in English, even when not requested.
The last thing I’d like to see improved is speed. It’s not terribly slow as is, to be clear. Using the default model size (small), which is what I used for the videos I wrote about, Whisper worked at about 2:1 speed: a two-minute video took about a minute to process. I tried the next size up, the medium model, and it worked at roughly 1:1.5 speed, taking about an hour fifteen to process a 46-minute video.
The thing is, all that is running solely on the CPU, which in my case is a 12-core M2. According to this pull request, problems in one of Whisper’s dependencies, PyTorch, means GPU utilization is essentially unavailable on the hardware I have. (Thanks to Chris Adams for the pointer.) I expect that will be cleared up sooner or later, so the limitation feels minor.
Overall, it’s a powerful tool, with accuracy I still find astounding, only coming up short in quality-of-life features that aren’t critical in some applications (transcribing a talk) or relatively easily worked around in others (hand-correcting caption length in short videos; using a small script to insert line breaks in longer videos). The lack of speaker diarisation is the real letdown for me, and definitely the hardest to work around, so I hope it gets addressed soon.