Posts in the Technovertigo Category

Peerless Whisper

Published 1 year, 9 months past

What happened was, I was hanging out in an online chatter channel when a little birdy named Bruce chirped about OpenAI’s Whisper and how he was using it to transcribe audio.  And I thought, Hey, I have audio that needs to be transcribed.  Brucie Bird also mentioned it would output text, SRT, and WebVTT formats, and I thought, Hey, I have videos I’ll need to upload with transcription to YouTube!  And then he said you could run it from the command line, and I thought, Hey, I have a command line!

So off I went to install it and try it out, and immediately ran smack into some hurdles I thought I’d document here in case someone else has similar problems.  All of this took place on my M2 MacBook Pro, though I believe most of the below should be relevant to anyone trying to do this at the command line.

The first thing I did was what the GitHub repository’s README recommended, which is:

$ pip install -U openai-whisper

That failed because I didn’t have pip installed.  Okay, fair enough.  I figured out how to install that, setting up an alias of python for python3 along the way, and then tried again.  This time, the install started and then bombed out:

Collecting openai-whisper
  Using cached openai-whisper-20230314.tar.gz (792 kB)
  Installing build dependencies ...  done
  Getting requirements to build wheel ...  done
  Preparing metadata (pyproject.toml) ...  done
Collecting numba
  Using cached numba-0.56.4.tar.gz (2.4 MB)
  Preparing metadata (setup.py) ...  error
  error: subprocess-exited-with-error

…followed by some stack trace stuff, none of which was really useful until ten or so lines down, where I found:

RuntimeError: Cannot install on Python version 3.11.2; only versions >=3.7,<3.11 are supported.

In other words, the version of Python I have installed is too modern to run AI.  What a world.

I DuckDucked around a bit and hit upon pyenv, which is I guess a way of installing and running older versions of Python without having to overwrite whatever version(s) you already have.  I’ll skip over the error part of my trial-and-error process and give you the commands that made it all work:

$ brew install pyenv

$ pyenv install 3.10

$ PATH="~/.pyenv/shims:${PATH}"

$ pyenv local 3.10

$ pip install -U openai-whisper

That got Whisper to install.  It didn’t take very long.

At that point, I wondered what I’d have to configure to transcribe something, and the answer turned out to be precisely zilch.  Once the install was done, I dropped into the directory containing my MP4 video, and typed this:

$ whisper wpe-mse-eme-v2.mp4

Here’s what I got back.  I’ve marked the very few errors.

[00:00.000 --> 00:07.000]  In this video, we'll show you several demos showcasing multi-media capabilities in WPE WebKit,
[00:07.000 --> 00:11.000]  the official port of the WebKit engine for embedded devices.
[00:11.000 --> 00:18.000]  Each of these demos are running on the low-powered Raspberry Pi 3 seen in the lower right-hand side of the screen here.
[00:18.000 --> 00:25.000]  Infotainment systems and media players often need to consume digital rights-managed videos.
[00:25.000 --> 00:32.000]  They tell me, is Michael coming out?  Affirmative, Mike's coming out.
[00:32.000 --> 00:45.000]  Here you can see just that, smooth streaming playback using encrypted media extensions, or EME, with PlayReady 4.
[00:45.000 --> 00:52.000]  Media source extensions, or MSE, are used by many players for greater control over playback.
[00:52.000 --> 01:00.000]  YouTube TV has a whole conformance test suite for this, which WPE has been passing since 2021.
[01:00.000 --> 01:09.000]  The loan exceptions here are those tests requiring hardware support not available on the Raspberry Pi 4, but available for other platforms.
[01:09.000 --> 01:16.000]  YouTube TV has a conformance test for EME, which WPE WebKit passes with flying colors.
[01:22.000 --> 01:40.000]  Music
[01:40.000 --> 01:45.000]  Finally, perhaps most impressively, we can put all these things together.
[01:45.000 --> 01:56.000]  Here is the dash.js player using MSE, running in a page, and using Widevine DRM to decrypt and play rights-managed video with EME all fluidly.
[01:56.000 --> 02:04.000]  Music
[02:04.000 --> 02:09.000]  Remember, all of this is being played back on the same low-powered Raspberry Pi 3.
[02:27.000 --> 02:34.000]  For more about WPE WebKit, please visit WPE WebKit.com.
[02:34.000 --> 02:42.000]  For more information about EGALIA, or to find out how we can help with your embedded device needs, please visit us at EGALIA.com.  

I am, frankly, astonished.  This has no business being as accurate as it is, for all kinds of reasons.  There’s a lot of jargon and very specific terminology in there, and Whisper nailed pretty much every last bit of it, first time in, no special configuration, nothing.  I didn’t even bump up the model size from the default of small.  I felt a little like that Froyo guy in the animated Hunchback of Notre Dame meme yelling about sorcery or whatever.

True, the output isn’t absolutely perfect.  Let’s review the glitches in reverse order.  The last two errors, turning “Igalia” into “EGALIA”, seems fair enough given I didn’t specify that there would be languages other than English involved.  I routinely have to spell it for my fellow Americans, so no reason to think a codebase could do any better.

The space inserted into “WPEWebKit” (which happens throughout) is similarly understandable.  I’m impressed it understood “WebKit” at all, never mind that it was properly capitalized and not-spaced.

The place where it says Music and I marked it as an error: This is essentially an echoing countdown and then a white-noise roar from rocket engines.  There’s a “music today is just noise” joke in here somewhere, but I’m too hip to find it.

Whisper turning “lone” into “loan” doesn’t particularly faze me, given the difficulty of handling soundalike words.  Hell, just yesterday, I was scribing a conference call and mistakenly recorded “gamut” as “gamma”, and those aren’t even technically homophones.  They just sound like they are.

Rounding out the glitch tour, “Hey” got turned into “They”, which (given the audio quality of that particular part of the video) is still pretty good.

There is one other error I couldn’t mark because there’s nothing to mark, but if you scrutinize the timeline, you’ll see a gap from 02:09.000 and 02:27.000.  In there, a short clip from a movie plays, and there’s a brief dialogue between two characters in not-very-Dutch-accented English there.  It’s definitely louder and more clear than the 00:25.000 –> 00:32.000 bit, so I’m not sure why Whisper just skipped over it.  Manually transcribing that part isn’t a big deal, but it’s odd to see it perform so flawlessly on every other piece of speech and then drop this completely on the floor.

Before posting, I decided to give Whisper another go, this time on a different video:

$ whisper wpe-gamepad-support-v3.mp4

This was the result, with the one actual error marked:

[00:00.000 --> 00:13.760]  In this video, we demonstrate WPE WebKit's support for the W3C's GamePad API.
[00:13.760 --> 00:20.080]  Here we're running WPE WebKit on a Raspberry Pi 4, but any device that will run WPE WebKit
[00:20.080 --> 00:22.960]  can benefit from this support.
[00:22.960 --> 00:28.560]  The GamePad API provides a JavaScript interface that makes it possible for developers to access
[00:28.560 --> 00:35.600]  and respond to signals from GamePads and other game controllers in a simple, consistent way.
[00:35.600 --> 00:40.320]  Having connected a standard Xbox controller, we boot up the Raspberry Pi with a customized
[00:40.320 --> 00:43.040]  build route image.
[00:43.040 --> 00:48.560]  Once the device is booted, we run cog, which is a small, single window launcher made specifically
[00:48.560 --> 00:51.080]  for WPE WebKit.
[00:51.080 --> 00:57.360]  The window cog creates can be full screen, which is what we're doing here.
[00:57.360 --> 01:01.800]  The game is loaded from a website that hosts a version of the classic video arcade game
[01:01.800 --> 01:05.480]  Asteroids.
[01:05.480 --> 01:11.240]  Once the game has loaded, the Xbox controller is used to start the game and control the spaceship.
[01:11.240 --> 01:17.040]  All the GamePad inputs are handled by the JavaScript GamePad API.
[01:17.040 --> 01:22.560]  This GamePad support is now possible thanks to work done by Igalia in 2022 and is available
[01:22.560 --> 01:27.160]  to anyone who uses WPE WebKit on their embedded device.
[01:27.160 --> 01:32.000]  For more about WPE WebKit, please visit wpewebkit.com.
[01:32.000 --> 01:35.840]  For more information about Igalia, or to find out how we can help with your embedded device
[01:35.840 --> 01:39.000]  needs, please visit us at Igalia.com.  

That should have been “buildroot”.  Again, an entirely reasonable error.  I’ve made at least an order of magnitude more typos writing this post than Whisper has in transcribing these videos.  And this time, it got the spelling of Igalia correct.  I didn’t make any changes between the two runs.  It just… figured it out.

I don’t have a lot to say about this other than, wow.  Just WOW.  This is some real Clarke’s Third Law stuff right here, and the technovertigo is Marianas deep.


Glasshouse

Published 11 years, 9 months past

Our youngest tends to wake up fairly early in the morning, at least as compared to his sisters, and since I need less sleep than Kat I’m usually the one who gets up with him.  This morning, he put away a box he’d just emptied of toys and I told him, “Well done!”  He turned to me, stuck his hand up in the air, and said with glee, “Hive!”

I gave him the requested high-five, of course, and then another for being proactive.  It was the first time he’d ever asked for one.  He could not have looked more pleased with himself.

And I suddenly realized that I wanted to be able to say to my glasses, “Okay, dump the last 30 seconds of livestream to permanent storage.”

There have been concerns raised about the impending crowdsourced panopticon that Google Glass represents.  I share those concerns, though I also wonder if the pairing of constant individual surveillance with cloud-based storage mediated through wearable CPUs will prove out an old if slightly recapitalized adage: that an ARMed society is a polite society.  Will it?  We’ll see — pun unintentional but unavoidable, very much like the future itself.

And yet.  You think that you’ll remember all those precious milestones, that there is no way on Earth you could ever forget your child’s first word, or the first time they took their first steps, or the time they suddenly put on an impromptu comedy show that had you on the floor laughing.  But you do forget.  Time piles up and you forget most of everything that ever happened to you.  A few shining moments stay preserved, and the rest fade into the indistinct fog of your former existence.

I’m not going to hold up my iPhone or Android or any other piece of hardware all the time, hoping that I’ll manage to catch a few moments to save.  That solution doesn’t scale at all, but I still want to save those moments.  If my glasses (or some other device) were always capturing a video buffer that could be dumped to permanent storage at any time, I could capture all of those truly important things.  I could go back and see that word, that step, that comedy show.  I would do that.  I wanted to do it, sitting on the floor of my child’s room this morning.

That was when I realized that Glass is inevitable.  We’re going to observe each other because we want to preserve our own lives — not every last second, but the parts that really matter to us.  There will be a whole host of side effects, some of which we can predict but most of which will surprise us.  I just don’t believe that we can avoid it.  Even if Google fails with Glass, someone else will succeed with a very similar project, and sooner than we expect.  I’ve started thinking about how to cope with that outcome.  Have you?


Connected

Published 19 years, 7 months past

Last fall, Tantek and I presented a poster at HT04.  To get it to the conference in one piece and to avoid having to lug it across the country, I created a PDF of the poster and sent it off to the Kinko’s web site.  It was printed for me by the Kinko’s closest to the conference.  All I had to do was send them a digital file, and 2,150 miles later I retrieved the physical output.

As I did so, I thought: This is really amazing.  This is what’s so great about being connected.

A few months later, Kat upgraded her car, and the new one came with XM digital radio.  We started receiving music from geosynchronous orbit, a digital signal broadcast from 22,600 miles above the equator and deciphered by the short, stubby antenna on the car’s roof.  On a drive to visit relatives, we listened to the same station for the entire four-hour drive there, and again for the return drive.

As we did so, I thought: This is incredible.  This is a great example of the benefits of connecting everything.

I was wrong in both cases.

This morning, I stood in a hotel room in Chiba, Japan and saw my wife and daughter on the television.  Back home in Cleveland, they saw me on a computer monitor.  We talked to each other, waved hello, got caught up on recent events.  I watched as Carolyn ran around my office, heard her say “mama”, and agreed with her when she signed “telephone” while she watched my image.  I stuck my tongue out to make a silly face, and six thousand miles away, my daughter laughed with delight at my antics.

A few minutes after we’d finished the chat, with the glow of home and family still warm upon me, I thought: This is why we connected everything in the first place.


Wakka Wakka Doo Doo Yeah!

Published 20 years, 7 months past

I spotted a link to PacManhattan over at SimpleBits, and was immediately stunned.  I mean, sure, it’s like an episode of “When Geeks Go Crazy,” what with the use of cellular and WiFi communications to update player positions, and the Web-based arcade view of a live game in progress, but think about it.  These people are running around entire blocks of New York City just to play a live-action version of a 1980’s video game.  They’re actually getting exercise.  They won’t just be toning their wrists; this is a total-body workout.  That’s so not geeky.

Can’t you just hear the guy playing PacMan trying to cross the street?  “Hey!  I’m wakka wakka wakkin’ here!”


Concerts… On A Steeck

Published 20 years, 7 months past

This proceeds past cool, tears through ultra-cool, and lands somewhere to the west of übercool:  taking home a recording of a live show on a USB memory stick the same night you heard the show.  And it’s legal!  One wonders how much money the band gets for sales of their show.  If I were a club owner, I’d split my take with the band 50/50, but then that’s just me.

Maybe the recording industry could stop whining about piracy and bootlegging long enough to examine some of these new approaches to helping fans get the music they want, and spend some time thinking about how they could do the same kind of thing.  Nahhh… that would make too much sense.


Spooky!

Published 21 years, 4 months past

Remember I mentioned the “ZARGON” license plate?  Gail Cohen wrote me from Miami, Florida to tell me whose car that was.  His name’s Rex.  I’ve talked about moments of technological vertigo (technovertigo? technologigo?  technigo?) in the past.  This is another such moment.

So apparently Gail and Rex are both members of the International Association of Haunted Attractions, and his hobby is being an interactive actor in haunted houses.  You know the guys who jump out at you with goalie masks and chain saws?  He’s one of them.  Oh, heck, take a look for yourself.  So it turns out that terror really is his business, at least as a hobby, and I suppose it does take guts.  Just not the kind I meant.


The Nature of Progress

Published 21 years, 10 months past

A redesigned Netscape DevEdge has been launched.  Look, ma, no tables.  Well, hardly any, and none in the basic design.  I was a primary project manager for this one, and the design is a from-scratch effort.  It’s nothing visually groundbreaking, and of course using positioning for a major site has been done, but we’ve gone a step further into using positioning to make the design come together.  The site didn’t quite validate at launch thanks to some deeply stupid oversights on my part, but hopefully they’ll have been fixed by the time you read this entry.

As for the design approach we took… that’s a subject for another day, and also the subject of an article I wrote.  I predict that we’ll draw fire for using HTML 4.01 Transitional, for not validating when we launched, for our font sizing approach, and for our dropdown menus.  On the other hand, we’ll probably draw praise for making the markup accessible (once one of my stupid mistakes is fixed), for using CSS in a sophisticated manner, for pushing the envelope in reasonable ways, and for our dropdown menus.  For myself, I’m very much satisfied with and proud of the result, and very grateful for all the effort and help I got from the other members of the team.

On a less important but possibly more amusing front, yesterday I hacked together a color-blending tool after Matt Haughey asked on Webdesign-L how to calculate the midpoint between two colors, and Steve Champeon explained how to do it in some detail.  The JavaScript is no doubt inefficient and clumsy, the tool may not work in your browser, and for all I know it will lock up your computer.  It was just a quick hack.  Well, not quick, actually; I’m not very skilled at JavaScript.  Enjoy it, or don’t, as you like.  Just don’t expect me to fix or add anything unless you mail me the code needed to do whatever you want the tool to do.

Lucas Gonze over the O’Reilly Network mentioned a fascinating paper on “cascade attacks” and how they can be used to take down a distributed network.  So the Internet can suffer cascade failure, eh?  I wonder how much effort would be required to take down the Internet’s starboard power coupling.  Or, worse yet, trigger a coolant leak.

It’s been revealed that the blurry, grainy image of the Space Shuttle Columbia wasn’t taken using any advanced telescopes or military systems after all, but three engineers who used some off-the-shelf parts to put together a personal experiment.  CNN says: ‘Hi-tech’ shuttle pic really low-tech.  Let’s think about that for a second.  Three guys took an eleven-year-old Macintosh, hooked it up to a telescope that probably cost no more than a couple hundred dollars, and took a picture of an object almost 40 miles away moving 18 times the speed of sound.  That’s low-tech?  The fact that you can even recognize the object they imaged is astounding.  Hell, the fact that they imaged anything at all is astounding.  No criticism of the three men intended; I’m sure they’re brilliant guys who know what they’re doing.  But think about it!

I refer to moments like this as “technological vertigo.”  They’re those points where you suddenly come to a dead halt while you realize the incredible complexity of the world, and just how much we take for granted.  For that one moment, you stop taking it for granted.  Here’s an example: a couple of years ago, I was driving south through suburban Columbus.  In the back yard of a house just off the interstate, I spotted an old satellite dish lying on its side, obviously no longer in use.  Then it hit me: whoever lived there once had the ability to receive information from orbit, and decided to throw it away.  Their garbage was so much more advanced than anything their parents had ever even envisioned that the gap was barely comprehensible.  Any general in the Second World War would have given anything, including men’s lives, to have the kind of communication capability that now lay discarded in somebody’s back yard.

The even more remarkable thing about this trashed satellite dish is that there was nothing remarkable about it.  So somebody threw out an old satellite dish—so what?  They can always get another one, and one that’s a lot smaller, better, and more capable than the piece of junk they tossed, right?

And that is perhaps the most incredible part of it all.


Thursday, 30 May 2002

Published 22 years, 6 months past

I experienced a touch of techno-frisson this evening.  The phone rang, and when I answered it, it turned out to be a sales call offering to refinance my mortgage.  Just as the words “we’re calling to offer competitive interest rates on mortgage refinancing” left the guy’s mouth and grated across my eardrum, e-mail dropped into my Inbox with the subject line current mortgage interest rate.

I had no idea I seemed so desperate for a new mortgage.  (Which I’m not, thanks.)

THIS IS SPAM Spam continues to stay in the forefront of my (mostly negative) thinking.  I do have to give major honesty points to a message I received a few weeks back.  When I opened it up (I still don’t know why I did) I found what’s depicted in the accompanying graphic.  They may be the scum of humanity, but at least they’re up front about what they do.  I have to respect that.  I admit I laughed out loud when I saw it, then took a screenshot and deleted the message.

The other thing I wanted to mention is from the “this is funny but I’m laughing as much at the audacity as the humor” department:  The Onion managed this week to put a surreal  perspective on current events.  You know, it almost does make sense…

Brief correction: apparently the painting I liked so much isn’t called “Deception” any more.  Now it’s called “Ear Drops”.  Personally, I think the original title worked better.


Browse the Archive