Press "Enter" to skip to content

Google Meet Transcriptions

I launched a new online campaign this week and with the consent of the players, I recorded the session for later reference. (One of them wrote a great summary, but it’s still nice to have the recording.) My original plan was to use Whisper to get a transcription but it turns out the built in Google Meet captioning system is plenty good enough. I did give Whisper a shot anyhow, and Whisper’s quality was higher, but the thing about Google Meet is that it adds speaker information to the transcriptions which is a huge difference.

Google One will cost you ten bucks a month, which gets you Google Meet sessions longer than an hour and transcripts, among other benefits. Worth it to me since I can afford it and I don’t like using my work Zoom for personal stuff, but YMMV.

Example Meet Transcription

So what you get out of the box is a VTT subtitle file that looks like this:

00:48:24.000 --> 00:48:28.000
(Bryant)
And you are there with Representative Ledger who speaks for 

00:48:28.000 --> 00:48:32.000
(Bryant)
the hogs. He is, in fact, the voice of the hogs TM 

00:48:32.000 --> 00:48:36.000
(Bryant)
registered trademark, etc, etc. Um he is like 

00:48:36.000 --> 00:48:40.000
(Bryant)
a super skinny guy with like elaborate 

00:48:40.000 --> 00:48:44.000
(Bryant)
wire thing on his head and like some antenna sticking out of it. 

00:48:44.000 --> 00:48:48.000
(Bryant)
Um and like every now and then there's like little sparks coming off of it. 

00:48:48.000 --> 00:48:52.000
(Bryant)
Like you have to keep them from setting things on fire because he's 

00:48:52.000 --> 00:48:56.000
(M.)
You say have to. 
-
 
(Bryant)
too important to do that himself. Um and 

00:48:56.000 --> 00:49:00.000
(Bryant)
um you know it depends on whether or not you want things near you to be on 

00:49:00.000 --> 00:49:04.000
(Bryant)
fire, I'm not forcing you to you know, maybe there's some 

00:49:04.000 --> 00:49:08.000
(Bryant)
things that would be better off if they were on fire.

Which is cool, but not as readable as I want it to be, so I wrote a python script to turn that into this:

Bryant: And you are there with Representative Ledger who
    speaks for the hogs. He is, in fact, the voice of the hogs TM 
    registered trademark, etc, etc. Um he is like a super skinny 
    guy with like elaborate wire thing on his head and like some 
    antenna sticking out of it. Um and like every now and then 
    there's like little sparks coming off of it. Like you have to 
    keep them from setting things on fire because he's --

M.: You say have to.

Bryant: -- too important to do that himself. Um and um you
    know it depends on whether or not you want things near you to be
    on fire, I'm not forcing you to you know, maybe there's some
    things that would be better off if they were on fire.

Mostly formatting but also I did some processing to add dashes in appropriate places. I could probably screen out the ums and uhhs but that starts to get fancy and this is all good enough to read.

You can grab the script here if you like. No warranty available.

Make It Whisper

For comparison, the Whisper transcript looks like this:

00:48:24.600 --> 00:48:27.520
And you are there with Representative Ledger

00:48:27.520 --> 00:48:29.240
who speaks for the hogs.

00:48:29.240 --> 00:48:31.520
He is in fact the voice of the hogs,

00:48:31.520 --> 00:48:34.360
TM, registered trademark, et cetera, et cetera.

00:48:34.360 --> 00:48:40.280
He is like a super skinny guy with like a elaborate

00:48:40.280 --> 00:48:43.320
wire thing on his head and like some antenna sticking out

00:48:43.320 --> 00:48:49.120
it. And like every now and then there's like little sparks coming off of it. Like you

00:48:49.120 --> 00:48:53.160
have to keep them from setting things on fire because he's too important to do that

00:48:53.160 --> 00:48:54.160
himself.

00:48:54.160 --> 00:49:02.160
You have to. You know, it depends on whether or not you want things near you to be on fire.

00:49:02.160 --> 00:49:05.400
I'm not forcing you to. You know, maybe there's some things that will be better off if they

00:49:05.400 --> 00:49:11.520
were on fire.

That’s actually a noticeably better transcript but it does not have the voices identified, and in fact it doesn’t even notice that me and M. have different voices. The script will not work on this file. I guess if I was feeling really spicy I could try and use the timestamps to interpolate the better text from that file into the original Google Meet transcript but this is starting to sound like work.

There’s a pull request to add word-level timestamps to Whisper output, so if that goes through I think I could merge the two transcripts effectively. Work for another day, then.

How To

If you’ve come this far, the least I can give you is a walkthrough. Google Meet drops transcripts and recordings into a Google Drive folder called Meet Recordings. Go in there, find your recording, and select it. Then click on the little three dot menu and select Manage caption tracks.

You’ll get a new window; find your caption track to the right (probably English – 1 unless you recorded it in another language), three dot menu, Download. Easy as pie.

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *