I launched a new online campaign this week and with the consent of the players, I recorded the session for later reference. (One of them wrote a great summary, but it’s still nice to have the recording.) My original plan was to use Whisper to get a transcription but it turns out the built in Google Meet captioning system is plenty good enough. I did give Whisper a shot anyhow, and Whisper’s quality was higher, but the thing about Google Meet is that it adds speaker information to the transcriptions which is a huge difference.
Google One will cost you ten bucks a month, which gets you Google Meet sessions longer than an hour and transcripts, among other benefits. Worth it to me since I can afford it and I don’t like using my work Zoom for personal stuff, but YMMV.
Example Meet Transcription
So what you get out of the box is a VTT subtitle file that looks like this:
00:48:24.000 --> 00:48:28.000
(Bryant)
And you are there with Representative Ledger who speaks for
00:48:28.000 --> 00:48:32.000
(Bryant)
the hogs. He is, in fact, the voice of the hogs TM
00:48:32.000 --> 00:48:36.000
(Bryant)
registered trademark, etc, etc. Um he is like
00:48:36.000 --> 00:48:40.000
(Bryant)
a super skinny guy with like elaborate
00:48:40.000 --> 00:48:44.000
(Bryant)
wire thing on his head and like some antenna sticking out of it.
00:48:44.000 --> 00:48:48.000
(Bryant)
Um and like every now and then there's like little sparks coming off of it.
00:48:48.000 --> 00:48:52.000
(Bryant)
Like you have to keep them from setting things on fire because he's
00:48:52.000 --> 00:48:56.000
(M.)
You say have to.
-
(Bryant)
too important to do that himself. Um and
00:48:56.000 --> 00:49:00.000
(Bryant)
um you know it depends on whether or not you want things near you to be on
00:49:00.000 --> 00:49:04.000
(Bryant)
fire, I'm not forcing you to you know, maybe there's some
00:49:04.000 --> 00:49:08.000
(Bryant)
things that would be better off if they were on fire.
Which is cool, but not as readable as I want it to be, so I wrote a python script to turn that into this:
Bryant: And you are there with Representative Ledger who
speaks for the hogs. He is, in fact, the voice of the hogs TM
registered trademark, etc, etc. Um he is like a super skinny
guy with like elaborate wire thing on his head and like some
antenna sticking out of it. Um and like every now and then
there's like little sparks coming off of it. Like you have to
keep them from setting things on fire because he's --
M.: You say have to.
Bryant: -- too important to do that himself. Um and um you
know it depends on whether or not you want things near you to be
on fire, I'm not forcing you to you know, maybe there's some
things that would be better off if they were on fire.
Mostly formatting but also I did some processing to add dashes in appropriate places. I could probably screen out the ums and uhhs but that starts to get fancy and this is all good enough to read.
You can grab the script here if you like. No warranty available.
Make It Whisper
For comparison, the Whisper transcript looks like this:
00:48:24.600 --> 00:48:27.520
And you are there with Representative Ledger
00:48:27.520 --> 00:48:29.240
who speaks for the hogs.
00:48:29.240 --> 00:48:31.520
He is in fact the voice of the hogs,
00:48:31.520 --> 00:48:34.360
TM, registered trademark, et cetera, et cetera.
00:48:34.360 --> 00:48:40.280
He is like a super skinny guy with like a elaborate
00:48:40.280 --> 00:48:43.320
wire thing on his head and like some antenna sticking out
00:48:43.320 --> 00:48:49.120
it. And like every now and then there's like little sparks coming off of it. Like you
00:48:49.120 --> 00:48:53.160
have to keep them from setting things on fire because he's too important to do that
00:48:53.160 --> 00:48:54.160
himself.
00:48:54.160 --> 00:49:02.160
You have to. You know, it depends on whether or not you want things near you to be on fire.
00:49:02.160 --> 00:49:05.400
I'm not forcing you to. You know, maybe there's some things that will be better off if they
00:49:05.400 --> 00:49:11.520
were on fire.
That’s actually a noticeably better transcript but it does not have the voices identified, and in fact it doesn’t even notice that me and M. have different voices. The script will not work on this file. I guess if I was feeling really spicy I could try and use the timestamps to interpolate the better text from that file into the original Google Meet transcript but this is starting to sound like work.
There’s a pull request to add word-level timestamps to Whisper output, so if that goes through I think I could merge the two transcripts effectively. Work for another day, then.
How To
If you’ve come this far, the least I can give you is a walkthrough. Google Meet drops transcripts and recordings into a Google Drive folder called Meet Recordings. Go in there, find your recording, and select it. Then click on the little three dot menu and select Manage caption tracks.
You’ll get a new window; find your caption track to the right (probably English – 1 unless you recorded it in another language), three dot menu, Download. Easy as pie.
Be First to Comment