Our Project

The Problem

The main motivation for our project is to provide a tool for transcribing interviews, podcasts, and anything in which two or more people are talking and an auto-generated caption might be useful. This would be a tool that journalists could use to streamline their workflow and concentrate their efforts on other aspects of their job, as opposed to the menial and often time-consuming task of transcription. At the very least, this tool aims to reduce the amount of work a journalist must spend on transcription.


How Does It Work?

Users will submit a wave audio file containing a conversation between two or more people. The user will then select portions of the audio in which only one person is talking, and will do this for each speaker in the conversation. Then, a profile for that user is created with that audio clip, which will be used for speaker identification. The program then uses a self-similarity measure to identify points in the audio at which one speaker stops speaking and another starts. It segments the original audio file and splits it into multiple files. Each file will then be analyzed by the speech identification API, the speaker will be identified, and the text of the audio file transcribed. This will be written in the format of a transcript, where the speaker’s name precedes the words they said.


Methodology

We used Python 3 to accomplish all the backend processing, segmenting and transcribing of the audio files. For speaker recognition and transcription, we used Microsoft’s Bing Speaker Identification and Speech to Text APIs. We used a HTML/CSS/JavaScript frontend along with the wavesurfer.js library for manipulating waveforms and displaying them to the user.

We tested the performance of our transcription software upon two major criteria: (1) How well it segmented the audio files into smaller files where only one speaker was talking, and (2) How accurate the final transcription was (regarding both the speaker identification and the speech to text conversion). For both of these instances we utilized f-score as a measure of our success.

In terms of testing data, we primarily used two files: one of a clip of an NPR interview, where an interviewer asked President Obama a question, and he started his response, and another of the three of us reading different sections from a Wikipedia article on sports in Latvia. Because we are using the Bing API, our speech to text and speech recognition f-scores are something that would primarily be determined by how well our audio was segmented


Project Demo


Conclusion

Speech to text transcription is a very difficult task to perform. Due to the complexities of speech, even the slightest bit of noise renders Bing's API helpless. However, even with high quality audio, transcription is not perfect. We were, however, able to consistently get accurate splitting of audio segments using the log spectrogram as our feature vector.

Overall, we feel that this would be a very useful tool. However, to make it ready for professional use, it would need further optimizations to increase its accuracy and reliability. Once that is achieved, we believe that our project could greatly reduce the time it takes to do audio transcription.