A couple months ago, our CTO, David Evans, was perusing Twitter and happened to see a tweet from YouTube celebrity and SciShow host, Hank Green, about wanting to collaborate with someone on a speech-to-text Watson API project. And, as Evans does, he launched into some thinking around what we could do to make it work. It started out with a team of two and expanded from there.
Now a couple months later, we’ve come a long way and have something usable and available to everyone. For more information, check out the thoughts we gleaned from our team below.
What is it?
First and foremost, Levar is not a product. There tends to be an assumption in technology that in order to work on a thing, you need to either want to sell it or have it do something funny.
This is, instead, a step towards a practical implementation of the question, “How can I make editing videos a little easier?” We chose to express this implementation by creating a tool that makes videos visually beautiful while enhancing user-experience and viewer functionality.
When we first began the project, we wanted to assess what kinds of features were available from various speech-to-text frameworks and compare what kind of utility could be born of using one or the other. We looked primarily to the IBM’s Watson API and Google Voice API, and found that Watson had a slight advantage to Google in that it provided timestamps to match a user’s spoken words. This meant we could match the audio with text at exactly the same time it was being spoken.
What we created is a way to use the Watson API visually to make videos with text on them based on what you say.
Levar was created for anyone interested in customizing their tutorials or directional videos, but it could easily become a valuable tool for educators, managers, or bloggers.
How does it work?
The input is a regular old video file. The output is a super sweet new video, where the transcript is overlaid on the visual then highlighted as the words are spoken.
After uploading a video, the user chooses a background—either a solid color, an image, or the original video. Then the user can choose the spoken and unspoken text color, as well as select from a few different fonts. This provides the ultimate personalization suite for the user’s media.
While the user is customizing the text and video background, the backend is hard at work. It was built with Express, using the Watson API for speech recognition and FFmpeg for video manipulation. This listens to the words in the video and creates a transcript of what was heard. Of course, nothing can be perfect, so a feature was built-in for the user to review and edit any of the words that may have been transcribed incorrectly.
By using the Watson API, we were able to integrate artificial intelligence into the program. The program learned, using the Watson API, to return timestamps for each word, meaning we can auto-align the highlight as words are spoken.
The frontend was made with React, which does a lovely job of routing and maintaining state.
After the video is processed and personalized, the user is able to download the customized video and share it to any of their social platforms.
What was the thinking behind the design?
Because Levar is fairly simple, we wanted the experience to be fairly simple as well without so many steps to customize it. So for things like color, we paired and highlighted the options with text colors to make it easier for the user. The idea being that they don’t need to think too deeply about it.
This also affected our iterations on Levar. In the beginning we had a lot more steps than we did in the final version, but that didn’t really align with the intention of the experience, so we made tweaks over time that went back to that overall vision.
How did this project come to be?
It all started with a Twitter conversation between Evans and Hank Green.
Hank wanted to work with “smart people on software that builds video using the Web Speech API.” We knew we had to step up.
After a couple of tweets and a series of DMs, the project was underway. Developers pieced together a simple proof-of-concept for a program that could take in a video, send it to the Watson API, then produce a captioned video.
Our development team then came together to flesh out the video-building functionality and developed a web app to host the service.
The next steps were to provide the user with fun customization options and to develop a friendly interface. It was at this phase when the team began to expand to other departments; allowing this project to draw from the expertise of our colleagues to develop, design, and manage the Levar.”
Where is the project now?
Many unfamiliar technologies were used to make Levar happen, and our team heavily depended on the open-source community to help us push forward with React, Express and FFmpeg, so you could say, we have some technical debt to pay off. Which is why we chose to make it an open sourced project.