vocaltract2.gif

Illustration from a 1940’s Bell Labs project investigating human speech synthesis and recognition

I recently signed up for SimulScribe, a new service which replaces your existing voicemail system with one that:

  1. Transcribes the voice message into text (using a speech-to-text (STT) engine)…
  2. wraps the voicemail message into a WAV file…
  3. and then emails the raw text and the WAV file (as an attachment) to your email address.

Setting up SimulScribe couldn’t be easier: The free trial doesn’t even require a credit card to start using right away, and they provide you with explicit and shockingly simple instructions for configuring your voicemail for your particular carrier. You can be set up with the SimulScribe service in literally under 3 minutes.

After setting it up (and this may come has a shock to those of you who still think STT is not ready for prime time), the system has performed almost flawlessly.

Below I’ll present some example transcriptions, followed by some ideas on how this technology might be extended in the future.

SimulScribe in Action

I’ll start out with the first message I received after signing up. This one is from my lovely wife Peggy, and it’s short and sweet:

Hi, baby. Please call me. I have a question. Thanks. Bye.

Perfect.

This next one is from a business colleague. The colleague’s name was of non-Anglo/European origin, and thus presumably not part of the STT system’s core vocabulary, but the system was smart enough to simply print the phonetic pronounciation of the person’s name (I’ve excised the actual name for privacy reasons, but beleive me it was totally accurate).

Chris, this is (phonetic “[excised]”). I am on my way over to the office. Maybe (garbled). I will see you then shortly. Thanks. Bye.

The “garbled” part turned out to be garbled on the voicemail audio itself, too, so even a human wouldn’t have understood it anyway. In fact, overall I’ve noticed that the system is better at hearing/interpreting the messages than I am when I dial in to listen to my voicemail system (perhaps because the system analyzes the uncompressed original message while the voicemail boxes are radically compressed).

This next message is from my family, calling me yesterday on my birthday and singing “Happy Birthday” to me. Yes, this is the transcription of three people singing, although in fairness they understood that they were being transcribed by a computer so they deliberately enunciated a little slowly and stiffly (like robots).

Happy birthday to you, happy birthday to you, happy birthday dear Christopher, happy birthday to you,bye

Perfect again.

Finally, this last one is from myself. This is where SimulScribe got really interesting for me.

Here’s the context: I was walking down the street, in a hurry to catch a train, and I saw this frumpy station wagon that had been souped-up into a low-rider with mag wheels, tinted windsheilds, the works, driving 60 miles an hour down one block of a residential neighborhood, screeching to a stop, and finally disgorging a guy whose physical appearance, I thought, explained why he would drive around in a misbegotten racing/stationwagon. His vehicle, to me, was pathetic and clearly fell short of being the dream racing/sports car he would have preferred to have, but it was a nice encapsulation of a phenomenon I’ve noticed where people deliberately aim for mediocrity and, when they achieve it, feel far more fulfilled than those who aim for perfection and simply fail.

So I wanted to jot this observation down in my sketchbook, but I didn’t have the time. Instead, I simply called myself on my phone, spoke my thoughts aloud, and received it minutes later in my email inbox:

Perfect is the enemy of good. I’m 5 feet tall. Driving a pathetic short red space wagon. 60 miles an hour. Down a single block. Perfect is the enemy of good.

It got only one thing wrong: “space wagon” instead of “station wagon”. In my mind, not only is that an acceptable error, but the poet/artist in me is convinced that it is an improvement on what I actually said. I love these kinds of errors.

This function alone — the use of the tool as an automatic transcription service for “notes to myself” and the automatic and convenient forwarding of those notes to my email box — already makes it a killer app in my book: Anytime/anywhere dictation service using a device you already have (your phone) and a delivery and storage medium you already use (email). Not bad.

Caveats

As I’ve made clear, the transcription imperfections are not a problem for me. My only substantial critique of SimulScribe so far is that they use the uncompressed WAV file format instead of MP3, but I imagine this is simply a penny-pinching decision and that they will change this eventually. Other annoyances are minor: When people reach your voicemail, they will hear a short “brought to you by SimulScribe” blurt, but they keep it mercifully short, no worse than many mobile carriers do.

Extending SimulScribe and the Future of STT

I’m already brainstorming a tool to further leverage my idea for “integrated virtual dictation machine” feature: Many blogs allow you to email messages right into the blog. It would not be too hard, I suspect, to write a little script that would parse SimulScribe’s email transcriptions and reformat them for posting to a blog. Which is to say that it’s only a few lines of PHP away from permitting one to blog using one’s phone and one’s voice. With a little more work, the WAV file itself could be attached to the blog post as a Podcast or something. These ideas are, to me, pretty cool.

I’ve long thought that speech-to-text is an incredibly underutilized technology. While still not perfect, STT has actually been pretty darn good for at least the past 6-8 years. Yes, we use it all the time for simple voice-based customer service phone calls, such as calling Moviefone or your credit card company, but couldn’t it be used for much more? For example, shouldn’t PDAs and mobile phones come with STT apps built in? A Google search for Windows PocketPC Speech-to-text doesn’t turn up a heck of a lot of recent material.

Existing commercial products (like Naturally Speaking and ViaVoice, now both sold by the same company, or iListen for the Mac) seem to be historically targeted mainly at people with accessibility issues where typing isn’t practical or possible. My father-in-law has a hard time typing these days, for example, but when we tried to set up his speech recognition software (admittedly this was a year or two ago) we found that the installation, configuration, and user interface of the product were an abomination to even basic usability principles. He couldn’t use it practically. I suspect there’s a niche for some UI designer to come in and help usher one of these products out of the olde era where engineers designed UIs and into today’s modern age of letting UI designers design UIs.

Clearly a big obstacle to acceptance of STT technology is the appropriateness of dictating sensitive words in a public space such as at an office. But perhaps the main obstacle is simply that the standards for perfection are too high for major corporations to invest in mainstream products using the technology. For example, a bank customer complaint line that mangled a user’s words could in theory land that bank in court. But SimulScribe is an internet startup, starting small and hoping that individuals and small businesses will find the margin of error acceptable. As early adopters we will adapt to the imperfections for however long it takes to eliminate them.

As far back as 2000, I was experimenting with VoiceXML via the amazing free developer toolkit provided at the time by Tellme.com, and it was one of the most exciting technological dalliances I’ve ever tried (and have always regretted not having more time to play with it). I’ll have to get back into it sometime.


Comments

6 responses to “Talking to Myself with SimulScribe”

  1. Interesting write up on this new technology. Im using spinvox to convert my voicemail into a text/email. They essentialy provide something like simulscribe, but provide a more in depth voice to text service. They have products like spinmyblog that allow you to update your blog from your mobile just by speaking. heres their site http://www.spinvox.com

  2. simulscribe isn’t transcribed by a computer…. it is transcribed by real people…

  3. @sweet: I don’t know whether i should laugh at your clever joke or if I should cry at your profound misunderstanding of technology. Voice recognition software is *really* good these days.

  4. Jason Mills Avatar
    Jason Mills

    I have signed up for an account and have run several tests and have come to the same conclusion as sweet – this is being transcribed by humans. During off-peak periods there are up to 45 minutes delays in transcriptions. I have my .wav files emailed to them and the larger the file, the longer the transcription delay, indicating that there may be a small internet pipe (possibly to a third-world nation) feeding the transcription location. No matter what size computer this supposed “voice-recognition) software is running, there would never be a 45 minute+ delay in transcription. Further, there are differences in capitalization and punctuation that would not be observed in a purely computerized, software-based application. I have worked around voice-recognition software for about 18+ years and we used to charge people $.25 per message back in the answering service days for “voice transcription” and “voice dispatch” alpha-numeric paging services. This one is no different. I think it is also interestin to note that all the press coverage by NY-based Television networks is easy to arrange when you are on 59th Street in Manhattan. This will soon be revealed as a scam, hopefully before the IPO.

  5. @Jason Mills: It sounds like you are assuming SimulScribe, or any of the many other companies doing the same thing, is running the same scam you and your colleagues (apparently) ran 18 years ago. Hate to break the news, but things have moved forward. You can buy a software package for your desktop for under $100 that does STT just as well as SimulScribe.

    As for taking 45 minutes, well, I’ve had plain old emails take 45 to arrive at their destination — haven’t you?

    Go out and buy Naturally Speaking or the other software packages and you’ll see that SimulScribe isn’t that impressive, really. Why is it so hard to believe that this is the real deal? Either you’re a serious conspiracy theorist (your invocation of some kind of NY-based backroom dealing suggests this might be true), or you’ve been unaware of some basic consumer technological developments in the last decade.