back to article Google's AI finds its voice ... and it's surprisingly human

Google has figured out how to use artificial intelligence to make robot sounds more human, according to a new paper. Using its “WaveNet” model, Google’s AI company in the UK, DeepMind, claims to have created a natural machine-to-human speech that halves “the gap with human performance." Machine babble often sounds emotionally …

  1. Anonymous Coward
    Anonymous Coward

    The WaveNet samples are good - but the enunciation feels too evenly spaced. Possibly a real person would be taking small breaths throughout that would vary the delivery..

    1. Leeroy

      I agree that the spacing is the main issue. What really throws me though was the word ' film' it just didn't sound right for some reason.

    2. John Brown (no body) Silver badge

      On the other hand, it's still pretty impressive. Especially compared to what we normally hear out in the real world with fairly poor TTS systems. I think I could quite comfortably listen to to documents or books being read to me by that wavenet voice.

      I've tried converting ebooks to audiobooks for i the car and it's listenable. But only just. You have to spend time looking for "odd" words and adding phonetic corrections, especially peoples names or place names although those corrections do build up the library so I suppose if I convert more, then only names will be a problem eventually. (Yes, it's mainly SF so often "alien" names which I'm not even sure myself how they ought to be pronounced anyway :-))

  2. a_yank_lurker

    Very Difficult

    The one problem with this is the human mouth and tongue are not positioned the exact same way for what is the "same" sound in two different words. Thus, there is a subtle sound shading. Also, as AC noted, the cadence is varies in human speech.

  3. Anonymous Coward
    Anonymous Coward

    They're going the wrong way

    If i woz gonna talk to an AI then i wood much prefer it if it sounded like a Cylon from the original Battlestar Galactica tv show rather than Robert Vaughn's Proteus IV in Demon Seed. An AI should sound like an AI and there should be laws to make these fuckers sound like Daleks and not Peter Mandelson. Its one thing to be talked down two bye a human being, its another to bee talked down too by a fuckin glorified 'Speak n Spell'.

    1. Rich 11

      Re: They're going the wrong way

      there should be laws to make these fuckers sound like Daleks and not Peter Mandelson.

      But which of those two is the more dangerous?

      1. Indolent Wretch

        Re: They're going the wrong way

        Luckily Peter Mandelson can't climb stairs.

  4. Anonymous Coward
    Anonymous Coward

    Gibberish

    There is something quite beautiful about the generated speech that makes up what it has to say. Somehow that is more impressive to me, as I imagine AI of the future creating new languages; though the WaveNet is still a noteworthy improvement to current TTS.

    1. This post has been deleted by its author

  5. M7S
    Terminator

    The computer is your friend...

    "Wolfie's fine Honey. Wolfie's just fine. Where are you?"

    Or maybe not

    1. Gruezi

      Re: The computer is your friend...

      So it wasnt just me then...

  6. Mage Silver badge

    This comment isn't true!

    "Machine babble often sounds emotionally flat and robotic because it’s difficult to capture the natural nuances of human speech."

    Absolute Nonsense. The issue is parsing and then understanding the text so as to decide how to nuance the speech, punctuation is a weak clue as to speed, pauses, pitch.

    We have been able to create natural speech for maybe more than 40 years, IF there is a special marked up text with meta tags for pitch, loudness, speed, pauses. Ordinary text has to be parsed and understood.

    Even actors etc, read text better if they are familiar with it and understand it. Otherwise even humans reading unfamiliar text can sound rubbish.

    Google Translate is very "successful" in some respects, but it killed computer speech/language/translation development because it's not "intelligent" at all, it's a brute force "Rosetta Stone" / "Code breaker" approach rather than translation by parsing an entire section, understanding it and paraphrasing.

    1. Anonymous Coward
      Childcatcher

      Re: This comment isn't true!

      "Even actors etc, read text better if they are familiar with it and understand it. Otherwise even humans reading unfamiliar text can sound rubbish."

      I'm no actor but I can read an unfamiliar piece and make it enjoyable for a listener. It is a bit of a skill and I am not the best but I'm quite good. I read ahead of my speech and moderate my tone accordingly. I will make a few mistakes but the overall effect is pretty good (if I do say so myself!) It's called story-telling. I can even manage to do it on the fly, without a script.

      One day Goog int al will get good at this stuff but not yet. I remember when CGI in films was frankly a bit wank but nowadays it is getting to the point where it is really hard to spot the seams.

      OTOH my router pings 8.8.8.8 and 8.8.4.4 on a regular basis to determine connectivity. It sends tiny ICMP packets out to do this every few seconds. Each packet is perfectly formed and contains a shit load of parameters - src/dst etc. It does this without complaining about its dodgy back, day in, day out. I can't do that.

      Attempting to make silicon emulate a tiny function that the billions of neurons in my bonce can manage is bloody stupid. My head is also much smaller than any server class system. Fart around with quantum stuff if they like but that ain't going to go much further than say prime factoring, I suspect. You'll be needing a new technology to do something that makes me feel really inadequate.

      1. allthecoolshortnamesweretaken

        Re: This comment isn't true!

        "You'll be needing a new technology to do something that makes me feel really inadequate."

        Well, it seems like sexbots are already under develepment.

        http://www.theregister.co.uk/2016/09/06/should_humanity_hump_robots_serious_question_feature/

        http://forums.theregister.co.uk/forum/1/2016/09/06/should_humanity_hump_robots_serious_question_feature/

  7. Anonymous Coward
    Anonymous Coward

    I don't want machines to sound human

    Just like if I had an android I wouldn't want it to look indistinguishable from a human. Even if it were possible, I'd want it oddly colored like Data from Star Trek, or that Trump robot running for president, so you know the difference.

    I'm sure Google wants machines that sound human, so they can get companies to outsource their call centers to Google's server farms - more income for Google, too bad about all the people who lose their jobs. They could even lie and say you have to wait for the next available operator to take your call, and sell ad space for that downtime instead of playing Muzak. I'll bet if you do a patent search, you'll find they've already patented that!

    1. Pascal Monett Silver badge

      I don't even want them to look human. Robots should not look human, nor sound human. They are machines, they are tools.

      Let's not add more confusion for the weak-minded. They have enough trouble already with all those conspiracy theories.

      1. LaeMing
        Boffin

        The makers of Red Dwarf's Kryten would agree with you, Pascal, that being the explicit reason his head looks like a novelty condom while an earlier model android in his line looks much more human.

      2. John Brown (no body) Silver badge

        "Robots should not look human, nor sound human. They are machines, they are tools."

        I know a few people who are right tools. Unfortunately they look human too!

  8. Dan 55 Silver badge
    Meh

    Not that impressive

    The CereProc apps (follow for examples, try the Jess voice) have got a load of different languages and accents and still sound more natural despite Google throwing stupid money and big data at the problem (as usual).

    1. Anonymous Coward
      Anonymous Coward

      Re: Not that impressive

      The CereProc samples do sound more natural. However they seem to have distracting traces of the "burble" you get on DAB radio when the signal is imperfect..

    2. John Brown (no body) Silver badge

      Re: Not that impressive

      The timing sounds little more natural, but the sounds are poor and stilted. Maybe they need to get together with the wavenet people and work together, taking the best from both systems?

  9. Long John Brass

    Our AI overlords

    I always thought they would have a female English RP accent.

    1. Rich 11

      Re: Our AI overlords

      Presumably because we'd be more comfortable being ordered around by Penelope Keith.

  10. Black Rat

    My voice was my passport

    Fascinating yet troubling technology with enormous potential for both hilarity and abuse. Elvis could have a another comeback. The real question though is could it fool a biometric lock or human being?

  11. Pascal Monett Silver badge

    "a system modelled on how the human brain works"

    No. A system modelled on how we think the human brain works.

    We still don't understand everything, and it will take at least a few more decades before we do.

  12. herman

    Muzak

    OK, so musicians are now obsolete also

    1. Darryl

      Re: Muzak

      Well, to be fair, Muzak has very little to do with real musicians

    2. This post has been deleted by its author

  13. ben kendim

    They have Majel Barrett Roddenberry's voice digitized...

    ... that's what I want to hear from my droid! https://twitter.com/roddenberry/status/772493204121944066

  14. Kaltern

    Personally, I prefer R2-D2 type audio output. Universally understandable, and easy to learn.

  15. LionelB Silver badge

    Kon'nichiwa, konpyūta

    I recently heard some Japanese-like speech samples generated by a deep learning algorithm. In fact, a Japanese colleague played me two samples, one of real Japanese from the training set and one of generated burble. I really couldn't tell which one was real speech. (This may have been partly because the training set was drawn from those weird manga cartoon voices.)

  16. Jez Burns

    Wow..

    That is really impressive, especially given the use of neural networks that will give it more room to evolve. As other commenters have pointed out, the rhythms and timing are still a bit off. I guess that's because these things are largely context sensitive in natural human speech. Still, there are rules that can be followed (and perhaps they already try to) involving the cadences and timings of different types of words following each other (nouns / verbs etc.), along with sentence / paragraph structure and amount of breath available before needing to pause.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like