Skip to main content

ByteDance Unveils StreamVoice: AI-Powered Live Voice Conversion Raises Deepfake Concerns and Misinformation Risks

 ByteDance, the renowned Chinese technology firm responsible for the popular TikTok platform, has unveiled something new for its users—StreamVoice. This tool, leveraging generative-AI technology, enables users to seamlessly alter their voices to mimic others.


As of now, StreamVoice remains inaccessible to the general public, yet its introduction underscores the noteworthy progress in AI development. The tool facilitates the effortless creation of audio and visual impersonations of public figures, commonly referred to as "deepfakes." Notable instances include the use of AI to emulate the voices of President Joe Biden and Taylor Swift, a phenomenon particularly prevalent as the 2024 election looms.

Collaborating on this groundbreaking initiative are technical researchers from ByteDance and Northwestern Polytechnical University in China. It's imperative to note that Northwestern Polytechnical University, recognized for its collaborations with the Chinese military, should not be confused with Northwestern University in the United States.

In a recently published paper, the researchers underscore StreamVoice's capacity for "real-time conversion" of a user's voice to any desired alternative, requiring only a singular instance of speech from the target voice. The output unfolds at livestreaming speed, boasting a mere 124 milliseconds of latency—a significant achievement in light of historical limitations associated with AI voice conversion technologies, traditionally effective in offline scenarios.

The researchers attribute StreamVoice's success to recent advancements in language models, enabling the creation of a tool that performs live voice conversion with high speaker similarity for both familiar and unfamiliar voices. Experiments, as detailed in the paper, emphasize the tool's efficacy in streaming speech conversion while maintaining performance comparable to non-streaming voice conversion systems.

Referring to Meta's Llama large language model, a prominent entity in the AI landscape, the paper details the utilization of the "LLaMA architecture" in constructing StreamVoice. Additionally, the researchers incorporated open-source code from Meta's AudioDec, described by Meta as a versatile "plug-and-play benchmark for audio codec applications." Training primarily on Mandarin speech datasets and a multilingual set featuring English, Finnish, and German, the researchers achieved the tool's proficiency.

Although the researchers refrain from prescribing specific use cases for StreamVoice, they acknowledge potential risks, such as the dissemination of misinformation or phone fraud. Users are encouraged to report instances of illegal voice conversion to appropriate authorities.

AI experts, cognizant of advancing technology, have long cautioned against the escalating prevalence of deepfakes. A recent incident involved a robocall deploying a deepfake of President Biden, urging people not to vote in the New Hampshire primary. Authorities are currently investigating this deceptive robocall, underscoring the urgent need for vigilance in the face of evolving AI capabilities.

Content generated using AI and reviewed by humans. Photo: DIW - AIGen

Comments

Popular posts from this blog

Apple In The Hotseat After Reviewer Confirms Its Vision Pro’s Eyesight Feature Doesn’t Work

  For months, we’ve seen tech giant Apple speak about how its Vision Pro entails features that set it far apart from all others in the industry. Now, a reviewer is casting serious doubt on the iPhone maker’s claims after adding that one of the key features of the new Vision Pro Eyesight does not work. And that’s shocking considering how much Apple has marketed the product as one of the best in the industry. When you consider a wide array of real-life examples, you’ll find how Apple has always spoken about this technology being one of the best out there. But in reality, one reviewer says that’s far from the truth. CEO Tim Cook took out the time to argue about how AR is far more superior and entertaining than the world of VR. The former was better as it did not isolate individuals from the community arising around it. Moreover, this is where the entire EyeSight product range came into existence from this notion as it ensured users were well aware and engaged in everything in their su...

OpenAI Sets Eyes On New AI Project Worth Trillions As Sam Altman Begins Talks With Potential Investors

  OpenAI has made it very clear that it’s not coming slow in terms of its ambitions for the year 2024. Sam Altman is said to be in talks with leading investors including the UAE government for a massive AI project that’s said to be worth trillions. This would entail the production of AI chips as confirmed by the WSJ in a new report. The CEO has yet to unveil the curtain on what exactly the project is all about and how it’s only in the early stages. Meanwhile, the list of investors taking part in this ordeal is still unknown, the company explained. Sam Altman also held similar discussions regarding the raising of funds for plans such as the production of a new plant with Japan and UAE-based investors such as their leading tech giants which include SoftBank Group and G42. It’s not too shocking as we’ve heard the OpenAI CEO mention time and time again how the world needs more AI chips now than ever. He feels they are designed to better enhance performance and assist in running AI mode...

The Arrival Of Gemini 1.5 - Google Unveils Its Latest Iteration Of Its Conversational AI System

  Google just unveiled  Gemini 1.5 , its latest rendition of the conversational AI system. The product is said to entail a greater array of advances in better efficiency, long-form reasoning, and enhanced performance. The latest system was detailed in a post by Google’s AI head that entailed a large figure of architecture enhancements, ensuring the core model can perform on the same level as the big Gemini 1.0 Ultra endeavor, without using extra computing resources. This latter was rolled out in the past week. The biggest leap comes at a time when there’s a huge window for carrying out experiments that the company says have to do with long-form context comprehension. The standard model of Gemini analyzes several prompts within a small 128k token context. With this new upgrade, the model will have a large number of data to process which can now be done quicker than before. This huge leap arose at a time when we saw the firm’s CEO analyze and classify as well as summarize a huge...