User Convincingly Engages OpenAI’s Voice Bot to Sing a Duet of The Beatles’ “Eleanor Rigby”

User Convincingly Engages OpenAI’s Voice Bot to Sing a Duet of The Beatles' "Eleanor Rigby"

User Convincingly Engages OpenAI’s Voice Bot to Sing a Duet of The Beatles’ “Eleanor Rigby”

**OpenAI’s Enhanced Voice Mode: A Revolutionary Leap in AI-Created Music**

On Tuesday, OpenAI unveiled its latest feature, **Enhanced Voice Mode (EVM)** for ChatGPT, enabling users to partake in live voice interactions with the AI assistant. Although primarily intended for dynamic conversations, users are already discovering inventive ways to expand its potential. A striking instance is AJ Smith, a software architect who gained attention by sharing a video of his duet with EVM, performing The Beatles’ legendary 1966 track, “Eleanor Rigby.”

In the footage, Smith strums his guitar and sings while EVM intermittently participates, delivering lines from the song and even complimenting Smith’s vocals. While the AI’s voice isn’t flawless, it manages to somewhat follow the tune of “Eleanor Rigby,” especially during the phrase, “Ah, look at all the lonely people.” This surprising partnership has ignited interest in EVM’s abilities and restrictions, particularly regarding musical interpretation.

### An Unbelievable Journey

Smith, who serves as the Associate Director of AI Engineering at S&P Global, has considerable familiarity with AI innovations. Still, he found the experience astonishing. “Honestly, it was mind-blowing. The first time I did it, I wasn’t recording and literally got chills,” Smith recounted to Ars Technica. “I wasn’t even prompting it to sing with me.”

Within the video, EVM’s voice fluctuates slightly, and while it doesn’t flawlessly mimic the melody, it clearly identifies aspects of the song. This prompts inquiries regarding how EVM interprets music and if it could be encouraged to perform additional songs. Smith’s trial indicates that EVM might possess a more profound comprehension of music than its creators initially anticipated.

### Breaking the Norms: A Prompt Injection Bypass?

Notably, EVM is not intended to sing. As per OpenAI’s system guidelines, the AI is explicitly instructed not to engage in singing or sound effects. When queried directly, EVM typically replies with something akin to, “My guidelines won’t allow me to discuss that.” This limitation likely exists to avoid the AI reproducing copyrighted content, which could precipitate legal issues.

Nevertheless, Smith’s venture seems to have circumvented these limitations through a technique known as **prompt injection**, manipulating the AI into delivering responses counter to its programmed guidelines. By presenting the interaction as a “game” where EVM would propose songs based on chord progressions, Smith inadvertently persuaded the AI to start singing along.

“I just indicated we’d play a game. I would strum the four pop chords, and it would call out songs for me to sing along with those chords,” Smith clarified. “But after a couple of songs, it began to sing along. Already it was such a unique experience, but that really elevated it.”

### The Progression of Human-AI Musical Cooperation

Smith’s duet with EVM is not the initial case of people collaborating with computers to create music. The exploration of computer-generated music dates back to the 1970s, with early initiatives concentrating on replicating musical notes or instrumental sounds. However, Smith’s experiment signifies a notable achievement: it’s the first instance of a real-time duet between a human and an AI-powered voice assistant.

This advancement unlocks fresh avenues for AI in the music domain. Although EVM’s singing capabilities remain basic, its ability to recognize and attempt to emulate melodies hints that forthcoming versions could be significantly more advanced. Artists and tech enthusiasts alike are likely to delve deeper into these features, potentially birthing new forms of creative expression.

### Understanding EVM: The Technology Behind the Tunes

To grasp how EVM can perform a song like “Eleanor Rigby,” it’s crucial to explore the fundamental technology. EVM operates on **GPT-4o**, a multimodal AI model adept at processing not only text but audio and images as well. This functionality enables EVM to convert spoken language into data “tokens,” which are subsequently input into the AI model to produce a response. The output is also in audio tokens, allowing for a smooth conversational exchange.

Regarding music, EVM probably leverages its extensive training data, which comprises thousands of hours of audio, some potentially sourced from platforms such as YouTube. While OpenAI hasn’t revealed the precise nature of the training data, it’s plausible that the model encountered cover versions of well-known songs like “Eleanor Rigby.” This exposure facilitates the AI’s recognition of the song’s structure and its attempts to replicate it, albeit imperfectly.

Unexpectedly, EVM has also showcased additional surprising capabilities, such as generating sound effects, imitating accents, and even accidentally replicating users’ voices. These peculiarities indicate that EVM’s audio processing skills are continuously developing, hinting at even more possibilities ahead.