# Investigation Reveals OpenAI’s Whisper Tool Can Generate False Text in Medical Transcripts
A thorough examination by the Associated Press (AP) has uncovered that OpenAI’s Whisper transcription tool, initially praised for its precision, is susceptible to producing misleading text, especially in critical fields such as healthcare and business. This issue, commonly known as “confabulation” or “hallucination” within the realm of artificial intelligence (AI), has sparked considerable concern regarding the tool’s dependability, particularly in vital environments like medical transcription.
## Whisper’s Initial Success and Rising Concerns
Upon its launch in 2022, Whisper was celebrated as a revolutionary advancement in automatic speech recognition (ASR). OpenAI claimed that Whisper achieved “human-level robustness” in audio transcription accuracy. However, as additional researchers and developers began utilizing the tool, alarming trends surfaced. The AP report noted that a researcher from the University of Michigan discovered that Whisper produced incorrect text in 80% of the public meeting transcripts analyzed. Another developer, who reviewed 26,000 transcriptions, indicated that nearly all of them contained erroneous content.
These inaccuracies are not merely trivial mistakes but can encompass entirely fabricated sentences or phrases that were never articulated. This concern is particularly critical in healthcare settings, where precision is essential.
## Concerns Regarding Whisper in Healthcare
Despite explicit warnings from OpenAI about employing Whisper in “high-risk domains,” the tool has been embraced by more than 30,000 medical professionals for transcribing patient encounters. Prominent healthcare facilities, including the Mankato Clinic in Minnesota and Children’s Hospital Los Angeles, are utilizing Whisper-powered AI solutions developed by the medical technology firm Nabla. These solutions are optimized to recognize medical terminology, yet the danger of confabulation persists.
Nabla has admitted that Whisper can produce false text, but the company is said to delete original audio files for “data safety reasons.” This approach could worsen the dilemma, as it hinders doctors from validating the accuracy of the transcriptions against the original recordings. This is particularly problematic for deaf patients, who depend on precise transcripts for communication and may have no means to confirm the accuracy of the information.
## Wider Implications: Whisper’s Fabrications Beyond Medicine
The challenges associated with Whisper extend beyond the medical domain. Researchers from Cornell University and the University of Virginia performed a study on thousands of audio samples and discovered that Whisper occasionally introduced non-existent violent themes or racial commentary into otherwise neutral dialogues. In 1% of the samples, Whisper invented whole hallucinated phrases or sentences, with 38% of those containing damaging material, like inciting violence or making erroneous connections.
In a notably concerning instance, Whisper recorded a speaker’s mention of “two other girls and one lady” as “two other girls and one lady who were Black,” despite the original audio containing no racial references. In another case, a harmless remark about a boy with an umbrella was distorted into a violent story about a boy committing murder with a “terror knife.”
## The Reasons Behind Whisper’s Confabulations
The underlying technology of Whisper is the root cause of its confabulations. Whisper is structured on a Transformer-based AI model, which predicts the next likely token (a data segment) based on a sequence of input tokens derived from audio input. The model’s output represents a prediction of what is most probable, not necessarily accurate.
The accuracy of Whisper’s transcriptions is heavily influenced by the quality and relevance of the training data. Whisper was trained on 680,000 hours of multilingual and multitask supervised data collected from the internet, which likely contains a significant amount of captioned audio from platforms like YouTube. This training dataset may not always reflect the specific contexts in which Whisper is applied, such as medical or corporate environments.
When Whisper encounters subpar audio or unclear speech, it resorts to what it “knows” from its training dataset. This can result in confabulations, as the model fills in voids with plausible yet incorrect information. For instance, if Whisper has frequently encountered the phrase “crimes by Black criminals” in its training dataset, it may mistakenly insert the term “Black” when transcribing a distorted audio sample related to crimes.
Moreover, Whisper seems to experience a phenomenon known as “overfitting,” where the model becomes excessively reliant on certain patterns found in the training data. This could elucidate why Whisper sometimes generates phrases like “thank you for watching” or “like and subscribe” when presented with silent or distorted inputs, as these phrases are prevalent in YouTube videos.
## Tackling the Problem: Potential Solutions
OpenAI has recognized the issue of confabulation and is diligently working to mitigate the incidence of fabricated text within Whisper’s transcriptions. One possible approach is to implement a secondary AI model to identify areas of unclear audio and help improve transcription quality.