Australian Government Research Shows AI Falls Short When Compared to Humans in Summarization Tasks

Australian Government Research Shows AI Falls Short When Compared to Humans in Summarization Tasks

Australian Government Research Shows AI Falls Short When Compared to Humans in Summarization Tasks


### The Difficulties of AI-Produced Summaries: Takeaways from ASIC’s Proof-of-Concept Research

As large language models (LLMs) gain traction, their possible applications are being examined in diverse sectors. One of the most notable functions of LLMs is their capability to swiftly condense lengthy documents, enhancing information accessibility and comprehension for users. Nonetheless, a recent analysis by Australia’s Securities and Investments Commission (ASIC) has pointed out significant constraints in the existing functionalities of these models, especially when placed against human-generated summaries.

#### ASIC’s Research: An In-Depth Investigation

In a proof-of-concept examination carried out in partnership with Amazon Web Services (AWS), ASIC scrutinized the summarization abilities of various LLMs, including the Llama2-70B model. The objective was to evaluate how effectively these models could summarize public submissions to an external Parliamentary Joint Committee inquiry focusing on audit and consultancy firms. A strong summary, according to ASIC, should underline references to ASIC, recommendations for mitigating conflicts of interest, and requests for enhanced regulation, while also providing page citations and brief contextual clarifications.

The study’s results were not particularly hopeful for advocates of AI-generated summaries. The Llama2-70B model, despite being among the more extensive and sophisticated models available at that time, produced summaries that were considered significantly inferior to those created by human specialists. The AI-generated summaries were critiqued for being “redundant and irrelevant,” frequently merely echoing the original submissions without providing significant context or analysis.

#### Main Observations: Where AI Lacks

The ASIC research unveiled multiple pivotal shortcomings in AI-generated summaries:

1. **Deficiency of Nuance and Contextual Grasp**: A prominent challenge was the AI’s poor capacity to dissect and summarize intricate content requiring a profound understanding of context, subtlety, or implied meanings. This shortcoming resulted in summaries that felt generic and failed to reflect the specific mentions of ASIC within the submissions.

2. **Errors and Irrelevance**: The AI-generated summaries were also noted to contain inaccuracies, miss pertinent details, and emphasize inconsequential points. This problem was intensified by occurrences of AI hallucinations—cases where the model created grammatically correct but factually wrong text.

3. **Increased Demand on Human Resources**: Instead of streamlining the summarization effort, the AI-generated outputs were likely to burden human reviewers with more work. The necessity to verify the AI’s summaries and the observation that the original material often conveyed information more effectively indicated that the AI’s participation could be counterproductive.

#### The Impact of Model Size and Prompt Design

While focusing primarily on the Llama2-70B model, ASIC also examined smaller models like Mistral-7B and MistralLite during initial stages. The comparison affirmed the notion within the industry that larger models generally yield superior outcomes. However, ASIC’s results also highlighted the critical role of prompt design—carefully constructing the inquiries and tasks posed to the model—to attain the best results. Moreover, behind-the-curtain adjustments to model parameters, such as temperature and top-k sampling, were required to refine the outputs.

#### Constraints and Future Outlook

It’s crucial to understand that ASIC’s study had various limitations that hinder broad application of the findings across all LLMs or scenarios. For instance, the researchers had merely one week to optimize their model, and they believe that dedicating more time to this phase could lead to enhanced outcomes. Additionally, the research centered on the Llama2-70B model, which has now been outclassed by more advanced models like ChatGPT-4o, Claude 3.5 Sonnet, and Llama3.1-405B. These newer models might provide better performance, particularly in tasks necessitating a deep comprehension of context and nuance.

Despite the unsatisfactory findings, ASIC holds a hopeful perspective on the future of generative AI. The organization recognizes that advancements in this domain are rapid, and upcoming models are expected to enhance both efficacy and precision. Nevertheless, the study acts as a cautionary reminder for large enterprises contemplating the integration of LLMs into their operations. It underscores the significance of comprehensive evaluation and optimization prior to depending on AI-generated outputs, especially in tasks demanding high levels of accuracy and context understanding.

#### Conclusion

ASIC’s proof-of-concept study offers essential insights into the present limitations of AI-generated summaries. Although LLMs such as Llama2-70B display potential, they still struggle in areas where nuanced understanding and accurate contextual analysis are required. As the technology progresses, it is anticipated that future models will rectify some of these deficiencies. Yet, at this juncture, organizations should approach the implementation of AI-generated summaries with caution, ensuring that human oversight is an integral aspect of the procedure.