The Difficulty of Assessing AI Models with Ankur Goyal

Evaluations are essential in determining the quality, performance, and effectiveness of software under development. Commonly used evaluation techniques such as code reviews and automated testing help detect bugs, ensure requirements compliance, and assess software reliability.

However, evaluating LLMs is uniquely challenging due to their complexity, versatility, and potential for unpredictability.

Ankur Goyal, CEO and Founder of Braintrust Data, an AI application development platform focusing on robust and iterative LLM development, joins the show. Previously, Ankur founded Impira, later acquired by Figma, where he managed the AI team. He discusses Braintrust and the unique challenges of developing evaluations in a non-deterministic context.

Sean, with a background as an academic, startup founder, and Googler, has published works on AI and quantum computing. Currently, as an AI Entrepreneur in Residence at Confluent, Sean focuses on AI strategy and thought leadership. Connect with Sean on LinkedIn.

Please click here to see the transcript of this episode.