How Many AIs Are Needed to Read a PDF?

How Many AIs Are Needed to Read a PDF?

4 Min Read

One of the most basic and widespread file formats is confounding the world’s top models. Last November, the House Oversight Committee unveiled 20,000 pages of Jeffrey Epstein’s documents, prompting Luke Igel and his friends to sift through messy email threads and a subpar PDF viewer. Soon, the Department of Justice would release over three million more files, again, all in PDF form.

This posed a challenge. While the Department of Justice applied optical character recognition to the text, it was poorly done, making the files largely unsearchable. “The government didn’t provide any interface to summarize flights, calendar events, or messages. There was no proper index; you had to hope the document ID contained what you wanted,” said Igel, cofounder of the AI video editing firm Kino. This led to the idea of building a Gmail-like tool to view and search the correspondence more efficiently.

Extracting information from PDFs is far more complex than it seems. Even with AI advancements in building complex software and solving physics problems, PDFs remain a grand challenge. Edwin Chen from the data company Surge describes them as AI’s “unsexy failures.” State-of-the-art models tasked with extracting PDF information often end up summarizing, confusing footnotes with body text, or hallucinating content. Pierre-Carl Langlais humorously suggests “PDF parsing is solved!” as a milestone just before reaching Artificial General Intelligence.

In the initial phase, Igel’s colleague, Riley Walz, used his remaining Google’s Gemini credits, but it only worked on the cleanest scans and was too costly for millions of documents. Igel then contacted Adit Abraham, a former MIT classmate, who ran a PDF-parsing AI company called Reducto.

Reducto, one of many companies tackling PDFs, successfully extracted data from ambiguous email threads, redacted call logs, and poor-quality handwritten flight manifests. Once the data was in a usable format, Igel and Walz developed an extensive Epstein-themed app ecosystem: Jmail for Epstein’s inbox, Jflights for interactive flight path viewing, Jamazon for searching Epstein’s Amazon purchases, and Jikipedia for exploring businesses and individuals linked to the files.

“That’s when I realized the potential of PDF information extraction,” said Igel. “It’s going to revolutionize many jobs.”

PDFs are challenging for machines due to their design for human readability. Adobe pioneered the format in the early 90s to maintain visual consistency in printed and rendered documents. Unlike HTML, which presents text logically, PDFs use character codes and coordinates to visually paint a page.

Optical character recognition (OCR) attempts to translate these images into usable computer text but struggles with complex formats like multi-column academic papers, leading to jumbled output. Tables, images, diagrams, and headers create further complications. Consequently, AI tools may fail, cycle through various methods, or misinterpret and take excessive time and resources for subpar results.

“The main challenge is their inability to recognize editorial structures,” explained Langlais. PDFs feature tables and forms that require cultural text understanding.

A further issue is models rarely train on PDFs, although this is changing as developers need high-quality data, of which PDFs hold a lot. Government reports, textbooks, and academic papers are all PDF-based. PDFs are seen as potential large, novel data sources for model training, as described by researchers at the Allen Institute for AI.

According to Duff Johnson, CEO of the PDF Association, the first PDF was likely an IRS 1040 form. By 1994, the IRS used PDFs on CDs to distribute consistent forms, eliminating the need for printed documents. PDF quickly became integral to digital work, with consistent appearance regardless of the recipient’s device or software.

“There’s no technology solving the problem PDF addresses,” Johnson stated. PDFs remain uniform across platforms, unlike webs that are transient and depend on the browser, or Word documents open to edits and changes.

Specialized PDF-parsing models have gained traction, said Luca Soldaini from the Allen Institute for AI, who worked on olmOCR, a vision language model trained on around 100,000 PDFs. This model has been popular for its specific optimization on problem areas, like non-mixed table rows and columns.

“Large text is identified as headers,” explained Soldaini. While generalized models draw attention, it’s the PDF reader AIs that see widespread practical use.

Meanwhile, Hugging Face AI platform researchers discovered 1.3 billion untouched PDFs in Common Crawl, providing vast, high-quality data to train future models. However, extracting text from PDFs is difficult, said Hynek Kydlíček of Hugging Face. They implemented a system that categorized PDFs as easy or hard to parse, using olmOCR for complex ones. Upon removing a significant amount of irrelevant data, like horse racing results, they declared the liberation of three trillion valuable tokens for model training.

Yet, achieving perfect accuracy in parsing PDFs remains distinct from AI model training. For instance,

You might also like