Meta AI Security Researcher Reports OpenClaw Agent Misused Her Inbox

Meta AI Security Researcher Reports OpenClaw Agent Misused Her Inbox

3 Min Read

The now-viral X post from Meta AI security researcher Summer Yue initially seems like satire. She instructed her OpenClaw AI agent to sort through her cluttered email inbox and suggest deletions or archiving.

However, the agent went rogue. It began deleting all emails in a “speed run,” ignoring her phone commands to halt.

“I had to RUN to my Mac mini like I was defusing a bomb,” she shared, showing images of the ignored stop prompts as evidence.

The Mac Mini, an affordable, compact Apple computer, has become popular for running OpenClaw. It’s reportedly selling rapidly, with one baffled Apple employee informing famed AI researcher Andrej Karpathy of its demand as he purchased one for an OpenClaw alternative, NanoClaw.

OpenClaw, the open-source AI agent that gained recognition through Moltbook, an AI-exclusive social network, was part of a debunked episode suggesting AIs were conspiring against humans.

OpenClaw’s mission, as detailed on its GitHub page, is not about social networks but to serve as a personal AI assistant operating on private devices.

Silicon Valley enthusiasts have embraced OpenClaw, with “claw” and “claws” emerging as buzzwords for AI agents on personal hardware. Other agents include ZeroClaw, IronClaw, and PicoClaw. Y Combinator’s podcast team even appeared in a recent episode dressed in lobster costumes.

Yue’s experience is a cautionary tale. If an AI security expert faces such issues, what about the general public?

“Were you purposely testing its guardrails or was it a rookie mistake?” a software developer queried on X.

“Rookie mistake tbh,” she answered. She had tested her agent with a small “toy” inbox, working well on minor emails, which built her confidence to deploy it on the actual inbox.

Yue believes the large data volume in her real inbox “triggered compaction.” This occurs when the context window—tracking everything the AI has heard and done—expands excessively, leading the agent to summarize and manage the conversation.

This step might cause the AI to miss crucial instructions, like Yue’s final prompt telling it not to proceed, reverting back to guidelines from the “toy” inbox.

As noted by others on X, prompts cannot be fully trusted as security guardrails; models might misinterpret or disregard them.

Suggestions poured in, including specific syntax Yue could’ve used to halt the agent and methods to enhance adherence to guardrails, like drafting instructions in dedicated files or employing other open-source tools.

TechCrunch couldn’t verify what transpired with Yue’s inbox (she declined our comment request but responded to many on X).

Nevertheless, the core lesson is that knowledge-worker-targeted agents at this development stage are risky. Users reportedly cobble together protective methods for effective use.

Perhaps one day, maybe by 2027 or 2028, they’ll be suitable for broad adoption. Many would welcome help with emails, groceries, and appointments. But that day has yet to arrive.

You might also like