AI agents outperform human teams in cybersecurity competitions

A new series of cybersecurity competitions conducted by Palisade Research has demonstrated that artificial intelligence agents can match—and even surpass—the performance of top human teams in highly technical Capture the Flag (CTF) tests. These competitions are known for their complexity and for testing the most advanced skills in ethical hacking, cryptographic puzzle solving, and software failure analysis.

In this competitive environment, different AI teams were tested against thousands of human hackers in two separate tournaments. As a result, the AI agents surprisingly placed among the most prominent, demonstrating that they are no longer just support tools, but direct competitors at the forefront of digital security.

From experiment to real challenge: how AI agents performed

The first competition, “AI vs. Humans,” brought together six autonomous AI systems and approximately 150 human teams. For 48 hours, all participants had to solve 20 tasks related to cryptography and reverse code analysis. Four of the six agents completed 19 of the 20 challenges, placing them in the top 5% of the overall ranking.

This is no small feat, considering that some teams like “CAI” invested more than 500 hours in preparing a custom-made agent. In contrast, others like “Imperturbable” used just 17 hours to adjust messages and prompts in existing models like Claude Code.

The key was the ability to run the challenges locally, allowing hardware-limited models to compete with high performance. Although the best humans remained competitive, the AI agents ended up showing a sustained speed and efficiency that surprised veteran experts.

A new standard in AI evaluation for cybersecurity

The second competition, “Cyber Apocalypse”, involved more than 17,000 human players and a set of 62 challenges with an even greater degree of difficulty. These challenges required interactions with external environments, something for which most current AI systems are not yet well prepared.

Even so, the agent “CAI” managed to solve 20 of the 62 challenges, placing it among the top 10% of all competitors. According to Palisade Research, this AI beat 90% of active human teams. This result encourages a reconsideration of the real potential of these technologies when they are designed with specific objectives and their capabilities are refined through targeted programming.

In addition, the researchers applied an interesting metric: they analyzed how long it took the best humans to solve the same challenges as the AI. They found that if a task took experts an average of up to 1.3 hours, an AI agent had a 50% probability of solving it. This suggests that, far from being harmless experimentation, artificial intelligence is already reaching disturbingly competitive levels.

Compared to previous tools that underestimated the capabilities of AI, such as CyberSecEval 2 or InterCode-CTF, this research evidenced what the authors call an “evaluative gap.” To close it, they propose that open competitions with massive participation (crowdsourcing) become an essential complement to measure the true scope of these technologies.

There is no doubt: artificial intelligence agents no longer only represent the future of cybersecurity, but are also challenging the very notion of what it means to think and solve problems under pressure in the digital world.