AI in 15 — July 04, 2026
Meta gathered its whole staff last week and told them two opposite things in the same meeting. First: our next AI model just caught OpenAI's best. Ten minutes later: our AI agents aren't getting better fast enough. Same room. Same forty-five billion-dollar-a-year bet.
Welcome to AI in 15 for Saturday, July fourth, 2026. I'm Kate, your host.
And I'm Marcus, your co-host. Happy Fourth to everyone listening between the barbecue and the fireworks.
It's a holiday here in the States, Marcus, but the news didn't take the day off. Our lead is Meta's split-personality town hall — a model that's catching the frontier and agents that are stalling, side by side. Then the story I think is actually the most important of the week, and it's a technology one.
AI is now finding software bugs at a scale no human team could match — and the numbers are frankly a little alarming.
Anthropic tells the Senate that Alibaba ran the biggest data-siphoning campaign it's ever seen against Claude.
And China opens a hotel run entirely by robots. No front desk. No humans. At all.
Lead story, Marcus. This town hall on July second. What did Meta actually tell its people?
Two things that don't sit easily together, Kate. First, the superintelligence chief, Alexandr Wang, told staff that Meta's next frontier model — codenamed Watermelon, still in training — has matched OpenAI's GPT-5.5 on key internal benchmarks. He didn't say which benchmarks, which matters. And this thing reportedly uses an order of magnitude more compute than its predecessor. If it holds when it ships, it'd be the first time in about a year that Meta has drawn level with the frontier it's been trailing behind OpenAI, Google, and Anthropic.
Okay, that's the good news. And then the boss got up.
Then Zuckerberg got up, Kate, and delivered the sober counterpoint in the same meeting. His words: the trajectory of agentic development over at least the last four months hasn't accelerated in the way we expected. And the big AI reorganization — the one they laid people off for — hasn't come to fruition yet. He said he expects the payoff in the next three to six months. So one executive says the model's catching up, and minutes later the founder says the products built on these models aren't working reliably.
And the backdrop to all this is enormous, right? Remind people of the scale.
It's staggering, Kate. Around eight thousand layoffs — about ten percent of staff. Seven thousand people shuffled into four new AI groups, one of them literally called Agent Transformation. And up to a hundred and forty-five billion dollars in capital spending this year. So both of these claims — the model that's catching up and the agents that are stalling — now ride on the same enormous check.
So which is it? Is Meta back at the frontier or isn't it?
Here's how I'd untangle it, Kate, because it's actually the defining split of 2026. Watermelon is a claim about the model — can it clear a benchmark. Zuckerberg's caution is a claim about the product — can it act reliably in the real world. Those are two completely different things, and right now there's a growing gap between them. A model can ace a test and still fail as an agent that has to plan and execute a real task without a human catching its mistakes.
And I should keep you honest — the Watermelon claim itself, how much weight do we put on it?
Not much yet, and I want to be plain about that, Kate. This is Meta's own internal benchmark, on a model that hasn't shipped, with no detail on which benchmarks. It's the classic question I keep coming back to — who's counting, and can anyone else check it? Right now, nobody outside Meta can. It's also a live test of whether you can simply buy your way back to the frontier by throwing ten times the compute at the problem. We'll find out when it ships, not before.
Okay, this next one, Marcus — you told me before we started this is the story of the week. AI finding software bugs. Walk me in.
It's the top item on Hacker News today, Kate, and I think it deserves it. Epoch AI plotted the monthly count of high- and critical-severity software vulnerabilities — CVEs, the publicly disclosed security flaws — from twenty-one major vendors. Microsoft, Google, Apple, Oracle, Cisco, Nvidia, the big ones. For years that line sat flat, roughly a hundred and fifty to three hundred a month. Then, after Anthropic's Claude Mythos model shipped in April, the line goes almost vertical. June hit around fifteen hundred high and critical flaws. That's about three-and-a-half times the previous monthly record.
Three and a half times. In two months.
In two months, Kate. And Anthropic, working with Microsoft, Google, Apple, and Amazon, says its model surfaced more than ten thousand high and critical flaws — and separately flagged around twenty-three thousand potential vulnerabilities across a thousand open-source projects. Of those, about seventeen hundred have been externally confirmed so far. And here's the part that keeps me up: reportedly over ninety-nine percent remain unpatched. The finding is now outrunning the fixing.
Okay. My gut reaction is that sounds terrifying. Talk me down — or don't.
Let me give you both sides honestly, Kate, because the caveats really matter here. A found-and-patched bug is defense — that's the good guys getting there first. Disclosure is not the same as exploitation. Second, vendors report differently, so part of this jump is just better-organized reporting, not pure new discovery. And third — this came up on Hacker News — at this volume, some of these reports could be false positives, the model hallucinating a bug that isn't there. Epoch itself says the spike might partly reflect a surge of interest in bug-hunting, not just raw capability.
But even with all those caveats, something real is happening.
Something real is happening, Kate. This is the clearest evidence yet that frontier models are already superhuman at one narrow, very high-stakes thing — finding bugs in code. The disclosure curve now tracks model releases, not the number of human researchers. And it's dual-use, which is the uncomfortable core of it: the exact same capability is defense if you're the one patching, and offense if you're the one exploiting. One commenter called 2026 the year of a million CVEs back in March. That's starting to look conservative.
And this connects to something we covered yesterday, doesn't it — Fable 5 refusing anything security-related.
It closes the loop perfectly, Kate. Remember, the model doing the most real work is also the one most likely to refuse to read a security blog post. Now you see why that tradeoff is so hard. When your model can find ten thousand real vulnerabilities, deciding what it should and shouldn't do with that power stops being theoretical. The CVE spike is exactly what makes those guardrails such a genuinely hard call.
Speaking of models doing things they maybe shouldn't — Marcus, Anthropic wrote a letter to the Senate about Alibaba. This one's spicy.
It is, Kate. In a letter to the Senate Banking Committee, Anthropic's policy team accused operators affiliated with Alibaba and its Qwen lab of — their words — brazenly and illicitly extracting Claude's capabilities. The alleged campaign used nearly twenty-five thousand fraudulent accounts to generate more than twenty-eight million exchanges with Claude over a forty-four-day window this spring. And it wasn't random — they say it deliberately targeted Claude's most valuable skills: agentic reasoning, coding, cybersecurity, long-horizon tasks.
Explain the technique, because this is the thing we keep circling — distillation.
Right, so a distillation attack, Kate, is where you extract answers out of a stronger model and use them to train a weaker one. You're essentially copying the capability by interrogating the original millions of times. And critically, it sidesteps the export controls — those cover the raw model weights, the actual file. But if you can just ask the best model twenty-nine million questions and learn from the answers, you never needed the file. For scale, Anthropic said in February it caught DeepSeek, Moonshot, and MiniMax doing similar things. This one campaign is bigger than that entire earlier batch combined.
And the skeptic's flag on this one, Marcus?
Flag it hard, Kate. These are Anthropic's allegations. Alibaba denies wrongdoing. And the numbers — twenty-five thousand accounts, twenty-eight million exchanges — have not been independently verified by anyone outside Anthropic. It's a single-source claim, from a company with a clear interest in Congress penalizing exactly this practice, which is what they're asking for. So it's plausible, it fits the pattern, and I still want someone else to check the math before I treat it as fact.
But if it's even roughly true, it reframes the whole chip-war story.
That's the takeaway, Kate. We spend all this energy on smuggling chips and controlling weights. And this says: you may not need any of that. If you can interrogate the frontier model twenty-nine million times, that's your export. The race isn't just about hardware anymore — it's about who can wall off their own model's outputs.
Alright, let's land somewhere a little more fun, Marcus. China is opening a hotel run entirely by robots. Is this real, or is this a press release?
It's a real project, Kate, though the proof is still ahead of it. Forty-four rooms, on an artificial island off Shenzhen, and the pitch is no human staff at all. Every check-in, every bag delivery, every drink, every cleaning task, handled by a fleet of robots from a company called Pudu Robotics. And the interesting part isn't the robots individually — it's that they're coordinated by a shared AI layer, so the machines hand tasks off to each other in real time. The CTO promises no service gaps and no human interruptions. Trial operations late this year, first guests early 2027.
Now, I feel like we've heard this before and it went badly.
You have a good memory, Kate. Japan's Henn-na Hotel became famous a few years back for firing its robots — they kept malfunctioning, and it turned out humans had to babysit the machines. So Pudu's bet is specifically that the difference this time is the coordinating brain over the whole fleet, rather than a bunch of dumb single-purpose machines each doing one trick. One shared AI deciding who does what.
And does that connect back to our lead — Zuckerberg saying agents aren't reliable yet?
It's the perfect pairing, Kate, and I'd hold it skeptically. This is one of the most complete real-world tests of physical AI we've seen — embodied agents coordinating in a messy, customer-facing environment. But a hotel is a much lower-stakes proving ground than enterprise software. If a robot brings you the wrong drink, you laugh. If an agent misfires in a bank's back office, you don't. So it's a genuine test, but let's watch whether it actually opens on time and stays open — that's the real headline, and it's still six months away.
It's also the labor story made very concrete, isn't it.
That's the quiet part, Kate. It's a building that fires no one — because it never hired anyone. Every job that would've existed just isn't created in the first place. That's a different shape of disruption than layoffs, and honestly a harder one to see or measure.
One to watch tomorrow, Marcus.
Watermelon, Kate — whether Meta's model actually ships and actually matches GPT-5.5, or whether it stays a town-hall claim nobody outside Meta can verify. But there's a wildcard: the White House is reportedly in advanced talks with OpenAI, Google, and Anthropic on voluntary AI-release standards, possibly landing this coming week. Either one could reset the whole frontier conversation within days.
Agree, or counter?
Slight counter, Kate. I'd watch the CVE curve. If July comes in even higher than June's fifteen hundred, then the bug-finding spike isn't a blip — it's the new normal, and the patching world has a genuine emergency on its hands. That's the number I'll be refreshing.
That's your AI in 15 for today. Enjoy the fireworks, and we'll see you tomorrow.