When Ang Li, co-founder of agent software biz Simular, started working at Google DeepMind in 2017, software engineers at the search giant were skeptical about the usefulness of machine learning, or artificial intelligence (AI) as it has come to be called.
As Li explained to The Register in an interview, the production team between 2017 and 2019 would often say, “machine learning never works in production.”
“That is kind of interesting because we have lots of papers also hyping AI,” he said.
At one point, Li said, the Google Ads team asked the DeepMind crew to apply its AlphaGo system – the one that conquered the game Go – to improve Google’s ad revenue.
“I think some people tried it, but it actually dropped the revenue,” said Li. “That’s the funny part because the real world system is very complex.”
Machine learning methods are based on statistics, said Li, and they assume a static dataset.
“But in the real world, this assumption doesn’t hold,” he explained. “In the real world, for example, on YouTube, you have videos being uploaded every day. In ads, you have search queries coming every day. And this distribution of data keeps changing. That’s actually the core reason why machine learning doesn’t work in production.”
That was all before OpenAI released ChatGPT on November 30, 2022. Nearly three years later, into the generative AI hype cycle and after many billions in capital expenditures, machine learning still doesn’t work all that well. But investors have been bedazzled.
As we noted last month, AI agents – AI models using tools in a loop – complete office tasks successfully only about 30 percent of the time.
The success rate depends, however, on which benchmark you’re using and when you’re measuring. The OSWorld benchmark, which assesses how well agent software can handle real-world computer tasks, was established in April 2024. Benchmark tasks consist of directives like: “Please update my bookkeeping sheet with the recent transactions from the provided folder, detailing my expenses over the past few days.”
At the time, the top performing AI agent, GPT-4 (with vision) managed an overall success percentage of 12.24.
As of about a week ago, the top performer was GUI Test-time Scaling Agent, or GTA1, which when paired with OpenAI’s o3 model scored a 45.2 percent task success rate on OSWorld benchmark. GTA1 reflects the work of researchers from Salesforce AI, the Australian National University, and the University of Hong Kong.
That’s a marked improvement from the state of the art last year, but even the best agent still fails at office automation tasks more than half the time. Human workers can manage a task completion score of 72.36 percent.”
In 2023, when Li co-founded Simular with Jiachen Yang, he said he told people the company was building agents. But people didn’t understand, and tried to convince him to call them assistants. Now everyone is building agents.
“Our definition for agents is a system that can interact with the environment and keep improving itself,” he said.
Basically for now we need to carry computers every day with us but in the future we don’t have to
Simular’s S2 agent framework, presently ranked number four on OSWorld and six on the AndroidWorld benchmark, reflects the company’s vision for autonomous computing.
“Basically for now we need to carry computers every day with us, but in the future we don’t have to,” said Li. “Meaning the computer becomes a human-like thing which can…book tickets for you, reserve tables, go shopping.”
This agent would also have knowledge of the user’s habits and preferences, stored locally in your computer, said Li. “This is the vision we’re pushing for.”
A recent manifestation of that vision is Simular Pro, a $500/month computer use agent for macOS (Apple silicon) that’s designed to automate desktop tasks. That’s not priced for casual use; rather Li anticipates adoption in industries like insurance and healthcare that have a lot of repetitive computer work involving filling out forms.
“Usually this happens in an industry we call an API-deficient industry, meaning they don’t have APIs [for programmatic access to data],” Li explained.
“Insurance, healthcare, finance, they have no API for developers or business to automate their workflow. They are pretty painful. They have to hire people around the world to sit in on the computers. They say if you can automate this, it’s going to be a huge productivity boost for them. Most of the customers are actually in these categories.”
Attracting organizational interest in this sort of office task automation is likely to require getting things right at least as often as human employees. But Li contends that the industry has lost its way.
“We believe everyone else is doing the wrong thing,” said Li. “It’s not really the wrong thing. It’s like they are not going in the right direction. Everyone says agents are based on LLMs. We believe this type of technology is only one part of the reinforcement learning framework.”
Li draws a distinction between exploration – having an LLM try out various possible paths to find a solution – and exploitation – executing a known solution without regard for other options.
Other companies, he said, are too focused on the exploration part and don’t spend enough time on the exploitation portion. Simular’s S2 agent framework starts with using the LLM for exploration, but once it finds a solution, it converts the action into symbolic code, similar to JavaScript, so that tasks can be executed predictably and programmatically – until the code breaks and the LLM has to rewrite it.
Li sees Simular as a technical infrastructure company rather than a maker of agent products. The goal, as he describes it, is to develop a neuro-symbolic continual reinforcement learning framework for building agents.
Continual learning, he said, is one of the hardest problems for AI researchers. The issue is that if you keep training a neural net with new data “it will gradually, catastrophically forget what you learned ten days ago,” he explained. And then there’s the matter of cost – eventually, it just becomes unaffordable to keep adding knowledge to a static model and retraining it.
Li believes that to get to what the industry calls AGI or Artificial General Intelligence – the point at which AI models handle most tasks as well as a human – the way forward will require continual learning. ®