Ever since ChatGPT exploded onto the tech scene last November, it’s been helping people write all kinds of material, generate code, and find information. It and other large language models (LLMs) have facilitated tasks from handling customer service calls to taking fast food orders. Given how useful LLMs have been to humans in the short time they’ve been around, how might a ChatGPT for robots affect their ability to learn and do new things? Researchers at Google DeepMind decided to find out and published their findings in a blog post and paper published last week.
They call their system RT-2. It is short for Robot Transformer 2 and it is the sequel to Robot Transformer 1 which the company released at the end of last year. RT-1 was based on a small language and vision program and specifically trained to perform many tasks. The software was used in Alphabet X’s Everyday Robots, enabling them to perform over 700 different tasks with a 97 percent success rate. But when asked to perform new tasks they weren’t trained for, robots using RT-1 were successful only 32 percent of the time.
The RT-2 nearly doubles that speed, successfully completing new tasks 62 percent of the time it’s asked. The researchers call RT-2 a vision-language-action (VLA) model. It uses text and images it sees online to learn new skills. It is not as simple as it sounds; it requires the software to first “understand” a concept, then apply that understanding to a command or set of instructions, and then perform actions that fulfill those instructions.
An example given by the paper’s authors is waste disposal. In previous models, the robot’s software first had to be trained to identify waste. For example, if there is a peeled banana on a table with the peel next to it, the bot will be shown that the peel is trash, while the banana is not. It would then be taught how to pick up the peel, move it to a bin and deposit it there.
However, the RT-2 works a little differently. Since the model has been trained on lots of information and data from the Internet, it has a general understanding of what garbage is, and although it is not trained to throw away garbage, it can put together the steps to complete this task.
The LLMs the researchers used to train RT-2 are PaLI-X (a vision and language model with 55 billion parameters) and PaLM-E (what Google calls an embedded multimodal language model, developed specifically for robots, with 12 billion parameters). “Parameter” refers to a property that a machine learning model defines based on its training data. In the case of LLMs, they model the relationship between words in a sentence and weigh how likely it is that a given word will be preceded or followed by another word.
By finding relationships and patterns between words in a huge data set, the models learn from their own inferences. They can gradually find out how different concepts relate to each other and distinguish context. In RT-2’s case, it translates that knowledge into generalized instructions for robot actions.
These actions are represented to the robot as tokens, which are usually used to represent natural language text in the form of word fragments. In this case, tokens are parts of an action, and the software binds multiple tokens together to perform an action. This structure also allows the software to perform chain-of-thought reasoning, meaning it can respond to questions or prompts that require some degree of reasoning.
Examples the team gives include choosing an object to use as a hammer when no hammer is available (the robot chooses a rock) and choosing the best drink for a tired person (the robot chooses an energy drink).
“RT-2 shows improved generalization abilities and semantic and visual comprehension beyond the robot data it was exposed to,” the researchers wrote in a Google blog post. “This includes interpreting new commands and responding to user commands by performing rudimentary reasoning, such as reasoning about object categories or high-level descriptions.”
The dream of general-purpose robots that can help humans with whatever may arise—whether in a home, commercial, or industrial environment—will not be achievable until robots can learn on the fly. What seems like the most basic instinct to us is to robots a complex combination of understanding context, being able to reason through it, and taking actions to solve problems that were not expected to appear. It is impossible to program them to respond appropriately to a series of unplanned scenarios, so they must be able to generalize and learn from experience, just as humans do.
The RT-2 is a step in this direction. However, the researchers acknowledge that while RT-2 can generalize semantic and visual concepts, it is not yet capable of learning new actions on its own. Rather, it applies the actions it already knows to new scenarios. Perhaps the RT-3 or 4 will be able to take these skills to the next level. In the meantime, as the team concludes in their blog post, “While there is still a huge amount of work to be done to enable helpful robots in human-centric environments, RT-2 shows us an exciting future for robotics is just within reach.”
Image credit: Google DeepMind
#DeepMinds #ChatGPTlike #brain #robots #lets #learn #Internet