This research summary is based on the paper 'Jump-Start Reinforcement Learning' Please don't forget to join our ML Subreddit
In the field of artificial intelligence, reinforcement learning is a type of machine learning strategy that rewards desirable behaviors while penalizing undesirable ones. An agent can perceive its environment and act accordingly by trial and error in general with that form or presence – it’s a bit like getting feedback on what works for you. However, learning rules from scratch in contexts with complex exploration problems is a big challenge in RL. Since the agent does not receive any intermediate incentive, it cannot determine how close it is to achieving the goal. As a result, randomly exploring the space becomes necessary until the door opens. Given the length of the task and the level of precision required, this is highly unlikely.
Random exploration of the state space with preliminary information should be avoided when performing this activity. This prior knowledge helps the agent to determine which states of the environment are desirable and should be investigated further. Offline data collected by human demos, programmed policies, or other RL agents could be used to form a policy and then initiate a new RL policy. This would include the neural network copy of the pre-trained policy into the new policy RL in the scenario where we are using neural networks to describe the procedures. This process transforms the new RL policy into a pre-formed policy. However, as seen below, naive initialization of a new RL policy like this frequently fails, especially for value-based RL approaches.
Google AI researchers have developed a meta-algorithm to leverage pre-existing policy to initialize any RL algorithm. Researchers use two procedures to learn tasks in Jump-Start Reinforcement Learning (JSRL): a guidance policy and an exploration policy. The exploration policy is an RL policy formed online from the agent’s new experiences in the environment. In contrast, the guide policy is any pre-existing policy that is not changed during the online training. JSRL produces a learning program by incorporating the guide policy, followed by the self-improving exploration policy, yielding results comparable to or better than competing IL+RL approaches.