One of the weaknesses of vanilla deep reinforcement learning is that policies and values learned are typically limited to a single environment, the one the agent was trained on. In other words, it is hard to transfer policies from one setting to another. This is in sharp contrast to how humans learn to do stuff. We draw heavily on past experiences and quickly learn what combination of skills we already have that works well in a new environment.

A canonical environment in RL is the taxi cab domain. Let’s say I train an agent to pick up and drop off a single passenger. Now I add another passenger to the mix. Instead of learning that it simply needs to execute the two skills, pick up passenger and drop off passenger, twice, the agent will move towards a brand new local minima for the two passenger problem. Once it has converged to that, it will forget how to do the single passenger case optimally. So, for every new problem, we end up initializing a new agent with random weights and train it to solve that problem only.


Retraining from scratch for every new task is ridiculously inefficient. It is like building a very expensive rocket booster and only using it for a single flight. Who does that?

Not these guys
Not these guys

We need agents that learn to build on top of existing skills quickly in new, unseen environments. These agents should have very modular architectures, so one can plug in different skills and create complex hierarchies.

Quick note: I am certainly not the first one to think of this idea. There has been lots of work around this, notably in hierarchical and multi-task RL [1,2,3,4,5,6]. For a detailed description of related work and how this one fits in, please take a look at our Arxiv paper. Here, I will say that our focus is on composing already learned skills in a variety of interesting ways, instead of decomposing tasks into a sequence of sub-policies, which is found more commonly in literature.

Ok, so how do we go about obtaining such reusability of basic skills?

Let’s say we have pre-trained policies for skill 1, , and skill 2, , and we want a policy for a task that requires composition of both skills, . Let’s take an example of a gridworld with three objects, red, green and blue. can be “collect the red object” and can be “evade the green enemy” (an enemy object chases the agent at every step). The composed policy, would make progress towards the red object while evading the green enemy. Let’s also say that can be represented as a function of and .

In regular hierarchical RL, is a switch that picks one of the two skills depending on the state. It acts as a meta-controller. Run away if the enemy is too close, or run to the food if not. One could also hand-design a function that blends the two policies together depending on the task. But that requires effort from an expert that understands each task and how policies work.

Ok, so what if we learned the function instead? How would we do that?

If is differentiable, we could pass the policy gradient from the final policy through it and adjust its parameters. But what about and ? They are also typically represented as neural networks with their own set of parameters in deep RL. Those we don’t want to modify, as we want to keep reusing them in the future for other composed tasks. Let’s write our equation down again, this time with the parameters.

and are parameters of the individual skill policies, and are additional parameters that do the composition.

But, the final probability distribution over actions, policy recommendations, from each skill may not contain enough information to blend them together. They are recommedations from each skill individually, and do not say anything about the current state of the combined task. We would instead like to embed information about the state and the skill policy into a single layer of a network, and provide the embedding to to learn a composition as the task requires. So, the final form of our composition function is

where is a function that generates embeddings for skill 1 and similarly for .

Fantastic! Now we can learn the function for any kind of composition, such as collect red object while evading green enemy, or evade green enemy and blue enemy, etc.

But wait, where do the embeddings come from?

Phase 1

We break down the learning of our skill policies as follows,

In other words, there are two networks, one that learns a state embedding for each skill, and the other outputs a policy given an embedding. The parameters for the latter, , are shared across all skills. In practice, this can be achieved using the following architecture.

Illustrative example of the forward and backward passes for a single skill
Illustrative example of the forward and backward passes for a single skill

This gif shows the training procedure for a single skill and the policy and skill modules. The black arrows represent the forward pass of generating an embedding and then the policy, and the red arrows denote the gradient.

Multiple skills are trained in parallel in separate environments.
Multiple skills are trained in parallel in separate environments.

Each skill has its own network that generates embeddings given a state. The policy layer takes any embedding and converts it to a distribution over actions which is executed in the environment. Each skill is running in its own separate environment, with its own reward function (+1 for successfully completing the skill, -1 for failing, with a small step cost). Since the policy layer is shared, policy gradients from all skills are applied to it. But the embedding networks are trained using only gradients from the appropriate skill.

Phase 2

This gives us a nice modular structure that allows us to do stuff like,

Or, the policy layer outputs a policy for the composed task given a composed embedding , which is obtained by combining embeddings from skill 1 and skill 2. Note that the composition embedding parameters, , is the only set of parameters that needs to be learned now, as the skill embedding and policy layer parameters have already been learned in phase 1 and are kept fixed. This equation can be realized in practice using the following neural network architecture.

Only the composition module needs to be trained for a new task
Only the composition module needs to be trained for a new task

We call this architecture ComposeNet.

What’s nice about ComposeNet’s modular structure is that we can construct arbitrary trees, or hierarchies, of skills and their compositions, which can be learned very quickly. For example, here is the composition for the task “collect the red object while evading the green enemy and the blue enemy”,

and the network,

Only the composition modules are trained
Only the composition modules are trained


So how well does this architecture do? Here are graphs for some composed tasks. On the x-axis is number of training steps. On the y-axis is the average episodic reward over 50 evaluation runs.

Collect blue while evading green (0.45)
Collect red then green (0.53)
Collect red while evading green and blue (0.01)

In the brackets is the zero-shot reward for the transfer setting (explained in the next subsection).

The orange line (ComposeNet) is our method. It always learns the task to near optimality. Hence, it is possible to learn to compose already learned skills in this modular way. Note that the same skills are reused for many different tasks without modification. It is more efficient to do so than learn the task from scratch (solid blue line) every time.

ComposeNet only needs to train the composition layer to map two or more skills embeddings to the composed task rather than learn everything from scratch. This big advantage is afforded to it by its modular structure and conditional independence of policies from the state, given the embeddings. In the following section, we provide further evidence of the importance of composition by showing that slightly modifying single skills does not work as well as composing the correct ones. So the advantage of the architecture is not just in being able to pre-train embeddings and use them over and over again, but also in being able to compose many of them together to solve unseen tasks quickly.

Also shown here is what happens if the pre-trained skills are provided as actions to the agent (metacontroller). Initially we see a quick jump in average reward, but optimality is slow. To understand this let’s take the example of the “collect blue object while evading the green enemy” task. Provided an option to get to the blue object, the agent quickly learns that some of the times it can get a large reward by making a beeline to it. So it learns to spam a single action. But it takes longer to get the more complicated control policy of alternating between running away from the enemy and moving towards the goal. Moreover, this is not even the optimal policy. The optimal behavior is blending the two skills together to make progress towards the goal while moving away from the enemy at the same time.

For more results, check out our Arxiv paper!

Zero Shot Compositions

Can these functions learn specific compositions and apply them to unseen settings? For example, if I train a layer to do the while composition on all but one composed task, and test on the held-out task, will it generalize to it? The green line shows this composition “transfer” case. It shows good zero-shot transfer (high rate of success with 0 training steps) and quick adaptation to the optimal policy for the new task.

Are the skills really that important?

The advantage of the modular architecture of ComposeNet is that for any new task, one only needs to train the composition layer to map embeddings from skills to the composed task. But what if we gave the network the wrong skills? Do the embeddings really learn skill specific representations or will the wrong embeddings work just as well? Also, what if we slightly modify one of the skills, let’s say by re-initializing a new policy layer, and try to train that to the composed task? Does the composition really give us an advantage, or will any embedding do just as well with a fresh layer on top? In other words, did we do some ablations!?

Collect red while evade green task with the incorrect skills.
Collect red while evade green task with the incorrect skills.

In the graph above, denotes a composition function, ex. means ComposeNet using red and green skills. The lines that just have skills markings, ex. , are the case when only a single skill embedder is used, and a fresh policy layer is trained on top of the embeddings it produces to try to control the composed task. The task is to “collect the red object while evading green enemy”.

This shows us a few things. Firstly, using any other skills than the correct ones does not perform as well.

Next, is the case of using the skills “collect green object” and “evade red enemy”, i.e. the opposite of what the task requires. This performs nearly as well as the correct skills because you can invert the policy for collect to evade and almost invert the policy for evade to collect. Given two completely irrelevant skills, “collect blue object” and “evade blue enemy”, the task is not learned at all. Meaning, compositions of relevant skills performs better than irrelevant skills.

Finally, retraining a fresh policy layer on top of the “collect red object” skill embedding also gets close but not all the way to the best average reward. Retraining a policy layer on top of “evade green enemy” skill does not solve the task, likely because the embedding is not informative on the red object, which is necessary to carry out this task. This shows that the skill embeddings are encoding useful information about the objects they are concerned with, and that composition of all the correct skills gives the best result.


Our framework ComposeNet allows an agent to compose simple skills into a hierarchy to solve complicated tasks. The skills are learned once and can be reused for multiple compositions. Key in the framework are skill-state embeddings and a trainable composition function. Moreover, when testing on composed tasks it has never seen before, ComposeNet shows some zero-shot generalization capability, and quickly converges with few environment samples. Stay tuned for results on more complicated domains, such as Minecraft!


[1] T. G. Dietterich. Hierarchical reinforcement learning with the maxq value function decomposition. J. Artif. Intell. Res.(JAIR), 13(1):227–303, Nov. 2000.

[2] R. S. Sutton, D. Precup, and S. Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112(1):181 – 211, 1999.

[3] T. D. Kulkarni, K. Narasimhan, A. Saeedi, and J. Tenenbaum. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In Advances in Neural Information Processing Systems, pages 3675–3683, 2016.

[4] H. van Seijen, M. Fatemi, J. Romoff, R. Laroche, T. Barnes, and J. Tsang. Hybrid reward architecture for reinforcement learning. arXiv preprint arXiv:1706.04208, 2017.

[5] K. Frans, J. Ho, X. Chen, P. Abbeel, and J. Schulman. Meta learning shared hierarchies. arXiv preprint arXiv:1710.09767, 2017.

[6] A. S. Vezhnevets, S. Osindero, T. Schaul, N. Heess, M. Jaderberg, D. Silver, and K. Kavukcuoglu. Feudal networks for hierarchical reinforcement learning. arXiv preprint arXiv:1703.01161, 2017.