Reinforcement learning algorithms for complex decision-making and planning

Kajal Singh

Kajal Singh

Project management and project planning involve making complex decisions under conditions of uncertainty. Algorithms have always assisted in the planning process. But with the advent of artificial intelligence, the Reinforcement Learning (RL) algorithm could play an essential role in project planning in the presence of unknowns.


 In this blog post, we consider an example of Reinforcement learning for planning for real-life projects in the healthcare sector. We first provide an overview of Reinforcement learning and then a simplified taxonomy. We then show how these ideas apply to decision-making in Healthcare. Finally, we discuss considerations on the use of RL in complex planning scenarios.

Overview of Reinforcement Learning

RL is a goal-oriented algorithm where the agent (decision maker) interacts with the environment and learns a policy to optimize a long-term reward. At each step, an RL agent gets feedback about its action's performance, allowing it to improve the performance of subsequent actions. This sequential decision-making process is called the Markov Decision  Process (MDP). Four components define mDPs: A state-space S: at each time t; An action space A: at each time t; A transition function P  which represents the probability of the following state given the current state and action; A reward function r(st, at) which represents the observed feedback given the state-action pair. Deep learning (deep neural networks) and reinforcement learning techniques have given rise to deep Reinforcement Learning (DRL). The power of DRLs is demonstrated in game-playing strategies such as Alpha Go. But DRLs could be used for complex planning applications in project management.  

Let us next consider a simplified taxonomy of RL algorithms. At a top-level, you can classify RL algorithms into model-free and model-based. In this context, the term ‘model’ represents the environment in which the agent is operating. In the Model-based approach, you first understand the environment (either provided to you or you can build it). In contrast, for the Model-free approach, you do not know the environment in advance. You derive your optimal policy as you collect experience. Within model-free RL algorithms, you have Policy-based vs. Value-based. In Policy-based methods, we explicitly build a policy representation and keep it in memory during learning. In Value-based, we don't store any explicit policy, only a value function. The policy here is implicit and can be derived directly from the value function. Actor-critic is a mix of the two. Q learning is an example of value-based. You can also classify RL algorithms by On-policy vs. Off-Policy. This division is based on whether you make a decision based on the best possible action you can take at a given state (off policy – ex the Q learning algorithm) OR based on an action according to your current policy (on-policy – ex SARSA algorithm).

Decision-making in Healthcare

Based on the above background, let us now explore the deployment in real-life (non-game) scenarios – ex healthcare. In the healthcare domain, the clinical process is dynamic, and underlying 'rules' for making clinical decisions are usually unclear. The traditional way to formulate rules/ best practices in Healthcare is to conduct clinical trials. But that’s not always feasible. The alternative is to learn from observational data as the engagement progresses. 

Consider the problem of weaning patients in the intensive care unit from mechanical ventilation (MV). In this case, an RL algorithm can be used to determine ICU strategies for the MV administration. The problem is formulated as a decision support tool for alerting clinicians when a patient was ready to be weaned off the MV and recommend a personalized treatment protocol. 

The researchers used an existing database (MIMIC-III) to train the RL.   The state was a 32-dimensional feature vector. The action was designed as a 2-dimensional vector (on/off MV and dosage levels for sedation). (example adapted from Deep Reinforcement Learning for Clinical Decision Support: A Brief Survey)


In this case, we can consider,

  • An agent as a clinician. 

  • The state is the wellbeing/condition of a patient. 

  • An action is a treatment that clinicians act to the patient 

  • The transition function P (st+1 st, at) can be view as the patient's biological system

  • The reward represents an improvement or decline in wellbeing.

Insights for complex decision-making and project planning

What insights can we draw upon when considering AI for complex decision-making and planning, such as in Healthcare?

  • Learning from limited observational data: Relative to games, there is very little data. You cannot ‘play’ a scenario in trial and error mode in a clinical setting.

  • Definition of state action, reward space for clinical applications is a challenge. 

  • Performance Benchmarking: There are no performance benchmarks due to the lack of many successful applications. 

  • Exploration/ Exploitation: Balancing exploration/exploitation approach (which is the way an agent learns) 

  • Data deficiency and Data Quality.


Reinforcement Learning is a relatively new but powerful approach. RL has many applications especially planning in the light of uncertainty, even in complex (non-game) scenarios. However, as we note above, additional considerations may be necessary.

Related: Agile for Data science: is there an impedance mismatch, and what are the implications? 


Deep Reinforcement Learning for Clinical Decision Support: A Brief Survey

Photo by Ashley Batz on Unsplash