Logo

Agile for Data science: is there an impedance mismatch, and what are the implications?

Kajal Singh

Kajal Singh

In 2009, Netflix awarded a $1 Million Award to a team of developers to improve its recommendation system's accuracy by 10%. But Netflix never used the winning solution in production.

The engineering complexity and the cost of deploying the winning solution in production were too high, as Netflix highlighted in its blogs. The increase in accuracy on the winning improvements "did not seem to justify the engineering effort needed to bring them into a production environment." 

This situation highlights a broader problem – which we see increasingly as the worlds of research and enterprise interact. 

Agile has become the de-facto standard for data science and AI in the enterprise world.  Agile’s strength lies in its adaptability to changing project requirements.  The Agile methodology needs frequent engagements with stakeholders. The agile process demands sharing early results to have parallel feedback as the solution is developed. Acquiring early feedback gets end-user buy-in from the outset of a project. 

But while a customer-driven approach may seem logical in theory, it may not be ideal in the world of research. Many data scientists come to industry from academia and research backgrounds. The research mindset tends to more in-depth exploration, extensive experimentation, adoption of structured methodologies, and exploration of alternative hypotheses. Detailed work tends to take longer – and is by definition – not 'agile'. Hence, the balancing of stakeholder demands in agile versus research-centered development could lead to an impedance mismatch.  

Agile for research – the differences

Today, most tech industries use Agile methodologies. Methods such as Scrum and Kanban are flexible enough for short-term development phases, but they are not designed for long-term research-oriented development. Hence, to apply agile to long-term research problems in data science, we need a framework that captures the best ideas from agile and adapts these for research-centered development.

Broadly, there are three types of research projects -

Long-term (i.e., 6 months and beyond) Medium-term (i.e. 2 to 6 months) Short-term projects (i.e., 1-2 months)

Most companies adopt a mindset of short-term project development(even if the project itself spans a long duration). For example, in most interviews, you are asked how well you can manage short deadline pressure, fast-paced project delivery, etc. Rarely are you asked about incorporating contradictory perspectives (an essential requirement of literature review in research)

How can agile practices be adapted into data science research and vice versa?

Some agile practices can be applied to data science. Here are some examples: 

  • Notebooks e.g. Jupyter notebooks already bring the mindset of frequent experimentation to software development.  Traditional IDEs would require data to be read, loaded, and processed on every execution. With a notebook, the data remains persistent and is kept in the memory for an entire active duration – allowing for self-contained changes in each cell.  

  • We can start our development by starting with basic and traditional ML algorithms that are easy to model and implement. This creates a baseline. Once you have baselined your model, it is easier to experiment with hyperparameters and other complex models. It brings out the sense of comparison and makes you understand your problem's best-fit solution.

  • Adopt DevOps practices. Your development should be contained within the best practices of DevOps. This will help you deploy your project to production with the least effort and minimum timeline.

Conclusion

For software development today, Agile is the norm. We take it for granted. But sometimes, we need to remind ourselves that other worlds exist beyond Agile methodologies.  Agile is suited for environments where we need frequent communications with the team and clients, faster delivery of product value, and a choice of relevant features from the outset. To achieve these goals, agile follows a set of disciplined and well-defined planning and execution steps. In contrast, data science demands heavy research, which implies spending time with cases that fail or are sub-optimal solutions.  

In this sense, there is an impedance mismatch between agile and research processes.

As we have seen above, there is already a cross-pollination of ideas between the research world and the enterprise ecosystem for building AI applications. 

It is tempting to say that 'research should embrace agile and should be customer-driven'

But that could be too simplistic.

The hardest thing for enterprises to adopt could be the tolerance for failure. 

Research embraces failure. 

In the words of Thomas Edison – “I have not failed, I have just found 10,000 things that do not work."

How many of us would be able to tell that to our customers?  

Much as agile is a norm today, it is not well suited for long-term research projects.  As the worlds or research and enterprise converge through data science, we expect that new methodologies could evolve to cater to these requirements. 

 

Thomas Edison quote: I have not failed. I've just found ten thousand ways that won't work

Image source: https://www.geckoandfly.com/25857/quotes-thomas-edison/

References 

About Me (Kajal Singh)

Kajal Singh is a Data Scientist and a  Tutor at the  Artificial Intelligence – Cloud and Edge implementations  course at the University of Oxford ,  . She is also the co-author of the book  “Applications of Reinforcement Learning to Real-World Data: An educational introduction to the fundamentals of Reinforcement Learning with practical examples on real data (2021)”