Data Science: From Algorithms to Production

Almost every company has a data science team. As a team lead, it is your responsibility to make the team effective, that is, to ship new things that add business value to production and to keep improving over time.

Having data science algorithms in production is the end goal. Getting to production does not happen on its own though. It is entirely possible to have a situation where a team of talented people is working hard on mathematically complex algorithms in Jupyter notebooks that never quite manage to make it into the finished product.

This article will cover the best practices to work past the roadblocks of leading a data science team to production. Namely making sure to work on the right problems, building a robust data processing system, and advocating for your team within the organization.

Working on the right problems, incrementally

Let’s say your team is tasked with building product recommendations for an online shop that sells sport equipment. They’ve never displayed product recommendations on the website before. Your data scientists are excited about implementing the latest collaborative filtering algorithm they’ve been reading about and want your input. What should you say?

In this case, it makes sense to redirect the enthusiasm of your team members from “what is most academically interesting to implement” to “how can we, as a team, make a quick and solid contribution towards the business goals of the company. Namely, increasing sales and user engagement.”

This might mean putting the machine learning papers aside for a moment, and implementing a simple algorithm to display product recommendations based on each user’s most searched for category. It might also mean monitoring how much these recommendations are clicked. This can take a few days or weeks of work, and once integrated into the website and monitored over time, can inform further decisions. Further improvements, that might include a collaborative filtering approach, can then be compared to the initial working idea.

It can be quite difficult to motivate smart people with PhDs to work on “low hanging fruit”, such as finding each user’s most searched for product category. Some ways to convince them are:

Point out that simple algorithms can always be incrementally made more complex, while it is more difficult the other way around. Building simple things first means the algorithm is more likely to survive in some form in the long term.
Allow, and even encourage, data scientists to pursue side projects, kaggle, and read papers during working hours. Maybe even organize internal hackathons where data scientists can go data wild without caring about integrating the end result to the product. These activities can keep them sharp and inspired.
Arrange for data scientists to have a bi-weekly or monthly one-on-one meetings with business stakeholders. This way they stay connected to the company vision.

It is important to ask your team to regularly communicate with you what they’re working on and why, in order to understand what they’re hoping to achieve. There are several ways to keep track of tasks, such as Kanban boards or JIRA. Keeping a history of ideas people have had, and whether the ideas worked out is useful. Not for judging individuals, but for developing a collective intuition of what is likely to work and what has already been tried.

As for yourself, it makes sense to have weekly (or more frequent) meetings with the business side of things. This is to make sure that the team’s to-do items correspond to actual business needs, and are updated as these needs change. To make the meetings productive, ask a lot of questions to discern the actual requirements from buzzwords, and clarify expectations. As you are the one with the most knowledge about data, make suggestions about what may be possible with the data you have. Here are some points you can raise:

We don’t have enough data to apply [deep learning algorithms].
When [we have integrated machine learning] in our recommendations, what value do you expect it to bring to the business?
The customer service team spends [x hours per month] classifying the severity of support tickets. We can build a tool to automatically classify them to free up time. What do you think?

Once you’ve clarified the direction the data science team should be going, it’s time to make it as easy as possible to get there.

Building a robust system

Robustness refers to how well a data processing system can respond to unexpected input and loads, how quickly issues can become apparent and solved, and whether there is a mechanism for integrating outside feedback into the system. You want to be keeping a close eye on how the system is behaving with respect to technical errors, but also to business outcomes.

Carrying on with the example of generating recommendations, in order to avoid errors, you’ll need to keep track of what happens when the user’s favorite category gets deleted or re-organized. More importantly, you’ll also need to monitor how business metrics are affected by the changes, such as the click-through ratio on the recommended links.

Investing effort into making the recommendation generation more complex (for example, taking into account the user’s gender or browser language) without a system in place to keep track of the affected business metrics, is a misuse of resources. The complexity of the algorithms and the complexity of a system that ensures quality and provides feedback needs to grow at a similar pace.

As a team lead, it is your job to set up the infrastructure that enables continuous integration, delivery, and evaluation of the data science algorithms. You will need to lay some groundwork yourself and advocate for hiring key people to support this effort – namely data engineers. These are the people who will write comprehensive regression tests on large datasets, set up and monitor data pipelines and provision machines with the necessary dependencies for data scientists to work on. They will also build custom tools to combine business outcome data and the data about the algorithm versions that created those outcomes. This data is precisely the feedback that needs to be integrated before work continues.

It may be of interest to watch this talk by Jesse T. Anderson. He goes into depth about the importance of having data engineers, and in the right numbers in your team, which means having more of them than actual data scientists.

Building relationships within the organization

Even if the data science team is diligently working on value creation, backed by data engineers and solid processes, external roadblocks can still rise. These roadblocks can take the form of a different team changing shared systems, such as APIs, without consulting the data science team first. It can also simply be that the rest of the organization remains unaware of the value the data science team provides.

It is never too late and most effective to be proactive and build strong relationships with other teams before serious misunderstandings occur. If your company has periodic internal presentations, make sure to present what you’re working on. If there’s a system in place to keep track of which algorithms caused which business outcomes, as described in the previous section, put together a presentation about the value the data science team provides, or perhaps, about what was tried and didn’t quite work.

Invite people from other departments to come to you with questions and ideas. There are some departments that have a naturally close relationship with the data science team, such as Marketing & Analytics. Make sure to understand how they work and how they can provide you with data on business metrics, such as click-through rates and conversion rates.

For collaborating with other technical teams in particular, there are well-known patterns you can adopt. Let’s assume a micro-service architecture, where recommendations are served by a lightweight web app or a Lambda. To test the interaction between the recommendation service and other services when updates to either of them are deployed, one can use consumer-driven contract tests and add those tests to the Continuous Integration process. You can read this article by Ian Robinson for a thorough analysis of this testing pattern. The consumer-driven contract tests will fail when there’s been a breaking change (for example, an API changes and its clients still expect the old kind of data). This will be an immediate alert that one or more services need to be adjusted before deploying to production, and will ensure that different teams remain in sync.

Conclusion

As a data science team lead, you may be overwhelmed with decisions and responsibilities. The single most important thing is to make sure you and your team are moving in the right direction by solving tangible business problems. This is a constant process of managing communication with stakeholders as priorities change. Building a sustainable development process and cultivating strong relationships within the organization can accelerate your progress.

Without a basis of providing business value and being able to measure it, technical excellence will be of limited use and will not be enough to build trust with other teams. The goal is to incrementally provide enough value to inspire the trust that, as the business changes and evolves, the data science team will be equally ready to adapt and make good use of existing data and opportunities.