What we just learned about data science — and whats next

2020 could be called The Year Data Science Grew Up. Organizations of all kinds significantly ramped up their adoption of data-oriented applications and turned to data science to solve their problems—with varying degrees of success. In the process data science was increasingly called upon to show its maturity and prove its real value demonstrating that it actually worked in production.

The emergence of a deadly global pandemic threw a wrench into designs—not all of them good—that had grown over the course of years in ways that have become difficult to maintain adjust or improve upon today. COVID-19 required the rapid analysis and sharing of massive amounts of data. Predictive models were run and updated with a new urgency amid constantly changing conditions—with all the world judging their accuracy and integrity.

[ How data analysis AI and IoT will shape the post-pandemic new normal ]

The past 12 months have revealed how valuable data science can be while also exposing its limitations. In 2020 there were numerous challenges to data sciences credibility adaptability and ultimate usefulness that will need to be addressed in 2021.

Lets look at the key levers.

Data science in 2020

This proliferation of data science while exciting falsely suggested that the field is now somehow settled. On the contrary data science remains very much a ’new’ field innovating at a rapid clip.

If one followed the hype cycle data science appeared to go mainstream in 2020 with vendors across the landscape co-opting AI. Every product or service seemed to have artificial intelligence somehow attached no matter how loosely. As such expectations rose to impossible heights with companies expecting smart data solutions to solve all of their problems. Data science just doesnt work that way.

Fortunately people now are moving beyond the hype and asking the right questions in order to understand what data science can and cant accomplish. Thus data science is now receiving attention based on its quality and the return on investment that is possible when constructed the right way.

Adaptability challenges

One of the fundamental challenges of data science has always been finding a way to repeatedly and reliably take a model from creation and put it into production. This can significantly hinder realization of ROI—which was certainly the case after the onslaught of COVID-19. Consider all the behaviors that changed throughout the pandemic. Machine learning models built prior to COVID-19 at minimum needed to undergo at least an update if not an entire redesign and retraining to account for these changes.

Depending on the problem domain and what the models were asked to solve for the new reality might look radically different from the pre-COVID world so much so that the millions of data points relied upon for insights break down because old base assumptions no longer hold. Models needed to be updated to incorporate new data and adjust to the new reality and the entire process from data science creation to production had to be revisited.

Because this has traditionally been quite difficult to do and because companies were suddenly forced to revise models quite rapidly the rigor and frequency with which models were tested slipped. Models were instead being created in a rush without verification. This harmed the credibility of data science to some extent.

[ Also on InfoWorld: How to choose a data analytics platform ]

2020 highlighted the gap between the creation of sound tested data science models and the deployment of production-ready models that can subsequently be modified as needed without recreating the wheel. Fortunately we are beginning to see new approaches that eliminate this gap as the year winds down.

Bias in AI models

Another issue that struck at the heart of the credibility and usefulness of data science was that of bias. Social justice moved to the forefront in 2020. The natural reaction was to try to eliminate bias wherever possible. And because every company became an AI company there was a push to remove bias from AI models—a task that is inherently problematic.

Often when we remove bias from data science models when we make them ’non-discriminatory’ we weaken the results and ultimately the value of the models. There also exists the danger that when one component is removed from a data science model something else creeps in with the result that bias is not eliminated altogether but just replaced by a different kind of bias.

Mitigating AI model bias is an important issue as data science is increasingly relied upon to help drive decisions and we dont want those decisions to be prejudiced or unfair. How can we create and deploy data science in an ethical way? A model must be understandable provable and verifiable. This is undoubtedly an area that will be explored in greater depth in the months and years to come.

Data science in 2021 and beyond

Significant strides were made in the past year to surface the issues holding back data science. As the hype cycle surrounding data science now ends the field can become more serious and focused on innovation and problem solving.

Production breakthroughs

Perhaps the most exciting opportunity for data science is the momentum behind an integrated deployment approach. With widespread availability of technology to close the gap between creation and production data scientists will no longer have to translate between several different technologies. This will be game changing saving time and frustration while yielding more accurate outcomes.

As it becomes much easier and faster to move models from testing to production data science will deliver a far greater return on its investment to multiple stakeholders—not just data scientists. Organizations will benefit by enabling different groups to consume and understand data insights.

[ Also on InfoWorld: A brief history of artificial intelligence ]

2nd generation collaboration

Expect to see different groups get involved with the creation and development of data science moving forward. Business analysts and engineers need to work with data scientists all collaborating together to get it right. Each group brings a different perspective to the table which makes data science more insightful impactful and useful for business purposes.

The advanced collaboration required specifically for data science will take the form of combining collaboration models at various levels to meet different needs. By sharing components organizations will be able to wrap up a certain piece of expertise data blending machine optimization or even a reporting module and share it across the organization. Such functional and purposeful collaboration combined with the appropriate amount of automation will characterize the next phase of data science.

Flexible environments

One consequence of COVID-19 has been an acceleration of digital transformation initiatives and cloud and hybrid environments have become much more prevalent. This trend will continue throughout 2021.

Organizations are not locking into one cloud or even just moving all of their data into the cloud. Many on-premises environments remain and companies will want to include their data center infrastructure in the mix without purchasing huge computational resources that will only be used every so often.

Instead they will look for elasticity and the ability to scale hybrid environments up and down to meet the resource requirements of specific workloads. As such it is essential that data science can be conducted in a variety of environments and shared across the data center and cloud in order to maximize effectiveness. Outstanding options are emerging to enable data science adoption to expand in new ways.

Closing thoughts

Data science maturity is all over the map today. The space between the organizations that are just getting on board and those that have been in the trenches for a while may narrow some in 2021 but the gulf will persist for a good while longer.

The reason? The organizations that have implemented data science successfully and that understand its capabilities and limitations will continue to experiment using open source technologies to try something out. If it works they can make it available for broader use. They will feel free to play and push the envelope without draining IT budgets on a hunch and this is where the greatest innovation will happen.

[ Participate in InfoWorlds 2021 Data amp; Analytics survey to share your thoughts and expertise on data-driven investments strategies and challenges ]

At the same time data science will become more accessible. Low-code capabilities are beginning to reach more users across the enterprise facilitating greater opportunities. With more people understanding data science and using it to solve problems faster than ever before the benefits of data science will be democratized and new possibilities will be unlocked.

Data science came a long way in 2020 despite hitting some bumps with the pandemic. Because were being forced to confront key data science challenges very exciting advances are occurring. 2021 will be the year data science gets real and shows its return on investment in deep and meaningful ways.

Michael Berthold is CEO and co-founder at KNIME an open source data analytics company. He has more than 25 years of experience in data science working in academia most recently as a full professor at Konstanz University (Germany) and previously at University of California Berkeley and Carnegie Mellon and in industry at Intels Neural Network Group Utopy and Tripos. Michael has published extensively on data analytics machine learning and artificial intelligence. Follow Michael on Twitter LinkedIn and the KNIME blog.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to [email protected].