Nearly every company that has gone through a digital transformation has struggled with how to best utilize the vast amounts of data being collected. Indeed, we estimate that for most companies, 85%-95% of data is never fully utilized, and therefore wasted.
There are many stages in a data lifecycle, including acquisition of the data, creation of data engineering/data sets to impart meaning to the raw data, bulk storage of the data for further use/analysis, database creation for exploring the data, and finally capability to use advanced analytics and/or Machine Learning to extract insights from the data not available through mere reporting, all while maintaining data security and full regulatory compliance. The challenge for many organizations is how best to put together such a system, while keeping costs reasonable and minimizing time to deployment/operation, as well as the challenge of presenting the data in a meaningful way so that people can actually gain insights from it.
What’s needed is a way to deal with the entire lifecycle of data from acquisition to analysis for insights, while also maintaining the advantages of open source and an ability to utilize on-prem and/or hybrid or cloud native computing. Data warehouses have been available for some time and can handle the storage and delivery, but they don’t provide a complete solution. Many organizations have implemented data clouds, whether through pure open source (e.g., Apache Hadoop) or as commercial products (e.g., Talend, Informatica, Amazon Redshift, IBM, SAP, Oracle, etc.), but this does not solve the entire data lifecycle challenge, and often forces the use of many add-on disparate products that may not be easily integrated.
While open source software/systems seem to be very attractive, especially from a cost perspective, the “roll your own” approach to implementation of a functional solution is often fraught with challenges, and “free” is not really “free”. Time to full operation is substantially reduced by choosing a complete solution, as is the complexity of ongoing operations and support. This approach can save enterprise deployments tens of millions of dollars over the long term. We estimate that complexity and integration challenges result in as many as 50%-65% of all enterprise systems not meeting expectations, or failing all together. Further, ongoing maintenance costs of non-optimized systems result in major operating budget impacts, and we estimate they can be 2X-5X the cost of fully integrated and packaged solutions.
The problem with all of this, aside from cost and the need to have multiple technical expertise and resources available, is that the ultimate desired result – the time to insight – gets extended, and may never be fully achieved. This delayed time to insight is very costly. It’s much more effective to find a solution that is based on open source, but has created all of the integrations necessary to build out a complete system that can be easily and quickly implemented and ultimately efficiently supported.
As an example of a more complete data lifecycle solution, Cloudera has created an integrated approach with its Cloudera Data Platform (CDP), including not only data acquisition and storage, but also enabling machine learning and reducing the time to insight, while including a profile-driven layered data security approach. It integrates data acquisition, data flow, data engineering, data warehousing, database and machine learning (ML) within one framework that is extensible and allows additional capability to be integrated as needed from an expanding partner ecosystem. It works on-prem, in a hybrid cloud or in a public cloud and when deployed as a cloud implementation, it can virtually eliminate the delays associated with deployment of individual components, thereby potentially saving months in time to data insight.
This is critical in many businesses where delays can be costly and/or create damage. For example, delaying fraud detection by minutes or hours can lead to massive losses over the long term. According to the American Bankers Association’s 2019 Deposit Account Fraud Survey report, America’s banks prevented $22.3 billion in attempted fraud against deposit accounts in 2018, while total attempted fraud was $25.1 billion. Even with this high level of prevention, it’s likely a more proactive and time sensitive analysis could have stopped much of the remaining $2.8 billion in fraud. And while financial fraud analysis often gets highlighted as a primary candidate for such data analysis systems, it’s just the tip of the iceberg.
Delayed analysis of health data/trends can create an opening for a disease to spread without detection and infect many more individuals as we’ve seen in the current pandemic crisis, as well as create challenges through lack of proper diagnosis and subsequent treatment. As we move to increased use of remote telehealth sessions and more reliance on remote sensor monitoring and more automated health analysis, accurately collected data is vitally important, as any misdiagnosis due to faulty data can take a heavy toll on both people and delivery systems.
Various estimates put the cost of misdiagnosis at up to 30% of total healthcare costs. In 2018, the United States spent about $3.6 trillion on healthcare, which averages to about $11,000 per person. Moving to a more inclusive role for remote health systems necessitates having a much more vigorous data lifecycle capability than is currently available within many institutions, so as to eliminate or at least substantially reduce misdiagnosis and its associated problems. Further, a way of sharing personal data across different organizations so as to better assess trends and provide larger classes of people for analysis, and do so confidentially, is another reason an enhanced data lifecycle management process that can protect the confidentiality and meet all the pertinent regulatory compliance issues is critical. Other industries, like retail, manufacturing, pharmaceuticals, transportation, and many others, would all benefit from such a data lifecycle management approach.
A more inclusive platform for full data lifecycle management is imperative as we move to a more data-driven and digitally transformed world. In many businesses, data is perishable, as any lack of timely insights can do significant financial or physical damage. Enterprises should adopt a platform approach to data lifecycle management that does not require extensive in-house integration, nor require an extended deployment cycle, whether for major cross-enterprise projects or for quickly stood-up individual or small group projects. To achieve this result, an integrated data lifecycle platform solution is critical.