Achievement 3: The (talpa) Data Science Process Lifecycle
For the successful implementation of a data science project, different methodologies have evolved over time, which can be displayed in flowcharts, in order to improve comprehensibility and increase the success rate of projects. In addition to the historical approach – the Cross Industry Standard Process for Data Mining – CRISP (Wirth and Hipp, 2000) more modern methods such as the Team Data Science Process – TDSP and the Analytics Solutions Unified Method – ASUM have been adapted as industrial standard. smartHUB’s analytical approach is based on the TDSP method. Here, user acceptance is usually achieved through iterative feedback sessions or live demonstration of the foreseen workflow (user interface). The corresponding user feedback gets implemented, resulting in subsequent updates of features and functionalities. The applied and improved TDSP consists of five core modules which are visualized in the process workflow displayed below:
The main objectives to be achieved by TDSP are:
- Business Understanding is important to gain a clear and common understanding of the business potential and the results which are to be achieved. Generated business understanding has to be documented to outline the framework of the project and to measure its success, at a later stage.
- Data acquisition and Data labeling, as a very important aspect for success of the project, is the identification of the relevant data sources the consortium has access to or which data are needed to generate the desired results. At the same time, data pipelines and needed back-end services might be applied, to set the baseline for continuous analysis of generated data.
- Modelling happens once the framework and accessible data is defined. Here, data sets are tested to find how desired results and insights can be generated. Suitable techniques might be basic statistical methods, as well as more advanced data science approaches such as decision trees or neuronal nets. To apply these methods and enable machine learning algorithms, a basic normalization of the data might be necessary to provide scalable solutions.
- Deployment (& feedback iterations) describes the process of applying the defined analysis on the data and provide insights to the user in an online environment (web application). After being involved in the definition of the project framework and desired results, this is potentially the first time the user comes in contact with actual work results. The customer feedback is crucial to improve the developed solution further and to provide a suitable, user friendly and intuitive application.
- User acceptance is usually achieved after one or two feedback iterations. Once the feedback has been implemented and the user accepts the solution to be practical and supportive for its purpose, the desired insights can be provided in an automated and timely manner.
During the described workflow it is highly likely that additional insights and deeper analytical potential are identified. Also, synergies with correlating data sets or data from adjacent process steps might be found, providing potential to further improve the solution. Such unforeseen results are common and can lead to future projects and products.
Mine operators are looking for a solution that is able to determine the processes and process steps served by the machines which are providing the data and to deduce the effectiveness of those processes. A mine can be compared with a logistics system in which material has to be moved from one location to another. This system depends on a steady workflow, so that disturbances of the logistic process chain should be reduced to a minimum. The aim of data analytics is to identify sub-processes for various machine types and process steps on the basis of existing signals to replace initial rule-based systems by a digital solution. The described Data Science Process Lifecycle can be seen as a tool to achieve this in various scenarios with ever changing baseline data and process details.