Data Science has emerged as a contemporary, ultimately-broad scientific domain that crosses interdisciplinary borders rather than focusing on a limited set of problems and applications, having data as a common denominator. The essential value of applying Artificial Intelligence and Machine Learning within our society is to empower people with better and more critical decision making using data within each respective domain (e.g., medicine, engineering, manufacturing, finance, healthcare, governance etc.). Scalable software platforms are the main entry point and standard means of declaring and executing computation on large volumes of data in order to derive well-rounded representations of the world, extract accurate knowledge and therefore use this knowledge to take good decisions. However, despite known research advancements software has been a limiting factor in democratizing data science and bringing it to the core of decision making in our modern society, for two main reasons.

First, existing tech is fragmented and requires both broad and rare technical expertise to develop and compose meaningful and optimized data-driven pipelines that combine several underlying components and special hardware (e.g., GPUs, FPGAs, TPUs) to achieve their goal. Such expertise is most likely inaccessible within diverse domains such as medicine and retail manufacturing. Second, existing systems regard data as a finite resource rather than a continuous stream of information, thus, restricting data models and corresponding decisions to retrospective rather than real-time views of our world which can impact critical decision making heavily.

Recent research developments in data management platforms have shown that continuous computing architectures such as that of data stream processing can combine the scalability and reliability of retrospective computing platforms with low latency processing, granting the speed required for critical decision making. Furthermore, advances in compiler and hardware technology has showcased the prospect of providing purely declarative and user-friendly programming models and languages for AI programming while lifting the complexity of the associated compilation, hardware optimization and maintenance within the responsibilities of the underlying system.

ProjectGoal: Within the scope of the DataScience MCP, we aim to work on a radical software platform architecture able to close the gap between recording data and acting on it. We plan to focus on ease-of-use to further allow for mass adoption within diverse domains such as ICT, IoT, retail, medicine, healthcare, finance and smart cities. Our work will unify the latest advances in compiler [7, 3, 4] and database technology [8, 5] together with hardware-accelerated machine learning [2] and distributed stream processing [6]. At the top, user-facing level, we aim to provide declarative programming support for continuous dynamic operations on live data streams such as incremental model training and ML, dynamic graph representations for building knowledge bases as well as support for integrating inductive reasoning within AI pipelines. We further plan to invest efforts on the design and implementation of an intermediate language, able to capture and optimize computation across all these diverse data types and representation and further execute it on existing or future specialized hardware such as CPUs, GPUs and TPUs.