Automatic workflow, data collection, and development of open-data infrastructure

For DCMD activities in data exploration and visualization, we need to address generation, collection, storing, and organizing data via research on automatic workflow, data collection for materials data, and development of an open-data infrastructure. All activities lead to necessary insights and software to do complex simulations within materials physics and molecular chemistry. However, DCMD activities require these to be automated into workflows that can be run unsupervised on large scale. Such automation is highly non-trivial and has in previous cases required both a re-examination of the underlying physics as well as integration with, and extension of research-front software for high-throughput.

The rapidly moving field of high-throughput materials simulation has yielded a large number of partially overlapping frameworks, e.g., AFLOW, Aiida, the materials project, Qmpy, and many more. There presently is no clear frontrunner, nor a single framework that covers all needs, and DCMD will explore these options. Our innovation in this field has so far been done in an in-house code, the high-throughput toolkit, httk, which is a public code in active use for published high- throughput data since 2011. That work has led us to active participation in the OPTIMaDe network, an ongoing collaborative effort between key actors behind all the larger frameworks to integrate capabilities and facilitate open data exchange. Continues participation in this network is absolutely central to the goals of DCMD. OPTIMaDe has so far had two successful CECAM workshops (2016, 2018), with Armiento as a co-organizer of the second, and plans to co- organize another workshop in 2019.

The data generated by high-throughput simulations with automatic workflows will be collected, stored and organized to facilitate active ongoing analysis and in a way that aids data exchange with other researchers as open data. We will engage LiU , NSC and PDC to arrange facilities for open data sharing. At the European level, development of data infrastructure is currently very active. To achieve high impact within open data sharing, we will work in synergy with the developments within the EOSC hub, where PDC is already a collaborator, the European Cloud Initiative (ECI) and EUDAT and take advantage of those, where appropriate.

In DCMD we will explore assigning semantics to materials science concepts to aid data integration. Techniques to explore include application of ontology debugging, alignment and completion techniques to existing ontologies, and text mining. The project consists of three parts: Categorization and semantic integration through ontologies; data storage and engineering: standardization and descriptors; and enhancing the knowledge generation for materials design. At this point, a PhD student supervised by Prof. Lambrix and co-supervised by Dr. Armiento will be supported by the MCP to address the tasks within the project.