Disclaimer
This document is a collective open, out-of-formalities commentary letter prepared by Decentralised PID Working Group members for the RDA Trusted Research Environments for Sensitive or Confidential Data: FAIRness for Controlled Data and Processes WG.
Authors
This comment is left from the small non-formal working group that develops special, globally sustainable technologies and practices for persistent identification and sharing of data and metadata. The following paragraphs of ideation and criticism were prepared by:
- Francis P. Crawley, chairman of the International Data Policy Committee, CODATA, Belgium
- Bence Lukacs, Institute for Applied Blockchain (IABC) Berlin / Nuremberg Institute of Technology, Germany
- Sergio Santamarina, Biblioteca Central, Universidad Nacional de José C. Paz, Argentina
- Andrey Vukolov, Elettra Sincrotrone Trieste, Italy
Scientific Practice vs. Philosophical Framing as the starting point
By Bence Lukacs
Before embarking on a new initiative, I believe it is paramount first to have a consensus on the target level of one’s work. When looking at previous initiatives (e.g. Budapest Open Access Initiative, Berlin Declaration on Open Access, Cape Town Open Education Declaration, Barcelona Declaration on Open Research Information), the practicality of concepts relevant for researchers and scientists, but also a proper definition and understanding of the concept at hand has to be in the foreground. When trying to solve certain issues or propose solutions, these things become necessary for further education and development of the social layer (i.e. the human scientists and research involved in the scientific processes).
I would therefore suggest a quite narrow starting point for initiatives around data, i.e. the distinction between Open Science (the concept, philosophy and principles formulated (by UNESCO)) and FAIR data practices has to be clear. Without a proper frame it can become quite difficult to come to discuss and work towards (a shared) goal, i.e. one definition of “open” data has a huge influence on the approach one takes, as it can be argued that to be truly “open” (as in “Open Science”), all data needs to be made available and usable for scientists and researchers. On the other hand, if one assumes that certain type of data (i.e. “sensitive”, while maybe not being clear about who defines what constitutes what is considered to be “sensitive”) will not be available for scientists and researchers, then there can be a flaw in the “Open Science” concept and principles.
I recommend two initial questions before starting an initiative:
- What is our target level? (Scientists and researchers? Infrastructure? (Meta)Data Standards? Technology in use? etc.)
- What are our core beliefs and philosophy? (True openness in science? UNESCO Open Science? FAIR principles? CARE principles? Decentralised Science?)
Categorization Bloating in the Context of the “Sensitivity” of Data: Why It is Better to Beware?
By Andrey Vukolov, Sergio Santamarina, Francis P. Crawley
Speaking in terms of “sensitive” data, we should keep in mind that it is a Pandora’s box, as the categories and definitions of sensitive data tend to expand once established, especially as this is always a direct threat when the existence of data declared sensitive drags after all the linked or derived data and metadata under the umbrella of “sensitivity” too. This situation is described in the famous rhetoric “Question about Auschwitz”: “As you told that you were not looking at the smoke from Auschwitz’s crematory, then didn’t you know the direction where you should not look?”
After studying the Case Statement document, two large areas of discussion should be highlighted:
- The meaning of the term “sensitive data”, “classified data” or “confidential data”, and related social and legal declaration problems.
- The problems related to Trusted Research Environments (TRE), their setups, maintenance, and applications, as well as the currently emerging problem of centralisation within the facilities.
Regarding the first group of problems, it is not necessary to focus only on openly available guidelines, common concepts, and architectural and design patterns, but also on the well-known problem of sensitivity redeclaration over time (so-called “the problem of blade’s edge calling”). This problem is what every person who has worked with any classified data is familiar with, but it is not explicitly documented in the discussion. In my opinion, it should be done. Due to this, the main question that should be placed here is: “Is the knowledge about the given data declared sensitive or [even] could be declared sensitive, leads the given person to take the responsibility of keeping this knowledge undisclosed?”. This question in our case should be interpreted broadly, because, in the presence of sensitivity redeclaration, we are facing a situation when the metadata mentioning the sensitivity status for the given data, will be considered sensitive together with it, but without any explicit declaration. After all, performing any operation over this metadata, linked with the sensitive data, will be considered as a fact of taking legal responsibility for working under the same sensitivity status. The practice has already shown that open legal declarations do not protect the researchers from cases like the described one, as the authorities may prefer to hide the rules they act under a sensitivity status cloak, changing the rules in place. Thus, from the legal actor’s point of view, the described points lead to the following things:
- Implicitly involve the personnel (legal actors) who work on the TRE, into the responsibility chain just because they know that the sensitive data are treated in the given TRE, and how it is treated.
- Enforces, also implicitly, the isolation of the legal actors working on setting up the TRE from the actors who maintain the treatment of the data. It creates a grey zone that in any jurisdiction will lay out both the principles of Open Science and the current classification legislation. In my opinion, it may create a situation where some legal actors may consume parasitically the resources of the communities doing open science, without providing any positive outcome except the metadata declaring the sensitivity status of the given data.
The second group of problems is reducible to the only problem of maintenance, according to what was said above. The main conflict here is the equipment and data treatment pipelines established for the facilities working with Open Science practices, are neither secure nor enclosed, so, in this case, it is impossible to hide the presence of the data of the given (sensitive) nature. The second point is that a calculation infrastructure installed on the large-scale facility is often distributed (and even between the facilities), so the data treatment space is not managed as a centralised entity. This is the point from where many conflicts arise. First of all, concepts like Zero-Knowledge-Proofs (ZKP), declaring that the treatment station does not know what the data was put in, implicitly bring the requirement for all the treatment stages to be stateless, so all the data should be proven for removal from them at every time every station does not directly operate over it. Otherwise, it becomes impossible to hide the presence of sensitive data from even the maintenance personnel. From the other side, it makes it impossible to effectively adjust the treatment pipeline, as they are almost always customised due to the innovative nature of the facilities, so to fix errors or to implement (and especially, to test!) the methods needed, the treatment stations should be stateful. This situation is zero-tolerant, as the presence of the only stateful station makes all the infrastructure also stateful. Thus the presence of sensitive data, according to the Five Safes principles leads the TRE to the necessity to either guarantee explicitly the stateless treatment, or the direct responsibility, due to what we had said above about the social aspect of the sensitive data. In the first case, assuming it is possible to do this, every station in the data treatment pipeline becomes a black box. This can somehow make the decentralised infrastructure itself trusted, but it also leads to results that could not be called trusted as they are not generally reproducible. To cope with this, the only way is to supervise all the distributed/decentralised TRE from a single point, potentially duplicating it and creating the additional high-level structure to obtain reproducible results, so the concept of decentralisation becomes irrelevant in favour of the principal investigator’s interest. This is, however, the only imaginable case that prevents the maintainers from implicitly taking responsibility for the sensitive data. The other way requires enclosing the [potentially] stateful TRE into the isolated high-level centralised environment where the enforcement of the ZKP concept is not guaranteed. Thus, all maintainers of the TRE implicitly take responsibility for the sensitive data once they know what data they are treating and how the data is to be handled to obtain reproducible trusted results. This is what the Secure Research Environment (SRE) concept is proposing, but it may not be compatible with the unrestricted access policies enforced in most modern laboratories and research facilities.
All the described factors, however, could be partially avoided in the cases of commercially sensitive data managed by an industrial stakeholder itself, since the relations between the TRE maintainers and principal investigators are regulated as it is done usually between the businesses, using NDAs, for example. But where the state authorities take governance, there all the described issues are becoming essential immediately, practice demonstrates.
To summarise, the concepts of Open Science, TRE, decentralisation, and Sensitive Data do not look compatible in reality. Generally speaking, the presence of the data declared sensitive in any way, “poisons” either the regular research infrastructure or TRE/SRE unless it is not an isolated classified one under governmental management, leading to the implicit necessity of the maintainers and the legal actors beyond them, to take responsibility for the sensitive data as they start to know about its existence. Moreover, the concepts of TREs/SREs do not look compatible with FAIR principles from the beginning because of the following:
- Only the concept of Findability is applicable here, one of four, in the context of Open Science.
- Implementation of the Findability concept may require disclosing the sensitive nature of the underlying data by mentioning this in the metadata, which then may lead to the problem of sensitivity redeclaration over time.
- The results obtained from the treatment of the sensitive data should be considered neither Reproducible nor Interoperable, cause the sensitivity decree enforced over the pipeline inserts the hidden part in the provenance chain of any kind, with the only holder of the ground truth. Hence, this situation leads to a singular trusted agent case that FAIR principles aim to bypass.
In general, the treatment of any kind of government-managed sensitive data on the same infrastructure with open data, and disclosure of the metadata attempting to comply with FAIR, in our opinion, should be considered as a dangerous practice in terms of the potential of making the open research controllable. In the cases of sensitive data managed by third-party commercial authorities, security rules should be established and enforced by documented bi-directional agreements.