» What tools or infrastructure are needed for data processing and analysis?
» Will data processing create additional metadata?
» Do data formats need to be converted?
» How will versions of files be managed?
About this chapter
Data processing and analysis is a central element in the research process and part of the data lineage/provenance. This chapter collects information about tools and infrastructure used as well as asking to clarify steps that will be important for preparation of high-quality FAIR data.
Question-specific guidance
Do you need a shared space with your collaborators to work on data analysis?
It might be worth to consider to have a shared work environment with collaborators to avoid the duplication of data, reduce data transfer issues and provide the same software to all project participants. If you are working with large datasets, bandwidth to compute infrastructure or data sources might become an issue. Also consider how critical the access to the workspace is for your project or whether you can tolerate data loss and downtime. It is often best to rely on professionals to operate the infrastructure, also to avoid straining project resources my infrastructure maintenance. Also consider how data is entering and leaving the environment. Sometimes the provisioning of the environment will require active application or will take time. For the workspace you should follow similar considerations as for storage is aspects of security.
Shared workspace examples (non-exhaustive):
- NIRD provided by Sigma2
- NeLS provided by ELIXIR-NO
- educloud provided by UiO
- TSD provided by UiO
- HUNT Cloud provided by NTNU
- SAFE provided by UiB
Do you need to plan compute solutions and capacity?
Specific to data/compute heavy projects
If you require a large amount of CPU hours, I/O bandwidth or memory it is best to try estimating this in advance and to choose a computing infrastructure upfront. The infrastructure might have restrictions, require application and/or payment, and might only be able to run certain software or workflow systems.
If you will use federated computing in you project this should be considered early on.
Further resources on federated data analysis:
Will the data be converted to other file format(s) before archiving?
In some cases data has to be converted to different formats late in the project for archiving. It is important to ensure that there is sufficient computing time and software for this task.
Archiving and working with data have different requirements. You want archived data to be in a persistent file format that others can open and read without being dependent on proprietary software, also in a number of years from now. When working actively with the data, you need to be able to process and analyse efficiently and e.g. using software-specific formats can be appropriate. If the two differ, you need to plan for conversions. Complicated (binary) file formats tend to change over time, and software may not stay compatible with older versions. Also, some formats (e.g. DOC, XLS) hamper long term usability by making use of patents or being hampered by restrictive licensing. Be aware that storing data in a persistent format is not only facilitating reuse by others, but also ensures their future internal use (e.g. when loosing access to commercial software).
Ideally a format should be simple, text only, completely described, not restricted by copyrights, and implemented in different software packages.
Resources on file formats and conversions:
- CESSDA DMEG: File formats and data conversions
- DataverseNO: Prepare your data - Preferred file formats
Will data processing or analysis alter metadata or produce additional metadata?
If your processing and analysis software and workflow engines produce metadata about these steps, it is good practise to consider upfront how this metadata can be captured and archived.
Generally speaking, script-based data cleaning & analysis is easier to reproduce than manual steps for analysis of quantitative data. For analysis of qualitative data, code categories should be exclusive, consistent & documented.
Examples of information to be added to metadata (non-exhaustive):
- Parameters or data coding used during data analysis
- Using a given version of a reference dataset in data analysis
Will data processing affect information security?
In some cases data processing will affect e.g. privacy and such data security. For example, it might not longer be possible to identify individuals from aggregated data. However, in some cases only the processed data, but not the raw data might allow easy access to sensitive information.
Resources on data anonymisation:
- CESSDA DMEG: Anonymisation
- Norwegian data protections agency: A guide to the anonymisation of personal data (2015)
- Amnesia anonymisation tool by OpenAire
Will you handle different versions of files or documents?
Being able to track versions of files or documents helps to understand the history of changes and why something was done in a specific way. This contributes to making data processing and analysis reproducible. Being able to review and possibly restore previous versions is another aspect, particularly in collaborative projects with simultaneous changes. Approaches to version control range from establishing routines e.g. for file naming to using version control systems such as git.
Version history and different sets of data might be especially important, if you are training AI-models with different datasets. You might want to consider to use git-annex, git-lfs or more specialized systems for this purpose for larger datasets.
Resources on version control:
- RDMkit: Data organisation - How do you manage file versioning?
- The Turing Way: Version Control
- The Turing Way: Version Control for Data
- Software Carpentry: Version Control with Git
Will you monitor data integrity once it has been collected?
It is important to ensure that the data in your project is not corrupted through transfer problems, data degradation (bitrot), vandalism and manipulation. One common procedure to detect data changes is to calculate checksums (e.g. SHA256) that can be stored and compare after transfers and time. Another procedure might be to repeat measurements on the same samples/objects.
Further reading:
Will you be integrating or linking data from different origins or different types of data?
If you are integrating data from different origins it might be necessary to unify the structure of the data. The most common structures for knowledge representation are: flat files (e.g. table formats including tsv, csv, …), tabular databases (including SQL) and linked data. While the complexity of relations and knowledge that can be saved in these structures increases in this order, this might also require additional skills and effort.
If possible, it is recommended to use of a common ontology to integrate data from different sources.
Further reading:
Does your computational approach require validation of results?
Specific to data/compute heavy projects
In some cases the results of computational steps cannot be considered to be deterministic, for instance due to randomness in the calculation, human inputs and possible errors or differences in the execution across infrastructures. To reduce false findings, the computational steps should be validated in these cases.
There are surprisingly many complications that can cause (slight) inconsistencies between results when workflows are run on different compute infrastructures. A good way to make sure this does not occur is to run a subset of all jobs on all different infrastructure to check the consistency.
Validation of results without a golden standard is very hard. One way of doing it is to develop two solutions for a problem (two independent workflows or two independently developed tools) to check whether the results are identical or comparable.
Surrounding all tools in your data processing and analysis workflows with the ‘boilerplate’ code necessary on the computer system you are using is tedious and error prone. Especially if you are using the same tools in multiple different work flows and/or on multiple different computer architectures. Automated instrumentation, e.g. by using a workflow management system, can prevent many mistakes.
Running a small subset of the data repeatedly can be useful to catch unexpected problems that would otherwise be very hard to detect.