Reusing or collecting data

Tip: Why is this topic important
» Defining what data will be included and analysed within your project is an important core element when designing your research project. And identifying reusable data will often be a very effective way of starting the planning of a project, and thus you should always check whether existing data can be of relevance to your project.
» Defining how project data will be collected will also help when defining data management roles and delegate data management in the project
» Defining how data will be created, possibly reused, and flow both in and out of your project, will contribute to Open Data and Open Science practices.

About this chapter

This chapter collects information about the type and origin of the data that will be included in the project. This could be both datasets or records originating outside of the project, as well as novel data collected, captured, generated or created by yourself or other project members.

Different disciplines may use different terms to describe the data or records underlying a research project. In this chapter, an inclusive approach is used with ‘data’ functioning as a broad and overarching term. This includes observational and experimental data, surveys, registry data, simulations, as well as data in digital museum archives and other data records.

Depending on the discipline and specifics of a project, a data unit may already resemble what will be archived or published as a dataset, or it may undergo significant processing on the way to becoming a dataset.

Question-specific guidance

The project will (re)use datasets or records available in a repository/registry/archive

Many research projects are using existing digital sources for producing knowledge. For example, existing data can be used as reference, be combined with other data, and datasets can be approached with novel questions. Often, pre-existing data will be combined with novel data collected or produced in a project.

As many of the same considerations apply, also access to materials that are commonly referred to as ‘records’ or ‘sources’ rather than ‘data’ should be included here, such as material in public archives, media archives, legal resources, or large amounts of digital literature.
If digital sources are not persistent, additional considerations on versioning and preserving a copy should be taken.

Not all data from conducted research within your research field will be open and unrestricted. But very often, even restricted and closed data(sets) of relevance can be findable by reference in publications, or by searchable metadata within registers and the like.

There can of course be a number of reasons to discard the idea of reusing data in your project, like lack of relevance for existing data etc. Taking the time to identify such reasons (when designing your project) will most often motivate the necessity of collecting new data both for yourself, project members and collaborators. If the idea of reusing data is discarded, funder guidelines ask for a brief description why building on existing sources is not relevant or applicable for the project.

You can consult the chapters on data discovery and existing data in e.g. the RDMkit for life sciences, the CESSDA Data Management Guide, or The Turing Way handbook for inspiration.

There is a number of different strategies that can be applied to identify existing datasets or records: Scientific datasets
Useful approaches to identify scientific datasets:

Data underlaying a research article or described in a data publication
- Use literature search engines to track relevant datasets via research articles. Data access should be described in a data availability statement.
- Data journals are an emerging journal genre
- Some search engines have filter functions to search in data availability statements, e.g. EuropePMC.
Datasets in relevant discipline-specific data archives
- Where do researchers in the same field publish their data? Check data availability statements in publications or publication guidelines of relevant journals.
- Curated research data registries that allow filtering of discipline: re3data.org, fairsharing.org
Metasearch-services for datasets. Different search engines have different advantages and disadvantages, it can be useful to compare results. Often the best choice to identify data in institutional and generic archives.
- Non-commercial: DataCite Commons, BASE (Bielefeld Academic Search Engine), OpenAIRE explore
- Commercial: Google Dataset Search, Mendeley Data, WOS Data Citation Index, Dimensions

Data from the public sector including registry data
Useful resources (non-exhaustive):

Norwegian public data are available from data.norge.no
data.europa.eu is the official portal for European data
Google Public Data search
Data by international organisations, e.g. WHO Data collections
microdata.no and sikt.no/surveybanken are Sikt services to access register data and survey data, respectively
helsedata.no gives access to both open health data and datasets with restricted access
Portals for national studies such as HUNT Cloud Data
National knowledge bases such as artsdatabanken.no for biodiversity or NVE map services Norwegian only for geographical data

Data in digital archives and collections
Useful resources (non-exhaustive):

Does the data/record/source have a license assigned or clear reuse conditions?

Most of the material found online is either copyright protected or protected through related rights, without being assigned an explicit license to allow further use. This can be confusing for those who want to reuse the material and is one of the reasons you are advised to add a license to your own work, explicitly stating what others are allowed to do and not. Knowledge of the Norwegian copyright act (in Norwegian) can be helpful when using copyrighted material in research. There is not fair use clause in Norwegian copyright law, and no general access to re-publish copyrighted material as research data.

Work in the public domain
If you work with historical sources where the copyright holder(s) has been dead for 70 years or more, the material is in the public domain. There are exceptions for some material, including databases (15 years) and photos not considered to have a threshold of originality (50 years). It will then be possible to share the material as research data in an open archive, using a public domain mark.

Norwegian digital sources
If you plan to collect on-line material from Norwegian sources, it might be of value to have a dialogue with the Web Archive about scarping and preserving the material. Nettarkivet the Norwegian Web Archive (in Norwegian) at the National Library of Norway aims to preserve websites under the .no domain according to the Legal Deposit Act Pliktavleveringslova, (in Norwegian). Websites under foreign domains can also be harvested if their content is related to Norway or is written in Norwegian.

Contacting the rights holder
If you plan to collect digital content from a limited number of content producers, consider contacting the rights holder and ask for permission to share the material as research data and part of your research. This is not always possible, but without permissions sharing the data openly at a later stage in the research will be difficult.

The project will collect data through observations, questionnaires or interviews

Describe the types of data to be collected through observations, questionnaires or interviews, and how data quality will be ensured. If you are collecting data about persons, make sure to provide all necessary information in the chapter ‘Legal and ethical aspects’.

Further reading:

The project will capture data using measurement equipment

Describe how data will be collected by using measurement equipment or laboratory instruments, how experimental parameters or other relevant information are documented, and how data quality will be ensured.

Further reading:

How will you be keeping track of the “provenance” of the data?

To make data understandable and accurate, and the results reliable by transparency or reproducibility, it is crucial that the data origin and relevant parameters as well as all processing and filtering steps are documented. Re-users of the data also need this information to assess the data quality and decide whether the data can be used for their purpose.

If traditional lab notebooks are used, the notes should be made available in electronic form along with your data.
For qualitative data, consider how to enhance transparency regarding data coding and analysis through documentation.
For historical research, consider how notes and annotations can be made available along with the (data)sources.

Further reading:

Turing way: Electronic Lab Notebooks

List data that you will acquire using measurement equipment

The description of the acquired data also includes information about the instrumentation used, as this is a critical part of the data lineage and thus needs to be included in metadata and data documentation.

If using non-standard equipment or if the technology is very much under development, you may want to come back later to understand exactly how the measurements have been made. Make sure to keep copies of any documentation and for example take pictures of the instruments for documentation.

Ownership: In case measurements are not carried out by project members but at a institutional or national core facility or by an external party, make sure that formal ownership of the data has been established. Agreements should include who will take responsibility for keeping raw data safe, and who will deal with data publication. Remember to include information about relevant contracts in the chapter ‘Legal and ethical aspects’ and if applicable, specify distribution of rights in the chapter ‘Data documentation during the project’.

Is special care needed to get the raw data ready for processing?

Be aware that information security considerations also apply for file transfer. Often, data is not processed at the measurement location but needs to be transferred or ingested to allow further processing. Aspects to consider include the data format to be used, whether data will be transferred via network connection or on a physical medium, and how data integrity and security will be ensured.

For larger data volumes, it is essential to investigate network bandwidth and the transfer protocol to be used. If a physical medium is used to transfer data, writing and reading capacity and speed are important. Consider to calculate checksums to ensure that data integrity is maintained.

Further reading:

Which quality processes will be applied?

Which measures will be taken to ensure data quality? This may include calibrating instruments or measurements, including control samples, repeating measurements, standardizing data capture and documentation, data entry validation, or data peer review. In most research project, several choices will apply.

Further reading:

RDMkit: Data quality

The project will collect physical samples

Describe how cross-referencing between physical samples and digital data will be achieved. If physical samples will be preserved in a public repository (e.g. museum archive, biobank), describe where the samples will be deposited, which identifier system will be used and any access conditions.

Further reading:

The project will generate research software, code, computational models or simulations

Research software, computational models and simulations are related to research data as a research output, yet more dynamic. Describe what will be generated during the research project. This may be everything from a few lines of code for data analysis to a more complex project or software package.

When writing code, use systems for version control and remember to document and refer to the version of the code used for the research results. By depositing the version of the code you use in a repository, you make it easy to refer to the correct version of the code in your publication(s) using a DOI. For code, different licenses are used than for data and publication. If you are working on and modifying code which has already been shared by others, then best practice is to assign the same license when you share.

If writing software/models is a significant part of the research project, you can consider writing a Software Management Plan (SMP) to supplement this DMP.

Elixir Software Management Plan

The FAIR for Research Software (FAIR4RS) RDA working group defines research software as “source code files, algorithms, scripts, computational workflows and executables that were created during the research process or for a research purpose” (Gruenpeter et al. 2021, doi: 10.5281/zenodo.5504016). The FAIR for Research Software (FAIR4RS) principles have been adapted from the FAIR principles to fit the characteristics of software/code (Barker et al. 2022, doi: 10.1038/s41597-022-01710-x, Lamprecht et al. 2020, doi: 10.3233/DS-190026).

Further reading:

Other types of data the project will utilize or produce

What other types of files or digital documents will the project utilize or produce, which have not been captured by the previous categories? Think of everything that would cause problems if lost due to technical or human error. What data type/format will be used? Will specific software be needed? What documentation or metadata is needed to understand the data in the future?
For some projects that handle little data, this may be the only applicable category.

Examples:

Manuscripts/research articles (including all their versions)
Illustrations/figures
Literature databases
Notetaking documents

Who else could be interested in us in using data from this project?

Thinking through who could be interested in the data that is used or produced in a research project, contributes to the project’s impact and can motivate good data management practices.