Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Good enough practices for Research Data Management

Tanja Milotic
September 26, 2022

Good enough practices for Research Data Management

Tanja Milotic

September 26, 2022
Tweet

More Decks by Tanja Milotic

Other Decks in Science

Transcript

  1. Good enough practices for research data management Tanja Milotić @oscibio

    Empowering biodiversity research II September 27th, 2022 Brussels
  2. Why should you even bother? Help your future self •

    higher data quality, less mistakes • increased research efficiency • minimized risk of data loss, less frustrations • saved time & money • prevent to collect duplicate data
  3. Why should you even bother? Motivation for the near future

    • required by funders & publishers • increase your visibility (citations!) • easy data sharing • new collaboration opportunities “Research data management is part of good research practice”
  4. How to start with research data management? Research data management

    (RDM) concerns the organization of data, from its entry to the research cycle to the dissemination and archiving of valuable results Data management plan (DMP)
  5. Research data? any information collected/created for the purpose of analysis

    to verify scientific claims Digital(ized) data: observations, measurements, pictures, audio, models,... Physical data: samples (soil, water, tissue,...), collections,...
  6. Folder structure • Use a well-defined folder structure |- README.md

    <- The top-level README describing the general layout of the project |- data <- research data | |- raw <- The original, read-only acquired raw data | |- interim <- Intermediate data that has been transformed. | |- processed <- final data products, used in the report/paper/graphs | |_ external <- used additional third party data resources (e.g. vector maps) |- reports <- reported outcome of the analysis as LaTeX, word, markdown,... | |_ figures <- Generated graphics and figures to be used in reporting |- src <- set of analysis scripts used in the analysis “Keep raw data raw!” (Hart et al, 2016)
  7. File names • Unique file names • Comprehensible • No

    special characters • Use _ instead of spaces • YYYY-MM-DD dates • Include initials, project, location, variable, content 2022-09-27_EBR_RDM_practices.pdf
  8. Keep track of changes In general • Backup changes ASAP

    • Keep changes small • Share changes frequently Manually track changes • Add a CHANGELOG.txt • Copy entire project after large changes Version control system • Git • GitHub, Bitbucket, Gitlab,... File naming conventions: file_v1, file_v2 Use a version control system (recommended)
  9. Data file formats • Non proprietary (open source) formats •

    Easily reusable • Commonly used Some preferred file formats Tabular data .csv (comma separated values), HDF5, netcdf, rdf Text .txt, html, xml, odt, rtf Still images .tif, jpeg2000, png, pdf, gif, bmp, svg “Love your data, and help other love it, too” (Goodman et al, 2014)
  10. Data quality • Standardize data collection • Check data entry

    • Edit, clean, verify and validate raw data • Peer review • Documentation • Scripting “Data should be structured for analysis” (Hart et al, 2016)
  11. Tidy data • 80% of data analysis is spent on

    data cleaning and preparing • Tidy data: structuring datasets to facilitate analysis • Tidy data from the start of the project Wickham, 2014
  12. From messy to tidy Make it a rectangle • Only

    rows and columns, no additional structure • One column for each type of information • One row for each observation (data point) Data carpentry for biologists Plot SpeciesA SpeciesB 1 3 1 2 2 4 Messy:
  13. From messy to tidy Make it a rectangle • Only

    rows and columns, no additional structure • One column for each type of information • One row for each observation (data point) Data carpentry for biologists Plot SpeciesA SpeciesB 1 3 1 2 2 4 Messy: Plot Species Abundance 1 A 3 1 B 1 2 A 2 2 B 4 Tidy:
  14. From messy to tidy One cell, one value • Every

    cell contains 1 piece of information Data carpentry for biologists Mass 26g 0.2kg Messy:
  15. From messy to tidy One cell, one value • Every

    cell contains 1 piece of information Data carpentry for biologists Mass 26g 0.2kg Messy: Mass Unit 26 g 0.2 kg Tidy:
  16. From messy to tidy Don’t mess with the computer •

    Don’t use visual markings (colors, italics, fonts,...) • Avoid spaces in names, use ‘_’ or CamelCase for multiple words • Avoid special characters (*, @, ^,...) Data carpentry for biologists Min temp 5 4.5 3.1* Messy:
  17. From messy to tidy Don’t mess with the computer •

    Don’t use visual markings (colors, italics, fonts,...) • Avoid spaces in names, use ‘_’ or CamelCase for multiple words • Avoid special characters (*, @, ^,...) Data carpentry for biologists Min temp 5 4.5 3.1* Messy: min_temp calibration_error 5 0 4.5 0 3.1 1 Tidy:
  18. From messy to tidy Be clear and consistent • Use

    short meaningful names. • Use consistent names, abbreviations, and capitalizations • Use good null values (blanks, NA,... Do not use numbers (0, -999)) • Write dates as YYYY-MM-DD or use separate Year, Month, and Day columns d s a 26/02/2022 dior 9 26/02/2022 disp 1 May 24, 2022 DIor -999 May 24, 2022 DISP Missing Messy:
  19. From messy to tidy Be clear and consistent • Use

    short meaningful names. • Use consistent names, abbreviations, and capitalizations • Use good null values (blanks, NA,... Do not use numbers (0, -999)) • Write dates as YYYY-MM-DD or use separate Year, Month, and Day columns d s a 26/02/2022 dior 9 26/02/2022 disp 1 May 24, 2022 DIor -999 May 24, 2022 DISP Missing Messy: Date Species Abundance 2022-02-26 dior 9 2022-02-26 disp 1 2022-05-24 dior NA 2022-05-24 disp NA Tidy:
  20. Documentation & metadata Why? • Long term usability • Avoid

    misinterpretation • Collaboration & staff changes • Save your memory for other stuff… @TDXDigLibrary
  21. Project documentation Add a README.txt to your project folder •

    Context • People • Sponsor(s) • Data collection methods • File organization • Known problems, limitations, gaps • Licenses • How to cite
  22. Data documentation README.txt for each dataset • Variable names, labels,

    data type, description • Explain codes & abbreviations • Code & reason for missing values • Code used for derived data • File format • Software • Data standards
  23. Backup guidelines • 3 - 2 - 1 rule: 3

    copies - 2 different types of media - 1 offsite • Apply a backup schedule • Test file restores • Do not use CDs or DVDs • Use reliable backup media
  24. open data funders demands publishers’ rules innovation & valorization higher

    citation rates more visibility for your work collaboration application of your findings reduced costs public access to your findings
  25. FAIR data Make your data • Findable • Accessible •

    Interoperable • Reusable Findable • Persistent identifiers (DOI) • Metadata • Naming conventions • Keywords • Versioning Accessible • Choice of datasets • Data repository • Software, documentation • Access status • Retrievable data • Metadata access Interoperable • Standards • Vocabulary • Methodology • References Reusable • Licensing • Provenance • Community standards
  26. Where to deposit data? • Depends on data type •

    Domain specific repositories ◦ GBIF for species occurrences ◦ Movebank for movement data ◦ Genbank for genomic data ◦ ... • General repositories (Zenodo, dataDryad,...) • ORCID integration • DOI for citing datasets