Defining dataset specifications to communicate ...

December 06, 2016

Science

190

Defining dataset specifications to communicate data quality characteristics

Talk at the TDWG 2016 annual conference in Santa Clara de San Carlos, Costa Rica - December 6, 2016.

Recording: https://vimeo.com/showcase/4308386/video/196431347

The Darwin Core standard provides a list of community-ratified terms for sharing biodiversity information. Although some terms have strict definitions, most allow users a certain level of freedom in how to interpret these. This degree of freedom has enabled a wide range of biodiversity data to be mapped to Darwin Core, but it complicates automated data aggregation and processing. One way to resolve this are community specific guidelines describing how data should be mapped, but few have been created or adopted. Moreover, these are intended for humans only.

Inspired by existing data validation specifications in other fields, we propose the usage of a specification file, describing the constraints to which the data should comply. Its syntax is human- and machine-readable, so it can be used to communicate expected data quality/conformity and to validate data automatically. The scope of the set of rules can be specific to a dataset, publisher or community, which allows bottom-up and top-down adoption.

In this talk, we will present a prototype format for these specifications, where the rules are defined on the level of individual terms and expressed as a YAML file. We also present prototype software to validate data with these specifications. We hope it will trigger a discussion on how to express data specifications and mapping guidelines.

Peter Desmet

December 06, 2016

Tweet

More Decks by Peter Desmet

See All by Peter Desmet

How we developed a data exchange format: Lessons learned from Camtrap DP

1

280

B-Cubed: Leveraging analysis-ready biodiversity datasets and cloud computing for timely and actionable biodiversity monitoring

0

290

MOVE2GBIF: GPS-zendergegevens van dieren mobiliseren naar Movebank en GBIF

0

190

Introduction to Camtrap DP: A frictionless data exchange format for camera trapping data

1

600

Camtrap DP: Using Frictionless Standards for a camera trapping data exchange format

0

300

ETN: R package to access data from the European Tracking Network

0

320

Camtrap DP: A frictionless data exchange format for camera trapping data

0

330

Ideas for the IPT: An INBO perspective

0

220

Publishing bird tracking data: Movebank, Zenodo & GBIF

0

160

Other Decks in Science

See All in Science

butterfly_effect/butterfly_effect_in-house

1

140

Factorized Diffusion: Perceptual Illusions by Noise Decomposition

0

300

How were Quaternion discovered

2

1.2k

ベイズ最適化をゼロから

2

1k

Analysis-Ready Cloud-Optimized Data for your community and the entire world with Pangeo-Forge

0

120

機械学習による確率推定とカリブレーション/probabilistic-calibration-on-classification-model

2

350

Improving Search @scale with efficient query experimentation @BerlinBuzzwords 2024

0

270

証明支援系LEANに入門しよう

0

570

眼科AIコンテスト2024_特別賞_6位Solution

0

260

ガウス過程回帰とベイズ最適化

1

130

マクロ経済学の視点で、財政健全化は必要か

2

120

Planted Clique Conjectures are Equivalent

0

100

Featured

See All Featured

[RailsConf 2023 Opening Keynote] The Magic of Rails

28

9.2k

452

42k

実際に使うSQLの書き方徹底解説 / pgcon21j-tutorial

175

51k

How to Create Impact in a Changing Tech Landscape [PerfNow 2023]

49

2.2k

A better future with KSS

238

17k

The Myth of the Modular Monolith - Day 2 Keynote - Rails World 2024

20

2.4k

Product Roadmaps are Hard

50

11k

[RailsConf 2023] Rails as a piece of cake

53

5.2k

4 Signs Your Business is Dying

182

22k

Dealing with People You Can't Stand - Big Design 2015

365

25k

CSS Pre-Processors: Stylus, Less & Sass

356

29k

For a Future-Friendly Web

176

9.5k

Transcript

Deﬁning Dataset speciﬁca-ons to communicate data quality Peter Desmet, S-jn
Van Hoey, Dimitri Brosens
Darwin Core oﬀers a lot of (necessary) freedom
But how do you express more rigorous requirements?
We need documenta-on
None
None
Does my dataset comply?
We need machine-readable documenta-on
YAML Human & machine-readable
None
Demo
Dataset
Run data-validator
Report
Improved dataset
Rerun data-validator
Speciﬁca-ons for datasets
Speciﬁca-ons for data publishers
Speciﬁca-ons for data users
Speciﬁca-ons for communi-es
Integra-on in data publica-on workﬂows
None
Proof of concept github.com/inbo/data-validator Examples used in this presenta-on: bit.ly/2h352c8
Thanks! @peterdesmet @s-jnvanhoey @dimibro bit.ly/2h0cDLU Desmet P, Van Hoey S
& Brosens D (2016) Deﬁning dataset speciﬁca-ons to communicate data quality. hbp://bit.ly/2h0cDLU