Open Science workshop: Create and use GBIF occurrence cubes
Hands-on workshop to learn what are GBIF species occurrence cubes, how to download them and how to use them. This workshop and the species occurrence cubes are an output of Horizon Europe project: B-Cubed. License: CC-BY.
occurrence cubes? • GBIF SQL download API • Hands-on: download species occurrence cubes ◦ GBIF web interface ◦ rgbif • Break with 🎂 • Hands-on: use species occurrence cubes
cube is a tab-separated csv file containing species occurrence measures (e.g. a count) summarised by taxonomic, temporal and/or spatial dimensions (e.g. a given year, a specific taxonomic rank, etc). Service has been officially launched by GBIF on March 2025. • Aggregated GBIF occurrence data • you choose the grouping variables • Data are delivered as a GBIF Download • Same delivery method as for occurrences • Findable: DOI • Accessible: GBIF infrastructure • Interoperable: tab-separated csv file • Reproducible: metadata (with query) available
GBIF occurrences A typical cube aggregates occurrences • taxonomically, e.g. species • spatially, e.g. EEA grid 1x1km • temporally, e.g. year Presented at TDWG2020 (see slides, abstract). Preprint (PDF) used in B-Cubed project proposal. Used for calculating emerging trends indicators. Oldoni D, Groom Q, Desmet P (2020) https://speakerdeck.com/damianooldoni/occurrence-cubes
2014 1kmE3886N3121 2889173 51 10 2014 1kmE3886N3122 2889173 109 10 ... ... ... ... ... 2018 1kmE4047N3067 2889088 1 2828 Aggregate Number of occurrences of a specific taxon in a specific cell and in a specific time interval Derived from Oldoni D, Groom Q, Desmet P (2020) https://speakerdeck.com/damianooldoni/occurrence-cubes
- extra measures: what’s the minimum coordinate uncertainty among all the occurrences for that specific year/cell/species? year eea_cell_code speciesKey n min_coord_uncertainty 2014 1kmE3886N3121 2889173 51 10 2014 1kmE3886N3122 2889173 109 10 ... ... ... ... ... 2018 1kmE4047N3067 2889088 1 2828 Derived from Oldoni D, Groom Q, Desmet P (2020) https://speakerdeck.com/damianooldoni/occurrence-cubes
- extra measures: what’s the number of occurrences for the same year/cell combination at class level? year eea_cell_code speciesKey n classKey n_class 2014 1kmE3886N3121 2889173 51 220 4890 2014 1kmE3886N3122 2889173 109 220 2901 ... ... ... ... … ... 2018 1kmE4047N3067 2889088 1 220 510 Derived from Oldoni D, Groom Q, Desmet P (2020) https://speakerdeck.com/damianooldoni/occurrence-cubes
SQL Download API allows users: • to query GBIF occurrences using SQL (Structured Query Language). • to select the columns of interest* • to generate summary views of GBIF data* *not possible with the “standard” Predicate Download API See GBIF documentation.
the columns of interest = return a flat occurrence table Let’s SELECT columns FROM occurrence WHERE conditions Let’s select the unique datasets and publishers with occurrences recorded in Belgium this year: SELECT DISTINCT datasetKey, publishingOrgKey FROM occurrence WHERE countryCode = 'BE' AND "year" = 2026 Result: https://doi.org/10.15468/dl.3z4fr2
the columns of interest = return a flat occurrence table Let’s SELECT columns FROM occurrence WHERE conditions Real world example. Biologging data from a network: - All organisms for a single year - occurrenceID - organismID - taxonomical information - Spatial information - eventID, parentEventID From workshop: Hip to be cubed: using the new GBIF SQL Download API (Part 1). Huybrechts P, Breugelmans L, Trekels M, Rodrigues A, Blissett M. https://bit.ly/4lSM75n
occurrenceid, organismid, scientificname, taxonkey, eventdate, decimallatitude, decimallongitude, eventid, parenteventid, datasetkey, publisher FROM occurrence WHERE GBIF_STRINGARRAYCONTAINS(occurrence.networkkey, 'ab013f3a-3c00-42cb-9fdb-cb5f4ba20a4b', FALSE) AND occurrence."year" = 2020 AND occurrence.occurrencestatus = 'PRESENT' AND occurrence.basisofrecord = 'MACHINE_OBSERVATION' From workshop: Hip to be cubed: using the new GBIF SQL Download API (Part 1). Huybrechts P, Breugelmans L, Trekels M, Rodrigues A, Blissett M. https://bit.ly/4lSM75n
workshop: Hip to be cubed: using the new GBIF SQL Download API (Part 1). Huybrechts P, Breugelmans L, Trekels M, Rodrigues A, Blissett M. https://bit.ly/4lSM75n
the columns of interest AND aggregate Let’s SELECT columns FROM occurrence WHERE conditions GROUP BY variables Let’s count the number of occurrences recorded this year in Belgium for each dataset and publisher: SELECT datasetKey, publishingOrgKey , COUNT(*) FROM occurrence WHERE countryCode = 'BE' AND "year" = 2026 GROUP BY datasetKey, publishingOrgKey Result: https://doi.org/10.15468/dl.czvvdp
the columns of interest AND aggregate Let’s SELECT columns FROM occurrence WHERE conditions GROUP BY dimensions Let’s count the number of occurrences recorded this month in Flanders for each species and day. Only presences (no absences). SELECT species, speciesKey , eventDate, COUNT(*)AS n FROM occurrence WHERE countryCode = 'BE' AND level1gid = 'BEL.2_1' AND "year" = 2026 AND "month" = 3 AND occurrenceStatus = 'PRESENT' GROUP BY species, speciesKey, eventDate Result: https://doi.org/10.15468/dl.cerftr
the columns of interest AND aggregate Let’s SELECT columns FROM occurrence WHERE conditions GROUP BY dimensions Let’s count the number of occurrences recorded this month in Flanders for each species and day. Only presences (no absences). SELECT species, speciesKey, eventDate, COUNT(*)AS n FROM occurrence WHERE countryCode = 'BE' AND level1gid = 'BEL.2_1' AND "year" = 2026 AND "month" = 3 AND occurrenceStatus = 'PRESENT' GROUP BY species, speciesKey, eventDate Result: We have just created our first species occurrence cube 🫨 A cube with two dimensions: - taxonomic - temporal Ok, we created a square ⃞
download a species occurrence cube from GBIF using the web interface. • Go to the GBIF occurrence search • occurrence_status=present (already selected by default) • year=2010,2025 • country=BE • Show “All filters” to select Flanders region: gadm_gid=BEL.2_1 • taxon_key=6 (scientificName: Plantae) • coordinate_uncertainty_in_meters=0,1000 (quite precise georeferenced data) • URL occurrence search • Download
download a species occurrence cube from GBIF using the web interface. • Taxonomic dimension: Species • Temporal dimension: Year • Spatial dimension: EEA reference grid - Europe only; • Spatial resolution: 1km • Randomize points within uncertainty circle: yes
points within uncertainty circle: why? Directly assigning centroid coordinates to grid can lead to huge spatial bias Oldoni D, Groom Q, Desmet P (2020) https://speakerdeck.com/damianooldoni/occurrence-cubes
P (2020) https://speakerdeck.com/damianooldoni/occurrence-cubes Download species occurrence cubes Randomize points within uncertainty circle: why? Directly assigning centroid coordinates to grid can lead to huge spatial bias
P (2020) https://speakerdeck.com/damianooldoni/occurrence-cubes Download species occurrence cubes Randomize points within uncertainty circle: why? Directly assigning centroid coordinates to grid can lead to huge spatial bias How to assign occurrences to grids ? How to apply randomization? Via special grid functions, e.g. GBIF_EEARGCode. STRING GBIF_EEARGCode(INTEGER gridSize, DOUBLE latitude, DOUBLE longitude, DOUBLE coordinateUncertaintyInMeters) Set to 0 to disable randomization.
download a species occurrence cube from GBIF using the web interface. • Occurrence count at higher taxonomic level: from Kingdom up to Genus. Useful to assert sampling bias. • Include minimum coordinate uncertainty: Yes. Useful to assert the spatial precision of the data. • Include minimum temporal uncertainty: Yes. Useful to assert the temporal precision of the data.
download a species occurrence cube from GBIF using the web interface. Check that the following filters are checked ✅ • Remove records with geospatial issues • Remove records not confidently matched to a taxon • Remove records at country centroids • Remove records of fossils and living specimens, e.g. those from botanical and zoological gardens Can we download now? NO: let’s Edit as SQL first. Why? Because “for complex queries and aggregations, the SQL editor provides more freedom.” Goal: remove unvalidated records, based on identificationVerificationStatus.
download a species occurrence cube from GBIF using the web interface. Remove unvalidated records, based on identificationVerificationStatus Filter stuff in SQL? Add condition in the WHERE section of the SQL query: WHERE ... AND ( LOWER(identificationVerificationStatus) NOT IN ( 'unverified', 'unvalidated', 'not validated', 'under validation', 'not able to validate', 'control could not be conclusive due to insufficient knowledge', 'uncertain', 'unconfirmed', 'unconfirmed - not reviewed', 'validation requested' ) OR identificationVerificationStatus IS NULL )
download a species occurrence cube from GBIF using the web interface. Remove unvalidated records, based on identificationVerificationStatus Filter stuff in SQL? Add condition in the WHERE section of the SQL query: WHERE ... AND ( LOWER(identificationVerificationStatus) NOT IN ( 'unverified', 'unvalidated', 'not validated', 'under validation', 'not able to validate', 'control could not be conclusive due to insufficient knowledge', 'uncertain', 'unconfirmed', 'unconfirmed - not reviewed', 'validation requested' ) OR identificationVerificationStatus IS NULL )
download a species occurrence cube from GBIF using the web interface. GBIF download still under processing? No worries! Give a look to this download and its SQL query.
download a species occurrence cube from GBIF using dedicated SQL web interface. Are you a SQL expert, or do you have a good template to reuse? Just start writing SQL directly in the GBIF SQL interface!
download a species occurrence cube from GBIF using rgbif. You can use {rgbif} to interface with the GBIF SQL download API: • Use the function occ_download_sql(). • Give a look to the “GBIF SQL Downloads” vignette. q <- “this is my SQL query, so it won’t work” occ_download_sql(q) Do you want to know more about creating/importing cubes wih rgbif? Give a look to the presentation from slide 66 (pdf, Google Slides): “The b3verse: an R package suite to process cubes and calculate indicators”, Langeraert W, Dove S, Hillaert J, 2026.
package suite to process cubes and calculate indicators”, Langeraert W, Dove S, Hillaert J, 2026. Slide 93. General indicators Centaurea cyanus (c) Kai-Philipp Schablewski CC-BY-NC
the European Union’s Horizon Europe Research and Innovation Programme (ID No 101059592). Views and opinions expressed are those of the author(s) only and do not necessarily reflect those of the European Union or the European Commission. Neither the EU nor the EC can be held responsible for them. Thank you! Damiano Oldoni, Ward Langeraert & Jasmijn Hillaert Google Slides, PDF B-Cubed Newsletter @b-cubed.eu B-Cubed Project @BCubedProject b-cubed.eu B-Cubed Project