Upgrade to Pro — share decks privately, control downloads, hide ads and more …

University_of_Amsterdam-Data_Governance___AI__T...

Avatar for Marketing OGZ Marketing OGZ PRO
September 19, 2025
65

 University_of_Amsterdam-Data_Governance___AI__The_Positive_Feedback_Loop.pdf

Avatar for Marketing OGZ

Marketing OGZ PRO

September 19, 2025
Tweet

More Decks by Marketing OGZ

Transcript

  1. Data Governance & AI: A positive feedback loop Data Expo,

    Utrecht September 11, 2025 Prof. Paul Groth | @pgroth | pgroth.com | indelab.org Thanks to Prof. George Fletcher, Dr. Juan Sequeda, Dr. Katleen Gregory, Dr. Laura Koesten, the INDElab team 1
  2. 2

  3. Research Topics at INDE lab Design systems to support people

    in working with data from diverse sources Address problems related to the preparation, management, and integration of data
 3 • Automated Knowledge Graph Construction
 (e.g. building KGs from multiple modalities; architectures for integrating KGs and LLMs) • Context Aware Data Systems
 (e.g. rule learning & digital twins; human-data interaction; human - ai work fl ows) • Data Management for Machine Learning 
 (e.g. data quality assessment; data handling impact on ML models; data search)
 
 

  4. – https://www.lightsondata.com/what-is-data-governance/ “Data Governance is a discipline which provides the

    necessary policies, processes, standards, roles and responsibilities needed to ensure that data is managed as an asset.” 4
  5. EU Data Act • Ease user access to data generated

    by them • New data sharing contracts for SMEs • Cloud switching • Public sector agencies can access data from businesses (in emergencies) 9
  6. EU AI Act • Data and data governance • Transparency

    for Users • Human oversight • Accuracy, Robustness and Cybersecurity • Traceability and Auditability Lilian Edwards. (2022). The EU AI Act proposal. Ada Lovelace Institute. Available at: https://www.adalovelaceinstitute.org/ resource/eu-ai-act-explainer/ https://www.lawfareblog.com/arti fi cial-intelligence-act-what-european-approach-ai 10
  7. Finding digital truth—that is, iden ti fying and combining data

    that accurately represent reality—is becoming more di ff i cult and more important. More di ff i cult because data and their sources are mul ti plying. And more important because fi rms need to get their data house in order to bene fi t from AI, which they must to stay compe ti ti ve. -- The Economist, February 2020 12
  8. 13 Graph analytics Self service analytics AI / ML models

    People with business questions Data Consumers Data Analysts Data Scientist Sources Data warehouse, Data lakes and app- specific DBs Cloud services and APIs Files and shared files Analytics platforms Data Producers Data Engineers Data Stewards AI in Production is a team sport
  9. New architectures Source: The Future of Work With AI -

    Microsoft March 2023 Event https://www.youtube.com/watch?v=Bf-dbS9CcRU&ab_channel=Microsoft 14
  10. Why Data Governance is increasingly important? • The amount of

    data • More people have access to data • More ways to collect data • More kinds of data • Uses have expanded • New regulations • Ethical Concerns Eryurek, Evren, et al. Data Governance: The De fi nitive Guide: People, Processes, and Tools to Operationalize Data Trustworthiness. First edition, O’Reilly Media, Inc, 2021. 15
  11. Data Lifecycle Eryurek, Evren, et al. Data Governance: The De

    fi nitive Guide: People, Processes, and Tools to Operationalize Data Trustworthiness. First edition, O’Reilly Media, Inc, 2021. 17
  12. Governance of a data life cycle Eryurek, Evren, et al.

    Data Governance: The De fi nitive Guide: People, Processes, and Tools to Operationalize Data Trustworthiness. First edition, O’Reilly Media, Inc, 2021. 18
  13. Implications for Data Governance Premise Consequence Improving ability to use

    expertise Expertise is a critical resource Improving ability to use more and di ff erent signals Signal capture becomes imperative Multiple content sources buttress each other Understanding and use the entire data estate Machine learning SOTA is accessible Problem formulation is fundamental 25
  14. Modern Data Stack • Cloud- fi rst • Built around

    cloud data warehouse/lake • Focus on solving one problem • O ff ered as SaaS or open-core • Low-entry barrier • Actively supported by communities https://atlan.com/modern-data-stack-101/ 27
  15. AI throughout Data Governance Tech • Data Catalog • AI:

    recommendations, prioritising curation • Semantic Layers • AI: automatically building knowledge graphs and vocabularies • Data workspaces • AI: synthetic data generation, making sure data is properly used • Monitoring and reporting • AI: governance advice, understanding an estate 28
  16. The Data Catalog as starting point • Data catalog as

    not only a place to fi nd data but understand data demands and employ AI • Including: • which datasets are used • how data is used ( fi elds) • who uses a dataset • who are the people to talk to fi gure out data 30
  17. Article Dataset Reuse: Toward Translating Principles to Practice Laura Koesten,1,*

    Pavlos Vougiouklis,2 Elena Simperl,1 and Paul Groth3,4,* 1King’s College London, London WC2B 4BG, UK 2Huawei Technologies, Edinburgh EH9 3BF, UK 3University of Amsterdam, Amsterdam 1090 GH, the Netherlands 4Lead Contact *Correspondence: [email protected] (L.K.), [email protected] (P.G.) https://doi.org/10.1016/j.patter.2020.100136 SUMMARY The web provides access to millions of datasets that can have additional impact when used beyond their original context. We have little empirical insight into what makes a dataset more reusable than others and which of the existing guidelines and frameworks, if any, make a difference. In this paper, we explore potential reuse features through a literature review and present a case study on datasets on GitHub, a popular open platform for sharing code and data. We describe a corpus of more than 1.4 million data files, from over 65,000 repositories. Using GitHub’s engagement metrics as proxies for dataset reuse, we relate them to reuse features from the literature and devise an initial model, using deep neural networks, to predict a data- set’s reusability. This demonstrates the practical gap between principles and actionable insights that allow data publishers and tools designers to implement functionalities that provably facilitate reuse. 1 INTRODUCTION There has been a gradual shift in the last years from viewing da- tasets as byproducts of (digital) work to critical assets, whose value increases the more they are used.1,2 However, our under- standing of how this value emerges, and of the factors that demonstrably affect the reusability of a dataset is still limited. Using a dataset beyond the context where it originated re- mains challenging for a variety of socio-technical reasons, which have been discussed in the literature;3,4 the bottom line is that simply making data available, even when complying with existing guidance and best practices, does not mean it can be easily used by others.5 At the same time, making data reusable to a diverse audience, in terms of domain, skill sets, and purposes, is an important way to realize its potential value (and recover some of the, sometimes considerable, resources invested in policy and infrastructure support). This is one of the reasons why scientific journals and research-funding organizations are increasingly calling for further data sharing6 or why industry bodies, such as the Interna- tional Data Spaces Association (IDSA) (https://www. internationaldataspaces.org/) are investing in reference archi- tectures to smooth data flows from one business to another. There is plenty of advice on how to make data easier to reuse, including technical standards, legal frameworks, and guidelines. Much work places focus on machine readability THE BIGGER PICTURE The web provides access to millions of datasets. These data can have additional impact when it is used beyond the context for which it was originally created. We have little empirical insight into what makes a dataset more reusable than others, and which of the existing guidelines and frameworks, if any, make a difference. In this paper, we explore potential reuse features through a literature review and present a case study on datasets on GitHub, a popular open platform for sharing code and data. We describe a corpus of more than 1.4 million data files, from over 65,000 repositories. Using GitHub’s engage- ment metrics as proxies for dataset reuse, we relate them to reuse features from the literature and devise an initial model, using deep neural networks, to predict a dataset’s reusability. This work demonstrates the practical gap between principles and actionable insights that allow data publishers and tools designers to implement functionalities that provably facilitate reuse. Proof-of-Concept: Data science output has been formulated, implemented, and tested for one domain/problem Patterns 1, 100136, November 13, 2020 ª 2020 The Author(s). 1 ll OPEN ACCESS Lots of good advice for metadata • Maybe a bit too much…. • Currently, 140 policies on fairsharing.org as of April 5, 2021 • We reviewed 40 papers • Cataloged 39 di ff erent features of datasets that enable data reuse 36
  18. Getting some data • Used Github as a case study

    • ~1.4 million datasets (e.g. CSV, excel) from ~65K repos • Use engagement metrics as proxies for data reuse • Map literature features to both dataset and repository features • Train a predictive model to see what are features are good predictors 37
  19. Dataset Features Missing values Size Columns + Rows Readme features

    Issue features Age Description Parsable 38
  20. Where to start? • Some ideas from this study if

    you’re publishing data with Github • provide an informative short textual summary of the dataset 
 • provide a comprehensive README fi le in a structured form and links to further information 
 • datasets should not exceed standard processable fi le sizes 
 • datasets should be possible to open with a standard con fi guration of a common library (such as Pandas)
 Trained a Recurrent Neural Network. Might be better models but useful for handling text, Not the greatest predicator (good for classifying not reuse) but still useful for helping us tease out features 39
  21. AI as a Governance Advisor 42 Daly, E. M., Rooney,

    S., Tirupathi, S., Garces-Erice, L., Vejsbjerg, I., Bagehorn, F., Salwala, D., Giblin, C., Wolf-Bauwens, M. L., Giurgiu, I., Hind, M., & Urbanetz, P. (2025). Usage governance advisor: From intent to ai governance (arXiv:2412.01957). arXiv. https://doi.org/10.48550/arXiv.2412.01957
  22. Building governance into the data lifecycle enables AI • Build

    standards into your existing process and implement them as engineering solutions. • Engineering enables AI • AI improves governance • Better governance means better data for AI systems • People and processes are just as important as tools and infrastructure https://www.microsoft.com/insidetrack/blog/driving-e ff ective-data-governance-for-improved-quality-and-analytics/ 43
  23. Conclusion • Data Governance is even more important as: •

    data landscapes expand; and • AI requires high quality governance. • Governance needs to be throughout the data lifecycle • AI can help governance across the data lifecycle • Better Governance 🔁 Better AI Paul Groth | [email protected] | @pgroth | pgroth.com | indelab.org 44