Data Engineering in the Large Language Models era
The free lunch is over, we have to 'really' deal with unstructured data!
When engineers think about unstructured data, basically the first idea that comes to mind is those pesky legacy files we need to transform to extract into some 'good old' structured table somewhere. But the recent improvements on Machine Learning and the growing popularity of Large Language Models (LLMs) have opened a Pandora's box of interest and requirements for Data Engineers. Users want to access and analyze data from unstructured data sources using natural language processing and we should also maintain unstructured sources.
In this talk we will go into detail about what we need to do to get up to speed with the recent developments, we will talk about processing audio, images and text, using vector embeddings as well as the requirements for unstructured data pipelines and how we can achieve them by relying on Microsoft Fabric, AI services and open-source technologies like Apache Spark and SynapseML.
About Ismaël:
Ismaël Mejía is a Senior Cloud Advocate at Microsoft working on the Azure Data team. He has more than a decade of experience architecting systems for startups and financial companies. He has been focused on distributed data and data engineering, he is a contributor to Apache Beam, Apache Avro and many other open-source projects. He is also a member of the Apache Software Foundation (ASF).
LinkedIn: https://www.linkedin.com/in/iemejia/
Twitter: https://twitter.com/iemejia