Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Securing AI: Data Access Control for RAG

Securing AI: Data Access Control for RAG

Description from stream:
If you're trying to get an LLM to accurately answer questions about your own documents, you need RAG: Retrieval Augmented Generation. With a RAG approach, the app first searches a knowledge base for relevant matches to a user's query, then sends the results to the LLM along with the original question. What if you have documents that should only be accessed by a subset of your users, like a group or a single user? Then you need data access controls to ensure that document visibility is respected during the RAG flow. In this session, we'll show an approach using Azure AI Search with data access controls to only search the documents that can be seen by the logged in user. We'll also demonstrate a feature for user-uploaded documents that uses data access controls along with Azure Data Lake Storage Gen2.

YouTube:
https://www.youtube.com/watch?v=4PKug91OSw8

Pamela Fox

July 17, 2024
Tweet

More Decks by Pamela Fox

Other Decks in Technology

Transcript

  1. Securing AI Apps on Azure: Data Access Control for AI

    RAG Apps on Azure Matthew Gotteiner Azure AI Search aka.ms/securing-acl-slides Pamela Fox Python Cloud Advocacy github.com/pamelafox
  2. RAG: Retrieval Augmented Generation Search PerksPlus.pdf#page=2: Some of the lessons

    covered under PerksPlus include: · Skiing and snowboarding lessons · Scuba diving lessons · Surfing lessons · Horseback riding lessons These lessons provide employees with the opportunity to try new things, challenge themselves, and improve their physical skills.…. Large Language Model Yes, your company perks cover underwater activities such as scuba diving lessons 1 User Question Do my company perks cover underwater activities?
  3. Data prep: Local data ingestion See prepdocs.py for code that

    ingests documents with these steps: Upload documents An online version of each document is necessary for clickable citations. Extract data from documents Supports PDF, HTML, docx, pptx, xlsx, images, plus can OCR when needed. Local parsers also available for PDF, HTML, JSON, txt. Azure Document Intelligence Azure Blob Storage Split data into chunks Split text based on sentence boundaries and token lengths. Langchain splitters could also be used here. Python Vectorize chunks Compute embeddings using OpenAI embedding model of your choosing. Azure OpenAI Indexing • Document index • Chunk index • Both Azure AI Search
  4. RAG solution: Code (Simplified) aka.ms/ragchat # STEP 3: Question answering

    answer = openai_client.chat.completions.create( messages=messages, temperature=0.3, max_tokens=1024 ).choices[0].message # STEP 1: Query rewriting with AI Search query_text = openai_client.chat.completions.create( messages=query_messages, temperature=0.0, max_tokens=100 ).choices[0].message.content # STEP 2: Retrieval search_vector = openai_client.embeddings.create( self.embedding_model, input=query_text ).data[0].embedding results = search_client.search( search_text=search_text, top=3, vector_queries=VectorizedQuery(vector=search_vector, fields="embedding"), query_type=QueryType.SEMANTIC, semantic_configuration_name="default", semantic_query=query_text, )
  5. Understanding token claims { "aud": "https://management.core.windows.net/", // Token Audience (Resource

    Server) "iss": "https://sts.windows.net/f6a799a2-eb93-4e7f-9515-19e4a2e7af04/", // Token Issuer "iat": 1714775919, // Issued at time "nbf": 1714775919, // Do not process token before this time "exp": 1714780517, // Expiry time "name": "Matt G", // Display name of the user "oid": "8d5a813e-af85-47f1-b076-0b88e9cf8443", // Object identifier of the user "groups": ["b415f9c9-4f20-45b4-87a1-0ac9a142f0c5"], // Identifiers of user groups "scp": "user_impersonation" // OAuth 2.0 scopes that have been consented to } Access tokens use the JSON Web Tokens (JWT) format. Claims, or key-value pairs, establish facts about the subject the token was issued for. Try decoding a token yourself at https://jwt.ms.
  6. Representing access control in AI Search indexes Search supports string

    collection fields. Directly map object and group identifiers from token claims to documents in the index. { "name": "index-with-access-control", "fields": [ { "name": "key", "type": "Edm.String", "key": true }, { "name": "oids", "type": "Collection(Edm.String)", "filterable": true }, { "name": "groups", "type": "Collection(Edm.String)", "filterable": true } ] }
  7. Access control using filtering AI Search supports filtering documents in

    addition to normal searches. Efficiently search thousands of unique identifiers using "search.in". { "search": "document text" , "filter": "oids/any(oid: search.in(oid, '3fd9a875-2e3d-4b97-8301-eb7b7e6a109e, a11be098-87b6- 4c68-af19-79e44d927c4d, ...')) or groups/any(group: search.in(group, 'e432e4cd-8e1c-4a5e-9c0a- 6e1fa3a6bb8d, 6f091fd9-5871-4d1b-8fd5-3dbef48b52a9, ...')" }
  8. Updating access control associated with a document AI Search supports

    incremental updates to individual records Include document key, "merge" action, and fields to update { "value": [ { "@search.action": "merge", "key": "my-document-key", "oids": [ "c0f84485-7814-49b2-9128-9b3a5369c423", "7dc3d6e8-8d6b-4ae4-b288-8d50d605df55" ], "groups": [ "f2b17199-8ec8-41b0-b0d7-1a6ad597f96e", "e5e0b705-993b-4880-81c8-3b0a3f7345f7" ] } ] }
  9. Combining AI Search and Data Lake Gen2 Storage Data Lake

    Gen2 Storage allows associating access control information with files and folders
  10. Fetching access control information Fetch access control list from files

    or directories Need to parse string to find exact group and object ids from azure.storage.filedatalake import DataLakeServiceClient from azure.identity import DefaultAzureCredential service_client = DataLakeServiceClient(account_url="https://account.dfs.core.windows.net", credential=DefaultAzureCredential()) file_system_client = service_client.get_file_system_client("container") file_client = file_system_client.get_file_client("My Documents/notes.txt") # Request ACLs as GUIDs by setting user principal name to false access_control = file_client.get_access_control(upn=False) acl = access_control["acl"] # ACL Format: user::rwx,group::r-x,other::r--,user:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx:r-- acl_list = acl.split(",")
  11. Lifecycle of data with access control information Option 1: Ingest

    the documents from a data source with access control information
  12. Lifecycle of data with access control information Option 2: Ingest

    the documents from a data source without access control and join them with access control information
  13. Why not just use an index per user? • AI

    Search limitations mean you have a finite number of indexes per search service • S3 HD index max size is 100GB Tier Free Basic S1 S2 S3 S3 HD L1 L2 Maximum indexes 3 15 50 200 200 1000 per partition, max 3000 per service 10 10
  14. User upload steps aka.ms/ragchat/upload field value oids 8c131152-9117-45b6-8221- b263f160d553 sourcefile

    newparentchecklist.docx storageurl https://userst24ccg67vinxu2.dfs.c ore.windows.net/user- content/8c131152-9117-45b6- 8221- b263f160d553%2Fnewparentchec klist.docx content 5 days per week for 4 weeks or half-days every day for 4 weeks), contact AskHR or call (425) 706- 8853 for assistance prior to your return. embedding [0.0060971826, 0.0015850986 ...] Azure Data Lake Storage Gen2 (ADLS2) 8c131152-9117-45b6-8221-b263f160d553 + newparentchecklist.docx azd-guidelines.md Store Azure Document Intelligence Extract Split Azure AI Search Index newparentchecklist.docx Azure OpenAI Embed
  15. User upload code user_oid = auth_claims["oid"] file = request_files.getlist("file")[0] user_directory_client

    = user_blob_container_client.get_directory_client(user_oid) try: user_directory_client.get_directory_properties() except ResourceNotFoundError: user_directory_client.create_directory() user_directory_client.set_access_control(owner=user_oid) file_client = user_directory_client.get_file_client(file.filename) file_io = io.BufferedReader(file) file_client.upload_data(file, overwrite=True, metadata={"UploadedBy": user_oid}) file_io.seek(0) ingester.add_file(File(content=file_io, acls={"oids": [user_oid]}, url=file_client.url)) aka.ms/ragchat/upload
  16. Try our samples and learn more! Azure OpenAI + AI

    Search + Entra + MSAL + App Service Built-in Auth aka.ms/ragchat Find more samples at: aka.ms/azai Java JavaScript Python .NET OpenAI Assistants Fine-tuning ...and more! Blog: Access Control in Generative AI applications with AI Search aka.ms/rag-access-control Microsoft Entra developer center aka.ms/dev/ms-entra
  17. Securing AI Apps on Azure Date Topic Speakers July 2

    5-6PM UTC Using Keyless Auth with Azure AI Services Marlene Mhangami Pamela Fox July 8 5-6PM UTC Add User Login to AI Apps using Built-in Auth James Casey Pamela Fox July 9 7-8PM UTC Add User Login to AI Apps using MSAL SDK Ray Luo Pamela Fox July 10 7-8PM UTC Handling User Auth for a SPA App on Azure Matt Gotteiner July 17 7-8PM UTC Data Access Control for AI RAG Apps on Azure Matt Gotteiner Pamela Fox July 25 11PM-12PM Deploying an AI App to a Private Network on Azure Matt Gotteiner Anthony Shaw https://aka.ms/S-1355