Basics of Azure AI Foundry

A self-sufficient guide to Azure AI Foundry — what it is, how the hub/project/deployment model works, how to ship a grounded agent end-to-end with SDK + Bicep, and the security/eval/cost levers you cannot skip.

Series Deep Dives · 4 parts

Azure AI Foundry — Deep Dive

What you'll leave with: an accurate mental model of Foundry (vs Azure OpenAI, vs AI Studio, vs "the portal"), a clear picture of hubs/projects/connections/deployments, and a working plan to ship a grounded, evaluated, observable agent on Foundry using the SDKs — not clicks.

Useful starting points (this page is designed to stand alone): - Official learning path: https://learn.microsoft.com/training/paths/create-custom-copilots-ai-studio/ - Foundry Agent Intro video: https://youtu.be/ltt7JNd30Ag?si=TUHrwTrdDgAu-2eN


1. What Foundry actually is (naming cleaned up)

There has been churn in the brand; the names you'll see in the wild:

  • Azure AI Studio — the earlier portal for multi-model AI.
  • Azure AI Foundry portal (ai.azure.com) — the current portal.
  • Azure AI Foundry — the Azure service family that sits on top of Azure OpenAI, Azure AI services, and Azure Machine Learning: model catalog, deployments, prompt flow, evaluation, agents, safety, monitoring.
  • Azure OpenAI (AOAI) — the underlying resource that hosts OpenAI-family models (gpt-4o, o1, text-embedding-3-*, DALL·E, Whisper). Foundry uses AOAI deployments.
  • Azure AI Services (multi-service) — the "Cognitive Services" bundle (Speech, Document Intelligence, AI Search, Content Safety, Translator, Vision). Foundry orchestrates these.

The single clearest mental model:

Foundry is the integrated control plane and SDK surface for building LLM applications on Azure. Azure OpenAI + Azure AI Services + Azure ML are the planes underneath it.

If you've been using plain Azure OpenAI, Foundry doesn't replace it — it adds projects, shared connections, grounding stores, evaluation, tracing, and a first-class agent runtime on top.


2. The resource model — hubs, projects, connections, deployments

flowchart TB
    RG[Resource group]
    RG --> H[Hub<br/>Foundry workspace]
    H --> P1[Project A]
    H --> P2[Project B]
    subgraph Shared[Shared at the hub]
      S[Storage account]
      KV[Key Vault]
      ACR[Container registry]
      AI[App Insights]
    end
    H --- Shared
    H --> CX1[Connection<br/>Azure OpenAI]
    H --> CX2[Connection<br/>AI Search]
    H --> CX3[Connection<br/>AI Services]
    P1 --> D1[Model deployment<br/>gpt-4o]
    P1 --> IDX[Indexed data /<br/>vector index]
    P1 --> AG[Agent]

Key terms and what they buy you:

  • Hub — a team-scoped container with shared infra (Storage/KV/ACR/AppInsights), governance, quota, network settings. Create one per team or business unit, not one per app.
  • Project — an isolation boundary for a single application: its own data, agents, deployments, RBAC. One project per product / deployable unit.
  • Connections — references to external resources (AOAI, AI Search, Azure Blob, Microsoft Fabric, Bing Grounding, etc.). Defined once at the hub, consumed by many projects.
  • Deployments — the served instance of a model (gpt-4o @ eastus2 @ 100k TPM). Deployments live at the project level in Foundry.
  • Compute — serverless endpoints for most models; managed compute for models that need dedicated GPUs (e.g., Llama-3-70B).
  • Evaluations / Prompt flow / Agents / Monitoring — features operating within a project.

Tip: hubs are like ML "workspaces"; projects are like "folders with permissions". Don't proliferate hubs.


3. Picking a model — practical rubric

Foundry exposes a model catalog that spans OpenAI, Meta (Llama), Mistral, Cohere, Microsoft Phi, Black Forest (Flux), NVIDIA NIMs, and more. You do not need to memorise it. Decide along four axes:

Axis Lean toward
Cost/latency per token Smaller (gpt-4o-mini, phi-3-medium, llama-3.1-8b)
Reasoning ability o1, o3-mini, gpt-4o, gpt-4.1, claude-* (via Bedrock/partner)
Data sovereignty Models deployable in your chosen Azure region; confirm regional availability
Open-weights / on-prem portability Llama / Mistral / Phi via "managed compute" endpoint

Default starter recipe:

  • Chat / reasoning: gpt-4o-mini for most traffic; promote to gpt-4o / o3-mini for hard queries.
  • Embeddings: text-embedding-3-small (1536d) unless multilingual → text-embedding-3-large (3072d) or bge-m3 via managed compute.
  • Vision: gpt-4o (text + image in one call).
  • Speech: Azure AI Speech (fast-transcription / real-time streaming) as a tool, not a chat model.

Never pick a model until you know the TPM/RPM quota you need. Quota is regional and per-deployment; request increases early.


4. Grounding — where the real product lives

Chat completions alone ≠ a product. Grounding = connecting the model to your data/tools. Foundry offers three layered options:

  1. Data → Index → Use as tool (RAG, the 80% case)
  2. Point Foundry at a Blob container / SharePoint / Fabric OneLake.
  3. Foundry builds an Azure AI Search index (chunking, embeddings, hybrid search with semantic ranker).
  4. Your agent uses the index as a tool ("search the docs").
  5. Connected/enterprise tools — Bing Grounding, Microsoft Graph, Azure Functions, REST API, OpenAPI spec. Each becomes a callable tool for the agent.
  6. Custom code tools — register any Python callable as a tool through the Agents SDK.

Azure AI Search specifics that matter:

  • Use hybrid (BM25 + vector + semantic ranker) by default; it materially beats pure vector.
  • Configure semantic configurations so the ranker knows the title/content/keywords fields.
  • Turn on scoring profiles to boost recent docs or specific tags.
  • Set security filters (securityGroups/any(...)) if you need per-user document visibility.

5. Agents — what the Foundry agent runtime gives you

The Azure AI Agent Service (formerly Assistants on Azure) is a stateful runtime layered on your model deployments. You get:

  • Threads — persistent, server-managed conversation state (not just in-memory).
  • Tools — file search, code interpreter, function calling, OpenAPI, and connected services.
  • Runs — a single turn; the service orchestrates model calls, tool calls, and retries, and emits streaming events.
  • Tracing — automatic OpenTelemetry spans for every model/tool call, ingested into your linked Application Insights.

A minimal working agent using the Azure AI Projects SDK (Python):

# pip install azure-ai-projects azure-identity
from azure.identity import DefaultAzureCredential
from azure.ai.projects import AIProjectClient
from azure.ai.projects.models import FileSearchTool

project = AIProjectClient.from_connection_string(
    credential=DefaultAzureCredential(),
    conn_str="<copy from Foundry portal > Project > Overview>",
)

# 1) Upload & index a file for grounding
file = project.agents.upload_file_and_poll(file_path="handbook.pdf", purpose="assistants")
store = project.agents.create_vector_store_and_poll(name="kb", file_ids=[file.id])
fs = FileSearchTool(vector_store_ids=[store.id])

# 2) Create the agent
agent = project.agents.create_agent(
    model="gpt-4o-mini",
    name="hr-bot",
    instructions=(
        "You are an HR assistant. Answer ONLY using the provided file search results. "
        "Cite sources. If unsure, say you don't know."
    ),
    tools=fs.definitions,
    tool_resources=fs.resources,
)

# 3) Run a conversation
thread = project.agents.create_thread()
project.agents.create_message(thread_id=thread.id, role="user",
                              content="What is the parental leave policy for salaried employees?")
run = project.agents.create_and_process_run(thread_id=thread.id, assistant_id=agent.id)
msgs = project.agents.list_messages(thread_id=thread.id)
for m in msgs.data[::-1]:
    print(m.role, ":", m.content[0].text.value)

Things that surprise people the first time:

  • Threads persist server-side. Delete them, or you keep paying for storage.
  • File search indexes are vector stores (ADA-style embeddings under the hood); you pay per MB stored + tokens embedded.
  • Tool calls and model calls each cost tokens and time; budget your loop (max_completion_tokens, tool timeouts).
  • The SDK is async-friendly; use it.

For code-generation, multi-agent orchestration, or local-first dev, also evaluate Semantic Kernel and the Microsoft Agent Framework — they compose agents and can target Foundry as the model provider.


6. Evaluation — not optional

Before you ship, set up two evaluation loops:

  • Offline eval on a golden set (30–200 prompts): faithfulness, groundedness, relevance, safety. Use Foundry's built-in evaluators or azure-ai-evaluation:

```python from azure.ai.evaluation import GroundednessEvaluator, RelevanceEvaluator, evaluate from azure.identity import DefaultAzureCredential

result = evaluate( data="eval.jsonl", # each row: {question, answer, context} evaluators={ "groundedness": GroundednessEvaluator(model_config), "relevance": RelevanceEvaluator(model_config), }, ) ```

  • Online eval — sample a % of prod traffic, score with the same evaluators, alert on drift.

Set hard accept thresholds (e.g., groundedness ≥ 0.9, relevance ≥ 0.85) in CI; fail the PR if a change regresses.


7. Safety, governance, responsible AI

  • Azure AI Content Safety — gate both prompt (jailbreak / prompt-injection detection, profanity, self-harm, violence) and response. Easier to wire via Foundry's built-in content filters than to bolt on later.
  • Prompt Shields — specifically detect prompt injection from retrieved documents or user input.
  • PII detection / redaction — Language service or Purview integration.
  • RBAC — scope at hub vs project; developers usually need Azure AI Developer at project scope, not hub.
  • Private networking — managed vnet, private endpoints for AOAI / AI Search / Storage. Disable public network access on prod.
  • Customer-managed keys (CMK) for the hub storage and Key Vault if your policies require it.

8. Cost model — what you actually pay for

Token cost is only one line item. You also pay for:

  • Hub shared infra — Storage, Key Vault, Container Registry, App Insights (cheap-to-moderate).
  • Provisioned Throughput Units (PTUs) vs pay-as-you-go for AOAI. PTUs give predictable latency + rate capacity; break-even is usually ≥ ~25% of a deployment's capacity being utilised.
  • AI Search — tier + replicas + partitions; semantic-ranker usage is metered separately.
  • Agent runtime storage — threads, files, vector stores.
  • Egress — cross-region calls add up.

Levers:

  • Route easy turns to *-mini; escalate only when needed.
  • Cache embeddings; avoid re-embedding unchanged documents.
  • Delete stale threads and vector stores (they silently accumulate).
  • Batch eval offline; don't run full evaluators in prod.
  • Use Matryoshka dimensions on text-embedding-3-large (e.g., 256d/512d) for smaller index footprint.

9. Deployment as code (you should not click this into existence)

Minimum you want in a repo:

  • Bicep / Terraform for: hub, project, connections (to AOAI, AI Search, Storage), model deployments, RBAC assignments.
  • Agent definitions as code (model, instructions, tools, tool resources).
  • CI/CD that: lints, runs the offline evals, bumps model deployment version via swap (staging → prod).
  • azd up target if you want one command to reproduce the whole project.

A fragment you'll recognise in the wild:

resource hub 'Microsoft.MachineLearningServices/workspaces@2024-10-01' = {
  name: 'foundry-hub-${env}'
  location: location
  kind: 'Hub'
  identity: { type: 'SystemAssigned' }
  properties: {
    storageAccount: storage.id
    keyVault: kv.id
    applicationInsights: ai.id
    containerRegistry: acr.id
    publicNetworkAccess: 'Disabled'
  }
}

resource project 'Microsoft.MachineLearningServices/workspaces@2024-10-01' = {
  name: 'foundry-proj-${env}'
  location: location
  kind: 'Project'
  identity: { type: 'SystemAssigned' }
  properties: { hubResourceId: hub.id }
}

10. Corrections to common misconceptions

  • ❌ "Foundry replaces Azure OpenAI." → It sits on top of AOAI; you still own an AOAI resource and deployments.
  • ❌ "Hubs and projects are the same thing." → Hubs are shared infra; projects are where apps live. One hub, many projects.
  • ❌ "File Search = RAG." → File Search is a managed, opinionated RAG with Azure AI Search under the hood; great for many cases, but you lose control over chunking/ranking. For control, wire AI Search directly.
  • ❌ "Agent threads are in-memory." → They are persisted; you're responsible for lifecycle.
  • ❌ "I'll fine-tune instead of RAG." → Fine-tuning doesn't inject fresh facts, and it's rarely the right first move. Ground + prompt + eval first.

11. TODO — self-sufficient action list

Everything below is doable using only this page + an Azure subscription with AOAI enabled.

Provisioning (as code)

  • [ ] Write Bicep for: resource group, hub, project, KV, Storage, ACR, App Insights, AOAI, AI Search, connections. Deploy with az deployment sub create.
  • [ ] Assign yourself Azure AI Developer at project scope; use system-assigned MI for the project to reach AOAI and AI Search.
  • [ ] Disable public network access on hub and create private endpoints. Prove connectivity from a jump VM or from a dev container with VNET integration.

First agent

  • [ ] Deploy gpt-4o-mini and text-embedding-3-small on the project's AOAI connection. Record the TPM quota.
  • [ ] Using the Projects SDK, upload one PDF, build a vector store, create an agent, and have a 3-turn grounded conversation with citations. Commit the script.
  • [ ] Replace the built-in File Search with a direct Azure AI Search integration (hybrid + semantic ranker). Compare answer quality on 10 questions.

Evaluate

  • [ ] Assemble a golden set of 30 questions with ideal answers. Run groundedness + relevance + safety evaluators offline; commit a JSON report.
  • [ ] Turn on online monitoring in the project; sample 5% of requests; alert on groundedness < 0.8.

Safety + governance

  • [ ] Enable Content Safety with high severity blocking on self-harm/violent/sexual categories, and Prompt Shields against injection.
  • [ ] Add a test prompt that tries prompt injection via an indexed doc; confirm the agent refuses.
  • [ ] Implement per-user document filtering via AI Search security filters.

Operability

  • [ ] Confirm traces appear in App Insights (model call, tool call, thread). Build one workbook: latency p50/p95, tokens per turn, tool success rate.
  • [ ] Parameterise the model name in deployment; perform a blue/green swap from gpt-4o-mini to gpt-4o and compare the eval report.

Cost controls

  • [ ] Put quotas and alerts per deployment (TPM, RPM, daily spend).
  • [ ] Implement a routing layer: mini by default, escalate to 4o only when a cheap confidence gate triggers.
  • [ ] Write a nightly cleanup job that deletes threads older than 30 days.

Stretch

  • [ ] Add a second agent (e.g., "planner") and orchestrate them with Semantic Kernel or the Microsoft Agent Framework; measure end-to-end latency.
  • [ ] Replace PTU pay-as-you-go with a small PTU reservation for the hot path; document the break-even.
  • [ ] Package the whole app as azd up so a teammate can reproduce it in < 30 minutes.

When all checks above are green, flip this post to status: published with the final eval numbers embedded.