Azure AI Foundry — Deep Dive

What you'll leave with: an accurate mental model of Foundry (vs Azure OpenAI, vs AI Studio, vs "the portal"), a clear picture of hubs/projects/connections/deployments, and a working plan to ship a grounded, evaluated, observable agent on Foundry using the SDKs — not clicks.

Useful starting points (this page is designed to stand alone): - Official learning path: https://learn.microsoft.com/training/paths/create-custom-copilots-ai-studio/ - Foundry Agent Intro video: https://youtu.be/ltt7JNd30Ag?si=TUHrwTrdDgAu-2eN

1. What Foundry actually is (naming cleaned up)

There has been churn in the brand; the names you'll see in the wild:

Azure AI Studio — the earlier portal for multi-model AI.
Azure AI Foundry portal (ai.azure.com) — the current portal.
Azure AI Foundry — the Azure service family that sits on top of Azure OpenAI, Azure AI services, and Azure Machine Learning: model catalog, deployments, prompt flow, evaluation, agents, safety, monitoring.
Azure OpenAI (AOAI) — the underlying resource that hosts OpenAI-family models (gpt-4o, o1, text-embedding-3-*, DALL·E, Whisper). Foundry uses AOAI deployments.
Azure AI Services (multi-service) — the "Cognitive Services" bundle (Speech, Document Intelligence, AI Search, Content Safety, Translator, Vision). Foundry orchestrates these.

The single clearest mental model:

Foundry is the integrated control plane and SDK surface for building LLM applications on Azure. Azure OpenAI + Azure AI Services + Azure ML are the planes underneath it.

If you've been using plain Azure OpenAI, Foundry doesn't replace it — it adds projects, shared connections, grounding stores, evaluation, tracing, and a first-class agent runtime on top.

2. The resource model — hubs, projects, connections, deployments

flowchart TB
    RG[Resource group]
    RG --> H[Hub<br/>Foundry workspace]
    H --> P1[Project A]
    H --> P2[Project B]
    subgraph Shared[Shared at the hub]
      S[Storage account]
      KV[Key Vault]
      ACR[Container registry]
      AI[App Insights]
    end
    H --- Shared
    H --> CX1[Connection<br/>Azure OpenAI]
    H --> CX2[Connection<br/>AI Search]
    H --> CX3[Connection<br/>AI Services]
    P1 --> D1[Model deployment<br/>gpt-4o]
    P1 --> IDX[Indexed data /<br/>vector index]
    P1 --> AG[Agent]

Key terms and what they buy you:

Hub — a team-scoped container with shared infra (Storage/KV/ACR/AppInsights), governance, quota, network settings. Create one per team or business unit, not one per app.
Project — an isolation boundary for a single application: its own data, agents, deployments, RBAC. One project per product / deployable unit.
Connections — references to external resources (AOAI, AI Search, Azure Blob, Microsoft Fabric, Bing Grounding, etc.). Defined once at the hub, consumed by many projects.
Deployments — the served instance of a model (gpt-4o @ eastus2 @ 100k TPM). Deployments live at the project level in Foundry.
Compute — serverless endpoints for most models; managed compute for models that need dedicated GPUs (e.g., Llama-3-70B).
Evaluations / Prompt flow / Agents / Monitoring — features operating within a project.

Tip: hubs are like ML "workspaces"; projects are like "folders with permissions". Don't proliferate hubs.

3. Picking a model — practical rubric

Foundry exposes a model catalog that spans OpenAI, Meta (Llama), Mistral, Cohere, Microsoft Phi, Black Forest (Flux), NVIDIA NIMs, and more. You do not need to memorise it. Decide along four axes:

Axis	Lean toward
Cost/latency per token	Smaller (`gpt-4o-mini`, `phi-3-medium`, `llama-3.1-8b`)
Reasoning ability	`o1`, `o3-mini`, `gpt-4o`, `gpt-4.1`, `claude-*` (via Bedrock/partner)
Data sovereignty	Models deployable in your chosen Azure region; confirm regional availability
Open-weights / on-prem portability	Llama / Mistral / Phi via "managed compute" endpoint

Default starter recipe:

Chat / reasoning: gpt-4o-mini for most traffic; promote to gpt-4o / o3-mini for hard queries.
Embeddings: text-embedding-3-small (1536d) unless multilingual → text-embedding-3-large (3072d) or bge-m3 via managed compute.
Vision: gpt-4o (text + image in one call).
Speech: Azure AI Speech (fast-transcription / real-time streaming) as a tool, not a chat model.

Never pick a model until you know the TPM/RPM quota you need. Quota is regional and per-deployment; request increases early.

4. Grounding — where the real product lives

Chat completions alone ≠ a product. Grounding = connecting the model to your data/tools. Foundry offers three layered options:

Data → Index → Use as tool (RAG, the 80% case)
Point Foundry at a Blob container / SharePoint / Fabric OneLake.
Foundry builds an Azure AI Search index (chunking, embeddings, hybrid search with semantic ranker).
Your agent uses the index as a tool ("search the docs").
Connected/enterprise tools — Bing Grounding, Microsoft Graph, Azure Functions, REST API, OpenAPI spec. Each becomes a callable tool for the agent.
Custom code tools — register any Python callable as a tool through the Agents SDK.

Azure AI Search specifics that matter:

Use hybrid (BM25 + vector + semantic ranker) by default; it materially beats pure vector.
Configure semantic configurations so the ranker knows the title/content/keywords fields.
Turn on scoring profiles to boost recent docs or specific tags.
Set security filters (securityGroups/any(...)) if you need per-user document visibility.

5. Agents — what the Foundry agent runtime gives you

The Azure AI Agent Service (formerly Assistants on Azure) is a stateful runtime layered on your model deployments. You get:

Threads — persistent, server-managed conversation state (not just in-memory).
Tools — file search, code interpreter, function calling, OpenAPI, and connected services.
Runs — a single turn; the service orchestrates model calls, tool calls, and retries, and emits streaming events.
Tracing — automatic OpenTelemetry spans for every model/tool call, ingested into your linked Application Insights.

A minimal working agent using the Azure AI Projects SDK (Python):

# pip install azure-ai-projects azure-identity
from azure.identity import DefaultAzureCredential
from azure.ai.projects import AIProjectClient
from azure.ai.projects.models import FileSearchTool

project = AIProjectClient.from_connection_string(
    credential=DefaultAzureCredential(),
    conn_str="<copy from Foundry portal > Project > Overview>",
)

# 1) Upload & index a file for grounding
file = project.agents.upload_file_and_poll(file_path="handbook.pdf", purpose="assistants")
store = project.agents.create_vector_store_and_poll(name="kb", file_ids=[file.id])
fs = FileSearchTool(vector_store_ids=[store.id])

# 2) Create the agent
agent = project.agents.create_agent(
    model="gpt-4o-mini",
    name="hr-bot",
    instructions=(
        "You are an HR assistant. Answer ONLY using the provided file search results. "
        "Cite sources. If unsure, say you don't know."
    ),
    tools=fs.definitions,
    tool_resources=fs.resources,
)

# 3) Run a conversation
thread = project.agents.create_thread()
project.agents.create_message(thread_id=thread.id, role="user",
                              content="What is the parental leave policy for salaried employees?")
run = project.agents.create_and_process_run(thread_id=thread.id, assistant_id=agent.id)
msgs = project.agents.list_messages(thread_id=thread.id)
for m in msgs.data[::-1]:
    print(m.role, ":", m.content[0].text.value)

Things that surprise people the first time:

Threads persist server-side. Delete them, or you keep paying for storage.
File search indexes are vector stores (ADA-style embeddings under the hood); you pay per MB stored + tokens embedded.
Tool calls and model calls each cost tokens and time; budget your loop (max_completion_tokens, tool timeouts).
The SDK is async-friendly; use it.

For code-generation, multi-agent orchestration, or local-first dev, also evaluate Semantic Kernel and the Microsoft Agent Framework — they compose agents and can target Foundry as the model provider.

6. Evaluation — not optional

Before you ship, set up two evaluation loops:

Offline eval on a golden set (30–200 prompts): faithfulness, groundedness, relevance, safety. Use Foundry's built-in evaluators or azure-ai-evaluation:

```python from azure.ai.evaluation import GroundednessEvaluator, RelevanceEvaluator, evaluate from azure.identity import DefaultAzureCredential

result = evaluate( data="eval.jsonl", # each row: {question, answer, context} evaluators={ "groundedness": GroundednessEvaluator(model_config), "relevance": RelevanceEvaluator(model_config), }, ) ```

Online eval — sample a % of prod traffic, score with the same evaluators, alert on drift.

Set hard accept thresholds (e.g., groundedness ≥ 0.9, relevance ≥ 0.85) in CI; fail the PR if a change regresses.

7. Safety, governance, responsible AI

Azure AI Content Safety — gate both prompt (jailbreak / prompt-injection detection, profanity, self-harm, violence) and response. Easier to wire via Foundry's built-in content filters than to bolt on later.
Prompt Shields — specifically detect prompt injection from retrieved documents or user input.
PII detection / redaction — Language service or Purview integration.
RBAC — scope at hub vs project; developers usually need Azure AI Developer at project scope, not hub.
Private networking — managed vnet, private endpoints for AOAI / AI Search / Storage. Disable public network access on prod.
Customer-managed keys (CMK) for the hub storage and Key Vault if your policies require it.

8. Cost model — what you actually pay for

Token cost is only one line item. You also pay for:

Hub shared infra — Storage, Key Vault, Container Registry, App Insights (cheap-to-moderate).
Provisioned Throughput Units (PTUs) vs pay-as-you-go for AOAI. PTUs give predictable latency + rate capacity; break-even is usually ≥ ~25% of a deployment's capacity being utilised.
AI Search — tier + replicas + partitions; semantic-ranker usage is metered separately.
Agent runtime storage — threads, files, vector stores.
Egress — cross-region calls add up.

Levers:

Route easy turns to *-mini; escalate only when needed.
Cache embeddings; avoid re-embedding unchanged documents.
Delete stale threads and vector stores (they silently accumulate).
Batch eval offline; don't run full evaluators in prod.
Use Matryoshka dimensions on text-embedding-3-large (e.g., 256d/512d) for smaller index footprint.

9. Deployment as code (you should not click this into existence)

Minimum you want in a repo:

Bicep / Terraform for: hub, project, connections (to AOAI, AI Search, Storage), model deployments, RBAC assignments.
Agent definitions as code (model, instructions, tools, tool resources).
CI/CD that: lints, runs the offline evals, bumps model deployment version via swap (staging → prod).
azd up target if you want one command to reproduce the whole project.

A fragment you'll recognise in the wild:

resource hub 'Microsoft.MachineLearningServices/workspaces@2024-10-01' = {
  name: 'foundry-hub-${env}'
  location: location
  kind: 'Hub'
  identity: { type: 'SystemAssigned' }
  properties: {
    storageAccount: storage.id
    keyVault: kv.id
    applicationInsights: ai.id
    containerRegistry: acr.id
    publicNetworkAccess: 'Disabled'
  }
}

resource project 'Microsoft.MachineLearningServices/workspaces@2024-10-01' = {
  name: 'foundry-proj-${env}'
  location: location
  kind: 'Project'
  identity: { type: 'SystemAssigned' }
  properties: { hubResourceId: hub.id }
}

10. Corrections to common misconceptions

❌ "Foundry replaces Azure OpenAI." → It sits on top of AOAI; you still own an AOAI resource and deployments.
❌ "Hubs and projects are the same thing." → Hubs are shared infra; projects are where apps live. One hub, many projects.
❌ "File Search = RAG." → File Search is a managed, opinionated RAG with Azure AI Search under the hood; great for many cases, but you lose control over chunking/ranking. For control, wire AI Search directly.
❌ "Agent threads are in-memory." → They are persisted; you're responsible for lifecycle.
❌ "I'll fine-tune instead of RAG." → Fine-tuning doesn't inject fresh facts, and it's rarely the right first move. Ground + prompt + eval first.

11. TODO — self-sufficient action list

Everything below is doable using only this page + an Azure subscription with AOAI enabled.

Provisioning (as code)

[ ] Write Bicep for: resource group, hub, project, KV, Storage, ACR, App Insights, AOAI, AI Search, connections. Deploy with az deployment sub create.
[ ] Assign yourself Azure AI Developer at project scope; use system-assigned MI for the project to reach AOAI and AI Search.
[ ] Disable public network access on hub and create private endpoints. Prove connectivity from a jump VM or from a dev container with VNET integration.

First agent

[ ] Deploy gpt-4o-mini and text-embedding-3-small on the project's AOAI connection. Record the TPM quota.
[ ] Using the Projects SDK, upload one PDF, build a vector store, create an agent, and have a 3-turn grounded conversation with citations. Commit the script.
[ ] Replace the built-in File Search with a direct Azure AI Search integration (hybrid + semantic ranker). Compare answer quality on 10 questions.

Evaluate

[ ] Assemble a golden set of 30 questions with ideal answers. Run groundedness + relevance + safety evaluators offline; commit a JSON report.
[ ] Turn on online monitoring in the project; sample 5% of requests; alert on groundedness < 0.8.

Safety + governance

[ ] Enable Content Safety with high severity blocking on self-harm/violent/sexual categories, and Prompt Shields against injection.
[ ] Add a test prompt that tries prompt injection via an indexed doc; confirm the agent refuses.
[ ] Implement per-user document filtering via AI Search security filters.

Operability

[ ] Confirm traces appear in App Insights (model call, tool call, thread). Build one workbook: latency p50/p95, tokens per turn, tool success rate.
[ ] Parameterise the model name in deployment; perform a blue/green swap from gpt-4o-mini to gpt-4o and compare the eval report.

Cost controls

[ ] Put quotas and alerts per deployment (TPM, RPM, daily spend).
[ ] Implement a routing layer: mini by default, escalate to 4o only when a cheap confidence gate triggers.
[ ] Write a nightly cleanup job that deletes threads older than 30 days.

Stretch

[ ] Add a second agent (e.g., "planner") and orchestrate them with Semantic Kernel or the Microsoft Agent Framework; measure end-to-end latency.
[ ] Replace PTU pay-as-you-go with a small PTU reservation for the hot path; document the break-even.
[ ] Package the whole app as azd up so a teammate can reproduce it in < 30 minutes.

When all checks above are green, flip this post to status: published with the final eval numbers embedded.

Azure AI Foundry — Deep Dive

1. What Foundry actually is (naming cleaned up)

2. The resource model — hubs, projects, connections, deployments

3. Picking a model — practical rubric

4. Grounding — where the real product lives

5. Agents — what the Foundry agent runtime gives you

6. Evaluation — not optional

7. Safety, governance, responsible AI

8. Cost model — what you actually pay for

9. Deployment as code (you should not click this into existence)

10. Corrections to common misconceptions

11. TODO — self-sufficient action list

Provisioning (as code)

First agent

Evaluate

Safety + governance

Operability

Cost controls

Stretch

Liked this? Get the next one in your inbox.