Azure AI Foundry — Deep Dive
What you'll leave with: an accurate mental model of Foundry (vs Azure OpenAI, vs AI Studio, vs "the portal"), a clear picture of hubs/projects/connections/deployments, and a working plan to ship a grounded, evaluated, observable agent on Foundry using the SDKs — not clicks.
Useful starting points (this page is designed to stand alone): - Official learning path: https://learn.microsoft.com/training/paths/create-custom-copilots-ai-studio/ - Foundry Agent Intro video: https://youtu.be/ltt7JNd30Ag?si=TUHrwTrdDgAu-2eN
1. What Foundry actually is (naming cleaned up)
There has been churn in the brand; the names you'll see in the wild:
- Azure AI Studio — the earlier portal for multi-model AI.
- Azure AI Foundry portal (
ai.azure.com) — the current portal. - Azure AI Foundry — the Azure service family that sits on top of Azure OpenAI, Azure AI services, and Azure Machine Learning: model catalog, deployments, prompt flow, evaluation, agents, safety, monitoring.
- Azure OpenAI (AOAI) — the underlying resource that hosts OpenAI-family models (
gpt-4o,o1,text-embedding-3-*, DALL·E, Whisper). Foundry uses AOAI deployments. - Azure AI Services (multi-service) — the "Cognitive Services" bundle (Speech, Document Intelligence, AI Search, Content Safety, Translator, Vision). Foundry orchestrates these.
The single clearest mental model:
Foundry is the integrated control plane and SDK surface for building LLM applications on Azure. Azure OpenAI + Azure AI Services + Azure ML are the planes underneath it.
If you've been using plain Azure OpenAI, Foundry doesn't replace it — it adds projects, shared connections, grounding stores, evaluation, tracing, and a first-class agent runtime on top.
2. The resource model — hubs, projects, connections, deployments
flowchart TB
RG[Resource group]
RG --> H[Hub<br/>Foundry workspace]
H --> P1[Project A]
H --> P2[Project B]
subgraph Shared[Shared at the hub]
S[Storage account]
KV[Key Vault]
ACR[Container registry]
AI[App Insights]
end
H --- Shared
H --> CX1[Connection<br/>Azure OpenAI]
H --> CX2[Connection<br/>AI Search]
H --> CX3[Connection<br/>AI Services]
P1 --> D1[Model deployment<br/>gpt-4o]
P1 --> IDX[Indexed data /<br/>vector index]
P1 --> AG[Agent]
Key terms and what they buy you:
- Hub — a team-scoped container with shared infra (Storage/KV/ACR/AppInsights), governance, quota, network settings. Create one per team or business unit, not one per app.
- Project — an isolation boundary for a single application: its own data, agents, deployments, RBAC. One project per product / deployable unit.
- Connections — references to external resources (AOAI, AI Search, Azure Blob, Microsoft Fabric, Bing Grounding, etc.). Defined once at the hub, consumed by many projects.
- Deployments — the served instance of a model (
gpt-4o @ eastus2 @ 100k TPM). Deployments live at the project level in Foundry. - Compute — serverless endpoints for most models; managed compute for models that need dedicated GPUs (e.g.,
Llama-3-70B). - Evaluations / Prompt flow / Agents / Monitoring — features operating within a project.
Tip: hubs are like ML "workspaces"; projects are like "folders with permissions". Don't proliferate hubs.
3. Picking a model — practical rubric
Foundry exposes a model catalog that spans OpenAI, Meta (Llama), Mistral, Cohere, Microsoft Phi, Black Forest (Flux), NVIDIA NIMs, and more. You do not need to memorise it. Decide along four axes:
| Axis | Lean toward |
|---|---|
| Cost/latency per token | Smaller (gpt-4o-mini, phi-3-medium, llama-3.1-8b) |
| Reasoning ability | o1, o3-mini, gpt-4o, gpt-4.1, claude-* (via Bedrock/partner) |
| Data sovereignty | Models deployable in your chosen Azure region; confirm regional availability |
| Open-weights / on-prem portability | Llama / Mistral / Phi via "managed compute" endpoint |
Default starter recipe:
- Chat / reasoning:
gpt-4o-minifor most traffic; promote togpt-4o/o3-minifor hard queries. - Embeddings:
text-embedding-3-small(1536d) unless multilingual →text-embedding-3-large(3072d) orbge-m3via managed compute. - Vision:
gpt-4o(text + image in one call). - Speech: Azure AI Speech (fast-transcription / real-time streaming) as a tool, not a chat model.
Never pick a model until you know the TPM/RPM quota you need. Quota is regional and per-deployment; request increases early.
4. Grounding — where the real product lives
Chat completions alone ≠ a product. Grounding = connecting the model to your data/tools. Foundry offers three layered options:
- Data → Index → Use as tool (RAG, the 80% case)
- Point Foundry at a Blob container / SharePoint / Fabric OneLake.
- Foundry builds an Azure AI Search index (chunking, embeddings, hybrid search with semantic ranker).
- Your agent uses the index as a tool ("search the docs").
- Connected/enterprise tools — Bing Grounding, Microsoft Graph, Azure Functions, REST API, OpenAPI spec. Each becomes a callable tool for the agent.
- Custom code tools — register any Python callable as a tool through the Agents SDK.
Azure AI Search specifics that matter:
- Use hybrid (BM25 + vector + semantic ranker) by default; it materially beats pure vector.
- Configure semantic configurations so the ranker knows the title/content/keywords fields.
- Turn on scoring profiles to boost recent docs or specific tags.
- Set security filters (
securityGroups/any(...)) if you need per-user document visibility.
5. Agents — what the Foundry agent runtime gives you
The Azure AI Agent Service (formerly Assistants on Azure) is a stateful runtime layered on your model deployments. You get:
- Threads — persistent, server-managed conversation state (not just in-memory).
- Tools — file search, code interpreter, function calling, OpenAPI, and connected services.
- Runs — a single turn; the service orchestrates model calls, tool calls, and retries, and emits streaming events.
- Tracing — automatic OpenTelemetry spans for every model/tool call, ingested into your linked Application Insights.
A minimal working agent using the Azure AI Projects SDK (Python):
# pip install azure-ai-projects azure-identity
from azure.identity import DefaultAzureCredential
from azure.ai.projects import AIProjectClient
from azure.ai.projects.models import FileSearchTool
project = AIProjectClient.from_connection_string(
credential=DefaultAzureCredential(),
conn_str="<copy from Foundry portal > Project > Overview>",
)
# 1) Upload & index a file for grounding
file = project.agents.upload_file_and_poll(file_path="handbook.pdf", purpose="assistants")
store = project.agents.create_vector_store_and_poll(name="kb", file_ids=[file.id])
fs = FileSearchTool(vector_store_ids=[store.id])
# 2) Create the agent
agent = project.agents.create_agent(
model="gpt-4o-mini",
name="hr-bot",
instructions=(
"You are an HR assistant. Answer ONLY using the provided file search results. "
"Cite sources. If unsure, say you don't know."
),
tools=fs.definitions,
tool_resources=fs.resources,
)
# 3) Run a conversation
thread = project.agents.create_thread()
project.agents.create_message(thread_id=thread.id, role="user",
content="What is the parental leave policy for salaried employees?")
run = project.agents.create_and_process_run(thread_id=thread.id, assistant_id=agent.id)
msgs = project.agents.list_messages(thread_id=thread.id)
for m in msgs.data[::-1]:
print(m.role, ":", m.content[0].text.value)
Things that surprise people the first time:
- Threads persist server-side. Delete them, or you keep paying for storage.
- File search indexes are vector stores (ADA-style embeddings under the hood); you pay per MB stored + tokens embedded.
- Tool calls and model calls each cost tokens and time; budget your loop (
max_completion_tokens, tool timeouts). - The SDK is async-friendly; use it.
For code-generation, multi-agent orchestration, or local-first dev, also evaluate Semantic Kernel and the Microsoft Agent Framework — they compose agents and can target Foundry as the model provider.
6. Evaluation — not optional
Before you ship, set up two evaluation loops:
- Offline eval on a golden set (30–200 prompts): faithfulness, groundedness, relevance, safety. Use Foundry's built-in evaluators or
azure-ai-evaluation:
```python from azure.ai.evaluation import GroundednessEvaluator, RelevanceEvaluator, evaluate from azure.identity import DefaultAzureCredential
result = evaluate( data="eval.jsonl", # each row: {question, answer, context} evaluators={ "groundedness": GroundednessEvaluator(model_config), "relevance": RelevanceEvaluator(model_config), }, ) ```
- Online eval — sample a % of prod traffic, score with the same evaluators, alert on drift.
Set hard accept thresholds (e.g., groundedness ≥ 0.9, relevance ≥ 0.85) in CI; fail the PR if a change regresses.
7. Safety, governance, responsible AI
- Azure AI Content Safety — gate both prompt (jailbreak / prompt-injection detection, profanity, self-harm, violence) and response. Easier to wire via Foundry's built-in content filters than to bolt on later.
- Prompt Shields — specifically detect prompt injection from retrieved documents or user input.
- PII detection / redaction — Language service or Purview integration.
- RBAC — scope at hub vs project; developers usually need
Azure AI Developerat project scope, not hub. - Private networking — managed vnet, private endpoints for AOAI / AI Search / Storage. Disable public network access on prod.
- Customer-managed keys (CMK) for the hub storage and Key Vault if your policies require it.
8. Cost model — what you actually pay for
Token cost is only one line item. You also pay for:
- Hub shared infra — Storage, Key Vault, Container Registry, App Insights (cheap-to-moderate).
- Provisioned Throughput Units (PTUs) vs pay-as-you-go for AOAI. PTUs give predictable latency + rate capacity; break-even is usually ≥ ~25% of a deployment's capacity being utilised.
- AI Search — tier + replicas + partitions; semantic-ranker usage is metered separately.
- Agent runtime storage — threads, files, vector stores.
- Egress — cross-region calls add up.
Levers:
- Route easy turns to
*-mini; escalate only when needed. - Cache embeddings; avoid re-embedding unchanged documents.
- Delete stale threads and vector stores (they silently accumulate).
- Batch eval offline; don't run full evaluators in prod.
- Use Matryoshka dimensions on
text-embedding-3-large(e.g., 256d/512d) for smaller index footprint.
9. Deployment as code (you should not click this into existence)
Minimum you want in a repo:
- Bicep / Terraform for: hub, project, connections (to AOAI, AI Search, Storage), model deployments, RBAC assignments.
- Agent definitions as code (model, instructions, tools, tool resources).
- CI/CD that: lints, runs the offline evals, bumps model deployment version via swap (staging → prod).
azd uptarget if you want one command to reproduce the whole project.
A fragment you'll recognise in the wild:
resource hub 'Microsoft.MachineLearningServices/workspaces@2024-10-01' = {
name: 'foundry-hub-${env}'
location: location
kind: 'Hub'
identity: { type: 'SystemAssigned' }
properties: {
storageAccount: storage.id
keyVault: kv.id
applicationInsights: ai.id
containerRegistry: acr.id
publicNetworkAccess: 'Disabled'
}
}
resource project 'Microsoft.MachineLearningServices/workspaces@2024-10-01' = {
name: 'foundry-proj-${env}'
location: location
kind: 'Project'
identity: { type: 'SystemAssigned' }
properties: { hubResourceId: hub.id }
}
10. Corrections to common misconceptions
- ❌ "Foundry replaces Azure OpenAI." → It sits on top of AOAI; you still own an AOAI resource and deployments.
- ❌ "Hubs and projects are the same thing." → Hubs are shared infra; projects are where apps live. One hub, many projects.
- ❌ "File Search = RAG." → File Search is a managed, opinionated RAG with Azure AI Search under the hood; great for many cases, but you lose control over chunking/ranking. For control, wire AI Search directly.
- ❌ "Agent threads are in-memory." → They are persisted; you're responsible for lifecycle.
- ❌ "I'll fine-tune instead of RAG." → Fine-tuning doesn't inject fresh facts, and it's rarely the right first move. Ground + prompt + eval first.
11. TODO — self-sufficient action list
Everything below is doable using only this page + an Azure subscription with AOAI enabled.
Provisioning (as code)
- [ ] Write Bicep for: resource group, hub, project, KV, Storage, ACR, App Insights, AOAI, AI Search, connections. Deploy with
az deployment sub create. - [ ] Assign yourself Azure AI Developer at project scope; use system-assigned MI for the project to reach AOAI and AI Search.
- [ ] Disable public network access on hub and create private endpoints. Prove connectivity from a jump VM or from a dev container with VNET integration.
First agent
- [ ] Deploy
gpt-4o-miniandtext-embedding-3-smallon the project's AOAI connection. Record the TPM quota. - [ ] Using the Projects SDK, upload one PDF, build a vector store, create an agent, and have a 3-turn grounded conversation with citations. Commit the script.
- [ ] Replace the built-in File Search with a direct Azure AI Search integration (hybrid + semantic ranker). Compare answer quality on 10 questions.
Evaluate
- [ ] Assemble a golden set of 30 questions with ideal answers. Run groundedness + relevance + safety evaluators offline; commit a JSON report.
- [ ] Turn on online monitoring in the project; sample 5% of requests; alert on groundedness < 0.8.
Safety + governance
- [ ] Enable Content Safety with high severity blocking on self-harm/violent/sexual categories, and Prompt Shields against injection.
- [ ] Add a test prompt that tries prompt injection via an indexed doc; confirm the agent refuses.
- [ ] Implement per-user document filtering via AI Search security filters.
Operability
- [ ] Confirm traces appear in App Insights (model call, tool call, thread). Build one workbook: latency p50/p95, tokens per turn, tool success rate.
- [ ] Parameterise the model name in deployment; perform a blue/green swap from
gpt-4o-minitogpt-4oand compare the eval report.
Cost controls
- [ ] Put quotas and alerts per deployment (TPM, RPM, daily spend).
- [ ] Implement a routing layer:
miniby default, escalate to4oonly when a cheap confidence gate triggers. - [ ] Write a nightly cleanup job that deletes threads older than 30 days.
Stretch
- [ ] Add a second agent (e.g., "planner") and orchestrate them with Semantic Kernel or the Microsoft Agent Framework; measure end-to-end latency.
- [ ] Replace PTU pay-as-you-go with a small PTU reservation for the hot path; document the break-even.
- [ ] Package the whole app as
azd upso a teammate can reproduce it in < 30 minutes.
When all checks above are green, flip this post to status: published with the final eval numbers embedded.