Deploying a Python Script as an API on Azure

Goal: Take a local PDF metadata extraction script and deploy it as a production-ready REST API on Azure.

📋 What This Blog Covers

flowchart LR
    A["📄 Local Python Script"] --> B["🌐 REST API<br/>(FastAPI/Flask)"]
    B --> C["🐳 Docker Container"]
    C --> D["☁️ Azure Deployment"]
    D --> E["🔒 Auth + Monitoring"]

Sample local script to extract content from PDF
Available options to deploy the script as an API
Designing the deployment architecture
Implementation and testing

Script resources: GitHub Repo

🔧 Background & Prerequisites

1. PDF Content Extraction — The Script

Library	Strengths	Best For
PyMuPDF (fitz)	Fastest, handles complex layouts	General text extraction
pdfplumber	Excellent table extraction	Tabular data
PyPDF2	Lightweight, basic extraction	Simple PDFs
Tesseract OCR	Open-source OCR for scanned PDFs	Image-based PDFs
Azure Doc Intelligence	Cloud-based, layout analysis	Enterprise extraction

Key extraction targets:

📝 Text — Full-text extraction from each page
📊 Tables — Structured table data
🏷️ Metadata — Title, author, creation date, page count (PdfReader(file).metadata)
🔍 OCR fallback — Detect if PDF has extractable text; if not, route to OCR

2. API Design

sequenceDiagram
    participant Client
    participant API as FastAPI Server
    participant Extractor as PDF Extractor

    Client->>API: POST /api/extract<br/>(multipart/form-data)
    API->>API: Validate file type & size
    API->>Extractor: Extract text + metadata
    Extractor-->>API: Structured result
    API-->>Client: 200 JSON response<br/>{text, metadata, pages, time_ms}

Aspect	Design Decision
Endpoint	`POST /api/extract` — upload PDF, get extracted data
Input	`multipart/form-data` file upload (or URL download)
Output	JSON: `{text, metadata, page_count, tables, processing_time_ms}`
Validation	File type check, size limits, malware considerations
Status Codes	`200` success, `400` bad request, `413` too large, `500` error
Docs	OpenAPI/Swagger auto-generated (built into FastAPI)
Auth	API key in header or Azure Entra ID token

3. Framework Choice — Flask vs FastAPI

Feature	Flask	FastAPI
Style	WSGI (sync)	ASGI (async)
Auto docs	Extension needed	Built-in Swagger UI
Validation	Manual	Pydantic models
Performance	Good	Excellent
Learning curve	Minimal	Minimal
Verdict	✅ Use if integrating into existing Flask app	✅ Recommended for new APIs

4. Containerization with Docker

# Example multi-stage Dockerfile
FROM python:3.12-slim AS builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

FROM python:3.12-slim
WORKDIR /app
COPY --from=builder /usr/local/lib/python3.12/site-packages /usr/local/lib/python3.12/site-packages
COPY . .
EXPOSE 8000
HEALTHCHECK CMD curl -f http://localhost:8000/health || exit 1
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

💡 Tips: Use multi-stage builds to reduce image size. Install system deps like poppler-utils for pdftotext. Pass secrets via environment variables — never hardcode.

5. Azure Deployment Options

graph TD
    Script["🐍 Python Script"] --> Container["🐳 Docker Image"]
    Container --> ACR["📦 Azure Container Registry"]
    ACR --> ACA["⭐ Container Apps<br/>(Recommended)"]
    ACR --> AppService["🌐 App Service"]
    ACR --> Functions["⚡ Azure Functions"]
    ACR --> ACI["📦 Container Instances"]
    ACR --> AKS["☸️ AKS"]

Option	Scale to Zero	Complexity	Best For	Monthly Cost
Container Apps ⭐	✅	Low	Variable-traffic APIs	Pay per use
App Service	❌	Low	Steady-traffic APIs	~$13+ (B1)
Azure Functions	✅	Low	Infrequent calls	Free tier available
Container Instances	❌	Minimal	Testing/one-off jobs	Pay per second
AKS	❌	High	Multi-service architectures

⭐ Recommendation: Azure Container Apps — best balance of simplicity, cost, and zero-to-scale capabilities.

6. CI/CD Pipeline

flowchart LR
    Push["📤 Git Push"] --> Build["🏗️ GitHub Actions"]
    Build --> Image["🐳 Build Docker Image"]
    Image --> ACR["📦 Push to ACR"]
    ACR --> Deploy["🚀 Deploy to<br/>Container Apps"]

Use GitHub Actions for automated build → push → deploy
Store Azure credentials and API keys in GitHub Secrets
Maintain separate staging and production environments

✅ TODO — Remaining Work

#	Task	Priority
1	Write PDF extraction script (PyMuPDF: metadata + text + tables)	🔴 High
2	Wrap in FastAPI with endpoints, validation, error handling	🔴 High
3	Add OpenAPI/Swagger documentation	🔴 High
4	Write Dockerfile and test locally	🔴 High
5	Push image to Azure Container Registry	🟡 Medium
6	Deploy to Azure Container Apps	🟡 Medium
7	Set up GitHub Actions CI/CD pipeline	🟡 Medium
8	Add authentication (API key or Azure Entra ID)	🟡 Medium
9	Load test with sample PDFs and document performance	🟢 Low
10	Create full architecture diagram with all components	🟢 Low

🧩 Reference Implementation — FastAPI Service

A minimal, production-leaning implementation of the PDF extraction API:

# main.py
from fastapi import FastAPI, UploadFile, File, HTTPException
from fastapi.responses import JSONResponse
import fitz  # PyMuPDF
import time, os

MAX_BYTES = 25 * 1024 * 1024  # 25 MB
app = FastAPI(title="PDF Extract API", version="1.0.0")

@app.get("/health")
def health():
    return {"status": "ok", "version": os.getenv("APP_VERSION", "dev")}

@app.post("/api/extract")
async def extract(file: UploadFile = File(...)):
    if file.content_type not in ("application/pdf", "application/octet-stream"):
        raise HTTPException(status_code=400, detail="Only PDF accepted")
    data = await file.read()
    if len(data) > MAX_BYTES:
        raise HTTPException(status_code=413, detail="File too large")

    t0 = time.perf_counter()
    try:
        doc = fitz.open(stream=data, filetype="pdf")
    except Exception as e:
        raise HTTPException(status_code=400, detail=f"Invalid PDF: {e}")

    pages = [p.get_text() for p in doc]
    meta = doc.metadata or {}
    result = {
        "text": "\n\n".join(pages),
        "page_count": doc.page_count,
        "metadata": {
            "title": meta.get("title"),
            "author": meta.get("author"),
            "creation_date": meta.get("creationDate"),
        },
        "processing_time_ms": int((time.perf_counter() - t0) * 1000),
    }
    doc.close()
    return JSONResponse(result)

Deploying to Azure Container Apps

# 1. Build and push
ACR=myblogacr
az acr build -r $ACR -t pdf-extract:v1 .

# 2. Create environment and deploy
az containerapp env create -g rg-pdf -n cae-pdf -l eastus
az containerapp create \
  -g rg-pdf -n ca-pdf-extract \
  --environment cae-pdf \
  --image $ACR.azurecr.io/pdf-extract:v1 \
  --registry-server $ACR.azurecr.io \
  --registry-identity system \
  --target-port 8000 --ingress external \
  --min-replicas 0 --max-replicas 5 \
  --cpu 0.5 --memory 1Gi

Testing

curl -X POST -F "file=@sample.pdf" https://<app-url>/api/extract | jq '.page_count'

When all TODO items above are ticked and the /api/extract endpoint handles 100 concurrent uploads without errors, flip status: workinprogress → status: published.