Deploying scripts as an API in Azure

Deploying local python scripts and converting them as an API.

Deploying a Python Script as an API on Azure

Goal: Take a local PDF metadata extraction script and deploy it as a production-ready REST API on Azure.


📋 What This Blog Covers

flowchart LR
    A["📄 Local Python Script"] --> B["🌐 REST API<br/>(FastAPI/Flask)"]
    B --> C["🐳 Docker Container"]
    C --> D["☁️ Azure Deployment"]
    D --> E["🔒 Auth + Monitoring"]
  1. Sample local script to extract content from PDF
  2. Available options to deploy the script as an API
  3. Designing the deployment architecture
  4. Implementation and testing

Script resources: GitHub Repo


🔧 Background & Prerequisites

1. PDF Content Extraction — The Script

Library Strengths Best For
PyMuPDF (fitz) Fastest, handles complex layouts General text extraction
pdfplumber Excellent table extraction Tabular data
PyPDF2 Lightweight, basic extraction Simple PDFs
Tesseract OCR Open-source OCR for scanned PDFs Image-based PDFs
Azure Doc Intelligence Cloud-based, layout analysis Enterprise extraction

Key extraction targets:

  • 📝 Text — Full-text extraction from each page
  • 📊 Tables — Structured table data
  • 🏷️ Metadata — Title, author, creation date, page count (PdfReader(file).metadata)
  • 🔍 OCR fallback — Detect if PDF has extractable text; if not, route to OCR

2. API Design

sequenceDiagram
    participant Client
    participant API as FastAPI Server
    participant Extractor as PDF Extractor

    Client->>API: POST /api/extract<br/>(multipart/form-data)
    API->>API: Validate file type & size
    API->>Extractor: Extract text + metadata
    Extractor-->>API: Structured result
    API-->>Client: 200 JSON response<br/>{text, metadata, pages, time_ms}
Aspect Design Decision
Endpoint POST /api/extract — upload PDF, get extracted data
Input multipart/form-data file upload (or URL download)
Output JSON: {text, metadata, page_count, tables, processing_time_ms}
Validation File type check, size limits, malware considerations
Status Codes 200 success, 400 bad request, 413 too large, 500 error
Docs OpenAPI/Swagger auto-generated (built into FastAPI)
Auth API key in header or Azure Entra ID token

3. Framework Choice — Flask vs FastAPI

Feature Flask FastAPI
Style WSGI (sync) ASGI (async)
Auto docs Extension needed Built-in Swagger UI
Validation Manual Pydantic models
Performance Good Excellent
Learning curve Minimal Minimal
Verdict ✅ Use if integrating into existing Flask app Recommended for new APIs

4. Containerization with Docker

# Example multi-stage Dockerfile
FROM python:3.12-slim AS builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

FROM python:3.12-slim
WORKDIR /app
COPY --from=builder /usr/local/lib/python3.12/site-packages /usr/local/lib/python3.12/site-packages
COPY . .
EXPOSE 8000
HEALTHCHECK CMD curl -f http://localhost:8000/health || exit 1
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

💡 Tips: Use multi-stage builds to reduce image size. Install system deps like poppler-utils for pdftotext. Pass secrets via environment variables — never hardcode.


5. Azure Deployment Options

graph TD
    Script["🐍 Python Script"] --> Container["🐳 Docker Image"]
    Container --> ACR["📦 Azure Container Registry"]
    ACR --> ACA["⭐ Container Apps<br/>(Recommended)"]
    ACR --> AppService["🌐 App Service"]
    ACR --> Functions["⚡ Azure Functions"]
    ACR --> ACI["📦 Container Instances"]
    ACR --> AKS["☸️ AKS"]
Option Scale to Zero Complexity Best For Monthly Cost
Container Apps Low Variable-traffic APIs Pay per use
App Service Low Steady-traffic APIs ~$13+ (B1)
Azure Functions Low Infrequent calls Free tier available
Container Instances Minimal Testing/one-off jobs Pay per second
AKS High Multi-service architectures

|

Recommendation: Azure Container Apps — best balance of simplicity, cost, and zero-to-scale capabilities.


6. CI/CD Pipeline

flowchart LR
    Push["📤 Git Push"] --> Build["🏗️ GitHub Actions"]
    Build --> Image["🐳 Build Docker Image"]
    Image --> ACR["📦 Push to ACR"]
    ACR --> Deploy["🚀 Deploy to<br/>Container Apps"]
  • Use GitHub Actions for automated build → push → deploy
  • Store Azure credentials and API keys in GitHub Secrets
  • Maintain separate staging and production environments

✅ TODO — Remaining Work

# Task Priority
1 Write PDF extraction script (PyMuPDF: metadata + text + tables) 🔴 High
2 Wrap in FastAPI with endpoints, validation, error handling 🔴 High
3 Add OpenAPI/Swagger documentation 🔴 High
4 Write Dockerfile and test locally 🔴 High
5 Push image to Azure Container Registry 🟡 Medium
6 Deploy to Azure Container Apps 🟡 Medium
7 Set up GitHub Actions CI/CD pipeline 🟡 Medium
8 Add authentication (API key or Azure Entra ID) 🟡 Medium
9 Load test with sample PDFs and document performance 🟢 Low
10 Create full architecture diagram with all components 🟢 Low

🧩 Reference Implementation — FastAPI Service

A minimal, production-leaning implementation of the PDF extraction API:

# main.py
from fastapi import FastAPI, UploadFile, File, HTTPException
from fastapi.responses import JSONResponse
import fitz  # PyMuPDF
import time, os

MAX_BYTES = 25 * 1024 * 1024  # 25 MB
app = FastAPI(title="PDF Extract API", version="1.0.0")

@app.get("/health")
def health():
    return {"status": "ok", "version": os.getenv("APP_VERSION", "dev")}

@app.post("/api/extract")
async def extract(file: UploadFile = File(...)):
    if file.content_type not in ("application/pdf", "application/octet-stream"):
        raise HTTPException(status_code=400, detail="Only PDF accepted")
    data = await file.read()
    if len(data) > MAX_BYTES:
        raise HTTPException(status_code=413, detail="File too large")

    t0 = time.perf_counter()
    try:
        doc = fitz.open(stream=data, filetype="pdf")
    except Exception as e:
        raise HTTPException(status_code=400, detail=f"Invalid PDF: {e}")

    pages = [p.get_text() for p in doc]
    meta = doc.metadata or {}
    result = {
        "text": "\n\n".join(pages),
        "page_count": doc.page_count,
        "metadata": {
            "title": meta.get("title"),
            "author": meta.get("author"),
            "creation_date": meta.get("creationDate"),
        },
        "processing_time_ms": int((time.perf_counter() - t0) * 1000),
    }
    doc.close()
    return JSONResponse(result)

Deploying to Azure Container Apps

# 1. Build and push
ACR=myblogacr
az acr build -r $ACR -t pdf-extract:v1 .

# 2. Create environment and deploy
az containerapp env create -g rg-pdf -n cae-pdf -l eastus
az containerapp create \
  -g rg-pdf -n ca-pdf-extract \
  --environment cae-pdf \
  --image $ACR.azurecr.io/pdf-extract:v1 \
  --registry-server $ACR.azurecr.io \
  --registry-identity system \
  --target-port 8000 --ingress external \
  --min-replicas 0 --max-replicas 5 \
  --cpu 0.5 --memory 1Gi

Testing

curl -X POST -F "file=@sample.pdf" https://<app-url>/api/extract | jq '.page_count'

When all TODO items above are ticked and the /api/extract endpoint handles 100 concurrent uploads without errors, flip status: workinprogressstatus: published.