From API-First to MCP-Native: Why We Didn't Have to Rebuild

Selva Ganesan

5 mins

min read

Every few years, enterprise software gets a new interaction model. GUIs replaced CLIs. Mobile reshaped expectations. APIs let machines talk to each other. Each time, platforms that had the right abstraction layer adapted. Platforms that had poured their intelligence into the previous layer scrambled.

AI agents are the next operator in media workflows. The platforms that will serve them well are the ones that have the intelligence layer that agents can use.

When I started FileSpin in 2017, the bet was straightforward: build media infrastructure as an API layer first. Not an application with an API bolted on but an API surface that happened to also have a UI. Operations like upload, tag, transform, watermark, transcode, share, search were composable, programmable endpoint. The UI was a convenience. The API was the product.

That architectural decision is why, when we shipped our MCP-native stack earlier this month, it wasn't a rebuild. It was a natural extension.

The problem with UI-layer intelligence

Most DAM platforms spent the last decade investing in UIs. Drag-and-drop workflow builders. Visual asset browsers with filters and facets. Impressive in a demo. But an agent doesn't click buttons or drag thumbnails (well, it does now, but why should it when it can do better). It reasons about operations and calls tools. If your platform's intelligence lives in the UI layer; if that's where the business logic, the validation, the workflow orchestration resides it does not transfer to an agent. You end up rebuilding from scratch, or shipping a thin MCP wrapper around a few basic endpoints and calling it "agentic."

When marketing says "agentic", the reality is a handful of CRUD endpoints behind an MCP facade, with no domain knowledge, no schema awareness, and no ability to compose multi-step operations.

What we shipped

The entire FileSpin media stack: On-Demand Imaging, AI auto-tagging, backgrund removal, face recognition, dynamic watermarking, video transcoding, branded share pages — now operates natively through the Model Context Protocol. AI agents from Claude, ChatGPT, and Mistral can execute media operations through natural language.

This isn't a connector sitting on top of existing endpoints. The MCP tool layer carries domain knowledge with 37 tools (24 read-only, 13 write operations), each mapping to a composable API primitive with full schema context. When an agent receives a multi-step instruction, it understands the dependency chain: a tag update happens after the transformation succeeds, and a share page gets assembled with the right permissions without being told how.

That same infrastructure connects to Make, Zapier, and n8n for repeatable workflows. Engineering teams still have the full REST API. Three paths in. Same platform. Same governance.

What this looks like in practice

Here's a workflow I ran recently. Three eCommerce product images — a summer dress, a running shoe, a leather tote bag — sitting in FileSpin with our E-Commerce & Retail Asset Schema applied. Each carries structured metadata: product name, brand, category, SKU, colour, season, shot type, status. All three marked status: approved.

I opened Claude Desktop, connected to both the FileSpin MCP Server and the Airtable MCP Server, and typed one sentence:

"Use FileSpin and find all approved images from last week, remove background and sync asset and metadata with Airtable FileSpin assets table."

The agent figured out the rest. It searched FileSpin for assets with status: approved from the past week. Found all three. Removed backgrounds from all three in parallel — generating new _bg_removed.png variants stored alongside the originals. Then it discovered the Airtable "Approvals" base, located the "FileSpin Assets" table, read its schema, confirmed the field mapping, checked for existing records to avoid duplicates, and wrote three rows with the full metadata and background-removal status.

Two MCP servers. Five distinct operations. One sentence. Under two minutes.

What made this possible isn't the MCP wrapper, it's that the agent could read FileSpin's metadata schemas, understand the data model, and compose a cross-platform workflow without any pre-built integration between FileSpin and Airtable. The schema was legible. The operations were composable. The agent did the orchestration.

What agents cannot do matters more

Equally important is what agents cannot do. The MCP primitives are deliberately constrained. Destructive operations simply don't exist as tools the agent can call. Every metadata write goes through schema validation. Agents operate under the same RBAC (Role Based Access Control) permissions as the user who invoked them. In the demo above, the background removal created new assets while the originals stayed intact, with full version lineage.

The guardrails live in the infrastructure itself, not in a system prompt that politely asks the agent to behave. I wrote about why this matters and how we built it.

The gap between agent ambition and agent governance is where things break. FileSpin's position is that governance isn't a feature you add after adoption, it's the reason adoption happens at all.

Why API-first made this possible

FileSpin's intelligence has always lived in the API layer. The domain knowledge, how transformations chain, how metadata schemas validate, how share permissions compose, was already expressed as programmable operations. We have a UI, and we've recently overhauled it, but the depth is in the API surface.

When MCP emerged, we didn't need to figure out how to expose our platform to agents. We needed to make agents effective operators of a media stack they'd never seen before. That was possible because the domain intelligence already existed in the right layer.

The Airtable workflow is a good illustration.

The agent didn't just call FileSpin endpoints, it read the E-Commerce & Retail Asset Schema, understood which fields existed (brand, SKU, season, shot type), matched them against Airtable's table structure, and mapped the data across without any pre-built connector between the two systems.

A thin MCP wrapper can't do this. Schema awareness, structured metadata, and composable operations are preconditions for cross-platform agent workflows — and they come from building API-first.

What comes next

MCP-native is the latest step, not the only one. Earlier this year we launched Teleport — our hybrid cloud-edge runtime that processes media locally for venues and field operations where connectivity isn't guaranteed. The conviction behind it is the same: media infrastructure should work where and how your operations actually run, not where the network happens to cooperate.

The platforms that matter next will share three qualities. If you're evaluating media infrastructure — ours or anyone else's — these are the questions worth asking:

Is the API surface deep enough for agents to reason about? Not a handful of CRUD endpoints wrapped in MCP. Full domain knowledge — transformations, metadata schemas, permissions, workflows — expressed as tools an agent can compose. If the agent can't chain operations intelligently without hand-holding, the integration is cosmetic.

Does governance make autonomous operations trustworthy? RBAC, schema validation, audit trails, version control — baked into the infrastructure, not bolted on as an afterthought. If you have to trust the agent's system prompt to enforce your rules, you don't have governance. You have hope.

Does it work where your operations actually run? Cloud-only is a single point of failure. Event venues, retail floors, remote locations — media operations don't pause when the connection drops. Hybrid processing isn't a nice-to-have; for many industries, it's the baseline.

We built FileSpin around these principles before they were trending. The architecture didn't change because the interaction model did. It just became more obvious why it was built this way.

Book a demo →