DataOS’ Data Movement: Built for Context and Speed

The Modern Team

•

November 24, 2025

•

DataOS’ Data Movement: Built for Context and Speed

Dashboards can work with stale snapshots and raw tables. AI can't. AI needs context: where data came from, how schemas evolved, what transformations were applied. Without that, models become black boxes no one can trust or explain. Readiness must start at ingestion, where data enters your systems.

Traditional ingestion tools weren't built for this. They move rows efficiently but strip away the operational context AI needs to function. None of these capture the metadata AI requires by default, so teams spend weeks reconstructing it manually. That gap is what keeps AI initiatives bottlenecked—not by model capability, but by data readiness

Introducing the DataOS Data Movement Engine

Today we’re announcing the Data Movement Engine in DataOS, the foundation for AI readiness that starts at ingestion. Unlike tools that just copy rows from A to B, DataOS’ Data Movement Engine moves data with metadata, and schema history, so analytics, agents, and applications can act with confidence. It connects to databases (batch or CDC), warehouses, lakes, APIs, and streams, and writes to tables and warehouses

Context on Arrival: The Data Movement Engine lands datasets with their context intact. It writes schema versions, row counts, run IDs, timestamps, and audit metadata alongside the data. AI models and analytics start with explainable inputs, and provenance is recorded automatically so downstream systems know exactly what they are using and where it came from.

Operational Visibility Built In: The engine emits Prometheus metrics for throughput, duration, error counts, CPU and memory usage, and CDC lag and offsets. These feed the DataOS metrics service and appear in Grafana dashboards and alerts. It exposes run status, offsets, and health through structured per-run logs. Workflow and configuration changes are fully audited, capturing who changed what and when.

Open and Extensible: Data Movement Engine ships with a connector library and a lightweight Python framework for custom integrations. Teams can add new sources and destinations in days without disrupting existing pipelines.

Unified Path for Batch and CDC: Batch extracts and CDC run through the same engine with similar configuration, retries, and recovery. This means fewer moving parts and no duplicate pipelines. Observability is unified across both modes, with consistent metrics, run history, and offsets for simpler monitoring and audits. SLAs and failure semantics remain consistent across all ingestion types.

Declarative by Default: Pipelines are declared in YAML, making every source connection, incremental strategy, and destination explicit and version-controlled. This removes hidden logic and undocumented scripts and replaces them with specifications teams can read, review, and modify with confidence.

Built for Performance:

The Data Movement Engine combines high throughput with AI-ready data from day one. Benchmark testing with synthetic TPC-H datasets showed strong performance across configurations:

PostgreSQL batch: 136K records/sec average, peaking at 152K records/sec
GCP-backed DataOS Lakehouse: 104K records/sec average
Azure-backed DataOS Lakehouse: 98K records/sec average

For real-time workloads, CDC configurations process thousands of events per second with sub-second lag, maintaining predictable performance even under load.

The value unlocked: Faster, Simpler, AI-Ready

The shift from traditional ingestion to the Data Movement Engine changes the economics and speed of delivering AI-ready data.

Lower infrastructure costs: Infrastructure spins up only when data moves, then shuts down. Costs scale with usage instead of capacity planning, eliminating idle spend and unpredictable pricing.

AI-ready from ingestion: Provenance and schema history are captured as data lands, not stitched together later. Datasets arrive explainable and audit-ready, so AI models can trust their inputs immediately.

Accelerated source onboarding: New sources integrate in days instead of quarters. The platform keeps pace with business needs instead of holding them back.

The Data Movement Engine is available now as part of DataOS. It supports destinations including DataOS Lakehouse (Apache Iceberg), PostgreSQL, Microsoft SQL Server, Snowflake, BigQuery, and Redshift, with the flexibility to add custom destinations.

The foundation for AI-ready data isn't built downstream. It's built at ingestion, where context and meaning travel with every table.

Ready to make your data AI-ready from the first mile? Contact us to schedule a demo

Topics:

Announcement

Data Integration

AI/ML