Multimodal AI workloads are breaking Spark and Ray. See how Daft’s streaming model runs 7× faster and more reliably across audio, video, and image pipelines.Multimodal AI workloads are breaking Spark and Ray. See how Daft’s streaming model runs 7× faster and more reliably across audio, video, and image pipelines.

Why Multimodal AI Broke the Data Pipeline — And How Daft Is Beating Ray and Spark to Fix It

Multimodal AI workloads break traditional data engines. They need to embed documents, classify images, and transcribe audio, not just run aggregations and joins. These multimodal workloads are tough: memory usage balloons mid-pipeline, processing requires both CPU and GPU, and a single machine can't handle the data volume.

This post provides a comprehensive comparison of Daft and Ray Data for multimodal data processing, examining their architectures and performance. Benchmarks across large-scale audio, video, document, and image workloads found Daft ran 2-7x faster than Ray Data and 4-18x faster than Spark, while finishing jobs reliably.

The Multimodal Data Challenge

Multimodal data processing presents unique challenges:

  1. Memory Explosions: A compressed image like a JPEG inflates 20x in memory once decoded. A single video file can be decoded into thousands of frames, each being megabytes.
  2. Heterogeneous Compute: These workloads stress CPU, GPU, and network simultaneously. Processing steps include resampling, feature extraction, transcription, downloading, decoding, resizing, normalizing, and classification.
  3. Data Volume: The benchmarked workloads included 113,800 audio files from Common Voice 17, 10,000 PDFs from Common Crawl, 803,580 images from ImageNet, and 1,000 videos from Hollywood2.

Introducing the Contenders

Daft

Daft is designed to handle petabyte-scale workloads with multimodal data (audio, video, images, text, embeddings) as first-class citizens.

Key features include:

  • Native multimodal operations: Built-in image decoding/encoding/cropping/resizing, text and image embedding/classification APIs, LLM APIs, text tokenization, cosine similarity, URL downloads/uploads, reading video to image frames
  • Declarative DataFrame/SQL API: With schema validation and query optimizer that automatically handles projection pushdowns, filter pushdowns, and join reordering - optimizations users get "for free" without manual tuning
  • Comprehensive I/O support: Native readers and writers for Parquet, CSV, JSON, Lance, Iceberg, Delta Lake, and WARC formats, tightly integrated with the streaming execution model

Ray Data

Ray Data is a data processing library built on top of Ray, a framework for building distributed Python applications.

Key features include:

  • Low-level operators: Provides operations like map_batches that work directly on PyArrow record batches or pandas DataFrames
  • Ray ecosystem integration: Tight integration with Ray Train for distributed training and Ray Serve for model serving

Architecture Deep Dive

Daft's Streaming Execution Model

Daft's architecture revolves around its Swordfish streaming execution engine. Data is always "in flight": batches flow through the pipeline as soon as they are ready. For a partition of 100k images, the first 1000 can be fed into model inference while the next 1000 are being downloaded or decoded. The entire partition never has to be fully materialized in an intermediate buffer.

Backpressure mechanism: If GPU inference becomes the bottleneck, the upstream steps automatically slow down so memory usage remains bounded.

Adaptive batch sizing: Daft shrinks batch sizes on memory-heavy operations like url_download or image_decode, keeping throughput high without ballooning memory usage.

Flotilla distributed engine: Daft's distributed runner deploys one Swordfish worker per node, enabling the same streaming execution model to scale across clusters.

Ray Data's Execution Model

Ray Data streams data between heterogeneous operations (e.g., CPU → GPU) that users define via classes or resource requests. Within homogeneous operations, Ray Data fuses sequential operations into the same task and executes them sequentially, which can cause memory issues without careful tuning of block sizes. You can work around this by using classes instead of functions in map/map_batches, but this materializes intermediates in Ray's object store, adding serialization and memory copy overhead. Ray's object store is by default only 30% of machine memory, and this limitation can lead to excessive disk spilling.

Performance Benchmarks

Based on recent benchmarks conducted on identical AWS clusters (8 x g6.xlarge instances with NVIDIA L4 GPUs, each with 4 vCPUs, 16 GB memory, and 100 GB EBS volume), here's how the two frameworks compare:

| Workload | Daft | Ray Data | Spark | |----|----|----|----| | Audio Transcription (113,800 files) | 6m 22s | 29m 20s (4.6x slower) | 25m 46s (4.0x slower) | | Document Embedding (10,000 PDFs) | 1m 54s | 14m 32s (7.6x slower) | 8m 4s (4.2x slower) | | Image Classification (803,580 images) | 4m 23s | 23m 30s (5.4x slower) | 45m 7s (10.3x slower) | | Video Object Detection (1,000 videos) | 11m 46s | 25m 54s (2.2x slower) | 3h 36m (18.4x slower) |

Why Such Large Performance Differences?

Several architectural decisions contribute to Daft's performance advantages:

  1. Native Operations vs Python UDFs: Daft has native multimodal expressions including image decoding/encoding/cropping/resizing, text and image embedding/classification APIs, LLM APIs, text tokenization, cosine similarity, URL downloads/uploads, and reading video to image frames. These native multimodal expressions are highly optimized in Daft. In Ray Data you have to write your own Python UDFs that use external dependencies like Pillow, numpy, spacy, huggingface, etc. This comes at the cost of extra data movement because these libraries each have their own data format.
  2. Memory Management - Streaming vs Materialization: Daft streams data through network, CPU, and GPU in a continuous stream without materializing entire partitions. Ray Data fuses sequential operations which can cause memory issues. While you can work around this by using classes to materialize intermediates in the object store, this adds serialization and memory copy overhead.
  3. Resource Utilization: Daft pipelines everything inside a single Swordfish worker, which has control over all resources of the machine. Data asynchronously streams from cloud storage, into the CPUs to run pre-processing steps, then into GPU memory for inference, and back out for results to be uploaded. CPUs, GPUs, and the network stay saturated together for optimal throughput. In contrast, Ray Data by default reserves a CPU core for I/O-heavy operations like downloading large videos, which can leave that core unavailable for CPU-bound processing work, requiring manual tuning of fractional CPU requests to optimize resource usage.

When to Choose Which?

Based on the benchmark results and architectural differences:

Daft shows significant advantages for:

  • Multimodal data processing (images, documents, video, audio)
  • Workloads requiring reliable execution without extensive tuning
  • Complex queries with joins, aggregations, and multiple transformations
  • Teams preferring DataFrame/SQL semantics

Ray Data may be preferred when:

  • You have tight integration needs with the Ray ecosystem (Ray Train, Ray Serve)
  • You need fine-grained control over CPU/GPU allocation per operation

What Practitioners Are Saying

Is Daft battle-tested enough for production?

When Tim Romanski of Essential AI set out to taxonomize 23.6 billion web documents from Common Crawl (24 trillion tokens), his team pushed Daft to its limits - scaling from local development to 32,000 requests per second per VM. As he shared in a panel discussion: "We pushed Daft to the limit and it's battle tested… If we had to do the same thing in Spark, we would have to have the JVM installed, go through all of its nuts and bolts just to get something running. So the time to get something running in the first place was a lot shorter. And then once we got it running locally, we just scaled up to multiple machines."

What gap does Daft fill in the Ray ecosystem?

CloudKitchens rebuilt their entire ML infrastructure around what they call the "DREAM stack" (Daft, Ray, poEtry, Argo, Metaflow). When selecting their data processing layer, they identified specific limitations with Ray Data and chose Daft to complement Ray's compute capabilities. As their infrastructure team explained, "one issue with the Ray library for data processing, Ray Data, is that it doesn't cover the full range of DataFrame/ETL functions and its performance could be improved." They chose Daft because "it fills the gap of Ray Data by providing amazing DataFrame APIs" and noted that "in our tests, it's faster than Spark and uses fewer resources."

How does Daft perform on even larger datasets?

A data engineer from ByteDance commented on Daft's 300K image processing demonstration, sharing his own experience with an even larger image classification workload: "Not just 300,000 images - we ran image classification evaluations on the ImageNet dataset with approximately 1.28 million images, and Daft was about 20% faster than Ray Data." Additionally, in a separate technical analysis of Daft's architecture, he praised its "excellent execution performance and resource efficiency" and highlighted how it "effortlessly enables streaming processing of large-scale image datasets."

Resources

  • Benchmarks for Multimodal AI Workloads - Primary source for performance data and architectural comparisons
  • Benchmark Code Repository - Open-source code to reproduce all benchmarks
  • Distributed Data Community Slack - Join the community to discuss with Daft developers and users

\

Market Opportunity
WHY Logo
WHY Price(WHY)
$0.00000001518
$0.00000001518$0.00000001518
-0.71%
USD
WHY (WHY) Live Price Chart
Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact service@support.mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

CEO Sandeep Nailwal Shared Highlights About RWA on Polygon

CEO Sandeep Nailwal Shared Highlights About RWA on Polygon

The post CEO Sandeep Nailwal Shared Highlights About RWA on Polygon appeared on BitcoinEthereumNews.com. Polygon CEO Sandeep Nailwal highlighted Polygon’s lead in global bonds, Spiko US T-Bill, and Spiko Euro T-Bill. Polygon published an X post to share that its roadmap to GigaGas was still scaling. Sentiments around POL price were last seen to be bearish. Polygon CEO Sandeep Nailwal shared key pointers from the Dune and RWA.xyz report. These pertain to highlights about RWA on Polygon. Simultaneously, Polygon underlined its roadmap towards GigaGas. Sentiments around POL price were last seen fumbling under bearish emotions. Polygon CEO Sandeep Nailwal on Polygon RWA CEO Sandeep Nailwal highlighted three key points from the Dune and RWA.xyz report. The Chief Executive of Polygon maintained that Polygon PoS was hosting RWA TVL worth $1.13 billion across 269 assets plus 2,900 holders. Nailwal confirmed from the report that RWA was happening on Polygon. The Dune and https://t.co/W6WSFlHoQF report on RWA is out and it shows that RWA is happening on Polygon. Here are a few highlights: – Leading in Global Bonds: Polygon holds 62% share of tokenized global bonds (driven by Spiko’s euro MMF and Cashlink euro issues) – Spiko U.S.… — Sandeep | CEO, Polygon Foundation (※,※) (@sandeepnailwal) September 17, 2025 The X post published by Polygon CEO Sandeep Nailwal underlined that the ecosystem was leading in global bonds by holding a 62% share of tokenized global bonds. He further highlighted that Polygon was leading with Spiko US T-Bill at approximately 29% share of TVL along with Ethereum, adding that the ecosystem had more than 50% share in the number of holders. Finally, Sandeep highlighted from the report that there was a strong adoption for Spiko Euro T-Bill with 38% share of TVL. He added that 68% of returns were on Polygon across all the chains. Polygon Roadmap to GigaGas In a different update from Polygon, the community…
Share
BitcoinEthereumNews2025/09/18 01:10
YoungHoon Kim Predicts XRP Price Surge Amid Institutional Demand

YoungHoon Kim Predicts XRP Price Surge Amid Institutional Demand

The post YoungHoon Kim Predicts XRP Price Surge Amid Institutional Demand appeared first on Coinpedia Fintech News YoungHoon Kim, the world’s highest IQ holder,
Share
CoinPedia2025/12/18 20:36
Why Reference-to-Video Is the Missing Piece in AI Video — and How Wan 2.6 Solves It

Why Reference-to-Video Is the Missing Piece in AI Video — and How Wan 2.6 Solves It

AI video generation has improved rapidly.  Visual quality is higher, motion looks smoother, and demos are more impressive than ever. Yet many creators still struggle
Share
AI Journal2025/12/18 20:11