Explores MaGGIe's architecture, featuring mask guidance embeddings, progressive refinement (PRM), and bidirectional matte fusion for consistent video results.Explores MaGGIe's architecture, featuring mask guidance embeddings, progressive refinement (PRM), and bidirectional matte fusion for consistent video results.

MaGGIe Architecture Deep Dive: Mask Guidance and Sparse Refinement

2025/12/20 02:15
3 min di lettura
Per feedback o dubbi su questo contenuto, contattateci all'indirizzo crypto.news@mexc.com.

Abstract and 1. Introduction

  1. Related Works

  2. MaGGIe

    3.1. Efficient Masked Guided Instance Matting

    3.2. Feature-Matte Temporal Consistency

  3. Instance Matting Datasets

    4.1. Image Instance Matting and 4.2. Video Instance Matting

  4. Experiments

    5.1. Pre-training on image data

    5.2. Training on video data

  5. Discussion and References

\ Supplementary Material

  1. Architecture details

  2. Image matting

    8.1. Dataset generation and preparation

    8.2. Training details

    8.3. Quantitative details

    8.4. More qualitative results on natural images

  3. Video matting

    9.1. Dataset generation

    9.2. Training details

    9.3. Quantitative details

    9.4. More qualitative results

7. Architecture details

This section delves into the architectural nuances of our framework, providing a more detailed exposition of components briefly mentioned in the main paper. These insights are crucial for a comprehensive understanding of the underlying mechanisms of our approach.

7.1. Mask guidance identity embedding

7.2. Feature extractor

\ Figure 7. Converting Dense-Image to Sparse-Instance Features. We transform the dense image features into sparse, instance-specific features with the help of instance tokens.

7.3. Dense-image to sparse-instance features

7.4. Detail aggregation

This process, akin to a U-net decoder, aggregates features from different scales, as detailed in Fig. 8. It involves upscaling sparse features and merging them with corresponding higher-scale features. However, this requires precomputed downscale indices from dummy sparse convolutions on the full input image.

7.5. Sparse matte head

Our matte head design, inspired by MGM [56], comprises two sparse convolutions with intermediate normalization and activation (Leaky ReLU) layers. The final output undergoes sigmoid activation for the final prediction. Non-refined locations in the dense prediction are assigned a value of zero.

7.6. Sparse progressive refinement

The PRM module progressively refines A8 → A4 → A1 to have A. We assume that all predictions are rescaled to the largest size and perform refinement between intermediate predictions and uncertainty indices U:

\

7.7. Attention loss and loss weight

\ Figure 8. Detail Aggregation Module merges sparse features across scales. This module equalizes spatial scales of sparse features using inverse sparse convolution, facilitating their combination.

\ Figure 9. Temporal Sparsity Between Two Consecutive Frames. The top row displays a pair of successive frames. Below, the second row illustrates the predicted differences by two distinct frameworks, with areas of discrepancy emphasized in white. In contrast to SparseMat’s output, which appears cluttered and noisy, our module generates a more refined sparsity map. This map effectively accentuates the foreground regions that undergo notable changes between the frames, providing a clearer and more focused representation of temporal sparsity. (Best viewed in color).

7.8. Temporal sparsity prediction

A key aspect of our approach is the prediction of temporal sparsity to maintain consistency between frames. This module contrasts the feature maps of consecutive frames to predict their absolute differences. Comprising three convolution layers with batch normalization and ReLU activation, this module processes the concatenated feature maps from two adjacent frames and predicts the binary differences between them.

\ Unlike SparseMat [50], which relies on manual threshold selection for frame differences, our method offers a more robust and domain-independent approach to determining frame sparsity. This is particularly effective in handling variations in movement, resolution, and domain between frames, as demonstrated in Fig. 9

7.9. Forward and backward matte fusion

\ This fusion enhances temporal consistency and minimizes error propagation.

\

:::info Authors:

(1) Chuong Huynh, University of Maryland, College Park (chuonghm@cs.umd.edu);

(2) Seoung Wug Oh, Adobe Research (seoh,jolee@adobe.com);

(3) Abhinav Shrivastava, University of Maryland, College Park (abhinav@cs.umd.edu);

(4) Joon-Young Lee, Adobe Research (jolee@adobe.com).

:::


:::info This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.

:::

\

Opportunità di mercato
Logo DeepBook
Valore DeepBook (DEEP)
$0.026573
$0.026573$0.026573
-4.36%
USD
Grafico dei prezzi in tempo reale di DeepBook (DEEP)
Disclaimer: gli articoli ripubblicati su questo sito provengono da piattaforme pubbliche e sono forniti esclusivamente a scopo informativo. Non riflettono necessariamente le opinioni di MEXC. Tutti i diritti rimangono agli autori originali. Se ritieni che un contenuto violi i diritti di terze parti, contatta crypto.news@mexc.com per la rimozione. MEXC non fornisce alcuna garanzia in merito all'accuratezza, completezza o tempestività del contenuto e non è responsabile per eventuali azioni intraprese sulla base delle informazioni fornite. Il contenuto non costituisce consulenza finanziaria, legale o professionale di altro tipo, né deve essere considerato una raccomandazione o un'approvazione da parte di MEXC.

Trading GOLD per 1,000,000 USDT

Trading GOLD per 1,000,000 USDTTrading GOLD per 1,000,000 USDT

0 commissioni, leva fino 1,000x, liquidità profonda