Details the Q-Former architecture: a 12-layer BERT-based model using 32 learnable query embeddings. These queries use cross-attention to extract visual information for MLLM input.Details the Q-Former architecture: a 12-layer BERT-based model using 32 learnable query embeddings. These queries use cross-attention to extract visual information for MLLM input.

Visual Prompt Generation: Cross-Attention in Q-Former

2025/11/20 00:00
Okuma süresi: 2 dk
Bu içerikle ilgili geri bildirim veya endişeleriniz için lütfen crypto.news@mexc.com üzerinden bizimle iletişime geçin.

Abstract and 1 Introduction

  1. Related Work

    2.1. Multimodal Learning

    2.2. Multiple Instance Learning

  2. Methodology

    3.1. Preliminaries and Notations

    3.2. Relations between Attention-based VPG and MIL

    3.3. MIVPG for Multiple Visual Inputs

    3.4. Unveiling Instance Correlation in MIVPG for Enhanced Multi-instance Scenarios

  3. Experiments and 4.1. General Setup

    4.2. Scenario 1: Samples with Single Image

    4.3. Scenario 2: Samples with Multiple Images, with Each Image as a General Embedding

    4.4. Scenario 3: Samples with Multiple Images, with Each Image Having Multiple Patches to be Considered and 4.5. Case Study

  4. Conclusion and References

\ Supplementary Material

A. Detailed Architecture of QFormer

B. Proof of Proposition

C. More Experiments

\ Figure 7. Overview of QFormer

A. Detailed Architecture of QFormer

The architecture overview is depicted in Figure 7. Specifically, QFormer is initialized as a BERT-based model[8] comprising a total of L = 12 layers. In contrast to typical BERT models that process textual inputs, QFormer takes R = 32 learnable query embeddings as inputs. These embeddings are utilized to extract visual information from the input visual data during Stage-1 pretraining in BLIP2[22]. Subsequently, they serve as visual prompt embeddings for the LLM inputs after projection.

\ Inside the QFormer, each layer includes a self-attention module composed of a Multi-Head Attention component and a Forward module (consisting of Linear, LayerNorm, and Residual Connection). The cross-attention module, initialized with random values, is inserted every G layers, where learnable query embeddings interact with visual embeddings. In the main paper, for the sake of conciseness, we condensed the representation of the multi-head attention and forward modules into self(cross) attention modules. Furthermore, we exclusively illustrated the modifications made to the cross-attention module in MIVPG, as the self-attention modules remain unchanged. The final QFormer output is represented by the last layer’s query embeddings.

\ For a more comprehensive understanding, readers are encouraged to refer to [22].

\

:::info Authors:

(1) Wenliang Zhong, The University of Texas at Arlington (wxz9204@mavs.uta.edu);

(2) Wenyi Wu, Amazon (wenyiwu@amazon.com);

(3) Qi Li, Amazon (qlimz@amazon.com);

(4) Rob Barton, Amazon (rab@amazon.com);

(5) Boxin Du, Amazon (boxin@amazon.com);

(6) Shioulin Sam, Amazon (shioulin@amazon.com);

(7) Karim Bouyarmane, Amazon (bouykari@amazon.com);

(8) Ismail Tutar, Amazon (ismailt@amazon.com);

(9) Junzhou Huang, The University of Texas at Arlington (jzhuang@uta.edu).

:::


:::info This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.

:::

\

Piyasa Fırsatı
Prompt Logosu
Prompt Fiyatı(PROMPT)
$0.03915
$0.03915$0.03915
-4.23%
USD
Prompt (PROMPT) Canlı Fiyat Grafiği
Sorumluluk Reddi: Bu sitede yeniden yayınlanan makaleler, halka açık platformlardan alınmıştır ve yalnızca bilgilendirme amaçlıdır. MEXC'nin görüşlerini yansıtmayabilir. Tüm hakları telif sahiplerine aittir. Herhangi bir içeriğin üçüncü taraf haklarını ihlal ettiğini düşünüyorsanız, kaldırılması için lütfen crypto.news@mexc.com ile iletişime geçin. MEXC, içeriğin doğruluğu, eksiksizliği veya güncelliği konusunda hiçbir garanti vermez ve sağlanan bilgilere dayalı olarak alınan herhangi bir eylemden sorumlu değildir. İçerik, finansal, yasal veya diğer profesyonel tavsiye niteliğinde değildir ve MEXC tarafından bir tavsiye veya onay olarak değerlendirilmemelidir.

Ayrıca Şunları da Beğenebilirsiniz

OpenClaw AI Agent Takes China by Storm: Understanding the Viral Phenomenon

OpenClaw AI Agent Takes China by Storm: Understanding the Viral Phenomenon

OpenClaw AI agent dominates China with Baidu and Tencent hosting public events, but security warnings and rising token costs present challenges. The post OpenClaw
Paylaş
Blockonomi2026/03/19 20:07
UK FCA Plans to Waive Some Rules for Crypto Companies: FT

UK FCA Plans to Waive Some Rules for Crypto Companies: FT

The post UK FCA Plans to Waive Some Rules for Crypto Companies: FT appeared on BitcoinEthereumNews.com. The U.K.’s Financial Conduct Authority (FCA) has plans to waive some of its rules for cryptocurrency companies, according to a Financial Times (FT) report on Wednesday. However, in another areas the FCA intends to tighten the rules where they pertain to industry-specific risks, such as cyber attacks. The financial watchdog wishes to adapt its existing rules for financial service companies to the unique nature of cryptoassets, the FT reported, citing a consultation paper published Wednesday. “You have to recognize that some of these things are very different,” David Geale, the FCA’s executive director for payments and digital finance, said in an interview, according to the report, adding that a “lift and drop” of existing traditional finance rules would not be effective with crypto. One such area that may be handled differently is the stipulation that a firm “must conduct its business with integrity” and “pay due regard to the interest of its customers and treat them fairly.” Crypto companies would be given less strict requirements than banks or investment platforms on rules concerning senior managers, systems and controls, as cryptocurrency firms “do not typically pose the same level of systemic risk,” the FCA said. Firms would also not have to offer customers a cooling off period due to the voltatile nature of crypto prices, nor would technology be classed as an outsourcing arrangement requiring extra risk management. This is because blockchain technology is often permissionless, meaning anyone can participate without the input of an intermediary. Other areas of crypto regulation remain undecided. The FCA has plans to fully integrate cryptocurrency into its regulatory framework from 2026. Source: https://www.coindesk.com/policy/2025/09/17/uk-fca-plans-to-waive-some-rules-for-crypto-companies-ft
Paylaş
BitcoinEthereumNews2025/09/18 04:15
Sweet Niblets! Official Trailer Drops For ‘Hannah Montana 20th Anniversary Special’

Sweet Niblets! Official Trailer Drops For ‘Hannah Montana 20th Anniversary Special’

Disney+ and Hulu dropped the official trailer for the highly anticipated “Hannah Montana 20th Anniversary Special.” “Hannah Montana 20th Anniversary Special” will
Paylaş
TechFinancials2026/03/19 19:57