PyJuice dramatically accelerates the training and inference of probabilistic circuits (PCs), outperforming established frameworks like SPFlow, EiNet, and Juice.jl across multiple benchmarks. Leveraging GPU efficiency, PyJuice handles billion-edge models without exhausting memory, scales sparse and block-sparse architectures, and enables fine-tuning of large HMMs and image models that were previously impractical. The results show PyJuice not only reduces runtime by orders of magnitude but also advances the state-of-the-art in PC modeling at scale.PyJuice dramatically accelerates the training and inference of probabilistic circuits (PCs), outperforming established frameworks like SPFlow, EiNet, and Juice.jl across multiple benchmarks. Leveraging GPU efficiency, PyJuice handles billion-edge models without exhausting memory, scales sparse and block-sparse architectures, and enables fine-tuning of large HMMs and image models that were previously impractical. The results show PyJuice not only reduces runtime by orders of magnitude but also advances the state-of-the-art in PC modeling at scale.

PyJuice Pushes HMMs and Image Models Beyond State-of-the-Art

Abstract and 1. Introduction

  1. Preliminaries and Related Work

  2. Key Bottlenecks in PC Parallelization

  3. Harnessing Block-Based PC Parallelization

    4.1. Fully Connected Sum Layers

    4.2. Generalizing To Practical Sum Layers

    4.3. Efficient Implementations by Compiling PC Layers

    4.4. Analysis: IO and Computation Overhead

  4. Optimizing Backpropagation with PC Flows

  5. Experiments

    6.1. Faster Models with PyJuice

    6.2. Better PCs At Scale

    6.3. Benchmarking Existing PCs

  6. Conclusion, Acknowledgements, Impact Statement, and References

A. Algorithm Details

B. Additional Technical Details

C. Experimental Details

D. Additional Experiments

\

6.1. Faster Models with PyJuice

We first benchmark the runtime of PyJuice on four commonly used PC structures: PD (Poon & Domingos, 2011), RAT-SPN (Peharz et al., 2020b), HCLT (Liu & Van den Broeck, 2021), and HMM (Rabiner & Juang, 1986). For all models, we record the runtime to process 60,000 samples (including the forward pass, the backward pass, and mini-batch EM updates). We vary their structural hyperparameters and create five PCs for every structure with sizes (i.e., number of edges) ranging from 500K to 2B. We compare against four baselines: SPFlow (Molina et al., 2019), EiNet (Peharz et al., 2020a), Juice.jl (Dang et al., 2021), and Dynamax (Murphy et al., 2023). Dynamax is dedicated to State Space Models so it is only used to run HMMs; SPFlow and EiNet are excluded in the HMM results because we are unable to construct homogeneous HMMs with their frameworks due to the need to share the transition and emission parameters at different time steps. We describe how PyJuice implements PCs with tied parameters in Appendix A.3. All experiments in this subsection are carried out on an RTX 4090 GPU with 24GB memory.

\ Table 1 reports the runtime in seconds per epoch with minibatch EMs. PyJuice is orders of magnitude faster than all baselines in both small and large PCs. Further, we observe that most baselines exhaust 24GB of memory for larger PCs (indicated by “OOM” in the table), while PyJuice can still efficiently train these models. Additionally, in Appendix D.1,

\ Figure 6. Comparison on memory efficiency. We take two PCs (i.e., an HCLT w/ 159M edges and an HMM w/ 130M edges) and record GPU memory usage under different block sizes.[7]

\ we show the efficiency of the compilation process. For example, it takes only ∼8.7s to compile an HCLT with 159M edges. Note that we only compile the PC once and then reuse the compiled structure for training and inference.

\ In Figure 6, we take two PCs to show the GPU memory consumption with different batch sizes. The results demonstrate that PyJuice is more memory efficient than the baselines, especially in the case of large batch sizes (note that we always need a constant-size space to store the parameters).

\ We move on to benchmark PyJuice on block-sparse PCs. We create a sum layer with 209M edges (see Appx. C.1 for details). We partition the sum and input product nodes in the layer into blocks of 32 nodes respectively. We randomly discard blocks of 32×32 edges, resulting in block-sparse layers. As shown in Figure 7, as the fraction of removed edge blocks increases, the runtime of both the forward and the backward pass decreases significantly. This motivates future work on PC modeling to focus on designing effective block-sparse PCs.

\ Figure 7. Runtime of a block-sparse sum layer as the function of the fraction of kept (non-dropped) edge blocks. The error bars represent standard deviations over 5 runs.

\ Finally, we proceed to evaluate the runtime of sparse PCs. We adopt the PC pruning algorithm proposed by Dang et al. (2022) to prune two HCLTs with 10M and 40M edges, respectively. We only compare against Juice.jl since all other implementations do not support sparse PCs. As shown in Figure 8, PyJuice is consistently faster than Juice.jl, despite the diminishing gap when over 90% edges are pruned. Note that with sparse PCs, PyJuice cannot fully benefit from the block-based parallelization strategy described in Section 4, yet it can still outperform the baseline.

\ Table 2. Perplexity of HMM language models trained on the CommonGen benchmark (Lin et al., 2020).

\ Figure 8. Runtime per epoch (with 60K samples) of two sparse HCLTs with different fractions of pruned edges. The error bars represent standard deviations over 5 runs.

6.2. Better PCs At Scale

This section demonstrates the ability of PyJuice to improve the state of the art by simply using larger PCs and training for more epochs thanks to its speed and memory efficiency. Specifically, we take the HMM language model proposed by Zhang et al. (2023) and the image model introduced by Liu et al. (2023c) as two examples.

\ HMM language models. Zhang et al. (2023) use the Latent Variable Distillation (LVD) (Liu et al., 2023a) technique to train an HMM with 4096 hidden states on sequences of 32 word tokens. Specifically, LVD is used to obtain a set of “good” initial parameters for the HMM from deep generative models. The HMM language model is then fine-tuned on the CommonGen dataset (Lin et al., 2020), and is subsequently used to control the generation process of (large) language models for constrained generation tasks. Following the same procedure, we use PyJuice to fine-tune two HMMs with hidden sizes 4096 and 8192, respectively.

\ As shown in Table 2, by using the same HMM with 4096 hidden states, PyJuice improved the perplexity by ∼1.0 by running many more epochs in less time compared to the original model. We also train a larger HMM with 8192 hidden states and further improved the perplexity by a further 0.16. We refer the reader to Appendix C.2 for more details.

\ Sparse Image Models. Liu et al. (2023c) design a PC learning algorithm that targets image data by separately training two sets of PCs: a set of sparse patch-level PCs (e.g., 4×4 patches) and a top-level PC that aggregates outputs of the patch-level PC. In the final training step, the PCs are supposed to be assembled and jointly fine-tuned. However, due to the huge memory consumption of the PC (with over 10M nodes), only the top-level model is fine-tuned in the original paper. With PyJuice, we can fit the entire model in 24GB of memory and fine-tune the entire model. For the PC trained on the ImageNet32 dataset (Deng et al., 2009), this

\ \ Table 3. Density estimation performance of PCs on three natural image datasets. Reported numbers are test set bits-per-dimension.

\ \ fine-tuning step leads to an improvement from 4.06 to 4.04 bits-per-dimension. See Appendix C.3 for more details.

\ \

:::info Authors:

(1) Anji Liu, Department of Computer Science, University of California, Los Angeles, USA (liuanji@cs.ucla.edu);

(2) Kareem Ahmed, Department of Computer Science, University of California, Los Angeles, USA;

(3) Guy Van den Broeck, Department of Computer Science, University of California, Los Angeles, USA;

:::


:::info This paper is available on arxiv under CC BY 4.0 DEED license.

:::

[7] In the adopted HMM, running Dynamax with batch size ≥128 leads to internal errors, and thus the results are not reported.

Market Opportunity
NodeAI Logo
NodeAI Price(GPU)
$0.07281
$0.07281$0.07281
-3.89%
USD
NodeAI (GPU) Live Price Chart
Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact service@support.mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

X to cut off InfoFi crypto projects from accessing its API

X to cut off InfoFi crypto projects from accessing its API

X, the most widely used app for crypto projects, is changing its API access policy. InfoFi projects, which proliferated non-organic bot content, will be cut off
Share
Cryptopolitan2026/01/16 02:50
X Just Killed Kaito and InfoFi Crypto, Several Tokens Crash

X Just Killed Kaito and InfoFi Crypto, Several Tokens Crash

The post X Just Killed Kaito and InfoFi Crypto, Several Tokens Crash appeared on BitcoinEthereumNews.com. X has revoked API access for apps that reward users for
Share
BitcoinEthereumNews2026/01/16 03:42
Google's AP2 protocol has been released. Does encrypted AI still have a chance?

Google's AP2 protocol has been released. Does encrypted AI still have a chance?

Following the MCP and A2A protocols, the AI Agent market has seen another blockbuster arrival: the Agent Payments Protocol (AP2), developed by Google. This will clearly further enhance AI Agents' autonomous multi-tasking capabilities, but the unfortunate reality is that it has little to do with web3AI. Let's take a closer look: What problem does AP2 solve? Simply put, the MCP protocol is like a universal hook, enabling AI agents to connect to various external tools and data sources; A2A is a team collaboration communication protocol that allows multiple AI agents to cooperate with each other to complete complex tasks; AP2 completes the last piece of the puzzle - payment capability. In other words, MCP opens up connectivity, A2A promotes collaboration efficiency, and AP2 achieves value exchange. The arrival of AP2 truly injects "soul" into the autonomous collaboration and task execution of Multi-Agents. Imagine AI Agents connecting Qunar, Meituan, and Didi to complete the booking of flights, hotels, and car rentals, but then getting stuck at the point of "self-payment." What's the point of all that multitasking? So, remember this: AP2 is an extension of MCP+A2A, solving the last mile problem of AI Agent automated execution. What are the technical highlights of AP2? The core innovation of AP2 is the Mandates mechanism, which is divided into real-time authorization mode and delegated authorization mode. Real-time authorization is easy to understand. The AI Agent finds the product and shows it to you. The operation can only be performed after the user signs. Delegated authorization requires the user to set rules in advance, such as only buying the iPhone 17 when the price drops to 5,000. The AI Agent monitors the trigger conditions and executes automatically. The implementation logic is cryptographically signed using Verifiable Credentials (VCs). Users can set complex commission conditions, including price ranges, time limits, and payment method priorities, forming a tamper-proof digital contract. Once signed, the AI Agent executes according to the conditions, with VCs ensuring auditability and security at every step. Of particular note is the "A2A x402" extension, a technical component developed by Google specifically for crypto payments, developed in collaboration with Coinbase and the Ethereum Foundation. This extension enables AI Agents to seamlessly process stablecoins, ETH, and other blockchain assets, supporting native payment scenarios within the Web3 ecosystem. What kind of imagination space can AP2 bring? After analyzing the technical principles, do you think that's it? Yes, in fact, the AP2 is boring when it is disassembled alone. Its real charm lies in connecting and opening up the "MCP+A2A+AP2" technology stack, completely opening up the complete link of AI Agent's autonomous analysis+execution+payment. From now on, AI Agents can open up many application scenarios. For example, AI Agents for stock investment and financial management can help us monitor the market 24/7 and conduct independent transactions. Enterprise procurement AI Agents can automatically replenish and renew without human intervention. AP2's complementary payment capabilities will further expand the penetration of the Agent-to-Agent economy into more scenarios. Google obviously understands that after the technical framework is established, the ecological implementation must be relied upon, so it has brought in more than 60 partners to develop it, almost covering the entire payment and business ecosystem. Interestingly, it also involves major Crypto players such as Ethereum, Coinbase, MetaMask, and Sui. Combined with the current trend of currency and stock integration, the imagination space has been doubled. Is web3 AI really dead? Not entirely. Google's AP2 looks complete, but it only achieves technical compatibility with Crypto payments. It can only be regarded as an extension of the traditional authorization framework and belongs to the category of automated execution. There is a "paradigm" difference between it and the autonomous asset management pursued by pure Crypto native solutions. The Crypto-native solutions under exploration are taking the "decentralized custody + on-chain verification" route, including AI Agent autonomous asset management, AI Agent autonomous transactions (DeFAI), AI Agent digital identity and on-chain reputation system (ERC-8004...), AI Agent on-chain governance DAO framework, AI Agent NPC and digital avatars, and many other interesting and fun directions. Ultimately, once users get used to AI Agent payments in traditional fields, their acceptance of AI Agents autonomously owning digital assets will also increase. And for those scenarios that AP2 cannot reach, such as anonymous transactions, censorship-resistant payments, and decentralized asset management, there will always be a time for crypto-native solutions to show their strength? The two are more likely to be complementary rather than competitive, but to be honest, the key technological advancements behind AI Agents currently all come from web2AI, and web3AI still needs to keep up the good work!
Share
PANews2025/09/18 07:00