new

Get trending papers in your email inbox!

Subscribe

Daily Papers

byAK and the research community

Jun 8

ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization

Large language models (LLMs) have shown impressive performance on language tasks but face challenges when deployed on resource-constrained devices due to their extensive parameters and reliance on dense multiplications, resulting in high memory demands and latency bottlenecks. Shift-and-add reparameterization offers a promising solution by replacing costly multiplications with hardware-friendly primitives in both the attention and multi-layer perceptron (MLP) layers of an LLM. However, current reparameterization techniques require training from scratch or full parameter fine-tuning to restore accuracy, which is resource-intensive for LLMs. To address this, we propose accelerating pretrained LLMs through post-training shift-and-add reparameterization, creating efficient multiplication-free models, dubbed ShiftAddLLM. Specifically, we quantize each weight matrix into binary matrices paired with group-wise scaling factors. The associated multiplications are reparameterized into (1) shifts between activations and scaling factors and (2) queries and adds according to the binary matrices. To reduce accuracy loss, we present a multi-objective optimization method to minimize both weight and output activation reparameterization errors. Additionally, based on varying sensitivity across layers to reparameterization, we develop an automated bit allocation strategy to further reduce memory usage and latency. Experiments on five LLM families and eight tasks consistently validate the effectiveness of ShiftAddLLM, achieving average perplexity improvements of 5.6 and 22.7 points at comparable or lower latency compared to the most competitive quantized LLMs at 3 and 2 bits, respectively, and more than 80% memory and energy reductions over the original LLMs. Codes and models are available at https://github.com/GATECH-EIC/ShiftAddLLM.

  • 9 authors
·
Jun 9, 2024

BiPer: Binary Neural Networks using a Periodic Function

Quantized neural networks employ reduced precision representations for both weights and activations. This quantization process significantly reduces the memory requirements and computational complexity of the network. Binary Neural Networks (BNNs) are the extreme quantization case, representing values with just one bit. Since the sign function is typically used to map real values to binary values, smooth approximations are introduced to mimic the gradients during error backpropagation. Thus, the mismatch between the forward and backward models corrupts the direction of the gradient, causing training inconsistency problems and performance degradation. In contrast to current BNN approaches, we propose to employ a binary periodic (BiPer) function during binarization. Specifically, we use a square wave for the forward pass to obtain the binary values and employ the trigonometric sine function with the same period of the square wave as a differentiable surrogate during the backward pass. We demonstrate that this approach can control the quantization error by using the frequency of the periodic function and improves network performance. Extensive experiments validate the effectiveness of BiPer in benchmark datasets and network architectures, with improvements of up to 1% and 0.69% with respect to state-of-the-art methods in the classification task over CIFAR-10 and ImageNet, respectively. Our code is publicly available at https://github.com/edmav4/BiPer.

  • 4 authors
·
Apr 1, 2024

Lattice-Based Pruning in Recurrent Neural Networks via Poset Modeling

Recurrent neural networks (RNNs) are central to sequence modeling tasks, yet their high computational complexity poses challenges for scalability and real-time deployment. Traditional pruning techniques, predominantly based on weight magnitudes, often overlook the intrinsic structural properties of these networks. We introduce a novel framework that models RNNs as partially ordered sets (posets) and constructs corresponding dependency lattices. By identifying meet irreducible neurons, our lattice-based pruning algorithm selectively retains critical connections while eliminating redundant ones. The method is implemented using both binary and continuous-valued adjacency matrices to capture different aspects of network connectivity. Evaluated on the MNIST dataset, our approach exhibits a clear trade-off between sparsity and classification accuracy. Moderate pruning maintains accuracy above 98%, while aggressive pruning achieves higher sparsity with only a modest performance decline. Unlike conventional magnitude-based pruning, our method leverages the structural organization of RNNs, resulting in more effective preservation of functional connectivity and improved efficiency in multilayer networks with top-down feedback. The proposed lattice-based pruning framework offers a rigorous and scalable approach for reducing RNN complexity while sustaining robust performance, paving the way for more efficient hierarchical models in both machine learning and computational neuroscience.

  • 1 authors
·
Feb 23, 2025

Orthogonal Matrices for MBAT Vector Symbolic Architectures, and a "Soft" VSA Representation for JSON

Vector Symbolic Architectures (VSAs) give a way to represent a complex object as a single fixed-length vector, so that similar objects have similar vector representations. These vector representations then become easy to use for machine learning or nearest-neighbor search. We review a previously proposed VSA method, MBAT (Matrix Binding of Additive Terms), which uses multiplication by random matrices for binding related terms. However, multiplying by such matrices introduces instabilities which can harm performance. Making the random matrices be orthogonal matrices provably fixes this problem. With respect to larger scale applications, we see how to apply MBAT vector representations for any data expressed in JSON. JSON is used in numerous programming languages to express complex data, but its native format appears highly unsuited for machine learning. Expressing JSON as a fixed-length vector makes it readily usable for machine learning and nearest-neighbor search. Creating such JSON vectors also shows that a VSA needs to employ binding operations that are non-commutative. VSAs are now ready to try with full-scale practical applications, including healthcare, pharmaceuticals, and genomics. Keywords: MBAT (Matrix Binding of Additive Terms), VSA (Vector Symbolic Architecture), HDC (Hyperdimensional Computing), Distributed Representations, Binding, Orthogonal Matrices, Recurrent Connections, Machine Learning, Search, JSON, VSA Applications

  • 1 authors
·
Feb 8, 2022

BinaryDM: Towards Accurate Binarization of Diffusion Model

With the advancement of diffusion models (DMs) and the substantially increased computational requirements, quantization emerges as a practical solution to obtain compact and efficient low-bit DMs. However, the highly discrete representation leads to severe accuracy degradation, hindering the quantization of diffusion models to ultra-low bit-widths. In this paper, we propose BinaryDM, a novel accurate quantization-aware training approach to push the weights of diffusion models towards the limit of 1-bit. Firstly, we present a Learnable Multi-basis Binarizer (LMB) to recover the representations generated by the binarized DM, which improves the information in details of representations crucial to the DM. Secondly, a Low-rank Representation Mimicking (LRM) is applied to enhance the binarization-aware optimization of the DM, alleviating the optimization direction ambiguity caused by fine-grained alignment. Moreover, a progressive initialization strategy is applied to training DMs to avoid convergence difficulties. Comprehensive experiments demonstrate that BinaryDM achieves significant accuracy and efficiency gains compared to SOTA quantization methods of DMs under ultra-low bit-widths. As the first binarization method for diffusion models, BinaryDM achieves impressive 16.0 times FLOPs and 27.1 times storage savings with 1-bit weight and 4-bit activation, showcasing its substantial advantages and potential for deploying DMs on resource-limited scenarios.

  • 9 authors
·
Apr 8, 2024

BiBench: Benchmarking and Analyzing Network Binarization

Network binarization emerges as one of the most promising compression approaches offering extraordinary computation and memory savings by minimizing the bit-width. However, recent research has shown that applying existing binarization algorithms to diverse tasks, architectures, and hardware in realistic scenarios is still not straightforward. Common challenges of binarization, such as accuracy degradation and efficiency limitation, suggest that its attributes are not fully understood. To close this gap, we present BiBench, a rigorously designed benchmark with in-depth analysis for network binarization. We first carefully scrutinize the requirements of binarization in the actual production and define evaluation tracks and metrics for a comprehensive and fair investigation. Then, we evaluate and analyze a series of milestone binarization algorithms that function at the operator level and with extensive influence. Our benchmark reveals that 1) the binarized operator has a crucial impact on the performance and deployability of binarized networks; 2) the accuracy of binarization varies significantly across different learning tasks and neural architectures; 3) binarization has demonstrated promising efficiency potential on edge devices despite the limited hardware support. The results and analysis also lead to a promising paradigm for accurate and efficient binarization. We believe that BiBench will contribute to the broader adoption of binarization and serve as a foundation for future research. The code for our BiBench is released https://github.com/htqin/BiBench .

  • 8 authors
·
Jan 26, 2023

Cylindric plane partitions, Lambda determinants, Commutants in semicircular systems

This thesis is divided into three parts. The first part deals with cylindric plane partitions. The second with lambda-determinants and the third with commutators in semi-circular systems. For more detailed abstract please see inside. Cylindric plane partitions may be thought of as a natural generalization of reverse plane partitions. A generating series for the enumeration of cylindric plane partitions was recently given by Borodin. The first result of section one is a new bijective proof of Borodin's identity which makes use of Fomin's growth diagram framework for generalized RSK correspondences. The second result is a (q,t)-analog of Borodin's identity which extends previous work by Okada in the reverse plane partition case. The third result is an explicit combinatorial interpretation of the Macdonald weight occurring in the (q,t)-analog using the non-intersecting lattice path model for cylindric plane partitions. Alternating sign matrices were discovered by Robbins and Rumsey whilst studying λ-determinants. In the second part of this thesis we prove a multi-parameter generalization of the λ-determinant, generalizing a recent result by di Francesco. Like the original λ-determinant, our formula exhibits the Laurent phenomenon. Semicircular systems were first introduced by Voiculescu as a part of his study of von Neumann algebras. In the third part of this thesis we study certain commutator subalgebras of the semicircular system. We find a projection matrix with an interesting self-similar structure. Making use of our projection formula we given an alternative, elementary proof that the semicircular system is a factor.

  • 1 authors
·
Oct 25, 2021

EcoFormer: Energy-Saving Attention with Linear Complexity

Transformer is a transformative framework that models sequential data and has achieved remarkable performance on a wide range of tasks, but with high computational and energy cost. To improve its efficiency, a popular choice is to compress the models via binarization which constrains the floating-point values into binary ones to save resource consumption owing to cheap bitwise operations significantly. However, existing binarization methods only aim at minimizing the information loss for the input distribution statistically, while ignoring the pairwise similarity modeling at the core of the attention. To this end, we propose a new binarization paradigm customized to high-dimensional softmax attention via kernelized hashing, called EcoFormer, to map the original queries and keys into low-dimensional binary codes in Hamming space. The kernelized hash functions are learned to match the ground-truth similarity relations extracted from the attention map in a self-supervised way. Based on the equivalence between the inner product of binary codes and the Hamming distance as well as the associative property of matrix multiplication, we can approximate the attention in linear complexity by expressing it as a dot-product of binary codes. Moreover, the compact binary representations of queries and keys enable us to replace most of the expensive multiply-accumulate operations in attention with simple accumulations to save considerable on-chip energy footprint on edge devices. Extensive experiments on both vision and language tasks show that EcoFormer consistently achieves comparable performance with standard attentions while consuming much fewer resources. For example, based on PVTv2-B0 and ImageNet-1K, Ecoformer achieves a 73% on-chip energy footprint reduction with only a 0.33% performance drop compared to the standard attention. Code is available at https://github.com/ziplab/EcoFormer.

  • 5 authors
·
Sep 19, 2022

The Base Dependent Behavior of Kaprekar's Routine: A Theoretical and Computational Study Revealing New Regularities

Consider the following process: Take any four-digit number which has at least two distinct digits. Then, rearrange the digits of the original number in ascending and descending order, take these two numbers, and find the difference between the two. Finally, repeat this routine using the difference as the new four-digit number. In 1949, D. R. Kaprekar became the first to discover that this process, known as the Kaprekar Routine, would always yield 6174 within 7 iterations. Since this number remains unchanged after an application of the Kaprekar Routine, it became known as Kaprekar's Constant. Previous works have shown that the only base 10 Kaprekar's Constants are 495 and 6174, the 3-digit and 4-digit case. However, little attention has been given to other bases or determining which digit cases and which bases have a Kaprekar's Constant. This paper analyzes the behavior of the Kaprekar Routine in the 3-digit case, deriving an expression for all 3-digit Kaprekar Constants. In addition, the author developed a series of C++ programs to analyze the paths integers followed to their respective Kaprekar's Constant. Surprisingly, it was determined from this program that the most commonly required number of iterations required to reach Kaprekar's Constant for 3-digit integers was consistently 3, regardless of base. When loaded as a matrix, the iteration requirement data demonstrates a precise recurring relationship reminiscent of Pascal's Triangle.

  • 1 authors
·
Oct 16, 2017

MST-compression: Compressing and Accelerating Binary Neural Networks with Minimum Spanning Tree

Binary neural networks (BNNs) have been widely adopted to reduce the computational cost and memory storage on edge-computing devices by using one-bit representation for activations and weights. However, as neural networks become wider/deeper to improve accuracy and meet practical requirements, the computational burden remains a significant challenge even on the binary version. To address these issues, this paper proposes a novel method called Minimum Spanning Tree (MST) compression that learns to compress and accelerate BNNs. The proposed architecture leverages an observation from previous works that an output channel in a binary convolution can be computed using another output channel and XNOR operations with weights that differ from the weights of the reused channel. We first construct a fully connected graph with vertices corresponding to output channels, where the distance between two vertices is the number of different values between the weight sets used for these outputs. Then, the MST of the graph with the minimum depth is proposed to reorder output calculations, aiming to reduce computational cost and latency. Moreover, we propose a new learning algorithm to reduce the total MST distance during training. Experimental results on benchmark models demonstrate that our method achieves significant compression ratios with negligible accuracy drops, making it a promising approach for resource-constrained edge-computing devices.

  • 5 authors
·
Aug 25, 2023

QuIVer: Rethinking ANN Graph Topology via Training-Free Binary Quantization

Approximate nearest neighbor (ANN) graph indices such as HNSW and Vamana construct their edge topology in full-precision or high-fidelity quantized metric spaces, relegating binary quantization (BQ) to a post-hoc distance estimator during search. This paper asks a different question: Can binary quantization define the graph topology itself -- and if so, under what conditions? We study this question through QuIVer (Quantized Index for Vector Retrieval), a training-free ANN graph index that performs Vamana edge selection, diversity pruning, and beam-search navigation entirely within a 2-bit Sign-Magnitude BQ metric space, accessing float32 vectors only for final reranking. Systematic evaluation on twelve million-scale datasets reveals a sharp applicability boundary: BQ-native topology is highly effective on cosine-native contrastive-learning embeddings (>=88% Recall@10 at ef=64 across five datasets, 384--3072 dimensions), moderately effective on multimodal CLIP data (71--78%), and empirically unsuitable for Euclidean-native or structureless distributions (<15%). Our results suggest an empirical "impossible triangle" between aggressive compression, high throughput, and universal data compatibility. The central contribution is not merely the system, but the boundary it reveals: falsifiable criteria for when industrial vector search systems can safely trade metric fidelity for compact BQ-native navigation. On compatible workloads, the system benefits are substantial: QuIVer's BQ-native hot path (<1.3 GB for 1M vectors) yields 2.5--5.5x higher multi-threaded throughput than DiskANN Rust and HNSW variants at matched recall, with 4.7x less hot memory and no codebook or rotation training (unlike PQ/OPQ/RaBitQ).

  • 3 authors
·
May 16

Binary Embedding-based Retrieval at Tencent

Large-scale embedding-based retrieval (EBR) is the cornerstone of search-related industrial applications. Given a user query, the system of EBR aims to identify relevant information from a large corpus of documents that may be tens or hundreds of billions in size. The storage and computation turn out to be expensive and inefficient with massive documents and high concurrent queries, making it difficult to further scale up. To tackle the challenge, we propose a binary embedding-based retrieval (BEBR) engine equipped with a recurrent binarization algorithm that enables customized bits per dimension. Specifically, we compress the full-precision query and document embeddings, formulated as float vectors in general, into a composition of multiple binary vectors using a lightweight transformation model with residual multilayer perception (MLP) blocks. We can therefore tailor the number of bits for different applications to trade off accuracy loss and cost savings. Importantly, we enable task-agnostic efficient training of the binarization model using a new embedding-to-embedding strategy. We also exploit the compatible training of binary embeddings so that the BEBR engine can support indexing among multiple embedding versions within a unified system. To further realize efficient search, we propose Symmetric Distance Calculation (SDC) to achieve lower response time than Hamming codes. We successfully employed the introduced BEBR to Tencent products, including Sogou, Tencent Video, QQ World, etc. The binarization algorithm can be seamlessly generalized to various tasks with multiple modalities. Extensive experiments on offline benchmarks and online A/B tests demonstrate the efficiency and effectiveness of our method, significantly saving 30%~50% index costs with almost no loss of accuracy at the system level.

  • 10 authors
·
Feb 17, 2023

Binary and Ternary Natural Language Generation

Ternary and binary neural networks enable multiplication-free computation and promise multiple orders of magnitude efficiency gains over full-precision networks if implemented on specialized hardware. However, since both the parameter and the output space are highly discretized, such networks have proven very difficult to optimize. The difficulties are compounded for the class of transformer text generation models due to the sensitivity of the attention operation to quantization and the noise-compounding effects of autoregressive decoding in the high-cardinality output space. We approach the problem with a mix of statistics-based quantization for the weights and elastic quantization of the activations and demonstrate the first ternary and binary transformer models on the downstream tasks of summarization and machine translation. Our ternary BART base achieves an R1 score of 41 on the CNN/DailyMail benchmark, which is merely 3.9 points behind the full model while being 16x more efficient. Our binary model, while less accurate, achieves a highly non-trivial score of 35.6. For machine translation, we achieved BLEU scores of 21.7 and 17.6 on the WMT16 En-Ro benchmark, compared with a full precision mBART model score of 26.8. We also compare our approach in the 8-bit activation setting, where our ternary and even binary weight models can match or outperform the best existing 8-bit weight models in the literature. Our code and models are available at: https://github.com/facebookresearch/Ternary_Binary_Transformer

  • 5 authors
·
Jun 2, 2023

Finding Kissing Numbers with Game-theoretic Reinforcement Learning

Since Isaac Newton first studied the Kissing Number Problem in 1694, determining the maximal number of non-overlapping spheres around a central sphere has remained a fundamental challenge. This problem is the local analogue of Hilbert's 18th problem, bridging geometry, number theory, and information theory. Although significant progress has been made through lattices and codes, the irregularities of high-dimensional geometry, dimensional structure variability, and combinatorial explosion beyond Go limit the scalability and generality of existing methods. Here we model the problem as a two-player matrix completion game and train the reinforcement learning system, PackingStar, to play the games. The matrix entries represent pairwise cosines of sphere center vectors. One player fills entries while another corrects suboptimal ones to improve exploration quality, cooperatively maximizing the matrix size, corresponding to the kissing number. These matrices are decomposed into representative substructures, providing diverse bases and structural constraints that steer subsequent games and make extremely large spaces tractable. PackingStar surpasses records from dimensions 25 to 31 and sets new lower bounds for generalized kissing numbers under various angular constraints. It achieves the first breakthrough beyond rational structures from 1971 in 13 dimensions and discovers over 6000 new structures in other dimensions. Notably, some configurations challenge long-held antipodal paradigms, revealing algebraic correspondences with finite simple groups as well as geometric relationships across dimensions. Inspired by these patterns, humans devised further improved constructions. These results demonstrate AI's power to explore high-dimensional spaces beyond human intuition via extreme-scale reinforcement learning and open new pathways for the Kissing Number Problem and broader geometry research.

  • 9 authors
·
Feb 10

Learning to Normalize on the SPD Manifold under Bures-Wasserstein Geometry

Covariance matrices have proven highly effective across many scientific fields. Since these matrices lie within the Symmetric Positive Definite (SPD) manifold - a Riemannian space with intrinsic non-Euclidean geometry, the primary challenge in representation learning is to respect this underlying geometric structure. Drawing inspiration from the success of Euclidean deep learning, researchers have developed neural networks on the SPD manifolds for more faithful covariance embedding learning. A notable advancement in this area is the implementation of Riemannian batch normalization (RBN), which has been shown to improve the performance of SPD network models. Nonetheless, the Riemannian metric beneath the existing RBN might fail to effectively deal with the ill-conditioned SPD matrices (ICSM), undermining the effectiveness of RBN. In contrast, the Bures-Wasserstein metric (BWM) demonstrates superior performance for ill-conditioning. In addition, the recently introduced Generalized BWM (GBWM) parameterizes the vanilla BWM via an SPD matrix, allowing for a more nuanced representation of vibrant geometries of the SPD manifold. Therefore, we propose a novel RBN algorithm based on the GBW geometry, incorporating a learnable metric parameter. Moreover, the deformation of GBWM by matrix power is also introduced to further enhance the representational capacity of GBWM-based RBN. Experimental results on different datasets validate the effectiveness of our proposed method.

  • 5 authors
·
Apr 1, 2025

Learning Fast Algorithms for Linear Transforms Using Butterfly Factorizations

Fast linear transforms are ubiquitous in machine learning, including the discrete Fourier transform, discrete cosine transform, and other structured transformations such as convolutions. All of these transforms can be represented by dense matrix-vector multiplication, yet each has a specialized and highly efficient (subquadratic) algorithm. We ask to what extent hand-crafting these algorithms and implementations is necessary, what structural priors they encode, and how much knowledge is required to automatically learn a fast algorithm for a provided structured transform. Motivated by a characterization of fast matrix-vector multiplication as products of sparse matrices, we introduce a parameterization of divide-and-conquer methods that is capable of representing a large class of transforms. This generic formulation can automatically learn an efficient algorithm for many important transforms; for example, it recovers the O(N log N) Cooley-Tukey FFT algorithm to machine precision, for dimensions N up to 1024. Furthermore, our method can be incorporated as a lightweight replacement of generic matrices in machine learning pipelines to learn efficient and compressible transformations. On a standard task of compressing a single hidden-layer network, our method exceeds the classification accuracy of unconstrained matrices on CIFAR-10 by 3.9 points -- the first time a structured approach has done so -- with 4X faster inference speed and 40X fewer parameters.

  • 5 authors
·
Dec 28, 2020

Estimator Meets Equilibrium Perspective: A Rectified Straight Through Estimator for Binary Neural Networks Training

Binarization of neural networks is a dominant paradigm in neural networks compression. The pioneering work BinaryConnect uses Straight Through Estimator (STE) to mimic the gradients of the sign function, but it also causes the crucial inconsistency problem. Most of the previous methods design different estimators instead of STE to mitigate it. However, they ignore the fact that when reducing the estimating error, the gradient stability will decrease concomitantly. These highly divergent gradients will harm the model training and increase the risk of gradient vanishing and gradient exploding. To fully take the gradient stability into consideration, we present a new perspective to the BNNs training, regarding it as the equilibrium between the estimating error and the gradient stability. In this view, we firstly design two indicators to quantitatively demonstrate the equilibrium phenomenon. In addition, in order to balance the estimating error and the gradient stability well, we revise the original straight through estimator and propose a power function based estimator, Rectified Straight Through Estimator (ReSTE for short). Comparing to other estimators, ReSTE is rational and capable of flexibly balancing the estimating error with the gradient stability. Extensive experiments on CIFAR-10 and ImageNet datasets show that ReSTE has excellent performance and surpasses the state-of-the-art methods without any auxiliary modules or losses.

  • 4 authors
·
Aug 13, 2023

Concatenated Matrix SVD: Compression Bounds, Incremental Approximation, and Error-Constrained Clustering

Large collections of matrices arise throughout modern machine learning, signal processing, and scientific computing, where they are commonly compressed by concatenation followed by truncated singular value decomposition (SVD). This strategy enables parameter sharing and efficient reconstruction and has been widely adopted across domains ranging from multi-view learning and signal processing to neural network compression. However, it leaves a fundamental question unanswered: which matrices can be safely concatenated and compressed together under explicit reconstruction error constraints? Existing approaches rely on heuristic or architecture-specific grouping and provide no principled guarantees on the resulting SVD approximation error. In the present work, we introduce a theory-driven framework for compression-aware clustering of matrices under SVD compression constraints. Our analysis establishes new spectral bounds for horizontally concatenated matrices, deriving global upper bounds on the optimal rank-r SVD reconstruction error from lower bounds on singular value growth. The first bound follows from Weyl-type monotonicity under blockwise extensions, while the second leverages singular values of incremental residuals to yield tighter, per-block guarantees. We further develop an efficient approximate estimator based on incremental truncated SVD that tracks dominant singular values without forming the full concatenated matrix. Therefore, we propose three clustering algorithms that merge matrices only when their predicted joint SVD compression error remains below a user-specified threshold. The algorithms span a trade-off between speed, provable accuracy, and scalability, enabling compression-aware clustering with explicit error control. Code is available online.

  • 1 authors
·
Jan 12

Assemblage: Automatic Binary Dataset Construction for Machine Learning

Binary code is pervasive, and binary analysis is a key task in reverse engineering, malware classification, and vulnerability discovery. Unfortunately, while there exist large corpuses of malicious binaries, obtaining high-quality corpuses of benign binaries for modern systems has proven challenging (e.g., due to licensing issues). Consequently, machine learning based pipelines for binary analysis utilize either costly commercial corpuses (e.g., VirusTotal) or open-source binaries (e.g., coreutils) available in limited quantities. To address these issues, we present Assemblage: an extensible cloud-based distributed system that crawls, configures, and builds Windows PE binaries to obtain high-quality binary corpuses suitable for training state-of-the-art models in binary analysis. We have run Assemblage on AWS over the past year, producing 890k Windows PE and 428k Linux ELF binaries across 29 configurations. Assemblage is designed to be both reproducible and extensible, enabling users to publish "recipes" for their datasets, and facilitating the extraction of a wide array of features. We evaluated Assemblage by using its data to train modern learning-based pipelines for compiler provenance and binary function similarity. Our results illustrate the practical need for robust corpuses of high-quality Windows PE binaries in training modern learning-based binary analyses. Assemblage can be downloaded from https://assemblage-dataset.net

  • 8 authors
·
May 7, 2024

A new sample of massive B-type contact binary candidates from the OGLE survey of the Magellanic Clouds

Massive contact binaries (CBs) are key to understanding close-binary evolution and stellar mergers, yet their study has been limited by the scarcity of observed systems, particularly of B-type binaries expected to dominate this class. We bridge this gap by mining a large sample of massive CB candidates from the OGLE-IV database, increasing their known numbers in the Magellanic Clouds by nearly an order of magnitude. Using main-sequence colour-magnitude limits, an observationally informed period-luminosity-colour relation for CBs, and a high morph-parameter cut (cgeq0.7), we identified 68 O- and B-type binaries that exhibit smooth, sinusoidal light curves with nearly equal eclipse depths. We then isolated a bona fide sample of 37 CB candidates (28 in the LMC and 9 in the SMC) that match theoretical colour-magnitude and period distributions derived from an extensive grid of MESA binary models. The bona fide sample, dominated by B-type systems with Papprox0.6-1 d, agrees with the predicted population and may contain many qapprox1 binaries, as expected from models showing mass equalization preceding temperature equalization during nuclear-timescale contact. Synthetic PHOEBE light curves of contact and near-contact phases of MESA models reveal a degeneracy between these configurations, suggesting possible misidentifications among these systems. Spectroscopic follow-up is required to test these predictions and refine the evolutionary framework of massive CBs.

  • 5 authors
·
Oct 21, 2024

Exact Learning of Permutations for Nonzero Binary Inputs with Logarithmic Training Size and Quadratic Ensemble Complexity

The ability of an architecture to realize permutations is quite fundamental. For example, Large Language Models need to be able to correctly copy (and perhaps rearrange) parts of the input prompt into the output. Classical universal approximation theorems guarantee the existence of parameter configurations that solve this task but offer no insights into whether gradient-based algorithms can find them. In this paper, we address this gap by focusing on two-layer fully connected feed-forward neural networks and the task of learning permutations on nonzero binary inputs. We show that in the infinite width Neural Tangent Kernel (NTK) regime, an ensemble of such networks independently trained with gradient descent on only the k standard basis vectors out of 2^k - 1 possible inputs successfully learns any fixed permutation of length k with arbitrarily high probability. By analyzing the exact training dynamics, we prove that the network's output converges to a Gaussian process whose mean captures the ground truth permutation via sign-based features. We then demonstrate how averaging these runs (an "ensemble" method) and applying a simple rounding step yields an arbitrarily accurate prediction on any possible input unseen during training. Notably, the number of models needed to achieve exact learning with high probability (which we refer to as ensemble complexity) exhibits a linearithmic dependence on the input size k for a single test input and a quadratic dependence when considering all test inputs simultaneously.

  • 3 authors
·
Feb 23, 2025

Polynomial-Time Optimal Group Selection via the Double-Commutator Eigenvalue Problem

The algebraic diversity framework generalizes temporal averaging over multiple observations to algebraic group action on a single observation for second-order statistical estimation. The central open problem in this framework is group selection: given an M-dimensional observation with unknown covariance structure, find the finite group whose spectral decomposition best matches the covariance. Naive enumeration of all subgroups of the symmetric group S_M requires exponential time in M. We prove that this combinatorial problem reduces to a generalized eigenvalue problem derived from the double commutator of the covariance matrix, yielding a polynomial-time algorithm with complexity O(d^2M^2 + d^3), where d is the dimension of a generator basis. The minimum eigenvector of the double-commutator matrix directly constructs the optimal group generator in closed form, with no iterative optimization. The reduction is exact: the double-commutator minimum eigenvalue is zero if and only if the optimal generator lies in the span of the basis, and its magnitude provides a certifiable optimality gap when it does not. This problem does not appear in the standard catalogs of computational complexity (Garey and Johnson, 1979) and represents a new class linking group theory, matrix analysis, and statistical estimation. We establish connections to independent component analysis (JADE), structured matrix nearness problems, and simultaneous matrix diagonalization, and we show that the double-commutator formulation is the unique approach that is simultaneously polynomial-time, closed-form, and certifiable. We extend the framework to non-Abelian symmetry recovery via a Sequential GEVP with deflation, and add two identifiability theorems characterizing the commutant-lattice ambiguity and the dichotomy on whether Aut(R) recovers a generative subgroup or only a supergroup.

  • 1 authors
·
May 7

Cross-Scale Context Extracted Hashing for Fine-Grained Image Binary Encoding

Deep hashing has been widely applied to large-scale image retrieval tasks owing to efficient computation and low storage cost by encoding high-dimensional image data into binary codes. Since binary codes do not contain as much information as float features, the essence of binary encoding is preserving the main context to guarantee retrieval quality. However, the existing hashing methods have great limitations on suppressing redundant background information and accurately encoding from Euclidean space to Hamming space by a simple sign function. In order to solve these problems, a Cross-Scale Context Extracted Hashing Network (CSCE-Net) is proposed in this paper. Firstly, we design a two-branch framework to capture fine-grained local information while maintaining high-level global semantic information. Besides, Attention guided Information Extraction module (AIE) is introduced between two branches, which suppresses areas of low context information cooperated with global sliding windows. Unlike previous methods, our CSCE-Net learns a content-related Dynamic Sign Function (DSF) to replace the original simple sign function. Therefore, the proposed CSCE-Net is context-sensitive and able to perform well on accurate image binary encoding. We further demonstrate that our CSCE-Net is superior to the existing hashing methods, which improves retrieval performance on standard benchmarks.

  • 5 authors
·
Oct 14, 2022

A Nearly-Optimal Bound for Fast Regression with ell_infty Guarantee

Given a matrix Ain R^{ntimes d} and a vector bin R^n, we consider the regression problem with ell_infty guarantees: finding a vector x'in R^d such that |x'-x^*|_infty leq epsilon{d}cdot |Ax^*-b|_2cdot |A^dagger| where x^*=argmin_{xin R^d}|Ax-b|_2. One popular approach for solving such ell_2 regression problem is via sketching: picking a structured random matrix Sin R^{mtimes n} with mll n and SA can be quickly computed, solve the ``sketched'' regression problem argmin_{xin R^d} |SAx-Sb|_2. In this paper, we show that in order to obtain such ell_infty guarantee for ell_2 regression, one has to use sketching matrices that are dense. To the best of our knowledge, this is the first user case in which dense sketching matrices are necessary. On the algorithmic side, we prove that there exists a distribution of dense sketching matrices with m=epsilon^{-2}dlog^3(n/delta) such that solving the sketched regression problem gives the ell_infty guarantee, with probability at least 1-delta. Moreover, the matrix SA can be computed in time O(ndlog n). Our row count is nearly-optimal up to logarithmic factors, and significantly improves the result in [Price, Song and Woodruff, ICALP'17], in which a super-linear in d rows, m=Omega(epsilon^{-2}d^{1+gamma}) for gamma=Theta(frac{loglog n{log d}}) is required. We also develop a novel analytical framework for ell_infty guarantee regression that utilizes the Oblivious Coordinate-wise Embedding (OCE) property introduced in [Song and Yu, ICML'21]. Our analysis is arguably much simpler and more general than [Price, Song and Woodruff, ICALP'17], and it extends to dense sketches for tensor product of vectors.

  • 4 authors
·
Feb 1, 2023

Learning with Boolean threshold functions

We develop a method for training neural networks on Boolean data in which the values at all nodes are strictly pm 1, and the resulting models are typically equivalent to networks whose nonzero weights are also pm 1. The method replaces loss minimization with a nonconvex constraint formulation. Each node implements a Boolean threshold function (BTF), and training is expressed through a divide-and-concur decomposition into two complementary constraints: one enforces local BTF consistency between inputs, weights, and output; the other imposes architectural concurrence, equating neuron outputs with downstream inputs and enforcing weight equality across training-data instantiations of the network. The reflect-reflect-relax (RRR) projection algorithm is used to reconcile these constraints. Each BTF constraint includes a lower bound on the margin. When this bound is sufficiently large, the learned representations are provably sparse and equivalent to networks composed of simple logical gates with pm 1 weights. Across a range of tasks -- including multiplier-circuit discovery, binary autoencoding, logic-network inference, and cellular automata learning -- the method achieves exact solutions or strong generalization in regimes where standard gradient-based methods struggle. These results demonstrate that projection-based constraint satisfaction provides a viable and conceptually distinct foundation for learning in discrete neural systems, with implications for interpretability and efficient inference.

  • 2 authors
·
Feb 19

BinaryAttention: One-Bit QK-Attention for Vision and Diffusion Transformers

Transformers have achieved widespread and remarkable success, while the computational complexity of their attention modules remains a major bottleneck for vision tasks. Existing methods mainly employ 8-bit or 4-bit quantization to balance efficiency and accuracy. In this paper, with theoretical justification, we indicate that binarization of attention preserves the essential similarity relationships, and propose BinaryAttention, an effective method for fast and accurate 1-bit qk-attention. Specifically, we retain only the sign of queries and keys in computing the attention, and replace the floating dot products with bit-wise operations, significantly reducing the computational cost. We mitigate the inherent information loss under 1-bit quantization by incorporating a learnable bias, and enable end-to-end acceleration. To maintain the accuracy of attention, we adopt quantization-aware training and self-distillation techniques, mitigating quantization errors while ensuring sign-aligned similarity. BinaryAttention is more than 2x faster than FlashAttention2 on A100 GPUs. Extensive experiments on vision transformer and diffusion transformer benchmarks demonstrate that BinaryAttention matches or even exceeds full-precision attention, validating its effectiveness. Our work provides a highly efficient and effective alternative to full-precision attention, pushing the frontier of low-bit vision and diffusion transformers. The codes and models can be found at https://github.com/EdwardChasel/BinaryAttention.

  • 3 authors
·
Mar 10

Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers

A striking geometric disparity has long persisted in the practice of deep learning. While modern neural network architectures naturally exhibit rich symmetry and equivariance properties, popular optimizers such as Adam and its variants operate inherently coordinate-wise, rendering them unable to respect the equivariance structures of the parameter space. We address this disparity by introducing a symmetry-compatible principle for optimizer design: the gradient update rule should be equivariant under the symmetry group acting on the corresponding weight block. Following this principle, we first provide a unified perspective on bi-orthogonally equivariant updates for general matrix layers, as employed by stochastic spectral descent, Muon, Scion, and polar gradient methods. More importantly, by moving from orthogonal groups to permutation and shared-shift symmetries, we derive symmetry-compatible optimizers for parameter blocks whose symmetries differ from those of general matrix layers: embedding and LM head matrices, SwiGLU MLP projections, and MoE router matrices. These constructions include one-sided spectral, row-norm, hybrid row-norm/spectral, row-aware, column-aware, centered row-norm, and left-spectral updates. They yield an end-to-end layerwise optimizer stack in which each major matrix-valued parameter class is assigned an update whose equivariance matches its symmetry group. We corroborate this principle through pre-training experiments on dense and sparse MoE language models, including Qwen3-0.6B-style, Gemma 3 1B-style, OLMoE-1B-7B-style, and downsized gpt-oss architectures. Across these experiments, symmetry-compatible updates consistently improve final validation loss, and in several cases training stability, over corresponding AdamW updates.

Sparse Linear Regression is Easy on Random Supports

Sparse linear regression is one of the most basic questions in machine learning and statistics. Here, we are given as input a design matrix X in R^{N times d} and measurements or labels {y} in R^N where {y} = {X} {w}^* + {xi}, and {xi} is the noise in the measurements. Importantly, we have the additional constraint that the unknown signal vector {w}^* is sparse: it has k non-zero entries where k is much smaller than the ambient dimension. Our goal is to output a prediction vector {w} that has small prediction error: 1{N}cdot |{X} {w}^* - {X} {w}|^2_2. Information-theoretically, we know what is best possible in terms of measurements: under most natural noise distributions, we can get prediction error at most epsilon with roughly N = O(k log d/epsilon) samples. Computationally, this currently needs d^{Omega(k)} run-time. Alternately, with N = O(d), we can get polynomial-time. Thus, there is an exponential gap (in the dependence on d) between the two and we do not know if it is possible to get d^{o(k)} run-time and o(d) samples. We give the first generic positive result for worst-case design matrices {X}: For any {X}, we show that if the support of {w}^* is chosen at random, we can get prediction error epsilon with N = poly(k, log d, 1/epsilon) samples and run-time poly(d,N). This run-time holds for any design matrix {X} with condition number up to 2^{poly(d)}. Previously, such results were known for worst-case {w}^*, but only for random design matrices from well-behaved families, matrices that have a very low condition number (poly(log d); e.g., as studied in compressed sensing), or those with special structural properties.

  • 3 authors
·
Nov 8, 2025

Learning Mesh Representations via Binary Space Partitioning Tree Networks

Polygonal meshes are ubiquitous, but have only played a relatively minor role in the deep learning revolution. State-of-the-art neural generative models for 3D shapes learn implicit functions and generate meshes via expensive iso-surfacing. We overcome these challenges by employing a classical spatial data structure from computer graphics, Binary Space Partitioning (BSP), to facilitate 3D learning. The core operation of BSP involves recursive subdivision of 3D space to obtain convex sets. By exploiting this property, we devise BSP-Net, a network that learns to represent a 3D shape via convex decomposition without supervision. The network is trained to reconstruct a shape using a set of convexes obtained from a BSP-tree built over a set of planes, where the planes and convexes are both defined by learned network weights. BSP-Net directly outputs polygonal meshes from the inferred convexes. The generated meshes are watertight, compact (i.e., low-poly), and well suited to represent sharp geometry. We show that the reconstruction quality by BSP-Net is competitive with those from state-of-the-art methods while using much fewer primitives. We also explore variations to BSP-Net including using a more generic decoder for reconstruction, more general primitives than planes, as well as training a generative model with variational auto-encoders. Code is available at https://github.com/czq142857/BSP-NET-original.

  • 3 authors
·
Jun 27, 2021

Beyond the Birkhoff Polytope: Spectral-Sphere-Constrained Hyper-Connections

Hyper-Connections (HC) generalize residual connections into multiple streams, employing residual matrices for cross-stream feature mixing to enrich model expressivity. However, unconstrained mixing disrupts the identity mapping property intrinsic to the residual connection, causing unstable training. To address this, Manifold-Constrained Hyper-Connections (mHC) and its variant restrict these matrices to the Birkhoff polytope (doubly stochastic matrices) via Sinkhorn iterations or permutation-based parameterizations. We reveal three limitations of this polytope constraint: (1) identity degeneration, where learned matrices collapse around the identity and diminish cross-stream interactions, (2) an expressivity bottleneck, as the non-negativity constraint prevents subtractive feature disentanglement, and (3) parameterization inefficiencies, manifesting as unstable Sinkhorn iterations or the factorial-scaling overhead of permutation-based parameterizations. To overcome these flaws, we propose Spectral-Sphere-Constrained Hyper-Connections (sHC). By geometrically shifting the feasible set from a rigid polytope to a spectral norm sphere, sHC allows negative entries, unlocking subtractive interactions for selective feature diversification. This shift eliminates unstable Sinkhorn projections and factorial parameterization, enabling expressive, non-degenerate residual matrices while preserving training stability.

  • 3 authors
·
Mar 21

Low Rank Matrix Completion via Robust Alternating Minimization in Nearly Linear Time

Given a matrix Min R^{mtimes n}, the low rank matrix completion problem asks us to find a rank-k approximation of M as UV^top for Uin R^{mtimes k} and Vin R^{ntimes k} by only observing a few entries specified by a set of entries Omegasubseteq [m]times [n]. In particular, we examine an approach that is widely used in practice -- the alternating minimization framework. Jain, Netrapalli and Sanghavi~jns13 showed that if M has incoherent rows and columns, then alternating minimization provably recovers the matrix M by observing a nearly linear in n number of entries. While the sample complexity has been subsequently improved~glz17, alternating minimization steps are required to be computed exactly. This hinders the development of more efficient algorithms and fails to depict the practical implementation of alternating minimization, where the updates are usually performed approximately in favor of efficiency. In this paper, we take a major step towards a more efficient and error-robust alternating minimization framework. To this end, we develop an analytical framework for alternating minimization that can tolerate moderate amount of errors caused by approximate updates. Moreover, our algorithm runs in time widetilde O(|Omega| k), which is nearly linear in the time to verify the solution while preserving the sample complexity. This improves upon all prior known alternating minimization approaches which require widetilde O(|Omega| k^2) time.

  • 4 authors
·
Feb 21, 2023

mHC-lite: You Don't Need 20 Sinkhorn-Knopp Iterations

Hyper-Connections (HC) generalizes residual connections by introducing dynamic residual matrices that mix information across multiple residual streams, accelerating convergence in deep neural networks. However, unconstrained residual matrices can compromise training stability. To address this, DeepSeek's Manifold-Constrained Hyper-Connections (mHC) approximately projects these matrices onto the Birkhoff polytope via iterative Sinkhorn--Knopp (SK) normalization. We identify two limitations of this approach: (i) finite SK iterations do not guarantee exact doubly stochasticity, leaving an approximation gap that can accumulate through network depth and undermine stability; (ii) efficient SK implementation requires highly specialized CUDA kernels, raising engineering barriers and reducing portability. Motivated by the Birkhoff--von Neumann theorem, we propose mHC-lite, a simple reparameterization that explicitly constructs doubly stochastic matrices as convex combinations of permutation matrices. This approach guarantees exact doubly stochasticity by construction and can be implemented using only native matrix operations. Extensive experiments demonstrate that mHC-lite matches or exceeds mHC in performance while achieving higher training throughput with a naive implementation and eliminating the residual instabilities observed in both HC and mHC. The code is publicly available at https://github.com/FFTYYY/mhc-lite.

  • 2 authors
·
Jan 9