New work by Tsinghua Zhu Jun team: Train Transformer with 4-bit integers to accelerate the arrival of AGI!
AD |
Transferred from XinzhiyuanEditor Aeneas RunQuantifying activation, weight, and gradient to 4 bits is expected to accelerate neural network training.However, existing 4-bit training methods require a custom number format, which modern hardware does not support
Transferred from Xinzhiyuan
Editor Aeneas Run
Quantifying activation, weight, and gradient to 4 bits is expected to accelerate neural network training.
However, existing 4-bit training methods require a custom number format, which modern hardware does not support.
Recently, the Tsinghua Zhu Jun team proposed a Transformer training method that uses the INT4 algorithm to implement all Matrix multiplication.
Training with ultra-low INT4 accuracy is very challenging. To achieve this goal, researchers carefully analyzed the specific structures of activation and gradient in Transformer and proposed dedicated quantizers for them.
For forward propagation, researchers identified the challenge of Outlier and proposed Hadamard quantizer to suppress Outlier.
For backward propagation, they utilize the structural sparsity of gradients by proposing bit segmentation, and use fractional sampling techniques to accurately quantify gradients.
This new algorithm has achieved competitive accuracy in a wide range of tasks such as Natural-language understanding, Machine translation and image classification.
The operational speed of the prototype linear operator is 2.2 times faster than that of similar operators in FP16, and the training speed has been improved by 35.1%.
Paper address:
https://arxiv.org/abs/2306.11987
Code address:
https://github.com/xijiu9/Train_Transformers_with_INT4
New INT4 training algorithm
Training neural networks requires high computational requirements. The use of low precision arithmetic for training (fully quantized training/FQT) is expected to improve computational and memory efficiency.
The FQT method added some quantizers and inverse quantizers to the original full precision computational graph, and replaced the higher consumption of floating-point operations with lower consumption of low precision floating-point operations.
The research of FQT aims to reduce the accuracy of training values without sacrificing too much Rate of convergence or accuracy.
The required numerical accuracy has been reduced from FP16 to FP8, INT32+INT8, and INT8+INT5.
FP8 training is implemented in NvidiaH100GPU with Transformer engine, accelerating the training of large-scale Transformers. The recent training numerical accuracy has dropped to 4 digits.
However, these 4-bit training methods cannot be directly used for acceleration as they require custom digital formats that modern hardware does not support.
Firstly, the non differentiable optimizer in forward propagation can make the loss situation bumpy, and gradient based optimizers can easily fall into local optima.
Secondly, gradients are only approximated with low accuracy. This imprecise gradient can slow down the training process and even lead to unstable or divergent training.
In this work, researchers proposed a novel INT4 training algorithm for Transformer.
All costly linear operations for training the Transformer can be written in the form of Matrix multiplication (MM).
This MM form allows us to design a more flexible quantizer. By using the specific structure of activation, weight and gradient in Transformer, we can better approximate FP32 Matrix multiplication.
The progress in the field of random numerical Linear algebra (RandNLA) is fully utilized by this quantizer.
For forward propagation, the researchers found that the active Outlier were the main reason for the decline in accuracy.
To suppress Outlier, they proposed Hadamard quantizer, which quantifies the transformed version of the activation matrix. This transformation is a block diagonal Hadamard matrix, which propagates the information carried by outliers to adjacent entries in the matrix, thereby reducing the numerical range of outliers.
For backward propagation, they utilized the structural sparsity of activation gradients. Researchers have found that some tokens have very large gradients. Meanwhile, the gradients of most other tokens are very uniform, and even the quantization residuals of larger gradients are more uniform.
Therefore, rather than calculating all gradients, it is better to save computational resources for calculating larger gradient residuals.
In order to utilize this sparsity, researchers proposed bit segmentation, which divides the gradient of each token into high 4 bits and low 4 bits.
Then, leverage score sampling is used to select the most informative gradient, which is an important sampling technique in RandNLA.
Combining the forward and backward propagation quantization technology, the researchers proposed an algorithm using INT4MM for all linear operations in Transformer, and evaluated the algorithms for training Transformer on various tasks, including Natural-language understanding, question answering, Machine translation and image classification.
Compared with existing 4-bit training algorithms, their algorithm achieves competitive or higher accuracy.
In addition, this algorithm is compatible with contemporary hardware such as GPUs, as it does not require custom digital formats such as FP4 or logarithmic format.
This prototype quantization+INT4MM operator implementation achieves a speed 2.2 times faster than the FP16MM baseline, and improves training speed by 35.1%.
Related work
Fully quantified training
Fully quantified training (FQT)
FQT's research has designed novel numerical formats and quantization algorithms that can better approximate full precision tensors.
The current research frontier is the 4-digit FQT. Due to the large numerical range of gradients and the optimization problem of training quantization networks from scratch, FQT is challenging.
Due to these challenges, the accuracy of existing 4-bit FQT algorithms has still decreased by 1-2.5% on certain tasks and cannot support contemporary hardware.
Other effective training methods
The hybrid expert increased the model capacity without increasing the training budget.
Structural dropout utilizes computationally effective methods to regularize the model. Efficient attention reduces the secondary Time complexity of computing attention.
Distributed training systems reduce training time by utilizing more computing resources.
The work of researchers to reduce numerical accuracy is orthogonal to these directions.
Forward propagation
Neural network training is an iterative optimization process that calculates random gradients through forward and backward propagation.
The research team used the 4-bit integer (INT4) algorithm to accelerate forward and backward propagation.
Forward propagation can be achieved by combining linear and nonlinear (GeLU, normalization, softmax, etc.) operators.
In our training process, we use INT4 arithmetic to accelerate all linear operators and keep all nonlinear operators with smaller computational complexity in the 16 bit floating point (FP16) format.
All linear operations in Transformer can be written in the form of Matrix multiplication (MM).
For the convenience of expression, this paper considers the following acceleration of simple Matrix multiplication:
The main use case for this type of MM is the fully connected layer.
Consider a Transformer with an input shape of (batch size S, sequence length T, dimension D).
The fully connected layer can be expressed as the formula above, where X is the activation of N=STtoken and W is the weight matrix.
For attention level, batch Matrix multiplication (BMMS) may be required.
Our proposed technology can be applied to BMMS.
Learning Step Quantization
Forward propagation
Researchers used the Learning Step Quantizer (LSQ) for this purpose.
LSQ is a static quantization, and its quantization scale does not depend on the input method, so it consumes less energy than dynamic methods. Quantification methods require dynamic calculation of quantization scale during each iteration.
Activate Outlier
LSQ4/FQTActivate Outlier
As shown in the above figure, there are some outlier entries activated, which are much larger in scale than other entries.
Unfortunately, Transformers tends to store information in these Outlier, and such truncation can seriously compromise accuracy.
When the training task is to fine tune the pre training model on some new downstream tasks, the Outlier problem is particularly obvious.
Because the pre training model contains more Outlier than the random initialization.
Hadamard quantization
Hadamard quantizationHQ
The main idea is to quantify another matrix in a linear space with fewer Outlier.
The Outlier in the activation matrix form a feature wise structure.
They usually concentrate on several dimensions, which means that only a few columns in X are significantly larger than the other columns.
Hadamard transform is a linear transform that can allocate Outlier to other entries.
Backward propagation
INT4Backward propagation
We will discuss the calculation of activation gradient/weight gradient in this section.
The structural sparsity of gradients
We noticed that the gradient matrix is often very sparse during the training process.
And sparsity has the following structure:
A few lines of (such as tokens) have large entries, while most other lines are close to the full Zero vector.
This structural sparsity stems from the severe hyperparameterization of modern neural networks.
Almost throughout the entire training process, the network operates in a hyperparameterization scheme, and except for some difficult examples, it can adapt well to most training data.
Therefore, for well fitted data points, the (activation) gradient will approach zero.
Researchers have found that for pre training tasks, for example, after several training cycles, structural sparsity quickly appears.
For fine-tuning tasks, the gradient is always sparse throughout the entire training process.
BitSplitting and LeverageScoreSampling
How to design a gradient quantizer to accurately calculate MM during backpropagation using structural sparsity?
The advanced idea is that many lines of the gradient are so small that they have little impact on the parameter gradient, but they waste a lot of computational time.
On the other hand, Da Xing cannot be accurately represented with INT4.
We give up some small rows and use the saved computing power to more accurately represent large rows.
experiment
Researchers evaluated our INT4 training algorithm tuning, Machine translation and image classification on various tasks, including language models.
The researchers performed their proposed HQ-MM and LSS-MM algorithms using CUDA and cutpass.
Researchers replaced all floating-point linear operators with INT4 implementation, but did not simply use LSQ to embed layers and maintain the accuracy of the last classifier layer.
Finally, the researchers used the default architecture, optimizer, scheduler, and hyperparameters for all evaluated models.
Convergence model accuracy
The researchers compared the accuracy of the convergence model in various tasks in the table below.
FPINT8INT8FP4LSQ(LSQ+LUQ)4 HQForward propagationLSSHQ+LSS
There is no publicly available implementation of 'ultra low', so we only listed its performance translation tasks in the original paper on the machine.
In addition to the large Machine translation task and the large visual transformer task, we will repeat each run three times and report the standard deviation as the subscript in the table.
The researchers did not perform any type of knowledge distillation or data augmentation.
experiment
experiment
Forward propagationBackward propagationFP16
The results are shown in the following figure.
Computing and memory efficiency
Finally, the researchers demonstrated the potential of their method to accelerate neural network training by evaluating their prototype implementation.
And their implementation has not been fully optimized yet.
Researchers also did not integrate linear operators with nonlinearity and normalization.
Therefore, the results cannot fully reflect the potential of the INT4 training algorithm.
The implementation of complete optimization requires a lot of engineering, which is beyond the scope of our paper's discussion.
conclusion
Researchers have proposed a hardware friendly training method for Transformer INT4.
By analyzing the attributes of MM in Transformer, researchers proposed HQ and LSS methods to quantify activation and gradient while maintaining accuracy.
Our method performs equally or even better than the existing INT4 method on several important tasks.
The work of researchers may be extended to other MM architectures besides Transformers, such as MLP Mixer, graph neural network and Recurrent neural network.
This is their future research direction.
Wider impact:Researchers' algorithms can improve efficiency and reduce the energy consumption of training neural networks, which helps to reduce carbon emissions caused by deep learning.
However, efficient training algorithms may also promote the development of large language models and malicious artificial intelligence applications that pose security risks to humans.
For example, relevant models and applications that can be used for generating false content.
Restrictions:The main limitation of this work is that it can only accelerate the large-scale Matrix multiplication (linear layer) model, but cannot accelerate the convolution layer.
Moreover, the proposed method is not yet well applicable to super large models such as OPT-175B.
To our knowledge, even INT8 training is still an unresolved issue for these very large models.
References:
https://arxiv.org/abs/2306.11987
Disclaimer: The content of this article is sourced from the internet. The copyright of the text, images, and other materials belongs to the original author. The platform reprints the materials for the purpose of conveying more information. The content of the article is for reference and learning only, and should not be used for commercial purposes. If it infringes on your legitimate rights and interests, please contact us promptly and we will handle it as soon as possible! We respect copyright and are committed to protecting it. Thank you for sharing.(Email:[email protected])
Mobile advertising space rental |
Tag: New work by Tsinghua Zhu Jun team Train Transformer
Have you noticed? Once a smartphone emits these signals, it indicates that it is time to change the phone!
NextTop global technologies from three Asian countries: Korean semiconductors, Japanese precision machine tools, and what do China have?
Guess you like
-
China Leads in Developing IEC 63206 International Standard, Driving Global Innovation in Industrial Process Control System RecordersDetail
2025-01-18 11:06:14 1
-
The 2024 Micro-Short Series Industry Ecological Insight Report: 647,000 Job Opportunities, Rise of Diversified Business Models, and High-Quality Content as the Future TrendDetail
2025-01-17 17:33:01 1
-
Global PC Market Shows Moderate Recovery in 2024: High AIPC Prices a Bottleneck, Huge Growth Potential in 2025Detail
2025-01-17 11:02:09 1
-
Bosch's Smart Cockpit Platform Surpasses 2 Million Units Shipped, Showcasing Strength in Intelligent Driving TechnologyDetail
2025-01-17 10:55:29 1
-
YY Guangzhou Awarded "2024 Network Information Security Support Unit" for Outstanding ContributionsDetail
2025-01-17 10:43:28 1
-
TikTok CEO Invited to Trump's Inauguration, Biden Administration May Delay BanDetail
2025-01-16 20:06:11 1
-
Douyin Denies Opening International Registration: Overseas IPs Don't Equate to Overseas Registration; Platform Actively Combats Account ImpersonationDetail
2025-01-16 14:26:12 1
-
Lei Jun, Xiaomi's founder, chairman, and CEO, has set a new goal: learning to drive a forklift!Detail
2025-01-15 10:22:30 11
-
ByteDance Scholarship 2024: Fifteen Outstanding Doctoral Students Awarded RMB 100,000 Each to Advance Frontier Technology ExplorationDetail
2025-01-14 15:56:39 1
-
Fliggy Launches "Peace of Mind for the New Year" Service Initiative to Ensure Smooth Travel During the Year of the Snake Spring Festival RushDetail
2025-01-14 15:24:53 1
-
Arm's Massive Fee Hike and Potential In-House Chip Development: A Precursor to a Seismic Shift in the Chip Industry?Detail
2025-01-14 11:02:36 1
-
Adobe Firefly Launches: Generative AI Suite Revolutionizes Image and Video Processing EfficiencyDetail
2025-01-14 10:46:39 1
-
Chinese New Year Elements Sell Like Hotcakes Overseas: Cross-border E-commerce "Spring Festival Economy" Booms, Cainiao Overseas Warehouses Help Merchants Capture Market ShareDetail
2025-01-13 14:17:50 1
-
China Railway's 12306 System Successfully Navigates Spring Festival Travel RushDetail
2025-01-13 12:56:54 1
-
Handan, Hebei Province Successfully Tests First Low-Altitude Drone Delivery Route, Ushering in a New Era of Smart LogisticsDetail
2025-01-13 12:50:13 1
-
Kuaishou Leads in Developing Anti-Fraud Industry Standards, Contributing to a Secure and Reliable Short-Video CommunityDetail
2025-01-13 09:47:32 11
-
Microsoft Offers Top Salaries to Retain AI Talent: AI Software Engineers Earn Over $400,000 AnnuallyDetail
2025-01-12 17:28:34 11
- Detail
-
Chang'e-5 Mission Unveils Secrets: New Discoveries Regarding Lunar Magnetic Field Strength and Deep Dynamics 2 Billion Years AgoDetail
2025-01-10 11:42:44 11
-
SenseTime's "Day Day New" Multimodal Large Model: Native Fusion Enables Diverse ApplicationsDetail
2025-01-10 11:40:40 21