At the rapid development of AI technology, generative AI, especially large-scale language models, are drawing the attention of countless users. The Intel® Xeon® Scalable processor, with its massive number of physical cores, high memory capacity, and bandwidth, has become the ideal choice for these workloads. The processor is not only powerful but also offers excellent stability and reliability, providing long-term stable support for enterprise applications and cloud services.
Intel® has further optimized the efficiency of large language models on its processors with the open-source Intel® LLM Library for PyTorch (IPEX-LLM). IPEX-LLM is specifically optimized for the AMX instruction set in the fourth generation of Intel® Xeon® Scalable processors and incorporates a series of low-bit operation tricks. This enables popular large language models such as Hugging Face, LangChain, LlamaIndex, and others to run efficiently on this processor, highlighting the outstanding performance and cost-effectiveness of Intel® Xeon® Scalable processors in running large language model inference.
IPEX-LLM not only enhances the inference performance of large language models based on the fourth-generation Intel® Xeon® Scalable processors but also its ease of use is impressive. Users can easily install and utilize the IPEX-LLM library to quickly set up a large language model inference framework. By reading the provided quick installation guide, users can conveniently install IPEX-LLM on the processor and easily build up the capability for large language model inference.
Furthermore, when developing large language model inference applications based on IPEX-LLM, users only need to make minimal code adjustments, such as appropriate import statements and setting the load_in_4bit=True
parameter in the model pre-load (from_pretrained) function to enable low-bit optimization, thereby fully utilizing the functionalities provided by APIs such as HuggingFace Transformer.
When using the IPEX-LLM library with Intel® Xeon® Scalable processors, small high-performance models can be accelerated through specific techniques, ensuring the efficiency of large language models is maximized. This process involves converting the model to a low-bit version and running it on hardware, leveraging optimizations based on both hardware and software to push the inference performance of the model. For example, IPEX-LLM provides a feature called BF16 Self-Speculative Decoding, which users can easily activate by setting a few parameters.
Performance testing users can perform large language model performance tests on the fourth-generation Intel® Xeon® Scalable processors and can prepare the hardware and software environment with reference to the performance testing quick guide provided by IPEX-LLM. In addition, test scripts should be adjusted to fit the specific testing scenarios. Before starting the performance tests, it is recommended to use the environment checking tool of IPEX-LLM to further confirm that the installation and running environment are correctly set up.
This text introduces how to perform inference operations on large language models using IPEX-LLM on the fourth generation Intel® Xeon® Scalable processors, and performance data on low-bit INT4 and BF16 Self-Speculative Decoding. Users can obtain the latest technical information on large language models by visiting the IPEX-LLM GitHub page and sample programs.
The performance test data comes from March 2024. The hardware configuration tested includes Intel® Xeon® Platinum 8468 with 48 cores, supporting hyper-threading and Turbo Boost technologies, 2 sockets, and a total of 1024GB memory (16x64GB DDR5 4800 MT/s). The system configuration is BIOS 05.02.01, microcode version 0x2b0004d0, operating system Ubuntu 22.04.3 LTS and kernel version 6.2.0-37-generic. The software configuration includes bigdl-llm 2.5.0b20240313 (version before migration to ipex-llm), PyTorch 2.3.0.dev20240128+cpu, intel-extension-for-pytorch 2.3.0+git004cd72, transformers 4.36.2.
The recorded performance data is obtained on a single-processor system using a greedy search decoding method, with an input of 1024 tokens and an output of 128 tokens, and batch size set to 1. Actual performance data may vary depending on usage, configuration, and other factors. More details and performance-related information can be found on Intel®’s performance index web page.
The performance test results are based on the configurations listed above and tests conducted on a specific date, and may not include all publicly available security updates. For more details, please refer to the configuration disclosure. It should be noted that no product or component can provide absolute security.
The specific costs and benefits of Intel technology may vary. Enabling these technologies may require activation of hardware, software, or services. Intel does not make any form of warranty in the provision of its products and services, whether express or implied, including but not limited to warranties of merchantability, suitability for a particular purpose, and non-infringement – as well as any warranty that may arise during the performance of a contract, the course of transactions, or usage in trade.
When it comes to processing third-party data, Intel does not control or audit the content. Accordingly, users need to carefully review such content, consult other sources of information, and confirm that the mentioned data is accurate. Intel Corporation owns all copyright in its content and furthermore, Intel’s trademarks, as well as other company trademarks, are registered trademarks in the United States and/or other countries. It should be noted that other mentioned names and brands may be the property of their respective owners.