Skip to content
【☢】 BitHardware.top ▷ Hardware, Reviews, News, Tutorials, Help post

AVX-512, Intel SIMD Instructions for AI and Multimedia

AVX instructions were first implemented in Intel CPUs, replacing the old SSE instructions. Since then, they have become the standard SIMD instructions for x86 CPUs in their two variants, the 128-bit and the 256-bit, and are also adopted by AMD. On the other hand, if we talk about the AVX-512 instructions, the situation is different and they are only used in Intel CPUs.

What is a SIMD unit?

A SIMD unit is a type of execution unit that is intended to execute the same instruction to several data at the same time. So its accumulator register is longer than a traditional instruction, since it has to group the different data that has to operate with that same instruction.

SIMD units have traditionally been used to speed up so-called multimedia processes in which it is necessary to manipulate various data under the same instructions. The SIMD units allow to parallelize the execution of the program in these parts and to accelerate the execution time.

In every processor, in order to separate the SIMD execution units from the traditional ones, they have their own subset of instructions that is normally a mirror of the scalar instructions or with a single operand. Although there are cases that are not possible to do with a scalar unit and are exclusive to SIMD units.

The history of the AVX-512

Xeon Phi AVX-512

The AVX instructions, Advanced Vector eXtensions, have been inside Intel processors for years, but the origin of the AVX-512 instructions is different from the rest. The reason? Its origin is the Intel Larrabee project, an attempt by Intel in the late 2000s to create a GPU that eventually became the Xeon Phi accelerators. A series of processors intended for high-performance computing that Intel released a few years ago.

The Xeon Phi / Larrabee architecture included a special version of the AVX instructions with a size in their accumulator register of 512 bits, which means that they can operate with up to 16 32-bit data. The reason for this amount has to do with the fact that the typical operations-per-texel ratio for a GPU is usually 16: 1. Let’s not forget that the AVX-512 instructions originate from the failed Larrabee project and were brought from there to the Xeon Phi.

To this day, the Xeon Phi no longer exist, the reason for this is that the same can be done through a traditional GPU for computing. This caused Intel to move these instructions to its main line of CPUs.

The gibberish that is AVX-512 instructions

Intel AVX-512

The AVX-512 instructions are not a homogeneous block that is 100% implemented, but rather have various extensions that, depending on the type of processor, have been added or not. All CPUs are called AVX512F, but there are additional instructions that are not part of the original instruction set and that Intel has added over time.

The AVX512 extensions are as follows:

  • AVX-512-CD: Conflict Detection, allows loops to be vectorized and therefore vectorized. They were first added in Skylake-X or Skylake-SP.
  • AVX-512-ER: Reciprocal and exponential instructions, which are designed for the implementation of transcendental operations. They were added in a Xeon Phi range called Knights Landing.
  • AVX-512-PF: Another inclusion in Knights Landing, this time to increase the precautionary or prefetech capabilities of the instructions.
  • AVX-512-BW: Byte-level (8-bit) and word-level (16-bit) instructions. This extension allows you to work with 8-bit and 16-bit data.
  • AVX-512-DQ: Add new instructions with 32-bit and 64-bit data.
  • AVX-512-VL: Allows AVX instructions to operate on XMM (128-bit) and YMM (256-bit) accumulator registers
  • AVX-512-IFMA: Fused Multiply Add, which is colloquially an A * (B + C) instruction, with 52-bit integer precision.
  • AVX-512-VBMI: Byte-level vector manipulation instructions, is an extension to the AVX-512-BW.
  • AVX-512-VNNI: The Vector Neural Network Instructions are a series of instructions added to accelerate Deep Learning algorithms, used in applications related to artificial intelligence.

Why hasn’t AMD implemented it on their CPUs yet?

AMD EPYC

The reason for this is very simple, AMD is committed to the combined use of its CPU and GPU when it comes to accelerating certain types of applications. Let’s not forget the origin of the AVX-512 in a failed GPU from Intel and AMD thanks to their Radeon GPUs they don’t need the use of AVX-512 instructions.

That is why the AVX-512 instructions are exclusive to Intel processors, not for total exclusivity, but because AMD has no interest in using this type of instructions in its CPUs, since its intention is to sell its GPUs, especially the newly launched AMD Instinct high performance computing with CDNA architecture.

Do the AVX-512 instructions have a future?

Intel Xe Render

Well, we do not know, it depends on the success of the Intel Xe, especially the Xe-HPC, which will give Intel a GPU architecture at the level of AMD and NVIDIA. This means a conflict between the Intel Xe and the AVX-512 instructions to solve the same problems.

The problem with the AVX-512 is that activating the part of the CPU that uses it ends up affecting the clock speed of the CPU, reducing it by about 25% in a program that uses these instructions for specific moments. In addition, its instructions are intended for high-performance computing and AI applications that are not important in what is a home CPU and the appearance of specialized units makes it a waste of transistors and space.

In reality, accelerators or domain-specific processors are slowly replacing SIMD units in CPUs, as they can do the same while taking up less space and with minuscule power consumption in comparison.