If there is a workload that currently demands current hardware, it is a deep neural network. Calculations in the field of AI are generally achieved with the use of brute force, which means that with each new CPU or GPU upgrade, almost perfect performance scaling is achieved. But the sector so far has been turning to graphics cards for their greater computing power to parallelize tasks, since almost all existing algorithms are based on matrix multiplications, but what if this were over and the tables were turned?
AVX512 and AVX512_BF16, main drivers of CPU performance
It has been from Rice’s Brown School of Engineering that Professor Anshumali Shrivastava and his team have presented a new algorithm for deep neural networks DNN using the next-generation instructions AVX512 and its derivative BF16.
Shrivastava left some more than interesting statements:
Businesses are spending millions of dollars a week just to train and tune their AI workloads. The entire industry is obsessed with one type of improvement: faster matrix multiplications. Everyone is looking for specialized hardware and architectures to drive matrix multiplication. People now even talk about having specialized hardware and software stacks for specific types of deep learning. taking an algorithm [computacionalmente] expensive, I’m saying, ‘Let’s check the algorithm’
To do this the scientists used an OpenMP-based C ++ engine like SLIDE and worked on it for Intel’s AVX512 and AVX512-bfloat16 instructions and the results were amazing.
Matrix multiplication might not be the way in SLIDE
The engine builds on LSH, thereby optimizing performance requirements by accelerating hash tables, which according to study co-author Shabnam Daghaghi means that a CPU can outperform a GPU.
To achieve a faster hash, the researchers vectorized and quantified the algorithm for these instructions, aware of their potential, and improved the memory allocation of some points of the algorithm.
«We take advantage of CPU innovations [AVX512 y AVX512_BF16] to take SLIDE even further, proving that if you are not obsessed with matrix multiplications, you can harness the power of modern CPUs and train AI models four to 15 times faster than alternative best specialized hardware. “
Already entering comparative data and performance, we talk that an Intel Cooper Lake processor can outperform a whole NVIDIA Tesla V100 by 7.8 times with Amazon-670K, 5.2 times compared to WikiLSHTC-325K and almost 15.5 times with Text8.
Even a Cascade Lake CPU is capable of more than double the NVIDIA V100. It remains to be seen how much an A100 can perform against an Intel Cooper Lake CPU with SLIDE, but it is more than likely that it will follow behind, so taking into account that processors are in all equipment, we could be facing a paradigm shift in the sector of AI, a change in trend that would affect NVIDIA and AMD in favor (for now) of Intel.