Show HN: Lightweight Llama3 Inference Engine – CUDA C Hey, recently I took inspiration from llama.cpp, ollama, and many other similar tools that enable inference of LLMs locally, and I just finished building a Llama inference engine for the 8B model in CUDA C. I recently wanted to explore my newly founded interest in CUDA programming and my passion for machine learning. This project only makes use of the native CUDA runtime api and cuda_fp16. The inference takes place in fp16, so it requires around 17-18GB of VRAM (~16GB for model params and some more for intermediary caches). It doesn’t use cuBLAS or any similar libraries since I wanted to be exposed to the least amount of abstraction. Hence, it isn’t as optimized as a cuBLAS implementation or other inference engines like the ones that inspired the project. ## *A brief overview of the implementation* I used CUDA C. It reads a .safetensor file of the model that you can pull from HuggingFace. The actual kernels are fairly straightforward for normalizations, skip connections, RoPE, and activation functions (SiLU). For GEMM, I got as far as implementing tiled matrix multiplication with vectorized retrieval for each thread. The GEMM kernel is also written in such a way that the second matrix is not required to be pre-transposed while still achieving coalesced memory access to HBM. Feel free to have a look at the project repo and try it out if you’re interested. If you like what you see, feel free to star the repo too! I highly appreciate any feedback, good or constructive. https://ift.tt/XsUfCzF January 5, 2025 at 01:07AM
0 Comments