Show HN: Lightweight Llama3 Inference Engine – CUDA C https://ift.tt/4LsC98j

Show HN: Lightweight Llama3 Inference Engine – CUDA C Hey, recently I took inspiration from llama.cpp, ollama, and many other similar tools that enable inference of LLMs locally, and I just finished building a Llama inference engine for the 8B model in CUDA C. I recently wanted to explore my newly founded interest in CUDA programming and my passion for machine learning. This project only makes use of the native CUDA runtime api and cuda_fp16. The inference takes place in fp16, so it requires around 17-18GB of VRAM (~16GB for model params and some more for intermediary caches). It doesn’t use cuBLAS or any similar libraries since I wanted to be exposed to the least amount of abstraction. Hence, it isn’t as optimized as a cuBLAS implementation or other inference engines like the ones that inspired the project. ## *A brief overview of the implementation* I used CUDA C. It reads a .safetensor file of the model that you can pull from HuggingFace. The actual kernels are fairly straightforward for normalizations, skip connections, RoPE, and activation functions (SiLU). For GEMM, I got as far as implementing tiled matrix multiplication with vectorized retrieval for each thread. The GEMM kernel is also written in such a way that the second matrix is not required to be pre-transposed while still achieving coalesced memory access to HBM. Feel free to have a look at the project repo and try it out if you’re interested. If you like what you see, feel free to star the repo too! I highly appreciate any feedback, good or constructive. https://ift.tt/XsUfCzF January 5, 2025 at 01:07AM

हमरु उत्तराखण्ड

Show HN: Lightweight Llama3 Inference Engine – CUDA C https://ift.tt/4LsC98j

Post a Comment

0 Comments

Popular Posts

भरत नाट्य शास्त्र गढवाली अनुवाद

FDA recalls 2 million popular at-home COVID-19 tests due to false positives

In Portugal, it's now illegal for your boss to call outside work hours

Subscribe Us

Technology

Comments

Facebook

Categories

Menu Footer Widget

हमरु उत्तराखण्ड

Show HN: Lightweight Llama3 Inference Engine – CUDA C https://ift.tt/4LsC98j

You may like these posts

Post a Comment

0 Comments

Social Plugin

Popular Posts

भरत नाट्य शास्त्र गढवाली अनुवाद

FDA recalls 2 million popular at-home COVID-19 tests due to false positives

In Portugal, it's now illegal for your boss to call outside work hours

Subscribe Us

Technology

Comments

Facebook

Categories

Menu Footer Widget