Alkemet News

brucethemoose2

2 years ago |

parent
You can essentially already do this with llama.cpp and ssd swap, much more quickly. Technically it only does the prompt processing and a few layers on the GPU (if that), but honestly that is better, just to avoid all the transfers over the GPU bus.

And you can do it in MLC, in your IGP, if you have enough CPU RAM to fit the model.

Running a 70B very slowly is nothing new. To be blunt, this strategy is a bad idea in the face of newer implementations.