Alkemet News

KaiserPro

2 years ago |

parent
A lot of the optimisation at this level is getting data into the right place at the right time, without killing the network.

Its also a group effort to provide simple to use primitives that "normal" ML people can use, even if they've never used hyper scale clusters before.

So you need a good scheduler, that understand dependencies (no, the k8s scheduler(s) are shit for this, plus it wont scale past 1k nodes without eating all of your network bandwidth), then you need a dataloader that can provide the dataset access, then you need the IPC that allows sharing/joining of GPUs together.

all of that needs to be wrapped up into a python interface that fairly simple to use.

Oh and it needs to be secure, pass an FCC audit (ie you need to prove that no user data is being used) have a high utilisation efficiency and uptime.

the model stuff is the cherry on the top