It's expected that the need for AI compute power will grow by 750x over the next 5 years, while hardware performance will grow by 12x at the same time. The gap is huge, but now we have proof that the gap has become a real issue. It’s a hot topic - AI developers can’t get GPUs from AWS, Azure and Google Cloud any more.
How it started?
Since Open AI released their ChatGPT, the demand for GPUs has started to grow geometrically. Now everybody wants to develop a product based on the famous language model, and become the next unicorn. Let’s not forget about computer vision, video, audio, GAN and other AI technologies that also require not just 1 GPU, but many of them.
The second reason is the Pandemic. It gave the boost to AI development, but also slowed down the hardware production. COVID-19 pandemic has disrupted the global supply chain, causing delays and shortages across many industries, including microchip production.
The last reason is the fast learning curve for becoming a data scientist. Now it’s very easy to become an AI developer and make tons of money. Take an online course, ready to go model and dataset, and just run it on GPU in the cloud. Clouds are open to everyone.
So what should you do if you have a quota issue and can’t get enough GPUs? 1. Obviously, buy your own hardware!
Sounds logical, but also not that easy. It’s expensive, and you end up taking the headache of building and managing hardware infrastructure. Also, GPUs are in short supply and aren’t easily available from Nvidia or its resellers.
2. Use multiple clouds
That sounds better. Create multiple cloud accounts, and get a few GPUs here and there. 4 GPU in AWS, 1 in GCP, 2 in Azure - now you have at least 7 GPUs, which is definitely better than 1. Managing one cloud is hard enough, doing it for 3 at the same time can drive any DevOps guy insane.
3. Make your existing GPUs work like it's 100s of them What if you can use your 4 available GPUs in AWS and make it work like 100s of them? Sounds insane, but that’s what the Scalertorch Deep Learning Offload Processor does. We can speed up AI model training by 10x-200x on the same number of GPUs. Let’s say now it takes 4 hours to complete 1 epoch of your experiment, with Scaletorch it’s only 13 mins! This invention can keep you productive even with the quota that you have. Use Scaletorch Deep Learning Offload Processor for speedup and automation. On top of it, we also support multi-cloud automation. It was mentioned before, but multicloud can bring you more GPUs, Scaletorch will not only automate running your jobs in multiple clouds simultaneously, but also speed it up, apply trials parallelization and distributed training. Now, using all the techniques, your 7 GPUs across the cloud can work like 100.
We expect you don’t believe it, so why don’t you try the platform?