The excellent performance of deep learning in manyfields brings more possibilities to the business, which puts forward higher requirements for the computing power. Stand-alone training can no longer meet the computing requirements of large-scale sample data and training models. The training of large-scale datasets often takes several days, which greatly limits the delivery efficiency of AI applications. Multi-machine distributed strategy can effectively speed up the training task and shorten the training time. Taking consideration of the AI training and Kubernetes features, AISpike optimizes the submission and scheduling strategies for distributed training tasks.
• Design and develop operators for Tensorflow,Pytorch, MXNet, and Caffe, and quickly carry out distributed training in Kubernets.
• Optimize Gang-scheduler strategy in Kubernets to ensure fast and accurate distribution of the training resources.
• Urgent task and privilege strategy, and user group polling strategy.
Deploying distributed training through the AISpike platform can help algorithm personnel focus on models and tuning parameters, while improving cluster resource utilization and deep learning
training performance.