Large Model (LM) refers to a deep neural network model with millions or billions of parameters, which has undergone a specialized training process to enable complex processing and tasking of large-scale data.
Large Models take up a lot of resources such as computational resources, storage space, time, and power to ensure its training and deployment. In contrast, Small Models are deep neural network models with fewer parameters. Small Models often run faster and are more lightweight, and are suitable for devices or scenarios that have fewer computational resources and storage space, such as mobile or embedded devices.
In practice, the choice of large or small models depends on the problem to be solved and the resources available. Large models typically perform well in natural language processing, computer vision, recommender systems, etc., and they usually require the support of high-performance computing resources, such as standard GPUs or cloud clusters.
Small models are suitable for solving simple, small-scale problems such as credit card fraud detection, and they have faster inference speeds and can run on low-power devices such as smartphones or IoT devices.
Problems Big Models Can Solve
Large-scale pre-training can efficiently capture knowledge from a large amount of labeled and unlabeled data, and greatly extends a model's ability to generalize by storing the knowledge in a large number of parameters and fine-tuning it for a specific task. Instead of starting from 0, only a small number of samples are needed for fine-tuning when responding to different scenarios.
And then let's say BERT has been trained and we want to do a downstream task, doing sentiment analysis of a sentence. Then a class token will be added to the input token of BERT, this is the same as vit, encoder later use the vector of class token to do a bit of linear transoformation and softmax and gt to do loss training, so this step can be directly initialize the BERT model pre So this step can directly initialize the pre-training parameters of BERT model to do finetune, the effect is better. The convergence is fast and the loss is low.