What is the basic principle of CTC method in speech recognition?

In the early days of training speech models, we needed to label each frame of training data, which was basically done with traditional HMMs and GMMs. The labeled data was then used to train the neural model. The end-to-end solution is to remove this non-neural network processing stage and directly use CTC and RNN to realize the training data without labeling to the frame to train the speech model directly, without the help of other (HMM, GMM) to train the neural network model. In traditional speech recognition models, we often have to strictly align the text with the speech before training the speech model. This is not so good in two ways: although there are already some mature open source alignment tools for everyone to use, but as deep learning is getting hotter and hotter, some people will think, can we let our network learn the alignment by itself? That's why CTC was born. Think about it, why doesn't CTC need to align speech and text? Because CTC allows our neural network to predict the label at any time, there is only one requirement: the sequence of the output as long as the order is correct on the OK ~ so we do not need to make the text and speech strictly aligned, and CTC output is the entire sequence of labels, so we do not need to do some post-processing operations. An example of using CTC on a piece of audio and aligning it with text is shown below: