

For the loss function, we use three classification objective loss functions and three metric learning objective loss functions to extensively evaluate the performance of the model. The ideal embedding is to compress the frame-level features into a compact speech-level representation, thereby maximizing the inter-class distance and minimizing the intra-class distance.
#Speech to text online ... tibetan verification#
“Open set” speaker verification is essentially metric learning. The network model uses ResNet-34 and ResNet-50, and fine-tuned them. Based on the original research, this paper uses the mainstream end-to-end method to study the speaker verification part. There are few studies on the combination of speaker recognition and speech recognition in Tibetan, mainly using non-end-to-end methods, and the performance of the model is not ideal.

In practical applications, by combining speech recognition and speaker recognition technologies, a double verification effect can be achieved, which also can effectively improve security. For the text content is known, the semantic information and speaker characteristics in the speech signal can be used for speech recognition and speaker verification respectively in text prompt speaker recognition, which solves the problem of forged recordings in the process of text association.
