Large Language Models: Enhancing Capabilities with Audio Encoder

Large Language Models (LLMs) have become increasingly popular since the introduction of OpenAI’s ChatGPT. These models excel at various tasks such as answering questions, summarizing text, translating languages, and more. LLMs are built on sub-fields of Artificial Intelligence, including Natural Language Processing, Natural Language Understanding, Computer Vision, and others.

LLMs train themselves by predicting the next word in vast amounts of text data. This training enables them to encode a significant amount of knowledge about the world within their neural networks. As a result, LLMs are useful for a wide range of tasks.

Recent research has taken LLM capabilities a step further by incorporating an audio encoder into the model. This allows the LLM to perform automatic speech recognition (ASR) tasks and translate spoken communication into text. By directly integrating audio data representations into the existing text token embeddings, the LLM gains speech recognition abilities similar to its text-based counterpart.

The research team has demonstrated the effectiveness of this approach by analyzing the outputs of the audio encoder and confirming the accurate matching of audio embeddings with corresponding text tokens. The team utilized the Multilingual LibriSpeech (MLS) dataset for evaluation and found that the adjusted LLM, known as LLaMA-7B, outperformed monolingual baselines by 18% in voice recognition tasks.

In addition to performance evaluation, the research also explored other aspects of the augmented LLM. Ablation trials showed that the LLM can still perform well in multilingual ASR tasks even when frozen during training, without changing its parameters.

The team also investigated the effects of scaling up the audio encoder and adjusting parameters associated with audio splitting. These tests aimed to improve the efficiency and effectiveness of the ASR system. The results showed that LLMs can process long-form audio inputs, even with larger audio encoders or longer strides.

Overall, the research demonstrates the promise of using LLMs with audio encoders to enhance multilingual ASR capabilities. With the advancements in audio processing, LLMs have the potential to handle a wide range of audio-based tasks effectively and efficiently.