Speech Recognition with SaraKIT
SaraKIT is equipped with three microphones and a specialized audio processor that clarifies the voice and supports speech recognition on Raspberry Pi, offering a significant leap in enabling offline, cloud-independent voice command functionalities. While many tools for speech recognition are available online, with cloud-based data analysis tools like Google Speech to Text being among the best and most efficient as discussed in another guide of mine, this article focuses on offline speech recognition—without the need for an internet connection.
In my search for the best and simplest-to-configure tool, I believe I've found a noteworthy solution currently recommended for offline speech recognition - Vosk API:
Vosk Speech Recognition Toolkit
Vosk is an offline open-source speech recognition toolkit, facilitating speech recognition in over 20 languages and dialects including English, German, French, Spanish, and many more. Its models are compact (around 50 Mb) but support continuous large vocabulary transcription, offer zero-latency response with a streaming API, feature reconfigurable vocabulary, and identify speakers. Vosk caters to a range of applications from chatbots and smart home devices to virtual assistants and subtitle creation, scaling from small devices like Raspberry Pi or Android smartphones to large clusters.
Vosk Homepage: https://alphacephei.com/vosk/
GitHub Vosk: https://github.com/alphacep/vosk-api
Installing on SaraKIT:
Assuming the basic SaraKIT drivers are already installed https://sarakit.saraai.com/getting-started/software, follow these steps to install:
sudo apt-get install pip
sudo apt-get install -y python3-pyaudio
sudo pip3 install vosk
git clone https://github.com/SaraEye/SaraKIT-Speech-Recognition-Vosk-Raspberry-Pi SpeechRecognition
cd SpeechRecognition
To use a language other than English, download the required language model from https://alphacephei.com/vosk/models and place it in the `models` directory.
Start speech recognition by running:
python SpeechRecognition.py
Below is a script for speech recognition in your chosen language, available at
https://github.com/SaraEye/SaraKIT-Speech-Recognition-Vosk-Raspberry-Pi:
import os
import sys
import json
import contextlib
import pyaudio
import io
from vosk import Model, KaldiRecognizer
# Specify the Vosk model path
model_path = "models/vosk-model-small-en-us-0.15/"
if not os.path.exists(model_path):
print(f"Model '{model_path}' was not found. Please check the path.")
exit(1)
model = Model(model_path)
# PyAudio settings
sample_rate = 16000
chunk_size = 8192
format = pyaudio.paInt16
channels = 1
# Initialize PyAudio and the recognizer
p = pyaudio.PyAudio()
stream = p.open(format=format, channels=channels, rate=sample_rate, input=True, frames_per_buffer=chunk_size)
recognizer = KaldiRecognizer(model, sample_rate)
print("\nSpeak now...")
while True:
data = stream.read(chunk_size)
if recognizer.AcceptWaveform(data):
result_json = json.loads(recognizer.Result())
text = result_json.get('text', '')
if text:
print("\r" + text, end='\n')
else:
partial_json = json.loads(recognizer.PartialResult())
partial = partial_json.get('partial', '')
sys.stdout.write('\r' + partial)
sys.stdout.flush()
The effects of this simple yet powerful script can be seen in the video below:
It might happen that you are utilizing the full power of the Raspberry Pi, for instance, for image analysis, and then you might find yourself lacking the processing power for speech recognition. In such cases, it will be necessary to use cloud analysis on a more powerful computer. You can set up your own server and continue to use Vosk, or you can opt for other tools like Google Speech to Text.