My project is a voice-controlled email website. The user has to speak through the browser mic to give in commands. The input audio is expected to not be stored as a file, but instead directly streamed to HuggingFace's Whisper model Inference API. This model will convert the speech to text, so that further processing can be done. I'll provide the Inference API JavaScript code below, but I think it expects a file to read instead of a stream. So, I need help modifying this code as well:
async function query(filename) {
const data = fs.readFileSync(filename);
const response = await fetch(
"https://api-inference.huggingface.co/models/openai/whisper-medium",
{
headers: { Authorization: "Bearer ...." },
method: "POST",
body: data,
}
);
const result = await response.json();
return result;
}
query("sample1.flac").then((response) => {
console.log(JSON.stringify(response));
});
So keeping in mind that the audio is to be streamed, how do I record the user's input from the browser and stream it to HuggingFace?
As of now, I only found the following article the most likely solution: Building a client-side web app which streams audio from a browser microphone to a server. (Part II) But this article focuses on the client sending the audio to an intermediate server which was also separately built, and then the server using API calls to Dialogflow.
I need the same functionality, but without the intermediate server and streaming the audio directly to existing server, via HuggingFace's API call.