Loading Stories...
Loading Stories...
Why does every Python application on GitHub have its own ad-hoc, informally specified, bug ridden, slow implementation of half of setuptools?
Why does TensorRT distribute the most essential part of what it does in an "examples" directory?
huggingface_cli... man, I already have a way to download something by a name, it's a zip file. In fact, why not make a PyPi index that facades these models? We have so many ways already to install and cache read only binary blobs...
1. Interruption - I need to be able to say "hang on" and have the LLM pause. 2. Wait for a specific cue before responding. I like "What do you think?"
That + low latency are crucial. It needs to feel like talking to another person.
With the implementation of tools for GPT, I could see a way to having the model check if it thinks it received a complete thought, and if it didn't, send back a signal to keep appending to the buffer until the next long pause. The addition of a longer "pregnant pause" timeout could have the model check in to see if you're done talking or whatever.
Asking because 8x7B Q4_K_M (25GB, GGUF) doesn't seem to be "ultra-low latency" on my 12GB VRAM + RAM. Like, at all. I can imagine running 7-13GB sized model with that latency (cause I did, but... it's a small model), or using 2x P40 or something. Not sure what the assumptions they make in the README. Am I missing something? Can you try it without TTS part?
It would be a live conversation and it can see whatever I’m doing on my screen.
We’re gradually getting closer.
WhisperFusion, WhisperLive, WhisperSpeech, those are very interesting projects.
I'm curious about latency (of all those 3 systems individually, and also the LLM), and WER numbers of WhisperLive. I did not really find any numbers on that? This is a bit strange, as those are the most crucial information about such models? Maybe I just looked at the wrong places (the GitHub repos).
like an engaged but not-most-polite person does
The reason I ask is that I’m building something that does both TTS and STT using OpenAI, but I do not want to be sending a never ending stream of audio to OpenAI just for it to listen for a single command I will eventually give it.
If I can do all of this local and use Mistral instead, then I’d give it a go too.
I've been toying around with something similar myself, only I want push-to-talk from my phone. There's a route there with a WebRTC SPA, and it feels like it should be doable just by stringing together the right bits of various tech demos, but just understanding how to string everything together is more effort than it should be if you're not familiar with the tech.
What's really annoying is Whisper's latency. It's not really designed for this sort of streaming use-case, they're only masking its unsuitability here by throwing (comparatively) ludicrous compute at it.
Discussion on them here from 10 months ago: https://news.ycombinator.com/item?id=35358873
I tried the demo back then and was very impressed. Anyone using it in dev or production?
"Over."
Some of the problems:
- Voice systems now (including ChatGPT mobile app) stop you at times when a human would not, based on how long you pause. If you said, "I think I'm going to...[3 second pause]" then LLM's stop you, but a human would wait
- No ability to interrupt them with voice only
- Natural conversationalists tend to match one another's speed, but these system's speed are fixed
- Lots of custom instructions needed to change from what works in written text to what works in speech (no bullet points, no long formulas)
On the other side of this problem is a super smart friend you can call on your phone. That would be world changing.
Interruption is something that is already in the pipeline and we are working on it. You should see an update soon.
I hope the new improved Siri and Google assistant will be able to chain actions as well. “Ok Google, turn off the lights. Ok Google, stop music.” Feels a bit cumbersome.
Well, today is your lucky day!: https://persona-webapp-beta.vercel.app/ and the demo https://smarterchild.chat/
Has anybody fine-tuned Phi-2? I haven't found any good resources for that yet.
Have been tempted to try and build something out myself, there are tons of IP cameras around with 2-way audio. If the mic was reasonable enough quality, the potential for a multimodal LLM to comment contextually on the scene as well as respond through the speaker in a ceiling-mounted camera appeals to me a lot. "Computer, WTF is this old stray component I found lying under the sink?"
I like to call it "Artificial Attention".
I think about the cue as kind of being like "Hey Siri/Alexa/Cortana" but in reverse.
When I'm listening to someone else talk, I'm already formulating responses or at least an outline of responses in my head. If the LLM could do a progressive summarization of the conversation in real-time as part of its context this would be super cool as well. It could also interrupt you if the LLM self-reflects on the summary and realizes that now would be a good time to interrupt.
Would be indeed great to get something like this integrated with whisper, LLM and TTS
- WhisperLive for the transcription - https://github.com/collabora/WhisperLive - WhisperSpeech for the text-to-speech - https://github.com/collabora/WhisperSpeech
and an LLM (phi-2, Mistral, etc.) in between
We tested it on 3090 and 4090 works as expected.
[1] : https://paul.mou.dev/posts/2023-12-31-listening-with-llm/
I’m not sure why the demand never materialized for other highly personal services like search, photos, medical, etc.
But I just have this hunch we all really want it for AI.
https://developer.apple.com/library/archive/samplecode/UIEle...
> docker run --gpus all --shm-size 64G -p 80:80 -it ghcr.io/collabora/whisperfusion:latest
instead of:
> docker run --gpus all --shm-size 64G -p 6006:6006 -p 8888:8888 -it ghcr.io/collabora/whisperfusion:latest > cd examples/chatbot/html > python -m http.server
I do like the interface though.
I like it too!
And I can't help but think getting into the habit of saying it -- it would help us get along much better with other people in our lives.
Saying "Go" to indicate it's the bot's turn would work for me. (Or maybe pressing a button.) The bot should always stop wherever I start speaking.
How exactly does WhisperLive work actually? Did you reduce the chunk size from 30 sec to something lower? To what? Is this fixed or can it be configured by the user? Where can I find information on those details, or even a broad overview on how WhisperLive works?
As to why that might matter: my single 4090 is occupied with most of a Mixtral instance, and I don't especially want to take any compute away from that.
That doesn't sound plausible. How can the LLM part know which speech recognition service is being used?
"... simultaneously on the line with a thousand other robots."
:)
We (we in the larger sense of computer users as a whole, not just the small subset of "power-users") should care more about privacy and security and such, but most people think of computers and networks in the same way they think of a toaster or a hammer. To them it's a tool that does stuff when they push the right "magic button", and they couldn't care less what's inside, or how it could harm them if mis-used until it actually does harm them (or come close enough to it that they can no longer ignore it).
That really bums me out, and used to make me lose steam too. My current approach is, "I'm going to do it no matter what, and if you want to join that's cool too."
--
That's the main advantage of GPT for me... not infinite wisdom, but infinite willingness to listen, and infinite enthusiasm!
That is in very short supply among humans. Which is probably why it costs $200/hr in human form, heh.
Unconditional positive regard.
Pi.ai is also surprisingly good for that, better than GPT in some aspects (talking out "soft problems" — not as good for technical stuff).
*Edit*
Ah, when you write faster_whisper, you actually mean https://github.com/SYSTRAN/faster-whisper?
And for streaming, you use https://github.com/ufal/whisper_streaming? So, the model as described in http://www.afnlp.org/conferences/ijcnlp2023/proceedings/main...?
There, for example in Table 1, you have exactly that, latency vs WER. But the latency is huge (2.85 sec the lowest). Usually, streaming speech recognition systems have latency well beyond 1 sec.
But anyway, is this actually what you use in WhisperLive / WhisperFusion? I think it would be good to give a bit more details on that.
For streaming we continuously stream audio bytes of fixed size to the server and send the completed segments back to the client while incrementing the timestamp_offset.
[1] https://developer.mozilla.org/en-US/docs/Web/API/SpeechRecog...
But I'm happy to be proven wrong. That's why I would like to see some actual numbers. Maybe it's still okish enough, maybe it's actually really bad. I'm curious. But I don't just want to see a demo or a sloppy statement like "it's working ok".
Note that this is a highly non-trivial problem, to make a streamable speech recognition system with low latency and still good performance. There is a big research community working on just this problem.
I actually have worked on this problem myself. E.g. see our work "Chunked Attention-based Encoder-Decoder Model for Streaming Speech Recognition" (https://arxiv.org/abs/2309.08436), which will be presented at ICASSP 2024. E.g. for a median latency of 1.11s ec, we get a WER of 7.5% on TEDLIUM-v2 dev, which is almost as good as the offline model with 7.4% WER. This is a very good result (only very minor WER degradation). Or with a latency of 0.78 sec, we get 7.7% WER. Our model currently does not work too well when we go to even lower latencies (or the computational overhead becomes impractical).
Or see Emformer (https://arxiv.org/abs/2010.10759) as another popular model.
I was impressed by Kaldi's models for streaming ASR: https://k2-fsa.github.io/sherpa/onnx/pretrained_models/index... ; I suspect that the Nvidia/Suno Parakeet models will also be pretty good for streaming https://huggingface.co/nvidia/parakeet-ctc-0.6b
This is TensorFlow-based. But I also have another PyTorch-based implementation already, also public (inside our other repo, i6_experiments). It's not so easy currently to set this up, but I'm working on a simpler pipeline in PyTorch.
We don't have the models online yet, but we can upload them later. But I'm not sure how useful they are outside of research, as they are specifically for those research tasks (Librispeech, Tedlium), and probably don't perform too well on other data.