Show HN: Multimodal perception system for real-time conversation

30 points

4 hours ago

4 comments

story

I work on real-time voice/video AI at Tavus and for the past few years, I’ve mostly focused on how machines respond in a conversation.

One thing that’s always bothered me is that almost all conversational systems still reduce everything to transcripts, and throw away a ton of signals that need to be used downstream. Some existing emotion understanding models try to analyze and classify into small sets of arbitrary boxes, but they either aren’t fast / rich enough to do this with conviction in real-time.

So I built a multimodal perception system which gives us a way to encode visual and audio conversational signals and have them translated into natural language by aligning a small LLM on these signals, such that the agent can "see" and "hear" you, and that you can interface with it via an OpenAI compatible tool schema in a live conversation.

It outputs short natural language descriptions of what’s going on in the interaction - things like uncertainty building, sarcasm, disengagement, or even shift in attention of a single turn in a convo.

Some quick specs:

- Runs in real-time per conversation

- Processing at ~15fps video + overlapping audio alongside the conversation

- Handles nuanced emotions, whispers vs shouts

- Trained on synthetic + internal convo data

Happy to answer questions or go deeper on architecture/tradeoffs

More details here: https://www.tavus.io/post/raven-1-bringing-emotional-intelli...