One thing that’s always bothered me is that almost all conversational systems still reduce everything to transcripts, and throw away a ton of signals that need to be used downstream. Some existing emotion understanding models try to analyze and classify into small sets of arbitrary boxes, but they either aren’t fast / rich enough to do this with conviction in real-time.
So I built a multimodal perception system which gives us a way to encode visual and audio conversational signals and have them translated into natural language by aligning a small LLM on these signals, such that the agent can "see" and "hear" you, and that you can interface with it via an OpenAI compatible tool schema in a live conversation.
It outputs short natural language descriptions of what’s going on in the interaction - things like uncertainty building, sarcasm, disengagement, or even shift in attention of a single turn in a convo.
Some quick specs:
- Runs in real-time per conversation
- Processing at ~15fps video + overlapping audio alongside the conversation
- Handles nuanced emotions, whispers vs shouts
- Trained on synthetic + internal convo data
Happy to answer questions or go deeper on architecture/tradeoffs
More details here: https://www.tavus.io/post/raven-1-bringing-emotional-intelli...
loading...