AI Agent · Real-time · Multimodal2026 (active)

TONELENS

“Google Translate tells you the words. ToneLens tells you the truth.”

Role

Solo Developer

Stack

PythonFastAPIGemini Live APIGoogle ADKVertex AIFirestoreCloud RunWebSockets

Status

Submitted — Gemini Live Agent Challenge 2026

Links

Overview

ToneLens is a real-time emotional intelligence agent that watches conversations through your camera, listens through your microphone, and streams back translation, emotional subtext, cultural context, and tactical suggestions — all simultaneously via the Gemini Live API.

The core insight behind ToneLens is that words carry only 30% of meaning. The remaining 70% lives in tone, hesitation, confidence, and cultural context. Existing tools like Google Translate decode language. ToneLens decodes intent.

Built in 7 days for the Gemini Live Agent Challenge 2026, ToneLens runs four distinct agent modes — Travel, Meeting, Present, and Negotiate — each with a specialized system prompt, real-time emotional scoring, and autonomous agent actions triggered by keyword detection across the conversation stream.

Screenshots

Key Features

→

Live translation — Detects and translates any language in real-time with cultural context

→

Emotion engine — 8 emotion types with confidence scoring streamed per utterance

→

Negotiation coach — Tracks power balance, detects bluffing, whispers tactical strategy

→

Presentation coach — Filler word detection, pace analysis, real-time delivery coaching

→

Meeting intelligence — Auto-saves decisions, deadlines, and commitments to Firestore

→

Agent actions — Autonomous triggers for cultural tips, emergency detection, and stress reports

Tech Deep Dive

ToneLens uses a two-model pipeline to work around a critical constraint — the Gemini Live native audio model does not support function declarations or text modality simultaneously. Audio streams bidirectionally via WebSockets to a FastAPI backend on Cloud Run. The Gemini Live session handles real-time multimodal understanding and responds in audio only. A second Vertex AI call to gemini-2.0-flash reformats the transcribed response into a strict four-line structured output: TRANSLATION, EMOTION, SUBTEXT, and SUGGEST. Agent actions are triggered via keyword detection in Python rather than function calling, keeping the Live session config clean. Session state and conversation history are persisted in Firestore. The entire pipeline runs end-to-end in under two seconds.

Next ProjectTASKDRIFT →