Hello-Chat: Towards Realistic Social Audio Interactions

Computational Intelligence Dept, HelloGroup Inc.
School of Artificial Intelligence, Beijing University of Posts and Telecommunications, China
Abstract

We introduce Hello-Chat, an end-to-end Large Audio Language Model (LALM) tailored for real-world conversational scenarios. The model achieves state-of-the-art performance on specific understanding benchmarks and significantly outperforms existing open-source systems in prosodic naturalness, emotional accuracy, and interaction fluency. By explicitly modeling fine-grained acoustic perception and cross-modal alignment, Hello-Chat enables realistic, context-aware spoken interaction between users and AI.

Contents
System Overview
Model Architecture

Figure 1: The overall architecture of Hello-Chat.

Audio Samples

Single-turn audio synthesis grouped by reference speaker.

Loading assets/single_turn.json ...

Multi-Turn Conversation

Context-aware human-like conversation with unified text and speech generation (Tester ↔ Bot, Zero-shot).

Loading assets/dialogue.json ...