To achieve high throughput, core count in compute accelerators such as General-Purpose Graphics Processing Units (GPGPUs) increases continuously. The communication demand of these cores boosts the demand for a low-latency packet switched network. As packet latency is mainly composed of per-hop latency, contention latency and serialization latency, a favorable Network- on-Chip (NoC) design should efficiently decrease these three latency contributors to meet the communication demand while keeping hardware cost low. In this paper, we first make two observations about the NoC differences between CMPs and GPGPUs, and then design a Heterogeneous Ring-Chain network (HRCnet) for the GPGPU reply network. HRCnet eliminates conflicts in the network by proposing a ring-similar topology, using a novel node placement and introducing unidirectional channels. Eliminating conflicts reduces the per-hop latency and removes the contention latency, and exploiting the ring-similar topology reduces the serialization latency. Experimental results show the benefits of the low-cost low-latency design. With the same bisection bandwidth compared to the baseline mesh, our work yields a 45% performance improvement while reducing the area by 42% and reducing energy consumption by 60%. Compared to two state-of-the-art GPGPU NoCs, BENoC and DA2mesh, HRCnet achieves more than 42% performance gain at reduced hardware cost. Our work also achieves the highest power and area efficiency among the designs.