Recent many-core processors such as Intel?s Xeon Phi and GPGPUs specialize in running highly scalable parallel applications at high performance while simultaneously em- bracing energy efficiency as a first-order design constraint. The traditional belief is that full utilization of all available cores also translates into the highest possible performance. In this paper, we study the effects of cache capacity con- flicts and competition for shared off-chip bandwidth; and show that undersubscription, or not utilizing all cores, often yields significant increases in both performance and energy efficiency. Based on a detailed shared working set analysis we make the case for clustered cache architectures as an efficient design point for exploiting both data sharing and undersubscription, while providing low-latency and ease of implementation in many-core processors. We propose ClusteR-aware Undersubscribed Schedul- ing of Threads (CRUST) which dynamically matches an application?s working set size and off-chip bandwidth de- mands with the available on-chip cache capacity and off- chip bandwidth. CRUST improves application performance and energy efficiency by 15% on average, and up to 50%, for the NPB and SPEC OMP benchmarks. In addition, we make recommendations for the design of future many-core architectures, and show that taking the undersubscription usage model into account moves the optimum performance under the cores-versus-cache area trade-off towards design points with more cores and less cache.