While the chip multiprocessor (CMP) has quickly become the predominant processor architecture, its continuing success largely depends on the parallelizability of complex programs. We present a framework that is able to extract coarse-grain function-level parallelism that can exploit theparallel resources of the CMP.The framework uses a profile-driven control and data dependence analysis between large code regions. We target coarse-grain parallelism by finding do-across parallelism in the outer-loops of a program. This parallelism can be exploited in a pipelined fashion. The identification of parallelismreduces the overall loop structure to a preset template by merging inter-dependent code regions. The actual parallelization is guided by the template.The extracted parallelism results in a significant speedup of factor 5 to 12 on an 8-core Sun UltraSPARC T1 processor.