The deblocking filter in the MPEG-4 AVC/H.264 standard is computationally complex because of its high content adaptivity, resulting in a significant number of data dependencies. These data dependencies interfere with parallel filtering of multiple macroblocks (MBs) on massively parallel architectures. In this letter, we introduce a novel MB partitioning scheme for concurrent deblocking in the MPEG-4 AVC/H. 264 standard, based on our idea of deblocking filter independency, a corrected version of the limited error propagation effect proposed in the letter. Our proposed scheme enables concurrent MB deblocking of luma samples with limited synchronization effort, independently of slice configuration, and is compliant with the MPEG-4 H.264/AVC standard. We implemented the method on the massively parallel architecture of the graphics processing unit (GPU). Experimental results show that our GPU implementation achieves faster-than real-time deblocking at 1309 frames per second for 1080p video pictures. Both software-based deblocking filters and state-of-the-art GPU-enabled algorithms are outperformed in terms of speed by factors up to 10.2 and 19.5, respectively, for 1080p video pictures.