5.1.2. Relaxing Loop Carriedの依存関係

Intel® FPGA SDK for OpenCL™: ベスト・プラクティス・ガイド

ダウンロード PDF

ID 683521

日付 12/08/2017

バージョン 17.1

Public

5.1.2. Relaxing Loop Carriedの依存関係

最適化レポートからのフィードバックに基づいて、依存距離を増やすことでループに依存する依存関係を緩和することができます。ループキャリー値の生成とその使用の間に発生するループ反復回数を増やすことによって依存距離を増加させます。

次の式を検討してみましょう。


 1 #define N 128
 2 
 3 __kernel void unoptimized (__global float * restrict A,
 4                            __global float * restrict result)
 5 {
 6   float mul = 1.0f;
 7 
 8   for (unsigned i = 0; i < N; i++)
 9     mul *= A[i];
10 
11   * result = mul;
12 }

===================================================================================
Kernel: unoptimized
===================================================================================
The kernel is compiled for single work-item execution.

Loop Report:

 + Loop "Block1" (file unoptimized.cl line 8)
   Pipelined with successive iterations launched every 6 cycles due to:

       Data dependency on variable mul  (file unoptimized.cl line 9)
       Largest Critical Path Contributor:
           100%: Fmul Operation  (file unoptimized.cl line 9)


===================================================================================

上記の最適化レポートでは、 Intel® FPGA SDK for OpenCL™オフライン・コンパイラーがループのパイプライン実行を首尾よく推測します。しかし、変数mulに対するループキャリーの依存関係は、6サイクルごとにループ反復を開始させます。この場合、ライン9上の浮動小数点乗算演算（すなわち、 mul * = A [i] ）は、変数mulの計算に対する最大の遅延に寄与します。

ループ運搬のデータ依存性を緩和するために、代わりにすべてのMの繰り返しを、乗算結果を格納する変数のMコピー上で動作し、一つのコピーを使用するために単一の変数を使用します。

変数mulの複数のコピーを宣言します（たとえば、 mul_copiesという配列内）。
mul_copiesのすべてのコピーを初期化します。
乗算演算では、配列の最後のコピーを使用します。
シフト演算を実行して、配列の最後の値をシフトレジスターの先頭に戻します。
すべてのコピーをmulに削減し、最終値をresultに書き込みます。

以下は再構築されたカーネルです。


 1 #define N 128
 2 #define M 8
 3 
 4 __kernel void optimized (__global float * restrict A,
 5                          __global float * restrict result)
 6 {
 7   float mul = 1.0f;
 8 
 9   // Step 1: Declare multiple copies of variable mul
10   float mul_copies[M];
11 
12   // Step 2: Initialize all copies
13   for (unsigned i = 0; i < M; i++)
14     mul_copies[i] = 1.0f;
15 
16   for (unsigned i = 0; i < N; i++) {
17     // Step 3: Perform multiplication on the last copy
18     float cur = mul_copies[M-1] * A[i];
19 
20     // Step 4a: Shift copies
21     #pragma unroll 
22     for (unsigned j = M-1; j > 0; j--)
23       mul_copies[j] = mul_copies[j-1];
24 
25     // Step 4b: Insert updated copy at the beginning
26     mul_copies[0] = cur;
27   }
28 
29   // Step 5: Perform reduction on copies
30   #pragma unroll 
31   for (unsigned i = 0; i < M; i++)
32     mul *= mul_copies[i];
33 
34   * result = mul;
35 }

以下のような最適化レポートは、変数mulに対するループキャリー依存関係の緩和に成功したことを示しています。

===================================================================================
Kernel: optimized
===================================================================================
The kernel is compiled for single work-item execution.

Loop Report:

 + Fully unrolled loop (file optimized2.cl line 13)
   Loop was automatically and fully unrolled.
   Add "#pragma unroll 1" to prevent automatic unrolling.


 + Loop "Block1" (file optimized2.cl line 16)
 | Pipelined well. Successive iterations are launched every cycle.
 |
 |
 |-+ Fully unrolled loop (file optimized2.cl line 22)
     Loop was fully unrolled due to "#pragma unroll" annotation.


 + Fully unrolled loop (file optimized2.cl line 31)
   Loop was fully unrolled due to "#pragma unroll" annotation.

言語の選択

Intel.com サーチを使用

クイックリンク

最近の検索

高度検索

検索のみ

Intel® FPGA SDK for OpenCL™: ベスト・プラクティス・ガイド

5.1.2. Relaxing Loop Carriedの依存関係