RustでなぜかTLEするコードを調べた

次回: 続: RustでなぜかTLEするコードを調べた - harryのブログ

「なぜか」とは？？？

二次元平面の BFS / ダイクストラ法で遅かったり TLE するコードがあって謎だった
枝狩りがバグってるかと思ったらそうでもない（バグってなかったとは言っていない）

tl;dr

二次元平面で移動先(上下左右)の座標を Vec<(usize, usize)> で返す関数に分けていた
その Vec の生成処理の呼び出し回数と処理時間がボトルネックになっていた

コードの比較

対象の問題は、競プロ典型90問の「043 - Maze Challenge with Lack of Sleep（★4）」
コードが違う部分は、特定の座標から上下左右の座標を返す箇所

TLEするコード

上下左右の座標を計算して、Tuple の Vec として返す closure を定義する

const DIRECTIONS: [(isize, isize); 4] = [(1, 0), (-1, 0), (0, 1), (0, -1)];

#[fastout]
fn main() {
    input! {
        (h, w): (usize, usize),
        (rs, cs): (usize, usize),
        (re, ce): (usize, usize),
        s: [Chars; h]
    }

    let next = |py: usize, px: usize| -> Vec<(usize, usize, usize)> {
        (0..4).filter_map(|d| {
            let (dy, dx) = DIRECTIONS[d];
            let py = (py as isize + dy) as usize;
            let px = (px as isize + dx) as usize;
            if 1 <= py && py <= h && 1 <= px && px <= w {
                Some((py, px, d))
            } else {
                None
            }
        }).collect::<Vec<_>>()
    };

    // ...

    while let Some((y, x, d)) = q.pop_front() {
        let c = costs[y][x][d];
        for (y, x, nd) in next(y, x) {
            if s[y-1][x-1] == '#' { continue; }
 
            let nc = c + i32::from(d != nd);
            if costs[y][x][nd] <= nc { continue; }
            costs[y][x][nd] = nc;
            q.push_back((y, x, nd));
        }
    }

提出 #42971757

TLEしないコード

上下左右の座標を計算して、それを処理する関数に渡す

    let next = |py: usize, px: usize, f: &mut(dyn FnMut(usize, usize, usize))| {
        for d in 0..4 {
            let (dy, dx) = DIRECTIONS[d];
            let py = (py as isize + dy) as usize;
            let px = (px as isize + dx) as usize;
            if 1 <= py && py <= h && 1 <= px && px <= w {
                f(py, px, d)
            }
        }
    };

    // ...

    while let Some((y, x, d)) = q.pop_front() {
        let c = costs[y][x][d];
        next(y, x, &mut |y, x, nd| {
            if s[y-1][x-1] == '#' { return; }
 
            let nc = c + i32::from(d != nd);
            if costs[y][x][nd] <= nc { return; }
            costs[y][x][nd] = nc;
            q.push_back((y, x, nd));
        });
    }

提出 #42971689

原因調査

Tuple の Vec を大量に生成してるのが遅い…？
推測するな、計測しろ
- はい
- なお推測で TLE しないコードに直した模様
Rust でボトルネック計測したことがなかったのでやってみることに
- cargo-profilerを試す | κeenのHappy Hacκing Blog
- Valgrindでコード解析してみる - CADDi Tech Blog
~~そもそもこの closure、1か所でしか使ってないから必要なくない？？？~~*1

計測結果

一番遅いテストケースは in16.txt
競プロ典型のテストケースは Dropbox で公開されてるので、ダウンロードしておく
- https://github.com/E869120/kyopro_educational_90/tree/main
ファイルから入力を読み込むようにコードを書き換えておく

コードを release ビルドして、計測を実行。

$ cargo build --release
$ valgrind --tool=callgrind --cache-sim=yes --branch-sim=yes ./target/release/typical90_043
==1987== Callgrind, a call-graph generating cache profiler
==1987== Copyright (C) 2002-2017, and GNU GPL'd, by Josef Weidendorfer et al.
==1987== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==1987== Command: ./target/release/typical90_043
==1987==
--1987-- warning: L3 cache found, using its data for the LL simulation.
==1987== For interactive control, run 'callgrind_control -h'.
==1987== brk segment overflow in thread #1: can't grow to 0x483e000
==1987== (see section Limitations in user manual)
==1987== NOTE: further instances of this message will not be shown
26
==1987==
==1987== Events    : Ir Dr Dw I1mr D1mr D1mw ILmr DLmr DLmw Bc Bcm Bi Bim
==1987== Collected : 9998410846 3258009404 1361703256 2027 45180801 7140193 1986 1864508 957964 1553965173 56423372 71452800 464
==1987==
==1987== I   refs:      9,998,410,846
==1987== I1  misses:            2,027
==1987== LLi misses:            1,986
==1987== I1  miss rate:          0.00%
==1987== LLi miss rate:          0.00%
==1987==
==1987== D   refs:      4,619,712,660  (3,258,009,404 rd + 1,361,703,256 wr)
==1987== D1  misses:       52,320,994  (   45,180,801 rd +     7,140,193 wr)
==1987== LLd misses:        2,822,472  (    1,864,508 rd +       957,964 wr)
==1987== D1  miss rate:           1.1% (          1.4%   +           0.5%  )
==1987== LLd miss rate:           0.1% (          0.1%   +           0.1%  )
==1987==
==1987== LL refs:          52,323,021  (   45,182,828 rd +     7,140,193 wr)
==1987== LL misses:         2,824,458  (    1,866,494 rd +       957,964 wr)
==1987== LL miss rate:            0.0% (          0.0%   +           0.1%  )
==1987==
==1987== Branches:      1,625,417,973  (1,553,965,173 cond +    71,452,800 ind)
==1987== Mispredicts:      56,423,836  (   56,423,372 cond +           464 ind)
==1987== Mispred rate:            3.5% (          3.6%     +           0.0%   )

とりあえず I refs (Instruction cache references) と D refs (data references)が大変なことになっているのは分かる
この結果を kcachegrind で表示すると、以下のような感じ

$ kcachegrind callgrind.out.1987

TLEするコードのcallgrind.outをKCachegrindで表示した結果

移動先の候補の数だけ Vec の生成処理が呼び出されている
その呼び出し回数と処理時間が支配的 (メモリの確保と解放も含めて)

短時間で大量に呼び出される処理で Vec を生成 (collect() 等) すると、無視できない処理時間が取られる (あたりまえ体操)

ACするコードとの差分の時点でほぼ原因は分かってたが、計測結果からも確認ができた

ちなみにACするコードの計測結果は以下の通り。

$ valgrind --tool=callgrind --cache-sim=yes --branch-sim=yes ./target/release/typical90_043_after
==1993== Callgrind, a call-graph generating cache profiler
==1993== Copyright (C) 2002-2017, and GNU GPL'd, by Josef Weidendorfer et al.
==1993== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==1993== Command: ./target/release/typical90_043_after
==1993==
--1993-- warning: L3 cache found, using its data for the LL simulation.
==1993== For interactive control, run 'callgrind_control -h'.
==1993== brk segment overflow in thread #1: can't grow to 0x483e000
==1993== (see section Limitations in user manual)
==1993== NOTE: further instances of this message will not be shown
26
==1993==
==1993== Events    : Ir Dr Dw I1mr D1mr D1mw ILmr DLmr DLmw Bc Bcm Bi Bim
==1993== Collected : 5870926705 1891773639 697368040 2005 44929619 7138100 1964 1864475 957962 823442004 22987314 5034254 460
==1993==
==1993== I   refs:      5,870,926,705
==1993== I1  misses:            2,005
==1993== LLi misses:            1,964
==1993== I1  miss rate:          0.00%
==1993== LLi miss rate:          0.00%
==1993==
==1993== D   refs:      2,589,141,679  (1,891,773,639 rd + 697,368,040 wr)
==1993== D1  misses:       52,067,719  (   44,929,619 rd +   7,138,100 wr)
==1993== LLd misses:        2,822,437  (    1,864,475 rd +     957,962 wr)
==1993== D1  miss rate:           2.0% (          2.4%   +         1.0%  )
==1993== LLd miss rate:           0.1% (          0.1%   +         0.1%  )
==1993==
==1993== LL refs:          52,069,724  (   44,931,624 rd +   7,138,100 wr)
==1993== LL misses:         2,824,401  (    1,866,439 rd +     957,962 wr)
==1993== LL miss rate:            0.0% (          0.0%   +         0.1%  )
==1993==
==1993== Branches:        828,476,258  (  823,442,004 cond +   5,034,254 ind)
==1993== Mispredicts:      22,987,774  (   22,987,314 cond +         460 ind)
==1993== Mispred rate:            2.8% (          2.8%     +         0.0%   )

ACするコードのcallgrind.outをKCachegrindで表示した結果

当たり前ですが、メインの処理を行っている closure の呼び出しが支配的
メモリの確保と解放の処理時間は無視できるレベル
I refs と D refs はほぼ半減

所感

ネイティブコードのパワーに甘えすぎるのも良くない
Rust の局所最適化、かなり難しく感じてる
- 「043 - Maze Challenge with Lack of Sleep（★4）」で実行時間 400～600ms の提出コード見て「こっちの方が早いんか～…」ってなる（なった）
- ルールには「実行時間は実時間とCPU時間の大きい方で計測されます。」と書かれていて難しい(?)
WSL2 で GUI アプリが起動してすごかった (小並感)

Appendix

Windows 11 (WSL2) での KCachegrind 環境構築

Android エミュレーターの為にHyper-Vとか仮想環境を無効にしてたのを有効にしたり色々した。要らない設定もありそう。

[コントロールパネル] の [プログラムと機能] にある [Windows の機能の有効化または無効化] から以下の機能を有効にする
- Hyper-V
- Linux 用 Windows サブシステム
- 仮想マシンプラットフォーム
再起動ついでに BIOS で仮想化機能を有効にする
wsl を最新版に更新後、 Ubuntu-22.04 をインストールして WSLg 環境を構築
- WSL で Linux GUI アプリを実行する | Microsoft Learn
- WSL2の新機能WSLgを使ってX Window SystemのGUIアプリを動作させてみる - uepon日々の備忘録
- Ubuntu 自体の更新も忘れずにやる
  - sudo apt update
  - sudo apt upgrade
cargo と GCC をインストール
- Installation - The Cargo Book
- sudo apt install build-essential
Valgrind をインストール
- sudo apt-get install valgrind
KCachegrind をインストール
- sudo apt install kcachegrind massif-visualizer
- KCachegrind 起動時にエラーが出たので、エラーメッセージの通りにコマンドを実行した
  - kf.dbusaddons: DBus session bus not found. To circumvent this problem try the following command (with bash): export $(dbus-launch)

ハマリどころとか

Windows から Ubuntu にプロジェクト一式コピーしたら、ビルド時に permission エラーで死ぬ
- コピーしたディレクトリとファイルの所有者が root になっていた
- chown で変更した
Windows から Ubuntu にファイルをコピーすると Zone.Identifier ファイルもコピーされる(?)
- 解決される気配がない
  - Zone.Identifier Files when copying from Windows to WSL filestructure · Issue #4609 · microsoft/WSL · GitHub
  - Zone.Identifier Files when downloading from Windows to WSL file structure · Issue #7456 · microsoft/WSL · GitHub
- みんな頑張って削除したり、.gitignore で除外したりしてるっぽい
  - find . -name "*:Zone.Identifier" -type f -delete
cargo-profiler がエラーを吐く
- どうも解決される気配がない
- 一応エラーが出ても callgrind.out ファイルは出力される
- 今回は普通に valgrind コマンド叩いて計測した
Windows版の WinCacheGrind で callgrind.out ファイルを開くとエラーが出て見れない
- 窓から投げ捨てよう
- 何か別にインストールが必要なのあるかも…？

*1:提出 #42946421