Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve state machine codegen #258

Open
1 of 6 tasks
nikomatsakis opened this issue Feb 18, 2025 · 6 comments
Open
1 of 6 tasks

Improve state machine codegen #258

nikomatsakis opened this issue Feb 18, 2025 · 6 comments

Comments

@nikomatsakis
Copy link
Contributor

nikomatsakis commented Feb 18, 2025

Metadata
Point of contact @folkertdev
Team(s) compiler, lang
Goal document 2025h1/improve-rustc-codegen

Summary

We want to improve rustc codegen, based on this initialive by the Trifecta Tech Foundation. The work focusses on improving state machine code generation, and finding (and hopefully fixing) cases where clang produces better code than rustc for roughly equivalent input.

Tasks and status

@nikomatsakis
Copy link
Contributor Author

This issue is intended for status updates only.

For general questions or comments, please contact the owner(s) directly.

@traviscross
Copy link
Contributor

This is, I believe, mostly waiting on us on the lang team to have a look, probably in a design meeting, to feel out what's in the realm of possibility for us to accept.

@folkertdev
Copy link
Contributor

@traviscross how would we make progress on that? So far we've mostly been talking to @joshtriplett, under the assumption that a #[loop_match] attribute on loops combined with a #[const_continue] attribute on "jumps to the next iteration" will be acceptable as a language experiment.

Our current implementation handles the following

#![feature(loop_match)]

enum State {
    A,
    B,
}

fn main() {
    let mut state = State::A;
    #[loop_match]
    'outer: loop {
        state = 'blk: {
            match state {
                State::A =>
                {
                    #[const_continue]
                    break 'blk State::B
                }
                State::B => break 'outer,
            }
        }
    }
}

Crucially, this does not add syntax, only the attributes and internal logic in MIR lowering for statically performing the pattern match to pick the right branch to jump to.

The main challenge is then to implement this in the compiler itself, which we've been working on (I'll post our tl;dr update shortly)

@folkertdev
Copy link
Contributor

Some benchmarks (as of march 18th)

A benchmark of https://github.com/bjorn3/comrak/blob/loop_match_attr/autolink_email.rs, basically a big state machine that is a perfect fit for loop match

Benchmark 1: ./autolink_email
  Time (mean ± σ):      1.126 s ±  0.012 s    [User: 1.126 s, System: 0.000 s]
  Range (min … max):    1.105 s …  1.141 s    10 runs
 
Benchmark 2: ./autolink_email_llvm_dfa
  Time (mean ± σ):     583.9 ms ±   6.9 ms    [User: 581.8 ms, System: 2.0 ms]
  Range (min … max):   575.4 ms … 591.3 ms    10 runs
 
Benchmark 3: ./autolink_email_loop_match
  Time (mean ± σ):     411.4 ms ±   8.8 ms    [User: 410.1 ms, System: 1.3 ms]
  Range (min … max):   403.2 ms … 430.4 ms    10 runs
 
Summary
  ./autolink_email_loop_match ran
    1.42 ± 0.03 times faster than ./autolink_email_llvm_dfa
    2.74 ± 0.07 times faster than ./autolink_email

#[loop_match] beats the status quo, but also beats the llvm flag by a large margin.


A benchmark of zlib decompression with chunks of 16 bytes (this makes the impact of loop_match more visible)

Benchmark 1 (65 runs): target/release/examples/uncompress-baseline rs-chunked 4
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          77.7ms ± 3.04ms    74.6ms … 88.9ms          9 (14%)        0%
  peak_rss           24.1MB ± 64.6KB    24.0MB … 24.2MB          0 ( 0%)        0%
  cpu_cycles          303M  ± 11.8M      293M  …  348M           9 (14%)        0%
  instructions        833M  ±  266       833M  …  833M           0 ( 0%)        0%
  cache_references   3.62M  ±  310K     3.19M  … 4.93M           1 ( 2%)        0%
  cache_misses        209K  ± 34.2K      143K  …  325K           1 ( 2%)        0%
  branch_misses      4.09M  ± 10.0K     4.08M  … 4.13M           5 ( 8%)        0%
Benchmark 2 (68 runs): target/release/examples/uncompress-llvm-dfa rs-chunked 4
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          74.0ms ± 3.24ms    70.6ms … 85.0ms          4 ( 6%)        🚀-  4.8% ±  1.4%
  peak_rss           24.1MB ± 27.1KB    24.0MB … 24.1MB          3 ( 4%)          -  0.1% ±  0.1%
  cpu_cycles          287M  ± 12.7M      277M  …  330M           4 ( 6%)        🚀-  5.4% ±  1.4%
  instructions        797M  ±  235       797M  …  797M           0 ( 0%)        🚀-  4.3% ±  0.0%
  cache_references   3.56M  ±  439K     3.08M  … 5.93M           2 ( 3%)          -  1.8% ±  3.6%
  cache_misses        144K  ± 32.5K     83.7K  …  249K           2 ( 3%)        🚀- 31.2% ±  5.4%
  branch_misses      4.09M  ± 9.62K     4.07M  … 4.12M           1 ( 1%)          -  0.1% ±  0.1%
Benchmark 3 (70 runs): target/release/examples/uncompress-loop-match rs-chunked 4
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          71.6ms ± 2.43ms    69.3ms … 78.8ms          6 ( 9%)        🚀-  7.8% ±  1.2%
  peak_rss           24.1MB ± 72.8KB    23.9MB … 24.2MB         20 (29%)          -  0.0% ±  0.1%
  cpu_cycles          278M  ± 9.59M      270M  …  305M           7 (10%)        🚀-  8.5% ±  1.2%
  instructions        779M  ±  277       779M  …  779M           0 ( 0%)        🚀-  6.6% ±  0.0%
  cache_references   3.49M  ±  270K     3.15M  … 4.17M           4 ( 6%)        🚀-  3.8% ±  2.7%
  cache_misses        142K  ± 25.6K     86.0K  …  197K           0 ( 0%)        🚀- 32.0% ±  4.8%
  branch_misses      4.09M  ± 7.83K     4.08M  … 4.12M           1 ( 1%)          +  0.0% ±  0.1%
Benchmark 4 (69 runs): target/release/examples/uncompress-llvm-dfa-loop-match rs-chunked 4
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          72.8ms ± 2.57ms    69.7ms … 80.0ms          7 (10%)        🚀-  6.3% ±  1.2%
  peak_rss           24.1MB ± 35.1KB    23.9MB … 24.1MB          2 ( 3%)          -  0.1% ±  0.1%
  cpu_cycles          281M  ± 10.1M      269M  …  312M           5 ( 7%)        🚀-  7.5% ±  1.2%
  instructions        778M  ±  243       778M  …  778M           0 ( 0%)        🚀-  6.7% ±  0.0%
  cache_references   3.45M  ±  277K     2.95M  … 4.14M           0 ( 0%)        🚀-  4.7% ±  2.7%
  cache_misses        176K  ± 43.4K      106K  …  301K           0 ( 0%)        🚀- 15.8% ±  6.3%
  branch_misses      4.16M  ± 96.0K     4.08M  … 4.37M           0 ( 0%)        💩+  1.7% ±  0.6%

The important points: loop-match is faster than llfm-dfa, and when combined performance is worse than when using loop-match on its own.

@folkertdev
Copy link
Contributor

TL;DR:

We've started work on implementing #[loop_match] on this branch. For the time being integer and enum patterns are supported. The benchmarks, are extremely encouraging, showing large improvements over the status quo, and significant improvements versus -Cllvm-args=-enable-dfa-jump-thread.

Our next steps can be found in the todo file, and focus mostly on improving the code quality and robustness.

@traviscross
Copy link
Contributor

Thanks for that update. Have reached out separately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Project goal
Development

No branches or pull requests

3 participants
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy