1. 14 Aug, 2024 5 commits
  2. 13 Aug, 2024 4 commits
  3. 12 Aug, 2024 13 commits
  4. 09 Aug, 2024 10 commits
    • Daniël de Kok's avatar
      8dcc7d3f
    • drbh's avatar
      feat: add guideline to chat request and template (#2391) · 0d06aed0
      drbh authored
      * feat: add guideline to chat request and template
      
      * fix: add template test and update docs
      0d06aed0
    • Nicolas Patry's avatar
      Using an enum for flash backens (paged/flashdecoding/flashinfer) (#2385) · 7a48a847
      Nicolas Patry authored
      * Using an enum for flash backens (paged/flashdecoding/flashinfer)
      
      * Early exit on server too.
      
      * Clippy.
      
      * Fix clippy and fmt.
      7a48a847
    • Daniël de Kok's avatar
      flake: use rust-overlay (#2390) · 6e127dcc
      Daniël de Kok authored
      6e127dcc
    • Vaibhav Srivastav's avatar
      Update documentation for Supported models (#2386) · b2b9c427
      Vaibhav Srivastav authored
      * Minor doc fixes
      
      * up.
      
      * Other minor updates.
      b2b9c427
    • Daniël de Kok's avatar
      flake: add fmt and clippy (#2389) · 977534bc
      Daniël de Kok authored
      977534bc
    • Nicolas Patry's avatar
    • Daniël de Kok's avatar
      Add experimental flake (#2384) · c6d5039c
      Daniël de Kok authored
      Add flake.nix
      c6d5039c
    • Daniël de Kok's avatar
      Add FlashInfer support (#2354) · 7830de15
      Daniël de Kok authored
      This change adds support for FlashInfer. FlashInfer can be enabled using
      `FLASH_INFER=1` and is currently only implemented in `FlashCausalLM`.
      Since this functionality is currently only for testing, FlashInfer is
      not installed anywhere yet.
      
      The FlashInfer API is quite different from FlashAttention/vLLM in that
      it requires more global bookkeeping:
      
      * A wrapper class needs to be contstructed (which we just call *state*).
        Since this is fairly expensive (due to pinned host memory allocation),
        we only do this once in a FlashCausalLM instance or for each CUDA
        Graph size.
      * Each model forward call needs to be wrapped in `begin_forward` and
        `end_forward`. This sets up data structures that can be reused for all
        calls to attention for that forward call.
      
      When calling attention, we need access to the state object. To avoid
      passing an argument down the call chain (which would require changes to
      all models), we use a context variable.
      
      Each model forward call is wrapped using a context manager that does all
      the bookkeeping for such a call:
      
      * Set the context variable to the forward call's state.
      * Call `begin_forward` on the state.
      * Yield.
      * Call `end_forward` on the state.
      * Reset the context variable.
      
      We cannot use a single shared global variable for this, since e.g. CUDA
      Graphs of different sizes each have their own state.
      7830de15
    • drbh's avatar
      Pr 2352 ci branch (#2382) · 6d06473c
      drbh authored
      
      * Fix unsigned integer underflow
      
      Passing --max-batch-size to the launcher actually had no effect
      because after a few requests the max_size passed to State::next_batch
      would underflow becoming a largo positive number.
      
      In the scheduler, as soon as the cached batch size reached the
      max_batch_size the max_size passed to next_batch becomes 0.
      Since the only check in that funcion is
      ```
      if Some(batch_requests.len()) == max_size {
          break;
      }
      ```
      and it's called after the `batch_requests.len()` has
      become 1, it doesn't do anything to prevent more than 0
      requests from being batched.
      
      Now we have cached batch in the server that is large than
      max_batch_size and `max_size - batch_size as usize`
      underflows.
      Signed-off-by: default avatarMax de Bayser <mbayser@br.ibm.com>
      
      * fix: update v3 scheduler and ensure max_batch_size > 0
      
      ---------
      Signed-off-by: default avatarMax de Bayser <mbayser@br.ibm.com>
      Co-authored-by: default avatarMax de Bayser <mbayser@br.ibm.com>
      6d06473c
  5. 08 Aug, 2024 7 commits
  6. 07 Aug, 2024 1 commit