• Nicolas Patry's avatar
    feat(server): Rework model loading (#344) · abd58ff8
    Nicolas Patry authored
    # What does this PR do?
    
    Reworked the loading logic. Idea is to use cleaner loading code:
    
    - Remove need for `no_init_weights`
    - Remove all weird `bnb_linear` and `load_weights` and
    `post_load_weights`.
    
    New code layout:
    
    - New class `Weights` in charge of handling loading the weights from
    multiple files into appropiate tensors (potentially sharded)
    - TP layers now are "shells", they contain the code to know what kind of
    sharding we need + eventual `all_reduce`. They do not inherit from
    linear, but they contain some kind of Linear instead
    - the contained linear can be either FastLinear, BnbLinear or GPTq
    Linear next.
    - All modeling code is explictly made for sharding, process group is
    just no-ops for non sharded code (removes a lot of test cases)
    
    ![Screenshot from 2023-05-19
    23-19-59](https://github.com/huggingface/text-generation-inference/assets/204321/9a802654-74a3-488c-87a8-073743a6143f)
    
    ---------
    
    Co-authored-by: Ubuntu <ubuntu@ip-1...
    abd58ff8