With Llama kicking things off, development has been ridiculously fast in the self hosted text model space. The requirements are getting better, but still fairly steep. You can either have painfully slow CPU generation, or if you have 24gb+ of VRAM, you really open up the GPU options.

7b models can sorta run in 12gb, but they’re not great. You really want at least 13b, which needs 24gb VRAM… Or run it on the CPU. Some of them are getting close to ChatGPT quality, definitely not a subset to sleep on, and I feel as though the fediverse would appreciate the idea of self hosting their own chat bots. Some of these models have ridiculous context memory, so they actually remember what you’re talking about ridiculously well.

A good starting point is this rentry: https://rentry.org/local_LLM_guide

I’m admittedly not great with these yet (and my GPU is only 12gb), but I’m fascinated and hope there can be some good discussions around these, as the tech is really fascinating

  • @Nazrin
    link
    English
    21 year ago

    What do these #b designations mean? Is it better to have a bigger number?

    • @awoo
      link
      English
      21 year ago

      “B” refer to the size of parameters of the model in the billion but people has started referring them to “bits”.

      The bigger the number, the smarter the model will be but the size and RAM, VRAM requirement rises accordingly.

      • LongerDonger
        link
        English
        18 months ago

        “B” refer to the size of parameters of the model in the billion but people has started referring them to “bits”.

        This is not entirely correct. B does stand for “Billion” parameters, but bits are a different thing. You can have, for example, a 4-bit 13B model or an 8-bit 3B model. They don’t correlate at all.

  • @Stargazer6343
    link
    English
    21 year ago

    same here. something I’m going to expriment with is using system memory as swap space or VRAM, it seems like it’s possible and while it would be slow it would let me use my 4 GB card for stuff without having to go full GPU

    • @awoo
      link
      English
      31 year ago

      It’s possible. Last year I had to put part of the models into system memory since I only had a 1060. You technically can split the model into 3 layers, GPU (VRAM), CPU (system RAM), disk. Both Kobold AI and oobabooga can do this.

  • @awoo
    link
    English
    2
    edit-2
    1 year ago

    Glad to see some interests in LLM here. A small note about this part

    You really want at least 13b, which needs 24gb VRAM

    This is sorta outdated information. Most of the Llama based model remix are using 4-bit quantization now, which makes the model usable with only 12GB of VRAM. Here is one of them if you want to try it out https://huggingface.co/TheBloke/Wizard-Vicuna-13B-Uncensored-GPTQ

    I would say if you have 12GB of VRAM, LLM is very doable for you as an entertaining tool. It’s not as smart as chatGPT but it’s surprisingly good and much batter than what we had in the beginning of the year already.

  • DisaA
    link
    English
    11 year ago

    The tech of AI text models is great, especially when said text models don’t have crippling and often exceptionally restrictive guardrails placed on them. Haven’t heard much about Llama, is it relatively limitless as far as the type of content it’s allowed to generate? Or does it have “protections” built-in?

    • @BurgerA
      link
      English
      11 year ago

      I “think” it doesn’t have any restrictions built in but I’m not sure. It’s an algorithm that was leaked from Facebook.

      • @soulnullOPM
        link
        English
        2
        edit-2
        1 year ago

        Yeah, the model leaked from Facebook but they didn’t have any guardrails on it when it leaked. There have been iterations on it with guardrails, but the benefit of these being local is people can mostly take them off (I say mostly because it’s sorta like a lobotomy… it stops saying preprogrammed crap but it also starts acting weird with context.)

        There are others being further iterated without guardrails, and the limited experience I have with testing the local models was the 13b model I played with was able to keep up the weird rules I placed upon it (talking dirty in uwu-speak… I swear it was purely for testing… no seriously, it was… wait why don’t you believe me? I swear, I just wanted to see if it would… By the way, 7b couldn’t figure the uwu-speak at all and just kept agreeing with everything, there was basically no narrative whatsoever, it just blindly agreed with everything)