Skip to content

Releases: LostRuins/koboldcpp

koboldcpp-1.78

16 Nov 02:15
Compare
Choose a tag to compare

koboldcpp-1.78

image

  • NEW: Added support for Flux and Stable Diffusion 3.5 models: Image generation has been updated with new arch support (thanks to stable-diffusion.cpp) with additional enhancements. You can use either fp16 or fp8 safetensor models, or the GGUF models. Supports all-in-one models (bundled T5XXL, Clip-L/G, VAE) or loading them individually.
  • Debug mode prints penalties for XTC
  • Added a new flag --nofastforward, this forces full prompt reprocessing on every request. It can potentially give more repeatable/reliable/consistent results in some cases.
  • CLBlast support is still retained, but has been further downgraded to "compatibility mode" and is no longer recommended (use Vulkan instead). CLBlast GPU offload must now maintain duplicate a copy of the layers in RAM as well, as it now piggybacks off the CPU backend.
  • Added common identity provider /.well-known/serviceinfo Haidra-Org/AI-Horde#466 PygmalionAI/aphrodite-engine#807 theroyallab/tabbyAPI#232
  • Reverted some changes that reduced speed in HIPBLAS.
  • Fixed a bug where bad logprobs JSON was output when logits were -Infinity
  • Updated Kobold Lite, multiple fixes and improvements
    • Added support for custom CSS styles
    • Added support for generating larger images (select BigSquare in image gen settings)
    • Fixed some streaming issues when connecting to Tabby backend
    • Better world info length limiting (capped at 50% of max context before appending to memory)
    • Added support for Clip Skip for local image generation.
  • Merged fixes and improvements from upstream

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you have an Nvidia GPU, but use an old CPU and koboldcpp.exe does not work, try koboldcpp_oldcpu.exe
If you have a newer Nvidia GPU, you can use the CUDA 12 version koboldcpp_cu12.exe (much larger, slightly faster).
If you're using Linux, select the appropriate Linux binary file instead (not exe).
If you're on a modern MacOS (M1, M2, M3) you can try the koboldcpp-mac-arm64 MacOS binary.
If you're using AMD, we recommend trying the Vulkan option (available in all releases) first, for best support. Alternatively, you can try koboldcpp_rocm at YellowRoseCx's fork here

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag.

koboldcpp-1.77

01 Nov 16:32
Compare
Choose a tag to compare

koboldcpp-1.77

the road not taken edition

logprobs

  • NEW: Token Probabilities (logprobs) are now available over the API! Currently only supplied over the sync API (non-streaming), but a second /api/extra/last_logprobs dedicated logprobs endpoint is also provided. Will work and provide a link to view alternate token probabilities for both streaming and non-streaming if "logprobs" is enabled in KoboldAI Lite settings. Will also work in SillyTavern when streaming is disabled, once the latest build is out.
  • Response prompt_tokens, completion_tokens and total_tokens are now accurate values instead of placeholders.
  • Enabled CUDA graphs for the cuda12 build, which can improve performance on some cards.
  • Fixed a bug where .wav audio files uploaded directly to the /v1/audio/transcriptions endpoint get fragmented and cut off early. Audio sent as base64 within JSON payloads are unaffected.
  • Fixed a bug where Whisper transcription blocked generation in non-multiuser mode.
  • Fixed a bug where trim_stop did not remove a stop sequence that was divided across multiple tokens in some cases.
  • Significantly increased the maximum limits for stop sequences, anti-slop token bans, logit biases and DRY sequence breakers, (thanks to @mayaeary for the PR which changes the way some parameters are passed to the CPP side)
  • Added link to help page if user fails to select a model.
  • Flash Attention GUI quick launcher toggle hidden by default if Vulkan is selected (usually reduced performance).
  • Updated Kobold Lite, multiple fixes and improvements
    • NEW: Experimental ComfyUI Support Added!: ComfyUI can now be used as an image generation backend API from within KoboldAI Lite. No workflow customization is necessary. Note: ComfyUI must be launched with the flags --listen --enable-cors-header '*' to enable API access. Then you may use it normally like any other Image Gen backend.
    • Clarified the option for selecting A1111/Forge/KoboldCpp as an image gen backend, since Forge is gradually superseding A1111. This option is compatible with all 3 of the above.
    • You are now able to generate images from instruct mode via natural language, similar to chatgpt. (e.g. Please generate an image of a bag of sand). This option requires having an image model loaded, it uses regex and is enabled by default, it can be disabled in settings.
    • Added support for Tavern "V3" character cards: Actually, V3 is not a real format, it's an augmented V2 card used by Risu that adds additional metadata chunks. These chunks are not supported in Lite, but the base "V2" card functionality will work.
    • Added new scenario "Interactive Storywriter": This is similar to story writing mode, but allows you to secretly steer the story with hidden instruction prompts.
    • Added Token Probability Viewer - You can now see a table of alternative token probabilities in responses. Disabled by default, enable in advanced settings.
    • Fixed JSON file selection problems in some mobile browsers.
    • Fixed Aetherroom importer.
    • Minor Corpo UI layout tweaks by @Ace-Lite
  • Merged fixes and improvements from upstream

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you have an Nvidia GPU, but use an old CPU and koboldcpp.exe does not work, try koboldcpp_oldcpu.exe
If you have a newer Nvidia GPU, you can use the CUDA 12 version koboldcpp_cu12.exe (much larger, slightly faster).
If you're using Linux, select the appropriate Linux binary file instead (not exe).
If you're on a modern MacOS (M1, M2, M3) you can try the koboldcpp-mac-arm64 MacOS binary.
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag.

koboldcpp-1.76

11 Oct 13:06
Compare
Choose a tag to compare

koboldcpp-1.76

shivers down your spine edition

image

  • NEW: Added Anti-Slop Sampling (Phrase Banning) - You can now provide a specified list of words or phrases prevented from being generated, by backtracking and regenerating when they appear. This capability has been merged into the existing token banning feature. It's now also aliased into the banned_strings field.
    • Note: When using Anti-Slop phrase banning, streaming outputs are slightly delayed - this is to allow space for the AI to backtrack a response if necessary. This delay is proportional to the length of the longest banned slop phrase.
    • Up to 48 phrase banning sequences can be used, they are not case sensitive.
  • The /api/extra/perf/ endpoint now includes whether the instance was launched in quiet mode (terminal outputs). Note that this is not foolproof - instances can be running modified versions of KoboldCpp.
  • Added timestamp information when each request starts.
  • Increased some limits for number of stop sequences, logit biases, and banned phrases.
  • Fixed a GUI launcher bug when a changed backend dropdown was overridden by a CLI flag.
  • Updated Kobold Lite, multiple fixes and improvements
    • NEW: Added a new scenario - Roleplay Character Creator. This Kobold Lite scenario presents users with an easy-to-use wizard for creating their own roleplay bots with the Aesthetic UI. Simply fill in the requested fields and you're good to go. The character can always be edited subsequently from the 'Context' menu. Alternatively, you can also load a pre-existing Tavern Character Card.
    • Updated token banning settings to include Phrase Banning (Anti-Slop).
    • Minor fixes and tweaks
  • Merged fixes and improvements from upstream

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you have an Nvidia GPU, but use an old CPU and koboldcpp.exe does not work, try koboldcpp_oldcpu.exe
If you have a newer Nvidia GPU, you can use the CUDA 12 version koboldcpp_cu12.exe (much larger, slightly faster).
If you're using Linux, select the appropriate Linux binary file instead (not exe).
If you're on a modern MacOS (M1, M2, M3) you can try the koboldcpp-mac-arm64 MacOS binary.
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag.

koboldcpp-1.75.2

21 Sep 08:01
Compare
Choose a tag to compare

koboldcpp-1.75.2

Nothing lasts forever edition

  • Important: When running from command line, if no backend was explicitly selected (--use...), a GPU backend is now auto selected by default if available. This can be overridden by picking a specific backend (eg. --usecpu, --usevulkan, --usecublas). As a result, dragging and dropping a gguf model onto the koboldcpp.exe executable will allow it to be launched with GPU and gpulayers auto configured.
  • Important: OpenBLAS backend has been removed, and unified with the NoBLAS backend, to form a single Use CPU option. This utilizes the sgemm functionality that llamafile upstreamed, so processing speeds should still be comparable. --noblas flag is also deprecated, instead CPU Mode can be enabled with the --usecpu flag.
  • Added support for RWKV v6 models (context shifting not supported)
  • Added a new flag --showgui that allows the GUI to be shown even with command line flags are used. Instead, command line flags will get imported into the GUI itself, allowing them to be modified. This also works with .kcpps config files,
  • Added a warning display when loading legacy GGML models
  • Fix for DRY sampler occasionally segfaulting on bad unicode input.
  • Embedded Horde workers now work with password protected instances.
  • Updated Kobold Lite, multiple fixes and improvements
    • Added first-start welcome screen, to pick a starting UI Theme
    • Added support for OpenAI-Compatible TTS endpoints
    • Added a preview option for alternate greetings within a V2 Tavern character card.
    • Now works with Kobold API backends with gated model lists e.g. Tabby
    • Added display-only regex replacement, allowing you to hide or replace displayed text while keeping the original used with the AI in context.
    • Added a new Instruct scenario to mimic CoT Reflection (Thinking)
    • Sampler presets now reset seed, but no longer reset generation amount setting.
    • Markdown parser fixes
    • Added system role for Metharme instruct format
    • Added a toggle for chat name format matching, allowing matching any name or only predefined names.
    • Fixed markdown image scaling
  • Merged fixes and improvements from upstream

Hotfix 1.75.1: Auto backend selection and clblast fixes
Hotfix 1.75.2: Fixed RWKV, modified mistral templates

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you have an Nvidia GPU, but use an old CPU and koboldcpp.exe does not work, try koboldcpp_oldcpu.exe
If you have a newer Nvidia GPU, you can use the CUDA 12 version koboldcpp_cu12.exe (much larger, slightly faster).
If you're using Linux, select the appropriate Linux binary file instead (not exe).
If you're on a modern MacOS (M1, M2, M3) you can try the koboldcpp-mac-arm64 MacOS binary.
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag.

koboldcpp-1.74

31 Aug 03:41
Compare
Choose a tag to compare

koboldcpp-1.74

Kobo's all grown up now

image

  • NEW: Added XTC (Exclude Top Choices) sampler, a brand new creative writing sampler designed by the same author of DRY (@p-e-w). To use it, increase xtc_probability above 0 (recommended values to try: xtc_threshold=0.15, xtc_probability=0.5)
  • Added automatic image resizing and letterboxing for llava/minicpm images, this should improve handling of oddly-sized images.
  • Added a new flag --nomodel which allows launching the Lite WebUI without loading any model at all. You can then select an external api provider like Horde, Gemini or OpenAI
  • MacOS defaults to full offload when -1 gpulayers selected
  • Minor tweaks to context shifting thresholds
  • Horde Worker now has a 5 minute timeout for each request, which should reduce the likelihood of getting stuck (e.g. internet issues). Also, horde worker now supports connecting to SSL secured Kcpp instances (remember to enable --nocertify if using self signed certs)
  • Updated Kobold Lite, multiple fixes and improvements
  • Merged fixes and improvements from upstream (plus Llama-3.1-Minitron-4B-Width support)

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you have an Nvidia GPU, but use an old CPU and koboldcpp.exe does not work, try koboldcpp_oldcpu.exe
If you have a newer Nvidia GPU, you can use the CUDA 12 version koboldcpp_cu12.exe (much larger, slightly faster).
If you're using Linux, select the appropriate Linux binary file instead (not exe).
If you're on a modern MacOS (M1, M2, M3) you can try the koboldcpp-mac-arm64 MacOS binary.
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag.

koboldcpp-1.73.1

19 Aug 08:45
Compare
Choose a tag to compare

koboldcpp-1.73.1

image

  • NEW: Added dual-stack (IPv6) network support. KoboldCpp now properly runs on IPv6 networks, the same instance can serve both IPv4 and IPv6 addresses automatically on the same port. This should also fix problems with resolving localhost on some systems. Please report any issues you face.
  • NEW: Added official MacOS pyinstaller binary builds! Modern MacOS (M1, M2, M3) users can now use KoboldCpp without having to self-compile, simply download and run koboldcpp-mac-arm64. Special thanks to @henk717 for setting this up.
  • NEW: Pure CLI Mode - Added --prompt, allowing KoboldCpp to be used entirely from command-line alone. When running with --prompt, all other console outputs are suppressed, except for that prompt's response which is piped directly to stdout. You can control the output length with --promptlimit. These 2 flags can also be combined with --benchmark, allowing benchmarking with a custom prompt and returning the response. Note that this mode is only intended for quick testing and simple usage, no sampler settings will be configurable.
  • Changed the default benchmark prompt to prevent stack overflow on old bpe tokenizer.
  • Pre-filter to the top 5000 token candidates before sampling, this greatly improves sampling speed on models with massive vocab sizes with negligible response changes.
  • Moved chat completions adapter selection to Model Files tab.
  • Improve GPU layer estimation by accounting for in-use VRAM.
  • --multiuser now defaults to true. Set --multiuser 0 to disable it.
  • Updated Kobold Lite, multiple fixes and improvements
  • Merged fixes and improvements from upstream, including Minitron and MiniCPM features (note: there are some broken minitron models floating around - if stuck, try this one first!)

Hotfix 1.73.1 - Fixed DRY sampler broken, fixed sporadic streaming issues, added letterboxing mode for images in Lite. The previous v1.73 release was buggy, so you are strongly suggested to upgrade to this patch release.

To use minicpm:

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you have an Nvidia GPU, but use an old CPU and koboldcpp.exe does not work, try koboldcpp_oldcpu.exe
If you have a newer Nvidia GPU, you can use the CUDA 12 version koboldcpp_cu12.exe (much larger, slightly faster).
If you're using Linux, select the appropriate Linux binary file instead (not exe).
If you're on a modern MacOS (M1, M2, M3) you can try the koboldcpp-mac-arm64 MacOS binary.
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag.

koboldcpp-1.72

02 Aug 10:29
Compare
Choose a tag to compare

koboldcpp-1.72

  • NEW: GPU accelerated Stable Diffusion Image Generation is now possible on Vulkan, huge thanks to @0cc4m
  • Fixed an issue with mismatched CUDA device ID order.
  • Incomplete SSE response for short sequences fixed (thanks @pi6am)
  • SSE streaming fix for unicode heavy languages, which should hopefully mitigate characters going missing due to failed decoding.
  • GPU layers now defaults to -1 when running in GUI mode, instead of overwriting the existing layer count. The predicted layers is now shown as an overlay label text instead, allowing you to see total layers as well as estimation changes when you adjust launcher settings.
  • Auto GPU Layer estimation takes into account loading image and whisper models.
  • Updated Kobold Lite: Now supports SSE streaming over OpenAI API as well, should you choose to use a different backend.
  • Merged fixes and improvements from upstream, including Gemma2 2B support.

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you have an Nvidia GPU, but use an old CPU and koboldcpp.exe does not work, try koboldcpp_oldcpu.exe
If you have a newer Nvidia GPU, you can use the CUDA 12 version koboldcpp_cu12.exe (much larger, slightly faster).
If you're using Linux, select the appropriate Linux binary file instead (not exe).
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag.

koboldcpp-1.71.1

25 Jul 06:09
Compare
Choose a tag to compare

koboldcpp-1.71.1

oh boy, another extra 30MB just for me? you shouldn't have!

  • Updated Kobold Lite:
    • Corpo UI Theme is now available for chat mode as well.
    • More accessibility label for screen readers.
    • Enabling inject chatnames in Corpo UI now replaces the AI's displayed name if enabled.
    • Added setting for TTS narration speed.
    • Allow selecting the greeting message in Character Cards with multiple greetings
  • NEW: Automatic GPU layer selection has been improved, thanks to the efforts of @henk717 and @Pyroserenus. You can also now set --gpulayers to -1 to have KoboldCpp guess how many layers to be used. Note that this is still experimental, and the estimation may not be fully accurate, so you will still get better results manually selecting the GPU layers to use.
  • NEW: Added KoboldCpp Launch Templates. These are sharable .kcppt files that contain the setup necessary for other users to easily load and use your models. You can embed everything necessary to use a model within one file, including URLs to the desired model files, a preloaded story, and a chatcompletions adapter. Then anyone using that template can immediately get a properly configured model setup, with correct backend, threads, GPU layers, and formats ready to use on their own machine.
    • For a demo, to run Llama3.1-8B, try this koboldcpp.exe --config https://huggingface.co/koboldcpp/kcppt/resolve/main/Llama-3.1-8B.kcppt , everything needed will be automatically downloaded and configured.
  • Fixed a crash when running a model with llava and debug mode enabled.
  • iq4_nl format support in Vulkan by @0cc4m
  • Updated embedded winclinfo for windows, other minor fixes
  • --unpack now does not include .pyd files as they were causing version conflicts.
  • Merged fixes and improvements from upstream, including Mistral Nemo support.

Hotfix 1.71.1 - Fix for llama3 rope_factors, fixed loading older Phi3 models without SWA, other minor fixes.

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you have an Nvidia GPU, but use an old CPU and koboldcpp.exe does not work, try koboldcpp_oldcpu.exe
If you have a newer Nvidia GPU, you can use the CUDA 12 version koboldcpp_cu12.exe (much larger, slightly faster).
If you're using Linux, select the appropriate Linux binary file instead (not exe).
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag.

koboldcpp-1.70.1

15 Jul 02:15
Compare
Choose a tag to compare

koboldcpp-1.70.1

mom: we have ChatGPT at home edition

meme

  • Updated Kobold Lite:
    • Introducting Corpo Mode: A new beginner friendly UI theme that aims to emulate the ChatGPT look and feel closely, providing a clean, simple and minimalistic interface. It has a limited feature set compared to other UI themes, but should feel very familiar and intuitive for new users. Now available for instruct mode!
    • Settings Menu Rework: The settings menu has also been completely overhauled into 4 distinct panels, and should feel a lot less cramped now, especially on desktop.
    • Sampler Presets and Instruct Presets have been updated and modernized.
    • Added support for importing character cards from aicharactercards.com
    • Added copy for code blocks
    • Added support for dedicated System Tag and System Prompt (you are still encouraged to use the Memory feature instead)
    • Improved accessibility, keyboard tab navigation and screen reader support
  • NEW: Official releases now provide windows binaries with included AVX1 CUDA support, download koboldcpp_oldcpu.exe
  • NEW: DRY dynamic N-gram anti-repetition sampler support has been added (credits @pi6am)
  • Added --unpack, a new self-extraction feature that allows KoboldCpp binary releases to be unpacked into an empty directory. This allows easy modification and access to the files and contents embedded inside the PyInstaller. Can also be used in the GUI launcher.
  • Fix for a Vulkan regression in Q4_K_S mistral models when offloading to GPU (thanks @0cc4m).
  • Experimental support for OpenAI tools and function calling API (credits @teddybear082)
  • Added a workaround for Deepseek crashing due to unicode decoding issues.
  • --chatcompletionsadapter can now be selected on included pre-bundled templates by filename, e.g. Llama-3.json, pre-bundled templates have also been updated for correctness (thanks @xzuyn).
  • Default --contextsize is finally increased to 4096, default Chat Completions API output length is also increased.
  • Merged fixes and improvements from upstream, including multiple Gemma fixes.

1.70.1: Fixed a bug with --unpack not including the py files, with the oldcpu binary missing some options, and swapped the cu11 linux binary to not use avx2 for best compatibility. The cu12 linux binary still uses avx2 for max performance.

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you have a newer Nvidia GPU, you can use the CUDA 12 version koboldcpp_cu12.exe (much larger, slightly faster).
If you're using Linux, select the appropriate Linux binary file instead (not exe).
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag.

koboldcpp-1.69.1

01 Jul 06:44
Compare
Choose a tag to compare

koboldcpp-1.69.1

  • Fixed an issue when selecting ubatch, which should now correctly match the blasbatchsize
  • Added separator tokens when selecting multiple images with LLaVA. Unfortunately, the model still tends to get mixed up and confused when working with multiple images in the same request.
  • Added a set of premade Chat Completions adapters selectable in the GUI launcher (thanks @henk717) which provide an easy instruct templates for various models and formats, should you want to use third party OpenAI based (chat completion) frontends along with KoboldCpp. This can help you override the instruct format even if the frontend does not directly support it. For more information on --chatcompletionsadapter see the wiki.
  • Allow inserting an extra added forced positive or forced negative prompt for stable diffusion (set add_sd_prompt and add_sd_negative_prompt in a loaded adapter).
  • Switched over the KoboldCpp Colab to use precompiled linux binaries, it starts and run much faster now. The Huggingface Tiefighter Space example has also been updated likewise (thanks @henk717) . Lastly, added information about using KoboldCpp in RunPod at https://koboldai.org/runpodcpp/
  • Fixed some utf decode errors.
  • Added tensor split GUI launcher input field for Vulkan.
  • Merged fixes and improvements from upstream, including the improved mmq with int8 tensor core support and gemma 2 features have been merged.
  • Updated Kobold Lite chatnames stopper for instruct mode. Also, Kobold Lite can now fall back to an alternative API or endpoint URL if the connection fails, you may attempt to reconnect using the OpenAI API instead, or to use a different URL.

1.69.1 - Merged the fixes for gemma 2 and IQ mmvq

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you have a newer Nvidia GPU, you can use the CUDA 12 version koboldcpp_cu12.exe (much larger, slightly faster).
If you're using Linux, select the appropriate Linux binary file instead (not exe).
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag.