Current State and Future of "Integer-Only" LLM Inference (Non-Floating Point)

For now, here’s a rundown of the major options available:

Of the options below, I’ve actually tried BitNet myself in the past. It’s very compact, fast, and produces decent results. However—and this isn’t limited to BitNet—since floating-point calculations are simpler and faster when relying on GPU hardware support, native integer calculations aren’t widely adopted in mainstream frameworks. That…

Read more →
Avoid Re-encoding Reference Images in Vision-LLM When Comparison Criteria Are User-Defined

Are there ways to cache or pre-load reference images in llama.cpp / Hugging Face pipelines to avoid repeated encoding?

This is an area where things can change significantly based on a single factor, such as backend or model repository metadata (for example, even when using the same GGUF from Llama.cpp and Ollama, the behaviour of the vision component differs…), so I think it will be quite…

Read more →
Avoid Re-encoding Reference Images in Vision-LLM When Comparison Criteria Are User-Defined

Are there ways to cache or pre-load reference images in llama.cpp / Hugging Face pipelines to avoid repeated encoding?

This is an area where things can change significantly based on a single factor, such as backend or model repository metadata (for example, even when using the same GGUF from Llama.cpp and Ollama, the behaviour of the vision component differs…), so I think it will be quite…

Read more →
Avoid Re-encoding Reference Images in Vision-LLM When Comparison Criteria Are User-Defined

Are there ways to cache or pre-load reference images in llama.cpp / Hugging Face pipelines to avoid repeated encoding?

This is an area where things can change significantly based on a single factor, such as backend or model repository metadata (for example, even when using the same GGUF from Llama.cpp and Ollama, the behaviour of the vision component differs…), so I think it will be quite…

Read more →
Avoid Re-encoding Reference Images in Vision-LLM When Comparison Criteria Are User-Defined

Are there ways to cache or pre-load reference images in llama.cpp / Hugging Face pipelines to avoid repeated encoding?

This is an area where things can change significantly based on a single factor, such as backend or model repository metadata (for example, even when using the same GGUF from Llama.cpp and Ollama, the behaviour of the vision component differs…), so I think it will be quite…

Read more →
Avoid Re-encoding Reference Images in Vision-LLM When Comparison Criteria Are User-Defined

Are there ways to cache or pre-load reference images in llama.cpp / Hugging Face pipelines to avoid repeated encoding?

This is an area where things can change significantly based on a single factor, such as backend or model repository metadata (for example, even when using the same GGUF from Llama.cpp and Ollama, the behaviour of the vision component differs…), so I think it will be quite…

Read more →
Avoid Re-encoding Reference Images in Vision-LLM When Comparison Criteria Are User-Defined

Are there ways to cache or pre-load reference images in llama.cpp / Hugging Face pipelines to avoid repeated encoding?

This is an area where things can change significantly based on a single factor, such as backend or model repository metadata (for example, even when using the same GGUF from Llama.cpp and Ollama, the behaviour of the vision component differs…), so I think it will be quite…

Read more →
Technical Note: VRAM Thermal Saturation during Flux.1 / SDXL Inference on Laptops

It’s probably a limiter designed to prevent hardware damage, so it’s essentially designed to be hard to bypass, but well, there are times when we want to bypass it…


The next step is to make this falsifiable.

Right now the idea is technically coherent. NVIDIA exposes separate GPU temp , memory temp , power limits , and thermal/power throttle reasons , so the basic…

Read more →
Technical Note: VRAM Thermal Saturation during Flux.1 / SDXL Inference on Laptops

It’s probably a limiter designed to prevent hardware damage, so it’s essentially designed to be hard to bypass, but well, there are times when we want to bypass it…


The next step is to make this falsifiable.

Right now the idea is technically coherent. NVIDIA exposes separate GPU temp , memory temp , power limits , and thermal/power throttle reasons , so the basic…

Read more →
Technical Note: VRAM Thermal Saturation during Flux.1 / SDXL Inference on Laptops

Some of this phenomenon seems to be backed up by nVidia itself:


The phenomenon described is technically plausible.

What is probably happening

This is best understood as a steady-state inference problem , not just a “GPU core temperature” problem. On supported NVIDIA devices, nvidia-smi exposes GPU Current Temp , Memory Current Temp , GPU Max Operating Temp , and…

Read more →
Technical Note: VRAM Thermal Saturation during Flux.1 / SDXL Inference on Laptops

Some of this phenomenon seems to be backed up by nVidia itself:


The phenomenon described is technically plausible.

What is probably happening

This is best understood as a steady-state inference problem , not just a “GPU core temperature” problem. On supported NVIDIA devices, nvidia-smi exposes GPU Current Temp , Memory Current Temp , GPU Max Operating Temp , and…

Read more →
Technical Note: VRAM Thermal Saturation during Flux.1 / SDXL Inference on Laptops

Some of this phenomenon seems to be backed up by nVidia itself:


The phenomenon described is technically plausible.

What is probably happening

This is best understood as a steady-state inference problem , not just a “GPU core temperature” problem. On supported NVIDIA devices, nvidia-smi exposes GPU Current Temp , Memory Current Temp , GPU Max Operating Temp , and…

Read more →
KV Cache precision compatibility in Spatial Disaggregation (Prefill-Decode) setups with AWQ/GPTQ models

Hmm… It probably depends on the backend, but there’s usually no need to match the KV cache format to the model weight format.


No. In the usual AWQ and GPTQ deployments, the KV cache does not need to be quantized into the same format as the model weights. The important distinction is that AWQ/GPTQ are primarily weight-quantization schemes , while **KV-cache precision is a separate…

Read more →
KV Cache precision compatibility in Spatial Disaggregation (Prefill-Decode) setups with AWQ/GPTQ models

Hmm… It probably depends on the backend, but there’s usually no need to match the KV cache format to the model weight format.


No. In the usual AWQ and GPTQ deployments, the KV cache does not need to be quantized into the same format as the model weights. The important distinction is that AWQ/GPTQ are primarily weight-quantization schemes , while **KV-cache precision is a separate…

Read more →
KV Cache precision compatibility in Spatial Disaggregation (Prefill-Decode) setups with AWQ/GPTQ models

Hmm… It probably depends on the backend, but there’s usually no need to match the KV cache format to the model weight format.


No. In the usual AWQ and GPTQ deployments, the KV cache does not need to be quantized into the same format as the model weights. The important distinction is that AWQ/GPTQ are primarily weight-quantization schemes , while **KV-cache precision is a separate…

Read more →
How to make my own single LLM file?

When you can code, the procedure is simple.

“(Since the models used in Ollama are pre-converted to GGUF files) Find the corresponding Hugging Face Transformers format repository (one repository or folder per model as a rule) or convert and generate it from GGUF → Fine-tune the Hugging Face Transformers format model using QLoRA → Merge QLoRA into the base model → Convert it into a single…

Read more →
How to make my own single LLM file?

When you can code, the procedure is simple.

“(Since the models used in Ollama are pre-converted to GGUF files) Find the corresponding Hugging Face Transformers format repository (one repository or folder per model as a rule) or convert and generate it from GGUF → Fine-tune the Hugging Face Transformers format model using QLoRA → Merge QLoRA into the base model → Convert it into a single…

Read more →
How to make my own single LLM file?

When you can code, the procedure is simple.

“(Since the models used in Ollama are pre-converted to GGUF files) Find the corresponding Hugging Face Transformers format repository (one repository or folder per model as a rule) or convert and generate it from GGUF → Fine-tune the Hugging Face Transformers format model using QLoRA → Merge QLoRA into the base model → Convert it into a single…

Read more →
Page 1