Microsoft’s New MAI Models: A Technical Analysis

At Build 2026, Microsoft significantly expanded its in-house MAI (Microsoft AI) model family. While much of the public attention focused on Microsoft's ongoing relationship with OpenAI, the more interesting technical story is that Microsoft is increasingly developing its own foundation models across reasoning, coding, image generation, speech synthesis, and transcription.

The latest…

Read more →
Thinking model recomendation for core ultra 5 135u

I’m not very familiar with the stack for Intel GPUs/NPUs…

As a safe bet for LLMs in general, the latest model families—such as Gemma 4 and Qwen 3.5—are worth recommending, but they’re so new that software support might still be lacking.

The previous Qwen 3 family, which has good software support, also includes a small but well-known Thinking model.


For a **Core Ultra 5 135U with 32 GB…

Read more →
Me and Copilot are done

You’re probably using a different Copilot than the one you intended to use, and if search is turned off, generative AI often can’t access the latest information.

A practical guide to Copilot, AI consistency, and what to use instead

Copilot does not feel inconsistent because you are imagining it. It feels inconsistent because it is **not always looking at the same information, using the…

Read more →
Me and Copilot are done

You’re probably using a different Copilot than the one you intended to use, and if search is turned off, generative AI often can’t access the latest information.

A practical guide to Copilot, AI consistency, and what to use instead

Copilot does not feel inconsistent because you are imagining it. It feels inconsistent because it is **not always looking at the same information, using the…

Read more →
How to build custom key-value extraction (similar to Azure Document Intelligence)?

Currently, one of the challenges with using open-source solutions for OCR and related tasks is that they often do not exist in the same form as commercial services, which provide a comprehensive, all-in-one package.

In many cases, while there are plenty of suitable models and libraries available as open-source software for specific tasks, you still need to build the pipeline yourself, and the…

Read more →
How to build custom key-value extraction (similar to Azure Document Intelligence)?

Currently, one of the challenges with using open-source solutions for OCR and related tasks is that they often do not exist in the same form as commercial services, which provide a comprehensive, all-in-one package.

In many cases, while there are plenty of suitable models and libraries available as open-source software for specific tasks, you still need to build the pipeline yourself, and the…

Read more →
How to build custom key-value extraction (similar to Azure Document Intelligence)?

Currently, one of the challenges with using open-source solutions for OCR and related tasks is that they often do not exist in the same form as commercial services, which provide a comprehensive, all-in-one package.

In many cases, while there are plenty of suitable models and libraries available as open-source software for specific tasks, you still need to build the pipeline yourself, and the…

Read more →
How to build custom key-value extraction (similar to Azure Document Intelligence)?

Currently, one of the challenges with using open-source solutions for OCR and related tasks is that they often do not exist in the same form as commercial services, which provide a comprehensive, all-in-one package.

In many cases, while there are plenty of suitable models and libraries available as open-source software for specific tasks, you still need to build the pipeline yourself, and the…

Read more →
Invoice Data Recognition

Hmm… While commercial OCR services may include such features, standalone OCR models are often not very good at properly interpreting multi-page data. This is because, in most cases, the models are primarily trained on pairs of a single page and the information to be extracted…

The most straightforward workaround is to split the document into individual pages before feeding them to the OCR…

Read more →
Technical Note: VRAM Thermal Saturation during Flux.1 / SDXL Inference on Laptops

It’s probably a limiter designed to prevent hardware damage, so it’s essentially designed to be hard to bypass, but well, there are times when we want to bypass it…


The next step is to make this falsifiable.

Right now the idea is technically coherent. NVIDIA exposes separate GPU temp , memory temp , power limits , and thermal/power throttle reasons , so the basic…

Read more →
Technical Note: VRAM Thermal Saturation during Flux.1 / SDXL Inference on Laptops

It’s probably a limiter designed to prevent hardware damage, so it’s essentially designed to be hard to bypass, but well, there are times when we want to bypass it…


The next step is to make this falsifiable.

Right now the idea is technically coherent. NVIDIA exposes separate GPU temp , memory temp , power limits , and thermal/power throttle reasons , so the basic…

Read more →
Best Practices for Handling User Identity in Custom Model Serving (MCP)

Is there no clean solution at the moment…?


This is a real gap, and other teams are running into the same thing.

The short version is:

MCP now has a much clearer standard for authenticating the client to the MCP server. It does not yet have one fully settled, built-in standard for end-user identity propagation through tool execution and downstream APIs. The safe pattern today is to…

Read more →
Best Practices for Handling User Identity in Custom Model Serving (MCP)

Is there no clean solution at the moment…?


This is a real gap, and other teams are running into the same thing.

The short version is:

MCP now has a much clearer standard for authenticating the client to the MCP server. It does not yet have one fully settled, built-in standard for end-user identity propagation through tool execution and downstream APIs. The safe pattern today is to…

Read more →
Best Practices for Handling User Identity in Custom Model Serving (MCP)

Is there no clean solution at the moment…?


This is a real gap, and other teams are running into the same thing.

The short version is:

MCP now has a much clearer standard for authenticating the client to the MCP server. It does not yet have one fully settled, built-in standard for end-user identity propagation through tool execution and downstream APIs. The safe pattern today is to…

Read more →
Best Practices for Handling User Identity in Custom Model Serving (MCP)

Is there no clean solution at the moment…?


This is a real gap, and other teams are running into the same thing.

The short version is:

MCP now has a much clearer standard for authenticating the client to the MCP server. It does not yet have one fully settled, built-in standard for end-user identity propagation through tool execution and downstream APIs. The safe pattern today is to…

Read more →
Invoice Data Recognition

While there are plenty of good existing OCR models, you shouldn’t expect a single model to work well on its own when dealing with extremely messy invoices. It’s better to use them in combination.

How heavy the OCR model or other models need to be depends on just how messy the invoices are…


Build it as a document understanding pipeline , not as a plain NER model.

That is the main…

Read more →
Invoice Data Recognition

While there are plenty of good existing OCR models, you shouldn’t expect a single model to work well on its own when dealing with extremely messy invoices. It’s better to use them in combination.

How heavy the OCR model or other models need to be depends on just how messy the invoices are…


Build it as a document understanding pipeline , not as a plain NER model.

That is the main…

Read more →
Need help getting started with image generation

If you can get PyTorch installed appropriately once, the rest isn’t too difficult, but that’s the tough part.
For example, how to install WSL2, ComfyUI, and FLUX.1:


Overview: what “FLUX.1 (GGUF) in ComfyUI” actually means

  • ComfyUI is the node-based image generation UI/server. You run it locally and open it in a browser. (ComfyUI Official Document)
  • FLUX.1 GGUF files…
Read more →
Need help getting started with image generation

If you can get PyTorch installed appropriately once, the rest isn’t too difficult, but that’s the tough part.
For example, how to install WSL2, ComfyUI, and FLUX.1:


Overview: what “FLUX.1 (GGUF) in ComfyUI” actually means

  • ComfyUI is the node-based image generation UI/server. You run it locally and open it in a browser. (ComfyUI Official Document)
  • FLUX.1 GGUF files…
Read more →
Need help getting started with image generation

If you can get PyTorch installed appropriately once, the rest isn’t too difficult, but that’s the tough part.
For example, how to install WSL2, ComfyUI, and FLUX.1:


Overview: what “FLUX.1 (GGUF) in ComfyUI” actually means

  • ComfyUI is the node-based image generation UI/server. You run it locally and open it in a browser. (ComfyUI Official Document)
  • FLUX.1 GGUF files…
Read more →
Need help getting started with image generation

If you can get PyTorch installed appropriately once, the rest isn’t too difficult, but that’s the tough part.
For example, how to install WSL2, ComfyUI, and FLUX.1:


Overview: what “FLUX.1 (GGUF) in ComfyUI” actually means

  • ComfyUI is the node-based image generation UI/server. You run it locally and open it in a browser. (ComfyUI Official Document)
  • FLUX.1 GGUF files…
Read more →
Need help getting started with image generation

If you can get PyTorch installed appropriately once, the rest isn’t too difficult, but that’s the tough part.
For example, how to install WSL2, ComfyUI, and FLUX.1:


Overview: what “FLUX.1 (GGUF) in ComfyUI” actually means

  • ComfyUI is the node-based image generation UI/server. You run it locally and open it in a browser. (ComfyUI Official Document)
  • FLUX.1 GGUF files…
Read more →
Need help getting started with image generation

WSL2 can be installed on most existing Windows 10 systems. However, I don’t think WSL2 is beginner-friendly (unless you’re already familiar with programming or Linux)…
If you do use WSL2, ComfyUI is available, so you can directly apply many online guides. It probably supports the most models among GUI options.
That said, ComfyUI is quite difficult to operate as a software, so I don’t think it’s…

Read more →
Need help getting started with image generation

WSL2 can be installed on most existing Windows 10 systems. However, I don’t think WSL2 is beginner-friendly (unless you’re already familiar with programming or Linux)…
If you do use WSL2, ComfyUI is available, so you can directly apply many online guides. It probably supports the most models among GUI options.
That said, ComfyUI is quite difficult to operate as a software, so I don’t think it’s…

Read more →
RAM usage, Model streaming or alternatives

When using large models, not only do the model weights themselves become heavy, but the RAM consumed by context, KV cache, etc. increases even more significantly.


What “fits in memory” actually means (for GGUF/llama.cpp-style inference)

When you run a local LLM, you’re budgeting for more than just the model file size :

  • Model weights (largest chunk; roughly the GGUF “file…
Read more →
Gemma 3 12B: 4-bit Quantization failing/ignored in Transformers v5.1.0 (Gemma3ForConditionalGeneration)

Hmm… For plain Gemma 3 12B alone, using pre-quantized weights is the fastest and most reliable approach, but it can’t be applied to other cases…

It seems environment variables can sometimes reduce spikes:


Why this still happens in Transformers v5 (and why max_memory doesn’t save you)

1) v5 “dynamic weight loading” can legitimately spike peak memory

Transformers v5 loads…

Read more →
Gemma 3 12B: 4-bit Quantization failing/ignored in Transformers v5.1.0 (Gemma3ForConditionalGeneration)

Hmm… For plain Gemma 3 12B alone, using pre-quantized weights is the fastest and most reliable approach, but it can’t be applied to other cases…

It seems environment variables can sometimes reduce spikes:


Why this still happens in Transformers v5 (and why max_memory doesn’t save you)

1) v5 “dynamic weight loading” can legitimately spike peak memory

Transformers v5 loads…

Read more →
Gemma 3 12B: 4-bit Quantization failing/ignored in Transformers v5.1.0 (Gemma3ForConditionalGeneration)

Hmm… For plain Gemma 3 12B alone, using pre-quantized weights is the fastest and most reliable approach, but it can’t be applied to other cases…

It seems environment variables can sometimes reduce spikes:


Why this still happens in Transformers v5 (and why max_memory doesn’t save you)

1) v5 “dynamic weight loading” can legitimately spike peak memory

Transformers v5 loads…

Read more →
Gemma 3 12B: 4-bit Quantization failing/ignored in Transformers v5.1.0 (Gemma3ForConditionalGeneration)

Hmm… For plain Gemma 3 12B alone, using pre-quantized weights is the fastest and most reliable approach, but it can’t be applied to other cases…

It seems environment variables can sometimes reduce spikes:


Why this still happens in Transformers v5 (and why max_memory doesn’t save you)

1) v5 “dynamic weight loading” can legitimately spike peak memory

Transformers v5 loads…

Read more →
Gemma 3 12B: 4-bit Quantization failing/ignored in Transformers v5.1.0 (Gemma3ForConditionalGeneration)

Hmm… For plain Gemma 3 12B alone, using pre-quantized weights is the fastest and most reliable approach, but it can’t be applied to other cases…

It seems environment variables can sometimes reduce spikes:


Why this still happens in Transformers v5 (and why max_memory doesn’t save you)

1) v5 “dynamic weight loading” can legitimately spike peak memory

Transformers v5 loads…

Read more →
Page 1 Older →