What is Jlama and why it matters?

Pure Java inference engine for LLMs (no external runtime dependencies)
Jlama avoids cross-runtime fragmentation by keeping inference entirely within the Java ecosystem, eliminating the need for Python services or native model servers. This allows LLM features to be integrated directly into existing Java applications while retaining a single language, deployment surface, and operational toolchain.
Supports major model families: LLaMA 1/2/3, Mistral, Qwen, Gemma, GPT-2, BERT etc.

Java Vector API and Enabling Preview Features

One implementation detail of Jlama is its use of the Java Vector API (part of Project Panama) for SIMD (Single Instruction, Multiple Data) acceleration. SIMD enables parallel vector math such as dot products, multiply-adds, and tensor operations by executing them across wide CPU lanes in a single instruction, rather than looping element by element.

Jlama depends on the Java Vector API. As of Java 25, the Vector API is still in incubation and considered a preview feature. According to the documentation regarding preview features:

To use preview language features in your programs, you must explicitly enable them in the compiler and the runtime system. If not, you'll receive an error message that states that your code is using a preview feature and preview features are disabled by default.

Enabling Preview Features via Maven at Compile Time


<properties>
    <java.version>25</java.version>
</properties>


<build>
    <plugins>
        <!-- https://mvnrepository.com/artifact/org.springframework.boot/spring-boot-maven-plugin -->
        <plugin>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-maven-plugin</artifactId>
            <configuration>
                <jvmArguments>--enable-preview --add-modules=jdk.incubator.vector</jvmArguments>
            </configuration>
        </plugin>
        <!-- https://mvnrepository.com/artifact/org.apache.maven.plugins/maven-compiler-plugin -->
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-compiler-plugin</artifactId>
            <configuration>
                <source>${java.version}</source>
                <target>${java.version}</target>
                <release>${java.version}</release>
                <compilerArgs>
                    <arg>--enable-preview</arg>
                    <arg>--add-modules=jdk.incubator.vector</arg>
                </compilerArgs>
            </configuration>
        </plugin>
        <!-- https://mvnrepository.com/artifact/org.apache.maven.plugins/maven-surefire-plugin -->
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-surefire-plugin</artifactId>
            <configuration>
                <argLine>--enable-preview --add-modules=jdk.incubator.vector</argLine>
            </configuration>
        </plugin>
        <!-- https://mvnrepository.com/artifact/org.apache.maven.plugins/maven-failsafe-plugin -->
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-failsafe-plugin</artifactId>
            <configuration>
                <argLine>--enable-preview --add-modules=jdk.incubator.vector</argLine>
            </configuration>
        </plugin>
    </plugins>
</build>

Enabling Preview Features at Runtime

#!/usr/bin/env zsh

java --enable-preview --add-modules=jdk.incubator.vector -jar path/to/your/app.jar

Or alternatively, you can specify JVM Arguments using the
JDK_JAVA_OPTIONS environment variable:

#!/usr/bin/env zsh

export JDK_JAVA_OPTIONS='--enable-preview --add-modules=jdk.incubator.vector'
java  -jar path/to/your/app.jar

Spring Boot Maven Plugin (Optional)

If you use the mvn spring-boot:run command to run your application, you can specify JVM Arguments as follows:

<!-- https://mvnrepository.com/artifact/org.springframework.boot/spring-boot-maven-plugin -->
<plugin>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-maven-plugin</artifactId>
    <configuration>
        <jvmArguments>--enable-preview --add-modules=jdk.incubator.vector</jvmArguments>
    </configuration>
</plugin>

Adding Jlama Java API Dependencies


<dependencies>
    <!-- https://mvnrepository.com/artifact/com.github.tjake/jlama-core -->
    <dependency>
        <groupId>com.github.tjake</groupId>
        <artifactId>jlama-core</artifactId>
        <version>${jlama.version}</version>
    </dependency>
</dependencies>

Native SIMD (Optional)

By default, Jlama will use the Vector API Support as the backend for SIMD operations. If you only use the jlama-core dependency, you'll see logs similar to the following when you run inference:

WARN 53015 --- [  restartedMain] c.g.t.j.t.o.TensorOperationsProvider     : Native operations not available. Consider adding 'com.github.tjake:jlama-native' to the classpath
INFO 53015 --- [  restartedMain] c.g.t.j.t.o.TensorOperationsProvider     : Using Panama Vector Operations (OffHeap)

As suggested by the logs, we can also use the native backend, which, according to the documentation, contains:

Platform-specific native libraries (C/C++) providing optimized SIMD operations for x86_64 and ARM64"

We'll have to specify a classifier (with the help of the OS Maven Plugin) to pull the correct native library for our OS (e.g. linux-x86_64 or macos-aarch_64).


<build>
    <extensions>
        <!-- https://github.com/trustin/os-maven-plugin -->
        <extension>
            <groupId>kr.motd.maven</groupId>
            <artifactId>os-maven-plugin</artifactId>
            <version>1.7.1</version>
        </extension>
    </extensions>
</build>

And the actual dependency itself:


<dependencies>
    <!-- https://mvnrepository.com/artifact/com.github.tjake/jlama-native -->
    <dependency>
        <groupId>com.github.tjake</groupId>
        <artifactId>jlama-native</artifactId>
        <classifier>${os.detected.name}-${os.detected.arch}</classifier>
        <version>${jlama.version}</version>
    </dependency>
</dependencies>

With native SIMD enabled, you'll see logs similar to the following when you run inference:

INFO com.github.tjake.jlama.tensor.operations.TensorOperationsProvider -- Using Native SIMD Operations (OffHeap)
INFO com.github.tjake.jlama.model.AbstractModel -- Model type = Q4, Working memory type = F32, Quantized memory type = I8

Code Example

import com.github.tjake.jlama.model.AbstractModel;
import com.github.tjake.jlama.model.ModelSupport;
import com.github.tjake.jlama.model.functions.Generator;
import com.github.tjake.jlama.safetensors.DType;
import com.github.tjake.jlama.safetensors.prompt.PromptContext;
import com.github.tjake.jlama.util.Downloader;

import java.io.File;
import java.util.UUID;


/// # Core Java JLama Example
///
/// Minimal end-to-end demonstration of JLama:
/// - Model download from Hugging Face
/// - SafeTensors & quantized model load
/// - One-shot text generation in pure Java
///
public class CoreJavaJLamaExample {

    /// ## Local Model Store
    ///
    /// Path on disk where the model is cached after the first download.
    /// Prevents repeated network pulls and speeds up startup.
    private final File localModelPath;

    public CoreJavaJLamaExample() {
        this("tjake/Llama-3.2-1B-Instruct-JQ4", System.getProperty("user.home") + "/.models");
    }


    public CoreJavaJLamaExample(final String model, final String workingDirectory) {
        this.localModelPath = downloadModel(model, workingDirectory);
    }

    private static File downloadModel(final String model, final String workingDirectory) {
        try {
            return new Downloader(workingDirectory, model).huggingFaceModel();
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    }

    public Generator.Response chat(String query) {
        // DType.F32: workingMemoryType => F32 represents a 32-bit floating point type.
        // DType.I8: workingQuantizationType => I8 represents a signed byte (8-bit) type.
        try (final AbstractModel model =
                     ModelSupport.loadModel(localModelPath, DType.F32, DType.I8)) {
            final var promptContext = PromptContext.of(query);
            return model.generateBuilder()
                    .session(UUID.randomUUID())              // KV-cache session key
                    .promptContext(promptContext)
                    .ntokens(256)                            // max response length
                    .temperature(0.0f)                       // 0.0 = deterministic
                    .onTokenWithTimings((token, timing) -> {
                        // streaming callback hook (optional)
                        // token.text(), timing.nanos()
                    })
                    .generate();
        }
    }
}

References

JLama GitHub Project
DeepWiki Documentation for Jlama | JLama Getting Started
os-maven-plugin for OS specific information.
Vector API Support
JVM Preview Features

Running LLMs in Pure Java with Jlama: Vector API & Native SIMD Setup

Run LLMs natively in Java with Jlama using the Vector API and optional SIMD acceleration - no Python, no external model servers.