I Fine-Tuned a Vision Model for Crime Detection on a Free GPU

What started as a rushed hackathon project turned into a months-long obsession: building a local, fast, open-source AI that could watch a CCTV feed and tell you if something bad was happening. This is the story of how I got there, the wrong turns, the walls I hit, and the scrappy hack that finally worked.

CamX: The Hackathon

It started at a hackathon. My team built CamX, an AI-powered security surveillance system. The pitch was genuinely solid: real-time threat detection with a 0-100% threat level, face recognition against a pre-loaded database, smart alerts with SMS notifications, video analysis that could do frame-by-frame breakdowns. The kind of thing that sounds impressive in a three-minute demo.

For the smart parts, we used tiny as a local detection model and face-api.js for face recognition. But for the contextual intelligence, the "what is actually happening here" layer, we plugged in Google Gemini. Every camera frame, every alert, every analysis routed through Gemini's free tier API.

You can probably guess where this is going.

Free tier API limits are not designed for continuous surveillance. The quota was gone within minutes of running. We also had a UI that my teammate, bless him, had a clanker build the kind of interface where you can tell a language model was given a vague prompt and nobody reviewed the output. Functional. Barely. It looked like a React component that had never seen a designer in its life.

The hackathon was a good exercise. But CamX was a demo, not a product.

Going Local with CamX2

Months later I started CamX2 with a clear constraint: no cloud AI in the core detection loop. Everything had to run locally, permanently, without hitting someone's API.

For the real-time object detection layer on camera feeds, I swapped in YOLOv11x. Fast, local, no rate limits, does what it says. That solved one problem.

The gap that remained was contextual understanding. YOLO can tell you "person detected." It cannot tell you what that person is doing. Is that a bag being set down, or being stolen? Is that a confrontation, or just two people standing close? For that you need something that can reason about a scene, not just classify objects in it.

I needed a vision-language model that could understand surveillance imagery specifically, run fast enough to be useful, work on my Snapdragon X Plus (Lenovo IdeaPad Slim 5) without an NVIDIA GPU or a cloud subscription, and output structured data I could parse in code. I couldn't find anything that fit all of those criteria. So I decided to make one.

Why LFM2.5-VL-1.6B

LiquidAI's LFM2.5-VL-1.6B stood out because it is genuinely small and built for edge hardware. LiquidAI's architecture is designed for efficient on-device inference, and the 1.6B parameter count meant it had a real shot at running on a Snapdragon CPU. It handles images natively at up to 512x512 and has a tiling strategy for larger inputs without needing retraining.

The target output I wanted was dead simple. A JSON blob the application could parse:

{
  "isHarm": true,
  "descriptionIfHarm": "The image depicts a physical altercation between two individuals."
}

Or, for clean footage, just { "isHarm": false }. No essays, no explanations, just a flag and a reason.

The Dataset

The UCF Crime dataset is the standard benchmark for this problem: real surveillance footage, 13 anomaly categories plus normal video, sourced from actual CCTV cameras in real environments. The full dataset has 1.2 million images across 1,900+ videos.

The HuggingFace version I found, tanzzpatil/ucf-crime-small, is a reduced version at around 600k images. Even that was more than a free Colab T4 could handle for a complete training run without it taking weeks.

So I carved out my own subset: 26k images for training (1,000 per crime class, with equal normal samples across 14 categories: Abuse, Arrest, Arson, Assault, Burglary, Explosion, Fighting, Robbery, Shooting, Shoplifting, Stealing, Vandalism, Road Accident, and Normal) and a 5.6k image test set held out completely from training. Balanced, real surveillance footage, small enough to actually train on in a reasonable amount of time.

Trying the Official Way

LiquidAI has a fine-tuning cookbook for their vision models, using a car identification example as the template. The methodology is genuinely good: establish a baseline, add structured output constraints with Outlines, then fine-tune with LoRA only if needed. One of the authors, Pau Labarta Bajo, pointed me to it after seeing my early results.

The problem was that the pipeline is built around Modal for compute. It is not designed for Colab. Every attempt to adapt it ran into pydantic validation errors when I changed the dataset config or output schema, wrong key names between YAML files, secrets it expected to find that weren't there. I also tried running it in WSL Debian on my Snapdragon machine and hit ARM compatibility issues with half the pre-built wheels. After a while I gave up on the official path entirely. Honestly it could be my incompetence.

Finding the Actual Path

The breakthrough was Unsloth's LFM2.5-VL guide. They had a free Colab notebook specifically for LFM2.5-VL-1.6B fine-tuning, and crucially, Unsloth's notebook was made to run fine on free Colab.

I used that notebook as the foundation and rewrote the dataset loading, the prompt format, the output schema, and the training config for my specific use case. Freeze the vision layers (the model already knows what things look like, no need to retrain SigLIP), fine-tune the language layers (teach it surveillance context and JSON output), LoRA rank 16.

from unsloth import FastVisionModel

model, tokenizer = FastVisionModel.from_pretrained(
    model_name="LiquidAI/LFM2.5-VL-1.6B",
    max_seq_length=2048,
    load_in_4bit=False,
)

model = FastVisionModel.get_peft_model(
    model,
    finetune_vision_layers=False,
    finetune_language_layers=True,
    r=16,
    lora_alpha=16,
    lora_dropout=0,
)

Training Kickoff:

The Colab free tier only gives ~4.5 hours of T4 GPU time per 24 hours, but training needed ~10 hours. To handle this, I added checkpoints every 100 iterations, mounted Google Drive, and saved checkpoints there. I also built a resume mechanism that checked Drive for the latest checkpoint and continued training from there without restarting. I spent the whole night monitoring it, copying checkpoints and the notebook across multiple Google accounts to rotate T4 quotas. With these tricks, I finally completed the full training.

LoRA rank 16 on language layers only (vision layers frozen). Friend initially worried about just 0.57% parameters being trained, but LoRA adapters are enough for this structured output task without causing forgetting.

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 26,000 | Num Epochs = 1 | Total steps = 3,250
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 9,142,272 of 1,605,768,176 (0.57% trained)

Initially, I'd just screenshotted to send it to my friend showing him "it's training bruh!", and he immediately shot back: "Dude, you need at least 10% of the parameters for decent results." That got me worried. So I dug into what LoRA actually does, turns out my friend was thinking about full fine-tuning, which would torch 10% of 1.6 billion parameters and cause catastrophic forgetting while taking forever on a T4. LoRA is completely different. It's this tiny adapter layer that nudges the model's behavior in a specific direction without overwriting what it already knows. I asked around more, did some research, and realized for a structured output task like this, 0.57% hitting the right layers is actually plenty. The Unsloth docs backed this up too. Training ended up running for about 10 hours..s.

Running Eval at Scale

Training was the straightforward part. Evaluating the model properly on 5,600 images while a free Colab session could die at any point and two large models needed VRAM at different stages was where things got genuinely hard.

The first thing I ran into was a RAM leak. Storing every prediction in a Python list means system RAM grows linearly with the dataset. Around 10,000 samples in, Colab OOM-crashed and I lost everything. The fix was switching to streaming IO: write each result to disk immediately using csv.DictWriter with f.flush() after every row. RAM stays flat regardless of how many images you process.

VRAM was a different problem. Even with no quantization while, running inference tens of thousands of times causes slow fragmentation PyTorch holds onto temporary tensors from model.generate() and does not always release them. I added a cleanup cycle every 50 iterations: del the input and output tensors explicitly, then gc.collect() and torch.cuda.empty_cache(). That kept GPU memory stable over long runs.

The bigger issue was that I needed two models for proper evaluation: the fine-tuned LFM-UCF to generate predictions, and a Qwen-2.5-3B acting as an automated judge to grade whether those predictions were actually correct. There is no way to keep both in VRAM simultaneously on a T4. The solution was a stage-gate pipeline: run all vision inference to completion, fully unload the vision model from memory, then load the judge LLM and run the grading pass. Sequential, not simultaneous. It meant two Colab sessions instead of one, but it worked.

Finally, because Colab runtimes are ephemeral and the eval loop took around 3 hours, I moved all output files to a Google Drive folder and added resume logic. On startup, the script checks the existing CSV, identifies the last processed row, and picks up exactly there. A disconnection at sample 4,000 means restarting from 4,001, not from zero. Each iteration was saved as a checkpoint into my Colab as it's just a CSV.

The Results

Model Accuracy Comparison on UCF Crime (5200 Samples)

LFM-2.5-VL-1.6B

35.2%

LFM-2.5-UCF-VL-1.6B

44.8%

Base Model

My Finetuned Model

Accuracy (%)

A +9.6 percentage point improvement from a 26k training set, one epoch, on a free T4. The base model had never encountered a surveillance camera in its training data. The fine-tuned version has (excluding the test subset), and the numbers show it.

It is not a finished product but good enough for my project CamX2. One epoch on 26k images is a starting point, not a ceiling. The real limiting factors are data volume and training time, both of which are solvable. More training data from the full 600k dataset, more epochs, and a better eval methodology that does not rely on a second LLM as a judge would all push these numbers higher. That is the next thing.

But the model runs locally, it outputs structured JSON when the system prompt asks for it, and it meaningfully understands what is happening in a surveillance scene in a way the base model does not. That was the goal.

If you want to reproduce this or adapt it for your own dataset, the training notebook is publicly available on Colab. No paid GPU, no cloud account setup, no Modal. Just open it and run.

The weights are on HuggingFace: LoRA adapters and GGUF for llama.cpp/Ollama.