Introducing Photo-Mate-v2, based on FLUX.1-Kontext-dev, for advanced image manipulation tasks. It supports transforming scenes into top-down/bottom-up perspectives, CAM-right/left-view and its reverse, as well as general kontext-specified object removal. Below is the list of demos and adapters.🔥🤗
A week ago, I shared a post about the latest transformers test implementation of DeepSeek-OCR Compatibility (https://tinyurl.com/ykc4mm66). Now, I’m dropping the most compatible version of it to support the model with the latest transformers. 🤗🔥
✅Supports the latest transformers v4.57.1 ✅torch: 2.6.0+cu124 (or) the latest version (i.e., torch 2.9.0) ✅cuda version: 12.4 ✅users can also opt out of specific attention implementations if desired.
I've added an 11th tool to the Nymbo/Tools MCP server, it's for your Obsidian_Vault. I'd argue it's far more context-efficient than any other Obsidian MCP I've seen, and doesn't require any plugins. Also some big improvements to the Web_Search and Web_Fetch tools.
# Obsidian_Vault Tool
It's basically a read-only version of the File_System tool, but it works so well for navigating Obsidian without unnecessary context. It supports recursive (full-text) search across the entire vault, and supports offset so the agent can "scroll" through a document without re-consuming tokens.
Run the server locally and set the OBSIDIAN_VAULT_ROOT environment variable to your vault's root path. If you don't use Obsidian, this is perfectly usable as simply a read-only filesystem.
# Web_Search Improvements
The Web_Search tool previously just used DuckDuckGo as a backend search engine, but now it also supports Bing, Brave, Yahoo, and Wikipedia. Default engine is auto which provides results from all backends in recommended order. Still doesn't require any kind of API or auth for Web_Search.
There's also a new date filter to limit results to those created in the past day, week, month, or year. Oh, and uhh, SafeSearch is now off by default :)
# Web_Fetch Improvements
As context-efficient as the Markdown mode is for web browsing, sometimes it does lose important context in the conversion from HTML to Markdown. So I've added a new HTML mode to the Web_Fetch tool that basically executes a cURL request on the URL, returning the full HTML page if necessary.
# A Note on Claude Skills
I've been having fun with the new File_System and Shell_Command tools. Using Claude Skills doesn't currently work in the public HF space because of environment restrictions, but using Skills works perfectly well running locally.
It discusses the latest trends in OCR models, the multilingual support offered by modern OCR systems, their unique capabilities, OCR benchmark model comparisons, transformer-based implementations, and strategies for streamlining transformers compatibility.
Implemented DeepSeek-OCR to support the latest transformers on the strangervisionhf page. The page includes the model weights and corrected configuration, which fix the issues and allow transformers inference to run smoothly.🤗🔥
✅Supports the latest transformers ✅You can also opt out of the attention implementation if needed. ✅Supports torch version 2.6.0 or higher ✅torch version cuda: 12.4
If you are interested in experimenting with new things and streamlining compatibility, the strangervisionhf organization is open for you, and you can join the community.
Introducing Gliese-OCR-7B-Post2.0-final, a document content-structure retrieval VLM designed for content extraction (OCR), summarization, and document visual question answering. This is the fourth and final model in the Camel Doc OCR VLM series, following Gliese-OCR-7B-Post1.0. The model delivers superior accuracy across a wide range of document types, including scanned PDFs, handwritten pages, structured forms, and analytical reports.🚀🤗
Two new tools added to the Nymbo/Tools MCP server, File_System and Shell_Exec. You can theoretically do basically anything with these two tools, and it should enable support for many Claude Skills.
GPT-5-Codex proves that for many cases, shell commands really are all you need, and Claude Skills seem to lean into this. The thing is, nothing about the design of Claude Skills actually restricts them to proprietary models!
# File_System
There's a new directory inside the repo called Filesystem, that's the agent's "root". It can perform the following actions : list, read, write, append, mkdir, move, copy, delete, info, help. It's able to keep this all within the scope of one tool call by making the Action field required and all other fields optional. Using a filesystem shouldn't require 15 different tools.
Files created in the public HF space live in the space's running container, and gets cleared when the space is restarted. When running the server locally, files are actually stored on disk.
# Shell_Exec
What good is a filesystem if you can't execute commands in that filesystem? This tool automatically detects if the server is running on Windows or Linux, and suggests using the appropriate shell (PowerShell/Bash). Both of these new tools require that the agent uses relative paths, rather than absolute paths. I could be convinced to back pedal on this.
# Closing Thoughts
The File_System and Shell_Exec tools aren't super polished yet, I'll continue to improve the agent's instructions and UX of using the new tools. Most of my testing was done with gpt-oss-20b and if it messes up, it gets the gist after one failed tool call. It should work perfectly fine for the GPU poor.
Now you can try all the latest state-of-the-art multimodal vision-language models from the Qwen3-VL series demo on Hugging Face Spaces — including 4B, 8B, and 30B (Instruct, 4B-Thinking) variants. I’ve also uploaded the weights for the Abliterated variants of these models, up to 30B parameters. Check out the Spaces and model links below! 🤗🔥
Note: This is version 1.0 of the Abliteration of the Qwen3-VL series of models. It may perform sub-optimally in some cases. If you encounter any issues, please open a discussion.
We’re proud to release AIRealNet — a binary image classifier built to detect whether an image is AI-generated or a real human photograph. Based on SwinV2 and fine-tuned on the AI-vs-Real dataset, this model is optimized for high-accuracy classification across diverse visual domains.
If you care about synthetic media detection or want to explore the frontier of AI vs human realism, we’d love your support. Please like the model and try it out. Every download helps us improve and expand future versions.
Introducing Image-Guard-2.0, an experimental, lightweight vision-language encoder model with a size of 0.1B (<100M parameters), trained on SigLIP2 (siglip2-base-patch16-224). Designed for multi-label image classification tasks, this model functions as an image safety system, serving as an image guard or moderator across a wide range of categories, from anime to realistic imagery.
It also performs strict moderation and filtering of artificially synthesized content, demonstrating strong detection and handling of explicit images. Image-Guard-2.0 delivers robust performance in streamlined scenarios, ensuring reliable and effective classification across diverse visual inputs.
I've made some improvements to my custom Deep_Research tool in the Nymbo/Tools MCP server. I've added a second LLM process and it still takes less than 1 minute to complete!
The original version of my Deep_Research tool would basically dump up to 50 fetched webpages onto the Researcher model (Qwen3-235B), with only a little bit of context shown from each page.
# New "Filterer" Process
The new process includes another LLM call before the researcher process. The Filterer (also Qwen3-235B) gets the query summary and the original 50 pages with low context, and decides which pages are most relevant to the research topic. The Filterer then outputs the URLs to the relevant pages, which are then re-fetched (with more context) and sent to the Researcher.
# Researcher Context
The Researcher now gets only the relevant webpages, then begins writing the report. When testing with 50 initial results, the researcher would often end up with 10-20 results of relevant context.
It still takes less than a minute to accomplish everything, thanks entirely to Cerebras inference. It now takes about 35-45 seconds to complete once the tool is run.
It's also worth noting that both the Filterer and Researcher now are provided the current time/date before they see the content, reducing hallucinations caused by knowledge cutoffs.
The demo of Qwen3-VL-30B-A3B-Instruct, the next-generation and powerful vision-language model in the Qwen series, delivers comprehensive upgrades across the board — including superior text understanding and generation, deeper visual perception and reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. 🤗🔥
New Technique to Deeply Poison AI on Images and Prove Creative Provenance
I've developed a new method to protect creative work from unauthorized AI training. My Poisonous Shield for Images algorithm embeds a deep, removal-resistant poison into the mathematical structure of your images. It's designed to be toxic to machine learning models, achieving up to 20-348% disruption in AI training convergence in benchmark tests.
Unlike traditional watermarks, this protection survives compression and resizing and is not removed by standard tools. The technique also embeds cryptographic proof of provenance directly into the image, verifying ownership and detecting tampering.
You can see examples and learn more about how and WHY it works better than current methods:
If you are interested in using this technology to protect your work from AI training and unauthorized use, please reach out to me. It is currently in the prototype phase but fully functioning and effective. Still working on expanding it to a production-grade usable app.
This is not intended as a pure self-promotion post. I am genuinely wanting to help creators and want to gauge interest from different communities. I've spent the past year and a half building this from scratch with new math and code to try and solve this massive problem.
Introducing the next-gen version of DeepCaption-VLA (v2.0) — an advanced, multimodal model based on Qwen2.5-VL, specialized for Image Captioning and Vision Language Attribution (VLA). This enhanced release focuses on generating precise, attribute-rich captions that capture visual properties, object attributes, and scene details across diverse image types and aspect ratios. Version 2.0 introduces significant improvements in multilingual inference, delivering higher captioning quality and attribution accuracy in languages including Chinese (Zh), Thai (Th), and more.
I’ve ported the BDH ( https://github.com/pathwaycom/bdh ) model to MLX for Apple Silicon. It’s a faithful conversion of the PyTorch version: same math, same architecture (byte-level vocab, shared weights across layers, ReLU sparsity, RoPE attention with Q=K), with MLX-friendly APIs and a detailed README explaining the few API-level differences and why results are equivalent.
Code, docs, and training script are ready to use. You may need to adjust the training script a bit to fit your own custom dataset. Only tested on M4 so far, but should work perfect for any M1/M2/M3 users out there.
I’m currently training this MLX build on my Internal Knowledge Map (IKM) dataset Severian/Internal-Knowledge-Map Training’s underway; expect a day or so before I publish weights. When it’s done, I’ll upload the checkpoint to Hugging Face for anyone to test.
Have built the new Image Studio with the Gemini Image Gen models for the following multiple tasks: imagen-4.0-fast-generate-001 model for Image Generation (Text-to-Image) and Multi-Image Editing (Image-to-Image), and Draw-to-Image powered by gemini-2.5-flash-image (aka Nano Banana).
To proceed, you need to add your Gemini API key. Your API key is stored only for the duration of your session and will be lost when you reload or exit the page. It will not be shared or exposed anywhere.
Do you think domain-specific embedding fine-tuners are needed? I've been working with embeddings for marketing use cases and noticed something: most embeddings don't get marketing concepts very well. They're trained in general-purpose ways. The Issue I'm Seeing When I search marketing content with general embeddings:
My Question Do you think domain-specific embeddings are needed for marketing? Some thoughts:
Marketing has its own vocabulary and concept relationships General models trained on Wikipedia/web crawl miss these nuances But is fine-tuning worth the effort vs just using more retrieval tricks?
Quick Example I fine-tuned all-mpnet-base-v2 on ~1000 marketing concept pairs and saw 15-20% better retrieval accuracy. But I'm curious:
Has anyone else tried this for marketing or other domains? When do you think domain-specific embeddings are actually necessary vs overkill? Are there better approaches I'm missing?