open-acc (open/ acc)

Nymbo

posted an update 9 days ago

Post

1733

🚨 New tool for the Nymbo/Tools MCP server: The new Agent_Skills tool provides full support for Agent Skills (Claude Skills but open-source).

How it works: The tool exposes the standard discover/info/resources/validate actions. Skills live in /Skills under the same File_System root, and any bundled scripts run through Shell_Command, no new infrastructure required.

Agent_Skills(action="discover")  # List all available skills
Agent_Skills(action="info", skill_name="music-downloader")  # Full SKILL.md
Agent_Skills(action="resources", skill_name="music-downloader")  # Scripts, refs, assets

I've included a music-downloader skill as a working demo, it wraps yt-dlp for YouTube/SoundCloud audio extraction.

Caveat: On HF Spaces, Shell_Command works for most tasks, but some operations (like YouTube downloads) are restricted due to the container environment. For full functionality, run the server locally on your machine.

Try it out ~ https://www.nymbo.net/nymbot

flozi00

posted an update 19 days ago

Post

271

We have covered Tensor Parallelism for slicing matrices and Pipeline Parallelism for stacking layers. But what if your model isn't just deep or wide—it's a sprawling Mixture-of-Experts (MoE) architecture like Mixtral or DeepSeek, with trillions of parameters that are mostly idle per token?

Replicating those experts wastes VRAM. Slicing them with TP wastes bandwidth. The solution is Expert Parallelism (EP), which distributes the experts themselves across GPUs and routes tokens to wherever their "chosen" expert lives.

The hardware catch? It is not matrix splitting or pipeline bubbles—it's the "Router's Dilemma." You must shuffle massive volumes of tokens across the cluster using All-to-All communication, and any imbalance can leave expensive GPUs idle.

My latest guide dives into the mechanics of EP and why the interconnect becomes the ultimate bottleneck.

In this breakdown, we explore:

The Token Routing Lifecycle
A four-step hardware flow: Local routing to pick experts, Dispatch (All-to-All shuffle), Expert computation on the "home" GPU, and Combine (another All-to-All to return results).

The All-to-All Primitive
Unlike the ring-based syncs in TP, All-to-All creates a dense mesh of personalized data transfers. We compare it to All-Reduce and show why uneven token distribution (load imbalance) causes network congestion and compute skew.

Load Balancing: The Hardware Nightmare
If one expert gets 90% of the tokens, its GPU bottlenecks while others stall. We discuss mitigation strategies like token dropping and auxiliary losses to keep utilization high.

The article includes a raw PyTorch implementation of an EP layer using torch.distributed.all_to_all_single to reveal exactly how the data shuffles and where the stalls happen.

Read the full hardware-centric guide here:
https://flozi.net/en/guides/ai/scaling/expert_parallel

1 reply

·

flozi00

posted an update 26 days ago

Post

3103

We recently discussed how Tensor Parallelism slices matrices to reduce latency within a single node. But what happens when you need to scale beyond that, where the bandwidth drops?

That is where Pipeline Parallelism (PP) takes over.

Instead of slicing the operation, PP slices the model depth. It turns your GPU cluster into an assembly line: GPU 0 handles layers 1-12, GPU 1 handles 13-24, and so on.

The hardware challenge here isn't the interconnect speed—it is the "Pipeline Bubble." In a naive setup, expensive H100s sit idle for most of the cycle waiting for data to flow through the chain.

My latest guide breaks down the scheduling strategies used to minimize this idle silicon time.

In this deep dive, we cover:

The Hardware Mechanics: Vertical Slicing
Unlike TP which requires "chatty" All-Reduce operations, PP relies on lightweight Point-to-Point (Send/Recv) communication. This makes it the only viable strategy for crossing node boundaries over Ethernet or InfiniBand.

Fighting the Bubble: 1F1B vs. GPipe
We analyze the scheduling algorithms that keep the GPUs fed:

GPipe: The "flush and fill" approach. Simple, but memory-intensive.
1F1B (One-Forward-One-Backward): The industry standard. By interleaving forward and backward passes, we aggressively free up memory and reduce the bubble size.
The Math of Efficiency
The "Bubble" is a mathematical inevitability. We look at the efficiency formula
M+N−1
M

to understand why you need massive global batch sizes to make PP worth the effort.

The article includes a conceptual PyTorch implementation of the 1F1B state machine to illustrate exactly how the data is handed off between stages.

Read the full breakdown here:
https://flozi.net/en/guides/ai/scaling/pipeline_parallel

Nymbo

posted an update about 1 month ago

Post

5015

🚀 I've just shipped a major update to the Nymbo/Tools MCP server: the Agent_Terminal, a single "master tool" that cuts token usage by over 90%!

Anthropic found 98.7% context savings using code execution with MCP, Cloudflare published similar findings. This is my open-source implementation of the same idea.

# The Problem

Traditional MCP exposes every tool definition directly to the model. With 12 tools, that's thousands of tokens consumed *before the conversation even starts*. Each tool call also passes intermediate results through the context window — a 10,000-row spreadsheet? That's all going into context just to sum a column.

# The Solution: One Tool to Rule Them All

Agent_Terminal wraps all 12 tools (Web_Search, Web_Fetch, File_System, Generate_Image, Generate_Speech, Generate_Video, Deep_Research, Memory_Manager, Obsidian_Vault, Shell_Command, Code_Interpreter) into a single Python code execution gateway.

Instead of the model making individual tool calls, it writes Python code that orchestrates the tools directly:

# Search for Bitcoin price
result = Web_Search("current price of bitcoin", max_results=3)
print(result)

Don't know what tools are available? The agent can discover them at runtime:

print(search_tools('image'))  # Find tools by keyword
print(usage('Generate_Image'))  # Get full docs for a specific tool

The individual direct tool calls are all still there, but they can be disabled if using the Agent_Terminal. Try it now - https://www.nymbo.net/nymbot

1 reply

·

flozi00

posted an update about 1 month ago

Post

3576

When models get too large for a single GPU, simply stacking layers vertically (Pipeline Parallelism) isn't always the answer. Sometimes, you need to slice the matrices themselves.

My latest guide breaks down the hardware mechanics of Tensor Parallelism (TP). We look at how to shard individual operations across devices to make a cluster function as one massive accelerator.

This isn't high-level theory—it is a look at the bare metal implementation.

Here is what is covered in the deep dive:

The Strategies: Column vs. Row Parallelism
We analyze how to split weight matrices (W) and inputs (X).

Column-Linear: Splits weights by columns. Requires an All-Gather to reconstruct the output.
Row-Linear: Splits weights by rows. Requires an All-Reduce to sum partial results.
The "Megatron-LM" Optimization
Efficiency comes from minimizing communication. By sandwiching the non-linearity (GeLU) between a Column-Parallel layer and a Row-Parallel layer, we can skip synchronization entirely during the activation phase. This cuts communication events by 50% per block.

The Hardware Reality: The Bandwidth Wall
In TP, the dist.all_reduce operation sits on the critical path. The CUDA cores effectively stall while waiting for the ring-reduce to finish.

Intra-Node: Works well because NVLink provides enough bandwidth to hide this latency.
Inter-Node: Fails at scale. Standard networking (Ethernet/InfiniBand) is too slow for the high-frequency syncs required by TP.
The article includes a raw PyTorch implementation using torch.distributed primitives to show exactly where the data moves and where the bottlenecks sit.

Read the full hardware-centric guide here:
https://flozi.net/en/guides/ai/scaling/tensor_parallel

Steveeeeeeen

authored 2 papers about 1 month ago

Treble10: A high-quality dataset for far-field speech recognition, dereverberation, and enhancement

Paper • 2510.23141 • Published Oct 27 • 4

Open ASR Leaderboard: Towards Reproducible and Transparent Multilingual and Long-Form Speech Recognition Evaluation

Paper • 2510.06961 • Published Oct 8 • 10

flozi00

posted an update about 1 month ago

Post

2766

Running large language models efficiently is more than just raw GPU power. The latest guide breaks down the essential math to determine if your LLM workload is compute-bound or memory-bound.

We apply these principles to a real-world example: Qwen's 32B parameter model on the new NVIDIA RTX PRO 6000 Blackwell Edition.

In this guide, you will learn how to:

Calculate your GPU's operational intensity (Ops:Byte Ratio)
Determine your model's arithmetic intensity
Identify whether your workload is memory-bound or compute-bound

Read the full guide here: https://flozi.net/en/guides/ai/llm-inference-math

flozi00

posted an update about 2 months ago

Post

301

Struggling with NVIDIA drivers on Ubuntu 24.04?
Can't use your GPUs with CUDA installed, or only half of them work?
Black screen after startup or nvidia-smi fails?

The nokaslr boot option might be the cause—and the solution.
Find out why disabling KASLR can fix these GPU issues until a permanent driver update is available.

https://flozi.net/en/guides/linux/solving-nvidia-driver-issues-on-ubuntu-24-04-with-nokaslr

flozi00

posted an update about 2 months ago

Post

1952

I just got asked about the differences between Blackwell systems and Grace Blackwell systems. What's the difference and how much of a performance gap is there between them?

https://flozi.net/en/hardware/nvidia/benchmarks/b200-vs-gb200-efficiency-comparison

Here's a summary of the key points from the article:

GB200 (Grace Blackwell) is a Superchip: It integrates a Grace CPU and two Blackwell GPUs into a single package.
B200 is a GPU-only module: It's designed to be paired with x86 or ARM CPUs in more traditional server setups.

Performance and Efficiency:

Based on MLPerf Training v5.0 benchmarks, the article concludes:

GB200 systems are approximately 42% more efficient than B200 systems on average. This is especially true in large-scale deployments (100+ GPUs), where the GB200's integrated design and high-speed NVLink interconnect provide a significant advantage.

In smaller, single-node systems (e.g., 8 GPUs), the performance difference is much smaller, around 10-15%.

Use Cases:

Choose GB200 for large-scale AI clusters, training massive models, and when maximum efficiency is the top priority.

Choose B200 for smaller deployments, when you need the flexibility to choose your own CPU, or for mixed AI and HPC workloads.

flozi00

posted an update about 2 months ago

Post

3155

Some weeks ago, i've just decide its time to leave LinkedIn for me.
It got silent around my open source activities the last year, so i thought something has to change.

That's why my focus will move to share experiences and insights about hardware, drivers, kernels and linux. I won't post about how to use models, built agents or do prompting. I want to share about some deeper layers the actual hypes are built on.

I will start posting summarizations of my articles here on the hub.

English version:
https://flozi.net/en

German translated version:
https://flozi.net/de

Feel free to reach me if you want to read something specific.

2 replies

·

Nymbo

posted an update about 2 months ago

Post

1103

I've added an 11th tool to the Nymbo/Tools MCP server, it's for your Obsidian_Vault. I'd argue it's far more context-efficient than any other Obsidian MCP I've seen, and doesn't require any plugins. Also some big improvements to the Web_Search and Web_Fetch tools.

# Obsidian_Vault Tool

It's basically a read-only version of the File_System tool, but it works so well for navigating Obsidian without unnecessary context. It supports recursive (full-text) search across the entire vault, and supports offset so the agent can "scroll" through a document without re-consuming tokens.

Run the server locally and set the OBSIDIAN_VAULT_ROOT environment variable to your vault's root path. If you don't use Obsidian, this is perfectly usable as simply a read-only filesystem.

# Web_Search Improvements

The Web_Search tool previously just used DuckDuckGo as a backend search engine, but now it also supports Bing, Brave, Yahoo, and Wikipedia. Default engine is auto which provides results from all backends in recommended order. Still doesn't require any kind of API or auth for Web_Search.

There's also a new date filter to limit results to those created in the past day, week, month, or year. Oh, and uhh, SafeSearch is now off by default :)

# Web_Fetch Improvements

As context-efficient as the Markdown mode is for web browsing, sometimes it does lose important context in the conversion from HTML to Markdown. So I've added a new HTML mode to the Web_Fetch tool that basically executes a cURL request on the URL, returning the full HTML page if necessary.

# A Note on Claude Skills

I've been having fun with the new File_System and Shell_Command tools. Using Claude Skills doesn't currently work in the public HF space because of environment restrictions, but using Skills works perfectly well running locally.

Happy building ~

SelmaNajih001

posted an update 2 months ago

Post

2851

How Financial News Can Be Used to Train Good Financial Models 📰
Numbers tell you what happened, but news tells you why.
I’ve written an article explaining how news can be used to train AI models for sentiment analysis and better forecasting. Hope you find it interesting!

Read it here: https://huggingface.co/blog/SelmaNajih001/llms-applied-to-finance

I would love to read your opinions! I’m open to suggestions on how to improve the methodology and the training

1 reply

·

SelmaNajih001

posted an update 2 months ago

Post

3031

Which is the best model to use as a signal for investment?
Here who is gaining the most:
SelmaNajih001/InvestmentStrategyBasedOnSentiment

The Space uses titles from this dataset:
📊 SelmaNajih001/Cnbc_MultiCompany

Given a news title, it calculates a sentiment score : if the score crosses a certain threshold, the strategy decides to buy or sell.
Each trade lasts one day, and the strategy then computes the daily return.
For Tesla the best model seems to be the regression 👀
Just a quick note: the model uses the closing price as the buy price, meaning it already reflects the impact of the news.

ariG23498

authored a paper 2 months ago

FineVision: Open Data Is All You Need

Paper • 2510.17269 • Published Oct 20 • 72

Nymbo

posted an update 2 months ago

Post

1999

Two new tools added to the Nymbo/Tools MCP server, File_System and Shell_Exec. You can theoretically do basically anything with these two tools, and it should enable support for many Claude Skills.

GPT-5-Codex proves that for many cases, shell commands really are all you need, and Claude Skills seem to lean into this. The thing is, nothing about the design of Claude Skills actually restricts them to proprietary models!

# File_System

There's a new directory inside the repo called Filesystem, that's the agent's "root". It can perform the following actions : list, read, write, append, mkdir, move, copy, delete, info, help. It's able to keep this all within the scope of one tool call by making the Action field required and all other fields optional. Using a filesystem shouldn't require 15 different tools.

Files created in the public HF space live in the space's running container, and gets cleared when the space is restarted. When running the server locally, files are actually stored on disk.

# Shell_Exec

What good is a filesystem if you can't execute commands in that filesystem? This tool automatically detects if the server is running on Windows or Linux, and suggests using the appropriate shell (PowerShell/Bash). Both of these new tools require that the agent uses relative paths, rather than absolute paths. I could be convinced to back pedal on this.

# Closing Thoughts

The File_System and Shell_Exec tools aren't super polished yet, I'll continue to improve the agent's instructions and UX of using the new tools. Most of my testing was done with gpt-oss-20b and if it messes up, it gets the gist after one failed tool call. It should work perfectly fine for the GPU poor.

1 reply

·

SelmaNajih001

posted an update 2 months ago

Post

685

How Financial News Can Be Used to Train Good Financial Models 📰
Numbers tell you what happened, but news tells you why.
I’ve written an article explaining how news can be used to train AI models for sentiment analysis and better forecasting. Hope you find it interesting!

Read it here: https://huggingface.co/blog/SelmaNajih001/llms-applied-to-finance

I would love to read your opinions! I’m open to suggestions on how to improve the methodology and the training

1 reply

·

thomwolf

authored a paper 2 months ago

Robot Learning: A Tutorial

Paper • 2510.12403 • Published Oct 14 • 118

Nymbo

posted an update 2 months ago

Post

1860

I've made some improvements to my custom Deep_Research tool in the Nymbo/Tools MCP server. I've added a second LLM process and it still takes less than 1 minute to complete!

The original version of my Deep_Research tool would basically dump up to 50 fetched webpages onto the Researcher model (Qwen3-235B), with only a little bit of context shown from each page.

# New "Filterer" Process

The new process includes another LLM call before the researcher process. The Filterer (also Qwen3-235B) gets the query summary and the original 50 pages with low context, and decides which pages are most relevant to the research topic. The Filterer then outputs the URLs to the relevant pages, which are then re-fetched (with more context) and sent to the Researcher.

# Researcher Context

The Researcher now gets only the relevant webpages, then begins writing the report. When testing with 50 initial results, the researcher would often end up with 10-20 results of relevant context.

It still takes less than a minute to accomplish everything, thanks entirely to Cerebras inference. It now takes about 35-45 seconds to complete once the tool is run.

It's also worth noting that both the Filterer and Researcher now are provided the current time/date before they see the content, reducing hallucinations caused by knowledge cutoffs.

SelmaNajih001

posted an update 3 months ago

Post

388

Which is the best model to use as a signal for investment? 🤔
I’ve created a Space where you can compare three models:
-Two available on my profile
- ProsusAI/finbert
You can try it here:
👉 SelmaNajih001/InvestmentStrategyBasedOnSentiment
The Space uses titles from this dataset:
📊 SelmaNajih001/Cnbc_MultiCompany

Given a news title, it calculates a sentiment score : if the score crosses a certain threshold, the strategy decides to buy or sell.
Each trade lasts one day, and the strategy then computes the daily return.

Just a quick note: the model uses the closing price as the buy price, meaning it already reflects the impact of the news.
If I had chosen the opening price, the results would have been less biased but less realistic given the data available.

open/ acc

AI & ML interests

Recent Activity

Treble10: A high-quality dataset for far-field speech recognition, dereverberation, and enhancement

Open ASR Leaderboard: Towards Reproducible and Transparent Multilingual and Long-Form Speech Recognition Evaluation

FineVision: Open Data Is All You Need

Robot Learning: A Tutorial

AI & ML interests

Recent Activity

Team members 213

open-acc's activity