social-post-explorers (Social Post Explorers)

vikhyatk

posted an update about 15 hours ago

Post

1164

Announcing RefCOCO-M, a refreshed RefCOCO with pixel-accurate masks and the problematic prompts removed.

moondream/refcoco-m

JustinLin610

authored a paper about 2 months ago

Qwen3-Omni Technical Report

Paper • 2509.17765 • Published Sep 22 • 136

vikhyatk

posted an update 2 months ago

Post

4393

Just released a preview of Moondream 3! moondream/moondream3-preview

This is a 9B parameter, 2B active MoE VLM with state of the art visual reasoning capabilities.

More details in the release blog post: https://moondream.ai/blog/moondream-3-preview

3 replies

·

yjernite

posted an update 2 months ago

Post

2484

Tremendous quality of life upgrade on the Hugging Face Hub - we now have auto-complete emojis 🤗 🥳 👏 🙌 🎉

Get ready for lots more very serious analysis on a whole range of topics from yours truly now that we have unlocked this full range of expression 😄 🤔 🗣 🙊

theadamsabra

authored 2 papers 3 months ago

Flavors of Moonshine: Tiny Specialized ASR Models for Edge Devices

Paper • 2509.02523 • Published Sep 2 • 7

SECP: A Speech Enhancement-Based Curation Pipeline For Scalable Acquisition Of Clean Speech

Paper • 2402.12482 • Published Feb 19, 2024

flavoredquark

authored a paper 3 months ago

Noise Hypernetworks: Amortizing Test-Time Compute in Diffusion Models

Paper • 2508.09968 • Published Aug 13 • 15

MikeDoes

posted an update 3 months ago

Post

296

Are you sure the open-source LLM model you just downloaded is safe?

A recent paper on "Privacy Backdoors" reports a new vulnerability where pre-trained models can be poisoned before fine-tuning them. This is a serious challenge for everyone building on open-source AI.

Instead of just pointing out problems, we believe in finding better solutions. To understand this threat, the researchers needed to test their attack on realistic data structures. They needed a dataset that could effectively simulate a high-stakes privacy attack, and we're proud that our Ai4Privacy dataset was used to provide this crucial benchmark. The paper reports that for our complex dataset, the privacy leakage on a non-poisoned model was almost zero. After the backdoor attack, that number reportedly jumped to 87%.

Ai4Privacy dataset provided a realistic benchmark for their research. Our dataset, composed of synthetic identities, helped them demonstrate how a poisoned model could dramatically amplify privacy leakage.

This is why we champion open source: it enables the community to identify these issues and develop better, safer solutions together.

Kudos to the research team behind this study: Yuxin Wen, Leo Marchyok, Sanghyun Hong, Jonas Geiping, Tom Goldstein, and Nicholas Carlini, Oregon State University, University of Maryland, Google DeepMind, and ELLIS Institute Tubingen & MPI Intelligent Systems.

🔗 Read the research to understand this new challenge: https://arxiv.org/pdf/2404.01231

#DataPrivacy #AI #OpenSource #Anonymization #MachineLearning #Ai4Privacy #Worldslargestopensourceprivacydataset

yjernite

authored 3 papers 3 months ago

The Responsible Foundation Model Development Cheatsheet: A Review of Tools & Resources

Paper • 2406.16746 • Published Jun 24, 2024

In-House Evaluation Is Not Enough: Towards Robust Third-Party Flaw Disclosure for General-Purpose AI

Paper • 2503.16861 • Published Mar 21 • 1

A Different Approach to AI Safety: Proceedings from the Columbia Convening on Openness in Artificial Intelligence and AI Safety

Paper • 2506.22183 • Published Jun 27 • 1

KingNish

posted an update 4 months ago

Post

1977

Wan 2.2 fast upto 10x faster than original wan 2.2

Model: FastVideo/FastWan2.2-TI2V-5B-FullAttn-Diffusers

Space: KingNish/wan2-2-fast

MikeDoes

posted an update 4 months ago

Post

304

When anonymizing data for LLMs, is replacing a name with XXXXX enough?

A great post by Franklin Cardenoso Fernandez argues that we can do better. While simple masking hides data, it often destroys the context that models need to perform well.

A more robust method is contextual anonymization, where PII is replaced with meaningful labels like [NAME] or [ADDRESS]. This protects privacy while preserving the data's structural integrity.

We were pleased to see our Ai4Privacy pii-masking-200k dataset featured in the article as a prime example of this best practice. Our dataset is designed to help developers implement this superior form of anonymization by providing tens of thousands of clear, labeled examples.

By enabling models to be trained on data that is both private and context-rich, we can build AI that is both smarter and safer. This is a core part of our mission.

What's your team's preferred method for data anonymization? Let's discuss best practices.

🔗 Read Franklin's full analysis here: https://www.holisticai.com/blog/managing-personal-data-in-large-language-models

#DataPrivacy #Anonymization #ResponsibleAI #LLM #MachineLearning #AIEthics #Ai4Privacy #World's largest open privacy masking dataset

yjernite

posted an update 4 months ago

Post

4192

𝗙𝗶𝗿𝘀𝘁 𝗚𝗣𝗔𝗜 𝗠𝗼𝗱𝗲𝗹 𝘄𝗶𝘁𝗵 𝗘𝗨 𝗗𝗮𝘁𝗮 𝗧𝗿𝗮𝗻𝘀𝗽𝗮𝗿𝗲𝗻𝗰𝘆 𝗧𝗲𝗺𝗽𝗹𝗮𝘁𝗲? 🇪🇺

With the release of the EU data transparency template this week, we finally got to see one of the most meaningful artifacts to come out of the AI Act implementation so far (haven't you heard? AI's all about the data! 📊📚)

The impact of the template will depend on how effectively it establishes a minimum meaningful transparency standard for companies that don't otherwise offer any transparency into their handling of e.g. personal data or (anti?-)competitive practices in commercial licensing - we'll see how those play out as new models are released after August 2nd 👀

In the meantime, I wanted to see how the template works for a fully open-source + commercially viable model, so I filled it out for the SmolLM3 - which my colleagues at Hugging Face earlier this month 🤗 ICYMI, it's fully open-source with 3B parameters and performance matching the best similar-size models (I've switched all my local apps from Qwen3 to it, you should too 💡)

Verdict: congrats to the European Commission AI Office for making it so straightforward! Fully open and transparent models remain a cornerstone of informed regulation and governance, but the different organizational needs of their developers aren't always properly accounted for in new regulation. In this case, it took me all of two hours to fill out and publish the template (including reading the guidelines) - so kudos for making it feasible for smaller and distributed organizations 🙌 Definitely a step forward for transparency 🔍

To learn more have a look at:

- The SmolLM3 model: HuggingFaceTB/SmolLM3-3B
- Its filled out Public Summary of Training Content: hfmlsoc/smollm3-eu-data-transparency
- And if you're interested, some previous remarks on regulatory minimum meaningful standards for data disclosure: https://huggingface.co/blog/yjernite/naiac-data-transparency

MikeDoes

posted an update 4 months ago

Post

2020

🛡️ At Ai4Privacy, our goal is to empower researchers to build a safer AI ecosystem. Today, we're highlighting crucial research that does just that by exposing a new vulnerability.

The paper "Forget to Flourish" details a new model poisoning technique. It's a reminder that as we fine-tune LLMs, our anonymization and privacy strategies must evolve to counter increasingly sophisticated threats.

We're proud that the Ai4Privacy dataset was instrumental in this study. It served two key purposes:

Provided a Realistic Testbed: It gave the researchers access to a diverse set of synthetic and realistic PII samples in a safe, controlled environment.

Enabled Impactful Benchmarking: It allowed them to measure the actual effectiveness of their data extraction attack, proving it could compromise specific, high-value information.

This work reinforces our belief that progress in AI security is a community effort. By providing robust tools for benchmarking, we can collectively identify weaknesses and build stronger, more resilient systems. A huge congratulations to the authors on this important contribution.

🔗 Read the full paper: https://arxiv.org/html/2408.17354v1

#OpenSource #DataPrivacy #LLM #Anonymization #AIsecurity #HuggingFace #Ai4Privacy #World's largest open privacy masking dataset

gabrielchua

authored a paper 4 months ago

Running in CIRCLE? A Simple Benchmark for LLM Code Interpreter Security

Paper • 2507.19399 • Published Jul 25 • 1

JustinLin610

authored a paper 4 months ago

RefCritic: Training Long Chain-of-Thought Critic Models with Refinement Feedback

Paper • 2507.15024 • Published Jul 20 • 14

gabrielchua

authored 2 papers 4 months ago

LionGuard 2: Building Lightweight, Data-Efficient & Localised Multilingual Content Moderators

Paper • 2507.15339 • Published Jul 21

Toxicity-Aware Few-Shot Prompting for Low-Resource Singlish Translation

Paper • 2507.11966 • Published Jul 16

MikeDoes

posted an update 4 months ago

Post

1146

In data privacy, 92% accuracy is not an A-grade. Privacy AI needs to be better.

That's the stark takeaway from a recent benchmark by Diego Mouriño

(Making Science), who put today's top PII detection methods to the test on call center transcripts using the Ai4Privacy dataset.

They pitted cutting-edge LLMs (like GPT-4 & Gemini) against traditional systems (like Cloud DLPs). The results show that our trust in these tools might be misplaced.

📊 The Hard Numbers:

Even top-tier LLMs peaked at a reported 92% accuracy, leaving a potential dangerous 8% gap where your customer's data can leak. They particularly struggled with basics like 'last names' and 'street addresses'.

The old guard? Traditional rule-based systems reportedly achieved a shocking 50% accuracy. A coin toss with your customers' privacy.

This tells us that for privacy tasks, off-the-shelf accuracy is a vanity metric. The real metric is the cost of a single failure—one leaked name, one exposed address.

While no tool is perfect, some are better than others. Diego’s full analysis breaks down which models offer the best cost-to-accuracy balance in this flawed landscape. It's a must-read for anyone serious about building trustworthy AI.

#DataPrivacy #AI #LLM #RiskManagement #MetricsThatMatter #InfoSec

Find the full post here:
https://www.makingscience.com/blog/protecting-customer-privacy-how-to-remove-pii-from-call-center-transcripts/

Dataset:
ai4privacy/pii-masking-400k

Social Post Explorers

AI & ML interests

Recent Activity

Qwen3-Omni Technical Report

Flavors of Moonshine: Tiny Specialized ASR Models for Edge Devices

SECP: A Speech Enhancement-Based Curation Pipeline For Scalable Acquisition Of Clean Speech

Noise Hypernetworks: Amortizing Test-Time Compute in Diffusion Models

The Responsible Foundation Model Development Cheatsheet: A Review of Tools & Resources

In-House Evaluation Is Not Enough: Towards Robust Third-Party Flaw Disclosure for General-Purpose AI

A Different Approach to AI Safety: Proceedings from the Columbia Convening on Openness in Artificial Intelligence and AI Safety

Running in CIRCLE? A Simple Benchmark for LLM Code Interpreter Security

RefCritic: Training Long Chain-of-Thought Critic Models with Refinement Feedback

LionGuard 2: Building Lightweight, Data-Efficient & Localised Multilingual Content Moderators

Toxicity-Aware Few-Shot Prompting for Low-Resource Singlish Translation

AI & ML interests

Recent Activity

Team members 855

social-post-explorers's activity