Presenting FEELTHEFORCE (FTF): a robot learning system that models human tactile behavior to learn force-sensitive manipulation. Using a tactile glove to measure contact forces and a vision-based model to estimate hand pose, they train a closed-loop policy that continuously predicts the forces needed for manipulation. This policy is re-targeted to a Franka Panda robot with tactile gripper sensors using shared visual and action representa- tions. At execution, a PD controller modulates gripper closure to track predicted forces -enabling precise, force-aware control. This approach grounds robust low- level force control in scalable human supervision, achieving a 77% success rate across 5 force-sensitive manipulation tasks. #research: https://lnkd.in/dXxX7Enw #github: https://lnkd.in/dQVuYTDJ #authors: Ademi Adeniji, Zhuoran (Jolia) Chen, Vincent Liu, Venkatesh Pattabiraman, Raunaq Bhirangi, Pieter Abbeel, Lerrel Pinto, Siddhant Haldar New York University, University of California, Berkeley, NYU Shanghai Controlling fine-grained forces during manipulation remains a core challenge in robotics. While robot policies learned from robot-collected data or simulation show promise, they struggle to generalize across the diverse range of real-world interactions. Learning directly from humans offers a scalable solution, enabling demonstrators to perform skills in their natural embodiment and in everyday environments. However, visual demonstrations alone lack the information needed to infer precise contact forces.
Advanced Computer Vision Techniques
Explore top LinkedIn content from expert professionals.
-
-
Big news for the 3D computer vision community! 🙌 ByteDance released Depth Anything 3 on Hugging Face 🔥. This is the world's most powerful model for 3D understanding: it predicts spatially consistent geometry (depth and ray maps) from an arbitrary number of visual inputs, with or without known camera poses. In other words, it allows you to reconstruct a 3D scene just from 2D inputs. DA3 extends monocular depth estimation to any-view scenarios, hence the model can take in single images, multi-view images, and video. Interestingly, the authors reveal two key insights: - A plain transformer (e.g., vanilla DINO) is enough. No specialized architecture is required. - A single depth-ray representation objective is enough. The model does not require a complex multi-task training. Three series of models have been released: the main DA3 series, a monocular metric estimation series, and a monocular depth estimation series. Metric estimation, also called absolute estimation, determines the distance in meters relative to the camera, whereas monocular depth estimation determines the distance relative among the pixels. The authors also released a new visual geometry benchmark covering camera pose estimation, any-view geometry and visual rendering. DA3 sets a new state-of-the-art across all 10 tasks, surpassing prior SOTA, Meta's VGGT, by an average of 35.7% in camera pose accuracy and 23.6% in geometric accuracy. Furthermore, DA3 facilitates SLAM (Simultaneous Localization and Mapping) and 3D Gaussian Splatting by providing a robust and generalizable method for predicting spatially consistent geometry from various visual inputs. Links: - Models: https://lnkd.in/eFFHJhJx - Paper: https://lnkd.in/ewtxy7p6 - Demo: https://lnkd.in/e7Qr3tnG - Code: https://lnkd.in/e89B6JpR
-
AI that counts sheep. Not the kind that helps you sleep. This footage shows AI models counting and tracking sheep with accuracy that would take humans hours to achieve manually. Agriculture is being transformed by computer vision that can detect, count, and monitor livestock at scale. Farmers managing thousands of animals can now get precise counts instantly instead of manual tallies that are always approximate. But the applications extend far beyond counting. The same technology detects health issues by identifying animals moving differently. → Tracks growth rates. → Monitors feeding patterns. → Identifies animals that need veterinary attention before visible symptoms appear. This is precision agriculture enabled by AI that can process visual information faster and more consistently than human observation. The technology applies to crops as well. → Detecting disease in plants. Identifying optimal harvest timing. → Monitoring soil conditions. → Tracking equipment across vast properties. Agriculture has always been about managing biological systems at scale. AI gives farmers tools to observe and respond to those systems with precision that was never possible before. The revolution is giving farmers capabilities to manage complexity that overwhelmed manual observation. What other industries have observation problems that computer vision could solve at scale?
-
#MIT's new "Radial Attention" makes Generative Video 4.4x cheaper to train and 3.7x faster to run. Here's why: The problem with current AI video? It's BRUTALLY expensive. Every frame must "pay attention" to every other frame. With thousands of frames, costs explode exponentially. Training one model? $100K+ Running it? Painfully slow. Massachusetts Institute of Technology, NVIDIA, Princeton, UC Berkeley, Stanford, and First Intelligence just changed the game. Their breakthrough insight: Video attention works like physics. - Sound gets quieter with distance - Light dims as it travels - Heat dissipates over space Turns out, AI video tokens follow the same rules. Why waste compute power on distant, irrelevant connections? Enter Radial Attention: Instead of checking EVERY connection: • Nearby frames → full attention • Distant frames → sparse attention • Computation scales logarithmically, not quadratically Technical result: O(n log n) vs O(n²) Translation: MASSIVE efficiency gains Real-world results on production models: 📊 HunyuanVideo (Tencent): • 2.78x training speedup • 2.35x inference speedup 📊 Mochi 1: • 1.78x training speedup • 1.63x inference speedup Quality? Maintained or IMPROVED. What this unlocks: 4x longer videos, same resources 4.4x cheaper training costs 3.7x faster generation Works with existing models (no retraining!) And, MIT open-sourced everything: https://lnkd.in/gETYw8eT The bigger picture: The internet is transforming. BEFORE: A place to store videos from the real world NOW: A machine that generates synthetic content on demand Think about it: • TikTok filled with AI-generated content • YouTube creators using AI for entire videos • Streaming services producing personalized shows • Educational content generated for each student This changes everything. Remember when only big tech could afford image AI? 2020: GPT-3 → Only OpenAI 2022: Stable Diffusion → Everyone 2024: Midjourney everywhere Video AI is next. Radial Attention probably just accelerated the timeline. The future isn't coming. It's here. And it's more accessible than ever. Want to ride this wave? → Follow me for weekly AI breakthroughs → Share if this opened your eyes → Try the code: https://lnkd.in/gETYw8eT What will YOU create when video AI costs 4x less? #AI #VideoGeneration #MachineLearning #TechInnovation #FutureOfContent
-
Happy Friday! This week in #learnwithmz, let’s talk about how AI “sees” the world through Vision Language Models (VLMs). We often treat AI as text-only, but modern models like Gemini, DeepSeek-VL and GPT-4o, etc. blend vision and language, allowing them to describe, reason about, and even “imagine” what they see. An excellent article by Frederik Vom Lehn mapped out how information flows inside a VLM, from raw pixels all the way to text predictions. What’s going on inside a VLM? - Early layers detect colors and simple patterns. - Middle layers respond to shapes, edges, and structures. - Later layers align visual regions with linguistic concepts: like “dog,” “street,” or “sky.” - Vision tokens have large L2 norms, which makes them less sensitive to spatial order (a “bag-of-visual-features” effect). - The attention mechanism favors text tokens, suggesting that language often dominates reasoning. - You can even use softmax probabilities to segment images or detect hallucinations in multimodal outputs. Why it Matters? Understanding how VLMs allocate attention helps explain why they sometimes hallucinate objects or struggle with spatial reasoning. PMs & Builders If you’re working with multimodal AI, think copilots, chat with images, or agentic vision, invest time in visual explainability. It’s understanding how AI perceives. Read the full visualization breakdown here: https://lnkd.in/gc2pZnt2 #AI #VisionLanguageModels #LLMs #ProductManagement #learnwithmz #DeepLearning #MultimodalAI
-
Lessons from a full day with SAM 2 on satellite imagery. First off, what is SAM 2? It’s a zero‑shot, promptable segmentation model, meaning it can segment unseen objects out-of-the-box, without any training on those classes using only simple prompts like clicks, boxes, or text descriptions (what I used) to guide the process. Why apply it to satellite imagery? SAM 2 excels at segmenting environmental features (ex. roads, buildings, orchards) without retraining. My top tips? 🛰️ Use high‑res imagery (30 cm–1 m/pixel) for crisp segmentation especially for small objects. 🍃 Adjust prompts for the overhead view (e.g., "green leaves" or "shrubs" instead of "trees" - I even used "grey boxes" to find air conditioning units on top of buildings) 🚗 Small objects are detectable with careful prompting, even counting cars works. At Wherobots we embed SAM 2 into our raster inference engine. Users write simple SQL/Python prompts with text, inference runs in parallel on tiles, and results are stored as Iceberg tables in S3. From there, you can use the vector objects that are returned just like regular geospatial data with no special modeling needed. SAM 2 brings zero‑shot segmentation to geospatial data and when you combine it with prompt tuning, high‑res imagery, and distributed inference, and you can pull out earth scale insights in a day. Would love to hear your experiences with vision models on remote sensing! 🌎 I'm Matt and I talk about modern GIS, geospatial data engineering, and AI and geospatial is changing. 📬 Want more like this? Join 7k+ others learning from my newsletter → forrest.nyc
-
Exciting advancements in Text-to-Video Retrieval (T2VR) from The Johns Hopkins University and Johns Hopkins Applied Physics Laboratory! Introducing Video-ColBERT → a breakthrough retrieval method enhancing similarity between language queries and videos. Here's why it's a game-changer: 1. MeanMaxSim (MMS) -> Efficiently handles variable query lengths by using mean instead of sum for token-wise similarity. 2. Dual-Level Tokenwise Interaction -> Independently analyzes static frames and dynamic temporal features for deeper video insights. 3. Query and Visual Expansion -> Adds tokens to both queries and videos, capturing richer, more relevant content. 4. Dual Sigmoid Loss Training -> Strengthens spatial and temporal data individually, boosting retrieval accuracy and robustness. This isn't just theory. Video-ColBERT achieved state-of-the-art results on major benchmarks like MSR-VTT, MSVD, VATEX, DiDeMo, and ActivityNet. Simply put: Better tech = more accurate video retrieval. Massive leap forward from Johns Hopkins and DEVCOM Army Research Laboratory!
-
Your AI Will See You Now: Unveiling the Visual Capabilities of Large Language Models The frontier of AI is expanding with major advancements in vision capabilities across Large Language Models (LLMs) such as OpenAI’s ChatGPT, Google’s Gemini, and Anthropic’s Claude. These developments are transforming how AI interacts with the world, combining the power of language with the nuance of vision. Key Highlights: • #ChatGPTVision: OpenAI’s GPT-4V introduces image processing, expanding AI’s utility from textual to visual understanding. • #GeminiAI: Google’s Gemini leverages multimodal integration, enhancing conversational abilities with visual data. • #ClaudeAI: Anthropic’s Claude incorporates advanced visual processing to deliver context-rich interactions. Why It Matters: Integrating visual capabilities allows #AI to perform more complex tasks, revolutionizing interactions across various sectors: • #Robots and Automation: Robots will utilize the vision part of multimodality to navigate and interact more effectively in environments from manufacturing floors to household settings. • #Security and Identification: At airports, AI-enhanced systems can scan your face as an ID, matching your image against government databases for enhanced security and streamlined processing. • #Healthcare Applications: In healthcare, visual AI can analyze medical imagery more accurately, aiding in early diagnosis and tailored treatment plans. These advancements signify a monumental leap towards more intuitive, secure, and efficient AI applications, making everyday tasks easier and safer. Engage with Us: As we continue to push AI boundaries, your insights and contributions are invaluable. Join us in shaping the future of multimodal AI. #AIRevolution #VisualAI #TechInnovation #FutureOfAI #DrGPT 🔗 Connect with me for more insights and updates on the latest trends in AI and healthcare. 🔄 Feel free to share this post and help spread the word about the transformative power of visual AI!
-
🚀 𝐏𝐨𝐰𝐞𝐫𝐢𝐧𝐠 𝐭𝐡𝐞 𝐅𝐮𝐭𝐮𝐫𝐞: 𝐌𝐋𝐋𝐌𝐬 🌐 𝐒𝐞𝐚𝐦𝐥𝐞𝐬𝐬𝐥𝐲 𝐁𝐫𝐢𝐝𝐠𝐢𝐧𝐠 𝐓𝐞𝐱𝐭 & 𝐕𝐢𝐬𝐮𝐚𝐥 𝐃𝐚𝐭𝐚 🖼️📖 NVIDIA’s 💡 𝐍𝐕𝐋𝐌 𝟏.𝟎 models mark a significant leap in multimodal large language models (LLMs), capable of processing both 𝒕𝒆𝒙𝒕 and 𝒊𝒎𝒂𝒈𝒆 data. NVLM 1.0 features three models—NVLM-D, NVLM-X, and NVLM-H—each designed to excel in vision-language tasks. With 34 𝒕𝒐 72 𝒃𝒊𝒍𝒍𝒊𝒐𝒏 𝒑𝒂𝒓𝒂𝒎𝒆𝒕𝒆𝒓𝒔, these models shine in areas like text recognition (OCR), image understanding, and math tasks 🧠. One of the toughest challenges with such models is maintaining high performance across both domains. Here's why. 🖼 𝐓𝐞𝐱𝐭 𝐚𝐧𝐝 𝐯𝐢𝐬𝐮𝐚𝐥 𝐝𝐚𝐭𝐚 are fundamentally different. Text is structured and sequential, governed by grammar rules, while visuals involve pixel-level detail and spatial relationships. These require vastly different reasoning, making it hard for a single model to master both. Training models for 𝒕𝒆𝒙𝒕 𝒂𝒏𝒅 𝒗𝒊𝒔𝒖𝒂𝒍𝒔 often lead to trade-offs. Improving visual performance can degrade language understanding, a phenomenon known as 𝐜𝐚𝐭𝐚𝐬𝐭𝐫𝐨𝐩𝐡𝐢𝐜 𝐟𝐨𝐫𝐠𝐞𝐭𝐭𝐢𝐧𝐠 🧩. Balancing text and visual comprehension without compromising either is a major technical hurdle in multimodal AI. NVIDIA’s NVLM 1.0 has made progress by optimizing its architecture, excelling in tasks like 𝐎𝐂𝐑 and complex image understanding while maintaining strong text performance. Notably, NVLM outperforms models like 𝐆𝐏𝐓-𝟒 in OCR and rivals 𝐂𝐥𝐚𝐮𝐝𝐞 𝟑.𝟓 in math 🎯. Each model has its strengths: 𝐍𝐕𝐋𝐌-𝐃 is efficient but requires more GPU resources, 𝐍𝐕𝐋𝐌-𝐗 specializes in high-res images 📸, and 𝐍𝐕𝐋𝐌-𝐇 blends both capabilities. These models also enhance text-only performance post-multimodal training, which is rare in AI development 🔥. With production-grade multimodal capabilities, NVIDIA is paving the way for businesses and researchers to harness the power of AI 🔧. By emphasizing 𝒉𝒊𝒈𝒉-𝒒𝒖𝒂𝒍𝒊𝒕𝒚, 𝒅𝒊𝒗𝒆𝒓𝒔𝒆 𝒅𝒂𝒕𝒂𝒔𝒆𝒕𝒔, they’ve created models that not only excel today but are open for future development. #AI #NVIDIA #LLM #MultimodalAI #DeepLearning #AIInnovation #DataScience #OCR #VisionLanguageModels
-
Check out this Stereo4D paper from DeepMind. It's a pretty clever approach to a persistent problem in computer vision -- getting good training data for how things move in 3D. The key insight is using VR180 videos -- those stereo fisheye videos we launched back in 2017 for VR headsets. It was always clear that structured stereo datasets would be valuable for computer vision -- and we launched some powerful VR tools with it back in 2017 (link below). But what's the game changer now in 2024 is the scale -- they're providing 110K high quality clips :-) That's the kind of massive, real-world AI dataset that was just a dream back then! They're using it to train this model called DynaDUSt3R that can predict both 3D structure and motion from video frames. The cool part is it tracks how objects move between frames while also reconstructing their 3D shape. And given we're dealing with real stereoscopic content, results are notably better than synthetic data, giving you a faithful rendition of the real-world with a diverse set of subject matter. It's one of those through lines when tackling a timeless mission like mapping the world or spatial computing -- VR content created for immersion becoming the foundation for teaching machines to understand how the world moves. Sometimes innovation chains together in unexpected ways.