LLM System Optimization

Explore top LinkedIn content from expert professionals.

  • View profile for Andrew Ng
    Andrew Ng Andrew Ng is an Influencer

    DeepLearning.AI, AI Fund and AI Aspire

    2,440,479 followers

    Last week, I described four design patterns for AI agentic workflows that I believe will drive significant progress: Reflection, Tool use, Planning and Multi-agent collaboration. Instead of having an LLM generate its final output directly, an agentic workflow prompts the LLM multiple times, giving it opportunities to build step by step to higher-quality output. Here, I'd like to discuss Reflection. It's relatively quick to implement, and I've seen it lead to surprising performance gains. You may have had the experience of prompting ChatGPT/Claude/Gemini, receiving unsatisfactory output, delivering critical feedback to help the LLM improve its response, and then getting a better response. What if you automate the step of delivering critical feedback, so the model automatically criticizes its own output and improves its response? This is the crux of Reflection. Take the task of asking an LLM to write code. We can prompt it to generate the desired code directly to carry out some task X. Then, we can prompt it to reflect on its own output, perhaps as follows: Here’s code intended for task X: [previously generated code] Check the code carefully for correctness, style, and efficiency, and give constructive criticism for how to improve it. Sometimes this causes the LLM to spot problems and come up with constructive suggestions. Next, we can prompt the LLM with context including (i) the previously generated code and (ii) the constructive feedback, and ask it to use the feedback to rewrite the code. This can lead to a better response. Repeating the criticism/rewrite process might yield further improvements. This self-reflection process allows the LLM to spot gaps and improve its output on a variety of tasks including producing code, writing text, and answering questions. And we can go beyond self-reflection by giving the LLM tools that help evaluate its output; for example, running its code through a few unit tests to check whether it generates correct results on test cases or searching the web to double-check text output. Then it can reflect on any errors it found and come up with ideas for improvement. Further, we can implement Reflection using a multi-agent framework. I've found it convenient to create two agents, one prompted to generate good outputs and the other prompted to give constructive criticism of the first agent's output. The resulting discussion between the two agents leads to improved responses. Reflection is a relatively basic type of agentic workflow, but I've been delighted by how much it improved my applications’ results. If you’re interested in learning more about reflection, I recommend: - Self-Refine: Iterative Refinement with Self-Feedback, by Madaan et al. (2023) - Reflexion: Language Agents with Verbal Reinforcement Learning, by Shinn et al. (2023) - CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing, by Gou et al. (2024) [Original text: https://lnkd.in/g4bTuWtU ]

  • View profile for Steve Nouri

    The largest AI Community 14 Million Members | Advisor @ Fortune 500 | Keynote Speaker

    1,734,515 followers

    🚀 Google just dropped the blueprint for the future of agentic AI: Context Engineering, Sessions & Memory. If prompt engineering was about crafting good questions, context engineering is about building an AI’s entire mental workspace. Here’s why this paper matters 👇 What’s Context Engineering? LLMs are stateless, they forget everything between calls. 🔹Context engineering turns them into stateful systems by dynamically assembling: • System instructions (the “personality” of the agent) • External knowledge (RAG results, tools, and outputs) • Session history (ongoing dialogue) • Long-term memory (summaries and facts from past sessions) • It’s not prompt design anymore, it’s prompt orchestration. Think of sessions as your workbench, messy but active. Sessions manage short-term context and working memory. Think of memory as your filing cabinet, organized, persistent, and searchable. Memories persist facts, preferences, and strategies across time and agents. Together, they make AI personal, consistent, and self-improving. My Takeaways: Context is the new compute, your system’s intelligence depends on what it sees, not just the model you use. Memory isn’t a vector DB, it’s an LLM-driven ETL pipeline that extracts, consolidates, and prunes knowledge. Multi-agent systems need shared memory layers, not shared prompts. Procedural memory (the how) is the next frontier, agents learning strategies, not just storing facts. Building an “agent” today isn’t about chaining APIs together. It’s about context architecture to make models actually think across time. The future of AI won’t belong to those who fine-tune models, it’ll belong to those who engineer context. “Stateful AI begins with context engineering.” This might just be the new foundation of agentic systems.

  • View profile for Zain Hasan

    I build and teach AI | AI/ML @ Together AI | EngSci ��Ψ/PhD @ UofT | Previously: Vector DBs, Data Scientist, Lecturer & Health Tech Founder | 🇺🇸🇨🇦🇵🇰

    19,192 followers

    You don't need a 2 trillion parameter model to tell you the capital of France is Paris. Be smart and route between a panel of models according to query difficulty and model specialty! New paper proposes a framework to train a router that routes queries to the appropriate LLM to optimize the trade-off b/w cost vs. performance. Overview: Model inference cost varies significantly: Per one million output tokens: Llama-3-70b ($1) vs. GPT-4-0613 ($60), Haiku ($1.25) vs. Opus ($75) The RouteLLM paper propose a router training framework based on human preference data and augmentation techniques, demonstrating over 2x cost saving on widely used benchmarks. They define the problem as having to choose between two classes of models: (1) strong models - produce high quality responses but at a high cost (GPT-4o, Claude3.5) (2) weak models - relatively lower quality and lower cost (Mixtral8x7B, Llama3-8b) A good router requires a deep understanding of the question’s complexity as well as the strengths and weaknesses of the available LLMs. Explore different routing approaches: - Similarity-weighted (SW) ranking - Matrix factorization - BERT query classifier - Causal LLM query classifier Neat Ideas to Build From: - Users can collect a small amount of in-domain data to improve performance for their specific use cases via dataset augmentation. - Can expand this problem from routing between a strong and weak LLM to a multiclass model routing approach where we have specialist models(language vision model, function calling model etc.) - Larger framework controlled by a router - imagine a system of 15-20 tuned small models and the router as the n+1'th model responsible for picking the LLM that will handle a particular query at inference time. - MoA architectures: Routing to different architectures of a Mixture of Agents would be a cool idea as well. Depending on the query you decide how many proposers there should be, how many layers in the mixture, what the aggregate models should be etc. - Route based caching: If you get redundant queries that are slightly different then route the query+previous answer to a small model to light rewriting instead of regenerating the answer

  • View profile for Sahar Mor

    I help researchers and builders make sense of AI | ex-Stripe | aitidbits.ai | Angel Investor

    41,674 followers

    Researchers from Oxford University just achieved a 14% performance boost in mathematical reasoning by making LLMs work together like specialists in a company. In their new MALT (Multi-Agent LLM Training) paper, they introduced a novel approach where three specialized LLMs - a generator, verifier, and refinement model - collaborate to solve complex problems, similar to how a programmer, tester, and supervisor work together. The breakthrough lies in their training method: (1) Tree-based exploration - generating thousands of reasoning trajectories by having models interact (2) Credit attribution - identifying which model is responsible for successes or failures (3) Specialized training - using both correct and incorrect examples to train each model for its specific role Using this approach on 8B parameter models, MALT achieved relative improvements of 14% on the MATH dataset, 9% on CommonsenseQA, and 7% on GSM8K. This represents a significant step toward more efficient and capable AI systems, showing that well-coordinated smaller models can match the performance of much larger ones. Paper https://lnkd.in/g6ag9rP4 — Join thousands of world-class researchers and engineers from Google, Stanford, OpenAI, and Meta staying ahead on AI http://aitidbits.ai

  • View profile for Philipp Schmid

    AI Developer Experience at Google DeepMind 🔵 prev: Tech Lead at Hugging Face, AWS ML Hero 🤗 Sharing my own views and AI News

    164,846 followers

    LLMs “think harder” in a latent space? New paper demonstrates that by allowing LLMs to iterate in its latent space (like "thinking" multiple times about the same input) improves performance comparable to much larger models. Implementation (simplified): 1️⃣ 3-part architecture: 1. Prelude: Transforms input tokens into latent space; Recurrent Block: Core "thinking" component that iterates multiple times; Coda: Converts final latent state to output tokens 2️⃣ Train with randomized recurrence steps (log-normal Poisson sampling) and truncated backpropagation through last 8 iterations 3️⃣ Deploy with dynamic recurrence steps (4-64) at inference for compute scaling with KV-cache sharing and KL-based early stopping for efficiency. Insights: 📈 Achieves 34.8% strict accuracy on GSM8K (5x baseline) with 32 recurrence 🛠️ "Sandwich" normalization and input reinjection prevent hidden state collapse 📚 Performs best on code/math tasks (23% HumanEval) with data mix containing 31.5% STEM content 🔄 Shows latent space "reasoning orbits" that correlate with task difficulty ⚡ KV-cache sharing reduces memory usage by 75% during long reasoning chains 📚 Performance gains vary by task; easier tasks saturate with fewer iterations, while harder tasks benefit from more. Paper: https://lnkd.in/e_iRk4Xd

  • View profile for Akshay Pachaar

    Co-Founder DailyDoseOfDS | BITS Pilani | 3 Patents | X (187K+)

    175,498 followers

    System prompts are getting outdated! Here's a counterintuitive lesson from building real-world Agents: Writing giant system prompts doesn't improve an Agent's performance; it often makes it worse. For example, you add a rule about refund policies. Then one about tone. Then another about when to escalate. Before long, you have a 2,000-word instruction manual. But here’s what we’ve learned: LLMs are extremely poor at handling this. Recent research also confirms what many of us experience. There’s a “Curse of Instructions.” The more rules you add to a prompt, the worse the model performs at following any single one. Here’s a better approach: contextually conditional guidelines. Instead of one giant prompt, break your instructions into modular pieces that only load into the LLM when relevant. ``` agent.create_guideline(  condition="Customer asks about refunds",  action="Check order status first to see if eligible",  tools=[check_order_status], ) ``` Each guideline has two parts: - Condition: When does it get loaded? - Action: What should the agent do? The magic happens behind the scenes. When a query arrives, the system evaluates which guidelines are relevant to the current conversation state. Only those guidelines get loaded into the model’s context. This keeps the LLM’s cognitive load minimal because instead of juggling 50 rules, it focuses on just 3-4 that actually matter at that point. This results in dramatically better instruction-following. This approach is called Alignment Modeling. Structuring guidance contextually so agents stay focused, consistent, and compliant. Instead of waiting for an allegedly smaller model, what matters is having an architecture that respects how LLMs fundamentally work. This approach is actually implemented in Parlant - a recently trending open-source framework (13k+ stars). You can see the full implementation and try it yourself. But the core insight applies regardless of what tools you use: Be more methodical about context engineering and actually explaining what you expect the behavior to be in special cases you care about. Then agents can become truly focused and useful. I’ve shared the repo link in the first comment! ___ Share this with your network if you found this insightful ♻️ Follow me (Akshay Pachaar) for more insights and tutorials on AI and Machine Learning!

  • View profile for Vince Lynch

    CEO of IV.AI | The AI Platform and Data Source to Reveal What Matters | We’re hiring

    11,814 followers

    I’m jealous of AI Because with a model you can measure confidence Imagine you could do that as a human? Measure how close or far off you are? here's how to measure for technical and non-technical teams For business teams: Run a ‘known answers’ test. Give the model questions or tasks where you already know the answer. Think of it like a QA test for logic. If it can't pass here, it's not ready to run wild in your stack. Ask for confidence directly. Prompt it: “How sure are you about that answer on a scale of 1-10?” Then: “Why might this be wrong?” You'll surface uncertainty the model won't reveal unless asked. Check consistency. Phrase the same request five different ways. Is it giving stable answers? If not, revisit the product strategy for the llm Force reasoning. Use prompts like “Show step-by-step how you got this result.” This lets you audit the logic, not just the output. Great for strategy, legal, and product decisions. For technical teams: Use the softmax output to get predicted probabilities. Example: Model says “fraud” with 92% probability. Use entropy to spot uncertainty. High entropy = low confidence. (Shannon entropy: −∑p log p) Language models Extract token-level log-likelihoods from the model if you have API or model access. These give you the probability of each word generated. Use sequence likelihood to rank alternate responses. Common in RAG and search-ranking setups. For uncertainty estimates, try: Monte Carlo Dropout: Run the same input multiple times with dropout on. Compare outputs. High variance = low confidence. Ensemble models: Aggregate predictions from several models to smooth confidence. Calibration testing: Use a reliability diagram to check if predicted probabilities match actual outcomes. Use Expected Calibration Error (ECE) as a metric. Good models should show that 80% confident = ~80% correct. How to improve confidence (and make it trustworthy) Label smoothing during training Prevents overconfident predictions and improves generalization. Temperature tuning (post-hoc) Adjusts the softmax sharpness to better align confidence and accuracy. Temperature < 1 → sharper, more confident Temperature > 1 → more cautious, less spiky predictions Fine-tuning on domain-specific data Shrinks uncertainty and reduces hedging in model output. Especially effective for LLMs that need to be assertive in narrow domains (legal, medicine, strategy). Use focal loss for noisy or imbalanced datasets. It down-weights easy examples and forces the model to pay attention to harder cases, which tightens confidence on the edge cases. Reinforcement learning from human feedback (RLHF) Aligns the model's reward with correct and confident reasoning. Bottom line: A confident model isn't just better - it's safer, cheaper, and easier to debug. If you’re building workflows or products that rely on AI, but you’re not measuring model confidence, you’re guessing. #AI #ML #LLM #MachineLearning #AIConfidence #RLHF #ModelCalibration

  • View profile for Sohrab Rahimi

    Director, AI/ML Lead @ Google

    23,102 followers

    Most LLM agents stop learning after fine-tuning. They can replay expert demos but can’t adapt when the world changes. That’s because we train them with imitation learning—they copy human actions without seeing what happens when they fail. It’s reward-free but narrow. The next logical step, reinforcement learning, lets agents explore and learn from rewards, yet in real settings (e.g. websites, APIs, operating systems) reliable rewards rarely exist or appear too late. RL becomes unstable and costly, leaving LLMs stuck between a method that can’t generalize and one that can’t start. Researchers from Meta and Ohio State propose a bridge called Early Experience. Instead of waiting for rewards, agents act, observe what happens, and turn those future states into supervision. It’s still reward-free but grounded in real consequences. They test two ways to use this data: 1. Implicit World Modeling: for every state–action pair, predict the next state. The model learns how the world reacts—what actions lead where, what failures look like. 2. Self-Reflection: sample a few alternative actions, execute them, and ask the model to explain in language why the expert’s move was better. These reflections become new training targets, teaching decision principles that transfer across tasks. Across eight benchmarks, from home simulations and science labs to APIs, travel planning, and web navigation, both methods beat imitation learning. In WebShop, success jumped from 42 % to 60 %; in long-horizon planning, gains reached 15 points. When later fine-tuned with RL, these checkpoints reached higher final performance and needed half (or even one-eighth) of the expert data. The gains held from 3B to 70B-parameter models. To use this yourself:, here is what you need to do: • Log each interaction and store a short summary of the next state—success, error, or side effect. • Run a brief next-state prediction phase before your normal fine-tune so the model learns transitions. • Add reflection data: run two-four alternative actions, collect results, and prompt the model to explain why the expert step was better. Train on those reflections plus the correct action. • Keep compute constant—replace part of imitation learning, not add more. This approach makes agent training cheaper, less dependent on scarce expert data, and more adaptive. As models learn from self-generated experience, the skill barrier for building capable agents drops dramatically. In my opinion, the new challenge is governance and ensuring they don’t learn the wrong lessons. That means filtering unsafe traces, constraining environments to safe actions, and auditing reflections before they become training data. When rewards are scarce and demonstrations costly, let the agent learn from what it already has, its own experience! That shift turns LLMs from static imitators into dynamic learners and moves us closer to systems that truly improve through interaction, safely and at scale.

  • View profile for Gabriele Berton

    Training Vision Language Models

    12,522 followers

    Did you know that you can - speed up any LLM by 4x - and reduce its memory footprint by 2x - and improve its results - without modifying the model at all How??? We just released CompLLM, here is how it works: LLMs take tokens as input: split the text in tokens, convert to token embedding, and feed to the LLM. This is suboptimal. Some tokens contain little information (like the token for "the"), yet they require as much memory and processing time as informative tokens. But we note that pretrained LLMs can take as input not only Token Embeddings (those ~200k vectors contained in the embedding table), but also other embeddings, which we call Concept Embeddings, unseen during training, and still produce correct output. Concept Embeddings can contain the same amount of information as Token Embeddings, but with shorter sequence length. So how to obtain Concept Embeddings? Enter CompLLM, a separate LM that given N Token Embeddings, extracts M Concept Embeddings, with M < N. CompLLM is trained with a distillation loss - the teacher is the LLM that takes as input the Token Embeddings (a standard LLM pipeline), while the student takes the Concept Embeddings. The output of teacher and student should match, and this trains the CompLLM (LLM is frozen). Essentially our pipeline performs two forward passes, one through the CompLLM and one through the LLM. So how can this be faster? CompLLM processes the prompt in chunks (like short sentences). In CompLLM's attention layer, each token only attends to tokens within its chunk. This makes CompLLM's speed and memory linear (not quadratic!) with the context length. With a compression rate of 2, we have 1 Concept Embeddings every 2 Token Embeddings. So the generation LLM takes as input a 2x shorter prompt that it normally would. This makes generation 4x faster, and KV cache 2x smaller! To be fair, this is only useful for long prompts: reducing 20 tokens to 10 is not really useful, but reducing 100k tokens to 50k unlocks new context lengths that otherwise might not even fit in GPUs, while being faster! Given that CompLLM processes the input in chunks, its output can also be re-used across queries! This is super useful for RAG (compress documents independently to Concept Embeddings, use them multiple times), or code agents (if you change a line of code in a file, you don't need to re-compress the whole file, but only the modified chunk!) Finally, given that CompLLM processes the input in chunks, we can train CompLLM with 1k-token-long sequences, and then it works even on 100k long sequences! And the results? At short context lengths we get similar results to not using a CompLLM, but at long context lengths, the results improve by far! Reducing the number of embeddings reduces attention dilution, and makes it easier for LLMs to find relevant information A huge thanks to my collaborators! Jayakrishnan Unnikrishnan Son Tran Mubarak Shah

  • View profile for Taha Kass-Hout, MD, MS

    Global Chief Science & Technology Officer, GE HealthCare | Physician and Health AI Leader | Imaging and Diagnostics AI | Former Amazon/AWS VP HealthAI and FDA Chief Health Informatics Officer

    20,400 followers

    LLMs don’t just “answer”, they write, one token at a time. That changes everything about how we serve them in production. Here’s why LLMs need their own “special sauce”: 1. Variable lengths → Continuous batching, traditional ML is like stamping “cat/not-cat” in one go. LLMs write word-by-word, so every response is a different length. To keep GPUs busy, we run a never-ending conveyor belt: as one reply finishes, a new request hops on without stopping the belt. 2. Two phases: Prefill vs. Decode, think of prefill as the model reading your prompt (compute-heavy), and decode as it writing the reply (memory-bound). We often put these on different hardware lanes so neither becomes the traffic jam. 3. KV cache = memory Tetris, LLMs “remember” intermediate results so they don’t re-read the whole prompt each token. Managing that memory without wasting GPU space is tricky; techniques like Paged Attention act like smart paging in an OS to avoid fragmentation. 4. Smart, prefix-aware routing, don’t just round-robin. If two users start with the same prompt chunk, send them to the replica that already cached it. Less redo, faster answers. 5. Sharding & Mixture-of-Experts (MoE), modern models activate only a few “specialists” per request. We shard those experts across GPUs and route dynamically, so you get big-model quality with small-model efficiency. This way of serving LLMs matters because it delivers lower latency and lower cost, translating to better UX and healthier margins, while increasing throughput on the same hardware budget, making performance more predictable under peak load, and enabling faster, cheaper experimentation so more ideas get shipped. If you’re choosing tooling, names to know: vLLM, sgl-project, TensorRT-LLM, Ray, plus managed options on Amazon Web Services (AWS) like Amazon Bedrock (with Provisioned Throughput for predictable capacity) and Amazon SageMaker (Large Model Inference containers with TensorRT-LLM/Triton, Hugging Face TGI, and support for Inferentia2 via the Neuron SDK), built to handle these exact wrinkles at scale.

Explore categories