1
$\begingroup$

LLMs pick their answers based on a large dataset of training data, and they seem to do a very good job at that (relatively, trying not to be pedantic), I know they spit a lot of incorrect information out but they're still doing a very impressive job at building text blocks about any topic.

But they often fail at the strawberry question. I know LLMs can't really think, and it would make sense for it to fail at it if it never saw the question, but I would assume that they get asked that questions thousands of times daily, and the users correct them often, but they still make the same mistake. How did it not "pick it up" yet? I am assuming it would get negative feedback when it makes a mistake, and rewarded when getting it correct, so why do LLMs seem to do a very impressive job at using the training data and feedback to build large blocks of text about anything, with perfect grammar, tuning everything to the custom user, but fails at such a simple thing with less variables?

$\endgroup$
3
  • $\begingroup$ I'm maybe out of date, but why do you think LLMs are self learning? A simple Google search says "they are not generally considered to be "self-learning" in the sense of autonomously developing new knowledge and skills without any human guidance or external data" $\endgroup$ Commented Aug 7, 2025 at 19:10
  • 1
    $\begingroup$ @adsp42 I believe he meant that through RLHF it would be retrained to correct wrong answers. $\endgroup$ Commented Aug 7, 2025 at 19:43
  • 1
    $\begingroup$ Right. The OP didn't exactly spell it like that :⁠-⁠) but I can see you used it in your answer. $\endgroup$ Commented Aug 7, 2025 at 19:57

1 Answer 1

3
$\begingroup$

The failure of the “strawberry” test by LLMs makes sense when you realize that the tokenizing of the vocabulary prevents it from counting the characters in a word. The three “r”’s in the word strawberry are abstracted away in a token or two, at least in earlier models. With byte-pair encoding, these types of problems become more possible but still tricky.

That said, I noticed many mainstream LLMs are more capable of answering the question. This might be from RLHF like you said, or from incorporating agentic capabilities in these models, allowing them to use functions to solve the general concept of counting characters in a word.

enter image description here

This was returned from Claude 4. I also tested it with the number of “s”s in “Mississippi”.

enter image description here

Both of these questions Claude got correctly.

Hope this helps!

$\endgroup$

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.