Good model but it is very flawed in recalling input

by cmp-nct - opened 3 days ago

3 days ago

•

I ran a few logical tests, like mazes and visual games (10x10 symbols) and the model is not able to replicate the layout.
Even in it's very early thinking, where it repeats the layout, it makes significant errors.
Temperature 0 didn't help.

I wonder how much damage the linear attention mechanism does to recalling precision.

Example:

Replicate the maze 1:1 but exchange S and E:
▧▧▧▧▧▧▧▧▧▧▧▧
S▣▢▢▧▢▧▢▢▢▢▢
▧▧▧▢▧▢▧▢▧▧▧▧
▧▢▧▢▧▢▧▢▧▢▧▢
▧▢▧▢▧▢▧▢▧▢▧▢
▧▢▧▢▧▢▢▢▢▢▧▢
▧▢▧▢▧▧▧▧▧▢▧▢
▧▢▧▢▢▢▧▢▢▢▧▢
▧▢▧▧▧▢▧▢▧▧▧▢
▧▢▢▢▢▢▧▢▢▢▧▢
▧▢▧▢▧▢▧▧▧▧▧▧
▧▢▧▢▧▢▢▢▢▢▢▢
▧▢▧▧▧▧▧▧▧▧▧▧
▧▢▢▢▢▢▢▢▢▢▢E
▧▧▧▧▧▧▧▧▧▧▧▧

Nemotron 3 Nano responds:

E▣▢▢▧▢▧▢▢▢▢▢
▧▧▧▢▧▢▧▢▧▧▧▧
▧▢▧▢▧▢▧▢▧▢▧▢
▧▢▧▢▧▢▧▢▧▢▧▢
▧▢▧▢▧▧▧▧▧▢▧▢
▧▢▧▢▧▧▧▧▧▢▧▢
▧▢▧▢▧▧▧▧▧▢▧▢
▧▢▧▢▧▧▧▧▧▧▧▧
▧▢▢▢▢▢▧▢▢▢▧▢
▧▢▧▧▧▧▧▧▧▧▧▧
▧▢▧▧▧▧▧▧▧▧▧▧
▧▢▧▧▧▧▧▧▧▧▧▧
▧▢▢▢▢▢▧▢▢▢▧▢
▧▢▧▧▧▧▧▧▧▧▧▧
▧▢▧▧▧▧▧▧▧▧▧▧
▧▢▢▢▢▢▢▢▢▢▢S
▧▧▧▧▧▧▧▧▧▧▧▧

I've tested this in various other ways, it's always the same flaw. In many cases it doesn't even include the "E" anymore. Only very small layouts (like 5x5) are working.
GPT OSS 20B is flawless in this task.

esarafian

NVIDIA org 3 days ago

Hey @cmp-nct

Thank you for your feedback.

This appears to be an issue of tokenization or pretraining on rare tokens rather than a limitation in reasoning or memory. I verified that Nemotron-v3 Nano successfully solves the task both with and without thinking when the rare ▣▢▧ tokens are replaced by more common symbols such as -, x, and o.

I did observe a few minor issues in the non-thinking mode: the model sometimes uses a space (' ') instead of \n as a column separator, and it sometimes omits the single - character. Enabling thinking mode, generally eliminates these minor errors.

cmp-nct

2 days ago

•

edited 2 days ago

It's an interesting insight, thanks.
Though it's the first model I've tested in a long time with those issues.
Tokenization issues usually cause mangled UTF8 symbols, however Nemotron is surprisingly consistent - doesn't destroy the tokens just mistakes their locations.

I tested 2 of my visual benchmarks with pure ascii (looks gruesome without utf8) and the model was able to solve the easy one which it previously failed.
However the maze was not solved and it mangled it as well.
Asking it to replace the model with more pleasant characters in utf8 causes the same issue, it places wrong blocks but it keeps the layout size correct.

My guess is that the efficient attention mechanism has a problem with these type of tasks.
Would be awesome if a later update or the next "small" nemotron release could have some evaluation done on such sort of "visual" recalling.

owao

2 days ago

•

edited 2 days ago

@cmp-nct I had the same behavior with a simple "Reverse this string: .DefaultCellStyle"! It just couldn't find the right tokens to assemble among those in its vocab! It was almost a frustrating torture for it :D
To have some fun and understand why it struggles, I like to paste the several options it's revisiting again and again in https://huggingface.co/spaces/bartar/tokenizers to see what it sees ;) it's really instructive

cmp-nct

1 day ago

•

edited 1 day ago

@cmp-nct I had the same behavior with a simple "Reverse this string: .DefaultCellStyle"! It just couldn't find the right tokens to assemble among those in its vocab! It was almost a frustrating torture for it :D
To have some fun and understand why it struggles, I like to paste the several options it's revisiting again and again in https://huggingface.co/spaces/bartar/tokenizers to see what it sees ;) it's really instructive

It might be related, attention seems to fail on intricate tasks.
Asking it to reverse it fails, that's a PhD level task.
Asking it to split it into chars works :['.', 'D', 'e', 'f', 'a', 'u', 'l', 't', 'C', 'e', 'l', 'l', 's', 't', 'y', 'l', 'e']
But asking it to reverse that (now simple problem) fails again !: ['e', 'l', 'y', 'l', 't', 's', 'l', 'l', 'e', 'C', 't', 'l', 'e', 'u', 'a', 'f', 'e', 'D', '.']

OSS-20B didn't have a problem reversing it correctly but 20B also does not scale context size efficiently.

I really wish it wasn't like that - I don't think the AI can be used for productive tasks. Earth needs open source models, foundation models and good training recipes - I'm really glad to see Nvidia starts giving back a little for the trillions it is earning from it's unique position.

owao

1 day ago

I don't think the AI can be used for productive tasks

I guess such a task isn't actually representative of real world use case if it can call a tool to do it.

20B also does not scale context size efficiently.

I don't really think it's directly link to the number of parameters

But I share the rest with you

owao

1 day ago

•

edited 1 day ago

I mean look at https://artificialanalysis.ai/leaderboards/models?size_class=tiny%2Csmall%2Cmedium&is_open_weights=open_source
for the AA-LCR benchmark for example, while gpt-oss-20b gets 31%, Apriel-v1.6-15B-Thinker gets 50%

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment