Why a vision model can read a stove
The model doesn't really "see" the way we do. It shrinks each photo, cuts it into
small patches, and turns every patch into a list of numbers. Those numbers sit in
the same space as words, so a round metal shape with a handle ends up
near "skillet," and a bright ring in the thermal photo ends up near "hot burner."
Because it was trained on huge piles of labelled food photos, overhead cooking
clips and thermal images, it can even read the little temperature numbers
printed on the IR frame.
Where it falls down. The model is matching patterns, not
understanding physics, so a strange scene it hasn't seen before can fool it. It's
also biased toward Western food: it nails a plate of salmon or French toast where
everything sits separately, but a wok full of mixed, blurry stir-fry is much
harder. It gets chicken right nearly every time; jackfruit, almost never. Writing
down where and why it fails is a real part of the project, not a
footnote.