Author here.I've been analyzing why RL-based reasoning models like DeepSeek-R1 exhibit specific instability patterns (language mixing, loops) despite their high logic capabilities.I modeled this as a control theory problem: treating the "Chain of Thought" as a noise-reduction loop ($\eta \to 0$). The math suggests that while this maximizes gain (reasoning power), operating without a "grounding manifold" ($M_{phys}$) mathematically guarantees divergence.I included a Python simulation in the Gist to verify the stability difference between grounded vs. ungrounded systems. Happy to discuss the math.
1 comments