Tangential (but topical in that "The threat is comfortable drift toward not understanding what you're doing" is also on the front page):
Is the generated python code in the example wrong?
The prompt
> Develop a Python function that removes any falsey values from a list. Return the modified list without creating a new one.
Is answered with list comprehension, which makes a new list and leaves the original unmodified (never mind that the *args input necessarily can't be a modifiable list?)
def remove_falsey_values(*args): return [val for val in args if val]
Whereas I'd expect something like
def remove_falsey_values(l):
for i in reversed(range(len(l))):
if not l[i]: l.pop(i)
# returned list is linked to input l
return l
a = [1, 0, False, 'foo']
x = remove_falsey_values(a)
x[0] = 2
print(a) # [2,'foo']
It doesn't fit the requirement to modify the list in place, but the prompt itself contradicts the requirements by asking explicitly for the implementation to use *args and a list comprehension.
Ahh I didn't see the full original prompt -- it's overflowing into a horz scroll for me. I thought it was the "critique loop" that injected the *args requirement. I guess garbage in, garbage out. Still unfortunate example to use.
Oh I wouldn't be surprised. This is a sample from one of the OSS code datasets I'd used, which are all generated synthetically using LLMs. Data is indeed the moat.
Absolutely! And the list.pop version is multiple orders of magnitude slower. But I took the prompt to be asking for in-place modification of the existing list. Comprehension does not do that.
This is a great question. You definitely aren't training this to use it, you're training it to understand how things work. It's an educational project, if you're interested in experimenting with things like distributed training techniques in JAX, or preference optimisation, this gives you a minimal and hackable library to build on.
It's also a great base for experimentation. If you have an idea for an architecture improvement you can try it for $36 on the 20 layer nanocode setting, then for another $200 see how it holds up on the "full scale" nanocode
Kaparthy's notes on improving nanochat [1] are one of my favorite blog-like things to read. Really neat to see which features have how much influence, and how the scaling laws evolve as you improve the architecture
There's also modded-nanogpt which turns the same kind of experimentation into a training speedrun (and maybe loses some rigor on the way) [2]
This is a gross simplification of the process - you would typically use order(s) of magnitude more data and compute, and a substantial amount of online reinforcement learning to elicit emergent tool use capabilities.
> This is a library showing you how to train your own Claude Code end-to-end.
What does it even mean?
Claude Code is a so called "harness" - a thing that builds a context for LLMs, calls LLMs, executes tool calls etc. It uses various Anthropic models under the hood.
It can also use other models AFAIK.
It cannot be "trained".
Sorry if this comment sounds nitpicky, I'm just annoyed by the imprecise use of terminology.
I see what you mean, but I disagree. I expect that Claude Code is backed by a separate post-train of Claude base which has been trained using the Claude Code harness and toolset.
Is the generated python code in the example wrong?
The prompt
> Develop a Python function that removes any falsey values from a list. Return the modified list without creating a new one.
Is answered with list comprehension, which makes a new list and leaves the original unmodified (never mind that the *args input necessarily can't be a modifiable list?)
Whereas I'd expect something likeWhy would people want to spend $200 to train a coding model when there are free coding models?
Kaparthy's notes on improving nanochat [1] are one of my favorite blog-like things to read. Really neat to see which features have how much influence, and how the scaling laws evolve as you improve the architecture
There's also modded-nanogpt which turns the same kind of experimentation into a training speedrun (and maybe loses some rigor on the way) [2]
1 https://github.com/karpathy/nanochat/blob/master/dev/LOG.md
2 https://github.com/kellerjordan/modded-nanogpt
Any practitioners can elaborate?
Many recent OSS models have great tech reports where you can learn more about these kind of things: Kimi 2.5 https://github.com/MoonshotAI/Kimi-K2.5/blob/master/tech_rep... GLM 5 https://arxiv.org/abs/2602.15763 DeepSeek R1 https://arxiv.org/pdf/2501.12948
What does it even mean?
Claude Code is a so called "harness" - a thing that builds a context for LLMs, calls LLMs, executes tool calls etc. It uses various Anthropic models under the hood.
It can also use other models AFAIK.
It cannot be "trained".
Sorry if this comment sounds nitpicky, I'm just annoyed by the imprecise use of terminology.
that being said, there are other potential explanations
https://github.com/Nano-Collective/nanocoder