We have watched tokens stream in from an LLM before where they appeared one at a time, like the model was typing. If you used the Anthropic SDK’s .stream() method, it just worked and you probably never saw what was on the wire.
This post will majorly focus on how a stream response works and how bugs are handled by SDK behind the hood.
stop_reason, in a stream
In post 1, stop_reason was right there in the response JSON. In a stream, it’s the same four values end_turn, max_tokens, tool_use, stop_sequence but they arrive inside a message_delta event near the end of the stream.
event: message_delta
data: {"type":"message_delta","delta":{"stop_reason":"end_turn",...}}
The same rule from post 1 applies: if you ignore stop_reason, you’ll ship a bug. A max_tokens cutoff in a streamed response looks exactly like a normal end of stream. You won’t know the model was cut off unless you read this event.
Three things to try before the next post
1. Run the streaming code. Then change "stream": true to false and run it again. Notice how long you wait before seeing anything. That gap is what your users feel.
2. Add console.error(chunk.length) inside the for await loop, before any parsing. Run the code and watch the numbers. You’ll see chunks of wildly different sizes it could be 8 bytes here, 400 bytes there. The network decides, not the model. Tokens and chunks are not the same thing.
3. Start a stream, then disconnect your wifi mid response. Watch what happens. The loop hangs, then eventually throws but only if we have added error handling. This sets up the error handling post later in the series.
What’s next
TinyAgent can now stream a response. Tokens land as they arrive. stop_reason shows up at the end. It still has no memory though every call starts blank.
In the upcoming post series we will capture another important details. 😁
Happy Coding! 👩💻



