WebSockets is a stream, not a message based protocol...

2011-07-06

As I mentioned here, the WebSockets protocol is, at this point, a bit of a mess due to the evolution of the protocol and the fact that it’s being pulled in various directions by various interested parties. I’m just ranting about some of the things that I find annoying…

The first thing to realise about the WebSockets protocol is that it isn’t really message based at all despite what the RFC claims.

Clients and servers, after a successful handshake, transfer data back and forth in conceptual units referred to in this specification as “messages”. A message is a complete unit of data at an application level, with the expectation that many or most applications implementing this protocol (such as web user agents) provide APIs in terms of sending and receiving messages. The WebSocket message does not necessarily correspond to a particular network layer framing, as a fragmented message may be coalesced, or vice versa, e.g. by an intermediary.

and…

The WebSocket protocol uses this framing so that specifications that use the WebSocket protocol can expose such connections using an event-based mechanism instead of requiring users of those specifications to implement buffering and piecing together of messages manually.

Suggest that a message based, event driven design which presents complete messages to the application layer would be a sensible design. Unfortunately once you realise exactly how a message is made up it becomes impossible to provide an interface where you ONLY deliver messages as complete units to the application layer.

WebSocket messages consist of one or more frames. A frame can be either a complete frame or a fragmented frame. Messages themselves do not have any length indication built into the protocol, only frames do. Frames can have a payload length of up to 9,223,372,036,854,775,807 bytes (due to the fact that the protocol allows for a 63bit length indicator) and finally…

The primary purpose of fragmentation is to allow sending a message that is of unknown size when the message is started without having to buffer that message. If messages couldn’t be fragmented, then an endpoint would have to buffer the entire message so its length could be counted before first byte is sent. With fragmentation, a server or intermediary may choose a reasonable size buffer, and when the buffer is full write a fragment to the network.

So a single WebSocket “message” can consist of an unlimited number of 9,223,372,036,854,775,807 byte fragments. This makes it impossible for a general purpose Websocket protocol parser to only present complete messages to the application layer in such a way that the application doesn’t need to do some form of “buffering and piecing together of messages manually”. At best a general purpose parser could present Websocket data as a ‘sequence of streams’, given that each “message” is in fact simply a potentially infinite stream of bytes with a message terminator (the FIN bit in the frame header) at the end. It could do this by passing the application layer an interface that allowed the application to pull data from the Websocket “message” until it was complete, and that’s less than ideal if you are used to working with asynchronous, push, APIs, or trying to avoid unnecessary memory copies…

Even if the maximum frame size was reduced, as some propose, the problem would still be present due to the fact that a single message can consist of an infinite number of fragments. Likewise a protocol parser can not take the easy route and simply disallow fragmented frames since the RFC states that…

o Clients and servers MUST support receiving both fragmented and unfragmented messages.

o An intermediary MUST NOT change the fragmentation of a message if any reserved bit values are used and the meaning of these values is not known to the intermediary.

o An intermediary MUST NOT change the fragmentation of any message in the context of a connection where extensions have been negotiated and the intermediary is not aware of the semantics of the negotiated extensions.

Which means that although the application that you’ve written may send and receive WebSocket messages of an application restricted maximum size you may still find that you receive fragments because an intermediary has decided to fragment your frames. Unless, of course, you subvert the protocol extensions functionality by negotiating “x-{My own private GUID}” extension between your client and server which would neatly prevent any intermediaries (except ones that you’d written yourself) from changing the fragmentation of your frames… Then, of course, the intermediaries may simply decide to remove the client’s request for your unknown extension from its initial handshake request to prevent it being negotiated. Or, perhaps more likely, close your connection with a 1004 close code…

There’s resistance to proposals to allow the maximum frame size to be negotiated during the handshake phase but there’s a standard close code for “frame too big”… Should an application just guess how large it’s allowed to go?

The view of some on the discussion list seems to be that “A server (or client) which exposes the frames as its primary API is doing it wrong.” but it seems to me that to write a flexible and general purpose protocol parser which can be used by both push and pull APIs you have no option but to expose details of the message framing. The reason for this is that a general purpose parser cannot buffer complete messages, or even complete frames and so must deliver the data either as a stream of bytes at an application level or as a sequence of partial frames and allow the application to decide how to accumulate the frames into messages. By hiding all of the framing from the application the application developer cannot take advantage of knowledge that he has of the message structure of his particular messages. This is especially useful with asynchronous APIs where the application might be pushed buffers with data in them as the data arrives - which is how most of The Server Framework happens to operate and how I/O Completion Port centric designs would tend to work. If an application works in terms of, say, messages that could be at most 4096 bytes and is dealing with buffers that can contain complete messages then it could use the details of the data framing to allow it to efficiently accumulate the data into a single buffer and then dispatch the complete message for processing when it receives the final frame. The alternative is to add complexity to the protocol parser by allowing it to accumulate ‘messages’ up to a configurable size and present complete messages via one callback and incomplete messages via another, or to provide only a stream based pull API which requires the application to needlessly copy data from the protocol parser’s buffers into its own.

The 63bit fragment size and fragmentation in general appear to come from a requirement for streaming data from one end of the connection to the other, see here where a Unixy design idea of simply telling the application to “read x amount of data from this file handle” seems perfectly sensible… Of course this design will fail as soon as you need to send a stream that’s bigger than the 63bit frame size will allow and it’ll also fail if the frame is fragmented as then the API becomes “read x amount of data from this file handle, but you’re not done yet, wait and I’ll call you again with more”… At which point and given the possibility of intermediaries that fragment your large fragment down to, lets be generous, 1024 byte fragments anyway, you may as well simply limit the maximum size of frames to something more manageable… But I suppose “nobody will ever need more than 63bits of data length”…

Unfortunately, large frame sizes also open the protocol up to lazy application design. Lets say we’re sending a file, we open the file and read some, send a fragmented frame header for the total file length and then simply start sending data. Cool, we don’t need to worry about the protocol any more, no need to build messages, just pull the data from disk and send it to the other side. This works fine until you have a read failure, or any other reason to terminate the connection. Since you’re in the middle of sending a single huge frame you can’t send an application level frame that informs the other side of the problem. You can’t even send a WebSocket close frame to shut down the connection cleanly, all you can do is abort the connection…

So, WebSockets presents a sequence of infinitely long byte streams with a termination indicator (the FIN bit in the frame header) and not a message based interface as you might initially believe. Given that a general purpose protocol handler can only work in terms of partial frames, we effectively have a stream based protocol with lots of added complexity to provide the illusion of a message based protocol that can actually only ever be dealt with as a stream of bytes.