In previous columns, we’ve looked at the Linux file model, and discussed many of the Linux system calls. This month, we’re going to talk a bit more about the semantics of the read() and write() system calls. Remember that read() and write() have very similar prototypes:
ssize_t read(int fd, void * data, size_t count);
ssize_t write(int fd, void * data, size_t count);
size_t is a fancy way of saying “unsigned int” on 32-bit machines, and “unsigned long” on 64-bit machines, and ssize_t is the signed version of size_t).
Both of these calls work roughly the same way; they try to read or write the number of bytes requested via the count parameter from or to the open file specified by the descriptor fd. Those bytes are stored in or written from the address pointed to by data. Both calls return the number of bytes which were processed, which is often the same as count.
The “try to” part of that sentence is where things can get tricky, though, since there are times when the kernel will stop trying to read or write bytes and instead will return with the requested operation only partially completed, or not even started. The explanation for these partial reads and writes depends on two things: the type of the file fd refers to and whether or not fd has been specifically selected as non-blocking.
Fast and Slow Files
Linux files fall into two categories: fast and slow. This can be a bit of a misnomer as the category has nothing to do with the actual speed of reading and writing from the files; it has to do with their predictability. Files that respond to read and write requests in a predictable amount of time are fast, and those that could take infinitely long to perform an operation are slow. For example, reading a block from a regular data file is predictable (the kernel knows the data is there, it just has to go get it), so regular files are fast. Reading from a pipe is not predictable, though, (if the pipe is empty the kernel has no way of knowing when, if ever, more data will get put into the pipe) so pipes are slow files. Table 1lists the file types and tells you whether they are fast or slow.
Table 1: File Types and PredictabilityUnfortunately, character devices can go either way. Most (like /dev/dsp) are slow, but a few (/dev/mem for example) act as fast files. The only device we’ll mention here is a slow one, the terminal, which stdin, stdout, and stderr (standard input, output, and error) are normally connected to.
This brings us back to determining when read() and write() will only perform part of your request. These calls will always perform complete operations on fast files, and will return as soon as possible from operations on slow files. The only caveat is that reading from a file could return a partial data block; but that will only occur if you are trying to read past the end of a file. Reading 500 bytes from a 100 byte file will only return 100 bytes, for example. In every other case with fast files, you’ll get as many bytes read or written as you requested, or an error returned.
Slow files work differently. As the kernel doesn’t want to block the process forever (which could happen reading from a pipe), the kernel will return from read() requests as soon as any data is available. If your application is trying to read 50 bytes of data from a pipe, and only 5 bytes are in that pipe, the kernel will return immediately with those 5 bytes and tell your program what happened via the value of the read() call.
This is almost always the behavior programs want; when programs read a line of text from the terminal, for example, they will instruct the kernel to return that line as soon as it’s ready rather than force the user to enter exactly 80 characters before the program sees any of it. Similarly, most programs which read data from a network socket will want to get that data as soon as possible rather than in discretely sized chunks.
If this isn’t the behavior you need, a simple function can duplicate the behavior of read()on slow files, as can be seen in Listing One.
Listing One: Duplicating read() on Slow Files
ssize_t readall(int fd, void * data, size_t count) ssize_t bytesRead;
char * dataPtr = data;
size_t total = 0;
while (count) bytesRead = read(fd, dataPtr, count);
/* we should check bytesRead for < 0 to return errors
properly, but this is just sample code! */
dataPtr += bytesRead;
count -= bytesRead;
total += bytesRead;
>
The only remaining question is what does read() do when no bytes are available? By default, read() waits until at least one byte is available to return to the application; this default is called “blocking” mode. Alternatively, individual file descriptors can be switched to “non-blocking” mode, which means that a read() on a slow file will return immediately, even if no bytes are available.
The last thing we need to mention for read() is how it behaves on empty pipes. If nobody has the pipe open for writing, read() will always return 0 bytes and not block. If someone does have the pipe open for writing, though, blocking file descriptors will block onread(), and non-blocking ones will return immediately with EAGAIN. This is summarized in Table 2.
Table 2: Reading from Empty PipesFile Descriptor
Open writer
No writer
Now that we’ve discussed how read() works with slow files, let’s move on to write(). Normally, write() will block until it has written all of the data to the file. If you try to write data to a pipe with a full internal buffer, your write() will not return until someone has read enough out of the pipe to make room for all of the data that you’re trying to add. If that particular file descriptor (or file structure) is in non-blocking mode, however, write() will write as much data into the file as it can, and then return. This means that it will store as few as 0 bytes, or as much as the full number that was requested.
Trying to write into a pipe that has no readers will result in the kernel sending the process a SIGPIPE signal, but since we haven’t talked about signals yet, all you need to know is that, by default, getting a SIGPIPE kills the process. As an aside for readers who already know about signals, if the signal is ignored or caught, the write() will instead return EPIPE.
Switching a File Descriptor
The last thing to discuss is how to acquire a non-blocking file descriptor. As we’ve already mentioned, file descriptors are, by default, blocking. You can open a file descriptor as non-blocking by adding a flag to the open(), and you can change a file descriptor between blocking and non-blocking via the fcntl() call. Unfortunately, fcntl() is a messy system call which has two or three different prototypes; we’re only going to mention the two we care about.
int fcntl(int fd, int cmd);
int fcntl(int fd, int cmd, int arg);
The first argument is the file descriptor which is being modified and the second specifies what actions should be taken. There are two that we care about, F_GETFL (get flags) and F_SETFL (set flags).
The F_GETFL command returns the file flags for fd. These are the set of O_flags that describe the types of access permitted to the file; its typical values are O_RDWR (read/write) or (O_WRONLY(write only) | O_APPEND). The O_NONBLOCK flag is set if the file is to be treated as non-blocking, and is not present if the file is blocking.
The following code snippet evaluates to nonzero if the file descriptor fd is non-blocking, and zero if it is blocking (assuming the call succeeds; if it fails, fcntl() returns -1 and stores an error code in the variable errno).
(fcntl(fd, F_GETFL) &
O_NONBLOCK)
The F_SETFL commands lets the application modify the O_NONBLOCK flag (and a couple of others, including O_APPEND) for the file. The correct way to change these is to retrieve the old flags, then either set or reset the
O_NONBLOCK flag, and finally set the new flags. By following this procedure, you change only the value of O_NONBLOCK rather than having to set the value of every flag. The following will show you how to turn on non-blocking mode for a file (without error checking):
fcntl(fd, F_SETFL,
fcntl(fd, F_GETFL) |
O_NONBLOCK);
Similarly, to turn it off:
fcntl(fd, F_SETFL,
fcntl(fd, F_GETFL) &
~O_NONBLOCK);
As I mentioned earlier, a file can be opened in non-blocking mode with the open() system call. You do this by OR-ing O_NONBLOCK with the rest of the file flags used in the open()call, such as such as O_RDONLY or O_RDWR. We’ll only talk about how opening named pipes changes when CO_NONBLOCK is specified, but the semantics of opening device files can also change when O_NONBLOCK is specified. Normally open() blocks when opening named pipes for writing until that pipe is opened for reading as well. This is usually the desired behavior, since it reduces the chances of data being written to a pipe and never being read.
If O_NONBLOCK is specified when opening a named pipe for writing, the open() will return immediately. If the pipe is already open for reading, the open() will return normally. If no readers are available, the open() will still return, but the ENXIO error code will be returned.
I hope this has helped illuminate some of the more esoteric behaviors of read() and write(), and will help you take advantage of Linux’s flexible I/O semantics in your own programs.