Saturday, April 25, 2009

Command Pipelines

Pipes are easy. The Unix shells provide mechanisms which you can use them to allow you to generate remarkably sophisticated `programs' out of simple components. We call that a pipeline. A pipeline is composed of a data generator, a series of filters, and a data consumer. Often that final stage is as simple as displaying the final output on stdout, and sometimes the first stage is as simple as reading from stdin. I think all shells use the "|" character to separate each stage of a pipeline. So:

data-generator | filter | ... | filter | data-consumer

Each stage of the pipeline runs in parallel, within the limits which the system permits. Hey, look closely, because that last phrase is important. Are you on a uni-processor system because if you are, then obviously only one process runs at a time, although that point is simply nitpicking. But pipes are buffers capable of holding only finite data. A process can write into a pipe until that pipe is full. When the pipe is full the process writing into it blocks until some of the data already in the pipe has been read. Similarly, a process can read from a pipe until that pipe is empty. When it's empty the reading process is blocked until some more data has been written into the pipe.

An interesting effect of pipes, which is not immediately obvious, is that `record boundaries' can be lost in a pipe. What I mean: If a program reads from the terminal using buffered stream libraries, it will be given data one line at a time. Likewise if it writes to the terminal using buffered stream libraries the data will be displayed one line at a time. But if a program writes into a pipe that data will be sent to the pipe one stream buffer at a time; that's about 1K of data. So if your data generator `emits' a line of data (using the buffered stream library) to a pipe, the data might actually NOT be written immediately, but maybe held in an internal buffer (internal the data generator) until there's enough data to make it worth sending.

Similarly, the program reading from a pipe might get a partial line from the read. That can cause unintentional effects. Suppose, for example, that the end of your pipeline is reading a list of files and directories to delete, and supposing the buffer is five characters long. If you write "/user/john" into the pipe, what comes out could be "/user" and "/john". Curious, yes?

This buffering effect of the stream libraries might sound like a bad thing but it actually gives you performance benefits most of the time. If you are writing a program which uses them you should consider how buffering will affect your program in a pipeline, but other than that I wouldn't be upset about it. As I said: It's a good thing.

If you are constructing a pipeline (as all true Unix users do every day) you should remember the buffering effect which the stream libraries and which pipes both introduce. If your pipeline starts with something which reads lines from standard input and then writes variations of those lines to standard output, remember that the second stage of the pipeline might not receive any input until you have typed a few lines; and then it might receive all of those lines in one go! Here's an example of what I mean for you to try:

awk '{$2="SURPRISE"; for (i=0; i<100; i++) print }' | grep -n SURPRISE

No comments: