http://www.dmst.aueb.gr/dds/pubs/trade/1999-login-dd/html/dd.html This is an HTML rendering of a working paper draft that led to a publication. The publication should always be cited in preference to this draft using the following reference:
|
Diomidis D. Spinellis
<dds@sena.gr>
When working with slow peripherals such as tape drives and CD-R recorders
careful tuning of the access patterns and process architecture
can result in substantial increases in performance.
One of the responsibilities of an operating system (OS) is to
optimise the utilisation of a computer system by the appropriate
scheduling of tasks.
Thus, when a process waits for a peripheral to complete an
operation the OS can schedule other processes that are waiting to run.
Although scheduling happens behind the scenes of the normal system
operation and is transparent to the user, in some cases one can
optimise execution time by carefully arranging interdependent tasks.
With ever increasing processor speeds and with peripheral access times
relatively stagnant such optimisations can deliver impressive results.
In the following paragraphs I will describe and explain how I reduced
the overhead of tape access by almost 30% by simply adding two
dd(1)
commands to our backup pipeline.
The problem I wanted to solve was to minimise the actual (wall) time
taken by our backup process which currently needs 8 hours for a full backup.
I thought I could achieve this by making our drive stream, and
by increasing the parallelism of the backup process.
Our LAN consists of Windows 95/98/NT and Unix machines.
We perform our backups on a machine running Linux (2.0.36).
At backup time the Linux machine mounts the disks of all Windows hosts
using the SMB file system and runs a tar(1)
-based
backup process.
The process is responsible for
administering incremental and full backups,
maintaining backup quota,
measuring the backup set size,
comparing the files after backup, and
logging the backup results.
The heart of the process is essentially the following command:
tar czf /dev/tape -b 1000 /smbThe command creates a
tar
file in /dev/tape
from the
contents of /smb
using gzip
compression and a blocking
factor of 1000.
The blocking factor is supposed to write to the tape device
in chunks of 1000 blocks i.e. of 0.5MB.
Accessing the tape in large blocks is for some devices
mandatory, and according to some manuals of tar
can be more efficient than access in smaller chunks.
As tape drives typically read or write data at a fixed rate,
if that rate can not be maintained the rest of the system,
the drive has to read or write some data and then
stop and backtrack to a point where the next block can be written.
When this phenomenon occurs the performance of the drive drastically
deteriorates and we say that the drive is not streaming.
Writing in large blocks can help the drive to stream.
In our environment our drive consistently refused to stream.
Increasing the size of the blocking factor had absolutely no effect.
Going through the source code of tar
we are using (GNU tar
1.11.8
with a small local modification to handle the lack of stable inode numbers
in SMB filesystems) I noticed that when compression is specified tar
ignores the blocking factor.
That explained the lack of streaming and prompted me to examine the matter
further.
A golden rule of performance optimisation is to profile the process to
be optimised.
I therefore first measured the time it takes for tar
and gzip
to
process a medium-sized directory:
% time sh -c 'tar cf - /usr/src/linux | gzip -c >/dev/null' 375.40user 39.38system 7:02.18elapsed 98%CPU
This measurement establishes our performance bottom line. No matter how much we improve tape access we will never be able to make the process faster than 7 minutes. The next step was to measure the overhead of writing to tape:
% time sh -c 'tar cf - /usr/src/linux | gzip -c >/dev/tape' 389.81user 37.34system 10:19.96elapsed 68%CPU
So we now know that writing to the tape increases the execution
time by about 3 minutes, or 47%.
It is this increase that we want to fight.
As our full backup takes about 8 hours, optimising tape access could
result in significant time savings.
The measurement also tells us that the CPU is idle about 30% of the time.
Our task mix is therefore I/O bound and we have CPU power to spare in order
to optimise it.
As a first try I piped the gzip
output through dd(1)
specifying a
large (0.5MB) output block size.
This causes dd
to accumulate data from its input until it has
0.5MB of data available and then write that data in one large chunk to
the tape.
% time sh -c 'tar cf - /usr/src/linux | gzip -c | dd obs=512k of=/dev/tape' 27979+1 records in 27+1 records out 392.65user 41.55system 11:28.28elapsed 63%CPU
This time I could hear and see the tape drive stream; a beautiful sound and
sight.
Embarrassingly the execution time increased
by more than one
minute (11%), while CPU utilisation decreased.
Streaming the tape drive did not appear to help.
An analysis of the situation was now in order.
Apparently, in the system I am using write commands to the tape
device are processed synchronously, i.e. the process issuing the
write system call is suspended until the data is written to the device.
In the first case involving the tape drive, while data was being written
by gzip
to the tape drive (and gzip
was waiting),
tar
could continue to process some files writing its output to the pipe
buffer connecting the two processes.
In the second case where dd
was added, dd
took a longer time to write
the 0.5MB block to the tape drive.
During that time the pipe buffer between gzip
and dd
got full
thereby blocking gzip
.
That in turn, made the pipe buffer between tar
and gzip
fill-up blocking tar
as well.
The explanation of the timing anomaly made me realise the problem's
root and think about a possible solution.
In order to optimise the backup process I would need a way to
prevent the pipeline stalls while data was being written to the
tape drive.
The addition of another dd
process after gzip
could provide this
buffer:
% tar cf - /usr/src/linux | gzip -c | dd obs=512k | dd obs=512k of=/dev/tapeIn the pipeline above
gzip
writes data to the first dd
process.
As soon as this process accumulates 0.5MB it writes it to the second dd
process using a single write
call.
As soon as the second dd
process accumulates 0.5MB (i.e. immediately after
the first process writes it) it will issue a write
call to send
the data to the tape drive.
While data is being written to the tape the second dd
process
is blocked;
however, the first one is still running,
as it has written its block to the second process and
is now again accumulating the next block to write.
Therefore tar
and gzip
can continue gathering and compressing data
at the same time as data is being written to the tape drive.
Thus,
the two dd
processes effectively implement a double buffering mechanism.
The time measurements prove the validity of this approach:
% time sh -c 'tar cf - /usr/src/linux | gzip -c | dd obs=512k | dd obs=512k of=/dev/tape 27979+1 records in 27+1 records out 27979+1 records in 27+1 records out 364.34user 48.57system 9:23.22elapsed 73%CPU
Our change resulted in a 28% decrease in the tape overhead giving a 10% decrease in overall execution time. Depending on the performance differences between the CPU and the peripheral in question, the execution time can decrease even more.
As I mentioned in the beginning one of the responsibilities of an OS is to
optimise the utilisation of a computer system by the appropriate
scheduling of tasks.
In our initial case while the system was waiting for the tape drive
to write the data it could indeed run other unrelated processes such as
sendmail
or httpd
.
However, the operation of the processes in the backup pipeline could not be
parallelised beyond the limits imposed by the small buffer provided
by the implementation of pipes under Unix.
Soon after one process in the pipeline blocked the whole pipeline was
stalled.
Adding an additional dd
process before the process writing to the
tape provided extra breathing space to the pipeline and increased the
possibility for parallel operation.
The same approach can be used in all cases where a pipeline consists
of a mixture of I/O and CPU-bound tasks.