Storage space management procedures and patterns for Operations teams in High Uptime environments

Stephen D. Cope

2018-03-07

or: using df safely

WARNING: Partition / Less than 20% free space on busyserver

(but in red, with flashing)

Extreme panic!

Typical series of steps to resolve:

ssh root@busyserver
cd /
du | sort -n
cd /var/log
ls -sS | head
gzip wtmp

There is only one correct step shown above. Do not use this pattern.

Importance of free disk space

If you run out of disk space you can't write log files, you can't write spool files, you can't create lock files. Everything stops.

Presuming you do actually want to write any files, which most programs do.

`df` vs `du`

df will quickly read a file system and tell you how much space is free.

du will traverse this directory and all sub-directories and count up every file within it: a lot of disk I/O and slow. It will miss held open deleted files.

`du -x`

Use -x to make sure you don't traverse across file systems.

We are interested in this file system.

`du -xh | sort -hr`

Use -h to make it human-readable.

The -h option for sort is a relative newcomer.

| head because you only want the largest.

Sparse files

Not all of the file exists, so it doesn't use much space.

ls -l reports the sparse size.

ls -s and du -h report the size in use. (This is what matters.)

Hint: wtmp is a sparse file.

Deleted files

Once deleted, files still retain space until closed.

Open file handles keep the file open: restart the process.

lsof

NFS mounts

If your NFS mount has gone AWOL, don't use df or du on it.

Processes in state D are waiting for I/O. Good luck with that.

Finding files (A)

You can use lsof and look for any file names with log in the name.

Finding files (B)

If you roughly know the size of the log files, but don't know where they are, try:

find -xdev -type f -mtime -1 -size +10M

And remember -xdev is so you don't traverse file systems.

Finding files (0)

Always use find -print0 | xargs -r0 when piping filenames through.

Don't choke on files with spaces in the names.

Now what?

You've found the files. What do you do?

Compress them.

Compression with 0 free space

Presume the files may not be deleted due to regulatory requirements.

Use another file system:

move some files to the other file system (large disk I/O hit),
compress files on both (CPU hit),
then move files back.

Which file system should you use?

/dev/shm - limited by RAM, all contents lost if system restarts

/tmp - other users writing there, make sure names don't clash, might also be tmpfs

Don't use a remote file system unless you must. The cost of transferring is roughly equivalent to compressing locally.

Which compression tool?

gzip is fastest, xz is best for uncompressing, bzip2 good but superceded (uses more memory to compress).

You want to compress faster than logs are written, which means, gzip!

Don't impact running tasks: use nice, and if you don't use isolcpu add taskset.

Compressed an open file?

After a file is compressed, the source file is deleted. Oops!

lsof | grep deleted
tail -f /proc/PID/fd/N

Can you send a signal to trigger re-opening of log files?

Why leave 80% free?

Legacy thing enshrined in policy and monitoring guidelines.

Maybe it affects fragmentation: is it still a problem where typical file size is much smaller than 1% of a disk's capacity? Flash memory doesn't care about file layout.

What about with massive file systems? Wasted money? Throw away 20% of your storage budget.

Predicting when you'll run out

Sample now, wait, sample again. Congratulations!

List your log directory, wait, check again.

Simple mathematics.

df -k . ; sleep 60 ; df -k .
Filesystem      1K-blocks      Used Available Use% Mounted on
/dev/tipene--vg 200620295 104453868  85905744  55% /
Filesystem      1K-blocks      Used Available Use% Mounted on
/dev/tipene--vg 200620295 104978160  85381452  56% /
$ expr 85905744 - 85381452
524292
$ expr 524292 / 60
8738
$ expr 85381452 / 8793
9710

In this example, 3 hours.

More Predicting

See how much you're writing,
look at your retention period,
adjust the amount you write or the size of your file system to suit.

Using what we've learned

ssh busyserver           # don't use root
cd /
du -xh | sort -hr | head # don't cross filesystems
cd /misplacedlog
ls -ltr | head           # oldest logs first
nice gzip oldlog.log     # be nice

Cheat sheet

Nine largest directories on this file system:

du -xh | sort -hr | head

Shift files into an "archived" sub-directory then compress them.

mkdir -p archived \
&& find -maxdepth 1 -type f -mmin +60 -print0 \
 | xargs -r0 mv -t archived/ \
&& nice gzip archived/*.log