Introducing the spacebot

I'm working on another Bash script: Spacebot. The name is more exciting than the actual script. It's got nothing to do with drones and Raspberry Pis; rather it's job is to find files over a certain size in a directory and/or the largest subdirectories.

Background: options are good

The script is inspired by an existing script we use at the company I work for. When we get a disk space alert for a Linux server we typically run the script. It finds large files and directories and sends an email to our ticketer. We then look at the output to see if there's anything that can be cleared.

The existing script is very simple, which is both a good and a bad thing. It's nice to have a command that you can just run, without having to remember any options. On the other hand, options are rather useful. For example, the old script always looks for files in / that are larger than 100MB. If you got an alert for, say, a /home partition then you have to edit the script itself (and remember to undo the change after it has finished). Similarly, there's no easy way to change the file size or the email address the output is sent to.

One particular gripe I have with the old script is that the list of large files is sorted by size. I prefer it to be sorted by file name, as there are often lots of large files in a particular directory – when the list is sorted by size you end up looking at the same directory multiple times. The new script has a --sort option that lets you sort the output by size, (mtime) date or file name.

Finding large files

For finding files I'm using the below code. $dir is the directory that should be inspected; $file_size is the file size; and the stat command grabs the size (in bytes), mtime (as a Unix timestamp) and file name:

find "$dir" \
-not \( -path "/proc" -prune \) \
-not \( -path "/sys" -prune \) \
-not \( -path "/run" -prune \) \
-type f \
-size +"$file_size" \
-exec stat --printf='%s\t%Y\t%n\n' '{}' \+ \
> "$output_file_tmp"
        

Next, I'm using a bit of awk to convert the file size to mebibytes and convert the date to a human-readble format:

awk -F"\t" '
  {
    bytes = $1 /1024/1024; $2 = strftime("%Y-%m-%d", $2);
    printf "%.0f%s\t%s\t%s\n", bytes, "MB", $2, $3
  }
' "$output_file_tmp" \
| sort_cmd \
| tee -a "$output"
        

Sorting the output

The sort_cmd command in the above output is a function. By default, the script still sorts the output by size but the option can be changed using the --sort option. The function validates the argument and defines the sort_cmd function at the same time:

validate_output_sort() {
  case "$output_sort" in 
    size)
      sort_cmd() { sort -hr; }
    ;;
    date)
      sort_cmd() { sort -rk 2; }
    ;;
    file|name)
      sort_cmd() { sort -k 3; }
    ;;
    *)
      printf '%s\n' "*** Error: invalid sort order ($output_sort) ***"
      exit 1
    ;;
  esac
}
      

Finding large directories

While looking to improve the du command used to find large directories I found that the utility has an --exclude and --max-depth option. Both are not mentioned in the man page, but they are listed when you run du --help. Excluding directories such as /proc obviously makes sense. The --max-depth option doesn't speed up things but it's nice to be able to list just the largest subdirectories in $dir (using --max-depth=1).

du -ah "$dir" \
--exclude=/proc \
--exclude=/sys \
--exclude=/run \
--max-depth="$dir_depth" \
| sort -hr \
| head -"$dir_limit" \
| tee -a "$output"
       

Next up

The main improvement I want to make is adding a command line option to exclude directories in the find and du commands. Different servers have different directories that can safely be ignored – CloudLinux servers, for instance, have a /usr/share/cagefs-skeleton/proc directory. The first thing on the to-do list, though, is finding and squashing the bugs…