Set Operations Using Shell Script Commands

On many occasions I've had lists of names or other data stored in text files. The lists were really sets of names since no name was duplicated in the list. I often wanted to quickly and easily compute the intersection, union, relative complement, and symmetric difference of those sets. Because the sets were stored in text files, it seemed to me that the easiest way to compute these set operations was from a command line or shell. So I wrote the following four bash shell scripts. All the scripts use existing Unix commands, such as sort, sed, and comm and therefore assume that the sets are stored in text files with only one element listed on each line. All four scripts write their output to stdout.

Beside computing the set operations, each of the scripts provides an interesting study of the basic shell commands sort, sed, and comm.

Intersection

You can learn much more about computing
set operations from chapter 5 of
Advanced Programming Techniques
#!/bin/sh

# intersect
# Purpose: Compute intersection of sets $1 and $2.
#   A ^ B = {x : x e A and x e B}
# 
# Expected format of sets:
# 1. One element per line.
# 2. Blank lines are allowed.
# 3. Comments start with '#' and continue to end of line.
# 4. Leading and trailing white space will be removed.

TMP1=`mktemp -p intse`
TMP2=`mktemp -p intse`

trap "rm -f $TMP1 $TMP2 ; exit 0" EXIT
trap "rm -f $TMP1 $TMP2 ; exit 1" ABRT BUS HUP INT KILL QUIT SEGV

# Remove comments, leading and trailing white space and
# blank lines, and sort input files.
sed -e 's/#.*$//' \
  -e 's/^[[:space:]]*\([^[:space:]]*\)[[:space:]]*$/\1/' \
  -e '/^[[:space:]]*$/d' $1 | \
  sort -u > $TMP1
sed -e 's/#.*$//' \
  -e 's/^[[:space:]]*\([^[:space:]]*\)[[:space:]]*$/\1/' \
  -e '/^[[:space:]]*$/d' $2 | \
  sort -u > $TMP2

comm -12 $TMP1 $TMP2
exit 0

Union

#!/bin/sh

# union
# Purpose:  Compute the union of sets, i.e. merge sets.
#   A U B = {x : x e A or x e B or x e both}
# 
# Expected format of sets:
# 1. One element per line.
# 2. Blank lines are allowed.
# 3. Comments start with '#' and continue to end of line.
# 4. Leading and trailing white space will be removed.

TMP=`mktemp -c -p union`
trap "rm -f $TMP ; exit 0" EXIT
trap "rm -f $TMP ; exit 1" ABRT BUS HUP INT KILL QUIT SEGV

# Remove comments, leading and trailing white space and
# blank lines from the sets and append all sets to $TMP.
for i in "$@"; do
  sed -e 's/#.*$//' \
    -e 's/^[[:space:]]*\([^[:space:]]*\)[[:space:]]*$/\1/' \
    -e '/^[[:space:]]*$/d' "$i" | \
    sort -u >> $TMP
done

# Merge sets using sort with unique flag.
sort -u $TMP

exit 0

Relative Complement

#!/bin/sh

# relcompl
# Purpose: Compute the complement of set $1 relative to set $2.
#   A \ B = {x : x e A and x ne B}
# 
# Expected format of sets:
# 1. One element per line.
# 2. Blank lines are allowed.
# 3. Comments start with '#' and continue to end of line.
# 4. Leading and trailing white space will be removed.

TMP1=`mktemp -p relc`
TMP2=`mktemp -p relc`
trap "rm -f $TMP1 $TMP2 ; exit 0" EXIT
trap "rm -f $TMP1 $TMP2 ; exit 1" ABRT BUS HUP INT KILL QUIT SEGV

# Remove comments, leading and trailing white space and
# blank lines, and sort input files.
sed -e 's/#.*$//' \
  -e 's/^[[:space:]]*\([^[:space:]]*\)[[:space:]]*$/\1/' \
  -e '/^[[:space:]]*$/d' $1 | \
  sort -u > $TMP1
sed -e 's/#.*$//' \
  -e 's/^[[:space:]]*\([^[:space:]]*\)[[:space:]]*$/\1/' \
  -e '/^[[:space:]]*$/d' $2 | \
  sort -u > $TMP2

comm -23 $TMP1 $TMP2
exit 0

Symmetric Difference

#!/bin/sh

# symdiff
# Purpose: Compute the symmetric difference
#   of sets $1 and $2
#   A @ B = {x : x e A or x e B but x ne both}
# 
# Expected format of sets:
# 1. One element per line.
# 2. Blank lines are allowed.
# 3. Comments start with '#' and continue to end of line.
# 4. Leading and trailing white space will be removed.

TMP1=`mktemp -p symdif`
TMP2=`mktemp -p symdif`

trap "rm -f $TMP1 $TMP2 ; exit 0" EXIT
trap "rm -f $TMP1 $TMP2 ; exit 1" ABRT BUS HUP INT KILL QUIT SEGV

# Remove comments, leading and trailing white space and
# blank lines, and sort input files.
sed -e 's/#.*$//' \
  -e 's/^[[:space:]]*\([^[:space:]]*\)[[:space:]]*$/\1/' \
  -e '/^[[:space:]]*$/d' $1 | \
  sort -u > $TMP1
sed -e 's/#.*$//' \
  -e 's/^[[:space:]]*\([^[:space:]]*\)[[:space:]]*$/\1/' \
  -e '/^[[:space:]]*$/d' $2 | \
  sort -u > $TMP2

comm -3 $TMP1 $TMP2 | sed 's/\t//g'
exit 0

Copyright © 2010, Maia L.L.C.  All rights reserved.

Maia L.L.C. and its employees have used their best efforts in preparing this article. These efforts include the development, research, and testing of the theories and computer programs in this article to determine their correctness. Maia L.L.C. makes no warranty of any kind, expressed or implied, with regard to these programs or the documentation contained in this article. Maia L.L.C. shall not be liable in any event for incidental or consequential damages in connection with, or arising out of, the furnishing, performance, or use of these programs.