sed — stream editor

Sometimes it is better to use regular expressions to manipulate content rather than patching sources. This can be used for small changes, especially those which are likely to create patch conflicts across versions. The canonical way of doing this is via sed:

# This plugin is mapped to the 'h' key by default, which conflicts with some
# other mappings. Change it to use 'H' instead.
sed -i 's/\(noremap <buffer> \)h/\1H/' info.vim \
	|| die 'sed failed'

Another common example is appending a -gentoo-blah version string (some upstreams like us to do this so that they can tell exactly which package they're dealing with). Again, we can use sed. Note that the ${PR} variable will be set to r0 if we don't have a -r component in our version.

# Add in the Gentoo -r number to fluxbox -version output. We need to look
# for the line in version.h.in which contains "__fluxbox_version" and append
# our content to it.
if [[ "${PR}" == "r0" ]] ; then
	suffix="gentoo"
else
	suffix="gentoo-${PR}"
fi
sed -i \
    -e "s~\(__fluxbox_version .@VERSION@\)~\1-${suffix}~" \
    version.h.in || die "version sed failed"

It is also possible to extract content from existing files to create new files this way. Many app-vim ebuilds use this technique to extract documentation from the plugin files and convert it to Vim help format.

# This plugin uses an 'automatic HelpExtractor' variant. This causes
# problems for us during the unmerge. Fortunately, sed can fix this
# for us. First, we extract the documentation:
sed -e '1,/^" HelpExtractorDoc:$/d' \
	"${S}"/plugin/ZoomWin.vim > ${S}/doc/ZoomWin.txt \
	|| die "help extraction failed"
# Then we remove the help extraction code from the plugin file:
sed -i -e '/^" HelpExtractor:$/,$d' "${S}"/plugin/ZoomWin.vim \
	|| die "help extract remove failed"

A summary of the more common ways of using sed and a description of commonly used address and token patterns follows. Note that some of these constructs are specific to GNU sed 4 — on non-GNU userland archs, the sed command must be aliased to GNU sed. Also note that GNU sed 4 is guaranteed to be installed as part of @system. This was not always the case, which is why some packages, particularly those which use sed -i, have a DEPEND upon >=sys-apps/sed-4.

Basic sed invocation

The basic form of a call is:

sed [ option flags ] \
	-e 'first command' \
	-e 'second command' \
	-e 'and so on' \
	input-file > output-file \
	|| die "Oops, sed didn't work!"

For cases where the input and output files are the same, the inplace option should be used. This is done by passing -i as one of the option flags.

Usually sed prints out every line of the created content. To obtain only explicitly printed lines, the -n flag should be used.

Simple text substitution using sed

The most common form of sed is to replace all instances of some text with different content. This is done as follows:

# replace all instances of "some text" with "different content" in
# somefile.txt
sed -i -e 's/some text/different content/g' somefile.txt || \
	die "Sed broke!"

If the pattern or the replacement string contains the forward slash character, it is usually easiest to use a different delimiter. Most punctuation characters are allowed, although backslash and any form of brackets should be avoided. You should choose your delimiter with care to ensure it cannot appear in any strings involved in the subject/replacement. For example, using sed with CFLAGS is hazardous because it is user-supplied data (so may contain any character), but one should in particular avoid e.g. the colon here.

# replace all instances of "/usr/local" with "/usr"
sed -i -e 's~/usr/local~/usr~g' somefile.txt || \
	die "sed broke"

Patterns can be made to match only at the start or end of a line by using the ^ and $ metacharacters. A ^ means "match at the start of a line only", and $ means "match at the end of a line only". By using both in a single statement, it is possible to match exact lines.

# Replace any "hello"s which occur at the start of a line with "howdy".
sed -i -e 's!^hello!howdy!' data.in || die "sed failed"
# Replace any "bye"s which occur at the end of a line with "cheerio!".
sed -i -e 's,bye$,cheerio!,' data.in || die "sed failed"
# Replace any lines which are exactly "change this line" with "have a
# cookie".
sed -i -e 's-^change this line$-have a cookie-' data.in || die "Oops"

To ignore case in the pattern, add the /i flag.

# Replace any "emacs" instances (ignoring case) with "Vim"
sed -i -e 's/emacs/Vim/gi' editors.txt || die "Ouch"

Regular expression substitution using sed

It is also possible to do more complex matches with sed. Some examples could be:

  • Match any three digits
  • Match either "foo" or "bar"
  • Match any of the letters "a", "e", "i", "o" or "u"

These types of pattern can be chained together, leading to things like "match any vowel followed by two digits followed by either foo or bar".

To match any of a set of characters, a character class can be used. These come in three forms.

  • A backslash followed by a letter. \d, for example, matches a single digit (any of 0, 1, 2, ... 9). \s matches a single whitespace character. A table of the more useful classes is provided later in this document.
  • A group of characters inside square brackets. [aeiou], for example, matches any one of 'a', 'e', 'i', 'o' or 'u'. Ranges are allowed, such as [0-9A-Fa-fxX], which could be used to match any hexadecimal digit or the characters 'x' and 'X'. Inverted character classes, such as [^aeiou], match any single character except those listed.
  • A POSIX character class is a special named group of characters that are locale-aware. For example, [[:alpha:]] matches any 'alphabet' character in the current locale. A table of the more useful classes is provided later in this document.

To match any one of multiple options, alternation can be used. The basic form is first\|second\|third.

To group items to avoid ambiguity, the \(parentheses\) construct may be used. To match "iniquity" or "infinity", one could use in\(iqui\|fini\)ty.

To optionally match an item, add a \? after it. For example, colou\?r matches both "colour" and "color". This can also be applied to character classes and groups in parentheses, for example \(in\)\?finite\(ly\)\?. Further atoms are available for matching "one or more", "zero or more", "at least n", "between n and m" and so on — these are summarised later in this document.

There are also some special constructs which can be used in the replacement part of a substitution command. To insert the contents of the pattern's first matched bracket group, use \1, for the second use \2 and so on up to \9. An unescaped ampersand & character can be used to insert the entire contents of the match. These and other replace atoms are summarised later in this document.

Addresses in sed

Many sed commands can be applied only to a certain line or range of lines. This could be useful if one wishes to operate only on the first ten lines of a document, for example.

The simplest form of address is a single positive integer. This will cause the following command to be applied only to the line in question. Line numbering starts from 1, but the address 0 can be useful when one wishes to insert text before the first line. If the address 100 is used on a 50 line document, the associated command will never be executed.

To match the last line in a document, the $ address may be used.

To match any lines that match a given regular expression, the form /pattern/ is allowed. This can be useful for finding a particular line and then making certain changes to it — sometimes it is simpler to handle this in two stages rather than using one big scary s/// command. When used in ranges, it can be useful for finding all text between two given markers or between a given marker and the end of the document.

To match a range of addresses, addr1,addr2 can be used. Most address constructs are allowed for both the start and the end addresses.

Addresses may be inverted with an exclamation mark. To match all lines except the last, $! may be used.

Finally, if no address is given for a command, the command is applied to every line in the input.

Other more complex options involving chaining addresses are available. These are not discussed in this document.

Content deletion using sed

Lines may be deleted from a file using address d command. To delete the third line of a file, one could use 3d, and to filter out all lines containing "fred", /fred/d.

Content extraction using sed

When the -n option is passed to sed, no output is printed by default. The p command can be used to display content. For example, to print lines containing "infra monkey", the command sed -n -e '/infra monkey/p' could be used. Ranges may also be printed — sed -n -e '/^START$/,/^END$/p' is sometimes useful.

Inserting content using sed

To insert text with sed use a address a or i command. The a command inserts on the line following the match while the i command inserts on the line before the match.

As usual, an address can be either a line number or a regular expression: a line number command will only be executed once and a regular expression insert/append will be executed for each match.

# Add 'Bob' after the 'To:' line:
sed -i -e '/^To: $/a    Bob' data.in || die "Oops"

# Add 'From: Alice' before the 'To:' line:
sed -i -e '/^To: $/i    From: Alice'

# Note that the spacing between the 'i' or 'a' and 'Bob' or 'From: Alice' is simply ignored'

# Add 'From: Alice' indented by two spaces: (You only need to escape the first space)
sed -i -e '/^To: $/i\  From: Alice'

Note that you should use a match instead of a line number wherever possible. This reduces problems if a line is added at the beginning of the file, for example, causing your sed script to break.

Regular expression atoms in sed

Basic atoms

Atom Purpose
text Literal text
\( \) Grouping
\| Alternation, a or b
* \? \+ \{\} Repeats, see below
. Any single character
^ Start of line
$ End of line
[abc0-9] Any one of
[^abc0-9] Any one character except
[[:alpha:]] POSIX character class, see below
\1 .. \9 Backreference
\x (any special character) Match character literally
\x (normal characters) Shortcut, see below

Character class shortcuts

Atom Description
\a "BEL" character
\f "Form Feed" character
\t "Tab" character
\w "Word" (a letter, digit or underscore) character
\W "Non-word" character

POSIX character classes

Read the source, it's the only place these're documented properly...

Class Description
[[:alpha:]] Alphabetic characters
[[:upper:]] Uppercase alphabetics
[[:lower:]] Lowercase alphabetics
[[:digit:]] Digits
[[:alnum:]] Alphabetic and numeric characters
[[:xdigit:]] Digits allowed in a hexadecimal number
[[:space:]] Whitespace characters
[[:print:]] Printable characters
[[:punct:]] Punctuation characters
[[:graph:]] Non-blank characters
[[:cntrl:]] Control characters

Count specifiers

Atom Description
* Zero or more (greedy)
\+ One or more (greedy)
\? Zero or one (greedy)
\{N\} Exactly N
\{N,M\} At least N and no more than M (greedy)
\{N,\} At least N (greedy)

Replacement atoms in sed

Atom Description
\1 .. \9 Captured \( \) contents
& The entire matched text
\L All subsequent characters are converted to lowercase
\l The following character is converted to lowercase
\U All subsequent characters are converted to uppercase
\u The following character is converted to uppercase
\E Cancel the most recent \L or \U

Details of sed match mechanics

GNU sed uses a traditional (non-POSIX) nondeterministic finite automaton with extensions to support capturing to do its matching. This means that in all cases, the match with the leftmost starting position will be favoured. Of all the leftmost possible matches, favour will be given to leftmost alternation options. Finally, all other things being equal favour will be given to the longest of the leftmost counting options.

Most of this is in violation of strict POSIX compliance, so it's best not to rely upon it. It is safe to assume that sed will always pick the leftmost match, and that it will match greedily with priority given to items earlier in the pattern.

Notes on performance with sed

The author recommends Mastering Regular Expressions by Jeffrey E. F. Friedl for those who wish to learn more about regexes. This text is remarkably devoid of phrases like "let t be a finite contiguous sequence such that t[n] ∈ ∑ ∀ n", and was not written by someone whose pay cheque depended upon them being able to express simple concepts with pages upon pages of mathematical and Greek symbols.