sed — stream editor
Sometimes it is better to use regular expressions to manipulate content rather
than patching sources. This can be used for small changes, especially those
which are likely to create patch conflicts across versions. The canonical way of
doing this is via sed
:
# This plugin is mapped to the 'h' key by default, which conflicts with some
# other mappings. Change it to use 'H' instead.
sed -i 's/\(noremap <buffer> \)h/\1H/' info.vim \
|| die 'sed failed'
Another common example is appending a -gentoo-blah
version string (some
upstreams like us to do this so that they can tell exactly which package they're
dealing with). Again, we can use sed
. Note that the ${PR}
variable will
be set to r0
if we don't have a -r
component in our version.
# Add in the Gentoo -r number to fluxbox -version output. We need to look
# for the line in version.h.in which contains "__fluxbox_version" and append
# our content to it.
if [[ "${PR}" == "r0" ]] ; then
suffix="gentoo"
else
suffix="gentoo-${PR}"
fi
sed -i \
-e "s~\(__fluxbox_version .@VERSION@\)~\1-${suffix}~" \
version.h.in || die "version sed failed"
It is also possible to extract content from existing files to create new files
this way. Many app-vim
ebuilds use this technique to extract documentation
from the plugin files and convert it to Vim help format.
# This plugin uses an 'automatic HelpExtractor' variant. This causes
# problems for us during the unmerge. Fortunately, sed can fix this
# for us. First, we extract the documentation:
sed -e '1,/^" HelpExtractorDoc:$/d' \
"${S}"/plugin/ZoomWin.vim > ${S}/doc/ZoomWin.txt \
|| die "help extraction failed"
# Then we remove the help extraction code from the plugin file:
sed -i -e '/^" HelpExtractor:$/,$d' "${S}"/plugin/ZoomWin.vim \
|| die "help extract remove failed"
A summary of the more common ways of using sed
and a description of
commonly used address and token patterns follows. Note that some of these
constructs are specific to GNU sed 4
— on non-GNU userland archs, the
sed
command must be aliased to GNU sed. Also note that GNU sed 4
is
guaranteed to be installed as part of @system
. This was not
always the case,
which is why some packages, particularly those which use sed -i
, have
a DEPEND
upon >=sys-apps/sed-4
.
Basic sed
invocation
The basic form of a call is:
sed [ option flags ] \
-e 'first command' \
-e 'second command' \
-e 'and so on' \
input-file > output-file \
|| die "Oops, sed didn't work!"
For cases where the input and output files are the same, the inplace option
should be used. This is done by passing -i
as one of the option flags.
Usually sed
prints out every line of the created content. To obtain only
explicitly printed lines, the -n
flag should be used.
Simple text substitution using sed
The most common form of sed
is to replace all instances of
some text
with different content
. This is done as follows:
# replace all instances of "some text" with "different content" in
# somefile.txt
sed -i -e 's/some text/different content/g' somefile.txt || \
die "Sed broke!"
/g
flag is required to replace all occurrences. Without this
flag, only the first match on each line is replaced.
irksome texting
with
irkdifferent contenting
, which may not be desired.
If the pattern or the replacement string contains the forward slash character,
it is usually easiest to use a different delimiter. Most punctuation characters
are allowed, although backslash and any form of brackets should be avoided. You
should choose your delimiter with care to ensure it cannot appear in any
strings involved in the subject/replacement. For example, using sed
with
CFLAGS is hazardous because it is user-supplied data (so may contain any
character), but one should in particular avoid e.g.
the colon here.
# replace all instances of "/usr/local" with "/usr"
sed -i -e 's~/usr/local~/usr~g' somefile.txt || \
die "sed broke"
Patterns can be made to match only at the start or end of a line by using the
^
and $
metacharacters. A ^
means "match at the start of a line
only", and $
means "match at the end of a line only". By using both in a
single statement, it is possible to match exact lines.
# Replace any "hello"s which occur at the start of a line with "howdy".
sed -i -e 's!^hello!howdy!' data.in || die "sed failed"
!g
suffix here.
# Replace any "bye"s which occur at the end of a line with "cheerio!".
sed -i -e 's,bye$,cheerio!,' data.in || die "sed failed"
# Replace any lines which are exactly "change this line" with "have a
# cookie".
sed -i -e 's-^change this line$-have a cookie-' data.in || die "Oops"
To ignore case in the pattern, add the /i
flag.
# Replace any "emacs" instances (ignoring case) with "Vim"
sed -i -e 's/emacs/Vim/gi' editors.txt || die "Ouch"
Regular expression substitution using sed
It is also possible to do more complex matches with sed
. Some examples could
be:
- Match any three digits
- Match either "foo" or "bar"
- Match any of the letters "a", "e", "i", "o" or "u"
These types of pattern can be chained together, leading to things like "match any vowel followed by two digits followed by either foo or bar".
To match any of a set of characters, a character class can be used. These come in three forms.
-
A backslash followed by a letter.
\d
, for example, matches a single digit (any of 0, 1, 2, ... 9).\s
matches a single whitespace character. A table of the more useful classes is provided later in this document. -
A group of characters inside square brackets.
[aeiou]
, for example, matches any one of 'a', 'e', 'i', 'o' or 'u'. Ranges are allowed, such as[0-9A-Fa-fxX]
, which could be used to match any hexadecimal digit or the characters 'x' and 'X'. Inverted character classes, such as[^aeiou]
, match any single character except those listed. -
A POSIX character class is a special named group of characters that are
locale-aware. For example,
[[:alpha:]]
matches any 'alphabet' character in the current locale. A table of the more useful classes is provided later in this document.
a[^b]
does not mean "match a, so long as it does not
have a 'b' after it". It means "match a followed by exactly one character which
is not a 'b'". This is important when one considers a line ending in the
character 'a'.
sed
documentation (man sed
and
sed.info
) does not mention that POSIX character classes are supported.
Consult
IEEE Std 1003.1-2017, section 9.3 for full details of how these
should work, and the sed
source code for full details of how
these actually work.
To match any one of multiple options, alternation can be used. The basic form
is first\|second\|third
.
To group items to avoid ambiguity, the \(parentheses\)
construct may be
used. To match "iniquity" or "infinity", one could use in\(iqui\|fini\)ty
.
To optionally match an item, add a \?
after it. For example, colou\?r
matches both "colour" and "color". This can also be applied to character classes
and groups in parentheses, for example \(in\)\?finite\(ly\)\?
. Further atoms
are available for matching "one or more", "zero or more", "at least n", "between
n and m" and so on — these are summarised later in this document.
There are also some special constructs which can be used in the replacement part
of a substitution command. To insert the contents of the pattern's first matched
bracket group, use \1
, for the second use \2
and so on up to \9
. An
unescaped ampersand &
character can be used to insert the entire contents of
the match. These and other replace atoms are summarised later in this document.
Addresses in sed
Many sed
commands can be applied only to a certain line or range of lines.
This could be useful if one wishes to operate only on the first ten lines of a
document, for example.
The simplest form of address is a single positive integer. This will cause the following command to be applied only to the line in question. Line numbering starts from 1, but the address 0 can be useful when one wishes to insert text before the first line. If the address 100 is used on a 50 line document, the associated command will never be executed.
To match the last line in a document, the $
address may be used.
To match any lines that match a given regular expression, the form
/pattern/
is allowed. This can be useful for finding a particular line and
then making certain changes to it — sometimes it is simpler to handle this in
two stages rather than using one big scary s///
command. When used in
ranges, it can be useful for finding all text between two given markers or
between a given marker and the end of the document.
To match a range of addresses, addr1,addr2
can be used. Most address
constructs are allowed for both the start and the end addresses.
Addresses may be inverted with an exclamation mark. To match all lines except
the last, $!
may be used.
Finally, if no address is given for a command, the command is applied to every line in the input.
sed
does not support the %
address forms found in some
other implementations. It also doesn't support /addr/+offset
, that's an
ex
thing...
Other more complex options involving chaining addresses are available. These are not discussed in this document.
Content deletion using sed
Lines may be deleted from a file using address d
command. To delete the
third line of a file, one could use 3d
, and to filter out all lines
containing "fred", /fred/d
.
/fred/d
is not the same as s/.fred.//
— the former
will delete the lines including the newline, whereas the latter will delete the
lines' contents but not the newline.
Content extraction using sed
When the -n
option is passed to sed
, no output is printed by default.
The p
command can be used to display content. For example, to print lines
containing "infra monkey", the command sed -n -e '/infra monkey/p'
could be
used. Ranges may also be printed — sed -n -e '/^START$/,/^END$/p'
is
sometimes useful.
Inserting content using sed
To insert text with sed use a address a
or i
command. The
a
command inserts on the line following the match while the i
command inserts on the line before the match.
As usual, an address can be either a line number or a regular expression: a line number command will only be executed once and a regular expression insert/append will be executed for each match.
# Add 'Bob' after the 'To:' line:
sed -i -e '/^To: $/a Bob' data.in || die "Oops"
# Add 'From: Alice' before the 'To:' line:
sed -i -e '/^To: $/i From: Alice'
# Note that the spacing between the 'i' or 'a' and 'Bob' or 'From: Alice' is simply ignored'
# Add 'From: Alice' indented by two spaces: (You only need to escape the first space)
sed -i -e '/^To: $/i\ From: Alice'
Note that you should use a match instead of a line number wherever possible. This reduces problems if a line is added at the beginning of the file, for example, causing your sed script to break.
Regular expression atoms in sed
Basic atoms
Atom | Purpose |
---|---|
text
|
Literal text |
\( \)
|
Grouping |
\|
|
Alternation, a or b |
* \? \+ \{\}
|
Repeats, see below |
.
|
Any single character |
^
|
Start of line |
$
|
End of line |
[abc0-9]
|
Any one of |
[^abc0-9]
|
Any one character except |
[[:alpha:]]
|
POSIX character class, see below |
\1 .. \9
|
Backreference |
\x (any special character)
|
Match character literally |
\x (normal characters)
|
Shortcut, see below |
Character class shortcuts
Atom | Description |
---|---|
\a
|
"BEL" character |
\f
|
"Form Feed" character |
\t
|
"Tab" character |
\w
|
"Word" (a letter, digit or underscore) character |
\W
|
"Non-word" character |
POSIX character classes
Read the source, it's the only place these're documented properly...
Class | Description |
---|---|
[[:alpha:]]
|
Alphabetic characters |
[[:upper:]]
|
Uppercase alphabetics |
[[:lower:]]
|
Lowercase alphabetics |
[[:digit:]]
|
Digits |
[[:alnum:]]
|
Alphabetic and numeric characters |
[[:xdigit:]]
|
Digits allowed in a hexadecimal number |
[[:space:]]
|
Whitespace characters |
[[:print:]]
|
Printable characters |
[[:punct:]]
|
Punctuation characters |
[[:graph:]]
|
Non-blank characters |
[[:cntrl:]]
|
Control characters |
Replacement atoms in sed
Atom | Description |
---|---|
\1 .. \9
|
Captured \( \) contents
|
&
|
The entire matched text |
\L
|
All subsequent characters are converted to lowercase |
\l
|
The following character is converted to lowercase |
\U
|
All subsequent characters are converted to uppercase |
\u
|
The following character is converted to uppercase |
\E
|
Cancel the most recent \L or \U
|
Details of sed
match mechanics
GNU sed
uses a traditional (non-POSIX) nondeterministic finite automaton with
extensions to support capturing to do its matching. This means that in all
cases, the match with the leftmost starting position will be favoured. Of all
the leftmost possible matches, favour will be given to leftmost alternation
options. Finally, all other things being equal favour will be given to the
longest of the leftmost counting options.
Most of this is in violation of strict POSIX compliance, so it's best not
to rely upon it. It is safe to assume that sed
will always pick the leftmost
match, and that it will match greedily with priority given to items earlier in
the pattern.
Recommended further reading for regular expressions
The author recommends
Mastering Regular Expressions by Jeffrey E. F. Friedl for those who
wish to learn more about regexes. This text is remarkably devoid of phrases like
"let t
be a finite contiguous sequence such that t[n] ∈ ∑ ∀ n
",
and was not written by someone whose pay cheque depended upon them being
able to express simple concepts with pages upon pages of mathematical and Greek
symbols.