linux - Substituting a multi-line pattern in an HTML file
2013-09
I have a series of HTML files that contain two lines like this:
<body>
<h1>Title</h1><p>
<a href="url">Description</a><br>
I want to replace this text with something else using a bash script. I'm trying
sed -i -r 's/<h1>Title.*?$\/^.*?<br>/Replacement text/1' filename.html
but it is not working. I'm suspecting it is getting stuck on the new line and not knowing how to go around the problem.
Any help appreciated. Feel free to suggest other Linux tools other than sed
as long as it works!
I'd use Perl for this:
perl -0pe 's/<h1>Title.*\n.*<br>/replacement/' filename.html
Here, -0
makes Perl split records on the NUL
character instead of reading line-by-line, which is the default when using the -p
option.
With Perl regular expressions you need .*
to match any character multiple times, and you match the newline with \n
.
Example:
$ echo '<body>
<h1>Title</h1><p>
<a href="url">Description</a><br>' | perl -0pe 's/<h1>Title.*\n.*<br>/replacement/'
<body>
replacement
sed
cannot match more than one line directly. When multiline pattern is needed, reach for a more powerful tool like Perl:
perl -i~ -ne 'if (/^<h1>Title/) {
$n = <>;
if ($n =~ /<br>$/) { print "Replacement\n" }
else { print "$_$n" }
} else { print }'
This can be done with sed.
sed -nf repl.sed filename.html
where repl.sed
contains:
# Must have one line loaded up before branching to rep.
# Processing will start this way.
:rep
# Load extra line into pattern space
N
# Test for title
/<h1>.*<\/h1><p>\n<a href=".*">.*<\/a><br>/{
#Substitute and print
s/<h1>\(.*\)<\/h1><p>\n<a href=".*">.*<\/a><br>/Title: \1/p
#append next line without cycling
N
# everything but the last line
s/.*\n\([.\n]*\)/\1/
#test for last line
${
p
# this will effectively end the program
n
}
b rep
}
${
# will print pattern space (both lines)
p
# this will effectively end the program
n
}
#Print first line in pattern space
P;
#Remove first line in pattern space with newline
s/.*\n\([.\n]*\)/\1/
b rep
Given the following find command: find . | xargs grep 'userTools' -sl
How can I use sed on the results of that command?
output:
./file1.ext
./file2.ext
./file3.ext
I am assuming that you want to perform some sed
operation on the contents of each of the files rather than on the list of file names since you seem to know how to do that already. The answer depends in part on the version of sed
you have available. If it supports the -i
option (edit files in place), you could use xargs
again like this:
find . | xargs grep 'userTools' -sl | xargs -L1 sed -i 's/this/that/g'
If your sed
doesn't have the -i
option, you could do this instead:
find . | xargs grep 'userTools' -sl | while read file
do
sed 's/this/that/g' "$file" > tmpfile
mv tmpfile "$file"
done
find . -print0 | xargs -0 grep -slZ 'userTools' | xargs -0 sed -i 's/foo/bar/'
or
find . -print0 | xargs -0 sed -i '/userTools/ s/foo/bar/'
or
ack -l --print0 'userTools' | xargs -0 sed -i 's/foo/bar/'
find \Path_where_files_are -type f -name 'file_type' -exec sed -e 's/"text_to_be_changed"/"text_to_be_changed_to"/' {} +