linux - Substituting a multi-line pattern in an HTML file

05
2013-09
  • To Do

    I have a series of HTML files that contain two lines like this:

    <body>
    <h1>Title</h1><p>
    <a href="url">Description</a><br>
    

    I want to replace this text with something else using a bash script. I'm trying

    sed -i -r 's/<h1>Title.*?$\/^.*?<br>/Replacement text/1' filename.html
    

    but it is not working. I'm suspecting it is getting stuck on the new line and not knowing how to go around the problem.

    Any help appreciated. Feel free to suggest other Linux tools other than sed as long as it works!

  • Answers
  • slhck

    I'd use Perl for this:

    perl -0pe 's/<h1>Title.*\n.*<br>/replacement/' filename.html
    

    Here, -0 makes Perl split records on the NUL character instead of reading line-by-line, which is the default when using the -p option.

    With Perl regular expressions you need .* to match any character multiple times, and you match the newline with \n.

    Example:

    $ echo '<body>
    <h1>Title</h1><p>
    <a href="url">Description</a><br>' | perl -0pe 's/<h1>Title.*\n.*<br>/replacement/'
    <body>
    replacement
    
  • choroba

    sed cannot match more than one line directly. When multiline pattern is needed, reach for a more powerful tool like Perl:

    perl -i~ -ne 'if (/^<h1>Title/) {
                      $n = <>;
                      if ($n =~ /<br>$/) { print "Replacement\n" }
                      else { print "$_$n" }
                  } else { print }'
    
  • Chad Skeeters

    This can be done with sed.

    sed -nf repl.sed filename.html
    

    where repl.sed contains:

    # Must have one line loaded up before branching to rep.
    # Processing will start this way.
    :rep
    # Load extra line into pattern space
    N
    # Test for title
    /<h1>.*<\/h1><p>\n<a href=".*">.*<\/a><br>/{
      #Substitute and print
      s/<h1>\(.*\)<\/h1><p>\n<a href=".*">.*<\/a><br>/Title: \1/p
      #append next line without cycling
      N
      # everything but the last line
      s/.*\n\([.\n]*\)/\1/
      #test for last line
      ${
        p
        # this will effectively end the program
        n
      }
      b rep
    }
    ${
      # will print pattern space (both lines)
      p
      # this will effectively end the program
      n
    }
    #Print first line in pattern space
    P;
    #Remove first line in pattern space with newline
    s/.*\n\([.\n]*\)/\1/
    b rep
    

    See Working with Multiple Lines


  • Related Question

    editing - Using sed to replace text from a list of files from find
  • Kyle Hayes

    Given the following find command: find . | xargs grep 'userTools' -sl

    How can I use sed on the results of that command?

    output:

    ./file1.ext
    ./file2.ext
    ./file3.ext
    

  • Related Answers
  • garyjohn

    I am assuming that you want to perform some sed operation on the contents of each of the files rather than on the list of file names since you seem to know how to do that already. The answer depends in part on the version of sed you have available. If it supports the -i option (edit files in place), you could use xargs again like this:

    find . | xargs grep 'userTools' -sl | xargs -L1 sed -i 's/this/that/g'
    

    If your sed doesn't have the -i option, you could do this instead:

    find . | xargs grep 'userTools' -sl | while read file
    do
    sed 's/this/that/g' "$file" > tmpfile
    mv tmpfile "$file"
    done
    
  • Dennis Williamson
    find . -print0 | xargs -0 grep -slZ 'userTools' | xargs -0 sed -i 's/foo/bar/'
    

    or

    find . -print0 | xargs -0 sed -i '/userTools/ s/foo/bar/'
    

    or

    ack -l --print0 'userTools' | xargs -0 sed -i 's/foo/bar/'
    
  • slhck
    find \Path_where_files_are -type f -name 'file_type' -exec  sed -e 's/"text_to_be_changed"/"text_to_be_changed_to"/' {} +