regex - Use perl to do multi-line replacement

08
2014-07
  • tamlok

    I have to do some replacements in many *.c files. I want to do the replacement like this:
    original: printf("This is a string! %d %d\n", 1, 2);
    result: print_record("This is a string! %d %d", 1, 2);
    That is, replace the "printf" with "print_record", and remove the trailing "\n".
    At first, I use sed to do this task. However, maybe there are some cases like this:

    printf("This is a multiple string, that is very long"
     " and be separated into multiple lines. %d %d\n", 1, 2); 
    

    In this case, I can't use sed to remove the "\n" easily. I heard that perl can do this work well. But I am fresh to perl. So can anyone help me? How to accomplish this with perl?
    Thanks very much!

  • Answers
  • Edward

    What you want to do is not trivial. It requires some parsing to take care of balanced delimiters, quoting, and the C rule that adjacent string literals be joined into a single one. Fortunately, the Perl module Text::Balanced handles a lot of this (Text::Balanced is available in the Perl 'standard' library). The following script should do more or less what you want. It takes one command-line argument and outputs on standard output. You'll have to wrap it inside a shell script. I used the following wrapper to test it:

    #/bin/bash
    find in/ -name '*.c' -exec sh -c 'in="$1"; out="out/${1#in/}"; perl script.pl "$in" > "$out"' _ {} \;
    colordiff -ru expected/ out/
    

    And here's the Perl script. I wrote some comments, but feel free to ask if you need more explanation.

    use strict;
    use warnings;
    use File::Slurp 'read_file';
    use Text::Balanced 'extract_bracketed', 'extract_delimited';
    
    my $text = read_file(shift);
    
    my $last = 0;
    while ($text =~ /(          # store all matched text in $1
                      \bprintf  # start of literal word 'printf'
                      (\s*)     # optional whitespace, stored in $2
                      (?=\()    # lookahead for literal opening parenthesis
                     )/gx) {
        # after a successful match,
        #   1. pos($text) is on the character right behind the match (opening parenthesis)
        #   2. $1 contains the matched text (whole word 'printf' followed by optional
        #      whitespace, but not the opening parenthesis)
        #   3. $2 contains the (optional) whitespace
    
        # output up to, but not including, 'printf'
        print substr($text, $last, pos($text) - $last - length($1));
        print "print_record$2(";
    
        # extract and process argument
        my ($argument) = extract_bracketed($text, '()');
        process_argument($argument);
    
        # save current position
        $last = pos($text);
    }
    
    # output remainder of text
    print substr($text, $last);
    
    # process_argument() properly handles the situation of a format string
    # consisting of adjacent string literals
    sub process_argument {
        my $argument = shift;
    
        # skip opening parenthesis retained by extract_bracketed()
        $argument =~ /^\(/g;
    
        # scan for quoted strings
        my $saved;
        my $last = 0;
        while (1) {
            # extract quoted string
            my ($string, undef, $whitespace) = extract_delimited($argument, '"');
            last if !$string;       # quit if not found
    
            # as we still have strings remaining, the saved one wasn't the last and should
            # be output verbatim
            print $saved if $saved;
            $saved = $whitespace . $string;
            $last = pos($argument);
        }
        if ($saved) {
            $saved =~ s/\\n"$/"/;   # chop newline character sequence off last string
            print $saved;
        }
    
        # output remainder of argument
        print substr($argument, $last);
    }
    

  • Related Question

    "sed" regex help: Replacing characters
  • powerbar

    I want to change characters in a XML file by using sed. The input looks like this:

    <!-- Input -->
    <root>
      <tree foo="abcd" bar="abccdcd" />
      <dontTouch foo="asd" bar="abc" />
    </root>
    

    Now I want to change all c to X in the bar tag of the tree element.

    <!-- Output -->
    <root>
      <tree foo="abcd" bar="abXXdXd" />
      <dontTouch foo="asd" bar="abc" />
    </root>
    

    How is the correct sed command? Please consider, there can be more than one occurence of c (next to each other or not) in one tag...

    I tried this myself, but it won't change multiple c, and it does append a X :(

    sed -i 's/\(<tree.*bar=\".*\)c\(.*\"\/>\)/\1X\2/g' Input.xml
    

    Edit: Some more details ;)

    • This is a once in a life time job, after the document is changed, I won't touch it ever again

    • The structure is as easy as above. That means, I can grab all lines (this works) with:

      cat input.xml | grep ""

    So assuming I have the correct string extracted, and know where to write it after modification: How to change 'abcdeccd' to 'abXdeXXd'? This isn't really a XML problem but a regex one, or am I wrong here?


  • Related Answers
  • potong

    This might work for you (GNU sed?):

    sed '/^\s*<tree.*\<bar="/!b;s//&\n/;:a;s/\n\([^c"]\+\)/\1\n/;ta;s/\nc/X\n/;ta;:b;s/\n//' XML
    
  • Daniel Andersson

    As RedGrittyBrick said, the best way to do it is using an XML parser, picking out the element, translate characters and then write it back using an XML library. This will not give you nasty surprises, it will stand the test of time, etc. It is not only best, it is far superior to other things. Other solutions more or less instantly become nightmares to debug, and there will certainly be hidden problems more or less everywhere.

    If it's just a simple task that needs to be done once, and one is very careful, and one checks the result, etc., etc., etc., then it might be less work to do it the bad way. But it will surprise you some day if you make it a habit.

    As example, here is one of the bad ways that seem to work, but it relies not just on valid XML, but the more or less exact syntax you described earlier, which is just a subset of valid XML, and thus valid XML is certainly able to make the code fail (what if someone adds a '>' sign in one of the tags? Add a special case. What if someone doesn't use quotation marks? Add a special case, and so on). This is the problem of not using a real parser. Some care has been taken below to act like a pseudoparser at least, reading the tag, then acting on it, then writing it back, but there are ready tools for this that have been tested extensively.

    #!/bin/sh
    IFS='\n'
    while read i; do
        if $(printf -- "${i}" | grep -qE '<tree [^>]+ bar="[^'"${1}"'"]*'"${1}"); then
            ORIGTAG=$(printf -- "${i}" | sed 's#^.*<tree [^>]\+ bar="\([^"]\+\)".*$#\1#g')
            NEWTAG=$(printf -- "${ORIGTAG}" | tr "${1}" "${2}")
            printf -- "${i}\n" | sed 's#\(^.*<tree [^>]\+ bar="\)'"${ORIGTAG}"'\(".*$\)#\1'"${NEWTAG}"'\2#g'
        else
            printf -- "${i}\n"
        fi
    done < "${3}"
    

    Usage: script.sh [character to replace] [replacing character] [filename], e.g.

    script.sh c X myfile
    

    IFS sets the "internal field separator" in the shell to newline, to keep whitespace in the beginning of the lines.

    while read reads the input file (given as argument 3 to the script) line by line.

    grep checks if the specific tag is in the current line AND if the tag contains the character to be translated. If so, go to sed logic; if not, return the line as-is.

    sed picks out the old tag, runs a character translation on it and returns the line with the new tag.

    As you can see, no one would like to find this script and have to debug it. If this is anything else than a one-off job, don't do it like this. For the sanity of future observers.