regex - Use perl to do multi-line replacement

08
2014-07

tamlok

I have to do some replacements in many *.c files. I want to do the replacement like this:
original: printf("This is a string! %d %d\n", 1, 2);
result: print_record("This is a string! %d %d", 1, 2);
That is, replace the "printf" with "print_record", and remove the trailing "\n".
At first, I use sed to do this task. However, maybe there are some cases like this:

printf("This is a multiple string, that is very long"
 " and be separated into multiple lines. %d %d\n", 1, 2);

In this case, I can't use sed to remove the "\n" easily. I heard that perl can do this work well. But I am fresh to perl. So can anyone help me? How to accomplish this with perl?
Thanks very much!

Answers

Edward

What you want to do is not trivial. It requires some parsing to take care of balanced delimiters, quoting, and the C rule that adjacent string literals be joined into a single one. Fortunately, the Perl module Text::Balanced handles a lot of this (Text::Balanced is available in the Perl 'standard' library). The following script should do more or less what you want. It takes one command-line argument and outputs on standard output. You'll have to wrap it inside a shell script. I used the following wrapper to test it:

#/bin/bash
find in/ -name '*.c' -exec sh -c 'in="$1"; out="out/${1#in/}"; perl script.pl "$in" > "$out"' _ {} \;
colordiff -ru expected/ out/

And here's the Perl script. I wrote some comments, but feel free to ask if you need more explanation.

use strict;
use warnings;
use File::Slurp 'read_file';
use Text::Balanced 'extract_bracketed', 'extract_delimited';

my $text = read_file(shift);

my $last = 0;
while ($text =~ /(          # store all matched text in $1
                  \bprintf  # start of literal word 'printf'
                  (\s*)     # optional whitespace, stored in $2
                  (?=\()    # lookahead for literal opening parenthesis
                 )/gx) {
    # after a successful match,
    #   1. pos($text) is on the character right behind the match (opening parenthesis)
    #   2. $1 contains the matched text (whole word 'printf' followed by optional
    #      whitespace, but not the opening parenthesis)
    #   3. $2 contains the (optional) whitespace

    # output up to, but not including, 'printf'
    print substr($text, $last, pos($text) - $last - length($1));
    print "print_record$2(";

    # extract and process argument
    my ($argument) = extract_bracketed($text, '()');
    process_argument($argument);

    # save current position
    $last = pos($text);
}

# output remainder of text
print substr($text, $last);

# process_argument() properly handles the situation of a format string
# consisting of adjacent string literals
sub process_argument {
    my $argument = shift;

    # skip opening parenthesis retained by extract_bracketed()
    $argument =~ /^\(/g;

    # scan for quoted strings
    my $saved;
    my $last = 0;
    while (1) {
        # extract quoted string
        my ($string, undef, $whitespace) = extract_delimited($argument, '"');
        last if !$string;       # quit if not found

        # as we still have strings remaining, the saved one wasn't the last and should
        # be output verbatim
        print $saved if $saved;
        $saved = $whitespace . $string;
        $last = pos($argument);
    }
    if ($saved) {
        $saved =~ s/\\n"$/"/;   # chop newline character sequence off last string
        print $saved;
    }

    # output remainder of argument
    print substr($argument, $last);
}

Related Answers

potong

This might work for you (GNU sed?):

sed '/^\s*<tree.*\<bar="/!b;s//&\n/;:a;s/\n\([^c"]\+\)/\1\n/;ta;s/\nc/X\n/;ta;:b;s/\n//' XML

Daniel Andersson

As RedGrittyBrick said, the best way to do it is using an XML parser, picking out the element, translate characters and then write it back using an XML library. This will not give you nasty surprises, it will stand the test of time, etc. It is not only best, it is far superior to other things. Other solutions more or less instantly become nightmares to debug, and there will certainly be hidden problems more or less everywhere.

If it's just a simple task that needs to be done once, and one is very careful, and one checks the result, etc., etc., etc., then it might be less work to do it the bad way. But it will surprise you some day if you make it a habit.

As example, here is one of the bad ways that seem to work, but it relies not just on valid XML, but the more or less exact syntax you described earlier, which is just a subset of valid XML, and thus valid XML is certainly able to make the code fail (what if someone adds a '>' sign in one of the tags? Add a special case. What if someone doesn't use quotation marks? Add a special case, and so on). This is the problem of not using a real parser. Some care has been taken below to act like a pseudoparser at least, reading the tag, then acting on it, then writing it back, but there are ready tools for this that have been tested extensively.

#!/bin/sh
IFS='\n'
while read i; do
    if $(printf -- "${i}" | grep -qE '<tree [^>]+ bar="[^'"${1}"'"]*'"${1}"); then
        ORIGTAG=$(printf -- "${i}" | sed 's#^.*<tree [^>]\+ bar="\([^"]\+\)".*$#\1#g')
        NEWTAG=$(printf -- "${ORIGTAG}" | tr "${1}" "${2}")
        printf -- "${i}\n" | sed 's#\(^.*<tree [^>]\+ bar="\)'"${ORIGTAG}"'\(".*$\)#\1'"${NEWTAG}"'\2#g'
    else
        printf -- "${i}\n"
    fi
done < "${3}"

Usage: script.sh [character to replace] [replacing character] [filename], e.g.

script.sh c X myfile

IFS sets the "internal field separator" in the shell to newline, to keep whitespace in the beginning of the lines.

while read reads the input file (given as argument 3 to the script) line by line.

grep checks if the specific tag is in the current line AND if the tag contains the character to be translated. If so, go to sed logic; if not, return the line as-is.

sed picks out the old tag, runs a character translation on it and returns the line with the new tag.

As you can see, no one would like to find this script and have to debug it. If this is anything else than a one-off job, don't do it like this. For the sanity of future observers.

Home

regex - Use perl to do multi-line replacement