"sed" regex help: Replacing characters

25
2013-11
  • powerbar

    I want to change characters in a XML file by using sed. The input looks like this:

    <!-- Input -->
    <root>
      <tree foo="abcd" bar="abccdcd" />
      <dontTouch foo="asd" bar="abc" />
    </root>
    

    Now I want to change all c to X in the bar tag of the tree element.

    <!-- Output -->
    <root>
      <tree foo="abcd" bar="abXXdXd" />
      <dontTouch foo="asd" bar="abc" />
    </root>
    

    How is the correct sed command? Please consider, there can be more than one occurence of c (next to each other or not) in one tag...

    I tried this myself, but it won't change multiple c, and it does append a X :(

    sed -i 's/\(<tree.*bar=\".*\)c\(.*\"\/>\)/\1X\2/g' Input.xml
    

    Edit: Some more details ;)

    • This is a once in a life time job, after the document is changed, I won't touch it ever again

    • The structure is as easy as above. That means, I can grab all lines (this works) with:

      cat input.xml | grep ""

    So assuming I have the correct string extracted, and know where to write it after modification: How to change 'abcdeccd' to 'abXdeXXd'? This isn't really a XML problem but a regex one, or am I wrong here?

  • Answers
  • potong

    This might work for you (GNU sed?):

    sed '/^\s*<tree.*\<bar="/!b;s//&\n/;:a;s/\n\([^c"]\+\)/\1\n/;ta;s/\nc/X\n/;ta;:b;s/\n//' XML
    
  • Daniel Andersson

    As RedGrittyBrick said, the best way to do it is using an XML parser, picking out the element, translate characters and then write it back using an XML library. This will not give you nasty surprises, it will stand the test of time, etc. It is not only best, it is far superior to other things. Other solutions more or less instantly become nightmares to debug, and there will certainly be hidden problems more or less everywhere.

    If it's just a simple task that needs to be done once, and one is very careful, and one checks the result, etc., etc., etc., then it might be less work to do it the bad way. But it will surprise you some day if you make it a habit.

    As example, here is one of the bad ways that seem to work, but it relies not just on valid XML, but the more or less exact syntax you described earlier, which is just a subset of valid XML, and thus valid XML is certainly able to make the code fail (what if someone adds a '>' sign in one of the tags? Add a special case. What if someone doesn't use quotation marks? Add a special case, and so on). This is the problem of not using a real parser. Some care has been taken below to act like a pseudoparser at least, reading the tag, then acting on it, then writing it back, but there are ready tools for this that have been tested extensively.

    #!/bin/sh
    IFS='\n'
    while read i; do
        if $(printf -- "${i}" | grep -qE '<tree [^>]+ bar="[^'"${1}"'"]*'"${1}"); then
            ORIGTAG=$(printf -- "${i}" | sed 's#^.*<tree [^>]\+ bar="\([^"]\+\)".*$#\1#g')
            NEWTAG=$(printf -- "${ORIGTAG}" | tr "${1}" "${2}")
            printf -- "${i}\n" | sed 's#\(^.*<tree [^>]\+ bar="\)'"${ORIGTAG}"'\(".*$\)#\1'"${NEWTAG}"'\2#g'
        else
            printf -- "${i}\n"
        fi
    done < "${3}"
    

    Usage: script.sh [character to replace] [replacing character] [filename], e.g.

    script.sh c X myfile
    

    IFS sets the "internal field separator" in the shell to newline, to keep whitespace in the beginning of the lines.

    while read reads the input file (given as argument 3 to the script) line by line.

    grep checks if the specific tag is in the current line AND if the tag contains the character to be translated. If so, go to sed logic; if not, return the line as-is.

    sed picks out the old tag, runs a character translation on it and returns the line with the new tag.

    As you can see, no one would like to find this script and have to debug it. If this is anything else than a one-off job, don't do it like this. For the sanity of future observers.


  • Related Question

    shell - Replace filename with filepath with sed
  • Alex Kahn

    I want to replace the string

    /opt/local/lib/ruby/gems/1.8/gems/cucumber-0.3.99/lib/cucumber.rb

    with the string

    /opt/local/lib/ruby/gems/1.8/gems/cucumber-0.3.99/lib/

    on the command line, probably using sed. I can't for the life of me figure out the replacement regex to pass to sed. Or maybe sed isn't even the right tool for the job. Any help would be appreciated.


  • Related Answers
  • Marcin

    no need for sed:

    dirname /usr/local/bin/program

    will return /usr/local/bin