regex - Using sed to remove digits and white space from a string

25
2013-11
  • balteo

    I am trying to remove the first occurence of digit(s), the dot, the second occurence of digit(s) and the space before the word.

    I have come up with this regex:

    sed 's/^[0-9]\+.[0-9]\+\s//' input.txt > output.txt
    

    Text sample:

    2.14 Italien
    2.15 Japonais
    

    My regex does not work unfortunately. There is a problem with the \s but I can't pinpoint what it is...

    Can anyone help?

    edit: The problem is that I need to remove the first space only as some text contain spaces as you can see below:

    3.15 Chichewa
    3.16 Chimane
    3.17 Cinghalais
    3.18 Créole de Guinée-Bissau
    
  • Answers
  • slhck

    The command you're using should work as-is with GNU sed. But with BSD sed, which for example comes with OS X, it won't.

    • If you're trying to use Extended Regular Expressions – which support the + metacharacter – you need to explicitly enable them. For BSD sed you do this with sed -E, and for GNU sed with sed -r.

      The \+ alone does with GNU sed when EREs are not enabled, but this is less portable.

    • You're using the Perl-like \s, which doesn't exist for both Basic and Extended Regular Expressions. Regular sed doesn't support Perl regular expressions though. GNU sed does support the \s – but it'd be more portable to simply add the space to your regular expression.

    • Finally, your . matches one character, so your regex would even match any character in that place, not just a dot. Use \. to properly escape it.

    So, a solution would be, for GNU sed:

    $ echo "2.12 blah" | sed -r 's/^[0-9]+\.[0-9]+ //'
    blah
    

    Or for BSD sed:

    $ echo "2.12 blah" | sed -E 's/^[0-9]+\.[0-9]+ //'
    blah
    

    This way you don't need a different regex for different versions of sed. With your example:

    $ cat test
    3.15 Chichewa
    3.16 Chimane
    3.17 Cinghalais
    3.18 Créole de Guinée-Bissau
    
    $ sed -r 's/^[0-9]+\.[0-9]+ //' test
    Chichewa
    Chimane
    Cinghalais
    Créole de Guinée-Bissau
    

    If the real problem is that you want to get the second column of a whitespace-delimited file, then you're going about this the wrong way. Either use awk, like @Srdjan Grubor says, or use cut:

    $ echo "2.12 foo bar baz" | cut -d' ' -f2-
    foo bar baz
    

    The -f2- specifies the second and all following columns, so this will basically take the first space as the separator and output the rest.

  • Srdjan Grubor

    Why not use awk?

    cat  input.txt | awk '{print $2}' > output.txt
    
  • vortex7

    If the only thing is to drop everything upto and including the first space then this suffices

    sed -e 's/[^ ]* //'
    
  • mohit6up

    You could also use grep:

    grep -oP '[a-zA-Z]+$' input.txt > output.txt

  • Scrutinizer

    With any sed:

    sed 's/^[0-9]\{1,\}\.[0-9]\{1,\} //' 
    

    Or perhaps this might suffice:

    sed 's/^[0-9.]\{1,\} //' file
    

  • Related Question

    linux - Removing newlines from an RTF file using sed
  • Spidey

    I have an RTF file which is formatted like so:

        Lorem ipsum dolor sit amet, consectetur adipiscing elit.\par
    Nullam vitae sem porttitor urna pellentesque gravida. Nulla\par
    consequat purus vel est vehicula porttitor.\par
        Maecenas pharetra metus in enim sollicitudin sollicitudin.\par
    Etiam et odio tellus, eget placerat enim. Aliquam sem purus,\par
    gravida sed feugiat eget, consectetur quis nisl.\par
    

    (\par added for brevity)

    As you can see, newlines have been inserted to fit a page's width. The problem arises when I try to read the text on my iPhone, which has a different line length. The lines break and readability is hindered.

    The ideal solution would be one that converts the file to a single line for each paragraph, while keeping the newline and indent for new paragraphs.

    So far I've tried parsing the file with sed but was unable to create a multiline regex. Ideally, I want to replace all "\r\n"s with " ", unless the next line begins with a space.

    Is there a better solution for this? If not, how can I do it using sed?


  • Related Answers
  • Peter Boughton

    This regex will match what you want:

    \r\n(?! )
    


    So to use that with sed:

    sed 's/\r\n(?! )/ /g' filename.rtf
    


    Except, it appears that sed doesn't support negative lookahead, and requires backslashed parens, so you can instead use:

    sed 's/\r\n\([^ ]\)/ \1/g' filename.rtf
    
  • Spidey

    The solution lied in a tool I haven't given serious thought - awk

    awk 'BEGIN { FS="\\\\par" } ; /^    / {print "\\par" $1} /^[^ ]/ {print " " $1}'
    

    This will go over the file, with \par as the field seperator, and will print a \par before any line that starts with 4 spaces (which marks the beginning of a new paragraph), and remove (or simply won't print) it when it starts with anything but a space.

    Now what we have is a file with \par only where legal line breaks should be. The next step would be to remove all newlines altogether, to get rid of rogue line breaks:

    tr -d '\r\n'
    

    And then feed the result to sed to replace \par with \par\r\n, practically adding a newline where a \par is.

    sed 's/\\par/\\par\r\n/g'
    

    And done.

    The only real issue I've found with this method is that it ruined the RTF header. No problem, I just copied over the header from the original file.

    Another smaller issue was that chapter titles were being printed inline with previous paragraphs. This is because chapter titles do not start with a space yet should be considered a paragraph. In my case, chapters were marked like so:

    CHAPTER THIRTY-TWO
    Chapter's Name

    So a quick sed took care of them:

    sed 's/\s*\(CHAPTER [[:upper:]-]* \)\(.*\\par\)/\\par\r\n\\par\r\n\\par\r\n\1\\par\r\n\2\\par\r\n/'
    

    I now have my book in proper format, which makes it readable on other devices (such as my iPod).