linux - Removing newlines from an RTF file using sed

25
2013-11
  • Spidey

    I have an RTF file which is formatted like so:

        Lorem ipsum dolor sit amet, consectetur adipiscing elit.\par
    Nullam vitae sem porttitor urna pellentesque gravida. Nulla\par
    consequat purus vel est vehicula porttitor.\par
        Maecenas pharetra metus in enim sollicitudin sollicitudin.\par
    Etiam et odio tellus, eget placerat enim. Aliquam sem purus,\par
    gravida sed feugiat eget, consectetur quis nisl.\par
    

    (\par added for brevity)

    As you can see, newlines have been inserted to fit a page's width. The problem arises when I try to read the text on my iPhone, which has a different line length. The lines break and readability is hindered.

    The ideal solution would be one that converts the file to a single line for each paragraph, while keeping the newline and indent for new paragraphs.

    So far I've tried parsing the file with sed but was unable to create a multiline regex. Ideally, I want to replace all "\r\n"s with " ", unless the next line begins with a space.

    Is there a better solution for this? If not, how can I do it using sed?

  • Answers
  • Peter Boughton

    This regex will match what you want:

    \r\n(?! )
    


    So to use that with sed:

    sed 's/\r\n(?! )/ /g' filename.rtf
    


    Except, it appears that sed doesn't support negative lookahead, and requires backslashed parens, so you can instead use:

    sed 's/\r\n\([^ ]\)/ \1/g' filename.rtf
    
  • Spidey

    The solution lied in a tool I haven't given serious thought - awk

    awk 'BEGIN { FS="\\\\par" } ; /^    / {print "\\par" $1} /^[^ ]/ {print " " $1}'
    

    This will go over the file, with \par as the field seperator, and will print a \par before any line that starts with 4 spaces (which marks the beginning of a new paragraph), and remove (or simply won't print) it when it starts with anything but a space.

    Now what we have is a file with \par only where legal line breaks should be. The next step would be to remove all newlines altogether, to get rid of rogue line breaks:

    tr -d '\r\n'
    

    And then feed the result to sed to replace \par with \par\r\n, practically adding a newline where a \par is.

    sed 's/\\par/\\par\r\n/g'
    

    And done.

    The only real issue I've found with this method is that it ruined the RTF header. No problem, I just copied over the header from the original file.

    Another smaller issue was that chapter titles were being printed inline with previous paragraphs. This is because chapter titles do not start with a space yet should be considered a paragraph. In my case, chapters were marked like so:

    CHAPTER THIRTY-TWO
    Chapter's Name

    So a quick sed took care of them:

    sed 's/\s*\(CHAPTER [[:upper:]-]* \)\(.*\\par\)/\\par\r\n\\par\r\n\\par\r\n\1\\par\r\n\2\\par\r\n/'
    

    I now have my book in proper format, which makes it readable on other devices (such as my iPod).


  • Related Question

    text - using sed to remove lines in a file
  • eleven81

    I have a file that looks something like this:

    Heading - 
      - Completed foo
        - More information
        - Still more
      * Need to complete bar
      - Did baz (comment blah blah) ***
    
    Another - 
      * Need to complete foo
      - Completed bar (blah comment blah) ***
      - Done baz
    

    I need to run the text file through sed to remove all of the lines that start with spaces (number varies) and a hyphen, and another space.

    What is the regex or pattern I need to use with sed to make the output look like this below?

    Heading - 
      * Need to complete bar
    
    Another - 
      * Need to complete foo
    

  • Related Answers
  • eleven81

    I used Phoshi's answer, assisted by Dennis Williamson, to help me come up with sed /^\s+-\s.*/d which works as expected.

  • Phoshi

    "s/\s*-\s.*//g" should do it, I think.

    That's \s to match a space, * to match zero or more of the preceding character (the space), a literal hyphen character, then another space, then .+ to match everything after it.

  • Ryan Thompson

    You should use egrep or grep for this task, sed is a stream editor, grep is more in line with the line-at-a-time philosophy.

    You need a regex that matches the start of line, whitespace, hyphen, space. Sounds like this would work:

    egrep  -v  '^[ ]+-[ ]' filename
    

    The -v option causes egrep to REMOVE the matching lines -- this is easier than building a regex that rejects the lines.

    Example:

     nobody$ egrep -v  '^[ ]+-[ ]' /tmp/foof
     Heading - 
       * Need to complete bar
    
     Another - 
       * Need to complete foo
     nobody$ cat /tmp/foof
     Heading - 
       - Completed foo
         - More information
         - Still more
       * Need to complete bar
       - Did baz (comment blah blah) ***
    
     Another - 
       * Need to complete foo
       - Completed bar (blah comment blah) ***
       - Done baz
     nobody$ _
    

    Dealing with Tab characters only means you need them in the bracket expressions,but that's hard to show online.