linux - replace nth occurence of string in each line of a text file

05
2013-09

dnkb

I have large text files with space delimited strings (2-5). The strings can contain "'" or "-". I'd like to replace say the second space with a pipe. What's the best way to go? Using sed I was thinking of this:

sed -r 's/(^[a-z'-]+ [a-z'-]+\b) /\1|/' filename.txt

Any other/better/simpler ideas?

Thank you

Answers

mrucci

You can add a number at the end of the substitute command. For example, the following will substitute the second occurrence of old with the string new on each line of file:

sed 's/old/new/2' file

So, instead of your proposed solution, you can use:

sed 's/ /|/2'

For more information, see e.g. this sed tutorial.

petersohn

Did you try your version? Did it work? Because I think it is basically a good idea. I would do slightly differently, though:

sed -re 's/^([^ ]+ +[^ ]+) /\1|/'

This will accept any characters in a word that is not space, and will accept more than one spaces between the first two words.

Related Question

linux - Removing newlines from an RTF file using sed

linux regex sed awk rtf

Spidey

I have an RTF file which is formatted like so:

    Lorem ipsum dolor sit amet, consectetur adipiscing elit.\par
Nullam vitae sem porttitor urna pellentesque gravida. Nulla\par
consequat purus vel est vehicula porttitor.\par
    Maecenas pharetra metus in enim sollicitudin sollicitudin.\par
Etiam et odio tellus, eget placerat enim. Aliquam sem purus,\par
gravida sed feugiat eget, consectetur quis nisl.\par

(\par added for brevity)

As you can see, newlines have been inserted to fit a page's width. The problem arises when I try to read the text on my iPhone, which has a different line length. The lines break and readability is hindered.

The ideal solution would be one that converts the file to a single line for each paragraph, while keeping the newline and indent for new paragraphs.

So far I've tried parsing the file with sed but was unable to create a multiline regex. Ideally, I want to replace all "\r\n"s with " ", unless the next line begins with a space.

Is there a better solution for this? If not, how can I do it using sed?

Related Answers

Peter Boughton

This regex will match what you want:

\r\n(?! )

So to use that with sed:

sed 's/\r\n(?! )/ /g' filename.rtf

Except, it appears that sed doesn't support negative lookahead, and requires backslashed parens, so you can instead use:

sed 's/\r\n\([^ ]\)/ \1/g' filename.rtf

Spidey

The solution lied in a tool I haven't given serious thought - awk

awk 'BEGIN { FS="\\\\par" } ; /^    / {print "\\par" $1} /^[^ ]/ {print " " $1}'

This will go over the file, with \par as the field seperator, and will print a \par before any line that starts with 4 spaces (which marks the beginning of a new paragraph), and remove (or simply won't print) it when it starts with anything but a space.

Now what we have is a file with \par only where legal line breaks should be. The next step would be to remove all newlines altogether, to get rid of rogue line breaks:

tr -d '\r\n'

And then feed the result to sed to replace \par with \par\r\n, practically adding a newline where a \par is.

sed 's/\\par/\\par\r\n/g'

And done.

The only real issue I've found with this method is that it ruined the RTF header. No problem, I just copied over the header from the original file.

Another smaller issue was that chapter titles were being printed inline with previous paragraphs. This is because chapter titles do not start with a space yet should be considered a paragraph. In my case, chapters were marked like so:

CHAPTER THIRTY-TWO
Chapter's Name

So a quick sed took care of them:

sed 's/\s*\(CHAPTER [[:upper:]-]* \)\(.*\\par\)/\\par\r\n\\par\r\n\\par\r\n\1\\par\r\n\2\\par\r\n/'

I now have my book in proper format, which makes it readable on other devices (such as my iPod).

Home

linux - replace nth occurence of string in each line of a text file