linux - Delete first lines from a Unicode html file

05
2013-09

Christopher

I have an HTML file that is in UTF-8 format and I want to remove the first five lines from it.

I've tried using sed but it doesn't work in this case:

sed  "1,5d" Result.html>small2

It actually works for other files, but not here. I can't use tail because it removes from the end of the file, and the site may be changed later.

this is my file

    HTTP/1.1 200 OK
    Cache-Control: private
    Content-Length: 176073
    Content-Type: text/html; charset=utf-8
    Server: Microsoft-IIS/7.5
    X-AspNet-Version: 4.0.30319
    Set-Cookie: ASP.NET_SessionId=jaq52r5vsd04zvffokbutu1q; path=/; HttpOnly
    X-Powered-By: ASP.NET
    Date: Thu, 29 Nov 2012 06:41:59 GMT
    Connection: close

    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" dir="ltr" lang="en-US" xml:lang="en"> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

the file link: 4shared.com/document/U8yRa19I/Result.html here is the od -c Result.html result:

0000000   H   T   T   P   /   1   .   1       2   0   0       O   K  \r
0000020       C   a   c   h   e   -   C   o   n   t   r   o   l   :    
0000040   p   r   i   v   a   t   e  \r       C   o   n   t   e   n   t
0000060   -   L   e   n   g   t   h   :       1   7   6   0   7   3  \r
0000100       C   o   n   t   e   n   t   -   T   y   p   e   :       t
0000120   e   x   t   /   h   t   m   l   ;       c   h   a   r   s   e
0000140   t   =   u   t   f   -   8  \r       S   e   r   v   e   r   :
0000160       M   i   c   r   o   s   o   f   t   -   I   I   S   /   7
0000200   .   5  \r       X   -   A   s   p   N   e   t   -   V   e   r
0000220   s   i   o   n   :       4   .   0   .   3   0   3   1   9  \r
0000240       S   e   t   -   C   o   o   k   i   e   :       A   S   P
0000260   .   N   E   T   _   S   e   s   s   i   o   n   I   d   =   j
0000300   a   q   5   2   r   5   v   s   d   0   4   z   v   f   f   o
0000320   k   b   u   t   u   1   q   ;       p   a   t   h   =   /   ;
0000340       H   t   t   p   O   n   l   y  \r       X   -   P   o   w
0000360   e   r   e   d   -   B   y   :       A   S   P   .   N   E   T
0000400  \r       D   a   t   e   :       T   h   u   ,       2   9    
0000420   N   o   v       2   0   1   2       0   6   :   4   1   :   5
0000440   9       G   M   T  \r       C   o   n   n   e   c   t   i   o
0000460   n   :       c   l   o   s   e  \r      \r

Answers

terdon

I can't access your file so I can't test this, but one of these should work:

gawk 'NR>5' Result.html>small2
perl -ne 'print if $.>5' Result.html>small2

If they don't work, I doubt it is a problem with the encoding, you may have some strange characters screwing things up. try passing your file through od to check:

od -c Result.html | more

UPDATE:

I see in the output of od -c that you have mac-style lines that end with a carriage return (\r) and not a line feed (\n). So, try changing these to \n and running sed or one of the other commands again:

perl -ne 's/\r/\n/g; print' Results.html | gawk 'NR>5' > small2

Also, please post your file so we can access it and try it ourselves. It will greatly speed up the process. The service you have linked to requires us to get an account.

Related Answers

Peter Boughton

This regex will match what you want:

\r\n(?! )

So to use that with sed:

sed 's/\r\n(?! )/ /g' filename.rtf

Except, it appears that sed doesn't support negative lookahead, and requires backslashed parens, so you can instead use:

sed 's/\r\n\([^ ]\)/ \1/g' filename.rtf

Spidey

The solution lied in a tool I haven't given serious thought - awk

awk 'BEGIN { FS="\\\\par" } ; /^    / {print "\\par" $1} /^[^ ]/ {print " " $1}'

This will go over the file, with \par as the field seperator, and will print a \par before any line that starts with 4 spaces (which marks the beginning of a new paragraph), and remove (or simply won't print) it when it starts with anything but a space.

Now what we have is a file with \par only where legal line breaks should be. The next step would be to remove all newlines altogether, to get rid of rogue line breaks:

tr -d '\r\n'

And then feed the result to sed to replace \par with \par\r\n, practically adding a newline where a \par is.

sed 's/\\par/\\par\r\n/g'

And done.

The only real issue I've found with this method is that it ruined the RTF header. No problem, I just copied over the header from the original file.

Another smaller issue was that chapter titles were being printed inline with previous paragraphs. This is because chapter titles do not start with a space yet should be considered a paragraph. In my case, chapters were marked like so:

CHAPTER THIRTY-TWO
Chapter's Name

So a quick sed took care of them:

sed 's/\s*\(CHAPTER [[:upper:]-]* \)\(.*\\par\)/\\par\r\n\\par\r\n\\par\r\n\1\\par\r\n\2\\par\r\n/'

I now have my book in proper format, which makes it readable on other devices (such as my iPod).

Home

linux - Delete first lines from a Unicode html file