linux - Delete all lines of text in the HTML file in addition to the first?

05
2013-09
  • user2435244

    I have to rewrite a lot of HTML files, example:

    *--file1.html--*
    
    <p>text1</p><br>
    **<p>text2</p><br>
    ...<br>
    <p>text(n)</p>**
    
    *--file2.html--*
    
    <img1...<br>
    <img2...<br>
    <p>text1</p><br>
    **<p>text2</p><br>
    ...<br>
    <p>text(n)</p>**
    
    *--file3.html--*
    
    <blockquote><br>
    <p>text1</p><br>
    **<img...<br>
    <p>text2</p><br>
    ...<br>
    <p>text(n)</p>**
    
    
    *--file(n).html--*
    
    ... - various combinations of tags.
    

    Tag [p]...[/p] in different lines. I need to delete all tag 'p' but the first (I marked from ** to **), example:

    *--file1.html--*
    
    <p>text1</p><br>
    
    
    *--file2.html--*
    
    <img1...<br>
    <img2...<br>
    <p>text1</p><br>
    
    *--file3.html--*
    
    <blockquote><br>
    <p>text1</p><br>
    

    I tried this but it does not work:

    sed '/<p>/,</p>/d;1/<p>/!d' file*.html - I delete all the lines starting with tag p, i can not to leave a single line P tag.
    
    sed '1!d' file*.html - work if the first line is tag p, but the first line can be tag img - so bad.
    

    How to do to not remove the first p tag, but the rest (of the second tag p)? Let's wrong?

  • Answers
  • user1146332

    You may tray this perl oneliner:

    perl -0777 -ne 'm#(^.*?<p>.*?</p>.*?\n).*</p>.*?\n(.*)$#s; print $1, $2' <file>
    

    For example if you have the file test with the following content

    <blockquote><br>
    <p>text1</p><br>
    **<img...<br>
    <p>text2</p><br>
    ...<br>
    <p>text(n)</p>**
    appendix
    

    and you process it with the mentioned oneliner it puts

    <blockquote><br>
    <p>text1</p><br>
    appendix
    

    as a result on the screen.


  • Related Question

    bash - HTML/PDF to DOC(X) in Linux command line?
  • Questioner

    I need to convert PDF or HTML+CSS into DOC or DOCX under Linux, it can be from the command line or with a scripting language.

    Any idea?


  • Related Answers
  • Pekka 웃

    You might be able to do the latter using OpenOffice from the command line. There are also bridges for Scripting languages - find out more on OpenOffice's website. There is one for PHP called PUNO, however I have no personal experience with it yet.

  • Colin Pickard

    You can convert HTML into .doc using an OpenOffice macro, see this thread:

    http://www.oooforum.org/forum/viewtopic.phtml?p=44367#44367

    converting pdf to .doc is much harder, due the multitude of different content that could be inside a PDF - quite often PDFs are used for things such as scanned text.

  • voyager

    You can use pdftohtml to make an html file from a pdf.

    Word can open html files directly.

  • jammypeach

    Document Conversion

    Current list of past examples.

    Convert any document type into PDF

    How convert Powerpoint slides to jpeg using openoffice api? (slide splitter)

    List of many past conversion examples

    Filter list

    List of converters available in OOo 2.0 (1.9.x)? Instructions to produce filter list

    Recursive Folder of Html into PDF,Txt,SXW,DOC

    Recursive Folder of SXD to SDC (StarCalc 5)

    Setting Image size for JPEG export

    Xcel to Calc conversion using the API

    A very similar one, converting Xcel to Text

    Batch mode conversion

    Document conversion

    VB: converting Excel files to txt files

    General Visual Basic document conversion of Text...

    Converting Word -> PDF from the command line http://www.oooforum.org/forum/viewtopic.php?t=3772 http://www.oooforum.org/forum/viewtopic.php?t=5513 http://www.oooforum.org/forum/viewtopic.php?t=3768

    PyOpenOffice tool to convert SXW to PDF without using OOo

    Convert Word --> Writer from the command line

    Convert Excel -> PDF from the command line

    http://www.oooforum.org/forum/viewtopic.php?t=5596 http://www.oooforum.org/forum/viewtopic.php?p=21050#21050

    Convert SXC to CSV from commandline

    Convert PPT to HTML from command line...

    Convert PPT to HTML short example...

    Convert PPT to PDF short example...

    see tail end of thread...

    Converting SXW -> PDF

    Draw export to PDF

    In Python...

    Thread about converting document to PDF in Java

    Convert SXW to DOC with Java
    http://www.oooforum.org/forum/viewtopic.phtml?p=81846#81846

    I wrote a batch document converter
    http://www.oooforum.org/forum/viewtopic.php?t=3525 http://www.oooforum.org/forum/viewtopic.php?t=2810 http://www.oooforum.org/forum/viewtopic.php?p=10311#10311 you can get it here
    http://www.ooomacros.org/user.php#95532 more discussion of it here...
    http://www.oooforum.org/forum/viewtopic.php?t=5708

    Macro to save in three formats
    http://www.oooforum.org/forum/viewtopic.php?t=3612 Macro to save backups with timestamps
    http://www.oooforum.org/forum/viewtopic.php?t=7674

    Open HTML with Writer not Web in order to export
    http://www.oooforum.org/forum/viewtopic.php?t=3973 http://www.oooforum.org/forum/viewtopic.php?p=44367#44367
    How to convert HTML into OpenOffice File?
    http://www.oooforum.org/forum/viewtopic.php?t=11580
    Page size pblm when converting HTML to PDF
    http://www.oooforum.org/forum/viewtopic.phtml?p=63682#63682

    Discussion that ends in DocConverter utility.
    http://www.oooforum.org/forum/viewtopic.php?t=2668

    Convert DBF into XLS, SXC, PDF and HTML
    http://www.oooforum.org/forum/viewtopic.php?t=5728

    Good Visual Basic code example...converting documents
    http://www.oooforum.org/forum/viewtopic.php?t=7673

    Draw exporting and printing
    http://www.oooforum.org/forum/viewtopic.php?t=3620

    Using OOo's source code to read / convert / write documents in the formats supported by its filters. http://www.oooforum.org/forum/viewtopic.php?t=5785