imagemagick - batch split note scans with automatic recognition of parts in page

07
2013-09
  • groovehunter

    I regularly use paper to make notes and concepts for most things, work and others. To be quick when having an idea I note them one after another, just to make sure I find things later, I use paper with a sidebar where I add some short tags. On top also the current date.

    I am now in the process of making them accessible digitally. First step was obviously scanning them. (stored in folder per month)

    Now I want to push them in some content system.

    I need now a way, semiautomagically

    • view page by page
    • computer checks for horizontal lines or part of lines and set sections.
    • I control and correct
    • computer recognizes tags in sidebar, I correct if necessary
    • image is cut in parts
    • The parts are saved by tags (and maybe date, at least keep the month)

    I like to ask for suggestions for tools I could play with, a kind of gui for imagemagick crop in the most simple case. Of course I could code such tool quite easily but I thought I ask here first, often you guys have awesome ideas.

  • Answers
    Know someone who can answer? Share a link to this question via email, Google+, Twitter, or Facebook.

    Related Question

    software rec - Looking for simple windows scan (multiple pages) to one pdf application?
  • Troggy

    I would like to find some simple scan software for a windows machine that can scan to pdf, but I would like it to do batch or multiple pages into one big pdf. I saw a couple questions on scan to pdf software, but did not see anything talking about scanning to large multiple page pdf's.

    EDIT: I am surprised there are not more options out there. Do many of the scanners/all in one devices come with included software that perform this function?

    EDIT 2: I tried Scan2PDF and it locked up on me multiple times in the middle of the scan job and then gave me non-english error messages. Otherwise, I liked how simple the app was, just select number of pages and hit ok. Any other success stories out there?


  • Related Answers
  • slhck

    This blog had the best documented suggestion of the simple and sweet iCopy.

    I tried all the others and many more, and only this worked flawlessly. I already have used CutePDF free for a decade which acts as the "to pdf" converter after you get your pages all scanned in with iCopy.

  • Snark

    Canon scanners come with a tool called CanoScan Toolbox. It can generate multi-page PDF files.

    VueScan is the name of a tool that comes to my mind. It's not compatible with all scanners (most of them are supported; compatibility list here; for Windows, WIA scanners are supported). Unfortunately is not free. But it has the feature "Scanning to a multi-page PDF file".

    Apparently (I did not try), Scan2PDF is free and can do it.

  • solved

    Thanks. I found the checkbox in the CanonScan software.

    Scan to a file. Press the PDF settings button. check the multi pdf box.

    When you scan it will ask if there is another page after each scan.

  • 8088

    Try Fast Scan to PDF. It's fast, simple, lightweight and reliable.

  • scls

    In my mind the best way to achieve the job is not to use a graphical user interface program but to use a collection of bash script (like in an Unix/Linux environment), so if you have some basic knowledges of programming you will be able to do much more things that a GUI program can provide to you.

    You can first install a minimal Unix like command line you can use

    my preference is to Cygwin as it contains a huge amount of software package.

    If you want to extract image from a PDF install also pdfimages pdfimages is an open source command-line utility for extracting images from PDF files. It is freely available as part of poppler-utils and xpdf-utils, and included by default with many Linux distributions.

    $ pdfimages file.pdf foo
    

    This usage produces a series of numbered images with "foo" as the prefix.

    Use in fact $ mkdir temp $ mkdir temp/jpg

    to create a temporary folder named jpg inside a temp directory

    $ pdfimages -j file.pdf temp/jpg/foo
    

    Let's say that you have now several fooXXXX.jpg images in temp folder.

    In your case, you ever had fooXXXX.jpg pictures.

    You can now generate one PDF using convert (a command line from ImageMagick)

    So download ImageMagick http://www.imagemagick.org/ or install it using Cygwin package manager

    Have a look at convert documentation (type "ImageMagick convert" in your favourite search engine)

    So you understand that to convert your pictures to one PDF file you will have to write

    $ convert -compress jpeg temp/*.jpg my_output_file.pdf
    

    That's all... ;-) but this solution can be extend...

    Let's imagine that the scanned pictures came from a book... 1 file is in fact 2 pages of your book...

    so if you have 10 files... your book had 20 pages... and you would like your PDF to also have 20 pages.

    So you need to split the image contained in one file to make 2 files for each page.

    Let's say that your file is temp/foo0001.jpg you will have 2 files temp2/foo0001a.jpg (left page) and temp2/foo0001b.jpg (right page)

    Create the temp2 directory (where your slitted files will go)

    $ mkdir temp2
    $ mkdir temp2/jpg
    

    Create a file named split_jpg_minw.sh using a text editor (Emacs, VI or if you prefer Windows application you can use Notepad or Notepad++)

    minimal_width=1500
    minimal_width_ignore=10
    
    rm temp2/jpg/*.jpg
    for f in temp/jpg/*.jpg
    do
      f2=$(basename $f)
      read -r width height <<< $( convert $f -format "%w %h" info:)
      width2=$(( ${width} / 2 ))
      height2=${height}
      if [ $width -gt $minimal_width ]; then
        echo "split $f ${width}x${height} to 2 files ${width2}x${height2}"
        convert $f -crop ${width2}x${height2}+0+0 +repage temp2/jpg/${f2%%.*}a.jpg
        convert $f -crop ${width2}x${height2}+$width2+0 +repage temp2/jpg/${f2%%.*}b.jpg
      else
        if [ $width -gt $minimal_width_ignore ]; then # ignore if with < 10px
          echo "copy $f ${width}x${height} (don't split because width<$minimal_width)"
          cp $f temp2/jpg/$f2
        else
          echo "ignore $f ${width}x${height} width=$width<minimal_width_ignore=$minimal_width_ignore"
        fi
    
      fi
    do
    

    width=1500px is the limit to split a file (or not)

    • a file with a width over 1500px will be split
    • a file with a width below 1500px will not be split

    Make this script executable

    $ chmod +x split_jpg_minw.sh
    

    (you can use tab key to autocomplete the name of the file)

    Run the script

    $ ./split_jpg_minw.sh
    

    The splitted files will be in temp2/jpg folder

    Generate the new "splitted" file.

    $ convert -compress jpeg temp2/*.jpg my_output_file_splitted.pdf
    

    You can add much more options to your chain to produce PDF file using bash scripting.

    There is no limit... you just have to learn scripting (but some code samples are sometimes much more useful than books)

    For example, you can apply to your pictures before generating the PDF file (to remove for example Moiré pattern or to reduce noise) using command line tools such as G'MIC

  • eleven81

    If you can find a piece of software that you are happy with that can scan to a Microsoft Office file format such as .DOC, you can use OpenOffice.org (free) to convert the .DOC file into a .PDF file.

  • Christian Davén

    My recommendation:

    PDFill PDF Tools - a free PDF Toolbox to Merge, Split, Reorder, Encrypt, Decrypt, Rotate, Crop, Reformat, Header, Footer, Watermark, Images to PDF, PDF to Images, Form Fields Delete/Flatten/List, PostScript to PDF, PDF Information, Scan to PDF, and Create Transparent Image.

    The interface is a bit cluttered but the software is excellent and a 'must have' if you're working a lot with PDFs.