imagemagick - batch split note scans with automatic recognition of parts in page

07
2013-09

groovehunter

I regularly use paper to make notes and concepts for most things, work and others. To be quick when having an idea I note them one after another, just to make sure I find things later, I use paper with a sidebar where I add some short tags. On top also the current date.

I am now in the process of making them accessible digitally. First step was obviously scanning them. (stored in folder per month)

Now I want to push them in some content system.

I need now a way, semiautomagically

view page by page
computer checks for horizontal lines or part of lines and set sections.
I control and correct
computer recognizes tags in sidebar, I correct if necessary
image is cut in parts
The parts are saved by tags (and maybe date, at least keep the month)

I like to ask for suggestions for tools I could play with, a kind of gui for imagemagick crop in the most simple case. Of course I could code such tool quite easily but I thought I ask here first, often you guys have awesome ideas.

Answers

Know someone who can answer? Share a link to this question via email, Google+, Twitter, or Facebook.

Related Answers

slhck

This blog had the best documented suggestion of the simple and sweet iCopy.

I tried all the others and many more, and only this worked flawlessly. I already have used CutePDF free for a decade which acts as the "to pdf" converter after you get your pages all scanned in with iCopy.

Snark

Canon scanners come with a tool called CanoScan Toolbox. It can generate multi-page PDF files.

VueScan is the name of a tool that comes to my mind. It's not compatible with all scanners (most of them are supported; compatibility list here; for Windows, WIA scanners are supported). Unfortunately is not free. But it has the feature "Scanning to a multi-page PDF file".

Apparently (I did not try), Scan2PDF is free and can do it.

solved

Thanks. I found the checkbox in the CanonScan software.

Scan to a file. Press the PDF settings button. check the multi pdf box.

When you scan it will ask if there is another page after each scan.

8088

Try Fast Scan to PDF. It's fast, simple, lightweight and reliable.

scls

In my mind the best way to achieve the job is not to use a graphical user interface program but to use a collection of bash script (like in an Unix/Linux environment), so if you have some basic knowledges of programming you will be able to do much more things that a GUI program can provide to you.

You can first install a minimal Unix like command line you can use

MinGW
Cygwin http://www.cygwin.com/

my preference is to Cygwin as it contains a huge amount of software package.

If you want to extract image from a PDF install also pdfimages pdfimages is an open source command-line utility for extracting images from PDF files. It is freely available as part of poppler-utils and xpdf-utils, and included by default with many Linux distributions.

$ pdfimages file.pdf foo

This usage produces a series of numbered images with "foo" as the prefix.

Use in fact $ mkdir temp $ mkdir temp/jpg

to create a temporary folder named jpg inside a temp directory

$ pdfimages -j file.pdf temp/jpg/foo

Let's say that you have now several fooXXXX.jpg images in temp folder.

In your case, you ever had fooXXXX.jpg pictures.

You can now generate one PDF using convert (a command line from ImageMagick)

So download ImageMagick http://www.imagemagick.org/ or install it using Cygwin package manager

Have a look at convert documentation (type "ImageMagick convert" in your favourite search engine)

So you understand that to convert your pictures to one PDF file you will have to write

$ convert -compress jpeg temp/*.jpg my_output_file.pdf

That's all... ;-) but this solution can be extend...

Let's imagine that the scanned pictures came from a book... 1 file is in fact 2 pages of your book...

so if you have 10 files... your book had 20 pages... and you would like your PDF to also have 20 pages.

So you need to split the image contained in one file to make 2 files for each page.

Let's say that your file is temp/foo0001.jpg you will have 2 files temp2/foo0001a.jpg (left page) and temp2/foo0001b.jpg (right page)

Create the temp2 directory (where your slitted files will go)

$ mkdir temp2
$ mkdir temp2/jpg

Create a file named split_jpg_minw.sh using a text editor (Emacs, VI or if you prefer Windows application you can use Notepad or Notepad++)

minimal_width=1500
minimal_width_ignore=10

rm temp2/jpg/*.jpg
for f in temp/jpg/*.jpg
do
  f2=$(basename $f)
  read -r width height <<< $( convert $f -format "%w %h" info:)
  width2=$(( ${width} / 2 ))
  height2=${height}
  if [ $width -gt $minimal_width ]; then
    echo "split $f ${width}x${height} to 2 files ${width2}x${height2}"
    convert $f -crop ${width2}x${height2}+0+0 +repage temp2/jpg/${f2%%.*}a.jpg
    convert $f -crop ${width2}x${height2}+$width2+0 +repage temp2/jpg/${f2%%.*}b.jpg
  else
    if [ $width -gt $minimal_width_ignore ]; then # ignore if with < 10px
      echo "copy $f ${width}x${height} (don't split because width<$minimal_width)"
      cp $f temp2/jpg/$f2
    else
      echo "ignore $f ${width}x${height} width=$width<minimal_width_ignore=$minimal_width_ignore"
    fi

  fi
do

width=1500px is the limit to split a file (or not)

a file with a width over 1500px will be split
a file with a width below 1500px will not be split

Make this script executable

$ chmod +x split_jpg_minw.sh

(you can use tab key to autocomplete the name of the file)

Run the script

$ ./split_jpg_minw.sh

The splitted files will be in temp2/jpg folder

Generate the new "splitted" file.

$ convert -compress jpeg temp2/*.jpg my_output_file_splitted.pdf

You can add much more options to your chain to produce PDF file using bash scripting.

There is no limit... you just have to learn scripting (but some code samples are sometimes much more useful than books)

For example, you can apply to your pictures before generating the PDF file (to remove for example Moiré pattern or to reduce noise) using command line tools such as G'MIC

eleven81

If you can find a piece of software that you are happy with that can scan to a Microsoft Office file format such as .DOC, you can use OpenOffice.org (free) to convert the .DOC file into a .PDF file.

Christian Davén

My recommendation:

PDFill PDF Tools - a free PDF Toolbox to Merge, Split, Reorder, Encrypt, Decrypt, Rotate, Crop, Reformat, Header, Footer, Watermark, Images to PDF, PDF to Images, Form Fields Delete/Flatten/List, PostScript to PDF, PDF Information, Scan to PDF, and Create Transparent Image.

The interface is a bit cluttered but the software is excellent and a 'must have' if you're working a lot with PDFs.

Home

imagemagick - batch split note scans with automatic recognition of parts in page