I have a notebook photocopied and the photocopy scanned, about 200 pages.
For various reasons I need to print this material. There are large amounts of black areas at the sides of the page (after the page itself ends), "black margins".
The image looks like this:
I would like to remove the black places, but keeping all text.
* The even and odd pages have the black part at different places.
* Notably, there is a white edge outside the black one, too!
* Most notably, the black areas has no fixed width (I've tried to overlay all the images for even and odd pages separately). It's width varies. The batch algorythm should be able to detect it.
Is there a way to remove these black-white margins automatically, keeping the text?
I can use Windows XP or Linux.
XnView has a batch processing mode with an auto crop feature:
As you can see, the colour and the tolerance level can be modified as required, so that may help.
IrfanView has a similar feature, although it's a bit more hidden. Under Options > Properties/Settings > Browsing/Editing you can set the tolerance value for auto crop borders:
Options > Properties/Settings > Browsing/Editing
You can batch auto crop via File > Batch conversion:
File > Batch conversion
If none of these help then you might have to break out the big guns and use something like Photoshop, perhaps with appropriate auto crop plugins.
I would recommend using a free utility called Scan Tailor, which removes borders, straightens and does other fixes to scanned images. Below is the result I got with minimal input to your sample file. While it is hard to say how it will work for an entire batch, but the preliminary results seem promising.
If you are looking for a true scripting solution to the problem you might try your hand at ImageMagick, a very powerful command line utility to work images. Specifically I would look at the sections on removing border and trimming. However I didn't have much luck getting it to work on your test image. You might want to look in the forums, where others seem to have similar issues.
Since I don't have a copier or scanner, I'm using an 8 megapixel camera to copy documents. This works pretty well except they need a lot of processing afterward. I'd like to get from a photo to a bitmap, but using
djpeg -grayscale -pnm photo.jpg |
pgmtopbm -threshold -value XXX
does not work so well, for two reasons:
It's hard to guess what XXX should be, and XXX is different for different photos.
Illumination varies, and sometimes a single threshold isn't what's right for the image.
How can I do better? The ideal solution will be fully automatic command-line program that I can run on Linux. (I have already written a program to remove dark pixels from the edges of images.)
NOTE: I really want a bitmap, that's just black and white pixels. No grayscale, no dithering.
Converting to grayscale / desaturating will preserve most of the noise too. The GIMP has a Threshold filter (under the Color menu) that eliminates the noise, and works very well for line-art and plain black scanned text.
I'm not too clued up on the batch scripting myself, but it sounds like a good idea to use the Threshold with it.
Edit: Since you have Linux as a tag, have a look at Phatch, batch photo manipulations. It has filters to adjust the contrast and brightness too. It's in the Ubuntu repos (if you use that distro)
Apparently, Gimp supports some command-line batch processing. You might be able to give that a shot, since desaturating will probably behave like you'd expect with varying brightness in your images.
Check out your camera. Many modern digital cameras have the ability to take B&W photos directly.
The best thing I've found in three years is the mkbitmap program that ships with potrace.