Author: Subject: Book scanning.
katavinyx
Harmless

Posts: 4
Registered: 3-4-2004
Member Is Offline

Mood: No Mood

Book scanning.

Some while back, I had said I'd scan 5 books for the ftp. Then, naturally, the digital camera was stolen when set down for barely a second. Now that I've got a new one, and begun scanning, I've realized this isn't quite as simple as I'd surmised.

Any of you have any tips on easing and speeding up the process?

As it is, to get pictures of good quality, I have to take them in a high resolution mode, which makes each pic half a meg or so apiece. Then I have to go through and flip the pages so that they're not sideways, and then save them all as a smaller file before converting into a pdf. And then go back and redo some pictures that invariably don't capture perfectly, and insert them in at the proper points. Only to find that they've been put in at the wrong place and finding yet more missing pages. Yeesh.
Polverone
Now celebrating 18 years of madness

Posts: 3164
Registered: 19-5-2002
Location: The Sunny Pacific Northwest
Member Is Offline

Mood: Waiting for spring

Unless you have a much better digicam than I think you do, suspend your book scanning project for now and wait until you can get a flatbed scanner. You don't need a fancy scanner. It should be able to capture 300 (or, better, 600) DPI in 8-bit grayscale or 1-bit lineart. Make sure you look at the *optical* resolution of the scanner, BTW, and ignore interpolated resolution. You should be able to pick up a used scanner that does this for US $20, or$40 for a new one.

Scan at the highest resolution you can afford (in terms of time, disk space, and scanner hardware). If there's very few pages with photographs or color, you can scan the whole book in 1-bit "lineart" or "bitonal" mode. If more than a few pages have photographs or non-lineart illustrations, scan in grayscale and later convert the pages without images into bitonal images (for better PDF compression). Clean up messy edges, blotches, and badly skewed images in Photoshop or another image editor. Run the cleaned-up images through Abbyy FineReader to produce a PDF with OCR text beneath the page images. Run the PDF through Acrobat 6 or the Silx compressor to convert bitonal images to JBIG2 compression.

Especially if you end up having a lot of grayscale or color pages (illustrations and whatnot), you might want to consider using DjVu instead of PDF, since it compresses better all-around and does an especially good job of handling pages containing a mixture of text/lineart and continuous-tone images.

PGP Key and corresponding e-mail address
JohnWW
International Hazard

Posts: 2849
Registered: 27-7-2004
Location: New Zealand
Member Is Offline

Mood: No Mood

I agree. The best format, in terms of required space, to store monochrome scanned page images in is PCX, which good graphics viewers and editing programs can use, as either 1-bit line-art or 8-bit grayscale (the latter can use 256 shades of gray). 300 dpi resolution is sufficient for larger type sizes, and 600 dpi or more is advisable where there is particularly small type or fine detail.

John W.
janger
Harmless

Posts: 40
Registered: 20-8-2004
Member Is Offline

Mood: No Mood

 Quote: Run the cleaned-up images through Abbyy FineReader to produce a PDF with OCR text beneath the page images.
Are you saying Abbyy is better than omnipage? The latter has never worked well for me. Was going to download finereader but wanted to hear good news about it first. Is it less of a hassle than omnipage? I have several nice books I'm willing to contribute - "Lob, Lorenz (1898) - Electrolysis and Electrosynthesis of Organic Compounds" to name one.

But it all depends on finding a decent OCR app.

Dave

[Edited on 25-8-2004 by janger]
Polverone
Now celebrating 18 years of madness

Posts: 3164
Registered: 19-5-2002
Location: The Sunny Pacific Northwest
Member Is Offline

Mood: Waiting for spring

I have never used omnipage, but I would say that FineReader is good. It is easy to use. If you have already scanned and cleaned up your images, use a batch image converter like that in irfanview to convert them to a format FineReader can read (if they are not already in such a format). Uncompressed TIFF is what I use.

You "open and recognize" the directory full of images. Wait for recognition to finish... then export all pages in PDF format, with recognized text below page images and page images downsampled to 300 DPI. Use Silx or Acrobat 6 to apply JBIG2 compression before distributing the PDF.

While you're scanning, make sure you your page images are named something like page0001.tiff, page0002.tiff, etc. (always writing any leading zeroes out) so that everything will be in the right order when you load the directory full of images into FineReader. If you omit the leading zeroes you will see orderings like
...
18.tif
19.tif
1.tif
20.tif
21.tif
...

PGP Key and corresponding e-mail address
Organikum
resurrected

Posts: 2284
Registered: 12-10-2002
Location: Europe
Member Is Offline

Mood: lonely

MEPHISTO !!

Ask mephisto, he definitivly makes the very best pdfs from books - no doubt. He is the only one who has managed to get the pages numeration right! I LOVE him!

Irgendwas is ja immer
JohnWW
International Hazard

Posts: 2849
Registered: 27-7-2004
Location: New Zealand
Member Is Offline

Mood: No Mood

How would you handle graphics in books, and their positioning among lines of text? Are there other image formats able to be used with Abbyy Finereader besides TIF, e.g. JPG for color and grayscale?

Also, you say that Abbyy Finereader can "export all pages in PDF format" after OCR. Is each page an individual PDF file, and if so, how can they be combined into a single PDF file? What about the page breaks and text formats and fonts - how are they preserved in the OCR process?

John W.
Organikum
resurrected

Posts: 2284
Registered: 12-10-2002
Location: Europe
Member Is Offline

Mood: lonely

The OCR solely makes the book fulltext-searchable, there is the plain text seperately stored for this. The book as you see and read it are pictures stored in some compressed picture format - best is JBIG5 afaik. Needs the SILX-compressor though, but I am not the specialist.

MEPHISTO!!!!!

Irgendwas is ja immer
Mephisto
Chemicus Diabolicus

Posts: 294
Registered: 24-8-2002
Location: Germany
Member Is Offline

Mood: swinging

Organikum: Nice to hear, you like my ebooks! Although I heard something from you, that sounds like: "Die Ammoniak-Gas Synthese ist in der Form MIST".

JohnWW: Pages which just contain text should be scanned with a resolution of 600 dpi (greyscale). Finereader can convert these greyscale scans very good to black/white. This is necessary, because silx can just compress b/w PDFs (not greyscale!). Silx compressed for example one of my ebooks with 2149 pages to a size of 32,5 MB. You see, silx is the most effecitve compressor, but can only handle b/w.
If the page contains text and a grey picture Finereader can't convert it proper to b/w. You can to this by yourself with Corel Photo-Paint (more pages => batch process). You get for example with Floyd Steinberg a b/w-picture, which looks like a greyscale picture.
The alternative is to hold the greyscale (or colour). Personally I convert everything to b/w, because my books contain only text an illustrations. For non-black/white ebooks I am the wrong to ask.

The size of the pages play an even more important role! Use Corel Photo-Paint (batch-process) to cut off the dark edges of the pages. In this way to pages will be also constant in height and width. An own script for a batch-process can be easily made with Corel Photo-Paint by recording a manual cropping of one page. For more professional PDFs I recommend the PitStop-plugin for Adobe Acrobat Professional (I'll up it soon to the ftp).

Note: Next month the following books will be available: "Warren, S. - Organische Retrosynthese.pdf" and "Christine L. Willis, Martin Willis - Syntheseplanung in der Organischen Chemie.pdf"

[Edited on 26-8-2004 by Mephisto]

Visit the German synthesis collection LambdaSyn and our new LambdaForum!
janger
Harmless

Posts: 40
Registered: 20-8-2004
Member Is Offline

Mood: No Mood

Just wanted to say I downloaded finereader. Very impressive! Easy to use compared to omnipage (it sux), and a damn lot more accurate.

Dave
Mephisto
Chemicus Diabolicus

Posts: 294
Registered: 24-8-2002
Location: Germany
Member Is Offline

Mood: swinging

Enfocus Pitstop Professional

You can find the PitStop plugin for Adobe Acrobat Professional now on the ftp (/upload/PDF-Tool Enfocus Pitstop Professional - by Mephisto.rar).

Visit the German synthesis collection LambdaSyn and our new LambdaForum!
Hermes_Trismegistus
International Hazard

Posts: 602
Registered: 27-11-2003
Location: Greece, Ancient
Member Is Offline

Mood: conformation:ga

AAAARRRRGGGHHH!!!!

Damn geekspeak.

I've been downloading REAMS of stuff off of the ftp site, and starting to feel a little guilty about my lack of reciprocity.

I see no reason why I shouldn't be scanning some of the neat books I check out at the library, and usually only get a chance to skim through before I have to bring them back.

It has occurred to me, that I could probably scan books with one hand, at the same time I am studying my subjects.

For instance, in calculus, I read the same paragraphs as many as a dozen times before the light comes on (in my head).

Why couldn't I take a moment every several seconds to turn to the next page of a book that was sitting on the scanner and press a button?

Slowly but surely the book would get scanned in right?

However, the "tips" that were placed earlier in this thread lost me pretty quick and left me needing a compu-dictionary. So I am going to document my efforts in case there is anyone else out there as thick as I am. This is my plan.

2. Learn to use scanner to get readable text onto screen
3. figure out how to take the readable text scans and make them into searchable PDF's like the all the cool kids are doing.
4. Share the PDF's with anyone who'll have them.

Arguing on the internet is like running in the special olympics; even if you win: you\'re still retarded.
tom haggen
National Hazard

Posts: 488
Registered: 29-11-2003
Location: PNW
Member Is Offline

Mood: a better mood

In order to make a PDF you have to have a full version of acrobat, which I have if you need help getting. My point is that there are a lot of acrobat reader programs out there that allow you to read pdf files but not create your own. Just something to keep in mind.

N/A
Hermes_Trismegistus
International Hazard

Posts: 602
Registered: 27-11-2003
Location: Greece, Ancient
Member Is Offline

Mood: conformation:ga

I didn't know that.

Plan subsections

1. How much can I spend?
2. What kind of scanner fits my needs (copyright violation)
3. What scanner will allow me to scan text in such a way as to make it possible to make a pdf later (wouldn't that be a bitch if my scanner made wonderful pictures but couldn't scan text into a format that would make an ebook!)
4. What is a reasonable speed expectation?
5. Are scanners reliable? Is it safe to buy a used one?
6. what kind of maintenance costs are associated with scanners (are there any consumables within one?)

These questions might seem obvious to an average teenager, but I've only used a scanner once in my life, I was buddy-buddy with the principle in high school and he called me into the office to check out this new-fangled gadget he got the school board to pop for. It was a "scanner" that looked like a phaser off of star trek The next generation. (which was a newer show on tv at the time, and you dragged it down a page to scan in....say your signature. It took about three minutes to scan a swath 5 inches wide and the length of a page. It had resolution that almost matched the comodore computer, but it did go well with the Principles brand new 386 with 8 whole megs of memory!!!.

Anyway, an update....

I figure I should get an OK scanner if I'm willing to spend no more than 4 million dollars canadian on a used model. That's about 150 bucks US. (canuck joke about the state of our dollar)

Arguing on the internet is like running in the special olympics; even if you win: you\'re still retarded.

I use Omnipage (from the Canon) to straighten bmp pages and save them as tif - this takes about 10 seconds per 100 pages. But that's all that I use it for. I use the built-in OCR of DjVu and Acrobat for OCR because I don't really care and only do it to make searching easier. When I remember to OCR before uploading. I don't use the other software (that came with the scanners) that I haven't mentioned - it is useless for books.

Inorg Syn, Inorganic Preparations, Inorganic Lab Preps, Glassblowing for Lab Techs, Systematic Organic Chemistry, Techniques of Glass Manipulation (not the scanners fault), and Preparation of Organic Intermediates were made by the Canon. Oxidations in Organic Chemistry, Reductions by the Alumino and Borohydrides, Mellor 8, and Vogel 5 are from the Visioneer.

I'd gladly trash both scanners for something that scans faster and brighter, like the copy machines at the library. I only get 2.75 pages per minute, either one. Neither will scan continuously, I have to push a button each time. I am not satisfied with either one.

Of course software that would automatically straighten and crop to a certain page size (centering the text) as the pages are scanned would be very useful, therefore it does not seem to exist.
S.C. Wack
bibliomaster

Posts: 2419
Registered: 7-5-2004
Location: Cornworld, Central USA
Member Is Offline

Mood: Enhanced

This is going to be a long post. I scanned another book today, the Gattermann-Wieland that I've pasted from twice before, but I want to talk about a previous one first.

If you take a look at Zubrick, you'll see that it is better than the others, even the one after it and this latest one. This is because it was scanned in 600 dpi. The whole thing, scanning to uploading, took 4 hours. Can you make room for 4 hours?

I still have the same 2 crappy scanners, though. To recap, one scans 600 very slow, but the other scans 600 as fast as the other scans 300 - but cannot take the weight of a book pressed down on it. Well, Zubrick was a paperback. The covers were torn off and then the glue was scraped and sanded off. This made the whole process easier than usual, and allowed use of the better scanner.

I hesitated to give advice before, and I was led astray by others' advice when I got my scanner. After a lot of trial, I trashed everything that I had heard and figured it out for myself. Worked out well. I still believe in what I said about hand-holding, but then again, no one is scanning anything. So this is my process, FWIW. The straightening, resizing canvas, cropping, resizing canvas again, straightening again, then resizing again that follow are all batch processes that take 2 minutes at the most per 100 pages each. Earlier, I was trying to crank them out as fast as I could, and didn't care if they were far from professional. I deliberately put off doing the better books. I feel bad about doing some of what I've done in only 300 dpi, but can't afford more scanner right now.

The scanner is set to scan in .bmp the minimum length and height possible. The default darkness setting is perfect on the nice scanner, and worthless on the other. Experimentation showed that turning the gamma control all the way down to .5 was best. All of the left hand pages are scanned at the same time, and then the right pages are scanned to another folder. Since all of the scanning is done in the corner of the scanner, this means that the left hand pages are scanned upside down.

It is very important with this process to keep each page near the same position. This gets difficult in certain areas of the book where pages on top of the one you're scanning - or the cover - make it difficult to position the page in the corner properly. If you're careful you could do it outside of the corner, but I digress. You get the point.

Now I have a left folder and a right folder. More folders are created for each side to accomodate the next steps. Every step and every side gets a new folder.

First thing to do is straighten the pages. Maybe your scanner came with OCR software. Finereader straightens best of the 3 I've tried. You don't have to go through the reading process or anything, just File>Open>Ctrl+A and the whole folder is straightened in a flash. Select all, despeckle, select all again, save image as, in new folder, as bmp.

There is some distortion on some pages though, occasionally a line of text has a top half that doesn't quite match the bottom half, on the pages that the OCR program rotated. Unavoidable, except to keep them straight to begin with, which is harder than you think. So it might be better to batch rotate the pages that are upside down in photoshop, then run through OCR, rather than have the OCR program do it and mess with every page.

The upside down pages are now right side up. If there are pages with tables I make sure that the OCR program didn't make any pages horizontal, which is bad in the next step.

I make the pages as straight as possible to begin with, because the straightening process adds to the # of pixels. It adds blank area to make it vertically rectangular again, and you want to add as little as you can.

I've used different versions of Photoshop and Corel for the rest. With Corel, I don't know how to batch process anything when the files are in different sizes, because the script I get is based on the original size of the template. Remember that I am computer illiterate and have no desire to learn Java or Visual Basic. But with Photoshop this is easy and makes the canvas the same size regardless of whether or not the files to be converted are of different sizes. The canvas is resized to the original size of the scan. Border is trimmed off. This takes 2 minutes and no guesswork.

Now that everything is the same size, I can do the rest either in Corel or Photoshop: I find a page that is in between extremes and crop it to the minimum size, recording the script that the program uses to do this. Both programs have a script recorder, though they are not particularly easy to find. That script is used to batch crop every page on that side. With practice, you can do it right the first time.

Any pages that were too far off to one side are deleted and the page is processed manually. These are rare if you do it right.

Now any one of these cropped pages is resized to a certain size, the image is centered onto a larger canvas. Using that script, all of the pages are batch resized and centered. Using Mephisto's pdf compression tool, certain sizes do not compress at all, I've never figured out the rhyme and reason with this despite extensive testing with many sizes.

Now I go though the pages in Photoshop, erasing here and there. If a page is a little off-center, all I have to do is drag it while holding Ctrl. Super simple. A 450 page book can be half-ass edited in an hour.

But not all of the pages are straight. No matter how many times you run it through Finereader, it will find more to straighten, just a little bit more. So everything is straightened and the canvas resized (to the size before straightening) as before.

Last, I batch rename the files, putting both sides into the same folder for the first time. The right pages are given a 4 digit serial+a, then the left 4 digit serial+b. This puts the pages in perfect order, as long as there are no skipped pages. I convert bmp to CCITT G4 pdf, and use Mephisto's compression tool to shrink this. The number of pages in the pdf makes a little difference in the amount of compression. I find it best to make 200 page pdfs, compress, then put those pdfs together. Like I said, OCR keeps finding new things to do, so I don't use Acrobat's built-in OCR or Finereader because these will again increase the size of the pages they decide to align, and you can't read without realignment AFAIK.

With DjVu, I batch convert the bmp to tif because tif gives smaller djvu.

This is all very easy, the most time consuming part is the scanning itself despite all the steps I've written. Give it a try and you will see. And it really doesn't take long to scan a book. 3 or 4 pages a minute.

It's not like I go to the Library of Congress for my books or anything, I haven't yet used interlibrary loan even. What's in your library?

Anyways...Here is Gattermann. But there are already 2 widely distributed versions available, why a 3rd? Rhadon's lovely (as usual) 43rd ed. is in German, and the 1901 edition is a little, um...

This 1937 translation of the 1935 24th edition has been a treasure to me for a long time and you probably won't feel the same way. I got it from the local used bookstore and found that it had never even been read, many pages were stuck together due to the slightly imperfect edge trimming. It has "complimentary professional copy" stamped on the cover, so I suppose it became a sale reject for that reason. It is very much a product of its time, the perfect time IMHO; it has hydrogenation with Ni, Pd, and Pt alongside extraction of urea and uric acid from urine. Both Organic Syntheses and the Merck Index refer to this edition in some syntheses.
JohnWW
International Hazard

Posts: 2849
Registered: 27-7-2004
Location: New Zealand
Member Is Offline

Mood: No Mood

The software packages you mention - they should be uploaded to a "warez" folder on the FTP.
S.C. Wack
bibliomaster

Posts: 2419
Registered: 7-5-2004
Location: Cornworld, Central USA
Member Is Offline

Mood: Enhanced

[hint]Corel and Finereader are widely available[/hint]

A few thousand people might be abusing Adobe products with emule, not that the management would condone such a thing.
Polverone
Now celebrating 18 years of madness

Posts: 3164
Registered: 19-5-2002
Location: The Sunny Pacific Northwest
Member Is Offline

Mood: Waiting for spring

thank you S.C. Wack

Since your scanned version of Gatterman is old enough not to cause legal trouble, it has been added to the <A HREF="http://www.sciencemadness.org/library/">library</A>. Downloads from the library should be quick and easy.

PGP Key and corresponding e-mail address
S.C. Wack
bibliomaster

Posts: 2419
Registered: 7-5-2004
Location: Cornworld, Central USA
Member Is Offline

Mood: Enhanced

I should add that all of my scanning is b/w, except when there are pictures. These are in greyscale and edited in photoshop by using the black and white points. The covers of Vogel are the only color, also the only image resizing done, it came out pretty good.

I tried to break the speed record today, and did this in 3 1/2 hours, scanning, ocr, everything. Oh, OCR. Well-

I was doing housekeeping on my computer and found that I hadn't yet deleted the uncompressed pdfs of Gattermann, Brauer, and Zubrick. I already made OCR'd djvu of all of them and was going to upload them to the FTP when it comes back. But I got to thinking. I don't know what the latest version of DjVu Editor does, just that it must be hard to warez since I couldn't find it anywhere. But the lack of a page renaming function has always pissed me off. I mentioned the duplicate file uploading before, I'd rather upload just 1 version. I decided that the larger file size and different size pages of OCR pdf wasn't enough to warrant continuing to make djvu dupes, especially now with the higher pdf compression. So from now on, I'll only put out uncorrected OCR'd pdf, no more books in djvu.

And since I had those old pdf's handy, I OCR'd them too, and might as well since I won't be uploading the DjVu's anywhere. So all of these rars contain the OCR'd versions. There are probably some tables rotated. I didn't go though them much less change anything.

I have Shriner's Systematic Identification of Organic Compounds and was going to scan it, but then I did those 2 Vogels. Now there are 3. There is some good stuff in Shriner, but oh well. I have volume 1 of Vogels Elementary Practical Organic Chemistry (1957) and there probably isn't anything in there that isn't in the 3rd ed we already know. And the qualitative organic analysis that is volume 2 is probably the same as what is in the back of Vogels 3rd ed.

However, volume 3, Quantitative Organic Analysis (1958), is a little different, and much different than the fifth ed. that I scanned earlier. There are no machine analyses here, but the main thing is that it is a really good read. Highly recommended, lots of good stuff in this 239 page book.

vogel_elementary_quantitative_organic_analysis.rar (2629536 Bytes)

gattermann_ocr.rar (4377527 Bytes)

zubrick_ocr.rar (4475679 Bytes)

brauer_ocr.rar (17584157 Bytes)
Mephisto
Chemicus Diabolicus

Posts: 294
Registered: 24-8-2002
Location: Germany
Member Is Offline

Mood: swinging

Book-Scanner

Funny how every demand for new products even by minorities like book-scanners is satisfied by industry. I didn't noticed it, but the firm Plustek brought in the end of 2004 a 'low-cost' scanner on the market, which is specially designed for scanning books. Sometimes evil capitalism works quite well.

The advantages compared to a common flatbed-scanner are the speed of scanning and technical design that allows one side of a book to lie flat on the scanner glass and a scanning head mechanism that can read right up to the edge where the book spine is placed.
Visit the German synthesis collection LambdaSyn and our new LambdaForum!

S.C. Wack

I slightly curved the frame of the Canon for TS1 by hand, eventually fixing the problem mentioned earlier. This allows me fast 600 dpi scans now, so I thought that I would scan a somewhat modern organic textbook that is actually used (in newer edition) at schools here in the USA. Just so that there is one in the collection, and I liked this one the best even though it doesn't have many preparations in it. It's just an entry-level lab book where everything has to be explained in a simple way to stupid American slacker kids and Koreans-on-student-visas. (The local state college has a requirement that you can only take any particular chemistry class 3 times if your seat can be used by someone whose ability to fail is less proven).

It's a shame that I didn't fix the scanner sooner. It took many hours to bend it the right way. Fortunately the problem could be made worse which gave clues how to make it better. Who would of guessed that one of those weird determination jags would pay off with success?

BTW, if it isn't obvious yet to those who inquired about TS1 (my last scan before this), HH bought non-prison toilet paper with my donation to his account, yet later on wiped his ass with my mail. Not that this surprises in the least, I'm just saying. It will not be made publicly available by me any time soon, just download TS2 from somewhere and forget it. He'd be happy to hear from women, though.

Do not scan text in color or greyscale! Do not be put off by hideous quality image preview on your scanner software when you do a scan, it probably looks much better with another viewer! Do not even think of using scanner bundled software for anything except running your scanner, download non-crap software for image manipulation instead!
