Finally!

It’s done.

Mission completed.

As you know if you’ve been following me for a while, I’m a big fan of David Allen and his “Getting Things Done” system, even though I don’t necessarily agree to everything he’s saying. One of the things I do agree with however, is the fact that projects thought of but not yet completed has a part of you until you somehow get it out of your system. The easiest way to do it, of course, is to put it on a list with a new action on just file it as a “someday, maybe” and review it every now and then, postponing it until you’ve got the time to do it. The hardest way to do it (and I’m sure David would beat me with a stick when hearing this) is to just muster up enough of frustration until the project becomes your first priority for that very reason, even though there are probably many things that are objectively more important. So, guess what… I went with door number 2!

The Project

So, what on earth could be so frustrating, yet so unimportant, that it would make me make such a bad decision? Paper. Plain and simple. I hate it!

Every since the first scanner made its appearance, and every since there’s been talk about the paperless office, I’ve had this nagging itch what it would be great to have everything stored digitally instead of on paper, or at least both ways. As the years go by, more and more paper accumulate in filers and drawers, and most of them are utterly unimportant and odds are I will never look at them again, yet I feel reluctant to throw them away. Now, you could argue that this is a problem better discussed with my therapist than a technical problem to solve, but using technology to solve life issues is kind of what my life (and this blog as well) is about , so I finally decided to take the bull by the horns.

Project plan

First of all, I needed to get an idea of how much paper I was going to handle in my back log (that is all the stuff I kept around that really could get thrown away) to get a feeling for what kind of solution I would need. Going through my cabinets and drawers gave me a rough estimate of about 3000 sheets of paper, which called for some serious equipment if I were ever going to actually go through with the project. Lucky as I am, I’m surrounded with equally geeky friends, so a few text messages later, my hardware of choice came down to the Fujitsu Siemens  ScanSnap IX500, which was promptly ordered.

Second of all, I needed to play around with the hardware and software to decide which settings would give me the best results when it came to scanning, performing OCR and catalogue all the documents, and after fiddling with the scanner for a while, I found that keeping the scanning quality at its highest was the best alternative for making the OCR as good as possible, and then having software downscale the raw image before it was saved to a searchable PDF in order to keep the file size within reason.

When it comes to cataloguing all the documents though, I sort of drew a blank. My Google skills and patience obviously got the better of me, since I’m sure that there has to be some piece of software around that would let you search through a PDF document, extract important information and relocate/rename the file depending on the prior search, but I was not able to find anything suitable to my requirements. Now, don’t get me wrong… There is plenty of software out there that performs some part of this task, but not in the workflow I was looking for.

So, what is one to do?

Hacking time

Luckily for me, I know how to code 🙂

I decided that the best thing to do was to create a windows service that would watch the folder where all the documents from my ScanSnap were dumped into, and process each one of them according to a set of rules that I would set up. The rules would be of the type “If a document contains the word ‘Telia’ (my cell phone carrier) then look for the text “invoice date: XXXX” and rename the file to Telia_Invoice_XXXX and move to the Telia folder”.

Setting up a list of such rules then allowed me to easily batch scan all my documents and have them end up roughly in the right places with meaningful names. The best part of it is that – now when the rules are set up and the service is running in the background – all documents that come in the door from now on will be scanned, then canned.

(The whole project can be viewed here on codeproject for those of you that are interested in the technical aspects of the solution.)

Last words

Now that everything is scanned, and all the paper is thrown in the bin, it’s easy to say that it was worthwhile. All in all, the project has not required that much actual work, even though the scanning part took almost 6 hours of continuous scanning. The IX500 was worth every cent, as it has performed flawlessly and even the OCR of the bundled software is quite satisfactory.

If you want to do something similar and need help changing the program or setting up the rules of the project, let me know in a comment below and I will try to help you out as best I can.