We frequently get asked why a document cannot be found by specific text that is in that document. Answering this question involves explaining how documents are made searchable and explaining some of the hurdles involved. The different methods of publishing determine how they are made searchable. Here’s a quick overview of each type of document, how it is made searchable and common problems.
Scanned, Faxed or drag-n-drop Documents
A scanned document is essentially a photograph of a piece of paper, as are faxed documents. Our search engine reads only text. To convert an image into text, the document must go through a process called optical character recognition, or OCR. The quality of the scan makes a big difference in the OCR; in the IT world there is an old saying: “Garbage in, Garbage out.” Here are some considerations in order to get the best results:
- The recommended best scanning resolution for OCR accuracy is 300 dpi.
- Brightness settings that are too high or too low can have negative effects on the accuracy of your image. A brightness of 50% is recommended.
- Straightness of the initial scan can affect OCR quality. Skewed pages can lead to inaccurate recognition.
- Older and discolored documents must be scanned in RGB mode in order to capture all of the image data.
Documents that you drag and drop into our folders are passed through the OCR to ensure searchability – some need it, some don’t but it’s better to be safe.
Documents that are printed using the ENet Docs Print Client (or older Net.DFM Print Client) are not images and thus do not get passed through the OCR. Whether or not they are searchable is determined by the contents; if the contents are true text, then the resulting document in ENet Docs will be searchable. However, if the contents are images the resulting documents will *not* be searchable. So, how can you tell? If you can select the text that you are trying to print, then *most* of the time the resulting document will be searchable. There are applications that do not play nice with the printing (Google Chrome is an example).
Things get even more complicated with emailed documents – except for our cloud customers. For our cloud customers, all emailed documents are OCR’d. For our on-premise customers, the “to” address determines whether or not a document is OCR’d. Documents sent to “folder… ” or “folder3…” are OCR’d. For some technical reasons, documents sent to “netdfm…” are not OCR’d.
When a document is emailed to ENet Docs (or Net.DFM) whether or not it is processed by the OCR engine is determined by the “to” address.