How to enable the OCR (Optical Character Recognition) conversion for embedded images with text in PDF documents.

book

Article ID: 100058826

calendar_today

Updated On:

Description

Description

The Storage service converts items to HTML or text, if possible, and this converted content is then used to index the item. The Enterprise Vault Storage Service uses Outside In® Technology content converters from Oracle® Corporation to convert most file types. To provide Optical Character Recognition (OCR) conversion for image file type, Enterprise Vault uses a Windows TIFF IFilter. Windows TIFF IFilter is an optional Windows feature that the Enterprise Vault installer enables automatically, if it is not already enabled. To enable OCR conversion for images present in PDF documents, following are the steps:

  1. Change the below 2 setting in the Enterprise Vault site properties | Click on Advanced tab | select “Content conversion”.

OCR conversion of embedded images” to “ON

OCR conversion of scanned pages” to “ON

  1. Click on “Apply” and “OK”.
  2. Restart all Enterprise Vault services.
  3. Archive a new email with PDF attachment which has image with text in it. Once the email is archived and indexed, content of the embedded image will be searchable.

Note: To enable OCR conversion for MS Word documents we only need to enable “OCR conversion of embedded images” setting.

Issue/Introduction

How to enable the OCR (Optical Character Recognition) conversion for embedded images with text in PDF documents.

Additional Information

JIRA: CFT-5511