How to enable OCR (Optical Character Recognition) conversion for multiple languages text within images.

book

Article ID: 100055724

calendar_today

Updated On:

Description

Description

To provide Optical Character Recognition (OCR) conversion for image file type, Enterprise Vault uses a Windows TIFF IFilter. Windows TIFF IFilter is an optional Windows feature that the Enterprise Vault installer enables automatically, if it is not already enabled. By default, OCR conversion happens for English language text. To enable the OCR conversion for multiple languages, following are the steps:

  1. Add the required language in Windows and install the required language pack (Chinese) for Windows on the Enterprise Vault server.
  2. Create below registry on Enterprise Vault server:

                 Name: OCRUseLocalServerSettings

                 Location: HKEY_LOCAL_MACHINE\SOFTWARE\Wow6432Node \KVS\Enterprise Vault\Storage

                 Type: REG_DWORD

                Set OCRUseLocalServerSettings to '1' to enable the local server settings and override the 'OCR language' defined in the Enterprise Vault site properties.

  1. Modify below settings in the local GPO on EV server to enable the multiple OCR language:

            Computer Configuration | Administrative Templates | Windows Components | Search | OCR

            Enable below GPO setting:

  • Force TIFF IFilter to perform OCR for every page in a TIFF document.
  • Select OCR language: Select the desired language from the drop down. (If this setting is enabled, then selected OCR language is used for OCR processing)
  • Select OCR language from a code page: Once this setting is enabled then multiple languages can be selected for which OCR needs to be enabled. (If this policy setting is enabled, the selected OCR languages are used in OCR processing during the indexing of TIFF files.)

Note: All selected OCR languages must belong to the same code page. If languages from more than one code page are selected, the entire OCR language selection is ignored and only the default system language is used.

  1. Once above settings are enabled, reboot the Enterprise Vault server.

Note: Once the setting 'Force TIFF IFilter to perform OCR for every page in a TIFF document' is enabled, there can be significant degradation in archiving and indexing performance.

Warning: Incorrect use of the Windows registry editor may prevent the operating system from functioning properly. Great care should be taken when making changes to a Windows registry. Registry modifications should only be carried-out by persons experienced in the use of the registry editor application. It is recommended that a complete backup of the registry and workstation be made prior to making any registry changes.

Issue/Introduction

How to enable OCR (Optical Character Recognition) conversion for multiple languages text within images.

Additional Information

JIRA: CFT-5334