Best Practices for Configuring Classification with Data Insight

book

Article ID: 100060360

calendar_today

Updated On:

Description

This document aims to help administrator's to setup and configure Veritas Information Classifier (VIC) in the Data Insight environment to process data as efficiently and error free as possible.

General

Use recommended system configurations for better throughput.
Use a classification server pool of multiple nodes (3 minimum) to achieve higher throughput for large classification tasks.
Disable smart classification if not required (Disabled by default).
- Smart classification requires significant resources on Indexer and Management Server nodes to automatically generate the list of files to classify.
Update default disk safeguard thresholds to higher values especially in case of PDF Files where uncompressed files can consume up to 40GB disk space (considering 16 threads and file sizes around 2.5 GB) hence the values given below will safeguard against disk usage reaching maximum limit.
Reset at 50 GB (or higher)
Stop at 45 GB (or higher)
As a part of classification, Data Insight does text extraction and uses the data directory for storing temporary files.

Maximum File Size Supported

Data Insight has a default maximum file size of 50MB. This limit can be changed in the Classification Configuration settings page.
Text extraction during classification is bounded by the uncompressed size of a file and this uncompressed size dictates whether files can be successfully classified. All Microsoft Office documents since Office 2007 use Office Open XML format (.docx, .pptx etc.) which introduced compression.
Most Office docs therefore have a degree of compression ranging from 20%-70% depending on the mix of text and images, with pure text compressing to around 80%.
Files with a lot of images will compress less as images such as JPEG and PNG are already compressed.
PDFs are not compressed by default unless the 'Optimize PDF' option in Adobe Acrobat or similar PDF authoring applications has been used.
It has been observed that 16 concurrent files of 400MB uncompressed docx files can be classified without any memory exhaustion.
This means that 16 concurrent requests of docx files in a range of 100MB-250MB logical sized would probably work fine given the average compression ratio.
Note that the compression ratio is impossible to predict unless you analyze each file or have some indication of the type of content within the corpus.
These figures do not relate to volume/disk level compression, but the compression that Microsoft Office applies to the content. A .docx file is simply a ZIP container that can be opened in a tool such as 7-Zip to assess the uncompressed size.

The table below shows the file types and sizes tested with the recommended Classification Server specification:

Recommended maximum file sizes for classification without OCR enabled

Document Type	Extensions	Maximum Compressed File Size Tested	Maximum Uncompressed File Size Tested
Microsoft Word	doc, docx, docm, dotm, dotx	200 MB	450 MB
Microsoft PowerPoint	ppt, pptx, pps, potm, potx, ppsm, ppsx	200 MB	450 MB
Office Tabular	xls, xlsx, xlt, xltx, xlsb, xlam	50 MB	100 MB
Adobe PDF	pdf	1 MB	Compressed PDFs are not yet tested. However, the maximum uncompressed size would mirror the compressed size of 1 GB.

Server specification used (the recommended Data Insight Classification Server specification)
- 16 Cores, 32GB RAM
- 16 classification threads running in parallel
Using Optical Character Recognition (OCR)
- OCR usually results in higher memory consumption which eventually affects the classification performance.

Larger File Support

It is possible that larger files than tested could be successfully classified, but it depends on the size of other files being classified at the same time. For example, if a 300MB DOCX is 1GB uncompressed, it could still be classified successfully if all other 15 files running in parallel are relatively small since the total memory used by the classification process would be within limits.
As there is no way to ensure that a mix of small and large files are classified at the same time, recommend that any DQL reports that are used to select files to classify are not ordered or segregated by file size. This ensures that the files submitted to VIC are done so as 'randomly' as possible.
- For example, do not classify all 'small DOCX' files first and leave the largest ones until later. Classifying the very largest files together in one classification Job increases the risk that the total uncompressed size of 16 large files would lead to VIC memory exhaustion. Submitting a mix of file sizes together provides the best chance of large and large uncompressed files being successfully classified.
- If using DQL to generate a report of files to classify, do not order the output of the report by size as that would lead to VIC processing the largest files together, whether they are sorted to appear at the start or end of the report.

Recommendations for Creating Classification Jobs

Use DQL reports which will filter out the files based on the above recommendations and then trigger classification requests accordingly.
- Use size-based file buckets (0-2MB, 2-4MB, 4-10MB, etc.)
- Specific File Types/Exclude unsupported files by extension
Enable only required policies in VIC configuration.
- As the number of enabled policies and policy complexity increases (such as using complex regular expressions or hundreds of keywords), the throughput tends to decrease.
OCR process is generally memory intensive disable the process if not required.
Configure the content fetch pause window to reduce the potential impact on the source devices.
- The content fetch job copies files from the source devices to classify them.
- By default, the job is paused from 7am to 7pm which matches normal working hours.
- Recommend assessing the load on the devices during the content fetch as many customers have discovered the load does not disrupt any normal activities. If it can run 24-hours a day, that will help ensure that the classification process has a constant feed of files to classify and hence throughput can be increased.

Minimum System Requirements for Classification Components

Table: Minimum recommended system requirements for classification components

Component	If classification is enabled	If Smart Classification is enabled
Management Server	32GB RAM 16 CPUs	128GB RAM Note: Provision additional 2 MB space per million paths. 32 CPUs 200 GB of free disk space for temporary files which are created during the classification process.
Indexer worker node	32GB RAM 16 CPUs	128GB RAM Note: Provision additional 2 MB space per million paths. 32 CPUs 200 GB of free disk space for temporary files which are created during the classification process.
Classification Server	32GB RAM 16 CPUs	32GB RAM 16 CPUs