Document indexing limitations in the eDiscovery Platform

book

Article ID: 100053981

calendar_today

Updated On:

Description

Description

When processing sources into the Veritas eDiscovery Platform (eDP), limitations within eDP have been set on the amount of data that is indexed for both resource and performance considerations. Only the data indexed is searchable in the Analysis and Review module in a case.

Scenarios

When indexing, Veritas eDiscovery Platform first extracts all the text from the source file. There are two use cases:

  • Emails: 
    • If the extracted text of an email body exceeds 512 KB, the message is flagged with an informational message Crawler Truncated, which means that the length of the email body exceeded the application's maximum body limit of 512 KB and email content is truncated by crawler.
    • Any message content prior to the 512K limit is indexed normally.  Message content after the 512K limit is not indexed and, therefore, not searchable.
       
  •  Loose files and email attachments: 
    • eDP indexes initial 1,000,000 (1 million) tokens of loose files or attachments. 
    • There are a couple of points to note here:
      • This is not a unique 1 million tokens. This count includes duplicates as well. It is just first 1 million tokens encountered.
      • The number of tokens also depends on data characteristics of the document being indexed. 
        For example: If there is a date value 05-23-2022, there are four tokens created: 05-23-2022, 05, 23 and 2022 which are indexed.
        Similarly, for a time value 23:45:56 in HH:MM:SS format, four tokens are created: 23:45:56, 23, 45 and 56, and these four tokens contribute to the 1 million count.
        A hyphenated word, such as well-received would create three tokens: well, received and well-received.
      • If the number of tokens in a loose file or attachment exceeds 1 million, then the file is flagged with the error message File Partially Indexed. The first 1 million tokens are still indexed and searchable in the application.
      • There is no relation between the size of the native file (or the size of the extracted text), and the error message mentioned in bullet # 3. As long as the token limit does not exceed 1 million, the complete file is indexed.

Adjustments

The amount of data indexed by Veritas eDiscovery Platform can be adjusted by modifying certain properties in the product.  Adjusting these values may negatively impact product performance due to increased resource utilization.

Emails:  Increasing the indexed email body size.

In the Veritas eDiscovery Platform UI, navigate to System > Support Features and select Property Browser in the drop-down.

  • Under Select the Case (or system) Select System
  • In the Name of property to change field, enter the property name listed below:
    esa.indexer.maxbodysize
    esa.crawler.maxbodysize
  • In the New Value (leave blank to remove) field, enter the desired value, i.e. 1024000
  • Check the Confirm change box
  • Click the Submit button
  • Perform these steps once for each (2) of the properties listed above.
  • Repeat the above steps for the following properties:
    • For PST source files:
      • esa.crawler.pst.maxbodysize
      • esa.indexer.pst.maxbodysize
    • For NSF source files:
      • esa.crawler.nsf.maxbodysize
      • esa.indexer.nsf.maxbodysize

Loose files and email attachments: Increasing the token limit.

  • On the eDiscovery server, navigate to D:\CW\V##\config\configs
  • Make a backup of the default.properties file.
  • Edit the default.properties file and search for the following token limit property:
    • esa.common.textengine.maxtermsperregion.x64
    • Change the value from 1000000 to the desired token limit.
  • The default.properties file will have a number of entries for the worker thread property, based upon the number of virtual CPUs.  
    • esa.asm.component.apcomponent.property.system.taskqueue.worker.task-threads.p##, where ## is the number of virtual CPUs.
  • Find the .p## entry matching the number of CPUs and lower the value to half of the original.
    • The p## value of all higher CPU count must also be lowered to this value. 
    • For example: If the server has 32 virtual CPUs and the value is lowered from 20 to 10, the value of the p48 and p64 entries must also be lowered from 20 to 10.
  • Save the edited default.properties file.
  • In the Clearwell Utility, run Option #7 to Build Incremental Configuration Changes.

Notes:

  • If the email body size exceeds the value set above, the message will be flagged with Crawler Truncated in Review.
  • If the number of tokens exceeds the value set above, the file will be still be marked as File too large (partially indexed) in the Processing Exceptions File Notices list and in Review.
  • The above changes affect all cases and must be made prior to processing source data into a case.

Issue/Introduction

Document indexing limitations in the eDiscovery Platform