Steps to resolve Checksum De-duplication issues using the 'Update checksum for emails'

book

Article ID: 100051881

calendar_today

Updated On:

Description

Description

Update Checksum for Emails

Background
De-duplication of documents depends on matching checksums. Microsoft Office and IBM Notes clients have altered the way in which they calculate checksums. Due to this, from version to version some email documents may not be de-duplicated.
For example, in release 10.1, Veritas eDiscovery Platform upgraded the version of Microsoft Office and IBM Notes. For cases with previously indexed data, new emails indexed since version 10.1 may not de-duplicate entirely against the emails already in the case. Besides the de-duplication impact, some users may experience retrieval errors after upgrading to 10.1 or later.

IMPORTANT: After upgrading to 10.1, a warning message is shown as in figure 1 on selecting those cases that require email checksum update. 

Figure 1

System or Group Administrator need to go to System > Support Features and choose Update checksum for emails. Only the cases that appear in the Select the case field are affected. The System Administrator or Group Administrator should co-ordinate the timing of running the Update Checksum for Emails feature against the affected cases.

Who Will Need This Feature

  • Users with cases having data indexed in eDiscovery Platform earlier to version 10.1.
  • Users who have upgraded to 10.1 or later having cases archived in earlier version and if such cases are restored.

De-duplication Prerequisites

  • User must have the System Manager or Group Admin role to access the System page with the support features.

Figure 2

Running De-duplication

  1. From the case, go to System > Support Features and choose Update checksum for emails. The cases that appear in the drop-down need to be scanned using this feature. Select a case that appears here or select Upgrade all cases. Click Submit to start a job. Select the case and click the job log to observe progress, else, wait for the job to complete.
  2. When complete, the log will state Job Finished, and a log summary and CSV file will be generated.
    Job log — the log shows the steps in progress, number of files de-duplicated, whether the update completed successfully, and where the csv file is available.
    Log summary— generated after the job is complete. This log restates the last three lines of the job log: numbers for each type of file to be de-duplicated, the number of successfully de-duplicated items, and stating the job is complete.
    CSV “mismatched_items.csv”— a detailed list of updated items identified by DocID, with original and updated crawler checksums.
    CSV “scan_failed_items.csv”— a list of items for which the scan for checksum recalculation failed.
    Note: if items fail to update successfully, the Group admin for the case should investigate the job log for details.
  3. Once the problem is resolved, the case should be scanned again using the Update checksum for emails feature, as in steps 1-3. After the case is updated completely it will stop appearing in the drop-down.

FAQs and Troubleshooting
Q: After upgrading to version 10.1, how do I know if there are cases that need to be scanned?
A: From the case, go to System > Support Features and choose Update checksum for emails. Only the cases that need to be scanned by this feature will appear in the drop-down.
Apart from this, a warning message is also shown (as in Figure 1) on selecting the cases that require email checksums to be updated.

Q: If I have cases that appear to be affected, when should I use the Update checksum for emails feature?
A: Preferably the update checksum should be run on the case as soon as the user sees the warning message or at least before processing any new case data or review. The installing System Administrator should check this as in step 1 and consult the Group or Case Administrator for priority and timing.

Q: Which files are scanned by this feature?
A: PST, MSG, and NSF files are scanned by this feature. Attachments, and loose files are not affected by the hash mismatch.

Q: What if the job does not complete successfully?
A: In some cases, files are dropped because they are corrupt, they are missing from the original location, or are open in another process. The Group admin for the case should read the job log to determine the cause and act accordingly, and then re-run the feature.

Q: Can I run discovery or processing on a case folder in parallel with the Update checksum for emails feature?
A: Ideally, we do not recommend running a job in parallel with the Update checksum utility. However, if this occurs, the Update checksum for emails job will be queued waiting for the discovery/processing job to complete, and vice versa.

Q: Can I review or perform an export on the already processed data while the Update checksum for emails job is in progress?
A: No, it is not recommended.

Q: How many Update checksum for emails jobs can I run at a time per case home node?
A: At a time, you can run maximum five jobs per case home node. You can configure the number of concurrent jobs by using a property—esa.checksumupdaterjob.execution.throttle. Default value set for this property is three. You can set the property value from 1-5.

Issue/Introduction

Steps to resolve Checksum De-duplication issues using the 'Update checksum for emails'

Additional Information

JIRA: ESA-62277 JIRA: ESA-62587