What is the criteria for de-duplication in eDiscovery Platform
book
Article ID: 100027772
calendar_today
Updated On:
Resolution
Clearwell will test all new documents against documents already in the case based on the criteria below. If the new document is determined to be a duplicate the system will not ingest the new document. The system will add information to the document that is already in the case. The new information includes a pointer to the duplicate document source location and custodian information.
Loose file de-duplication
Clearwell uses the following to determine whether two loose files are considered duplicates:
File name
file size
Last modified date
Checksum
Filename and last modified date are considered in Clearwell's de-duplication algorithm even when files have an identical MD5 hash in order to ensure that no critical file metadata is inadvertently de-duplicated out for identical files.
Email de-duplication
Clearwell uses the following fields to determine whether two emails are considered duplicates:
Sender email address
"To" list (normalized; in sorted order)
"From" list (normalized; in sorted order)
"cc" list (normalized; in sorted order)
"Bcc" list (normalized; in sorted order)
Subject (normalized; alphanumeric characters only)
Sent time (normalized to UTC format, hours and minutes only to the milisecond)
Full text of email content (alphanumeric characters only, as extracted by our text extractor)
Count of enclosed emails
Attachment file names
Attachment sizes
Issue/Introduction
Clearwell will use a specific criteria when comparing two files for de-duplication.
Was this article helpful?
thumb_up
Yes
thumb_down
No