How to minimize the disk space used by Clearwell on the local appliance.

book

Article ID: 100014720

calendar_today

Updated On:

Resolution

The following are Clearwell configuration changes that can be made to minimize the disk space used on the local appliance:

  • Store case and appliance backups off the appliance. 
     
  • Store exports off the appliance. See the section "Export to an External Drive" in the "Export and Production Guide".
     
  • Relocate the "convertedFiles" and "extractedFiles" directories off the appliance. See section "Defining System Settings" in the System Admin Guide.
     
  • In eDiscovery versions 9.5 and below, the IGC imaging cache directories can be moved from their default location on the installation drive. See the section "Moving Cache (On or Off) the Appliance" in the System Admin Guide.
    The following caching directories can be moved off the appliance:
    • D:\CW\vXXX\extdata
      This cache directory stores the proprietary binary files generated by redacted documents and locked productions. These files are included in a case backup.
    • D:CW\vXXX\scratch\exttemp
      This cache directory stores the proprietary binary files generated when a reviewer uses Native view (or when a "Native" cache job is run). These files are not included in a case backup.
 

 

Issue/Introduction

The following is mentioned in the Clearwell Installation Guide:
"The D: drive is used as a temporary cache for many of Clearwell’s processing components, therefore a D: drive of at least 1 TB is recommended."
The Clearwell installation itself will only use around 5GB of disk space, so the majority of the disk space used will be by the Clearwell Cases and their caches. As a safe working guideline, always try to have around 500GB free disk space on the D drive. This will ensure that there is always sufficient temporary disk space for the discovery, processing (indexing) and post-processing stages.
Note: If the Clearwell appliance does happen to run out of disk space during operation, this can occasionally result in database integrity issues that cannot be easily corrected. This might then require a case backup to be restored. For this reason, it is strongly recommended to always perform a case backup before any new data is processed into a case (including running an OCR job), or if post-processing is about to be re-run manually.
The Clearwell Release Notes includes the following regarding this:
"Check the disk space in your database before you start case processing (26106): If the database runs out of disk space during case processing, increase the disk space and either restart case processing or restore the case from a backup and resume processing."

QUESTION: Is it possible to predict the disk space that Clearwell will use for a certain size of source data?
ANSWER: No. There are no official metrics at this time on the disk space used by Clearwell during the various stages of the EDRM process. It is very difficult to determine such metrics, because the disk space used by Clearwell is dependent on numerous factors. These factors include, but are not limited to, the following:
The data types of the source data:
Different data types (e.g. LEF, OST, PST, ZIP) use different amounts of caching disk space on the appliance.
The data import methods used:
Load File Imports, for example, can generate a large number of non-temporary files. These files stay with the case and are included in case backups.
The % of de-duplication achieved by processing:
If the de-duplication % is low then there will be more unique documents ingested into Clearwell.
If OCR is used:
OCR generates output text files and also generates temporary files. Sometimes these temporary files can be left behind if the OCR fails.
Use of "Text" view during review:
Viewing a document in "Text" view (or running an "HTML" cache job) causes an HTML file to be cached to disk. This caching can be cleared, but it cannot be configured to be stored off the appliance.
Use of "Native" view during review:
Viewing a document in "Native" view (or running a "Native" cache job, or running an image production) causes proprietary binary files to be generated and cached to disk. It is possible to direct this caching off the appliance.
The number of processing batches in a case and the age of a case:
Over time, as more data is added incrementally to a case, the case size can increase more than if all the data had been processed in a single go. In certain situations, rebuilding search indexes from scratch can help reduce this disk space used.
This article describes the various configuration changes that can be made to the Clearwell appliance in order to minimize the disk space used.