If a user specifies a keyword filter in the fileshare collection task, a keyword search is performed for each document that is not excluded by the other filter settings in the collection task. However, this keyword search is only performed if the document's file type is not in the "keyword search skip list". The keyword search skip list is a Clearwell property that lists the file ID values for file types where text extraction does not make sense at the collection stage. Among the file types included in this list are the following: Email MSG, Email PST, Email NSF, Image files (BMP, TIFF, GIF, JPG, PNG, etc), Executables, Zip files and Multimedia files (MP3, MPEG2, etc). Text extraction for these file types can only be performed when the documents are processed (indexed) into Clearwell.
Below is a list of Clearwell file id values for which keyword filtering is skipped during a fileshare collection:
1302,4004,1189,1170,1998,1818,1195,1132,1192,1356,1187,1143,1345,1347,1349,1351,1353,1357,1359,1361,1363,1365,1500-1518,1527,1528,1532-1535,1539,1542,1544-1546,1551,1556,1559,1561,1564,1565,1569-1575,1578,1579,1580,1588-1591,1593,1594,1599,1601-1604,1606,1625,1636,1639,1646,5000,2261,1800,1801,1803,1820,1802,1803,1806,1807,1813,1815,1821,1822,65537,65538,65539,65541,65542,1826,1827,65545,1817,65540,65543,65544,1700-1799
The ScanDir log file created in directory D:\CW\VXX\logs\scandir\ indicates if a document has been excluded from keyword filtering, and also displays the file id of the document, e.g.:
2017-06-29 16:06:39,306 [17820,34180] INFO Filter - Skipping keyword check for the file: \\localhost\d$\fs_test_3\8am call tomorrow.EML with fileid: 1195
2017-06-29 16:06:39,325 [17820,39452] INFO Filter - Skipping keyword check for the file: \\localhost\d$\fs_test_3\Test email 2.msg with fileid: 1143
Text extraction needs to be performed in order to execute a keyword search on a document. Should text extraction fail, Clearwell will include the document in the collection result, rather than exclude a document that could potentially include the user's search keyword. Including the file type group "Other types" in the fileshare collection task is always likely to increase the number of document file types for which text extraction is attempted and therefore the number of potential text extraction failures. This can result in Clearwell collecting more files than the user might expect, even though they have specified a search keyword. Note: The default setting for file type filtering in a fileshare collection task is "Do not filter by file type or file extension", which effectively includes all files in the "Other types" file type group.
One particularly good example of this unexpected 'over-collecting', is the handling of Windows shortcut (.lnk) files. These have a file type id value of 2401. The file id 2401 is not in the keyword search skip list, so text extraction is attempted, but this always tends to fail. Because file id 2401 falls into the "Other types" file type group, these shortcut files will always be collected by default.
Note: One interesting fact as a result of this behaviour is that Zip files are not extracted during a fileshare collection.
There is no ideal solution without risking excluding some documents that could potentially include the user's search keyword. Excluding the "Other types" file group (either explicitly, or implicitly by including one of the other file type groups) will exclude files such as Windows shortcut (.lnk) files, but doing this will also exclude any zip files, unless the user then explicitly includes these (e.g. using a file extension inclusion for "zip", or by selecting "Other Containers(ZIP,RAR,etc.)" from the "Container Files" filter tab.
See the attached document "Clearwell File ID values grouped by File Type.docx" which lists the file id values that belong to each Clearwell file type group.
In conclusion:
If the user has a specific list of file types to collect, these file types should be explicitly selected in the collection task filter settings. If the user does not know all the file types to collect, they should expect that more files than they expected are likely to be collected, even though they have specified a search keyword.