TOWARDS A SET OF METRICS TO DETECT AND GENERATE SYNTHETIC FILE SYSTEMS

No ratings

Presented at SRISecCon 2014 by

File system researchers often rely on generating their own synthetic document repositories, due to data privacy and copyright concerns associated with experimenting on real-world corpora. Constructing fake file systems is also important in the field of cyber deception, to bait intruders and fool forensic investigators. For both these fields, realism is critical. Unfortunately, after creating a set of files and folders, there are no current testing standards that can be applied to validate their authenticity or reliably automate their detection. This paper reviews the previous 30 years of file system surveys on real world corpora, to identify a set of criteria for detecting and generating synthetic file systems. Statistics, such as size, age and lifetime of files, common file types, compression and duplication ratios, directory distribution and depth (and its relationship with numbers of files and sub-directories) are identified and their merits discussed. Additionally, this paper highlights notable absences in these surveys, which could be beneficial, such as analysing text content distribution, file naming habits, and comparing file access times against traditional working hours.