Statistical File Scanner

The statistical-file-scanner utilizes statistics to compute the estimated value for a data characteristics of large data sets without actually requiring to scan the full data set.

The characteristics to determine is computed based on the occupied size (file size) of the data. For example, one may want to estimate the percentage (proportion) a given file type has on the overall data, or how well data compresses, i.e., what will be the compression ratio if I compress the 10 Petabyte of data with compression scheme X.

Contact Dr. Julian Kunkel
Repository Public on GitHub

Publications

Talks

  • Statistical File Characterization and Status Update: Monitoring at DKRZ (Dr. Julian Kunkel), BoF: Analyzing Parallel I/O, Supercomputing Conference, Salt Lake City, USA, 2016-11-17 Presentation
  • Analyzing Data Properties Using Statistical Sampling Techniques – Illustrated on Scientific File Formats and Compression Features (Dr. Julian Kunkel), HPC-IODC Workshop, Frankfurt, Germany, 2016-06-23 Presentation
  • Analyzing Data Properties Using Statistical Sampling Techniques – Illustrated on Scientific File Formats and Compression Features (Dr. Julian Kunkel), ISC High Performance, Frankfurt, Germany, 2016-06-21 Presentation