Statistical File Scanner

The statistical-file-scanner utilizes statistics to compute the estimated value for a data characteristics of large data sets without actually requiring to scan the full data set.

The characteristics to determine is computed based on the occupied size (file size) of the data. For example, one may want to estimate the percentage (proportion) a given file type has on the overall data, or how well data compresses, i.e., what will be the compression ratio if I compress the 10 Petabyte of data with compression scheme X.

Contact Dr. Julian Kunkel
Repository Public on GitHub

Publications

  • \myPub{2017}{SFS: A Tool for Large Scale Analysis of Compression Characteristics}{Julian Kunkel}{Research Papers (4), Research Group: Scientific Computing, University of Hamburg}
  • \myPub{2015}{Identifying Relevant Factors in the I/O-Path using Statistical Methods}{Julian Kunkel}{Research Papers (3), Research Group: Scientific Computing, University of Hamburg}

Talks

  • \myPub{2016}{Statistical File Characterization and Status Update: Monitoring at DKRZ}{BoF: Analyzing Parallel I/O, Supercomputing Conference}{Salt Lake City, USA}
  • \myPub{}{Analyzing Data Properties Using Statistical Sampling Techniques – Illustrated on Scientific File Formats and Compression Features}{HPC-IODC Workshop}{Frankfurt, Germany}
  • \myPub{}{Analyzing Data Properties Using Statistical Sampling Techniques – Illustrated on Scientific File Formats and Compression Features}{ISC High Performance}{Frankfurt, Germany}