Statistical File Scanner

The statistical-file-scanner utilizes statistics to compute the estimated value for a data characteristics of large data sets without actually requiring to scan the full data set.

The characteristics to determine is computed based on the occupied size (file size) of the data. For example, one may want to estimate the percentage (proportion) a given file type has on the overall data, or how well data compresses, i.e., what will be the compression ratio if I compress the 10 Petabyte of data with compression scheme X.

Key Information

Contact		Dr. Julian Kunkel
Repository		Public on GitHub

Publications

SFS: A Tool for Large Scale Analysis of Compression Characteristics (Julian Kunkel), 2017-05-05 BibTeX PDF
Identifying Relevant Factors in the I/O-Path using Statistical Methods (Julian Kunkel), 2015-03-14 BibTeX PDF

Talks

Statistical File Characterization and Status Update: Monitoring at DKRZ (Dr. Julian Kunkel), BoF: Analyzing Parallel I/O, Supercomputing Conference, Salt Lake City, USA, 2016-11-17 Presentation
Analyzing Data Properties Using Statistical Sampling Techniques – Illustrated on Scientific File Formats and Compression Features (Dr. Julian Kunkel), HPC-IODC Workshop, Frankfurt, Germany, 2016-06-23 Presentation
Analyzing Data Properties Using Statistical Sampling Techniques – Illustrated on Scientific File Formats and Compression Features (Dr. Julian Kunkel), ISC High Performance, Frankfurt, Germany, 2016-06-21 Presentation