author	 = {Raul Torres},
	title	 = {{Comparison of Compiler’s Intermediate Representations and Input/Output Access Patterns with String Kernels}},
	advisors	 = {Thomas Ludwig and Julian Kunkel and Manuel F. Dolz},
	year	 = {2018},
	month	 = {06},
	school	 = {Universität Hamburg},
	howpublished	 = {{Online \url{{{:research:theses:raul_torres_comparison_of_compiler__s_intermediate_representations_and_input_output_access_patterns_with_string_kernels.pdf|Thesis}}}}},
	type	 = {PhD's Thesis},
	abstract	 = {Kernel methods aim for the detection of stable patterns robustly and efficiently from a finite data sample by embedding data items into a space of higher dimensionality where data points have linear relations. Strings kernels apply this methodology to find relationships between string objects by checking for the number of shared substrings and using this measure as a similarity score. Due to the low number of studies conducted in the area of code comparison and Input/Output (I/O) access pattern recognition using string kernels, the goal of this thesis is to propose a suitable, general representation, as well as the corresponding strategies of comparison based on kernel methods, such that they can be used successfully to determine a reliable similarity measure among a collection of programs. Therefore, we propose different conversion strategies from these original sources to a weighted string representation; the defined representation is a collection of tokens whose weights allow the modulation of the contribution of each token in the calculation of the overall similarity. The resulting strings are compared with a new family of kernel functions, which correspond to the major contribution of this thesis: the kastx spectrum kernel family. In order to create a similarity measure among two strings, these kernels are based on the longest common substrings; the idea behind this approach is to give more relevance to the largest common pieces of code rather than to small and disperse code instructions. The size of the valid matching substrings is controlled by the cut weight, a parameter given by the user, that specifies the minimum weight that those substrings should have. We tested our methodology in two scenarios: i) pattern recognition in I/O traces, and ii) comparison of intermediate representations of a compiler. In the first scenario, the clustering analysis showed that the proposed kernels managed to conform clusters that reflected the similarity of patterns taken from two popular I/O benchmarks. For the second scenario, a set of C functions was organized in four different classes, according to their purpose; clustering analysis here also showed a cluster organization that reflected the affinity among functions of the same class. These new kernels obtained similar, and in some cases, better results when compared to the blended spectrum kernel [SC04], a string kernel with widespread use in cheminformatics problems. The provided kernels will enrich the spectra of available string kernel functions on the literature and might be used in the future in similarity studies, not only in the field of computer science, but also in other areas like cheminformatics, bioinformatics or natural language processing.},