"Machine Learning for Software Maintainability":
Slides for the Talk at the Joint Workshop on Intelligent Methods for Software System Engineering (JIMSE) 2012
its overall life cycle or it progressively becomes less satisfactory” (cit. Lehman’s Law of Software Evolution) • Software Maintenance is one of the most expensive and time consuming phase of the whole life cycle • Anticipating the Maintenance operations reduces the cost • 85%-90% of the total cost are related to the effort necessary to comprehend the system and its source code [Erlikh, 2000]
Data Helpers / Utilities Security Operational Management Communications Business Components Application Facade Buisiness Workflows Messages Interfaces Service Interfaces • Provide models and views representing the relationships among different software artifacts • Clustering of Software Artifacts • Advantages: • To aid the comprehension • To reduce maintenance effort SOFTWARE ARCHITECTURE
software artifacts • Clustering of Software Artifacts • Advantages: • To aid the comprehension • To reduce maintenance effort SOFTWARE ARCHITECTURE External Systems Service Consumers Services Service Interfaces Messages Interfaces Cross Cutting Security Operational Management Communications Data Data Access Components Data Helpers / Utilities Presentation UI Components UI Process Components Business Application Facade Buisiness Workflows Business Components Clusters of Software Artifacts
software artifacts • Clustering of Software Artifacts • Advantages: • To aid the comprehension • To reduce maintenance effort SOFTWARE ARCHITECTURE External Systems Service Consumers Services Service Interfaces Messages Interfaces Cross Cutting Security Operational Management Communications Data Data Access Components Data Helpers / Utilities Presentation UI Components UI Process Components Business Application Facade Buisiness Workflows Business Components Clusters of Software Artifacts
The different levels of abstractions lead to different analysis tasks: • Identification of functional modules and their hierarchical arrangement • i.e., Clustering of Software classes • Identification of Code Clones • i.e., Clustering of Duplicated code fragments (blocks, SOFTWARE ARTIFACTS
the syntactic/lexical information provided in the source code text • Exploit the relational information between artifacts • e.g., Program Dependencies Problem: Definition of a proper similarity measure to apply in the clustering analysis, which is able to exploit the considered representation of software artifacts SOFTWARE ARTIFACTS CLUSTERING
the source code • Combine different kind of information (lexical and structural) • Application of Kernel Methods to software artifacts • Provide flexible and computational effective solutions to analyze large data sets ADVANCED MACHINE LEARNING FOR SOFTWARE MAINTENANCE
the source code • Combine different kind of information (lexical and structural) • Application of Kernel Methods to software artifacts • Provide flexible and computational effective solutions to analyze large data sets Advanced Machine Learning • Learning with syntactic/semantic information (Natural Language Processing) • Learning in relational domains (Structured-output learning, Logic Learning, Statistical Relational Learning) ADVANCED MACHINE LEARNING FOR SOFTWARE MAINTENANCE
function between (arbitrary) pairs of entities • It can be seen as a kind of similarity measure • Based on the idea that structured objects can be described in terms of their constituent parts • Generalize the computation of the dot product to arbitrary domains • Can be easily tailored to specific domains • Tree Kernels • Graph Kernels • ....
• Tree Kernels can be used to measure the similarity between parse trees KERNELS FOR LANGUAGES • Abstract Syntax Trees (AST) represent the syntactic structure of a piece of code • Research on Tree Kernels for NLP carries over to AST (with adjustments) KERNELS FOR SOURCE CODE
Ranking • Unsupervised Learning • Clustering • Anomaly Detection Idea: Any learning algorithm relying on similarity measure can be used KERNEL MACHINES
different characteristics • e.g., Ignore variables names or not • Employ kernel learning approaches which learn a weighted combination of candidate kernels • Useless/harmful kernels will get zero weight and will be discarded in the final model LEARNING SIMILARITIES
software • Training examples are software projects/portions with annotation on existing clones (clustering) • A learning model uses training examples to refine the similarity measure for correctly clustering novel examples STRUCTURED-OUTPUT LEARNING
Advanced Machine learning approaches are promising for exploiting such information • Kernel Methods are natural candidate • e.g., see the analogy between NLP parse trees and AST • Many applications: • architecture recovery, code clone detection, vulnerability detection .... SUMMARY
Copy&Paste programming • Taxonomy of 4 different types of clones • Program Text similarities and Functional similarities • Clones affect the reliability and the maintainability of a software system CODE CLONE DETECTION
syntactic structure of the different instructions of a program (function) • Program Dependencies Graph (PDG) • (Directed) Graph structure representing the relationship among the different statement of a program KERNELS FOR CLONES CODE STRUCTURES Kernels for Structured Data: • The source code could be represented by many different data structures
Nodes (Dashed ones) • e.g., if - for - while - function calls... • Data Nodes • e.g., expressions - parameters... NODES AND EDGES while call-site arg expr
Nodes (Dashed ones) • e.g., if - for - while - function calls... • Data Nodes • e.g., expressions - parameters... • Two Types of Edges (i.e., dependencies) • Control edges (Dashed ones) • Data edges NODES AND EDGES while call-site arg expr
new Kernel for a Structured Object requires the definition of: • Set of features to annotate each part of the object • A Kernel function to measure the similarity on the smallest part of the object • e.g., Nodes of AST and Graphs • A Kernel function to apply the computation on the different (sub)parts of the structured object KERNELS FOR CODE STRUCTURES
4 features • Instruction Class • i.e., LOOP, CONDITIONAL_STATEMENT, CALL • Instruction • i.e., FOR, IF, WHILE, RETURN • Context • i.e., Instruction Class of the closer statement node • Lexemes • Lexical information gathered (recursively) from leaves KERNELS FOR CODE STRUCTURES: AST TREE KERNELS FOR AST FOR FOR-INIT FOR- BODY
4 features • Instruction Class • i.e., LOOP, CONDITIONAL_STATEMENT, CALL • Instruction • i.e., FOR, IF, WHILE, RETURN • Context • i.e., Instruction Class of the closer statement node • Lexemes • Lexical information gathered (recursively) from leaves KERNELS FOR CODE STRUCTURES: AST Instruction Class = LOOP Instruction = FOR Context = (e.g., LOOP) Lexemes = (e.g, name of variables in FOR- INIT..) TREE KERNELS FOR AST FOR FOR-INIT FOR- BODY
blocks to each other • Blocks: Atomic unit for (sub) tree considered KERNELS FOR CODE STRUCTURES: AST TREE KERNELS FOR AST BLOCK = = print x 1.0 y f x x y BLOCK = = print s 0.0 p f s 1.0 p
blocks to each other • Blocks: Atomic unit for (sub) tree considered KERNELS FOR CODE STRUCTURES: AST TREE KERNELS FOR AST BLOCK = = print x 1.0 y f x x y BLOCK = = print s 0.0 p f s 1.0 p
WHILE, CALL-SITE, EXPR, ... • Node Type • i.e., Data Node or Control Node • Features of edges: • Edge Type • i.e., Data Edge or Control Edge KERNELS FOR CODE STRUCTURES: PDG GRAPH KERNELS FOR PDG while call-site arg expr expr
WHILE, CALL-SITE, EXPR, ... • Node Type • i.e., Data Node or Control Node • Features of edges: • Edge Type • i.e., Data Edge or Control Edge KERNELS FOR CODE STRUCTURES: PDG Node Label = WHILE Node Type = Control Node GRAPH KERNELS FOR PDG while call-site arg expr expr Control Edge Data Edge
GRAPH KERNELS FOR PDG • Goal: Identify common subgraphs • Selectors: Compare nodes to each others and explore the subgraphs of only “compatible” nodes (i.e., Nodes of the same type) • Context: The subgraph of a node (with paths whose lengths are at most L to avoid loops) KERNELS FOR CODE STRUCTURES: PDG
GRAPH KERNELS FOR PDG • Goal: Identify common subgraphs • Selectors: Compare nodes to each others and explore the subgraphs of only “compatible” nodes (i.e., Nodes of the same type) • Context: The subgraph of a node (with paths whose lengths are at most L to avoid loops) KERNELS FOR CODE STRUCTURES: PDG
GRAPH KERNELS FOR PDG • Goal: Identify common subgraphs • Selectors: Compare nodes to each others and explore the subgraphs of only “compatible” nodes (i.e., Nodes of the same type) • Context: The subgraph of a node (with paths whose lengths are at most L to avoid loops) KERNELS FOR CODE STRUCTURES: PDG
GRAPH KERNELS FOR PDG • Goal: Identify common subgraphs • Selectors: Compare nodes to each others and explore the subgraphs of only “compatible” nodes (i.e., Nodes of the same type) • Context: The subgraph of a node (with paths whose lengths are at most L to avoid loops) KERNELS FOR CODE STRUCTURES: PDG
detector tools: • AST-based Clone detector • PDG-based Clone Detector • No publicly available clone detection dataset • No unique set of analyzed open source systems EMPIRICAL EVALUATION
detector tools: • AST-based Clone detector • PDG-based Clone Detector • No publicly available clone detection dataset • No unique set of analyzed open source systems • Usually clone results are not available EMPIRICAL EVALUATION
detector tools: • AST-based Clone detector • PDG-based Clone Detector • No publicly available clone detection dataset • No unique set of analyzed open source systems • Usually clone results are not available • Two possible strategies: EMPIRICAL EVALUATION
detector tools: • AST-based Clone detector • PDG-based Clone Detector • No publicly available clone detection dataset • No unique set of analyzed open source systems • Usually clone results are not available • Two possible strategies: • To automatically modify an existing system with randomly generated clones EMPIRICAL EVALUATION
detector tools: • AST-based Clone detector • PDG-based Clone Detector • No publicly available clone detection dataset • No unique set of analyzed open source systems • Usually clone results are not available • Two possible strategies: • To automatically modify an existing system with randomly generated clones • Manual classification of candidate results EMPIRICAL EVALUATION
another (pure) AST-based clone detector • Clone Digger http://clonedigger.sourceforge.net/ • Comparison on a system with randomly seeded clones Results refer to clones where code fragments have been modified by adding/removing or changing code statements EVALUATION TREE KERNELS FOR AST
• Kernel Methods advantages: • flexible solution to be tailored to specific domain • efficient solution easy to parallelize • combinations of multiple kernels • Provide a publicly available data set