Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Software Architecture Patterns Detector for Jav...

Software Architecture Patterns Detector for Java Applications using BERT model

This is a presentation about my capstone project for the Mastering Neural Networks course at MITxPro. This project used my deep learning knowledge from the course applied to source code analysis to detect architectural patterns of Java applications. I used the BERT model as the basis for training my custom model using random Github repos as my dataset.

Avatar for Bruno Tinoco

Bruno Tinoco

May 19, 2025
Tweet

More Decks by Bruno Tinoco

Other Decks in Technology

Transcript

  1. Software Architecture Patterns Detector for Java Applications using BERT model

    MODULE 8; CAPSTONE PROJECT - APRIL 2025 BRUNO TINOCO
  2. Project Summary  This project intent to build a model

    using a pre-trained BERT model called CodeBERT to extract common features of a series of Github repositories with labels to be able to detect which architecture style is applied for any specific Java programming language project using its source code as the model input.  Most software development projects adopt some architecture patterns to build solutions for different types of problems for each industry. There patterns are sometimes combined with others in order to solve common issues and many of them share the same characteristics that can be mapped based on patterns found on source codes; Some of the most known architecture patterns or styles are the following:  Layered  Monolith  Clean  Microservices  DDD  Hexagonal  Event-driven These architectures can be combined into a solution, and it requires a manual inspection to source code to assert it they are being applied to a specific project. This could be an initial step to automate software architecture review or to generate diagrams to visualize its structure.
  3. Project Solution  The first step to build this project

    was to find and label Github source code repositories applying each of the mapped architecture patterns.  Github provides a search mechanism to find repositories that could be used based on tags, this is the most time-consuming task, with the help of online LLMs we can make the search faster.  The dataset generated from labeled repositories follow the structure below to identify the repository, the architecture style used, and a list of key files and its contents to be used as the input for the embedding's generation.  [ { "repo": "spring-petclinic", "architecture": "layered", "files": [ { "file_path": "src/main/java/org/springframework/Ow nerController.java", "content": "package org.springframework...\npublic class OwnerController {...}" }, ... ] } ]  Some pre-processing was used to normalize the data; using the assumption most of the enterprise java projects adopt some keywords for class naming convention (model, controller, service and repository). we grouped code contents into these naming convention categories. DATASET SELECTION
  4. Project Solution  Find a pre-trained model that could support

    the feature extraction from the source code; to build a word tokenizer and the embeddings.  Considering the input dataset is a set of java programming code, the model needs to be able to process texts understands its semantics, so this problem falls into the Natural Language Processing category.  BERT was the main model considered for the task but we found that there was a specific version of BERT fine- tunned for programming code, which is CodeBERT created by some researches from Microsoft that is a bimodal pre-trained model fro natural language and programing languages(NL-PL) including Java.  Hugging Face provided an open-source Python library that allows us to load the pre-trainned codebert model using Pytorch.  The model is composed of two main functions;  Code Tokenizer  Embedding generation MODEL DEFINITION
  5. Project Solution  After we normalized the data we loaded

    a customized tensor dataset to combine the embeddings and labels for the model training using CodeBERT Tokenizer  We defined a custom classifier with 768 inputs, using dropout and ReLU layers down to a 6 possible classes of output for each one of the architecture styles/patterns  The model was trained using 20 epochs with batch size of 8 for each group (train and validation) TRAINNING INPUT LAYER: 768 OUTPUT LAYER: 6
  6. Model Execution Results  The model reach 89% of accuracy

    as demonstrated by the classification report below The confusion matrix helped to validate how well each class performed
  7. Project Solution  We found that were some miss-classifications that

    could be improved by increasing the number of sample classes.  We were not considering that these classes could be combined into the same project, so we should adjust the model to consider a multiclass output, for instance many projects adopt DDD pattern combined with other architecture type.  We also could improve the model accuracy and explore the model semantics by getting more relevant code details, because we are truncating, and we may be missing some features. IMPROVEMENTS
  8. References: CodeBERT: A Pre-Trained Model for Programming and Natural Languages

    https://arxiv.org/abs/2002.08155 CodeBERT Github repository https://github.com/microsoft/CodeBERT Hugging Face CodeBERT-base model https://huggingface.co/microsoft/codebert-base List of software architecture styles and patterns https://en.wikipedia.org/wiki/List_of_software_architecture_styles_and_patterns Github Source Code repositories https://github.com/