Software Architecture Patterns Detector for Java Applications using BERT model

Software Architecture Patterns Detector for Java Applications using BERT model
MODULE 8; CAPSTONE PROJECT - APRIL 2025 BRUNO TINOCO

Project Summary  This project intent to build a model
using a pre-trained BERT model called CodeBERT to extract common features of a series of Github repositories with labels to be able to detect which architecture style is applied for any specific Java programming language project using its source code as the model input.  Most software development projects adopt some architecture patterns to build solutions for different types of problems for each industry. There patterns are sometimes combined with others in order to solve common issues and many of them share the same characteristics that can be mapped based on patterns found on source codes; Some of the most known architecture patterns or styles are the following:  Layered  Monolith  Clean  Microservices  DDD  Hexagonal  Event-driven These architectures can be combined into a solution, and it requires a manual inspection to source code to assert it they are being applied to a specific project. This could be an initial step to automate software architecture review or to generate diagrams to visualize its structure.

Project Solution  The first step to build this project
was to find and label Github source code repositories applying each of the mapped architecture patterns.  Github provides a search mechanism to find repositories that could be used based on tags, this is the most time-consuming task, with the help of online LLMs we can make the search faster.  The dataset generated from labeled repositories follow the structure below to identify the repository, the architecture style used, and a list of key files and its contents to be used as the input for the embedding's generation.  [ { "repo": "spring-petclinic", "architecture": "layered", "files": [ { "file_path": "src/main/java/org/springframework/Ow nerController.java", "content": "package org.springframework...\npublic class OwnerController {...}" }, ... ] } ]  Some pre-processing was used to normalize the data; using the assumption most of the enterprise java projects adopt some keywords for class naming convention (model, controller, service and repository). we grouped code contents into these naming convention categories. DATASET SELECTION

Project Solution  Find a pre-trained model that could support
the feature extraction from the source code; to build a word tokenizer and the embeddings.  Considering the input dataset is a set of java programming code, the model needs to be able to process texts understands its semantics, so this problem falls into the Natural Language Processing category.  BERT was the main model considered for the task but we found that there was a specific version of BERT fine- tunned for programming code, which is CodeBERT created by some researches from Microsoft that is a bimodal pre-trained model fro natural language and programing languages(NL-PL) including Java.  Hugging Face provided an open-source Python library that allows us to load the pre-trainned codebert model using Pytorch.  The model is composed of two main functions;  Code Tokenizer  Embedding generation MODEL DEFINITION

Project Solution  After we normalized the data we loaded
a customized tensor dataset to combine the embeddings and labels for the model training using CodeBERT Tokenizer  We defined a custom classifier with 768 inputs, using dropout and ReLU layers down to a 6 possible classes of output for each one of the architecture styles/patterns  The model was trained using 20 epochs with batch size of 8 for each group (train and validation) TRAINNING INPUT LAYER: 768 OUTPUT LAYER: 6

Model Execution Results  The model reach 89% of accuracy
as demonstrated by the classification report below The confusion matrix helped to validate how well each class performed

Project Solution  We found that were some miss-classifications that
could be improved by increasing the number of sample classes.  We were not considering that these classes could be combined into the same project, so we should adjust the model to consider a multiclass output, for instance many projects adopt DDD pattern combined with other architecture type.  We also could improve the model accuracy and explore the model semantics by getting more relevant code details, because we are truncating, and we may be missing some features. IMPROVEMENTS

References: CodeBERT: A Pre-Trained Model for Programming and Natural Languages
https://arxiv.org/abs/2002.08155 CodeBERT Github repository https://github.com/microsoft/CodeBERT Hugging Face CodeBERT-base model https://huggingface.co/microsoft/codebert-base List of software architecture styles and patterns https://en.wikipedia.org/wiki/List_of_software_architecture_styles_and_patterns Github Source Code repositories https://github.com/

Bruno Tinoco Business Development Solutions Architect at GFT US https://www.linkedin.com/in/brunocrt/
[email protected] Thank you

Software Architecture Patterns Detector for Jav...

Software Architecture Patterns Detector for Java Applications using BERT model

Bruno Tinoco

More Decks by Bruno Tinoco

Other Decks in Technology

Featured

Transcript

Software Architecture Patterns Detector for Java Applications using BERT model

Project Summary  This project intent to build a model

Project Solution  The first step to build this project

Project Solution  Find a pre-trained model that could support

Project Solution  After we normalized the data we loaded

Model Execution Results  The model reach 89% of accuracy

Project Solution  We found that were some miss-classifications that

References: CodeBERT: A Pre-Trained Model for Programming and Natural Languages

Bruno Tinoco Business Development Solutions Architect at GFT US https://www.linkedin.com/in/brunocrt/