Assessing Reliability of Statistical Maximum Coverage Estimators in Fuzzing

Assessing Reliability of Statistical Maximum Coverage Estimators in Fuzzing Danushka
Liyanage, PhD Co-authors: Nelum Attanayake, Zijian(Jack) Luo, and Rahul Gopinath, PhD School of Computer Science, University of Sydney, Australia Registered Report - ICSME 2025 Auckland, New Zealand September 2025

Background 2 In testing, • No matter how long, there
could still be bugs in the program • 100% coverage for most real-world programs is infeasible • Maximum reachable coverage is a key concern • Coverage is a major criteria of testing completeness Source: https://www.fuzzingbook.org/html/Coverage.html

Maximum Reachability in Fuzzing 3 • Modelled fuzzing as a
statistical sampling process (STADS, 2018) [1] • AKA species richness estimators since borrowed from Ecology • Proposed a suite of statistical estimators for maximum reachable coverage[2,3] Maximum reachability estimation in fuzzing Fuzzing [1] Böhme, M. (2018). STADS: Software testing as species discovery. ACM Transactions on Software Engineering and Methodology (TOSEM), 27(2), 1-52. [2] Liyanage, D., Böhme, M., Tantithamthavorn, C., & Lipp, S. (2023, May). Reachable coverage: Estimating saturation in fuzzing. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) (pp. 371-383). IEEE. [3] Kuznetsov, K., Gambi, A., Dhiddi, S., Hess, J., & Gopinath, R. (2024). Empirical Evaluation of Frequency Based Statistical Models for Estimating Killable Mutants. In 2024 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM) (pp. 61–71). ACM.

Research Gap 4 • Ground-Truth ( ) is unknown for
real-world programs S Challenges in Estimation and Evaluation [2] Liyanage, D., Böhme, M., Tantithamthavorn, C., & Lipp, S. (2023, May). Reachable coverage: Estimating saturation in fuzzing. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) (pp. 371-383). IEEE. • No studies on the reliability of effectiveness estimators Small-Programs Real-World Programs Ground Truth Attainable in Fuzzing Unattainable in Fuzzing Generalizability No Yes Real-world Small-Scale

Our Approach - Part 01 5 Generate Synthetic Ground-Truth Benchmarks
• These parsers has complexity comparable to real-world programs • Any control- fl ow can be represented as context-free LL(1) grammar • The reverse of this is also true i.e. Converting LL(1) grammar to recursive descent parser

6 Generate Synthetic Benchmarks • Each nonterminal becomes a procedure
e.g: 〈D〉 def parse_D • Each production rule is implemented in that procedure • Linear recursion can be replaced with a while loop { • Lookahead tokens (next unprocessed input symbols) guide which rule to apply

7 Controlling the Program Complexity and Reachability Variable Controlled #
of non-terminals # of procedures # and length of production rules Branching complexity Depth and type of recursion Direct, indirect, linear recursion Unreachable nonterminals or dead code Partially unreachable programs These parser programs will be our ground-truth subjects • Grammar has reachability property parser procedures are reachable • Expansion rules Reachable inputs

Our Approach - Part 2 8 Reliability of Effectiveness Estimators
• Sampling Unit - All the inputs generated within a time interval (STADS) • Previous research only used constant interval size [2] (i.e. 15 mins) t=0 t=r t=2r t=3r t=4r … Interval Size = r { Estimators should not depend on the sampling unit We evaluate estimator reliability by varying sampling unit size r [2] Liyanage, D., Böhme, M., Tantithamthavorn, C., & Lipp, S. (2023, May). Reachable coverage: Estimating saturation in fuzzing. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) (pp. 371-383). IEEE.

Experimental Setup 9 Part 01 Part 02 Fuzzer AFL++ Subject
Programs 100 Generated Parsers average 100k LoC 10-20 programs from Fuzzbench Campaign Length # of Trials ~100 trials of 24hr Fuzzing campaigns ~30 trials of 7-day Fuzzing campaigns Initial Seed Corpus Well-formed valid inputs Established seeds from Fuzzbench[4] Evaluation Metrics • Mean-bias • Variance • Con fi dence Intervals • Welch’s t-test or non-parametric test [4] Metzman, J., Szekeres, L., Simon, L., Sprabery, R., & Arya, A. (2021, August). Fuzzbench: an open fuzzer benchmarking platform and service. In Proceedings of the 29th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering (pp. 1393-1403).

Our Paper 10 Preprint

Software Engineering Team @USyd 11 Visit Site

Summary 12

Assessing Reliability of Statistical Maximum Co...

Assessing Reliability of Statistical Maximum Coverage Estimators in Fuzzing

Rahul Gopinath

More Decks by Rahul Gopinath

Other Decks in Research

Featured

Transcript