Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scalability, Practicability (and Promotion) in ...

ASERG, DCC, UFMG
September 17, 2018

Scalability, Practicability (and Promotion) in SE Research

GitHub is the world’s largest collection of open source software, with around 27 million users and 80 million repositories. These numbers make GitHub an invaluable source of data for large-scale empirical software engineering research. In this talk, we describe recent research conducted in our group, based on GitHub data. For example, we are using GitHub to predict the popularity of open source projects, to understand the motivations behind refactoring, to characterize the evolution of APIs, and to reveal key characteristics of open source development teams. In the talk, we also plan to discuss the strategies we are using to make our research results known by practitioners.

ASERG, DCC, UFMG

September 17, 2018
Tweet

More Decks by ASERG, DCC, UFMG

Other Decks in Science

Transcript

  1. Scalability, Practicability (and Promotion) in SE Research Marco Tulio Valente

    ASERG, DCC, UFMG, BR @mtov 1 SBCARS, September 2018
  2. This talk's story is about using as much as possible

    data to shed light on modern software engineering problems 11
  3. "Why" surveys • Why do we refactor? • Why do

    we break APIs? • Why do open source projects fail? • Why do we star GitHub projects? 15
  4. Why do we really refactor? • Danilo tracked (using a

    tool) refactorings in ◦ 748 Java projects, 61 days ◦ 1,411 refactorings, 185 projects, by 465 devs ◦ 195 answers (42%), right after refactoring 18
  5. Key Finding Refactoring is driven by the need to add

    new features and fix bugs and much less by code smell resolution 20
  6. Why do we break APIs? • Aline tracked (using a

    tool) breaking changes (BCs) in ◦ 400 Java libraries & frameworks ◦ 116 days ◦ 282 possible BCs, by 102 developers ◦ 56 answers (55%), right after the BCs 22
  7. Key Finding We break APIs to implement new features (32%),

    to simplify the APIs (29%) and to improve maintainability (24%) 23
  8. Why do open source projects fail? • Jailton asked this

    question to the maintainers of ◦ 408 projects without commits for one year ◦ 118 answers (29%) 27
  9. Reason Projects Usurped by competitor 27 Obsolete 20 Lack of

    time 18 Lack of interest 18 Outdated technologies 14 Low maintainability 7 Conflicts among developers 3 Legal problems 2 Acquisition 1 Why do open source projects fail? 28
  10. 30

  11. 32

  12. Why do we star GitHub projects? • Hudson asked this

    question to 4,370 GitHub users ◦ right after they starred a popular repository ◦ 791 answers (19%) 33
  13. Key Finding #2 3 out of 4 devs consider stars

    before contributing or using GitHub projects 35
  14. (1) Surveys require building tools (2) Surveys are used to

    evaluate tools (3) Surveys motivate building tools (4) Surveys contribute to public datasets 37
  15. Refactoring Detection Tools • RefactoringMiner (1.0) (Tsantalis, CASCON 2013) •

    RefactoringMiner (1.1) (Tsantalis, Danilo, MT, FSE 2016) • RefDiff (new tool) (1.0) (Danilo, MT, MSR 2017) • RefactoringMiner (2.0) (Tsantalis et al., ICSE 2018) • RefDiff (2.0) (??) 39
  16. Refactoring Detection Tools • RefactoringMiner (1.0) (Tsantalis, CASCON 2013) •

    RefactoringMiner (1.1) (Tsantalis, Danilo, MT, FSE 2016) • RefDiff (new tool) (1.0) (Danilo, MT, MSR 2017) • RefactoringMiner (2.0) (Tsantalis et al., ICSE 2018) • RefDiff (2.0) (??) ⇒ refactoring-aware tools (code reviews, MSR etc) [ see Andre, Romain & MT, ICSE18 ] 40
  17. Truck (or Bus) Factor 43 The minimum number of developers

    that if hit by a truck (or bus) will put a project in a serious risk
  18. "I get emails like this every week ... This problem

    [is] worse than spam, since Google at least filters out spam for me". 59
  19. Based on our experience/lessons learned 1. Questions should focus on

    practical and prevailing problems 2. Questions should focus on recent events 3. Questions should be sent by e-mail 4. Mails should have 2-3 short and clear questions 5. Avoid sending thousands of mails 6. Never send two mails to the same person (even in distinct studies) 7. Never identify the participants (names, e-mails, projects, etc) 61
  20. In our experience, Strong correlation: (practical value) vs (response rate)

    practical value ➜ at least 20% response rates 62
  21. 69 % API elements deprecated with replacement msgs SANER 2016

    & JSS 2018, with Gleison and Andre ⇒ automatic documentation
  22. [ there are other forms to transfer knowledge e.g. open

    tools = important, but very hard ] 73
  23. 78 "Oh, nice to hear from you! I heard a

    lot about (and read) your group's truck factor paper. Cool work!" (answer received in another survey, not related with TFs)