Alex Reads - MS Research - Cohesive and Isolated Development with Branches

Saturday, January 14, 2012

Post-read thoughts -
   This paper seems like it was written by amateurs. Note that I am not a member of the academic community, nor do I write academic papers, so this is more of a comment on their writing style and their ability to defeat my BS filter (i.e. Can you prove that? How exactly do you define 'x'?).
   Having said that, there are some useful ideas and interesting results from their interviews and research with real projects. Here's what I found interesting:
  • Studies show that branch usage greatly increases with new adoptees of DVC.
    • Pre-DVC, 1.54 branches/month. With-DVC, 3.67 branches/month (though I worry about methods used to obtain this info)
    • The idea that prior to DVC, branches were created only for releases, not new features.
    • To effectively use DVC branches, create one for each new feature, localized bug fix, or maintenance effort.
  • Studies show that even with DVC, a central repo is still used. (It is important to admit this, IMO)
    • An accessible DVC repo enables anyone to contribute to the project. Developers without commit privileges were reduced to working w/o VC. Accepting changes from unofficial project members has high barriers.
    • Academics advise us to checkpoint code at frequent intervals in a place separate from the 'team repo'. Only tested and stable code should be integrated into the 'team repo'. DVC systems enable and encourage this practice.
  • The term "Semantic conflict" - All VC systems are good at syntactic conflicts, but not semantic conflicts.
  • Awareness of  'Distract commits', which are commits that are required to resolve merge conflicts.



Link to Microsoft Research paper -
Introduction web page - http://research.microsoft.com/apps/pubs/default.aspx?id=157290
Research paper [PDF] - http://research.microsoft.com/pubs/157290/paper.pdf


Abstract. The adoption of distributed version control (DVC), such as Git and
Mercurial, in open-source software (OSS) projects has been explosive. Why is
this and how are projects using DVC? This new generation of version control supports two important new features: distributed repositories, and history-preserving
branching and merging where branching is easier, faster, and more accurately
recorded. We observe that the vast majority of projects using DVC continue to
use a centralized model of code sharing, while using branching much more extensively than when using CVC. In this study, we examine how branches are
used by over sixty projects adopting DVC in an effort to understand and evaluate
how branches are used and what benefits they provide. Through interviews with
lead developers in OSS projects and a quantitative analysis of mined data from
development histories, we find that projects that have made the transition are
using observable branches more heavily to enable natural collaborative processes:
history-preserving branching allow developers to collaborate on tasks in highly
cohesive branches, while enjoying reduced interference from developers working
on other tasks, even if those tasks are strongly coupled to theirs



Introduction
  1. Purpose of Version Control
    1. Create isolated workspace from a particular state of the source code.
    2. Can work within one branch without impacting other developers
  2. Purpose of branches
    1. Should be 'cohesive' so that a team can work together on a branch
    2. Keeps new features separate, and allows merging features when complete
  3. Evolution of VC systems
    1. Marked by 'increasing fidelity of the histories they record'
    2. 1st gen - record individual file changes - can roll back individual files (RCS)
    3. 2nd gen - record sets of file changes (transactions) that can be rolled back (CVS)
    4. 3rd gen - records history of files even through branching and merging (DVC)
  4. DVS features
    1. Every copy of a project is a complete repository, complete with history
    2. Can change source code changes with other peer repositories
    3. Preserves history through branches and merges
      1. Each child commit tracks its parent commits - across branches and merges
      2. Allows us to quantitatively study of branch cohesion and isolation
      3. Allows us to study relationship in branch usage with defect rates and schedules delays
  5. Why has DVC become so popular?
    1. Developers wanted to use branches, but experienced "merge pain" with CVS
      1. Studies show that branch usage greatly increases with new adoptees of DVC
      2. Studies show that even with DVC, a central repo is still used
      3. Can observe that branched history can be linearized into a single 'mainline' branch
  6. RQ2 is "How cohesive are branches?"
    1. 'Cohesivity' is measured by directory distance of files modified in a branch (wha?)
    2. Compare branch cohesion in Linux history against trunk branch cohesion
    3. If branches are not more cohesive, then either a) trunk is more cohesive or b) directory distance is not a good measurement for 'cohesivity' (lol)
    4. Results - branches are far more cohesive than background commit sequences (background?)
  7. RQ3 is "How successfully do DVC branches isolate developers?"
    1. VC is good about flagging syntactic changes between branch-time and merge-time
    2. VC is not good about flagging semantic changes between branch-time and merge-time
      1. Semantic = assumptions made during development (so, API/method changes?)
      2. Branch coupling causes semantic conflict
    3. Semantic conflict is number of files in branch that was also modified in trunk since fork
    4. Measure how often a semantic conflict would interrupt a developer if using no branching
  8. Paper proves three things
    1. Prove that branching, not distribution, has driven popularity in DVC
    2. Define two new measures, branch cohesion and distracted commits
      1. 'Distract commit' are new commits required to resolve merge conflicts
    3. Show that branches are used to undertake cohesive development tasks
    4. Show that branches effectively protect developers from concurrent development interruptions
Theory
  1. History
    1. Git and Mercurial basic history - birth, growth, majority use in Debian
    2. Adopting new VC is very difficult - citing experiences by Gnome, KDE, and Python
  2. RQ1 "Why did projects rapidly adopt DVC?"
    1. Interviews show that main reason is to use branches for better cohesion and isolation
    2. Exactly how cohesive are branches? How well do they isolate feature teams?
    3. If developers use branches to isolate tasks, branches will be cohesive. On the other hand, developers could use branches merely to isolate personal development work, without separating work into tasks
  3. RQ2 "How cohesive are branches?"
    1. Coupling and Interruption
      1. Should checkpoint code at frequent intervals separate from 'team repo' - only tested and stable code should be integrated into 'team repo'
      2. When ready, integration must not be difficult or gains of personal branch is lost
      3. When not using branches, changes are not proven stable, require integration work
      4. Studies show that resuming from interruption takes at least 15 minutes
  4. RQ3 "To what extent do branches protect developers from integration interruptions caused by concurrent work in other branches?"

Methodology
  1. Began with interviews to developer hypothesis regarding motivations for adoption
  2. Empirically evaluating by performing statistical analysis
  3. Semi-structured interviews (sounds like high probability for introduction of non-scientific bias)

Evaluation
  1. Description of linearizing a branched DVC history
    1. Project concurrent sequence of changes onto single timeline
    2. Commits on this timeline represent changes 'across' branches
  2. Rapid DVC adoption
    1. Observe that, contrary to common knowledge, most DVC projects do not make use of distribution
      1. Of 60 projects, all but Linux use centralized model around single public repo
        1. (this doesn't make sense. I think their understanding of 'distributed' is off)
    2. Some branches that grew too different from trunk had to be abandoned
    3. Prior to DVC, branches were created only for releases, not new features
    4. Pre-DVC, 1.54 branches/month. With-DVC, 3.67 branches/month
    5. Developers without commit privileges were reduced to working w/o VC
      1. Accepting changes from unknown devs required huge patch sets
        1. Could not add incremental work
        2. Sometimes included unrelated changes
    6. Therefore, main motivation is branching, not distribution (define "distribution"?)
  3. Cohesion
    1. Large systems structure their files in a modular manner - related files are located nearby (I question this premise)
    2. [Science! Graphs are shown, descriptions and explanations are given]
    3. Results show that branches are relatively cohesive.
      1. Interviews are consistent - branches are created for more than releases (low standard)
      2. DVC branches comprise features, localized bug fixes, and maintenance efforts
      3. Three interviewees indicate that non-trivial changes would have been created offline and then commited in a single mega-commit
  4. Coupling and Interruptions
    1. [Hardcore science! Too difficult to understand. Questionably scientific pictures]
      1. Trying to identify and quantify 'semantic conflicts'
    2. Some disclaimer that git allows 'hidden' history in unpublished commits, hidden by rebasing

Related Work
  1. This paper's main concern is to study history-preserving branching and merging
    1. Some people advocate even finer grained history retention
    2. Some people advocate automating information acquisition, such as static relationships
  2. Some people recommend patterns to use for workflows that effectively use branching
    1. Other people advocate workflows that mitigate branching/merging issues
  3. Somebody proposes current tools and project management is inadequate

No comments:

Post a Comment