InLineDifferencingThoughts
From WinMerge Development Wiki
As stated in Development 2.9.x, the in-line differencing is a very visible feature in WinMerge but it has lots of known problems. Below is the list of thoughts collected while looking at fixing/refactoring this feature. See also String Differencing for general description of the feature.
Contents |
[edit] Possible implementation starting points
- current implementation
- Geek's implementation, which is too buggy, but maybe easier to fix than our current implementation: #1784310 Geek's version of line differencing
- some other GPL implementation as base (several diff programs have similar feature)
[edit] Expectations of in-line differencing
- Should be able to work in case sensitive or insensitive way, depending on user's settings
- Should be able to group differences in any of the following three ways (depending on user's settings)
- Character level differencing (the easiest one, no grouping)
- Word-level differencing (break at whitespace)
- Word-level differencing (break at whitespace or punctuation)
- As with the three musketeers, there may be a fourth one we could add which would be a generalization of the previous 2: Word-level differencing where user specifies where to break through regular expressions or something similar
- Should be able to honor the whitespace user setting (see manual for meaning)
- Compare
- Ignore change
- Ignore all
- (need to verify this) Detect end-of-line differences?
[edit] Related bugs
Here is a list of bugs that are (or may be) related to in-line differencing and could be used as a source of data for unit or regression tests. Some may be there erroneously, I looked at a lot of bugs in the tracker the day I scanned the list.
- #465783
- #510352
- #830911
- #964068
- #1177592
- #1262229
- #1444655
- #1639453
- #1658984
- #1683061
- #1714088
- #1883409
- #1890614
- #1910870
- #1939279
- #1949397
- #1989811
- #1996269
that is up to 2008-07-08 for the "DIFF Engine" category.
[edit] Diff algorithm theory
Looking for the smallest differences between 2 strings is the equivalent of finding the "Longest Common Subsequence" (LCS for short) of these 2 strings and declaring the rest of the 2 strings as different. There has been a lot of research done on the subject. Here is some links to good documentation I found:
- LCS algorithms explained from simple recursive in O(mn) space and time to Hirschberg's algorithm linear space and O(mn) time
- Some more with links
[edit] New implementation plan
The new in-line diff algorithm should try as much as possible to separate the functionalities into different "policies". The diff-algorithm should not be doing itself the whitespace checks, the case-sensitivity check, the word-breaking and DBCS lead byte verification; it should only perform diff algorithm.
So the overall algorithm should be something like:
- FindDifferences(leftstring, rightstring)
- Foreach string
- Build word array (word meaning depends on line-difference coloring option)
- Remove spaces (depends on whitespace option)
- Select comparison function to pass to diff algorithm (depends on case-sensitivity option and on the definition of word above)
- Perform diff using the comparison function and the new structure/iterators representing the original strings in there modified state
- Convert the returned diff into the right character pointers to be consumed by the caller
- Foreach string
or
- FindDifferences(leftstring, rightstring)
- Select comparison function to pass to diff algorithm (depends on case-sensitivity option)
- Perform diff at character-level without taking whitespace or grouping into account but using the case-(in)sensitivity comparison function
- Modify the returned diff structure to remove unwanted whitespace differences (depends on whitespace option)
- Modify the returned diff structure to move the character pointers at the begin/end of groups (depends on line-difference coloring option)

