Message Board


Message Board > Programming > Text comparison

February 26, 2006, 00:01
Frimkron
Frustrated Megalomaniac
703 posts

Comparing 2 versions of a block of text and highlighting the differences. It sounds like it should be easy but I'm at a loss as to where I should even begin to try and get this working. Does anybody have any ideas or useful links?
____________
#
February 26, 2006, 00:47
PEader
お前はもう死んでいる
1486 posts

It depends how complex but I'd say you mmight need to do several pass throughs.

This code wont work but it should give you some ideas:
Code:
//starting with two strings
String String1 = "I eat peaches but only on a Sunday afternoon";
String String2 = "I eat apples but never on a Sunday morning";

for( int i = 0, j = 0; i < String1.length() && j < String2.length(); j++){
   if( String1[ i ] == String2[ j ] ){
      //store character in new string
      i++;
   }else{
      //they don't match so start highlighting as difference
      
   }

}

____________
I see 57,005 people.
#
February 26, 2006, 01:03
PEader
お前はもう死んでいる
1486 posts

This is somethign I am going to be working on soon so I decided to search for some references of previous work. Starting with the "diff" utility of *nix.

We have the following:

http://en.wikipedia.org/wiki/Diff Wikipedia Article on "diff" with links

A technique for isolating differences between files - seems it was free at one stage but you have to pay for it now

http://ejohn.org/projects/java … diff-algorithm/ a javascript implementation for the previous technique
____________
I see 57,005 people.
#
February 26, 2006, 01:41
PEader
お前はもう死んでいる
1486 posts

http://researchweb.watson.ibm. … story/index.htm ibm link nice information
____________
I see 57,005 people.
#
February 26, 2006, 03:13
PEader
お前はもう死んでいる
1486 posts

I couldn't get the algorithm mentioned in those docs so I made my own one up. It seems to work ok.


Code:
class StringDiff 
{
    public static void main(String[] args) 
    {
        StringDiff.diff( 
                "the rain in spain falls mainly earth on the plain test",  
                "the rain in spain dog falls mainly on the plain cow"
        );
    }

    public static void diff( String s1, String s2 )
    {
        int i, j;
        String [][] words = new String[2][];
        words[0] = s1.split( "\\s" );
        words[1] = s2.split( "\\s" );

        forloop:for( i = 0, j = 0; i < words[0].length && j < words[1].length; ){
            if( words[0][i].compareTo( words[1][j] ) == 0 ){
                System.out.println( ":" + words[0][i] + ":\t\t:" + words[1][j] + ":" );
                i++; j++;
            }else{
                //search ahead and see if [0][i] can be found in [1]
                int x = j + 1;
                int y;
                while( x < words[ 1 ].length ){
                    if( words[0][i].compareTo( words[1][ x ] ) == 0 ){
                        y = j;
                        j = x;
                        //print all the inserted words
                        for( ; y < x; y++ ){
                            System.out.println( ":new:\t\t:" + words[1][y] + ":" );
                        }
                        continue forloop;
                    }
                    x++;
                }   

                //search ahead and see if [1][j] can be found in [0]
                x = i + 1;
                while( x < words[ 0 ].length ){
                    if( words[0][x].compareTo( words[1][ j ] ) == 0 ){
                        y = i;
                        i = x;
                        for( ; y < x; y++ ){
                            System.out.println( ":" + words[0][y] + ":\t\t:deleted : "  );
                        }
                        continue forloop;
                    }
                    x++;
                }
                break forloop;
            }
        }

        //print excess words
        while( i < words[ 0 ].length ){
                System.out.println( ":" + words[0][i] + ":\t\t:deleted: "  );
                i++;
        }

        while( j < words[ 1 ].length ){
                System.out.println( ":new:\t\t:" + words[1][j] + ":"  );
                j++;
        }

    }
}



Sample output:
Code:
:the:           :the:
:rain:          :rain:
:in:            :in:
:spain:         :spain:
:new:           :dog:
:falls:         :falls:
:mainly:                :mainly:
:earth:         :deleted :
:on:            :on:
:the:           :the:
:plain:         :plain:
:test:          :deleted:
:new:           :cow:


for:

        StringDiff.diff( 
                "the rain in spain falls mainly on the plain",  
                "falls mainly on the plain the rain in spain"
        );
:new:           :falls:
:new:           :mainly:
:new:           :on:
:the:           :the:
:new:           :plain:
:new:           :the:
:rain:          :rain:
:in:            :in:
:spain:         :spain:
:falls:         :deleted:
:mainly:                :deleted:
:on:            :deleted:
:the:           :deleted:
:plain:         :deleted:


[Edited on February 26, 2006 by PEader]
____________
I see 57,005 people.
#
February 26, 2006, 03:13
Fiona
games are terrible
-9616558 posts

I remember starting one that split a block of text by paragraph, compare the paragraphs, then compared it word for word.
Explode helps, I think comparing character by character is overkill.
____________
laffo
#
February 26, 2006, 03:19
PEader
お前はもう死んでいる
1486 posts

Quoting Ferret:
I think comparing character by character is overkill.

that all depends on what you are doing. If you needed to compare it character by character it wouldn't be overkill!
____________
I see 57,005 people.
#
February 26, 2006, 12:49
Frimkron
Frustrated Megalomaniac
703 posts

Well you made that look bloody easy didn't you :P I've been trying to work this out for ages now... I was about to resort to evolving a solution with a genetic algorithm!
____________
#
February 26, 2006, 12:53
Woody
HEAD BLACK MAN
722 posts

PEader has been eating his spinach.
____________
boredome is the bitter fruit of too much routine
#
February 26, 2006, 13:42
PEader
お前はもう死んでいる
1486 posts

Quoting Frimkron:
Well you made that look bloody easy didn't you :P I've been trying to work this out for ages now... I was about to resort to evolving a solution with a genetic algorithm!

It's a bit rough though.

Myself and EL Moogster are going to try make a more elegant solution that can tell when you move words around and works out the least number of changes needed to transform one sentence into the other.


Oh yeah, post more programming problems please. This was quite diverting.

[Edited on February 26, 2006 by PEader]
____________
I see 57,005 people.
#
February 26, 2006, 14:15
Frimkron
Frustrated Megalomaniac
703 posts

Yes looking at it again I think its what I came up with at first. The problem is it can match the wrong words and assume massive chunks have been inserted or deleted when they clearly haven't.

When you start thinking about it as a least-number-of-changes problem it starts to get into heuristic search terretory. Its such an annoying problem
____________
#
February 26, 2006, 14:37
PEader
お前はもう死んでいる
1486 posts

Quoting Frimkron:
Yes looking at it again I think its what I came up with at first. The problem is it can match the wrong words and assume massive chunks have been inserted or deleted when they clearly haven't.


Well you see it isn't matching the wrong words it is matching only one case. The strings we used are:

"BAM BOOM BOOM BOOM BOOM"
"BOOM BOOM BOOM BOOM BAM"

How can you tell that bam was deleted and inserted at the end rather then the four booms being deleted and added to the start, what is the correct awnser?

So you need to check all transformation possiblities and return the least expensive match.

But I think the way I have it working is fine. What would be cool is if we could get that article on file comparisons then we could implement that.
____________
I see 57,005 people.
#

Message Board > Programming > Text comparison

Quick reply


You must log in or register to post.
Copyright © 2005 Booleansoup.com
Questions? Comments? Bug reports? Contact us!