Feature request - key column specification
Moderator: SourceGear
-
- Posts: 8
- Joined: Wed Apr 09, 2008 9:35 am
- Location: CA, USA
Feature request - key column specification
It would be useful if there was a way to specify a column range in which an exact match would identify a matching pair of records. Also if this could be re-specified for each new diff following a manual alignment marker.
I'll try to explain what I mean.
Suppose you have a file containing 2 time ordered reports. Each report contains a column of data which is the date/time stamp of the line. That column containing the date/time stamp is in a different set of columns in each of the 2 reports. A match of lines is desired only if the date/time stamp matches and otherwise, the lines are to be considered to be different lines. What I need to see are the differences in all other columns of the report besides the key columns as well as when a key value has been omitted from one of the 2 files.
An example would be the first report has the date/time stamp in columns 10 thru 25, the second report in columns 30 thru 45.
I would want to specify for the first report the key columns as 10 thru 25, then after that report, insert a manual alignment marker which specifies the key columns as 30 thru 45 for use in the diff of the 2nd report. In actuality, there would be more than 2 reports in each file, but I simplified the example.
Now I have a further complication on this need. The time stamps I have are specified in a DDD HH:MM:SS.SSS format (Day, Hour, Minute, Second to the millisecond), however the times are frequently off by a millisecond or two and it is still desired to match the lines. Possibly by omitting the last digit of the key column from the key column specification would do the trick for most of the lines, although 10.000 and 09.999 seconds would not compare using that method.
Does anybody else out there have any similar needs or have any suggestions on how to accomplish this?
Ideally, one could specify a maximum delta between the 2 key values, but the syntax of a delta could get unwieldy with the time format specifications and for a general purpose diff tool, not all column matches are numeric, so you start to get into specifying data types as well.
-- Jim
I'll try to explain what I mean.
Suppose you have a file containing 2 time ordered reports. Each report contains a column of data which is the date/time stamp of the line. That column containing the date/time stamp is in a different set of columns in each of the 2 reports. A match of lines is desired only if the date/time stamp matches and otherwise, the lines are to be considered to be different lines. What I need to see are the differences in all other columns of the report besides the key columns as well as when a key value has been omitted from one of the 2 files.
An example would be the first report has the date/time stamp in columns 10 thru 25, the second report in columns 30 thru 45.
I would want to specify for the first report the key columns as 10 thru 25, then after that report, insert a manual alignment marker which specifies the key columns as 30 thru 45 for use in the diff of the 2nd report. In actuality, there would be more than 2 reports in each file, but I simplified the example.
Now I have a further complication on this need. The time stamps I have are specified in a DDD HH:MM:SS.SSS format (Day, Hour, Minute, Second to the millisecond), however the times are frequently off by a millisecond or two and it is still desired to match the lines. Possibly by omitting the last digit of the key column from the key column specification would do the trick for most of the lines, although 10.000 and 09.999 seconds would not compare using that method.
Does anybody else out there have any similar needs or have any suggestions on how to accomplish this?
Ideally, one could specify a maximum delta between the 2 key values, but the syntax of a delta could get unwieldy with the time format specifications and for a general purpose diff tool, not all column matches are numeric, so you start to get into specifying data types as well.
-- Jim
- Jim B
-
- Posts: 534
- Joined: Tue Jun 05, 2007 11:37 am
- Location: SourceGear
- Contact:
I'm not sure DiffMerge is the right tool for this task.
I'm not sure that DiffMerge (or any diff tool) is the right match
for this job. Let me explain. Selecting 2 distinct sets of columns
in the files allows for vertical alignment of the files (and for showing
gaps when one file doesn't have a particular timestamp), but the
rest of the line will be different (each report is reporting something
different, right?) So, even if DiffMerge could vertically sync things
up correctly, it'd still just show the entire file as 1 large change
because of the rest of each line -- this would be sort of like comparing
2 completely different files with no correlation -- probably not very
useful.
Then we add the timestamp complication. I'm not sure how we
could handle this in a general way.
I'm thinking that what you need is a database or spreadsheet
application (or perhaps a perl/python script). read each file, for
each line, normalize the timestamp (and maybe round it a little)
and output a record with the modified timestamp key and the
rest of the line as a data column. load all of the files that way
(each into their own data column) into 1 db or spreadsheet.
then you can generate a db report or view the spreadsheet
(sorted by timestamp) and see each column properly matched
up. there'll be gaps in particular columns where a source
file was missing a line for that timestamp.
You might be able to do it with a perl/python script, doing
essentially the same thing. Using associative arrays keying
on the modified timestamp. But my perl/python is a little
rusty, so I'll not elaborate here in a public forum
If I've understood your problem, I think either of these
suggestions will do the trick.
Hope this helps,
jeff
for this job. Let me explain. Selecting 2 distinct sets of columns
in the files allows for vertical alignment of the files (and for showing
gaps when one file doesn't have a particular timestamp), but the
rest of the line will be different (each report is reporting something
different, right?) So, even if DiffMerge could vertically sync things
up correctly, it'd still just show the entire file as 1 large change
because of the rest of each line -- this would be sort of like comparing
2 completely different files with no correlation -- probably not very
useful.
Then we add the timestamp complication. I'm not sure how we
could handle this in a general way.
I'm thinking that what you need is a database or spreadsheet
application (or perhaps a perl/python script). read each file, for
each line, normalize the timestamp (and maybe round it a little)
and output a record with the modified timestamp key and the
rest of the line as a data column. load all of the files that way
(each into their own data column) into 1 db or spreadsheet.
then you can generate a db report or view the spreadsheet
(sorted by timestamp) and see each column properly matched
up. there'll be gaps in particular columns where a source
file was missing a line for that timestamp.
You might be able to do it with a perl/python script, doing
essentially the same thing. Using associative arrays keying
on the modified timestamp. But my perl/python is a little
rusty, so I'll not elaborate here in a public forum
If I've understood your problem, I think either of these
suggestions will do the trick.
Hope this helps,
jeff
-
- Posts: 8
- Joined: Wed Apr 09, 2008 9:35 am
- Location: CA, USA
Reply to: I'm not sure DiffMerge is the right tool for this
I guess I was unclear in describing my usage of DiffMerge on these sets of reports. You don't seem to understand the degree of similarity between the 2 reports being compared. Maybe I can better explain why I feel your wonderful DiffMerge is the best tool I have located for this job.
The 2 files are created from the same input data with a slightly different version of the same software. The comparison is being done to see if the code changes to the software base has caused any unforeseen side effects in the output reports. In most cases, the changes in the rest of the report lines is very small... Often no more than 1 or 2 characters different.
The 2 versions of the software I am currently comparing are one written in Fortran and the other converted to C++. Ideally, they should yield the same reports as output, however, the Fortran version contained a mixture of usage of both REAL*4 and REAL*8 where in the conversion to C++, both of those have been replaced with double and we are not using float for what was REAL*4. We are getting a little precision difference which sometimes shows up as a different last digit in one of the values in a report. The reports usually don't display more than the first 4 or 5 digits of each value, so most values are displayed identically.
We have a test suite of over 300 test cases. For many of the test cases DiffMerge has worked wonderfully to compare the set of output reports from the 2 versions. Usually, when DiffMerge gets confused in the comparison, it's because something went wrong in the translation from Fortran to C++ and that language conversion mistake made a substantial difference in the output reports. The DiffMerge output is being used to try to help identify where to set a debugger breakpoint to look for the mistake. After each time a bug is corrected, and the reports re-run, the amount of differences between the 2 sets of reports decreases.
Once the entire test suite is validated, then we would continue using DiffMerge to compare the results following a batch of changes to our source code to the results from a set of runs prior to the batch of changes.
We rerun a test suite validation about once a month using a partial sub-set of our full test suite and rerun the full test suite about once a year. The report files being compared each time are to the ones from current run with the ones from the prior run.
In short, if the time stamps match, then the rest of the line should be quite similar if not exactly identical. In no case would one expect to find more than a dozen characters different in a line averaging about 90 columns wide. Now comparing one line to the next line can produce much greater differences as the objects being measured are moving and are thus reporting different positions/trajectories, etc. which is why matching a time stamp is important.
I'm not so worried if the timestamps are reported as different when they are nearly identical but not exactly so (the 1 to 2 millisecond difference). For 99% of the lines, just omitting the last digit of the time stamp from the key specification columns would be an adequate solution.
The 2 files are created from the same input data with a slightly different version of the same software. The comparison is being done to see if the code changes to the software base has caused any unforeseen side effects in the output reports. In most cases, the changes in the rest of the report lines is very small... Often no more than 1 or 2 characters different.
The 2 versions of the software I am currently comparing are one written in Fortran and the other converted to C++. Ideally, they should yield the same reports as output, however, the Fortran version contained a mixture of usage of both REAL*4 and REAL*8 where in the conversion to C++, both of those have been replaced with double and we are not using float for what was REAL*4. We are getting a little precision difference which sometimes shows up as a different last digit in one of the values in a report. The reports usually don't display more than the first 4 or 5 digits of each value, so most values are displayed identically.
We have a test suite of over 300 test cases. For many of the test cases DiffMerge has worked wonderfully to compare the set of output reports from the 2 versions. Usually, when DiffMerge gets confused in the comparison, it's because something went wrong in the translation from Fortran to C++ and that language conversion mistake made a substantial difference in the output reports. The DiffMerge output is being used to try to help identify where to set a debugger breakpoint to look for the mistake. After each time a bug is corrected, and the reports re-run, the amount of differences between the 2 sets of reports decreases.
Once the entire test suite is validated, then we would continue using DiffMerge to compare the results following a batch of changes to our source code to the results from a set of runs prior to the batch of changes.
We rerun a test suite validation about once a month using a partial sub-set of our full test suite and rerun the full test suite about once a year. The report files being compared each time are to the ones from current run with the ones from the prior run.
In short, if the time stamps match, then the rest of the line should be quite similar if not exactly identical. In no case would one expect to find more than a dozen characters different in a line averaging about 90 columns wide. Now comparing one line to the next line can produce much greater differences as the objects being measured are moving and are thus reporting different positions/trajectories, etc. which is why matching a time stamp is important.
I'm not so worried if the timestamps are reported as different when they are nearly identical but not exactly so (the 1 to 2 millisecond difference). For 99% of the lines, just omitting the last digit of the time stamp from the key specification columns would be an adequate solution.
- Jim B
-
- Posts: 534
- Joined: Tue Jun 05, 2007 11:37 am
- Location: SourceGear
- Contact:
Here's another possible workaround
OK, but if the lines are mostly the same why are the timestamps
in different columns in different versions of the files?
I don't have any way to do column matching/filtering/etc to do what
you want, but I do have an idea that might get you a little further
towards your goal. (It's somewhat of a hack, and I apologize in
advance. )
If you ran a little awk/sed script on each report and put a copy of
the timestamp on a line by itself immediately prior to each report
line (with or without the timestamp removed from the data line).
then compare the output of the script as before. the timestamp
lines should cause things to line up as you want and changes will
show up on the data lines that are different.
You may still need to round off/up the timestamps for optimal matching,
but you should be closer to your goal.
You may want to switch to "Lines Only" detail level when viewing
the diffs on the script output as that will turn off the multi-line intra-line
analysis (character highlighting spanning line breaks).
Hope this helps,
jeff
in different columns in different versions of the files?
I don't have any way to do column matching/filtering/etc to do what
you want, but I do have an idea that might get you a little further
towards your goal. (It's somewhat of a hack, and I apologize in
advance. )
If you ran a little awk/sed script on each report and put a copy of
the timestamp on a line by itself immediately prior to each report
line (with or without the timestamp removed from the data line).
then compare the output of the script as before. the timestamp
lines should cause things to line up as you want and changes will
show up on the data lines that are different.
You may still need to round off/up the timestamps for optimal matching,
but you should be closer to your goal.
You may want to switch to "Lines Only" detail level when viewing
the diffs on the script output as that will turn off the multi-line intra-line
analysis (character highlighting spanning line breaks).
Hope this helps,
jeff
-
- Posts: 8
- Joined: Wed Apr 09, 2008 9:35 am
- Location: CA, USA
Re. another possible workaround
Jeff,
The reason some timestamps are different is because of bugs (undoubtedly several) in the new version of the source code, some sections of the code were getting skipped. There's about 2 million lines of code that was converted from Fortran to C++ and we are trying to identify the causes for the missing output. Of course, after nearly every bug fix in the conversion, the amount of differences decreases.
The character highlighting is one of the nicer and more useful features of your tool. Turning that off by switching to "Lines Only" is not desirable and puts us back where we started with a tool like the DIFF built into Visual Studio.
Your comments have given me an idea for a hack that might work for this type of case and that would be to move or even copy the time stamp to the beginning of the line. The reports are generated dynamically from a report definition file so that definition could be altered to specify the time stamp at the front of the line. Last night, I located the bug that was skipping the section of code causing the missing lines, so that problem is gone from the test case I am working with right now. Still more problems left to find (2 of the reports still show too much difference in results) before I move on to the next test case.
I still think it would be a useful enhancement to allow specifying a range of columns for doing the line matching with and I would still like to request that such an enhancement at least be considered.
Also, I see some talk in the help about rulesets, but I haven't been able to find enough information on them to determine how they are specified and what one can specify in them. Is there some decent documentation on your rulesets?
Thanks for the suggestions
--- Jim
The reason some timestamps are different is because of bugs (undoubtedly several) in the new version of the source code, some sections of the code were getting skipped. There's about 2 million lines of code that was converted from Fortran to C++ and we are trying to identify the causes for the missing output. Of course, after nearly every bug fix in the conversion, the amount of differences decreases.
The character highlighting is one of the nicer and more useful features of your tool. Turning that off by switching to "Lines Only" is not desirable and puts us back where we started with a tool like the DIFF built into Visual Studio.
Your comments have given me an idea for a hack that might work for this type of case and that would be to move or even copy the time stamp to the beginning of the line. The reports are generated dynamically from a report definition file so that definition could be altered to specify the time stamp at the front of the line. Last night, I located the bug that was skipping the section of code causing the missing lines, so that problem is gone from the test case I am working with right now. Still more problems left to find (2 of the reports still show too much difference in results) before I move on to the next test case.
I still think it would be a useful enhancement to allow specifying a range of columns for doing the line matching with and I would still like to request that such an enhancement at least be considered.
Also, I see some talk in the help about rulesets, but I haven't been able to find enough information on them to determine how they are specified and what one can specify in them. Is there some decent documentation on your rulesets?
Thanks for the suggestions
--- Jim
- Jim B
-
- Posts: 534
- Joined: Tue Jun 05, 2007 11:37 am
- Location: SourceGear
- Contact:
Ruleset info.
That's fine. I didn't know if you were being affected by the multi-lineThe character highlighting is one of the nicer and more useful features of your tool. Turning that off by switching to "Lines Only" is not desirable and puts us back where we started with a tool like the DIFF built into Visual Studio.
intra-line analysis or not. If so, you might grab a copy of Vault 4.1
or Fortress 1.1 and get the version of DiffMerge shipped with it. It lets
you do "Lines and Characters" and independently control the multi-line
stuff. (This will be in the next stand-alone DiffMerge release (either
3.2 or 3.1.1) whenever we decide when that will be.) But if it's not affecting
you, don't worry about it.
I'll log a feature request for it. To summarize: allow a range of columnsI still think it would be a useful enhancement to allow specifying a range of columns for doing the line matching with and I would still like to request that such an enhancement at least be considered.
to be used when line matching, but then use the whole line when doing
intra-line (character/word) matching.
There's a whole chapter on Rulesets in the manual. In a nutshell, weAlso, I see some talk in the help about rulesets, but I haven't been able to find enough information on them to determine how they are specified and what one can specify in them. Is there some decent documentation on your rulesets?
automatically select them with the file suffix (or you can manually
swich after loading a set of files); they control the character encoding
used when importing the files (we do everything internally in Unicode);
the allow you to ignore/respect whitespace and EOL characters when
matching lines; they allow you to exclude from the analysis lines that
match one or more Regular Expressions (such as page headers);
and finally they allow you to declare "contexts" for document content
(such as string literals or comments) and with that specify the contexts
that are "important" and ones that are "unimportant".
You can then use the "Hide Unimportant" menu option to selectively
see/hide changes in the unimportant contexts. (such as hiding the
changes in whitespace within a comment)
The context machinery doesn't affect line matching, just the highlight.
Based upon what you've said so far, I'm not sure that you'll need to
create a custom ruleset.
But check the manual, there's a full chapter on them.
Hope this helps,
jeff
-
- Posts: 8
- Joined: Wed Apr 09, 2008 9:35 am
- Location: CA, USA
Re: Feature request - key column specification
Just some additional info to help clarify my previous post about the timestamps being in different columns. The output is a collection of about 40 different reports which are written out to a single file, one after the other. The timestamp will always be in the same column in a particular report, but not from one report to the next report. The timestamp is a measurement of time passed from the beginning of a run (time == 0) and the format used for a timestamp is
ddd hh:mm:ss.sss
The timestamp will normally be either the 2nd or 3rd column of values which can put it somewhere between the 8th and 30th character column. The other values in the first 3 to 4 columns are ID values which identify the item being reported. Those ID values won't change so they become a constant match. When there are differences in the timestamp caused by the different version of the software, it is always a small fraction of a second, normally no more than .002 seconds.
We define the 40 different reports in a postprocessor definition file and those report definitions apply to the output generated by all 400+ of our test cases we run. If no data is created for a report by a particular test case, then that report gets omitted from the output file. As a summary, we compare about 10,000 reports stored in 400 output files with their prior version about once every other month. The length of a single report will contain as few as 1 line of output and as many as 5000 lines of output. Each report will repeat it's page header every 60 or so lines.
We have tried many different comparison tools and so far, yours is the best one we have found for our purposes. You have a great product.
ddd hh:mm:ss.sss
The timestamp will normally be either the 2nd or 3rd column of values which can put it somewhere between the 8th and 30th character column. The other values in the first 3 to 4 columns are ID values which identify the item being reported. Those ID values won't change so they become a constant match. When there are differences in the timestamp caused by the different version of the software, it is always a small fraction of a second, normally no more than .002 seconds.
We define the 40 different reports in a postprocessor definition file and those report definitions apply to the output generated by all 400+ of our test cases we run. If no data is created for a report by a particular test case, then that report gets omitted from the output file. As a summary, we compare about 10,000 reports stored in 400 output files with their prior version about once every other month. The length of a single report will contain as few as 1 line of output and as many as 5000 lines of output. Each report will repeat it's page header every 60 or so lines.
We have tried many different comparison tools and so far, yours is the best one we have found for our purposes. You have a great product.
- Jim B
-
- Posts: 534
- Joined: Tue Jun 05, 2007 11:37 am
- Location: SourceGear
- Contact:
Re: Feature request - key column specification
Thanks for the kind words.
As for the timestamp / column issue, would it be helpful to have another post-processing
step that (strictly for the sake of comparing with DiffMerge) that runs each file through
a script such as the following. Then you could individually diff each post-processed pair
and effectively ignore clock skew effects. I know it would still be good to have the feature
in DiffMerge, but it may make things easier for your in the short-term.
Just a thought,
jeff
As for the timestamp / column issue, would it be helpful to have another post-processing
step that (strictly for the sake of comparing with DiffMerge) that runs each file through
a script such as the following. Then you could individually diff each post-processed pair
and effectively ignore clock skew effects. I know it would still be good to have the feature
in DiffMerge, but it may make things easier for your in the short-term.
Code: Select all
#!/bin/sh
## Convert timestamps on lines from STDIN into token "TIMESTAMP" and output to STDOUT.
/bin/sed 's/[0-9][0-9][0-9] [0-9][0-9]:[0-9][0-9]:[0-9][0-9].[0-9][0-9][0-9]/TIMESTAMP/'
jeff
-
- Posts: 8
- Joined: Wed Apr 09, 2008 9:35 am
- Location: CA, USA
Re: Feature request - key column specification
I understand removing the TIMESTAMP value and replacing it with the string "TIMESTAMP" would make it compare the rest of the line, but the issue is that a line with a timestamp within about .01 seconds is ok to compare but a time difference larger than that should be flagged as a difference.
we are simulating fast moving objects and small time difference will make a measurable difference in the object's position, angle, etc. throwing the rest of the data columns off. The timestamp is the critical component in whether or not the lines can be matched. The ideal would be to treat the
ddd hh:mm:ss.ss
portion of the timestamp as being required to match the lines and the last digit (millisecond) as being reportable as a difference. This would only break down 1/10 of the time with seconds values like 14.999 vs. 15.000 with the millisecond difference propagating into higher digits.
During the version validation runs we are looking only at the differences between outputs from the 2 versions, as values remaining the same have not changed. Most of the output reports are close enough that DiffMerge does not get out of sync, but occasionally the timestamp shifts cause large blocks of report files to be reported as a difference because it shifts the comparison into trying to match records that are at substantially different times, making the difference report useless for the remainder of that output report. That is the reason we need a better re-sync capability. Out of the 10,000 reports we review, about 50 of them get out of sync.
we are simulating fast moving objects and small time difference will make a measurable difference in the object's position, angle, etc. throwing the rest of the data columns off. The timestamp is the critical component in whether or not the lines can be matched. The ideal would be to treat the
ddd hh:mm:ss.ss
portion of the timestamp as being required to match the lines and the last digit (millisecond) as being reportable as a difference. This would only break down 1/10 of the time with seconds values like 14.999 vs. 15.000 with the millisecond difference propagating into higher digits.
During the version validation runs we are looking only at the differences between outputs from the 2 versions, as values remaining the same have not changed. Most of the output reports are close enough that DiffMerge does not get out of sync, but occasionally the timestamp shifts cause large blocks of report files to be reported as a difference because it shifts the comparison into trying to match records that are at substantially different times, making the difference report useless for the remainder of that output report. That is the reason we need a better re-sync capability. Out of the 10,000 reports we review, about 50 of them get out of sync.
- Jim B