Raw Test Data
The raw test data for Internet Explorer can be found here: Internet Explorer 6 & 7 test data, Internet Explorer 8 Beta 1 test data
Test data was gathered for both version 6, version 7, and version 8 Beta 1, on multiple platforms. Although the test data from Internet Explorer 8 Beta 1 is different overall compared with version 6 and 7, the sections from the actual HTML tables were found to be identical on all!
Results
Legend: n = newline; t = tab; s = space
The following patterns were collected from between the cells (containing actual data) either side of each of the tests! So for example, if the test is looking at an empty cell at row 2 cell 1, the pattern covering that would include everything between the contents of row 1 cell 5 and row 2 cell 2!
# | Test | Group 1 Patterns | Group 2 Patterns | Group 3 Patterns | Group 4 Patterns |
---|---|---|---|---|---|
1 | Between two normal cells | s | s | ns | ns |
2 | Between two normal rows | sn | sn | nsn | nsn |
3 | Empty cell at start of row | sns | snss | nsnns | nsnsns |
4 | Empty cell at end of row | ssn | sssn | nsnsn | nssnsn |
5 | Two empty cells at start of row | snss | snssss | nsnnsns | nsnsnssns |
6 | Two empty cells at end of row | sssn | sssssn | nsnsnsn | nssnssnsn |
7 | Empty cell in middle of row | ss | sss | nsns | nssns |
8 | Three empty cells in middle of row | ssss | sssssss | nsnsnsns | nssnssnssns |
9 | Empty cell at end of row followed by empty cell at start of next row | ssns | sssnss | nsnsnns | nssnsnsns |
10 | Three empty cells at end of row followed by three empty cells at start of next row | ssssnsss | sssssssnssssss | nsnsnsnsnnsnsns | nssnssnssnsnsnssnssns |
12 | Entire row empty (Five cells) | snsssssn | snssssssssssn | nsnnsnsnsnsnsn | nsnsnssnssnssnssnsn |
13 | Empty cells at beginning of first row (One cell) | (n)s | (n)ss | (n)ns | (n)sns |
14 | Empty cells at beginning of first row (Two cells) | (n)ss | (n)ssss | (n)nsns | (n)snssns |
15 | Empty cells at beginning of first row (Four cells) | (n)ssss | (n)ssssssss | (n)nsnsnsns | (n)snssnssnssns |
16 | Empty cells at end of last row (One cell) | ssn(n) | sssn(n) | nsnsn(n) | nssnsn(n) |
17 | Empty cells at end of last row (Two cells) | sssn(n) | sssssn(n) | nsnsnsn(n) | nssnssnsn(n) |
18 | Empty cells at end of last row (Four cells) | sssssn | sssssssssn | nsnsnsnsnsn | nssnssnssnssnsn |
Note, the bits of patterns shown in brackets were found for Internet Explorer 6 and 7, but not the Internet Explorer 8 Beta 1!
<thead>
,<tbody>
and<tfoot>
seem to make no difference - good news!- No difference between
<th>
and<td>
seen - good news! - A
<br />
tag in the middle of the data splits it with a new line
Analysis
It is strange that in Internet Explorer 6 and 7, tests 16 and 17 have an extra n on the end of the patterns, but test 18 doesn't. I did a little additional test in Internet Explorer 7 and removed everything between the tables for tests 16 to 18 in Group 2. I discovered that all of the bracketed n's (those that don't appear in IE8 Beta 1) for tests 13 to 18 were no longer present, so they must all therefore have been placed there due to the <h4>
and <h5>
tags alongside the tables. In which case for the purposes of this study you can completely ignore there presence!
The basic building blocks used by Internet Explorer are an s between cells and sn between rows. Using
in empty cells (required for cells to display correctly in certain browsers, such as this one!) inserts a single s as cell data.
If you compare the patterns above to those from the Opera browser (use the patterns in the first two tests to break down the patterns), you'll find that with the exception of the final three tests for Group 1 and Group 2, the patterns are actually all the same. The only difference with the last three tests in Group 1 and Group 2 is that Internet Explorer's patterns each have an extra sn on the end (aka row divider).
The effect the <p>
tag has on cells is to add a single n after the contents of the cell, whether the cell is completely empty or not.
Spacing out the HTML has the effect of adding a single s after existing cell contents for cells that are not completely empty. It even does this for cells using the <p>
tag if enough spacing is added!
Conclusion
The only way to draw a conclusion as to how well these test cases are handled, is to see how easy it is to build a parsing algorithm which can convert all of these patterns into a simple format from which the data can then be easily extracted. Let's make a few rules first though:
- Our algorithm will not be told how many columns or rows it has been provided with (you couldn't expect a user to have to provide this info).
- Any number of cells in the table could be using the
<p>
tag, and any number of "empty" cells in the table could be using
. - Any part of the HTML of the table could be spaced out and therefore introduce additional spaces into the patterns.
- One single algorithm must cover everything.
Processing of all Group 1 patterns is possible with the following steps:
- sn => [newrow]
- s => [newcell]
However, it is impossible to process any of the patterns from the other groups, nor even Group 1 patterns if spaces are inserted due to spacing in the HTML.
If Microsoft swapped the basic building blocks used in Groups 1 and 2 to those used by Opera and Firefox (n and t), it would be completely possible to process all Group 1 and Group 2 patterns. Group 3 and Group 4 patterns would still be a problem though as I'm having problems with those in Opera and Firefox too.
Ultimately, not a good result Microsoft! The majority of patterns are impossible to process! Although to be fair, if the basic building blocks were changed as I mentioned above, you'd be on par with Opera, which has given me the best results in this study!