Web browser HTML table clipboard tests

Results and analysis of Opera

Note, don't miss the analysis & conclusion further down the page!

Raw Test Data

The raw test data for Opera can be found here: Opera test data
Test data was gathered for both version 9.27, and version 9.50 Final, on multiple platforms, and found to be identical on all!

Results

Legend: n = newline; t = tab; s = space

The following patterns were collected from between the cells (containing actual data) either side of each of the tests! So for example, if the test is looking at an empty cell at row 2 cell 1, the pattern covering that would include everything between the contents of row 1 cell 5 and row 2 cell 2!

#TestGroup 1 PatternsGroup 2 PatternsGroup 3 PatternsGroup 4 Patterns
1Between two normal cellstttntn
2Between two normal rowsnnnnnn
3Empty cell at start of rowntnstnntnnnstn
4Empty cell at end of rowtntsntnnntnsnn
5Two empty cells at start of rownttnststnntntnnnstnstn
6Two empty cells at end of rowttntstsntntnnntnstnsnn
7Empty cell in middle of rowtttsttntntnstn
8Three empty cells in middle of rowtttttstststtntntntntnstnstnstn
9Empty cell at end of row followed by empty cell at start of next rowtnttsnsttnnntntnsnnstn
10Three empty cells at end of row followed by three empty cells at start of next rowtttnttttststsnstststtntntnnntntntntnstnstnsnnstnstnstn
12Entire row empty (Five cells)nttttnnststststsnnntntntntnnnnnstnstnstnstnsnn
13Empty cells at beginning of first row (One cell)tstntnnstn
14Empty cells at beginning of first row (Two cells)ttststntntnnstnstn
15Empty cells at beginning of first row (Four cells)ttttststststntntntntnnstnstnstnstn
16Empty cells at end of last row (One cell)ttstnnntnsnn
17Empty cells at end of last row (Two cells)tttststntnnntnstnsnn
18Empty cells at end of last row (Four cells)tttttststststntntntnnntnstnstnstnsnn
  • <thead>, <tbody> and <tfoot> seem to make no difference - good news!
  • No difference between <th> and <td> seen - good news!
  • A <br /> tag in the middle of the data splits it with a new line

Analysis

The basic building blocks used by Opera are a t between cells and an n between rows. Using &nbsp; in empty cells (required for cells to display correctly in certain browsers) inserts a single s as cell data. Patterns from Group 1 and Group 2 tests are an exact match to those taken from the Mozilla Firefox browser.

When analysing the patterns from Group 3 and Group 4 tests (which add <p> tags around the contents of cells), if you pretend those basic building blocks have changed from t and n to tn and nn respectively, you'll see that the patterns are actually identical to those from Group 1 and Group 2, with two very simple exceptions:

  • If the cell is the first in the table, an n is inserted before it.
  • If the cell is the last in the table, nn is inserted after it.

We should analyse why exactly the building blocks have changed though, and in actual fact the building blocks haven't changed at all, we're just seeing the effect of the <p> tag, which for this browser has a neat and consistent effect on all patterns. The actual effect the <p> tag has on each cell using it is to simply place a single n before the cell's contents (whether completely empty or not). In fact that explains the first of the above two "exceptions" above. As for the nn on the end, I believe actually one n is due to other page content, and the other seems to actually be placed after all tables!

Spacing out the HTML has the effect of adding a single s after existing cell contents for cells that are not completely empty. It even does this for cells using the <p> tag if enough spacing is added!

Break down of an example pattern

Let's break down an example pattern just to enforce understanding. I'll use the pattern from Row 12, Group 4 (nnstnstnstnstnsnn).

The first n is simply the row divider between the previous row and this one. You then have a repeating pattern of nst, the s is the cell contents, the n is placed in front of it by the <p> tag, and then finally the t is the cell divider. The last bit of the pattern starts off with ns which is again the cell contents s with an n forced in front. The nn on the end is firstly the row separator, and then the n forced in front of the first cell on the next line. That's it, pretty simple really!

Conclusion

The only way to draw a conclusion as to how well these test cases are handled, is to see how easy it is to build a parsing algorithm which can convert all of these patterns into a simple format from which the data can then be easily extracted. Let's make a few rules first though:

  1. Our algorithm will not be told how many columns or rows it has been provided with (you couldn't expect a user to have to provide this info).
  2. Any number of cells in the table could be using the <p> tag, and any number of "empty" cells in the table could be using &nbsp;.
  3. Any part of the HTML of the table could be spaced out and therefore introduce additional spaces into the patterns.
  4. One single algorithm must cover everything.

Processing of all Group 1 and Group 2 patterns, can be processed with exactly the same simple algorithm I came up with for Mozilla Firefox (the patterns are exactly the same):
Note * means zero or more of the previous character (or group if in round brackets); + means one or more; ? means zero or one.

  1. n => [newrow]
  2. t => [newcell]
  3. [newcell]s* => [newcell]
  4. s*[newcell] => [newcell]

We need to expand this to cover patterns from Group 3 and Group 4 too though, which I failed to do for Mozilla Firefox. Now since these patterns directly correlate to the Group 1 and Group 2 patterns, you could simply use a variant of the above process where the first two steps become tn => [newcell] then nn => [newrow], then the only issue would be the additional [newrow] you would end up with at the end of the table, but that's not too problematic.

The real problem though is merging the two sets of processes, because you find tn in Group 1 and Group 2 patterns too, which need to be converted to [newcell][newrow], not [newcell]. One of the rules I laid out above was that the algorithm must be able to cope with tables using a mixture of patterns from all groups. How are we supposed to know what a tn pattern for example really represents?

I conclude that although it is possible to process all patterns, it is not possible to create one single algorithm that covers all of them.