HTML table clipboard tests

Note, don't miss the analysis & conclusion further down the page!

Raw Test Data

The raw test data for Opera can be found here: Opera test data
Test data was gathered for both version 9.27, and version 9.50 Final, on multiple platforms, and found to be identical on all!

Results

Legend: n = newline; t = tab; s = space

The following patterns were collected from between the cells (containing actual data) either side of each of the tests! So for example, if the test is looking at an empty cell at row 2 cell 1, the pattern covering that would include everything between the contents of row 1 cell 5 and row 2 cell 2!

#	Test	Group 1 Patterns	Group 2 Patterns	Group 3 Patterns	Group 4 Patterns
1	Between two normal cells	t	t	tn	tn
2	Between two normal rows	n	n	nn	nn
3	Empty cell at start of row	nt	nst	nntn	nnstn
4	Empty cell at end of row	tn	tsn	tnnn	tnsnn
5	Two empty cells at start of row	ntt	nstst	nntntn	nnstnstn
6	Two empty cells at end of row	ttn	tstsn	tntnnn	tnstnsnn
7	Empty cell in middle of row	tt	tst	tntn	tnstn
8	Three empty cells in middle of row	tttt	tststst	tntntntn	tnstnstnstn
9	Empty cell at end of row followed by empty cell at start of next row	tnt	tsnst	tnnntn	tnsnnstn
10	Three empty cells at end of row followed by three empty cells at start of next row	tttnttt	tststsnststst	tntntnnntntntn	tnstnstnsnnstnstnstn
12	Entire row empty (Five cells)	nttttn	nststststsn	nntntntntnnn	nnstnstnstnstnsnn
13	Empty cells at beginning of first row (One cell)	t	st	ntn	nstn
14	Empty cells at beginning of first row (Two cells)	tt	stst	ntntn	nstnstn
15	Empty cells at beginning of first row (Four cells)	tttt	stststst	ntntntntn	nstnstnstnstn
16	Empty cells at end of last row (One cell)	t	ts	tnnn	tnsnn
17	Empty cells at end of last row (Two cells)	tt	tsts	tntnnn	tnstnsnn
18	Empty cells at end of last row (Four cells)	tttt	tstststs	tntntntnnn	tnstnstnstnsnn

<thead>, <tbody> and <tfoot> seem to make no difference - good news!
No difference between <th> and <td> seen - good news!
A <br /> tag in the middle of the data splits it with a new line

Analysis

The basic building blocks used by Opera are a t between cells and an n between rows. Using   in empty cells (required for cells to display correctly in certain browsers) inserts a single s as cell data. Patterns from Group 1 and Group 2 tests are an exact match to those taken from the Mozilla Firefox browser.

When analysing the patterns from Group 3 and Group 4 tests (which add <p> tags around the contents of cells), if you pretend those basic building blocks have changed from t and n to tn and nn respectively, you'll see that the patterns are actually identical to those from Group 1 and Group 2, with two very simple exceptions:

If the cell is the first in the table, an n is inserted before it.
If the cell is the last in the table, nn is inserted after it.

We should analyse why exactly the building blocks have changed though, and in actual fact the building blocks haven't changed at all, we're just seeing the effect of the <p> tag, which for this browser has a neat and consistent effect on all patterns. The actual effect the <p> tag has on each cell using it is to simply place a single n before the cell's contents (whether completely empty or not). In fact that explains the first of the above two "exceptions" above. As for the nn on the end, I believe actually one n is due to other page content, and the other seems to actually be placed after all tables!

Spacing out the HTML has the effect of adding a single s after existing cell contents for cells that are not completely empty. It even does this for cells using the <p> tag if enough spacing is added!

Break down of an example pattern

Let's break down an example pattern just to enforce understanding. I'll use the pattern from Row 12, Group 4 (nnstnstnstnstnsnn).

The first n is simply the row divider between the previous row and this one. You then have a repeating pattern of nst, the s is the cell contents, the n is placed in front of it by the <p> tag, and then finally the t is the cell divider. The last bit of the pattern starts off with ns which is again the cell contents s with an n forced in front. The nn on the end is firstly the row separator, and then the n forced in front of the first cell on the next line. That's it, pretty simple really!

Conclusion

The only way to draw a conclusion as to how well these test cases are handled, is to see how easy it is to build a parsing algorithm which can convert all of these patterns into a simple format from which the data can then be easily extracted. Let's make a few rules first though:

Our algorithm will not be told how many columns or rows it has been provided with (you couldn't expect a user to have to provide this info).
Any number of cells in the table could be using the <p> tag, and any number of "empty" cells in the table could be using  .
Any part of the HTML of the table could be spaced out and therefore introduce additional spaces into the patterns.
One single algorithm must cover everything.

Processing of all Group 1 and Group 2 patterns, can be processed with exactly the same simple algorithm I came up with for Mozilla Firefox (the patterns are exactly the same):
Note * means zero or more of the previous character (or group if in round brackets); + means one or more; ? means zero or one.

n => [newrow]
t => [newcell]
[newcell]s* => [newcell]
s*[newcell] => [newcell]

We need to expand this to cover patterns from Group 3 and Group 4 too though, which I failed to do for Mozilla Firefox. Now since these patterns directly correlate to the Group 1 and Group 2 patterns, you could simply use a variant of the above process where the first two steps become tn => [newcell] then nn => [newrow], then the only issue would be the additional [newrow] you would end up with at the end of the table, but that's not too problematic.

The real problem though is merging the two sets of processes, because you find tn in Group 1 and Group 2 patterns too, which need to be converted to [newcell][newrow], not [newcell]. One of the rules I laid out above was that the algorithm must be able to cope with tables using a mixture of patterns from all groups. How are we supposed to know what a tn pattern for example really represents?

I conclude that although it is possible to process all patterns, it is not possible to create one single algorithm that covers all of them.

Web browser HTML table clipboard tests

Raw Test Data

Results

Analysis

Break down of an example pattern

Conclusion