Web browser HTML table clipboard tests

Results and analysis of Mozilla Firefox

Note, don't miss the analysis & conclusion further down the page!

Raw Test Data

The raw test data for Firefox can be found here: Firefox test data
Test data was gathered for both version 2.0.0.14, and version 3.0 Final, on multiple platforms, and found to be identical on all!

Results

Legend: n = newline; t = tab; s = space

The following patterns were collected from between the cells (containing actual data) either side of each of the tests! So for example, if the test is looking at an empty cell at row 2 cell 1, the pattern covering that would include everything between the contents of row 1 cell 5 and row 2 cell 2!

#TestGroup 1 PatternsGroup 2 PatternsGroup 3 PatternsGroup 4 Patterns
1Between two normal cellsttntnnntnn
2Between two normal rowsnnnnnn
3Empty cell at start of rowntnstnntnnnnsntnn
4Empty cell at end of rowtntsnntnnntnnsnn
5Two empty cells at start of rownttnststnntnntnnnnsntnnsntnn
6Two empty cells at end of rowttntstsnntnntnnntnnsntnnsnn
7Empty cell in middle of rowtttstntnntnnntnnsntnn
8Three empty cells in middle of rowtttttstststntnntnntnntnnntnnsntnnsntnnsntnn
9Empty cell at end of row followed by empty cell at start of next rowtnttsnstntnntnnntnnsnnsntnn
10Three empty cells at end of row followed by three empty cells at start of next rowtttnttttststsnstststntnntnntnntnntnntnnntnnsntnnsntnnsnnsntnnsntnnsntnn
12Entire row empty (Five cells)nttttnnststststsnnntnntnntnntnnnnsntnnsntnnsntnnsntnnsnn
13Empty cells at beginning of first row (One cell)tstntnnnsntnn
14Empty cells at beginning of first row (Two cells)ttststntnntnnnsntnnsntnn
15Empty cells at beginning of first row (Four cells)ttttststststntnntnntnntnnnsntnnsntnnsntnnsntnn
16Empty cells at end of last row (One cell)ttsntnnntnnsn
17Empty cells at end of last row (Two cells)tttstsntnntnnntnnsntnnsn
18Empty cells at end of last row (Four cells)tttttstststsntnntnntnntnnntnnsntnnsntnnsntnnsn
  • <thead>, <tbody> and <tfoot> seem to make no difference - good news!
  • No difference between <th> and <td> seen - good news!
  • A <br /> tag in the middle of the data splits it with a new line

Analysis

The basic building blocks used by Firefox are a t between cells and an n between rows. Using &nbsp; in empty cells (required for cells to display correctly in certain browsers) inserts a single s as cell data. Patterns from Group 1 and Group 2 tests are an exact match to those taken from the Opera browser.

Analysing the patterns from Group 3 and Group 4 tests (which add <p> tags around the contents of cells) was a lot more difficult, but adding Group 6 Test 3 helped a lot with determining the influence the tag had on the patterns. For most cells the tag adds nn before the cell contents and n after. However there are complications and exceptions:

  • If the cell is completely empty, it is not given an n after its contents.
  • A cell at the start of a row is not given nn before its contents (note, it seems like they are there in the above patterns, but in fact are there due to other factors - see the example breakdown below!)
  • For the last cell in a row, if completely empty, all you get for the cell is a single n.
  • For the first cell in the table, all you get is a single n if the cell is empty.

Spacing out the HTML has the effect of adding a single s after existing cell contents of cells that are not completely empty and that don't use the <p> tag.

Break down of an example pattern

The above analysis of Group 3 and Group 4 patterns may be a little tough to follow, so lets take the pattern from Test 12, Group 4 (nnsntnnsntnnsntnnsntnnsnn) as an example and break it down; I'll leave the rest to you. Skip over this section if you think you've already got it!

First of all note that tests from Group 4 use <p> tags around the contents of all cells, so that's going to have an influence, and secondly Group 4 takes after Group 2 in that it uses &nbsp; in empty cells rather than leaving them truly empty like Group 1 and Group 3. So what is this test? Well this test looks at the pattern that occurs when an entire row is empty. Note that the table the patterns were collected from had five columns!

In this pattern there are four cell dividers (t), which divide the pattern into five pieces, one for each of the five cells. The chunks for the middle three cells are all the same - nnsn. As per the rules above, the <p> tag puts nn in front of the cells' contents. It then puts a single n afterwards too, but only if a cell is not completely empty, which in this case they are not due to &nbsp;, hence the sn.

Now although the first chunk of the pattern is exactly the same as the middle three, its origins are actually different! You see, the whole pattern begins right after the data in the last cell of the previous row, and so therefore must include the n placed after the contents of that cell by the <p> tag around it, and also the row divider (n). So the nn at the start has nothing to do with the next cell, only the sn after it does, and so we must conclude that for the first cell in a row, the <p> tag does not insert nn before the data!

Finally let's look at the last chunk, here the only difference to the previous three cells is the extra n on the end, this is simply the row divider, and that's where the pattern ends. There is no more to the pattern because next comes the data from the first cell of the next row! (Remember, the <p> tag does not put nn before the contents of the first cell in a row!).

In this test there was no spacing in the HTML, so no s was introduced after cells' contents, but the <p> tags would have negated them anyway!

Conclusion

The only way to draw a conclusion as to how well these test cases are handled, is to see how easy it is to build a parsing algorithm which can convert all of these patterns into a simple format from which the data can then be easily extracted. Let's make a few rules first though:

  1. Our algorithm will not be told how many columns or rows it has been provided with (you couldn't expect a user to have to provide this info).
  2. Any number of cells in the table could be using the <p> tag, and any number of "empty" cells in the table could be using &nbsp;.
  3. Any part of the HTML of the table could be spaced out and therefore introduce additional spaces into the patterns.
  4. One single algorithm must cover everything.

Well, the obvious place to start is with the basic building blocks of a t between cells, and an n between rows. If we check out all of the patterns above, the use of the t is consistent throughout, so that's easy enough. The use of the n in Group 1 and Group 2 always acts as a row divider. Spaces are added either by &nbsp; or by HTML spacing, they can all be ignored though, and don't hinder processing.

Processing Group 1 and Group 2 tables can be done as follows:
Note * means zero or more of the previous character (or group if in round brackets); + means one or more; ? means zero or one.

  1. n => [newrow]
  2. t => [newcell]
  3. [newcell]s+ => [newcell]
  4. s+[newcell] => [newcell]

Note that you can't just remove all spaces, because cell data can contain legitimate spaces which must be preserved! With the above process, spaces that are part of the patterns (including either side of cell contents) will be removed, but not spaces within other cell data!

Well that bit was easy, however, we haven't covered Group 3 and Group 4 yet! So let's try to expand it for that next! First off, obviously we can't just take n to mean row divider now, since the <p> tag just casually throws in some extra ones. The t as a cell divider is still consistent throughout. Really you could boil it all down to having to figure out whether what's separated by t in the above patterns is simply garbage added by the <p> tag, or whether it also represents a row divider.

It's not as easy as that though, and in fact I've had to give up! You see, take pattern 10 from Group 3 (ntnntnntnntnntnntnn); Group 1 shows us exactly what this needs to be boiled down to: tttnttt. How can we do this though? When I approached this pattern I was going to replace tnnt with [newcell]t, which gets rid of the unnecessary nn bit in the middle. However, that also gets rid of the row divider which is covered by that same pattern. How are we supposed to know which the pattern represents?

The patterns for Group 4 vary a little more than Group 3, and it may actually be possible to come up with an algorithm for it, but since I've already failed at producing one for Group 3, what's the point?!

The only potential way I can imagine of processing Group 3 is if the algorithm can somehow determine how many cells there are per row and use that for a basis on which to tell whether it's looking at garbage or a new row divider, but I have my doubts about it.