Raw Test Data
The raw test data for Konqueror can be found here: Konqueror test data
Test data was gathered for version 3.9.5, on Linux.
Results
Legend: n = newline; t = tab; s = space
The following patterns were collected from between the cells (containing actual data) either side of each of the tests! So for example, if the test is looking at an empty cell at row 2 cell 1, the pattern covering that would include everything between the contents of row 1 cell 5 and row 2 cell 2!
# | Test | Group 1 Patterns | Group 2 Patterns | Group 3 Patterns | Group 4 Patterns |
---|---|---|---|---|---|
1 | Between two normal cells | ss | ss | nss | nss |
2 | Between two normal rows | nss | nss | nss | nss |
3 | Empty cell at start of row | nss | nsssss | nss | nsssnss |
4 | Empty cell at end of row | nss | sssnss | nss | nsssnss |
5 | Two empty cells at start of row | nss | nssssssss | nss | nsssnsssnss |
6 | Two empty cells at end of row | nss | ssssssnss | nss | nsssnsssnss |
7 | Empty cell in middle of row | ss | sssss | nss | nsssnss |
8 | Three empty cells in middle of row | ss | sssssssssss | nss | nsssnsssnsssnss |
9 | Empty cell at end of row followed by empty cell at start of next row | nss | sssnsssss | nss | nsssnsssnss |
10 | Three empty cells at end of row followed by three empty cells at start of next row | nss | sssssssssnsssssssssss | nss | nsssnsssnsssnsssnsssnsssnss |
12 | Entire row empty (Five cells) | nss | nsssssssssssssssnss | nss | nsssnsssnsssnsssnsssnss |
13 | Empty cells at beginning of first row (One cell) | sss | nss | nsnss | |
14 | Empty cells at beginning of first row (Two cells) | ssssss | nss | nsnsssnss | |
15 | Empty cells at beginning of first row (Four cells) | ssssssssssss | nss | nsnsssnsssnsssnss | |
16 | Empty cells at end of last row (One cell) | n | sssn | n | nsssn |
17 | Empty cells at end of last row (Two cells) | n | ssssssn | n | nsssnsssn |
18 | Empty cells at end of last row (Four cells) | n | ssssssssssssn | n | nsssnsssnsssnsssn |
<thead>
,<tbody>
and<tfoot>
seem to make no difference - good news!- There is an apparent difference between
<th>
and<td>
tags! Cells are separated with an n rather than nss. I am unsure whether there are any other differences. - A
<br />
tag in the middle of the data splits it with a new line
Analysis
You know what, there may be some logic to the construction of the patterns in Group 2 & 4, but honestly it's just a complete mess really, so I'm not going to waste time trying to analyse it!
Conclusion
The only way to draw a conclusion as to how well these test cases are handled, is to see how easy it is to build a parsing algorithm which can convert all of these patterns into a simple format from which the data can then be easily extracted. Let's make a few rules first though:
- Our algorithm will not be told how many columns or rows it has been provided with (you couldn't expect a user to have to provide this info).
- Any number of cells in the table could be using the
<p>
tag, and any number of "empty" cells in the table could be using
. - Any part of the HTML of the table could be spaced out and therefore introduce additional spaces into the patterns.
- One single algorithm must cover everything.
Wow, it's just simply completely impossible to do anything with these patterns. Terrible job Konqueror!