Class GridTableParser
TableParser
--+
|
GridTableParser
Parse a grid table using parse()
.
Here's an example of a grid table:
+------------------------+------------+----------+----------+
| Header row, column 1 | Header 2 | Header 3 | Header 4 |
+========================+============+==========+==========+
| body row 1, column 1 | column 2 | column 3 | column 4 |
+------------------------+------------+----------+----------+
| body row 2 | Cells may span columns. |
+------------------------+------------+---------------------+
| body row 3 | Cells may | - Table cells |
+------------------------+ span rows. | - contain |
| body row 4 | | - body elements. |
+------------------------+------------+---------------------+
Intersections use '+', row separators use '-' (except for one optional
head/body row separator, which uses '='), and column separators use '|'.
Passing the above table to the parse()
method will result in the
following data structure:
([24, 12, 10, 10],
[[(0, 0, 1, ['Header row, column 1']),
(0, 0, 1, ['Header 2']),
(0, 0, 1, ['Header 3']),
(0, 0, 1, ['Header 4'])]],
[[(0, 0, 3, ['body row 1, column 1']),
(0, 0, 3, ['column 2']),
(0, 0, 3, ['column 3']),
(0, 0, 3, ['column 4'])],
[(0, 0, 5, ['body row 2']),
(0, 2, 5, ['Cells may span columns.']),
None,
None],
[(0, 0, 7, ['body row 3']),
(1, 0, 7, ['Cells may', 'span rows.', '']),
(1, 1, 7, ['- Table cells', '- contain', '- body elements.']),
None],
[(0, 0, 9, ['body row 4']), None, None, None]])
The first item is a list containing column widths (colspecs). The second
item is a list of head rows, and the third is a list of body rows. Each
row contains a list of cells. Each cell is either None (for a cell unused
because of another cell's span), or a tuple. A cell tuple contains four
items: the number of extra rows used by the cell in a vertical span
(morerows); the number of extra columns used by the cell in a horizontal
span (morecols); the line offset of the first line of the cell contents;
and the cell contents, a list of lines of text.
Method Summary |
|
check_parse_complete (self)
Each text column should have been completely seen. |
|
get_cell_block (self,
top,
left,
bottom,
right)
Given the corners, extract the text of a cell. |
|
mark_done (self,
top,
left,
bottom,
right)
For keeping track of how much of each text column has been seen. |
|
parse_table (self)
Start with a queue of upper-left corners, containing the upper-left
corner of the table itself. |
|
scan_cell (self,
top,
left)
Starting at the top-left corner, start tracing out a cell. |
|
scan_down (self,
top,
left,
right)
Look for the bottom-right corner of the cell, making note of all row
boundaries. |
|
scan_left (self,
top,
left,
bottom,
right)
Noting column boundaries, look for the bottom-left corner of the cell. |
|
scan_right (self,
top,
left)
Look for the top-right corner of the cell, and make note of all column
boundaries ('+'). |
|
scan_up (self,
top,
left,
bottom,
right)
Noting row boundaries, see if we can return to the starting point. |
|
setup(self,
block)
|
|
structure_from_cells (self)
From the data colledted by scan_cell() , convert to the final data
structure. |
Inherited from TableParser |
|
find_head_body_sep (self)
Look for a head/body row separator line; store the line index. |
|
parse (self,
block)
Analyze the text block and return a table data structure. |
check_parse_complete(self)
Each text column should have been completely seen. -
|
get_cell_block(self,
top,
left,
bottom,
right)
Given the corners, extract the text of a cell. -
|
mark_done(self,
top,
left,
bottom,
right)
For keeping track of how much of each text column has been seen. -
|
parse_table(self)
Start with a queue of upper-left corners, containing the upper-left
corner of the table itself. Trace out one rectangular cell, remember
it, and add its upper-right and lower-left corners to the queue of
potential upper-left corners of further cells. Process the queue in
top-to-bottom order, keeping track of how much of each text column has
been seen.
We'll end up knowing all the row and column boundaries, cell positions
and their dimensions.
-
|
scan_cell(self,
top,
left)
Starting at the top-left corner, start tracing out a cell. -
|
scan_down(self,
top,
left,
right)
Look for the bottom-right corner of the cell, making note of all row
boundaries. -
|
scan_left(self,
top,
left,
bottom,
right)
Noting column boundaries, look for the bottom-left corner of the cell.
It must line up with the starting point. -
|
scan_right(self,
top,
left)
Look for the top-right corner of the cell, and make note of all column
boundaries ('+'). -
|
scan_up(self,
top,
left,
bottom,
right)
Noting row boundaries, see if we can return to the starting point. -
|
structure_from_cells(self)
From the data colledted by scan_cell() , convert to the final data
structure. -
|
head_body_separator_pat
-
- Type:
-
SRE_Pattern
- Value:
|