Разбор HTML-таблицы с помощью python — HTMLparser или lxml

У меня есть html-страница, состоящая из таблицы, и я хочу получить все значения в td, tr в этой таблице.
Я пытался работать с BeautifulSoup, но теперь я хотел работать с парсером lxml или HML с Python.

Я прикрепил пример.
Я хочу получать значения в виде списков кортежей, таких как

[
[( value of 2050 jan, value of main subject-part1-sub part1-subject1 ), ( value of 2050 feb, value of main subject-part1-sub part1-subject1 ),... ],
[( value of 2050 jan, value of main subject-part1-sub part1-subject2 ), ( value of 2050 feb, value of main subject-part1-sub part1-subject2 )... ]
]

и т.д.

Кто-нибудь может дать мне знать, как я могу обработать это очень «оптимальным» способом, используя парсер lxml или HTML python?

пример: test.html

<HTML>
<HEAD>
<TITLE>Title</TITLE>
</HEAD>
<BODY>
<TABLE BORDER>
<TR ALIGN=LEFT>
<TH COLSPAN=38>Main Subject</TH>
</TR>
<TR ALIGN=LEFT>
<TH COLSPAN=2> </TH>

<TH VALIGN=TOP COLSPAN=18>part1</TH>
<TH VALIGN=TOP COLSPAN=18>part2</TH>
</TR>
<TR ALIGN=LEFT>
<TH COLSPAN=2> </TH>
<TH VALIGN=TOP COLSPAN=9>sub-part1</TH>
<TH VALIGN=TOP COLSPAN=9>sub-part2</TH>
<TH VALIGN=TOP COLSPAN=9>sub-part3</TH>
<TH VALIGN=TOP COLSPAN=9>sub-part4</TH>
</TR>

<TR ALIGN=LEFT>
<TH COLSPAN=2> </TH>
<TH VALIGN=TOP COLSPAN=1>subject1</TH>
<TH VALIGN=TOP COLSPAN=1>subject2</TH>

<TH VALIGN=TOP COLSPAN=1>subject10</TH>
<TH VALIGN=TOP COLSPAN=1>subject11</TH>
<TH VALIGN=TOP COLSPAN=1>subject12</TH>
<TH VALIGN=TOP COLSPAN=1>subject13</TH>
<TH VALIGN=TOP COLSPAN=1>subject14</TH>
<TH VALIGN=TOP COLSPAN=1>subject15</TH>
<TH VALIGN=TOP COLSPAN=1>subject16</TH>

<TH VALIGN=TOP COLSPAN=1>subject17</TH>
<TH VALIGN=TOP COLSPAN=1>subject18</TH>
<TH VALIGN=TOP COLSPAN=1>subject19</TH>
<TH VALIGN=TOP COLSPAN=1>subject20</TH>
<TH VALIGN=TOP COLSPAN=1>subject21</TH>
<TH VALIGN=TOP COLSPAN=1>subject22</TH>
<TH VALIGN=TOP COLSPAN=1>subject23</TH>
<TH VALIGN=TOP COLSPAN=1>subject24</TH>
<TH VALIGN=TOP COLSPAN=1>subject25</TH>

<TH VALIGN=TOP COLSPAN=1>subject26</TH>
<TH VALIGN=TOP COLSPAN=1>subject27</TH>
<TH VALIGN=TOP COLSPAN=1>subject28</TH>
<TH VALIGN=TOP COLSPAN=1>subject29</TH>
<TH VALIGN=TOP COLSPAN=1>subject30</TH>
<TH VALIGN=TOP COLSPAN=1>subject31</TH>
<TH VALIGN=TOP COLSPAN=1>subject32</TH>
<TH VALIGN=TOP COLSPAN=1>subject33</TH>
<TH VALIGN=TOP COLSPAN=1>subject34</TH>

<TH VALIGN=TOP COLSPAN=1>subject35</TH>
<TH VALIGN=TOP COLSPAN=1>subject36</TH>
</TR>
<TR ALIGN=RIGHT>
<TH ALIGN=LEFT VALIGN=TOP ROWSPAN=12>2050</TH>
<TH ALIGN=LEFT>January</TH>
<TD>0</TD>
<TD>1</TD>
<TD>3</TD>
<TD>0</TD>

<TD>4</TD>
<TD>16</TD>
<TD>0</TD>
<TD>6</TD>
<TD>2</TD>
<TD>2</TD>
<TD>0</TD>
<TD>3</TD>
<TD>0</TD>

<TD>3</TD>
<TD>2</TD>
<TD>0</TD>
<TD>26</TD>
<TD>1</TD>
<TD>0</TD>
<TD>0</TD>
<TD>7</TD>
<TD>0</TD>

<TD>5</TD>
<TD>6</TD>
<TD>0</TD>
<TD>8</TD>
<TD>2</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>

<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>2</TD>
<TD>0</TD>
</TR>
<TR ALIGN=RIGHT>
<TH ALIGN=LEFT>February</TH>
<TD>1</TD>
<TD>0</TD>

<TD>8</TD>
<TD>0</TD>
<TD>2</TD>
<TD>4</TD>
<TD>1</TD>
<TD>6</TD>
<TD>1</TD>
<TD>2</TD>
<TD>0</TD>

<TD>3</TD>
<TD>0</TD>
<TD>0</TD>
<TD>4</TD>
<TD>0</TD>
<TD>25</TD>
<TD>0</TD>
<TD>0</TD>
<TD>1</TD>

<TD>2</TD>
<TD>0</TD>
<TD>4</TD>
<TD>14</TD>
<TD>1</TD>
<TD>1</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>

<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>1</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
</TR>
<TR ALIGN=RIGHT>
<TH ALIGN=LEFT>March</TH>

<TD>0</TD>
<TD>0</TD>
<TD>4</TD>
<TD>0</TD>
<TD>4</TD>
<TD>7</TD>
<TD>0</TD>
<TD>9</TD>
<TD>2</TD>

<TD>1</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>2</TD>
<TD>9</TD>
<TD>0</TD>
<TD>45</TD>
<TD>1</TD>

<TD>0</TD>
<TD>0</TD>
<TD>7</TD>
<TD>0</TD>
<TD>10</TD>
<TD>16</TD>
<TD>0</TD>
<TD>5</TD>
<TD>1</TD>

<TD>1</TD>
<TD>0</TD>
<TD>1</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>4</TD>
<TD>0</TD>

</TR>
<TR ALIGN=RIGHT>
<TH ALIGN=LEFT>April</TH>
<TD>1</TD>
<TD>0</TD>
<TD>5</TD>
<TD>0</TD>
<TD>3</TD>
<TD>12</TD>
<TD>1</TD>

<TD>11</TD>
<TD>0</TD>
<TD>3</TD>
<TD>0</TD>
<TD>3</TD>
<TD>0</TD>
<TD>0</TD>
<TD>3</TD>
<TD>2</TD>

<TD>34</TD>
<TD>0</TD>
<TD>0</TD>
<TD>1</TD>
<TD>2</TD>
<TD>0</TD>
<TD>6</TD>
<TD>18</TD>
<TD>1</TD>

<TD>3</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>

<TD>5</TD>
<TD>1</TD>
</TR>
<TR ALIGN=RIGHT>
<TH ALIGN=LEFT>May</TH>
<TD>7</TD>
<TD>0</TD>
<TD>6</TD>
<TD>0</TD>
<TD>8</TD>

<TD>4</TD>
<TD>1</TD>
<TD>13</TD>
<TD>0</TD>
<TD>0</TD>
<TD>2</TD>
<TD>2</TD>
<TD>0</TD>
<TD>1</TD>

<TD>7</TD>
<TD>1</TD>
<TD>30</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>7</TD>
<TD>0</TD>
<TD>5</TD>

<TD>12</TD>
<TD>0</TD>
<TD>4</TD>
<TD>1</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>

<TD>0</TD>
<TD>0</TD>
<TD>6</TD>
<TD>1</TD>
</TR>
<TR ALIGN=RIGHT>
<TH ALIGN=LEFT>June</TH>
<TD>0</TD>
<TD>1</TD>
<TD>14</TD>

<TD>0</TD>
<TD>7</TD>
<TD>15</TD>
<TD>0</TD>
<TD>17</TD>
<TD>1</TD>
<TD>2</TD>
<TD>0</TD>
<TD>5</TD>

<TD>0</TD>
<TD>1</TD>
<TD>3</TD>
<TD>0</TD>
<TD>24</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>5</TD>

<TD>0</TD>
<TD>6</TD>
<TD>13</TD>
<TD>1</TD>
<TD>9</TD>
<TD>1</TD>
<TD>1</TD>
<TD>0</TD>
<TD>0</TD>

<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>2</TD>
<TD>1</TD>
</TR>
<TR ALIGN=RIGHT>
<TH ALIGN=LEFT>July</TH>
<TD>0</TD>

<TD>1</TD>
<TD>6</TD>
<TD>0</TD>
<TD>8</TD>
<TD>17</TD>
<TD>1</TD>
<TD>15</TD>
<TD>2</TD>
<TD>1</TD>

<TD>0</TD>
<TD>10</TD>
<TD>0</TD>
<TD>2</TD>
<TD>15</TD>
<TD>2</TD>
<TD>53</TD>
<TD>0</TD>
<TD>3</TD>

<TD>3</TD>
<TD>6</TD>
<TD>0</TD>
<TD>7</TD>
<TD>16</TD>
<TD>0</TD>
<TD>9</TD>
<TD>1</TD>
<TD>1</TD>

<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>1</TD>
<TD>0</TD>
<TD>2</TD>
<TD>0</TD>
</TR>

<TR ALIGN=RIGHT>
<TH ALIGN=LEFT>August</TH>
<TD>2</TD>
<TD>0</TD>
<TD>5</TD>
<TD>0</TD>
<TD>8</TD>
<TD>15</TD>
<TD>1</TD>

<TD>17</TD>
<TD>0</TD>
<TD>2</TD>
<TD>0</TD>
<TD>2</TD>
<TD>0</TD>
<TD>5</TD>
<TD>16</TD>
<TD>0</TD>

<TD>33</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>11</TD>
<TD>0</TD>
<TD>2</TD>
<TD>25</TD>
<TD>4</TD>

<TD>8</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>1</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>

<TD>3</TD>
<TD>0</TD>
</TR>
<TR ALIGN=RIGHT>
<TH ALIGN=LEFT>September</TH>
<TD>2</TD>
<TD>0</TD>
<TD>10</TD>
<TD>0</TD>
<TD>16</TD>

<TD>22</TD>
<TD>2</TD>
<TD>19</TD>
<TD>4</TD>
<TD>2</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>2</TD>

<TD>8</TD>
<TD>0</TD>
<TD>27</TD>
<TD>0</TD>
<TD>1</TD>
<TD>0</TD>
<TD>8</TD>
<TD>0</TD>
<TD>11</TD>

<TD>31</TD>
<TD>1</TD>
<TD>9</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>1</TD>
<TD>0</TD>
<TD>0</TD>

<TD>0</TD>
<TD>1</TD>
<TD>1</TD>
<TD>0</TD>
</TR>
<TR ALIGN=RIGHT>
<TH ALIGN=LEFT>October</TH>
<TD>3</TD>
<TD>1</TD>
<TD>8</TD>

<TD>0</TD>
<TD>4</TD>
<TD>28</TD>
<TD>0</TD>
<TD>15</TD>
<TD>2</TD>
<TD>1</TD>
<TD>0</TD>
<TD>1</TD>

<TD>0</TD>
<TD>1</TD>
<TD>6</TD>
<TD>0</TD>
<TD>15</TD>
<TD>0</TD>
<TD>1</TD>
<TD>0</TD>
<TD>3</TD>

<TD>0</TD>
<TD>9</TD>
<TD>26</TD>
<TD>1</TD>
<TD>8</TD>
<TD>4</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>

<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>1</TD>
<TD>0</TD>
</TR>
<TR ALIGN=RIGHT>
<TH ALIGN=LEFT>November</TH>
<TD>0</TD>

<TD>3</TD>
<TD>3</TD>
<TD>0</TD>
<TD>6</TD>
<TD>23</TD>
<TD>1</TD>
<TD>8</TD>
<TD>1</TD>
<TD>2</TD>

<TD>0</TD>
<TD>1</TD>
<TD>0</TD>
<TD>3</TD>
<TD>7</TD>
<TD>1</TD>
<TD>20</TD>
<TD>0</TD>
<TD>0</TD>

<TD>0</TD>
<TD>8</TD>
<TD>0</TD>
<TD>3</TD>
<TD>18</TD>
<TD>3</TD>
<TD>7</TD>
<TD>0</TD>
<TD>0</TD>

<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>3</TD>
<TD>0</TD>
</TR>

<TR ALIGN=RIGHT>
<TH ALIGN=LEFT>December</TH>
<TD>1</TD>
<TD>0</TD>
<TD>4</TD>
<TD>0</TD>
<TD>4</TD>
<TD>13</TD>
<TD>2</TD>

<TD>15</TD>
<TD>1</TD>
<TD>0</TD>
<TD>0</TD>
<TD>2</TD>
<TD>0</TD>
<TD>1</TD>
<TD>2</TD>
<TD>0</TD>

<TD>29</TD>
<TD>0</TD>
<TD>1</TD>
<TD>0</TD>
<TD>7</TD>
<TD>0</TD>
<TD>3</TD>
<TD>20</TD>
<TD>1</TD>

<TD>13</TD>
<TD>0</TD>
<TD>1</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>

<TD>3</TD>
<TD>0</TD>
</TR>
</TABLE>
</BODY>
</HTML>
11
задан self 29 March 2012 в 07:28
поделиться