{"id":2986,"date":"2023-09-26T05:21:15","date_gmt":"2023-09-26T03:21:15","guid":{"rendered":"https:\/\/nyamsprod.com\/blog\/?p=2986"},"modified":"2023-09-26T05:21:15","modified_gmt":"2023-09-26T03:21:15","slug":"extracting-data-from-html-table-in-php","status":"publish","type":"post","link":"https:\/\/nyamsprod.com\/blog\/extracting-data-from-html-table-in-php\/","title":{"rendered":"Extracting data from HTML table in PHP"},"content":{"rendered":"\n<p>Have you ever face a situation where you want to get some data out of a HTML table because you did not find the clean and clear API to consume which can deliver the same data but via JSON, XML or CSV. If so I am happy to present to you a simple package to parse HTML table called <code>bakame\\html-table<\/code>.<\/p>\n\n\n\n<p>The package was initially a script build to get some data in a complex old archived HTML page. But the more I was working on it, the more I was able to refine the code to extract a small package with a nice DX. <\/p>\n\n\n\n<h2 class=\"wp-block-heading\">How parsing works ?<\/h2>\n\n\n\n<p>Simply put, the package consists of one main class <code>Bakame\\Html\\Table\\Parser<\/code> that you will need to configure in order to parse a specific table. Under the hood, for parsing, the package relies on PHP DOM, and libxml extensions to correctly parse a single table contained in the submitted HTML text. It relies on XPath to traverse and detect each section of the table and uses the DOM to iterate over each table row and cells.<\/p>\n\n\n\n<p>Assuming you have the following HTML table:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: xml; title: ; notranslate\" title=\"\">\n&lt;table&gt;\n  &lt;caption&gt;Yearly Sales&lt;\/caption&gt;\n  &lt;thead&gt;\n    &lt;tr&gt;\n      &lt;th&gt;Year&lt;\/th&gt;\n      &lt;th&gt;Product&lt;\/th&gt;\n      &lt;th&gt;Units Sold&lt;\/th&gt;\n    &lt;\/tr&gt;\n  &lt;\/thead&gt;\n  &lt;tbody&gt;\n    &lt;tr&gt;\n      &lt;td&gt;2020&lt;\/td&gt;\n      &lt;td&gt;Widgets&lt;\/td&gt;\n      &lt;td&gt;3200&lt;\/td&gt;\n    &lt;\/tr&gt;\n    &lt;tr&gt;\n      &lt;td&gt;2021&lt;\/td&gt;\n      &lt;td&gt;Widgets&lt;\/td&gt;\n      &lt;td&gt;4800&lt;\/td&gt;\n    &lt;\/tr&gt;\n  &lt;\/tbody&gt;\n  &lt;tfoot&gt;\n    &lt;tr&gt;\n      &lt;td colspan=&quot;2&quot;&gt;Total Sales&lt;\/td&gt;\n      &lt;td&gt;8000&lt;\/td&gt;\n    &lt;\/tr&gt;\n  &lt;\/tfoot&gt;\n&lt;\/table&gt;\n<\/pre><\/div>\n\n\n<p>The parser will converted it into a iterable structure as shown below. <\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: php; title: ; notranslate\" title=\"\">\n&lt;?php\n\nuse Bakame\\Html\\Table\\Parser;\n$html = &lt;&lt;&lt;TABLE\nthe table content goes here.\nTABLE;\n\n$table = Parser::new()-&gt;parseHtml($html);\nvar_dump($table);\n\/\/ returns something like this\n\/\/ &#x5B;\n\/\/   &#x5B;\n\/\/     &quot;Year&quot; =&gt; &quot;2020&quot;,\n\/\/     &quot;Product&quot; =&gt; &quot;Widgets&quot;,\n\/\/     &quot;Units Sold&quot; =&gt; &quot;3200&quot;,\n\/\/   ],\n \/\/  &#x5B;\n \/\/    &quot;Year&quot; =&gt; &quot;2021&quot;,\n \/\/    &quot;Product&quot; =&gt; &quot;Widgets&quot;,\n \/\/    &quot;Units Sold&quot; =&gt; &quot;4800&quot;,\n\/\/   ],\n\/\/   &#x5B;\n\/\/     &quot;Year&quot; =&gt; &quot;Total Sales&quot;,\n\/\/     &quot;Product&quot; =&gt; &quot;Total Sales&quot;,\n\/\/     &quot;Units Sold&quot; =&gt; &quot;8000&quot;,\n \/\/  ],\n\/\/ ]\n<\/pre><\/div>\n\n\n<p>As you may have notice the package autodetected the table header and used its content has keys for each record. Cells with the <code>colspan<\/code> and\/or <code>rowspan<\/code> attributes are handled, in both cases the content will be duplicated so that each record contains the exact same number of cell. the <code>caption<\/code> tag however is skipped and there&#8217;s no way currently to retrieve its value.<\/p>\n\n\n\n<p>The parser can be configure to handle all sorts of situations regarding how the table was constructed.<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: php; title: ; notranslate\" title=\"\">\nuse Bakame\\Html\\Table\\Parser;\nuse Bakame\\Html\\Table\\Section;\n\n$formatter = fn (array $record): array =&gt; array_map(\n    strtoupper(...),\n    $record\n);\n\n$parser = Parser::new()\n    -&gt;excludeTableFooter()\n    -&gt;withFormatter($formatter)\n    -&gt;tableHeaderPosition(Section::tbody, 2)\n    -&gt;tablePosition(&#039;statistics-sells&#039;);\n\n$table = $parser-&gt;parseFile(&#039;https:\/\/www.example.com\/page-with-table.html&#039;);\n<\/pre><\/div>\n\n\n<p>In the example above:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>the table <code>tfoot<\/code> section will be ignore during parsing,<\/li>\n\n\n\n<li>the table rows will be formatted during parsing to convert them all into uppercase characters<\/li>\n\n\n\n<li>the table header is said to be located in the table <code>tbody<\/code> section as the 3rd row, if found, the header will be skipped during the table content parsing to avoid duplicate data<\/li>\n\n\n\n<li>in the submitted content only the table with the id <code>statistic-sells<\/code> will be parsed.<\/li>\n<\/ul>\n\n\n\n<p>Once configure you can either submit a HTML document to the <code>parseHtml<\/code> method or a path\/URL to the HTML page containing the table to convert via <code>parseFile<\/code>. Both methods leverages PHP8+ Union type to allow more types. Of note, the parseFile method uses <code>file_get_contents<\/code> internally to retrieve the page content.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Manipulating the parsed data<\/h2>\n\n\n\n<p>As anyone would guess manipulating the parsed table data is also important, but since I did not want to re-invent the wheel instead of just returning a simple <code>Iterator<\/code>, both parsing methods returns a <code>League\\Csv\\TabularDataReader<\/code>. While the package main focus was parsing the table, it uses <code>league\\csv<\/code> under the hood to ease further manipulation on the result. As a quick reminder it means you can filter and convert the result into a CSV file<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: php; title: ; notranslate\" title=\"\">\nuse Bakame\\Html\\Table\\Parser;\nuse League\\Csv\\Writer;\n\n$table = Parser::new()-&gt;parseHtml($html);\n$result = $table\n    -&gt;filter(fn (array $row) =&gt; $row&#x5B;&#039;points&#039;] &gt;= 10)\n    -&gt;sorted(fn (array $rowA, array $rowB) =&gt; $rowB&#x5B;&#039;for&#039;] &lt;=&gt; $rowA&#x5B;&#039;for&#039;]);\n\n$csv = Writer::createFromString();\n$csv-&gt;insertOne($table-&gt;getHeader());\n$csv-&gt;insertAll($table);\necho $csv-&gt;toString(), PHP_EOL;\n<\/pre><\/div>\n\n\n<p>But you can also convert it into a JSON file or even converting it back into a HTML table<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: php; title: ; notranslate\" title=\"\">\nuse Bakame\\Html\\Table\\Parser;\nuse League\\Csv\\HTMLConverter;\n\n$table = Parser::new()-&gt;parseHtml($html);\n$result = $table\n    -&gt;filter(fn (array $row) =&gt; $row&#x5B;&#039;points&#039;] &gt;= 10)\n    -&gt;sorted(fn (array $rowA, array $rowB) =&gt; $rowB&#x5B;&#039;for&#039;] &lt;=&gt; $rowA&#x5B;&#039;for&#039;]);\n\necho json_encode($result, JSON_PRETTY_PRINT); \/\/ the returned object is json serializable\necho HTMLConverter::create()-&gt;convert($result, $result-&gt;getHeader()), PHP_EOL;\n<\/pre><\/div>\n\n\n<h2 class=\"wp-block-heading\">Conclusion<\/h2>\n\n\n\n<p>The  <a href=\"https:\/\/github.com\/bakame-php\/html-table\" target=\"_blank\" rel=\"noopener\" title=\"\">bakame\/html-table<\/a>  is a fun little package to handle HTML tables in PHP. It tries hard to not re-invent the wheel while presenting a nice DX.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Last but not least<\/h2>\n\n\n\n<p>The <a href=\"https:\/\/github.com\/bakame-php\/html-table\" target=\"_blank\" rel=\"noopener\" title=\"\">bakame\/html-table<\/a> is open source project with a MIT License so contributions are more than welcome and will be fully credited. These contributions can be anything from reporting an issue, requesting or adding missing features or simply improving or correcting some typo on the documentation website.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>How to parse and manipulate HTML Table in PHP <\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[588],"tags":[761,848,253,306,807,412,509,785,581,849],"class_list":["post-2986","post","type-post","status-publish","format-standard","hentry","category-humeurs","tag-csv","tag-domdocuemt","tag-html","tag-json","tag-parser","tag-php","tag-table","tag-thephpleague","tag-xml","tag-xpath"],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/nyamsprod.com\/blog\/wp-json\/wp\/v2\/posts\/2986","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/nyamsprod.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/nyamsprod.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/nyamsprod.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/nyamsprod.com\/blog\/wp-json\/wp\/v2\/comments?post=2986"}],"version-history":[{"count":5,"href":"https:\/\/nyamsprod.com\/blog\/wp-json\/wp\/v2\/posts\/2986\/revisions"}],"predecessor-version":[{"id":2997,"href":"https:\/\/nyamsprod.com\/blog\/wp-json\/wp\/v2\/posts\/2986\/revisions\/2997"}],"wp:attachment":[{"href":"https:\/\/nyamsprod.com\/blog\/wp-json\/wp\/v2\/media?parent=2986"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/nyamsprod.com\/blog\/wp-json\/wp\/v2\/categories?post=2986"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/nyamsprod.com\/blog\/wp-json\/wp\/v2\/tags?post=2986"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}