Extracting data from HTML table in PHP

Have you ever face a situation where you want to get some data out of a HTML table because you did not find the clean and clear API to consume which can deliver the same data but via JSON, XML or CSV. If so I am happy to present to you a simple package to parse HTML table called bakame\html-table.

The package was initially a script build to get some data in a complex old archived HTML page. But the more I was working on it, the more I was able to refine the code to extract a small package with a nice DX.

How parsing works ?

Simply put, the package consists of one main class Bakame\Html\Table\Parser that you will need to configure in order to parse a specific table. Under the hood, for parsing, the package relies on PHP DOM, and libxml extensions to correctly parse a single table contained in the submitted HTML text. It relies on XPath to traverse and detect each section of the table and uses the DOM to iterate over each table row and cells.

Assuming you have the following HTML table:

<table>
  <caption>Yearly Sales</caption>
  <thead>
    <tr>
      <th>Year</th>
      <th>Product</th>
      <th>Units Sold</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>2020</td>
      <td>Widgets</td>
      <td>3200</td>
    </tr>
    <tr>
      <td>2021</td>
      <td>Widgets</td>
      <td>4800</td>
    </tr>
  </tbody>
  <tfoot>
    <tr>
      <td colspan="2">Total Sales</td>
      <td>8000</td>
    </tr>
  </tfoot>
</table>

The parser will converted it into a iterable structure as shown below.

<?php

use Bakame\Html\Table\Parser;
$html = <<<TABLE
the table content goes here.
TABLE;

$table = Parser::new()->parseHtml($html);
var_dump($table);
// returns something like this
// [
//   [
//     "Year" => "2020",
//     "Product" => "Widgets",
//     "Units Sold" => "3200",
//   ],
 //  [
 //    "Year" => "2021",
 //    "Product" => "Widgets",
 //    "Units Sold" => "4800",
//   ],
//   [
//     "Year" => "Total Sales",
//     "Product" => "Total Sales",
//     "Units Sold" => "8000",
 //  ],
// ]

As you may have notice the package autodetected the table header and used its content has keys for each record. Cells with the colspan and/or rowspan attributes are handled, in both cases the content will be duplicated so that each record contains the exact same number of cell. the caption tag however is skipped and there’s no way currently to retrieve its value.

The parser can be configure to handle all sorts of situations regarding how the table was constructed.

use Bakame\Html\Table\Parser;
use Bakame\Html\Table\Section;

$formatter = fn (array $record): array => array_map(
    strtoupper(...),
    $record
);

$parser = Parser::new()
    ->excludeTableFooter()
    ->withFormatter($formatter)
    ->tableHeaderPosition(Section::tbody, 2)
    ->tablePosition('statistics-sells');

$table = $parser->parseFile('https://www.example.com/page-with-table.html');

In the example above:

the table tfoot section will be ignore during parsing,
the table rows will be formatted during parsing to convert them all into uppercase characters
the table header is said to be located in the table tbody section as the 3rd row, if found, the header will be skipped during the table content parsing to avoid duplicate data
in the submitted content only the table with the id statistic-sells will be parsed.

Once configure you can either submit a HTML document to the parseHtml method or a path/URL to the HTML page containing the table to convert via parseFile. Both methods leverages PHP8+ Union type to allow more types. Of note, the parseFile method uses file_get_contents internally to retrieve the page content.

Manipulating the parsed data

As anyone would guess manipulating the parsed table data is also important, but since I did not want to re-invent the wheel instead of just returning a simple Iterator, both parsing methods returns a League\Csv\TabularDataReader. While the package main focus was parsing the table, it uses league\csv under the hood to ease further manipulation on the result. As a quick reminder it means you can filter and convert the result into a CSV file

use Bakame\Html\Table\Parser;
use League\Csv\Writer;

$table = Parser::new()->parseHtml($html);
$result = $table
    ->filter(fn (array $row) => $row['points'] >= 10)
    ->sorted(fn (array $rowA, array $rowB) => $rowB['for'] <=> $rowA['for']);

$csv = Writer::createFromString();
$csv->insertOne($table->getHeader());
$csv->insertAll($table);
echo $csv->toString(), PHP_EOL;

But you can also convert it into a JSON file or even converting it back into a HTML table

use Bakame\Html\Table\Parser;
use League\Csv\HTMLConverter;

$table = Parser::new()->parseHtml($html);
$result = $table
    ->filter(fn (array $row) => $row['points'] >= 10)
    ->sorted(fn (array $rowA, array $rowB) => $rowB['for'] <=> $rowA['for']);

echo json_encode($result, JSON_PRETTY_PRINT); // the returned object is json serializable
echo HTMLConverter::create()->convert($result, $result->getHeader()), PHP_EOL;

Conclusion

The bakame/html-table is a fun little package to handle HTML tables in PHP. It tries hard to not re-invent the wheel while presenting a nice DX.

Last but not least

The bakame/html-table is open source project with a MIT License so contributions are more than welcome and will be fully credited. These contributions can be anything from reporting an issue, requesting or adding missing features or simply improving or correcting some typo on the documentation website.