jq, but for HTML
 
 
Go to file
Guilian 79d6a8f77d
feat: attribute selection
2025-01-20 17:05:29 +01:00
.gitignore chore: gitignore 2025-01-19 14:30:13 +01:00
README.md chore: readme update 2025-01-19 16:42:22 +01:00
css.lua feat: attribute selection 2025-01-20 17:05:29 +01:00
html.lua feat: attribute selection 2025-01-20 17:05:29 +01:00
logging.lua fix: proper nil printing in logger 2025-01-19 14:19:49 +01:00
main.lua feat: get element inner text (--text option) 2025-01-20 17:05:04 +01:00

README.md

Htmlq

Overview

In short: jq, but for HTML

This project is a from-scratch implementation of an HTML and CSS parser, written entirely in Lua. No external dependencies. It's designed to take HTML and CSS as input and provide a way to query the Document Object Model (DOM) using CSS selectors.

Features

There's really only one feature: it takes in HTML and a CSS selector, and returns whatever is matched by that selector in the DOM.

Supported simple selectors:

  • tag name - h1
  • class - .class
  • id - #id

And any compound selector (like p.text-center.bold matching all ps that have the text-center and bold class)

Supported combinators are all the "basic" ones:

Limitations

  • The column and namespace combinators are not supported
  • Here be dragons: This tool was written by someone who is not especially good at writing parsers ; It may break or behave unexpectedly. Don't hesitate to report issues !
  • This tool was not designed with speed in mind ; it seems fast enough for common CLI usage purposes.

TODO

  • --text option to only get the text in the matched elements
  • Universal selector (* to match any element)
  • Attribute selectors ([attr="value"])
  • A way to "group" selectors, e.g. aside {p, footer} to select all ps and footers in asides ?

Usage

Usage: lua main.lua [FLAGS] <html_path_or_minus> <css_selector>
  html_path_or_minus: Path to HTML file or '-' for stdin
  css_selector: CSS selector to search for

  Flags:
  -f, --first-only: return only the first match
  -q, --quiet: Don't print warnings

Motivation

I needed this for a specific need of mine, where I wanted to systematically extract the HTML starting with an element with a certain id, up to the closing tag. While I could probably have hacked something together for this one-time use case, in typical programmer spirit, I decided to create a tool.

This is my first parser, and it was very fun! Writing a parser seems to be a kind of "rite of passage" for programmers, and now I did it too.

Obviously, this could have been solved with jsdom and like 10 lines of JS.

Plus, it's kinda neat to have a lightweight, dependency-free way to mess with web stuff in Lua.

Installation

Htmlq is written in Lua and requires no external dependencies. To use it, you will need to have Lua installed on your system. You can check if Lua is installed by running lua -v in your terminal. If Lua is not installed, you can install it from your distribution's package manager or from the official Lua website.

Compiling

To compile Htmlq, you will need to use luastatic. You can install luastatic via luarocks by running the following command:

luarocks install luastatic

Once luastatic is installed, you can compile Htmlq by running the following command in your terminal, from the project's root directory:

luastatic main.lua css.lua html.lua logging.lua /usr/lib/liblua5.4.so

Note that all .lua files from the project need to be specified, with main.lua as the first one. Also, the path to liblua may vary according to your system. The example provided is for an installation on EndeavourOS.

Running

Once compiled, you can run Htmlq using the following command:

./htmlq [FLAGS] <html_path_or_minus> <css_selector>

Where:

  • <html_path_or_minus> is the path to the HTML file you want to parse, or - to read from stdin.
  • <css_selector> is the CSS selector you want to use to query the HTML.

Flags

  • -f, --first-only: Return only the first match
  • -q, --quiet: Don't print warnings