Product Launch · XiaoHu Explains

Alibaba Open-Sources Page Agent: Embedding an Agent Directly Into the Webpage, Reading Text Instead of Screenshots to Operate the UI

MIT-licensed, model-agnostic — works with any OpenAI-compatible text model out of the box, though for now it can only operate a single page view

60-Second Overview

An Alibaba team has open-sourced Page Agent, an agent library that runs as plain JavaScript inside the webpage itself. It reads the page's text-based structure (the DOM) to understand and act on the UI, with no reliance on screenshots
The core technique is DOM dehydration: it compresses a page with thousands of nodes down into a lean text map called FlatDomTree that keeps only the interactive elements — so a plain text model can pinpoint elements precisely
Open-sourced under MIT, model-agnostic, and pluggable via any OpenAI-compatible endpoint — text-only, no multi-modal model required. The code inherits its DOM handling and prompt logic from browser-use
Because it runs inside the webpage, it automatically inherits the user's current login state, cookies, and permissions — no separate backend to write, no headless browser needed
The limits are just as clear: safety rules live in the prompt rather than being hard constraints; the core library can only operate a single page, and cross-tab control requires installing a separate Chrome extension

1Who, and What They Built

Alibaba Open-Sources a Contrarian Browser Agent

An Alibaba team recently open-sourced Page Agent, an agent library that runs as plain JavaScript inside the webpage, understanding and operating the UI by reading the text-based DOM.

It lives inside the webpage like a real user — reading the page's text structure to click buttons and fill in forms — all without launching a headless browser, taking screenshots, or needing a multi-modal model that can "see" images.

⚡

Today's mainstream browser automation tools — Playwright, Selenium, Puppeteer, browser-use — all spin up a separate process outside the browser and control the page remotely via screenshots or a debugging protocol. Page Agent takes the opposite route: the agent logic itself is a piece of JavaScript embedded in the webpage, so it naturally inherits the user's current login state, cookies, and permissions. Integration takes just one script tag or one npm install.

Same webpage, two control paths: the agent living inside it, or a robotic arm reaching in from outside

2Setting Up the Comparison

Where the Old Approach's Cost Goes

To see what Page Agent actually saves, first look at everything external tools have to carry just to control one webpage. They can't get inside the page — they can only stand outside it, directing through a pane of glass.

A separate process

Playwright, Selenium, and the like all need to run a separate runtime or service outside the browser — a different program from your application

A driver or debugging protocol layer

They read and control the page remotely via WebDriver or CDP (Chrome DevTools Protocol, the browser's official remote debugging protocol) — never actually entering the webpage

Often a multi-modal model too

Many approaches feed the model a screenshot of the page and have a vision-capable model guess where elements are, which drives up inference cost

This externally-driven approach still works well for cross-site scraping and end-to-end testing. What Page Agent is trying to solve is a different headache: when the webpage is already your own product and you control its code, why go the long way around?

3The Key Technique

Compressing an Entire Webpage Into a Text List

A modern webpage can have thousands of nodes — feeding the raw HTML straight to a model is slow and expensive. Page Agent's approach is to "dehydrate" the page first, keeping only the handful of things that are actually operable.

DOM dehydration

When an instruction comes in, the agent scans the entire DOM (Document Object Model, the element tree the browser parses the webpage into) and finds every interactive element — buttons, links, input fields — tagging each one with an index, a role, and a text label. All the redundant decorative markup gets stripped away, and the whole page is compressed into a lean text map called FlatDomTree. That's what the model reads — not pixels.

An Analogy

It's like stripping all the body text out of a thick book and keeping only the chapter titles and page numbers from the table of contents. The model doesn't need to read the whole book — one glance at this table of contents tells it which page to flip to and which button to press.

How Different What the Model Sees Is, Before vs. After Dehydration

Before Dehydration · Raw DOM

<div class="hdr"> <nav><ul><li><a href…> <span><svg>…</svg></span> <div class="wrap"><div>… <button class="btn primary"…> <input type="text" name…> …thousands of cluttered nodes

After Dehydration · FlatDomTree

[0] link "Home" [1] input Email [2] input Password [3] button "Log in" [4] button "Submit expense"

The original demo page puts this loop right on display: a "Dehydrated DOM" panel shows the list the model reads, while an "Action trace" panel next to it updates step by step as the instruction executes — you can watch it click through the steps one at a time.

Dehydrated DOM

[0] link "Home"

[1] input Email

[2] input Password

[3] button "Log in"

[4] button "Submit expense"

Action trace

▸ updateTree · generate list

▸ inputText [1] · fill email

▸ inputText [2] · fill password

▸ clickElement [3] · click login

4Under the Hood

What Happens Once an Instruction Comes In

From a sentence of natural language to an actual click on the page, there's a fixed loop in between. The dirty work is handed off to a component called PageController.

Natural language instructionexecute()

→

Scan the DOMupdateTree

→

Generate FlatDomTreetext map

→

Model decideswhich index to pick

→

Execute actionclick / input / scroll

What PageController exposes are exactly these concrete actions, operating on elements by index:

PageController · Core Action Calls

await this.pageController.updateTree()
await this.pageController.clickElement(index)
await this.pageController.inputText(index, text)
await this.pageController.scroll({ down: true, numPages: 1 })

The whole monorepo splits responsibilities across three small packages, each owning its own piece:

@page-agent/coreHeadless agent core logic

page-agentFull entry class with a UI panel

@page-agent/page-controllerHandles DOM extraction and element indexing, with an optional SimulatorMask for visual feedback

Three Guardrails Developers Get to Use

Action whitelist

Restrict the agent to only the actions you allow — nothing else can be touched

Data masking

Hide sensitive fields like passwords so they're never sent to the model

Custom knowledge

Inject your own business rules so it follows your domain's conventions

5Head-to-Head Comparison

Against Other Tools, Who Should Use This

What this comparison table is really about is use case, not speed. Four approaches run in different places and read the page in different ways — each has its own turf.

Approach	Where It Runs	How It Reads the Page	Integration Cost	Best For
Page Agent	Inside the webpage (client-side JS)	Dehydrated text DOM	One script tag or npm	An operational copilot inside your own product
Selenium / Playwright / Puppeteer	External process	Reads DOM via a driver (WebDriver/CDP)	Driver plus a runtime or service	Scripted end-to-end testing
browser-use	External process	DOM plus optional vision	Python plus a browser	Autonomous, multi-site agents
WebMCP	Server-side tool	Structured function calls	Requires broad adoption of the standard	Agent-native tool invocation

What Is WebMCP

It takes a different path: the webpage wraps its own functionality into structured "tool" functions and exposes them directly for an agent to call, relying on a standardized interface. Page Agent reads DOM text; WebMCP relies on a standard protocol — one doesn't require modifying the webpage and can be adopted right away, the other has to wait for the interface standard to gain broad acceptance.

The conclusion comes down to scope: Page Agent fits inside products you can modify and control the code for; for scraping someone else's site or working against a locked-down environment, external-driver tools still win.

6Real-World Use Cases

What You Can Actually Build With It

Because it lives inside your application, it can actually complete an action for the user rather than just standing beside them explaining how to click. The original piece gives four concrete examples.

🤖

In-product operational copilot

Add an assistant to a SaaS product that can actually operate it for the user. A support bot completes the steps directly instead of just describing them.

📝

Fill a multi-step form with one sentence

Compress a long multi-step form in an ERP or CRM into a single sentence. The user types submit a $50 expense report for yesterday's lunch, and it handles the page-flipping and data entry itself.

🎙️

Voice and accessibility

Pair it with the Web Speech API for voice control — any webpage becomes reachable through natural language, and it can also give screen readers friendlier prompts to announce.

🧰

Add a natural-language entry point to legacy systems

Wrap it around an old internal tool with no API and add a command bar, without touching the original code.

7Getting Started

The Lowest-Cost Path Is a Single Script Tag

Want to get a feel for it first? One script tag loads Page Agent with a free test model, ready to try right on the page.

For Evaluation · One-Line Integration (Includes Free Test AI)

<script src="https://cdn.jsdelivr.net/npm/[email protected]/dist/iife/page-agent.demo.js" crossorigin="true"></script>

MIT

Open-source license, TypeScript-first codebase

1.10.0

Demo version number available to try directly on jsDelivr CDN

For production use, install the package and swap in your own endpoint:

For Production · npm Install and Configure Your Own Endpoint

import { PageAgent } from 'page-agent'

const agent = new PageAgent({
  model: 'qwen3.5-plus',
  baseURL: 'https://dashscope.aliyuncs.com/compatible-mode/v1',
  apiKey: 'YOUR_API_KEY',
  language: 'en-US',
})

await agent.execute('Click the login button')

Both model and baseURL accept any OpenAI-compatible provider — switching models is basically just swapping the base URL and key.

⚠️

That demo endpoint is for technical evaluation only. And an apiKey written directly into new PageAgent gets bundled straight into your frontend code — a production setup needs to proxy requests through your own backend, never expose the key on the client. The agent also supports popping up a confirmation before executing each critical action.

8The Real Limits

What It Can't Do

This "living inside the webpage" approach comes with real, inherent trade-offs, and the official docs are upfront about the limits. These need to be on the table before you use it.

Safety Rules Written Into the Prompt Are Just Suggestions

A rule like "never auto-submit a payment form" is placed in the system prompt. It's a persuasive nudge, not a hard guarantee. For sensitive or destructive actions, server-side validation still has to stay in place — you can't treat prompt instructions as your only line of defense.

The Core Library Only Handles a Single Page

The core library targets interaction within a single view — on its own, it can't move between tabs or windows. For cross-page automation, you need the optional Chrome extension, which requires a separate install and authorization. There's also a Beta-stage MCP server that lets external agents like Claude Desktop or Copilot drive it in the other direction.

Expand: what each of the three deployment layers solves

The core library runs inside the page, handling single-page operations; the Chrome extension adds cross-tab capability, at the cost of an install and permissions; the Beta MCP server turns Page Agent into a tool that external agents can call, connecting it back to external agents like Claude Desktop and Copilot. Each of the three layers covers a different scope — and the further out you go, the higher the setup and authorization cost.

Back to the opening line: Page Agent and mainstream tools take two different paths — one embeds inside the webpage and reads text, the other stands outside and remote-controls via screenshots and protocols. It extends where browser automation can land — from "an external script controlling someone else's webpage" to "a natural-language operation layer built right into the product."

The agent lives inside the webpage as plain JavaScript. It reads the live DOM as text and acts as the real user. No headless browser, no screenshots, no multi-modal model. , MarkTechPost, 2026-07-02

This piece is based on and adapted from MarkTechPost reporting; facts and code examples are drawn from the original article. Open-source project at github.com/alibaba/page-agent (MIT license, TypeScript). Compiled by XiaoHu · AI Explainer.