Alibaba Open-Sources Page Agent: Embedding an Agent Directly Into the Webpage, Reading Text Instead of Screenshots to Operate the UI
MIT-licensed, model-agnostic — works with any OpenAI-compatible text model out of the box, though for now it can only operate a single page view
- An Alibaba team has open-sourced Page Agent, an agent library that runs as plain JavaScript inside the webpage itself. It reads the page's text-based structure (the DOM) to understand and act on the UI, with no reliance on screenshots
- The core technique is DOM dehydration: it compresses a page with thousands of nodes down into a lean text map called FlatDomTree that keeps only the interactive elements — so a plain text model can pinpoint elements precisely
- Open-sourced under MIT, model-agnostic, and pluggable via any OpenAI-compatible endpoint — text-only, no multi-modal model required. The code inherits its DOM handling and prompt logic from browser-use
- Because it runs inside the webpage, it automatically inherits the user's current login state, cookies, and permissions — no separate backend to write, no headless browser needed
- The limits are just as clear: safety rules live in the prompt rather than being hard constraints; the core library can only operate a single page, and cross-tab control requires installing a separate Chrome extension
Alibaba Open-Sources a Contrarian Browser Agent
An Alibaba team recently open-sourced Page Agent, an agent library that runs as plain JavaScript inside the webpage, understanding and operating the UI by reading the text-based DOM.
It lives inside the webpage like a real user — reading the page's text structure to click buttons and fill in forms — all without launching a headless browser, taking screenshots, or needing a multi-modal model that can "see" images.
Where the Old Approach's Cost Goes
To see what Page Agent actually saves, first look at everything external tools have to carry just to control one webpage. They can't get inside the page — they can only stand outside it, directing through a pane of glass.
This externally-driven approach still works well for cross-site scraping and end-to-end testing. What Page Agent is trying to solve is a different headache: when the webpage is already your own product and you control its code, why go the long way around?
Compressing an Entire Webpage Into a Text List
A modern webpage can have thousands of nodes — feeding the raw HTML straight to a model is slow and expensive. Page Agent's approach is to "dehydrate" the page first, keeping only the handful of things that are actually operable.
When an instruction comes in, the agent scans the entire DOM (Document Object Model, the element tree the browser parses the webpage into) and finds every interactive element — buttons, links, input fields — tagging each one with an index, a role, and a text label. All the redundant decorative markup gets stripped away, and the whole page is compressed into a lean text map called FlatDomTree. That's what the model reads — not pixels.
It's like stripping all the body text out of a thick book and keeping only the chapter titles and page numbers from the table of contents. The model doesn't need to read the whole book — one glance at this table of contents tells it which page to flip to and which button to press.
How Different What the Model Sees Is, Before vs. After Dehydration
The original demo page puts this loop right on display: a "Dehydrated DOM" panel shows the list the model reads, while an "Action trace" panel next to it updates step by step as the instruction executes — you can watch it click through the steps one at a time.
What Happens Once an Instruction Comes In
From a sentence of natural language to an actual click on the page, there's a fixed loop in between. The dirty work is handed off to a component called PageController.
What PageController exposes are exactly these concrete actions, operating on elements by index:
await this.pageController.updateTree()
await this.pageController.clickElement(index)
await this.pageController.inputText(index, text)
await this.pageController.scroll({ down: true, numPages: 1 })
The whole monorepo splits responsibilities across three small packages, each owning its own piece:
@page-agent/coreHeadless agent core logicpage-agentFull entry class with a UI panel@page-agent/page-controllerHandles DOM extraction and element indexing, with an optional SimulatorMask for visual feedbackThree Guardrails Developers Get to Use
Against Other Tools, Who Should Use This
What this comparison table is really about is use case, not speed. Four approaches run in different places and read the page in different ways — each has its own turf.
| Approach | Where It Runs | How It Reads the Page | Integration Cost | Best For |
|---|---|---|---|---|
| Page Agent | Inside the webpage (client-side JS) | Dehydrated text DOM | One script tag or npm | An operational copilot inside your own product |
| Selenium / Playwright / Puppeteer | External process | Reads DOM via a driver (WebDriver/CDP) | Driver plus a runtime or service | Scripted end-to-end testing |
| browser-use | External process | DOM plus optional vision | Python plus a browser | Autonomous, multi-site agents |
| WebMCP | Server-side tool | Structured function calls | Requires broad adoption of the standard | Agent-native tool invocation |
It takes a different path: the webpage wraps its own functionality into structured "tool" functions and exposes them directly for an agent to call, relying on a standardized interface. Page Agent reads DOM text; WebMCP relies on a standard protocol — one doesn't require modifying the webpage and can be adopted right away, the other has to wait for the interface standard to gain broad acceptance.
The conclusion comes down to scope: Page Agent fits inside products you can modify and control the code for; for scraping someone else's site or working against a locked-down environment, external-driver tools still win.
What You Can Actually Build With It
Because it lives inside your application, it can actually complete an action for the user rather than just standing beside them explaining how to click. The original piece gives four concrete examples.
The Lowest-Cost Path Is a Single Script Tag
Want to get a feel for it first? One script tag loads Page Agent with a free test model, ready to try right on the page.
<script src="https://cdn.jsdelivr.net/npm/[email protected]/dist/iife/page-agent.demo.js" crossorigin="true"></script>
For production use, install the package and swap in your own endpoint:
import { PageAgent } from 'page-agent'
const agent = new PageAgent({
model: 'qwen3.5-plus',
baseURL: 'https://dashscope.aliyuncs.com/compatible-mode/v1',
apiKey: 'YOUR_API_KEY',
language: 'en-US',
})
await agent.execute('Click the login button')
Both model and baseURL accept any OpenAI-compatible provider — switching models is basically just swapping the base URL and key.
new PageAgent gets bundled straight into your frontend code — a production setup needs to proxy requests through your own backend, never expose the key on the client. The agent also supports popping up a confirmation before executing each critical action.What It Can't Do
This "living inside the webpage" approach comes with real, inherent trade-offs, and the official docs are upfront about the limits. These need to be on the table before you use it.
Safety Rules Written Into the Prompt Are Just Suggestions
A rule like "never auto-submit a payment form" is placed in the system prompt. It's a persuasive nudge, not a hard guarantee. For sensitive or destructive actions, server-side validation still has to stay in place — you can't treat prompt instructions as your only line of defense.
The Core Library Only Handles a Single Page
The core library targets interaction within a single view — on its own, it can't move between tabs or windows. For cross-page automation, you need the optional Chrome extension, which requires a separate install and authorization. There's also a Beta-stage MCP server that lets external agents like Claude Desktop or Copilot drive it in the other direction.
Expand: what each of the three deployment layers solves
The core library runs inside the page, handling single-page operations; the Chrome extension adds cross-tab capability, at the cost of an install and permissions; the Beta MCP server turns Page Agent into a tool that external agents can call, connecting it back to external agents like Claude Desktop and Copilot. Each of the three layers covers a different scope — and the further out you go, the higher the setup and authorization cost.
Back to the opening line: Page Agent and mainstream tools take two different paths — one embeds inside the webpage and reads text, the other stands outside and remote-controls via screenshots and protocols. It extends where browser automation can land — from "an external script controlling someone else's webpage" to "a natural-language operation layer built right into the product."
The agent lives inside the webpage as plain JavaScript. It reads the live DOM as text and acts as the real user. No headless browser, no screenshots, no multi-modal model. , MarkTechPost, 2026-07-02