When GPTBot, ClaudeBot, and AppleBot crawl your content, they’re parsing HTML you wrote for browsers. Most of what they consume is layout markup, navigation chrome, cookie banners, and styling – not the prose they need to ingest. Cloudflare’s Markdown for Agents feature solves this with HTTP content negotiation: agents send Accept: text/markdown, Cloudflare returns a markdown version of the same URL at the edge. Browsers still get HTML. The catch is that it requires a Cloudflare Pro plan or higher – $20/month on annual billing, $240/year, a spend most pre-monetization sites won’t justify.
This is the WordPress implementation I built for squin.org while running Cloudflare Free. Single PHP snippet, no plugin, passes the isitagentready.com compliance check. It also handles the LiteSpeed cache poisoning issue that breaks naive implementations – the part Cloudflare’s docs skip and that none of the existing WordPress tutorials cover.
What Does Markdown for Agents Actually Do?
Content negotiation is an HTTP feature most developers know exists but rarely use directly. The client sends an Accept header listing MIME types it can handle. The server picks the best match and responds. Browsers send something like Accept: text/html,application/xhtml+xml..., and they get HTML. Agents that opt in send Accept: text/markdown, and they get markdown. Same URL, different response, decided server-side based on who’s asking.
Why markdown specifically? Token economics. Cloudflare claims up to 80% token reduction on real pages. I measured this on squin.org’s entity SEO guide and got 77% – close to Cloudflare’s number, slightly under because WordPress sites running lean themes like GeneratePress have relatively little chrome to strip. Sites running Astra, Divi, or Avada with sidebars, mega menus, and related-post widgets would land closer to 80% because there’s more to remove. The savings come from cutting nav menus, cookie banners, JSON-LD blocks the agent would otherwise have to dedupe against the prose, and CSS class soup. What’s left is headings, paragraphs, code, links – the substrate of what your page actually communicates.
Cloudflare’s implementation does this at the edge: requests come in, Cloudflare fetches HTML from your origin, converts on the fly, returns markdown. The protocol itself is just HTTP content negotiation, which any web server can implement. The convention has a defined spec at isitagentready.com – a SKILL.md document defining what compliant content negotiation should look like. Pass that, and you’re playing the same protocol Cloudflare-served sites play.
Markdown for Agents vs. llms.txt: Which Do You Need?
Both, ideally. They solve different parts of the same problem.
llms.txt is a static file at the root of your site listing URLs and short descriptions for AI agents to crawl. It’s a discovery index – the equivalent of a sitemap for LLMs. Markdown for Agents is on-demand markdown delivery via HTTP content negotiation. Different mechanism, different URL, different signal.
The two are complementary. Agents discover content via /llms.txt, then fetch any individual page in markdown via the Accept: text/markdown header on that page’s URL. Run only one and you’ve covered half the agent-readiness surface area. Run both and you cover the full pair: discovery plus delivery.
Rank Math ships an LLMS Txt module that handles /llms.txt automatically – toggle it on, configure post types, done. The snippet in this article handles content negotiation. Together they’re the complete pair, no other plugins or upgrades needed.
Why Build This in WordPress Instead of Upgrading Cloudflare?
Three reasons. Cost: $240/year on annual billing for one Beta feature isn’t a defensible spend on a pre-monetization site. Control: you decide what gets stripped – JSON-LD blocks, inline styles, cookie banner divs – and what front matter the markdown ships with. Performance: WordPress already runs the_content filters on every page load, so adding HTML-to-markdown conversion on top costs roughly 10ms after the first cache hit.
The tradeoff is honest. You don’t get edge-cached markdown variants the way Cloudflare delivers them. Every markdown request hits your origin, runs through PHP, and either fetches from a transient or builds fresh. For squin.org’s traffic that’s irrelevant. At scale where it actually matters, you’re past the point where Cloudflare Pro is the wrong purchase.
A Free Alternative to Cloudflare’s Pro Feature
The snippet in this article is functionally a free alternative to Cloudflare’s Pro feature. Same protocol – HTTP content negotiation with Accept: text/markdown. Same response shape – markdown body with Content-Type: text/markdown; charset=utf-8, Vary: Accept, and X-Markdown-Tokens headers. Same SKILL.md compliance at isitagentready.com. The only difference is where the conversion runs: Cloudflare Pro converts at the edge across their global network. The snippet converts on your origin’s PHP process.
Plugin alternatives exist too. Botkibble is a free WordPress plugin that handles content negotiation with similar mechanics. Different tradeoff – you get a maintained dependency instead of a snippet you control. If you’d rather not maintain PHP yourself, that’s a valid path.
How Does the WordPress Implementation Work?
Hook template_redirect at priority 1, which fires before the theme renders anything but after WordPress has resolved the request to a post. Check the Accept header. If it explicitly includes text/markdown, take over the response: load the post, run the_content to apply WordPress’s standard filters, convert the resulting HTML to markdown, send it back with the right headers, and exit. Browsers – which never send Accept: text/markdown – flow through to the theme’s normal render and get HTML.
The dispatch function:
add_action('template_redirect', 'squin_md_dispatch', 1);
function squin_md_dispatch() {
if (is_admin() || is_feed()) return;
if (!is_singular()) return;
if (post_password_required()) return;
if (empty($_SERVER['HTTP_ACCEPT'])) return;
if (!squin_md_accept_wants_markdown($_SERVER['HTTP_ACCEPT'])) return;
global $post;
if (!$post instanceof WP_Post) return;
$cache_key = 'squin_md_' . $post->ID . '_' . md5($post->post_modified_gmt);
$output = get_transient($cache_key);
if ($output === false) {
$output = squin_md_build($post);
set_transient($cache_key, $output, HOUR_IN_SECONDS);
}
status_header(200);
header('Content-Type: text/markdown; charset=utf-8');
header('Vary: Accept');
header('X-Markdown-Tokens: ' . (int) ceil(strlen($output) / 4));
header('Cache-Control: private, max-age=300');
header('X-Robots-Tag: noindex');
echo $output;
exit;
}
Three design choices in there are worth pulling out. The is_singular() check restricts conversion to posts and pages – archives, search results, and the homepage stay as HTML. Paginated archives and search pages are list-of-excerpts content with no primary body to ingest, which is the wrong shape for an agent fetch. The transient cache key includes post_modified_gmt, so any edit to a post invalidates the cache automatically without manual flushing. And the Accept header parsing is intentionally strict: only an explicit text/markdown triggers conversion. A wildcard */* from a curl command with no headers does not, which prevents browsers from accidentally getting markdown if their client sends a wildcard.
The X-Markdown-Tokens header is part of the SKILL.md spec at isitagentready.com. It announces the approximate token count of the response so agents can budget context window before downloading the full body. The four-characters-per-token heuristic is the approximation defined in the spec – close to OpenAI’s cl100k_base tokenizer, slightly low for Anthropic and Meta tokenizers which trend nearer 3.0–3.5 chars per token. Treat it as a budget signal, not a precise count. On squin.org’s entity SEO guide the snippet calculates 5,139 tokens, in line with what GPT-4-class tokenizers report for the same markdown body.
How Do You Handle the LiteSpeed Cache Poisoning Issue?
This is where I lost an hour. The snippet shipped clean – agent requests with Accept: text/markdown returned markdown, browsers got HTML. Then I ran a second browser-style request to the same URL and got back markdown. LiteSpeed Cache had captured the markdown response on the first agent hit and was now serving it to every subsequent visitor regardless of Accept header. Browsers were getting markdown. The site was effectively broken until the cache was purged.

The cause is structural. Most WordPress page caches don’t vary their cache key by request headers – they key on URL plus user state, full stop. The HTTP Vary: Accept response header is informational at the browser cache level, telling the browser to consider Accept when matching responses. But shared server-side caches like LiteSpeed, WP Rocket, W3 Total Cache, and WP Super Cache ignore Vary entirely. They cache by URL, then serve whatever they stored to anyone who asks for that URL. This is why “varying by header” doesn’t behave the way you’d expect on a shared cache.
The fix is to bypass page caching for markdown requests entirely. WordPress core defines a constant – DONOTCACHEPAGE – that all major caching plugins honor. Set it as early as possible in the request lifecycle, before the cache plugin makes its store-or-serve decision. LiteSpeed also exposes a plugin-specific action (litespeed_control_set_nocache) for the same purpose, used as belt-and-braces:
add_action('plugins_loaded', 'squin_md_bypass_page_cache', 1);
function squin_md_bypass_page_cache() {
if (empty($_SERVER['HTTP_ACCEPT'])) return;
if (!squin_md_accept_wants_markdown($_SERVER['HTTP_ACCEPT'])) return;
if (!defined('DONOTCACHEPAGE')) define('DONOTCACHEPAGE', true);
if (!defined('DONOTCACHEDB')) define('DONOTCACHEDB', true);
if (!defined('DONOTCACHEOBJECT')) define('DONOTCACHEOBJECT', true);
do_action('litespeed_control_set_nocache', 'squin markdown variant');
}
The hook choice matters. plugins_loaded fires before init, which fires before template_redirect. By hooking the bypass to plugins_loaded at priority 1, the cache plugin sees DONOTCACHEPAGE defined before it makes any caching decision for the request. init works for LiteSpeed in practice, but more aggressive caches may have already committed to a store-or-serve decision by then. plugins_loaded is the earliest practical hook in the WordPress request lifecycle and the safe default.
One last detail: after installing the bypass, purge your existing page cache. Any URL that was hit by an agent request before the bypass was in place already has a poisoned cache entry, and that entry will keep going out until it expires or you clear it manually. WP admin → LiteSpeed Cache → Toolbox → Purge All. The Cache-Control: private header on the markdown response is defense in depth – even if a downstream cache somehow ignores both DONOTCACHEPAGE and Vary, the private directive tells it not to store for shared use.
What Goes Into the HTML-to-Markdown Converter?
WordPress’s the_content output is messy in ways that break naive converters. Gutenberg block wrappers, EnlighterJS code blocks with custom attributes, oEmbed iframes, Rank Math FAQ blocks, characters embedded by the visual editor, JSON-LD scripts injected by SEO plugins. A regex-based converter falls over on any of it.
The right tool is DOMDocument. Walk the tree, dispatch on tag name, recurse for unknown elements so nothing gets silently dropped. Headings become #-prefixed lines. Paragraphs become double-newline-separated blocks. <pre> elements check for data-enlighter-language first, then fall back to class="language-xxx" on either the <pre> or its <code> child to determine the fence language. Lists track depth for proper nested indentation. Tables convert to GitHub-flavored markdown with header-row detection.
Three preprocessing steps matter. Strip <script> blocks first – Rank Math’s JSON-LD is a separate signal that doesn’t belong in prose markdown, and agents that parse JSON-LD do it from the HTML version anyway. Strip <style> blocks because some plugins leak inline CSS. Strip HTML comments because Gutenberg’s block delimiters are comments and would otherwise appear in output as visible noise.
Output safety follows from the conversion approach itself. DOMDocument only emits text content from allow-listed elements, and the preprocessing pass removes <script> and <style> blocks before the walk begins. Anything that survives is plain text or markdown formatting – no executable elements, no inline event handlers, no embedded JavaScript that could fire if the markdown is later rendered in an HTML viewer.
Then the whitespace gotcha that breaks most converters: is Unicode U+00A0, and PHP’s \s regex character class doesn’t catch it without the /u flag and an explicit byte sequence. The fix is /[\s\xC2\xA0]+/u, which catches both regular whitespace and the UTF-8 byte sequence for non-breaking space. Without it, you get stray spaces on otherwise-empty lines, broken paragraph spacing, and markdown that looks correct in PHP’s echo but renders weirdly in any markdown viewer. The bug is invisible until you actually look at output and notice the gaps.
How Do You Validate It’s Working?
Three curl requests, in order: markdown, browser, markdown. The middle request is the critical one – if it returns markdown, your cache bypass isn’t firing.
# 1. Markdown request - expect Content-Type: text/markdown; charset=utf-8 curl -H "Accept: text/markdown" https://yoursite.com/your-post/ -I # 2. Browser request - expect Content-Type: text/html; charset=UTF-8 curl https://yoursite.com/your-post/ -I # 3. Markdown request again - expect Content-Type: text/markdown; charset=utf-8 curl -H "Accept: text/markdown" https://yoursite.com/your-post/ -I
If request 2 returns markdown, the cache bypass isn’t firing. Purge your page cache, verify the plugins_loaded hook is registered (Code Snippets active, mu-plugin loaded, or wherever you installed the snippet), and retry. PowerShell users hit a snag here: curl in PowerShell aliases to Invoke-WebRequest, which parses arguments differently and breaks on the -I flag. Use curl.exe explicitly to invoke the bundled real curl.
One implementation note: the curls use -I for HEAD requests, which is faster for inspecting headers. The snippet still runs the full conversion for HEAD because template_redirect doesn’t distinguish HEAD from GET – wasteful but harmless for low-frequency validation. Most agents use GET in real traffic, so an early HEAD short-circuit is a low-priority optimization unless your validator hits frequently.
For full SKILL.md spec compliance, run the isitagentready.com markdown-negotiation skill check against your site. It validates that Content-Type, Vary, and X-Markdown-Tokens headers all conform to the protocol.
What Doesn’t This Snippet Cover?
Three deliberate omissions. No archive or category page conversion – the is_singular() check filters those out. Most agents don’t want a markdown version of a paginated taxonomy or a search results page. If your use case is different, swap the check for is_singular() || is_archive() and accept that you’ll be converting post excerpts on the fly across every archive request.
No llms.txt generation. That’s a different signal at a different URL, and Rank Math’s LLMS Txt module already handles it. The snippet covers content negotiation; Rank Math covers the discovery index. Together they’re the complete pair.
No JSON-LD reproduction in the markdown body. The converter strips <script> blocks before processing, including Rank Math’s structured data output. This is intentional. Agents that parse JSON-LD do it from the HTML version of the page where the schema lives natively. Duplicating structured data in both formats inflates token counts without adding signal – the markdown body is for prose ingestion, not metadata.
Frequently Asked Questions
Do you need Cloudflare Pro for Markdown for Agents?
Not if you implement content negotiation at your origin. Cloudflare’s Pro feature is convenience – it does the conversion at the edge so you don’t have to. The underlying protocol is HTTP content negotiation, which any web server can implement. The WordPress snippet in this article matches what Cloudflare Pro delivers, with the tradeoff that conversion runs on your origin’s PHP process instead of Cloudflare’s edge network.
What’s the difference between llms.txt and content negotiation?
llms.txt is a static file at the root of your site listing URLs and descriptions for AI agents to crawl – a discovery index. Content negotiation lets agents request any individual page in markdown by sending Accept: text/markdown – on-demand delivery. They solve different parts of the problem and work best together: agents discover content via llms.txt, then fetch individual pages in markdown via the Accept header.
Should I use a plugin like Botkibble instead?
If you want a maintained dependency rather than a code snippet, yes. Plugins handle updates and edge cases without your involvement. The snippet in this article gives you full control over preprocessing, front matter, and cache integration – useful if you want to customize what gets stripped or how the markdown is structured. Both approaches deliver compliant content negotiation. Pick based on whether you’d rather maintain PHP or maintain a plugin.
Will serving markdown affect SEO?
The snippet sets X-Robots-Tag: noindex on the markdown response, so search engines that respect that header won’t index the markdown variant as a separate document. The HTML version stays canonical and indexable. Google’s crawlers haven’t been observed requesting Accept: text/markdown, so they receive the HTML version by default. Content negotiation responds to what the client explicitly asks for via the Accept header – distinct from user-agent cloaking, where the server decides what to serve based on who’s asking. The distinction matters for how Google evaluates differential content.
Does this work with caching plugins?
Yes, with the DONOTCACHEPAGE bypass on plugins_loaded. Without that bypass, page caching plugins will store the markdown response and serve it to subsequent browser requests for the same URL – the cache poisoning issue covered above. The bypass is verified against LiteSpeed Cache; the DONOTCACHEPAGE constant is documented for WP Rocket and W3 Total Cache as well.
Where This Fits
Markdown for Agents is one piece of agent-readiness – making your content cleanly consumable by LLMs and AI crawlers. For the broader stack of tools and infrastructure that affect how AI systems ingest your site, see the SEO tools pillar.
Appendix: The Complete Snippet
Below is the complete PHP snippet, ready to paste into Code Snippets, an mu-plugin, or your child theme’s functions.php. It includes the dispatch and cache-bypass functions shown above plus the full HTML-to-markdown converter (the squin_md_* helper functions and the DOMDocument walker). Three filter hooks are exposed for tuning: squin_md_skip for per-post bypass, squin_md_content_html for pre-conversion HTML manipulation, and squin_md_output for post-conversion markdown manipulation.
<?php
/**
* Squin - Markdown for Agents
*
* Serves singular WordPress posts/pages as Markdown when an agent
* requests them with `Accept: text/markdown`. HTML stays the default
* for browsers via standard content negotiation.
*
* Drop-in replacement for Cloudflare's "Markdown for Agents" feature
* (which is Pro-plan only). No Composer, no external dependencies -
* pure PHP using DOMDocument.
*
* Hook: template_redirect (priority 1, before most output).
* Cache: per-post transient keyed on post_modified, 1 hour.
*
* Filters:
* squin_md_skip (bool $skip, WP_Post $post)
* squin_md_content_html (string $html, WP_Post $post) - pre-conversion
* squin_md_output (string $markdown, WP_Post $post) - post-conversion
* squin_md_cache_ttl (int $seconds)
*/
if (!defined('ABSPATH')) exit;
// Run as early as possible so page-cache plugins don't serve a cached
// HTML/markdown response from the wrong variant. plugins_loaded fires
// before init, before any cache plugin's serve-decision logic.
// LiteSpeed/WP Rocket/W3TC/WP Super Cache all honor DONOTCACHEPAGE.
add_action('plugins_loaded', 'squin_md_bypass_page_cache', 1);
add_action('template_redirect', 'squin_md_dispatch', 1);
function squin_md_bypass_page_cache() {
if (!isset($_SERVER['REQUEST_METHOD']) || $_SERVER['REQUEST_METHOD'] !== 'GET') return;
if (empty($_SERVER['HTTP_ACCEPT'])) return;
if (!squin_md_accept_wants_markdown($_SERVER['HTTP_ACCEPT'])) return;
if (!defined('DONOTCACHEPAGE')) define('DONOTCACHEPAGE', true);
if (!defined('DONOTCACHEDB')) define('DONOTCACHEDB', true);
if (!defined('DONOTCACHEOBJECT')) define('DONOTCACHEOBJECT', true);
// LiteSpeed Cache plugin
do_action('litespeed_control_set_nocache', 'squin markdown variant');
}
function squin_md_dispatch() {
if (is_admin()) return;
if (is_feed()) return;
if (defined('REST_REQUEST') && REST_REQUEST) return;
if (defined('DOING_AJAX') && DOING_AJAX) return;
if (defined('DOING_CRON') && DOING_CRON) return;
if (!is_singular()) return;
if (post_password_required()) return;
if (empty($_SERVER['HTTP_ACCEPT'])) return;
if (!squin_md_accept_wants_markdown($_SERVER['HTTP_ACCEPT'])) return;
global $post;
if (!$post instanceof WP_Post) return;
if (apply_filters('squin_md_skip', false, $post)) return;
$ttl = (int) apply_filters('squin_md_cache_ttl', HOUR_IN_SECONDS);
$cache_key = 'squin_md_' . $post->ID . '_' . md5($post->post_modified_gmt);
$output = $ttl > 0 ? get_transient($cache_key) : false;
if ($output === false) {
$output = squin_md_build($post);
if ($ttl > 0) set_transient($cache_key, $output, $ttl);
}
$tokens = (int) ceil(strlen($output) / 4);
status_header(200);
header('Content-Type: text/markdown; charset=utf-8');
header('Content-Length: ' . strlen($output));
header('Vary: Accept');
header('X-Markdown-Tokens: ' . $tokens);
header('Cache-Control: private, max-age=300'); // private: no shared caching
header('X-Robots-Tag: noindex'); // markdown variant should not be indexed
echo $output;
exit;
}
/**
* Parses an Accept header and returns true if text/markdown is explicitly
* preferred or accepted. Wildcards (`*\/*`) alone do NOT trigger conversion -
* we only serve markdown when an agent asks for it by name.
*/
function squin_md_accept_wants_markdown($accept_header) {
$accept_header = strtolower($accept_header);
foreach (explode(',', $accept_header) as $part) {
$part = trim($part);
if ($part === '') continue;
// Strip parameters (q=, etc.)
$type = trim(strtok($part, ';'));
if ($type === 'text/markdown') return true;
}
return false;
}
function squin_md_build(WP_Post $post) {
$title = get_the_title($post);
$url = get_permalink($post);
$modified = get_post_modified_time('c', true, $post);
$author = get_the_author_meta('display_name', $post->post_author);
$excerpt = has_excerpt($post) ? get_the_excerpt($post) : '';
$content_html = apply_filters('the_content', $post->post_content);
$content_html = str_replace(']]>', ']]>', $content_html);
$content_html = apply_filters('squin_md_content_html', $content_html, $post);
$body = squin_md_html_to_markdown($content_html);
$front = "---\n";
$front .= 'title: ' . squin_md_yaml_scalar($title) . "\n";
$front .= 'url: ' . $url . "\n";
$front .= 'updated: ' . $modified . "\n";
if ($author) {
$front .= 'author: ' . squin_md_yaml_scalar($author) . "\n";
}
if ($excerpt) {
$front .= 'description: ' . squin_md_yaml_scalar($excerpt) . "\n";
}
$front .= "---\n\n";
$output = $front . '# ' . $title . "\n\n" . $body;
return apply_filters('squin_md_output', $output, $post);
}
function squin_md_yaml_scalar($str) {
$str = (string) $str;
if (preg_match('/[:#\[\]{},&*!|>\'"%@`\n\r]/', $str) || $str === '' || ctype_space($str)) {
return '"' . str_replace(['\\', '"', "\n", "\r"], ['\\\\', '\\"', '\\n', ''], $str) . '"';
}
return $str;
}
/* -------------------------------------------------------------------------
* HTML → Markdown converter
* ------------------------------------------------------------------------- */
function squin_md_html_to_markdown($html) {
// Strip scripts, styles, comments - JSON-LD and inline CSS are not content.
$html = preg_replace('#<script\b[^>]*>.*?</script>#is', '', $html);
$html = preg_replace('#<style\b[^>]*>.*?</style>#is', '', $html);
$html = preg_replace('/<!--.*?-->/s', '', $html);
if (trim($html) === '') return '';
$wrapped = '<?xml encoding="UTF-8"?><div id="squin-md-root">' . $html . '</div>';
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($wrapped, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
libxml_clear_errors();
$root = $dom->getElementById('squin-md-root');
if (!$root) {
// Fallback for libxml without ID awareness on bare divs
$root = $dom->getElementsByTagName('div')->item(0);
}
if (!$root) return '';
$md = squin_md_walk($root);
// Normalize: trim trailing spaces, blank-only lines → empty, collapse 3+ newlines
$md = preg_replace("/[ \t\xC2\xA0]+\n/u", "\n", $md);
$md = preg_replace("/^[ \t\xC2\xA0]+$/mu", "", $md);
$md = preg_replace("/\n{3,}/", "\n\n", $md);
return trim($md) . "\n";
}
function squin_md_walk(DOMNode $node, $list_depth = 0) {
$out = '';
foreach ($node->childNodes as $child) {
if ($child->nodeType === XML_TEXT_NODE) {
$out .= $child->nodeValue;
continue;
}
if ($child->nodeType !== XML_ELEMENT_NODE) continue;
$tag = strtolower($child->nodeName);
switch ($tag) {
case 'h1': case 'h2': case 'h3':
case 'h4': case 'h5': case 'h6':
$level = (int) substr($tag, 1);
$text = trim(squin_md_inline($child));
if ($text !== '') {
$out .= "\n\n" . str_repeat('#', $level) . ' ' . $text . "\n\n";
}
break;
case 'p':
$text = squin_md_inline($child);
if (preg_match('/[^\s\xC2\xA0]/u', $text)) {
$out .= "\n\n" . trim($text) . "\n\n";
}
break;
case 'iframe':
$src = $child->getAttribute('src');
if ($src) {
$title = $child->getAttribute('title');
if (!$title) $title = $src;
$out .= "\n\n" . '[' . $title . '](' . $src . ')' . "\n\n";
}
break;
case 'br':
$out .= " \n";
break;
case 'hr':
$out .= "\n\n---\n\n";
break;
case 'pre':
$out .= "\n\n" . squin_md_pre_to_fence($child) . "\n\n";
break;
case 'blockquote':
$inner = squin_md_walk($child, $list_depth);
$inner = preg_replace("/^[ \t\xC2\xA0]+$/mu", "", $inner);
$inner = preg_replace("/\n{3,}/", "\n\n", trim($inner));
if ($inner !== '') {
$lines = explode("\n", $inner);
$quoted = array_map(function ($l) { return rtrim('> ' . $l); }, $lines);
$out .= "\n\n" . implode("\n", $quoted) . "\n\n";
}
break;
case 'ul':
case 'ol':
$out .= "\n\n" . squin_md_list($child, $list_depth, $tag === 'ol') . "\n\n";
break;
case 'table':
$tbl = squin_md_table($child);
if ($tbl !== '') $out .= "\n\n" . $tbl . "\n\n";
break;
case 'figure':
$out .= "\n\n" . trim(squin_md_walk($child, $list_depth)) . "\n\n";
break;
case 'figcaption':
$text = trim(squin_md_inline($child));
if ($text !== '') $out .= "\n\n*" . $text . "*\n\n";
break;
case 'img':
$src = $child->getAttribute('src');
$alt = $child->getAttribute('alt');
if ($src) $out .= "\n\n" . '' . "\n\n";
break;
// Inline-ish elements that may appear at block level - treat inline
case 'a':
case 'strong': case 'b':
case 'em': case 'i':
case 'code':
case 'span':
$out .= squin_md_inline_single($child);
break;
// Block wrappers - recurse
case 'div': case 'section': case 'article':
case 'main': case 'header': case 'footer': case 'aside':
case 'details': case 'summary':
case 'dl': case 'dt': case 'dd':
$out .= squin_md_walk($child, $list_depth);
break;
default:
// Unknown - recurse so we don't drop content
$out .= squin_md_walk($child, $list_depth);
}
}
return $out;
}
function squin_md_inline(DOMNode $node) {
$out = '';
foreach ($node->childNodes as $child) {
if ($child->nodeType === XML_TEXT_NODE) {
$out .= preg_replace('/[\s\xC2\xA0]+/u', ' ', $child->nodeValue);
continue;
}
if ($child->nodeType !== XML_ELEMENT_NODE) continue;
$out .= squin_md_inline_single($child);
}
return $out;
}
function squin_md_inline_single(DOMNode $el) {
$tag = strtolower($el->nodeName);
switch ($tag) {
case 'strong': case 'b':
$t = trim(squin_md_inline($el));
return $t === '' ? '' : '**' . $t . '**';
case 'em': case 'i':
$t = trim(squin_md_inline($el));
return $t === '' ? '' : '*' . $t . '*';
case 'code':
$t = $el->textContent;
return $t === '' ? '' : '`' . $t . '`';
case 'a':
$href = $el->getAttribute('href');
$text = trim(squin_md_inline($el));
if ($text === '') $text = $href;
if (!$href) return $text;
return '[' . $text . '](' . $href . ')';
case 'br':
return " \n";
case 'img':
$src = $el->getAttribute('src');
$alt = $el->getAttribute('alt');
return $src ? '' : '';
case 'span':
return squin_md_inline($el);
default:
// Recurse into anything else inline-ish
return squin_md_inline($el);
}
}
function squin_md_pre_to_fence(DOMNode $pre) {
$lang = '';
// EnlighterJS: data-enlighter-language="php"
$data_lang = $pre->getAttribute('data-enlighter-language');
if ($data_lang) $lang = $data_lang;
// Class hints on <pre>: language-xxx, lang-xxx, brush: xxx
if (!$lang) {
$cls = $pre->getAttribute('class');
if ($cls && preg_match('/(?:language|lang|brush)[-:\s]+([\w+-]+)/i', $cls, $m)) {
$lang = $m[1];
}
}
// Look for a <code> child to get the actual content + possibly language
$code_text = '';
$code_node = null;
foreach ($pre->childNodes as $c) {
if ($c->nodeType === XML_ELEMENT_NODE && strtolower($c->nodeName) === 'code') {
$code_node = $c;
break;
}
}
if ($code_node) {
if (!$lang) {
$cclass = $code_node->getAttribute('class');
if ($cclass && preg_match('/(?:language|lang)[-:\s]+([\w+-]+)/i', $cclass, $m)) {
$lang = $m[1];
}
}
$code_text = $code_node->textContent;
} else {
$code_text = $pre->textContent;
}
$code_text = rtrim($code_text);
return "```" . $lang . "\n" . $code_text . "\n```";
}
function squin_md_list(DOMNode $list, $depth, $ordered) {
$out = '';
$indent = str_repeat(' ', $depth);
$i = 1;
foreach ($list->childNodes as $li) {
if ($li->nodeType !== XML_ELEMENT_NODE) continue;
if (strtolower($li->nodeName) !== 'li') continue;
$marker = $ordered ? ($i . '.') : '-';
$inline_buf = '';
$nested_buf = '';
foreach ($li->childNodes as $c) {
if ($c->nodeType === XML_TEXT_NODE) {
$inline_buf .= $c->nodeValue;
continue;
}
if ($c->nodeType !== XML_ELEMENT_NODE) continue;
$ctag = strtolower($c->nodeName);
if ($ctag === 'ul' || $ctag === 'ol') {
$nested_buf .= squin_md_list($c, $depth + 1, $ctag === 'ol');
} elseif ($ctag === 'p') {
$inline_buf .= squin_md_inline($c) . ' ';
} elseif ($ctag === 'pre') {
// Code blocks inside list items: indent each line
$fence = squin_md_pre_to_fence($c);
$fence_indented = preg_replace('/^/m', $indent . ' ', $fence);
$nested_buf .= "\n\n" . $fence_indented;
} else {
$inline_buf .= squin_md_inline_single($c);
}
}
$inline_buf = trim(preg_replace('/[\s\xC2\xA0]+/u', ' ', $inline_buf));
$line = $indent . $marker . ' ' . $inline_buf;
$out .= $line . "\n";
if ($nested_buf !== '') $out .= rtrim($nested_buf, "\n") . "\n";
$i++;
}
return rtrim($out, "\n");
}
function squin_md_table(DOMNode $table) {
$header = null;
$rows = [];
// Headers from <thead>
foreach ($table->childNodes as $c) {
if ($c->nodeType !== XML_ELEMENT_NODE) continue;
if (strtolower($c->nodeName) !== 'thead') continue;
foreach ($c->getElementsByTagName('tr') as $tr) {
$header = squin_md_table_cells($tr);
break;
}
break;
}
// Body rows
$body_found = false;
foreach ($table->childNodes as $c) {
if ($c->nodeType !== XML_ELEMENT_NODE) continue;
if (strtolower($c->nodeName) !== 'tbody') continue;
$body_found = true;
foreach ($c->getElementsByTagName('tr') as $tr) {
$rows[] = squin_md_table_cells($tr);
}
}
if (!$body_found && !$header) {
// Bare <table><tr>… layout
foreach ($table->getElementsByTagName('tr') as $tr) {
$is_header_row = $tr->getElementsByTagName('th')->length > 0
&& $tr->getElementsByTagName('td')->length === 0;
$cells = squin_md_table_cells($tr);
if ($header === null && $is_header_row) {
$header = $cells;
} else {
$rows[] = $cells;
}
}
}
if (!$header && !empty($rows)) {
$header = array_shift($rows);
}
if (!$header) return '';
$out = '| ' . implode(' | ', $header) . " |\n";
$out .= '|' . str_repeat(' --- |', count($header)) . "\n";
foreach ($rows as $row) {
while (count($row) < count($header)) $row[] = '';
$row = array_slice($row, 0, count($header));
$out .= '| ' . implode(' | ', $row) . " |\n";
}
return $out;
}
function squin_md_table_cells(DOMNode $tr) {
$cells = [];
foreach ($tr->childNodes as $c) {
if ($c->nodeType !== XML_ELEMENT_NODE) continue;
$tag = strtolower($c->nodeName);
if ($tag !== 'td' && $tag !== 'th') continue;
$text = trim(squin_md_inline($c));
$text = str_replace(['|', "\n", "\r"], ['\\|', ' ', ''], $text);
$cells[] = $text;
}
return $cells;
}