Skip to content

feat: extract Markdown frontmatter metadata#11615

Merged
anakin87 merged 2 commits into
deepset-ai:mainfrom
gyx09212214-prog:codex/markdown-frontmatter-metadata
Jun 17, 2026
Merged

feat: extract Markdown frontmatter metadata#11615
anakin87 merged 2 commits into
deepset-ai:mainfrom
gyx09212214-prog:codex/markdown-frontmatter-metadata

Conversation

@gyx09212214-prog

Copy link
Copy Markdown
Contributor

Summary

Adds optional YAML frontmatter extraction to MarkdownToDocument.

When extract_frontmatter=True, Markdown files beginning with --- ... --- are parsed with PyYAML. Mapping values are added to Document.meta and the frontmatter block is removed before rendering document content. The default remains unchanged, so existing users keep frontmatter in the converted content unless they opt in.

Metadata precedence is ByteStream.meta < frontmatter < run(meta=...), matching the existing behavior where explicit runtime metadata can override source metadata. Date-like YAML scalars are kept as strings so common note fields like date: 2026-06-12 remain JSON-serializable metadata.

This is useful for Markdown/RAG ingestion pipelines where source notes carry fields like ticker, source, report date, author, or document id in frontmatter and downstream retrievers need them as metadata filters or citations.

Tests

  • python -m pytest test/components/converters/test_markdown_to_document.py -q
  • python -m py_compile haystack\components\converters\markdown.py test\components\converters\test_markdown_to_document.py
  • python -m ruff check haystack\components\converters\markdown.py test\components\converters\test_markdown_to_document.py
  • git diff --check

@gyx09212214-prog gyx09212214-prog requested a review from a team as a code owner June 12, 2026 17:13
@gyx09212214-prog gyx09212214-prog requested review from anakin87 and removed request for a team June 12, 2026 17:13
@vercel

vercel Bot commented Jun 12, 2026

Copy link
Copy Markdown

@gyx09212214-prog is attempting to deploy a commit to the deepset Team on Vercel.

A member of the Team first needs to authorize it.

@CLAassistant

CLAassistant commented Jun 12, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

@github-actions github-actions Bot added topic:tests type:documentation Improvements on the docs labels Jun 12, 2026

@anakin87 anakin87 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, it seems like a helpful addition.

I left some comments.
Please also fix the format error (running hatch run fmt should do trick) and add a release note.

Comment thread haystack/components/converters/markdown.py
Comment thread test/components/converters/test_markdown_to_document.py Outdated
Comment thread test/components/converters/test_markdown_to_document.py
Comment thread haystack/components/converters/markdown.py
@github-actions

Copy link
Copy Markdown
Contributor

Coverage report

Click to see where and how coverage changed

FileStatementsMissingCoverageCoverage
(new stmts)
Lines missing
  haystack/components/converters
  markdown.py 117, 122-128, 146, 167-172
Project Total  

This report was generated by python-coverage-comment-action

@anakin87 anakin87 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@anakin87 anakin87 merged commit 3fbcac7 into deepset-ai:main Jun 17, 2026
22 of 23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

topic:tests type:documentation Improvements on the docs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants