MarkItDown — สรุปครบ

⚡ Microsoft Open Source

MarkItDown

Python library สำหรับแปลงไฟล์ทุกประเภทให้เป็น Markdown
เพื่อใช้กับ LLM และ Text Analysis Pipeline

⭐ 152k Stars Python 3.10+ MIT License v0.1.6

01 — ทำอะไรได้

รองรับไฟล์กว่า 10 รูปแบบ

📄

PDF

.pdf

📊

PowerPoint

.pptx

📝

Word

.docx

📈

Excel

.xlsx / .xls

🖼️

รูปภาพ

+ EXIF + OCR

🎵

Audio

Speech Transcription

🌐

HTML

.html

📋

Text-based

CSV · JSON · XML

📦

ZIP Files

แตกและแปลงทุกไฟล์

▶️

YouTube

URL → Transcript

📚

EPub

.epub

✉️

Outlook

.msg

ทำไมต้องเป็น Markdown?

Markdown คือรูปแบบที่ LLM อย่าง GPT-4o "พูดได้" อย่างเป็นธรรมชาติ เพราะถูกฝึกมากับข้อมูล Markdown จำนวนมหาศาล — ทำให้ output ที่ได้จาก MarkItDown เหมาะสำหรับใส่เป็น context ใน AI pipeline โดยตรง

นอกจากนี้ Markdown ยังประหยัด Token มากกว่า HTML หรือ XML อีกด้วย

02 — ติดตั้ง

ติดตั้งใน 2 ขั้นตอน

สร้าง Virtual Environment (แนะนำ)

Python venv

python -m venv .venv
source .venv/bin/activate  # Mac/Linux
.venv\Scripts\activate     # Windows

หรือใช้ Anaconda: conda create -n markitdown python=3.12

ติดตั้ง MarkItDown

ติดตั้งทุก format (แนะนำ)

pip install 'markitdown[all]'

ติดตั้งเฉพาะ format ที่ต้องการ

pip install 'markitdown[pdf, docx, pptx]'

Optional dependencies ที่มี:

[all] [pptx] [docx] [xlsx] [xls] [pdf] [outlook] [audio-transcription] [youtube-transcription] [az-doc-intel] [az-content-understanding]

🐳

หรือใช้ Docker

Docker

docker build -t markitdown:latest .
docker run --rm -i markitdown:latest < file.pdf > output.md

03 — การใช้งาน

3 วิธีใช้งาน

🖥️ Command Line (CLI)

# แปลงไฟล์ → stdout
markitdown file.pdf

# บันทึกลงไฟล์
markitdown file.pdf -o output.md

# ใช้ pipe
cat file.pdf | markitdown

# ดู plugins
markitdown --list-plugins

# ใช้งาน plugins
markitdown --use-plugins file.pdf

🐍 Python API

from markitdown import MarkItDown

# การใช้งานพื้นฐาน
md = MarkItDown(enable_plugins=False)
result = md.convert("test.xlsx")
print(result.text_content)

# ใช้ LLM อธิบายรูปภาพ
from openai import OpenAI
client = OpenAI()
md = MarkItDown(
    llm_client=client,
    llm_model="gpt-4o"
)
result = md.convert("photo.jpg")
print(result.text_content)

04 — ฟีเจอร์ขั้นสูง

Integrations & Plugins

Plugin markitdown-ocr — OCR จากรูปในเอกสาร

ดึงข้อความจากรูปภาพที่ฝังอยู่ใน PDF, DOCX, PPTX, XLSX โดยใช้ LLM Vision ไม่ต้องติดตั้ง ML library เพิ่มเติม

ติดตั้ง + ใช้งาน

pip install markitdown-ocr openai

from markitdown import MarkItDown
from openai import OpenAI

md = MarkItDown(
    enable_plugins=True,
    llm_client=OpenAI(),
    llm_model="gpt-4o",
)
result = md.convert("doc_with_images.pdf")
print(result.text_content)

Azure Azure Document Intelligence

ใช้ Azure Document Intelligence แปลง PDF ที่ซับซ้อน — layout analysis และ OCR คุณภาพสูงบน cloud

CLI

markitdown file.pdf -o output.md -d -e "<doc_intelligence_endpoint>"

Python

md = MarkItDown(docintel_endpoint="<endpoint>")
result = md.convert("scan.pdf")

Azure Azure Content Understanding — รองรับ Video & Audio

บริการ cloud ของ Microsoft ที่รองรับ Video (ซึ่ง built-in ไม่มี), Audio คุณภาพสูง, การ extract field เฉพาะทาง (ใบแจ้งหนี้, ใบเสร็จ) เป็น YAML front matter

ความสามารถ	Built-in	Doc Intel	Content Understanding
แปลงเอกสาร	✓ Offline	✓ Cloud	✓ Cloud Multimodal
Video	✗	✗	✓
Structured Fields	✗	✗	✓ YAML front matter
Custom Analyzer	✗	✗	✓
ค่าใช้จ่าย	Local เท่านั้น	Billable API	Billable API

Python — Zero-config

md = MarkItDown(cu_endpoint="<content_understanding_endpoint>")
result = md.convert("report.pdf")   # → prebuilt-documentSearch
result = md.convert("meeting.mp4")  # → prebuilt-videoSearch
result = md.convert("call.wav")     # → prebuilt-audioSearch

05 — ข้อควรระวัง

Security Considerations

⚠️ MarkItDown ทำงานด้วย Privilege ของ Process ปัจจุบัน

อย่า pass input ที่ไม่น่าเชื่อถือโดยตรง — validate ก่อนเสมอในสภาพแวดล้อม server/hosted
ใช้ convert_local() แทน convert() ถ้าต้องการเปิดเฉพาะไฟล์ local
ใช้ convert_stream() เพื่อควบคุม input สูงสุด
จำกัด file path, URI scheme และ network destination ในสภาพแวดล้อม production

MarkItDown

Subscribe our newsletter