HudchewMan's Station: แปลงเอกสารด้วย Pandoc (ตอนที่ 3)

ก่อนหน้านี้เคยเขียนบล็อกเรื่องการแปลงเอกสารด้วย Pandoc ไปทั้ง ตอนที่ 1 และ ตอนที่ 2 แล้ว หลังจากที่พบว่าโปรแกรม Pandoc ที่ผมใช้อยู่มันเป็นเวอร์ชันที่เก่าแล้ว (2.7.x) มี bug เรื่องการแปลงฟอร์แมต markdown นิดหน่อย คือตัว ~~ ไม่แปลงเป็นตัวขีดฆ่า และไม่แปลงตัวหนาตัวเอนใน docx (แต่แปลงให้ใน odt) ก็เลยไปอัปเดตให้เป็น เวอร์ชันล่าสุด (วันที่เขียนบล็อก ล่าสุดคือเวอร์ชัน 3.3.1 - 29 Jul 2024) ซึ่งเวอร์ชันนี้มีบางจุดเปลี่ยนไปจากรุ่นก่อน เลยต้องมาเขียนบันทึกเพิ่ม

ผมใช้ Linux Mint (Ubuntu based) เลยดาวน์โหลดไฟล์ pandoc-3.3-1-amd64.deb มาติดตั้ง ถ้าใช้ Windows, Mac ก็เลือกตามระบบที่ใช้

การตรวจสอบเวอร์ชันของ Pandoc ที่ติดตั้งในเครื่อง ให้พิมพ์ว่า

pandoc -v หรือ pandoc --version

ดูพารามิเตอร์คำสั่งต่างๆ ให้พิมพ์ว่า

pandoc -h หรือ pandoc --help

ถ้ามีเวอร์ชันก่อนหน้าติดตั้งเอาไว้ ต้องลบของเดิมออกถึงค่อยติดตั้งเวอร์ชันใหม่ โดยพิมพ์ในเทอร์มินัลว่า

sudo apt remove pandoc

วิธีการติดตั้ง โดยปกติจะดับเบิลคลิกไฟล์เพื่อเปิด package installer ให้ติดตั้งได้เลย แต่ถ้าไม่ได้ ก็ใช้วิธีติดตั้งผ่านเทอร์มินัล โดยพิมพ์ว่า

sudo dpkg -i ไฟล์ติดตั้งที่ดาวน์โหลด

เวอร์ชันที่ผมใช้ก่อนหน้านี้ การแปลงเป็น markdown กับ plain (text ข้อความธรรมดา) มันจะตัดแบ่งบรรทัดให้อัตโนมัติ ถ้าไม่อยากให้มันแบ่งต้องใส่ --wrap=none ไว้ด้วย เช่น

pandoc -t markdown-smart --wrap=none file-in.docx > file-out.md

หากแปลงเป็น html มันจะไม่ตัดบรรทัดให้ และผมก็เคยชินแบบนั้น แต่หลังจากที่อัปเดตมาเป็นเวอร์ชัน 3.3.1 ก็เจอว่าตอนแปลงเป็น html มันก็ตัดบรรทัดให้ด้วย ดังนั้นจึงต้องใส่แท็ก --wrap=none ไว้เช่นกัน

ที่จริงนอกจาก none แล้วก็ยังมีตัวเลือก preserve อีกด้วย ในคู่มือการใช้งานบอกไว้ว่า

--wrap=auto|none|preserve

Determine how text is wrapped in the output (the source code, not the rendered version). With auto (the default), pandoc will attempt to wrap lines to the column width specified by --columns (default 72). With none , pandoc will not wrap lines at all. With preserve , pandoc will attempt to preserve the wrapping from the source document (that is, where there are nonsemantic newlines in the source, there will be nonsemantic newlines in the output as well). In ipynb output, this option affects wrapping of the contents of markdown cells.

นอกจากนี้ก็ยังมีเรื่องที่รู้เพิ่มมาก็คือเราสามารถตั้งค่าไฟล์เอกสาร docx, odt ที่แปลงออกมาแล้วโดยใช้ต้นแบบเอกสารที่เราสร้างขึ้นมาได้อีกด้วย (จำพวกรูปแบบตัวอักษร ตั้งค่าหน้ากระดาษ) โดยการใส่พารามิเตอร์ --reference-doc แล้วอ้างถึงตำแหน่งไฟล์ที่จะใช้เป็นต้นแบบ (ถ้าอยู่คนละที่ ก็ต้องใส่ path ให้ครบด้วย หรือจะใส่เป็น url ก็ได้)

เช่น

pandoc --reference-doc file-ref.docx -t docx file-in.md > file-out.docx

ซึ่งจาก https://pandoc.org/MANUAL.html#option--reference-doc ระบุไว้ว่าไฟล์ต้นแบบนี้ควรเป็นไฟล์ที่สร้างขึ้นจาก pandoc และไม่สามารถอ้างอิง docx มาใส่ใน odt ได้ (และในทางกลับกันด้วย)

การแปลง docx ที่มีรูปภาพอยู่ในเอกสารให้เป็น html นั้น หากต้องการให้บันทึกไฟล์รูปภาพด้วย จะต้องใส่พารามิเตอร์ว่า --extract-media=. เพื่อให้สร้างโฟลเดอร์ media ขึ้นมาด้วย เช่น

pandoc --extract-media=. --wrap=none -t html file-in.docx > file-out.htm

※※※ คำสั่งบน Linux ※※※

แปลง docx เป็น plain (ใช้นามสกุล .md)

~~for f in *.docx ; do pandoc "${f}" -f docx -t plain --wrap=none -s -o "${f}.md" ; done~~

for f in *.docx ; do pandoc "${f}" -t plain --wrap=none -o "${f%.docx}.md" ; done

แปลง md เป็น docx โดยอ้างอิงไฟล์

for f in *.md ; do pandoc "${f}" --reference-doc file-ref.docx -f markdown -t docx -s -o "${f}.docx" ; done

แปลง md เป็น odt โดยอ้างอิงไฟล์

~~for f in *.md ; do pandoc "${f}" --reference-doc file-ref.odt -t odt -s -o "${f}.odt" ; done~~

for f in *.md ; do pandoc "${f}" --reference-doc file-ref.odt -t odt -s -o "${f%.md}.odt" ; done

แปลง md เป็น html (ไม่ต้องใส่พารามิเตอร์ -s เพื่อไม่ต้องใส่ meta head ใน output)

~~for f in *.md ; do pandoc -t html --wrap=none "${f}" > "${f}.htm" ; done~~

for f in *.md ; do pandoc "${f}" -t html --wrap=none -o "${f%.md}.htm" ; done

แปลง xhtml เป็น docx

for f in *.xhtml ; do pandoc -t docx "${f}" > "${f}.docx" ; done

แปลง docx เป็น html

for f in *.docx ; do pandoc --wrap=none -t html "${f}" > "${f}.htm" ; done

เปลี่ยนนามสกุลส่วนเกิน .docx.htm ให้เหลือเป็น .htm

for f in *.docx.htm ; do mv -- "${f}" "${f%.docx.htm}.htm" ; done

เปลี่ยนนามสกุลส่วนเกิน .odt.htm ให้เหลือเป็น .htm

for f in *.odt.htm ; do mv -- "${f}" "${f%.odt.htm}.htm" ; done

※※※ คำสั่งบน Windows

แปลง docx เป็น plain (ใช้นามสกุล .md)

for %i in (*.docx) do pandoc -f docx -t plain --wrap=none %~ni.docx > md/%~ni.md

หรือ

for /r "." %i in (*.docx) do pandoc -t plain --wrap=none -o "%~i.docx" "%~i"

※※※

พารามิเตอร์ของ Pandoc ค่อนข้างหลากหลาย ใช้คำได้หลายคำเพื่อทำหน้าที่เดียวกัน อย่างเช่น

การกำหนดประเภทของไฟล์ต้นทาง

-f FORMAT, -r FORMAT, --from=FORMAT, --read=FORMAT

การกำหนดประเภทของไฟล์ปลายทาง

-t FORMAT, -w FORMAT, --to=FORMAT, --write=FORMAT

การกำหนดชื่อไฟล์ปลายทาง

-o FILE, --output=FILE

อ่านเพิ่มเติมได้ที่ Pandoc User’s Guide

※※※

ใช้งานส่วนตัว

ดาวน์โหลดไฟล์จาก Google Drive ได้เป็นนามสกุล .docx
สร้างโฟลเดอร์ย่อยเพื่อเก็บไฟล์แยกเป็นหมวดหมู่ md, htm, odt
แปลงไฟล์ที่ดาวน์โหลดมาให้เป็น text ธรรมดา มีนามสกุลเป็น .md (ในโฟลเดอร์ md)
แปลงไฟล์ .md เป็น html โดยสร้าง title ของ h1 ให้เป็นเลขตอนและชื่อตอน (h1+h2) สำหรับทำสารบัญใน Sigil (ref-html.lua) และใส่ class "sigil_not_in_toc" ใน h2 เพื่อไม่ต้องทำสารบัญ (ในโฟลเดอร์ htm)
แปลงไฟล์ .md เป็น odt โดยใช้ไฟล์ต้นแบบ ref-odt.odt (ในโฟลเดอร์ odt)

mkdir md ; mkdir htm ; mkdir odt

for f in *.docx ; do pandoc "${f}" -t plain --wrap=none -o "md/${f%.docx}.md" ; done

for f in *.md ; do pandoc "md/${f}" -t html --wrap=none --lua-filter=ref-html.lua -o "htm/${f%.md}.htm" ; done

for f in *.md ; do pandoc "${f}" --reference-doc ref-odt.odt -t odt -s -o "odt/${f%.md}.odt" ; done

[ Reference ]

How to specify the font used for word doc exported using pandoc?

How do I add custom formatting to docx files generated in Pandoc?

Pandoc option--reference-doc

[ Keywords ]

แปลงเอกสาร, แปลงไฟล์

HudchewMan's Station

วันศุกร์ที่ 30 สิงหาคม พ.ศ. 2567

แปลงเอกสารด้วย Pandoc (ตอนที่ 3)

※※※ คำสั่งบน Linux ※※※

※※※ คำสั่งบน Windows

ไม่มีความคิดเห็น:

แสดงความคิดเห็น