UNPKG

md2hwp

Version:

Convert Markdown to HWP (Hangul Word Processor) format

477 lines (385 loc) 14.3 kB
# md2hwp Improvements Summary ## Overview This document summarizes all improvements made to the md2hwp library for better HWP output quality. ## 1. Heading Hierarchy Implementation ✅ ### What Was Added - Proper H1-H6 heading support with different font sizes - H1: 1400 HWPUNIT (14pt) - H2: 1300 HWPUNIT (13pt) - H3: 1200 HWPUNIT (12pt) - H4: 1100 HWPUNIT (11pt) - H5/H6: 1000 HWPUNIT (10pt - normal size) ### Character Properties Created 6 different character property sets (charPr id="0" through "10") with: - Bold weight (700) for all headings - Graduated font sizes - Proper baseline and spacing calculations ### Benefits - Clear visual hierarchy in documents - Professional document appearance - Proper heading structure for navigation --- ## 2. Line Spacing (줄간격) Improvements ✅ ### What Was Added Multiple paragraph properties with different line spacing: - **paraPr id="0"**: 140% line spacing (for headings) - **paraPr id="1"**: 150% line spacing (for lists) - **paraPr id="20"**: 160% line spacing (for normal paragraphs) - **paraPr id="21"**: 160% line spacing + top margin (paragraphs after headings) ### Smart Spacing - Extra gaps (300-400 HWPUNIT) between different content types - Context-aware paragraph selection based on previous content - Consistent vertical rhythm throughout document ### Benefits - Better readability - Professional document spacing - Improved visual separation between sections --- ## 3. Line Wrapping Fix (자간 압축 해결) ✅ ### The Problem Long sentences were not wrapping naturally. Instead, HWP was compressing character spacing (자간) to force text onto one line. ### The Solution **Removed `<hp:linesegarray>` from regular paragraphs and lists** while keeping it for headings and tables. #### Before (Not Working) ```xml <hp:p paraPrIDRef="20" ...> <hp:run charPrIDRef="0"> <hp:t>Long sentence...</hp:t> </hp:run> <hp:linesegarray> <hp:lineseg textpos="0" ... flags="393216"/> </hp:linesegarray> </hp:p> ``` #### After (Working!) ```xml <hp:p paraPrIDRef="20" ...> <hp:run charPrIDRef="0"> <hp:t>Long sentence...</hp:t> </hp:run> </hp:p> ``` ### Why It Works - No layout hints to override - HWP calculates line breaks from paragraph properties - Uses `breakNonLatinWord="BREAK_WORD"` for natural wrapping - Character spacing stays at 0 (no compression) ### Where Applied **Removed linesegarray from:** - Normal paragraphs - List items - Mixed content paragraphs - Image placeholders - Empty paragraphs **Kept linesegarray for:** - Headings (precise spacing needed) - Tables (layout structure needed) - Table cells (content layout needed) ### Benefits - ✅ Natural line wrapping for long sentences - ✅ Consistent character spacing - ✅ No compression artifacts - ✅ Proper text flow for Korean and English --- ## 4. Character Spacing Settings ✅ ### What Was Changed Set character spacing to **0** for all scripts: ```xml <hh:spacing hangul="0" latin="0" hanja="0" japanese="0" other="0" symbol="0" user="0"/> ``` ### Why - Prevents automatic spacing that could trigger compression - Allows HWP to use natural character metrics - Works with line wrapping fix ### Benefits - Natural character appearance - No unwanted spacing adjustments - Consistent rendering --- ## 5. Paragraph Break Settings ✅ ### What Was Configured ```xml <hh:breakSetting breakLatinWord="KEEP_WORD" <!-- Don't break English words --> breakNonLatinWord="BREAK_WORD" <!-- Allow Korean/CJK wrapping --> widowOrphan="0" keepWithNext="0" keepLines="0" pageBreakBefore="0" lineWrap="BREAK"/> <!-- Enable line wrapping --> ``` ### Benefits - English words stay intact - Korean text wraps naturally - Proper line breaking behavior --- ## 6. Bold Text Support (v1.2.4) ✅ ### The Problem Bold text (`**text**`) was not rendering correctly: - Initially appeared at 14pt instead of 10pt - Then had correct size but no bold weight - Attempted solutions with `<hh:fontweight>` didn't work ### The Root Causes Discovered #### 1. Character Property Reference Issue HWP's `charPrIDRef` uses **position index**, not the `id` attribute: ```xml <!-- charPrIDRef="6" looks at position 6, not id="6"! --> <hh:charProperties itemCnt="7"> <hh:charPr id="0" .../> <!-- Position 0 --> <hh:charPr id="1" .../> <!-- Position 1 --> <hh:charPr id="6" .../> <!-- Position 2 ← charPrIDRef="6" goes here! --> </hh:charProperties> ``` **Solution:** Made all charPr IDs sequential (0-5) #### 2. Wrong Bold Tag Using `<hh:fontweight hangul="700" .../>` didn't work. **Solution:** Analyzed user-corrected HWP file and discovered HWP requires `<hh:bold/>` tag #### 3. Font References Bold text needs different `fontRef` values for CJK scripts. ### The Correct Implementation ```xml <!-- Bold text: charPr id="1" at position 1 --> <hh:charPr id="1" height="1000" borderFillIDRef="1" ...> <hh:fontRef hangul="0" latin="0" hanja="1" japanese="1" other="1" symbol="1" user="1"/> <!-- ← Different! --> ... <hh:bold/> <!-- ← The key! --> ... </hh:charPr> ``` ### Character Property Mapping | Position | ID | Height | Purpose | |----------|-----|--------|---------| | 0 | "0" | 1000 (10pt) | Normal text | | 1 | "1" | 1000 (10pt) | **Bold text** | | 2 | "2" | 1400 (14pt) | H1 heading | | 3 | "3" | 1300 (13pt) | H2 heading | | 4 | "4" | 1200 (12pt) | H3 heading | | 5 | "5" | 1100 (11pt) | H4 heading | ### Usage in Paragraphs ```typescript const cid = child.style?.bold ? '1' : '0'; // Position 1 = bold return `<hp:run charPrIDRef="${cid}"><hp:t>${text}</hp:t></hp:run>`; ``` ### Benefits - ✅ Bold text renders at correct size (10pt, same as normal) - ✅ Bold weight properly applied - ✅ Works for English and Korean text - ✅ No outline boxes or unwanted borders - ✅ Headings remain properly sized and bold ### Testing ```bash npm run build node test-bold-clear-convert.js # Open test-bold-clear-output.hwp in Hancom Office ``` --- ## 7. Nested Lists with Indentation (v1.2.6) ✅ ### The Problem Markdown supports nested lists, but md2hwp wasn't handling them correctly: - Nested list items weren't being parsed - No visual indentation to show hierarchy - Bold text in list items wasn't working - Mixed content (bold + normal text) in same list item not supported ### Example Input ```markdown - **총 예산**: 35,000,000원 - **주요 항목**: - 해외 연사 항공료 및 숙박: ~10,440,000원 - 연사비: 3,400,000원 ``` ### The Root Causes Discovered #### 1. Missing Recursive Parsing The `marked.js` library puts nested lists in the parent item's `tokens` array: ```javascript { type: 'list_item', text: '주요 항목', tokens: [ { type: 'text', text: '주요 항목' }, { type: 'list', items: [...] } // ← Nested list here! ] } ``` **Solution:** Recursively parse the `item.tokens` array in `parseList()` #### 2. Bold Text Not Parsed in Lists The `parseList()` method was using `extractText()` which strips formatting. **Solution:** Use `parseInlineTokens()` to preserve bold/italic formatting #### 3. No Visual Indentation Nested lists had no indentation to show hierarchy levels. **Initial Attempt (v1.2.5):** Used inline `<hp:paraPr>` with left margin - didn't work in HWP **Correct Solution (v1.2.6):** HWP requires **pre-defined paragraph properties** in the header, not inline paraPr. Created 14 paragraph property definitions with progressive left margin values. ### The Implementation #### Parser Changes (markdown-parser.ts) ```typescript private parseList(token: Tokens.List): HwpContent { const children: HwpContent[] = []; for (const item of token.items) { // Check for nested tokens (including nested lists) if ('tokens' in item && Array.isArray(item.tokens)) { for (const subToken of item.tokens) { if (subToken.type === 'text') { // Parse inline elements (bold, italic, etc.) const inlineElements = this.parseInlineTokens(subToken.text); // ... } else { // Recursively handle nested lists const parsed = this.tokenToContent(subToken); if (parsed) children.push(parsed); } } } else { // Handle inline formatting in simple list items const inlineElements = this.parseInlineTokens(item.text); // ... } } return { type: 'list', children }; } ``` #### Generator Changes (hwpx-generator.ts) **Step 1: Create helper method for paragraph properties** ```typescript private generateParaPr( id: string, lineSpacing: string, leftMargin: string = '0', prevMargin: string = '0' ): string { return `<hh:paraPr id="${id}" ...> <hh:margin> <hc:intent value="0" unit="HWPUNIT"/> <hc:left value="${leftMargin}" unit="HWPUNIT"/> <hc:right value="0" unit="HWPUNIT"/> <hc:prev value="${prevMargin}" unit="HWPUNIT"/> <hc:next value="0" unit="HWPUNIT"/> </hh:margin> <hh:lineSpacing type="PERCENT" value="${lineSpacing}" unit="HWPUNIT"/> ... </hh:paraPr>`; } ``` **Step 2: Generate paragraph properties in header** ```typescript private generateHeaderXml(): string { return `... <hh:paraProperties itemCnt="14"> ${this.generateParaPr('0', '140', '0', '0')} // Headings ${this.generateParaPr('1', '150', '0', '0')} // Lists level 0 ${this.generateParaPr('2', '150', '800', '0')} // Lists level 1 ${this.generateParaPr('3', '150', '1600', '0')} // Lists level 2 ${this.generateParaPr('4', '150', '2400', '0')} // Lists level 3 ... ${this.generateParaPr('10', '150', '7200', '0')} // Lists level 9 ${this.generateParaPr('20', '160', '0', '0')} // Normal paragraphs ${this.generateParaPr('21', '160', '0', '400')} // Para after heading </hh:paraProperties> ...`; } ``` **Step 3: Use paraPrIDRef in generateList** ```typescript private generateList( content: HwpContent, vertPos: number, isFirst: boolean, previousType: string | null, level: number = 0 ): { xml: string; nextVertPos: number } { // Calculate paragraph property ID based on level // level 0 → id="1" (no indent) // level 1 → id="2" (800 HWPUNIT indent) // level 2 → id="3" (1600 HWPUNIT indent), etc. const paraPrId = Math.min(level + 1, 10); for (const item of content.children) { // Handle nested lists recursively if (item.type === 'list') { const nestedResult = this.generateList( item, currentVertPos, first, 'list', level + 1 ); // ... } // Handle bold text in list items if (item.children && item.children.length > 0) { const textRuns = item.children.map(child => { const t = this.escapeXml(child.content || ''); const cid = child.style?.bold ? '1' : '0'; return `<hp:run charPrIDRef="${cid}"><hp:t>${t}</hp:t></hp:run>`; }).join(''); runs = `<hp:run charPrIDRef="0"><hp:t>• </hp:t></hp:run>${textRuns}`; } // Use pre-defined paragraph property with indentation const xml = `<hp:p paraPrIDRef="${paraPrId}" ...>${runs}</hp:p>`; } } ``` ### Indentation Calculation | Level | Indent (HWPUNIT) | Approximate (mm) | |-------|------------------|------------------| | 0 | 0 | 0 | | 1 | 800 | ~8mm | | 2 | 1600 | ~16mm | | 3 | 2400 | ~24mm | Each nesting level adds **800 HWPUNIT** of left margin. ### Key Learning **HWP does NOT support inline `<hp:paraPr>` for indentation.** While the XML allows inline paragraph properties, HWP ignores the left margin settings when specified this way. The indentation MUST be defined using pre-defined paragraph properties in the `<hh:paraProperties>` section of the header, referenced via `paraPrIDRef`. This is similar to the bold text issue where we learned that `charPrIDRef` uses position index, not ID value. HWP's format has many such undocumented requirements that can only be discovered by analyzing working files. ### Benefits -**Nested list parsing** - recursively handles multi-level structures -**Visual indentation** - clear hierarchy with proper margins -**Bold text in lists** - supports `**label**: value` pattern -**Mixed content** - handles bold + normal text in same item -**Arbitrary depth** - supports up to 9 levels of nesting -**Natural appearance** - indentation matches typical document formatting ### Testing ```bash npm run build node test-nested-list-convert.js # Open test-nested-list-output.hwp in Hancom Office ``` Test files: - `test-nested-list.md` - Markdown with nested lists and bold text - `test-nested-list-convert.js` - Conversion script - `test-nested-list-output.hwp` - Generated HWP file --- ## Testing All improvements have been tested with: ```bash npm run build node test-headings-convert.js # Test heading hierarchy node test-spacing-convert.js # Test line spacing node test-wrapping-convert.js # Test line wrapping node test-bold-clear-convert.js # Test bold text node test-nested-list-convert.js # Test nested lists with indentation ``` Test files included: - `test-headings.md` / `test-headings-output.hwp` - `test-spacing.md` / `test-spacing-output.hwp` - `test-wrapping.md` / `test-wrapping-output.hwp` - `test-bold-clear.md` / `test-bold-clear-output.hwp` - `test-nested-list.md` / `test-nested-list-output.hwp` --- ## Documentation Complete documentation available in: - `docs/HWP_Document_Data_Records.md` - HWPTAG reference - `docs/HWP_CharShape_Structure.md` - Character properties - `docs/Line_Wrapping_Fix.md` - Detailed wrapping fix analysis - `docs/Bold_Text_Implementation.md` - Bold text implementation journey --- ## Summary ### Before - Flat heading structure (all same size) - Tight line spacing - Line wrapping issues (character compression) - Basic paragraph formatting ### After - ✅ Professional heading hierarchy (H1-H6) - ✅ Comfortable line spacing (140%-160%) - ✅ Natural line wrapping (no compression) - ✅ Smart spacing between content types - ✅ Context-aware paragraph selection - ✅ Proper character spacing settings -**Working bold text support** (`**text**`) ### Result **Professional, readable HWP documents with natural text flow and proper text formatting!** 🎉