Help Needed: EPUB + DOCX Formatter Script for Termux – Almost working but some parts still broken

argosbcn · April 10, 2025, 6:55pm

Hi everyone,
I’ve been working on a custom Python script for Termux to help me format and organize my literary texts. The idea is to take rough .docx, .pdf, and .txt drafts and automatically convert them into clean, professional EPUB, DOCX, and TXT outputs—justified, structured, and even analyzed.

It’s called MelkorFormatter-Termux, and it lives in this path (Termux with termux-setup-storage enabled):

/storage/emulated/0/Download/Originales_Estandarizar/

The script reads all supported files from there and generates outputs in a subfolder called salida_estandar/ with this structure:

salida_estandar/
├── principales/
│   ├── txt/
│   │   └── archivo1.txt
│   ├── docx/
│   │   └── archivo1.docx
│   ├── epub/
│   │   └── archivo1.epub
│
├── versiones/
│   ├── txt/
│   │   └── archivo1_version2.txt
│   ├── docx/
│   │   └── archivo1_version2.docx
│   ├── epub/
│   │   └── archivo1_version2.epub
│
├── revision_md/
│   ├── log/
│   │   ├── archivo1_REVISION.md
│   │   └── archivo1_version2_REVISION.md
│
├── logs_md/
│   ├── archivo1_LOG.md
│   └── archivo1_version2_LOG.md

What the script is supposed to do

Detect chapters from .docx, .pdf, .txt using heading styles and regex
Generate:
- .txt with --- FIN CAPÍTULO X --- after each chapter
- .docx with Heading 1, full justification, Times New Roman
- .epub with:
  - One XHTML per chapter (capX.xhtml)
  - Valid EPUB 3.0.1 files (mimetype, container.xml, content.opf)
  - TOC (nav.xhtml)
Analyze the text for:
- Lovecraftian word density (uses a lovecraft_excepciones.txt file)
- Paragraph repetitions
- Suggested title
Classify similar texts as versiones/ instead of principales/
Generate a .md log for each file with all stats

Major Functions (and their purpose)

leer_lovecraft_excepciones() → loads custom Lovecraft terms from file
normalizar_texto() → standardizes spacing/casing for comparisons
extraer_capitulos_*() → parses .docx, .pdf or .txt into chapter blocks
guardar_docx() → generates justified DOCX with page breaks
crear_epub_valido() → builds structured EPUB3 with TOC and split chapters
guardar_log() → generates markdown log (length, density, rep, etc.)
comparar_archivos() → detects versions by similarity ratio
main() → runs everything on all valid files in the input folder

What still fails or behaves weird

EPUB doesn’t always split chapters
Even if chapters are detected, only one .xhtml gets created. Might be a loop or overwrite issue.
TXT and PDF chapter detection isn’t reliable
Especially in PDFs or texts without strong headings, it fails to detect Capítulo X headers.
Lovecraftian word list isn’t applied correctly
Some known words in the list are missed in the density stats. Possibly a scoping or redefinition issue.
Repetitions used to show up in logs but now don’t
Even obvious paragraph duplicates no longer appear in the logs.
Classification between ‘main’ and ‘version’ isn’t consistent
Sometimes the shorter version is saved as ‘main’ instead of ‘versiones/’.
Logs sometimes fail to save
Especially for .pdf or .txt, the logs_md folder stays empty or partial.

What I need help with

If you know Python (file parsing, text processing, EPUB creation), I’d really love your help to:

Debug chapter splitting in EPUB
Improve fallback detection in TXT/PDF
Fix Lovecraft list handling and repetition scan
Make classification logic more consistent
Stabilize log saving

I’ll reply with the full `formateador.py` below

It’s around 300 lines, modular, and uses only standard libs + python-docx, PyMuPDF, and pdfminer as backup.

You’re welcome to fork, test, fix or improve it. My goal is to make a lightweight, offline Termux formatter for authors, and I’m super close—just need help with these edge cases.

Thanks a lot for reading!

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# MelkorFormatter-Termux - BLOQUE 1: Configuración, Utilidades, Extracción COMBINADA

import os
import re
import sys
import zipfile
import hashlib
import difflib
from pathlib import Path
from datetime import datetime
from docx import Document
from docx.shared import Pt
from docx.enum.text import WD_PARAGRAPH_ALIGNMENT

# === CONFIGURACIÓN GLOBAL ===
ENTRADA_DIR = Path.home() / "storage" / "downloads" / "Originales_Estandarizar"
SALIDA_DIR = ENTRADA_DIR / "salida_estandar"
REPETIDO_UMBRAL = 0.9
SIMILITUD_ENTRE_ARCHIVOS = 0.85
LOV_MODE = True
EXCEPCIONES_LOV = ["Cthulhu", "Nyarlathotep", "Innsmouth", "Arkham", "Necronomicon", "Shoggoth"]

# === CREACIÓN DE ESTRUCTURA DE CARPETAS ===
def preparar_estructura():
    carpetas = {
        "principales": ["txt", "docx", "epub"],
        "versiones": ["txt", "docx", "epub"],
        "logs_md": [],
        "revision_md/log": []
    }
    for base, subtipos in carpetas.items():
        base_path = SALIDA_DIR / base
        if not subtipos:
            base_path.mkdir(parents=True, exist_ok=True)
        else:
            for sub in subtipos:
                (base_path / sub).mkdir(parents=True, exist_ok=True)

# === FUNCIONES DE UTILIDAD ===
def limpiar_texto(texto):
    return re.sub(r"\s+", " ", texto.strip())

def mostrar_barra(actual, total, nombre_archivo):
    porcentaje = int((actual / total) * 100)
    barra = "#" * int(porcentaje / 4)
    sys.stdout.write(f"\r[{porcentaje:3}%] {nombre_archivo[:35]:<35} |{barra:<25}|")
    sys.stdout.flush()

# === DETECCIÓN COMBINADA DE CAPÍTULOS DOCX ===
def extraer_capitulos_docx(docx_path):
    doc = Document(docx_path)
    caps_por_heading = []
    caps_por_regex = []
    actual = []

    # Modo 1: detectar por estilo Heading 1
    for p in doc.paragraphs:
        texto = p.text.strip()
        if not texto:
            continue
        if p.style.name.lower().startswith("heading") and "1" in p.style.name:
            if actual:
                caps_por_heading.append(actual)
            actual = [texto]
        else:
            actual.append(texto)
    if actual:
        caps_por_heading.append(actual)

    if len(caps_por_heading) > 1:
        return ["\n\n".join(parrafos) for parrafos in caps_por_heading]

    # Modo 2: fallback por texto tipo "Capítulo X"
    cap_regex = re.compile(r"^(cap[ií]tulo|cap)\s*\d+.*", re.IGNORECASE)
    actual = []
    caps_por_regex = []
    for p in doc.paragraphs:
        texto = p.text.strip()
        if not texto:
            continue
        if cap_regex.match(texto) and actual:
            caps_por_regex.append(actual)
            actual = [texto]
        else:
            actual.append(texto)
    if actual:
        caps_por_regex.append(actual)

    if len(caps_por_regex) > 1:
        return ["\n\n".join(parrafos) for parrafos in caps_por_regex]

    # Si todo falla: devolver como único bloque
    todo = [p.text.strip() for p in doc.paragraphs if p.text.strip()]
    return ["\n\n".join(todo)]

from docx.enum.text import WD_PARAGRAPH_ALIGNMENT

# === GUARDAR TXT CON SEPARADORES ENTRE CAPÍTULOS ===
def guardar_txt(nombre, capitulos, clasificacion):
    contenido = ""
    for idx, cap in enumerate(capitulos):
        contenido += cap.strip() + f"\n--- FIN CAPÍTULO {idx+1} ---\n\n"
    out = SALIDA_DIR / clasificacion / "txt" / f"{nombre}_TXT.txt"
    out.write_text(contenido.strip(), encoding="utf-8")
    print(f"[✓] TXT guardado: {out.name}")

# === GUARDAR DOCX CON JUSTIFICADO Y SIN SANGRÍA ===
def guardar_docx(nombre, capitulos, clasificacion):
    doc = Document()
    doc.add_heading(nombre, level=0)
    doc.add_page_break()
    for i, cap in enumerate(capitulos):
        doc.add_heading(f"Capítulo {i+1}", level=1)
        for parrafo in cap.split("\n\n"):
            p = doc.add_paragraph()
            run = p.add_run(parrafo.strip())
            run.font.name = 'Times New Roman'
            run.font.size = Pt(12)
            p.alignment = WD_PARAGRAPH_ALIGNMENT.JUSTIFY
            p.paragraph_format.first_line_indent = None
        doc.add_page_break()
    out = SALIDA_DIR / clasificacion / "docx" / f"{nombre}_DOCX.docx"
    doc.save(out)
    print(f"[✓] DOCX generado: {out.name}")

# === GENERACIÓN DE EPUB CON CAPÍTULOS Y ESTILO RESPONSIVO ===
def crear_epub_valido(nombre, capitulos, clasificacion):
    base_epub_dir = SALIDA_DIR / clasificacion / "epub"
    base_dir = base_epub_dir / nombre
    oebps = base_dir / "OEBPS"
    meta = base_dir / "META-INF"
    oebps.mkdir(parents=True, exist_ok=True)
    meta.mkdir(parents=True, exist_ok=True)

    # mimetype sin compresión
    (base_dir / "mimetype").write_text("application/epub+zip", encoding="utf-8")

    # container.xml
    container = '''<?xml version="1.0"?>
<container version="1.0" xmlns="urn:oasis:names:tc:opendocument:xmlns:container">
  <rootfiles><rootfile full-path="OEBPS/content.opf" media-type="application/oebps-package+xml"/></rootfiles>
</container>'''
    (meta / "container.xml").write_text(container, encoding="utf-8")

    manifest_items, spine_items, toc_items = [], [], []
    for i, cap in enumerate(capitulos):
        id = f"cap{i+1}"
        file_name = f"{id}.xhtml"
        title = f"Capítulo {i+1}"
        html = f"""<?xml version="1.0" encoding="utf-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head><title>{title}</title><meta charset="utf-8"/>
<style>
body {{
  max-width: 40em; width: 90%; margin: auto;
  font-family: Merriweather, serif;
  text-align: justify; hyphens: auto;
  font-size: 1em; line-height: 1.6;
}}
h1 {{ text-align: center; margin-top: 2em; }}
</style>
</head>
<body><h1>{title}</h1><p>{cap.replace('\n\n', '</p><p>')}</p></body>
</html>"""
        (oebps / file_name).write_text(html, encoding="utf-8")
        manifest_items.append(f'<item id="{id}" href="{file_name}" media-type="application/xhtml+xml"/>')
        spine_items.append(f'<itemref idref="{id}"/>')
        toc_items.append(f'<li><a href="{file_name}">{title}</a></li>')

    nav = f"""<?xml version='1.0' encoding='utf-8'?>
<html xmlns="http://www.w3.org/1999/xhtml"><head><title>TOC</title></head>
<body><nav epub:type="toc" id="toc"><h1>Índice</h1><ol>{''.join(toc_items)}</ol></nav></body></html>"""
    (oebps / "nav.xhtml").write_text(nav, encoding="utf-8")
    manifest_items.append('<item href="nav.xhtml" id="nav" media-type="application/xhtml+xml" properties="nav"/>')

    uid = hashlib.md5(nombre.encode()).hexdigest()
    opf = f"""<?xml version='1.0' encoding='utf-8'?>
<package xmlns="http://www.idpf.org/2007/opf" unique-identifier="bookid" version="3.0">
  <metadata xmlns:dc="http://purl.org/dc/elements/1.1/">
    <dc:title>{nombre}</dc:title>
    <dc:language>es</dc:language>
    <dc:identifier id="bookid">urn:uuid:{uid}</dc:identifier>
  </metadata>
  <manifest>{''.join(manifest_items)}</manifest>
  <spine>{''.join(spine_items)}</spine>
</package>"""
    (oebps / "content.opf").write_text(opf, encoding="utf-8")

    epub_final = base_epub_dir / f"{nombre}_EPUB.epub"
    with zipfile.ZipFile(epub_final, 'w') as z:
        z.writestr("mimetype", "application/epub+zip", compress_type=zipfile.ZIP_STORED)
        for folder in ["META-INF", "OEBPS"]:
            for path, _, files in os.walk(base_dir / folder):
                for file in files:
                    full = Path(path) / file
                    z.write(full, full.relative_to(base_dir))
    print(f"[✓] EPUB creado: {epub_final.name}")

# === CÁLCULO DE SIMILITUD ENTRE ARCHIVOS ===
def calcular_similitud(a, b):
    return difflib.SequenceMatcher(None, a, b).ratio()

def comparar_archivos(textos):
    comparaciones = []
    for i in range(len(textos)):
        for j in range(i + 1, len(textos)):
            sim = calcular_similitud(textos[i][1], textos[j][1])
            if sim > SIMILITUD_ENTRE_ARCHIVOS:
                comparaciones.append((textos[i][0], textos[j][0], sim))
    return comparaciones

# === REPETICIONES INTERNAS ENTRE PÁRRAFOS ===
def detectar_repeticiones(texto):
    parrafos = [p.strip().lower() for p in texto.split("\n\n") if len(p.strip()) >= 30]
    frec = {}
    for p in parrafos:
        frec[p] = frec.get(p, 0) + 1
    return {k: v for k, v in frec.items() if v > 1}

# === DENSIDAD LOVECRAFTIANA ===
def calcular_densidad_lovecraft(texto):
    palabras = re.findall(r"\b\w+\b", texto.lower())
    total = len(palabras)
    lov = [p for p in palabras if p in [w.lower() for w in EXCEPCIONES_LOV]]
    return round(len(lov) / total * 100, 2) if total else 0

# === TÍTULO SUGERIDO ===
def sugerir_titulo(texto):
    for linea in texto.splitlines():
        if linea.strip() and len(linea.strip().split()) > 3:
            return linea.strip()[:60]
    return "Sin Título"

# === LOG EN FORMATO .md ===
def guardar_log(nombre, texto, clasificacion, similitudes):
    log_path = SALIDA_DIR / "logs_md" / f"{nombre}.md"
    repes = detectar_repeticiones(texto)
    dens = calcular_densidad_lovecraft(texto)
    sugerido = sugerir_titulo(texto)
    palabras = re.findall(r"\b\w+\b", texto)
    unicas = len(set(p.lower() for p in palabras))

    try:
        with open(log_path, "w", encoding="utf-8") as f:
            f.write(f"# LOG de procesamiento: {nombre}\n\n")
            f.write(f"- Longitud: {len(texto)} caracteres\n")
            f.write(f"- Palabras: {len(palabras)}, únicas: {unicas}\n")
            f.write(f"- Densidad Lovecraftiana: {dens}%\n")
            f.write(f"- Título sugerido: {sugerido}\n")
            f.write(f"- Modo: lovecraft_mode={LOV_MODE}\n")
            f.write(f"- Clasificación: {clasificacion}\n\n")

            f.write("## Repeticiones internas detectadas:\n")
            if repes:
                for k, v in repes.items():
                    f.write(f"- '{k[:40]}...': {v} veces\n")
            else:
                f.write("- Ninguna\n")

            if similitudes:
                f.write("\n## Similitudes encontradas:\n")
                for s in similitudes:
                    otro = s[1] if s[0] == nombre else s[0]
                    f.write(f"- Con {otro}: {int(s[2]*100)}%\n")

        print(f"[✓] LOG generado: {log_path.name}")

    except Exception as e:
        print(f"[!] Error al guardar log de {nombre}: {e}")

# === FUNCIÓN PRINCIPAL: PROCESAMIENTO TOTAL ===
def main():
    print("== MelkorFormatter-Termux - EPUBCheck + Justify + Capítulos ==")
    preparar_estructura()
    archivos = list(ENTRADA_DIR.glob("*.docx"))
    if not archivos:
        print("[!] No se encontraron archivos DOCX en la carpeta.")
        return

    # Etapa 1: carga y extracción de texto
    textos = []
    for idx, archivo in enumerate(archivos):
        nombre = archivo.stem
        capitulos = extraer_capitulos_docx(archivo)
        texto_completo = "\n\n".join(capitulos)
        textos.append((nombre, texto_completo))
        mostrar_barra(idx + 1, len(archivos), nombre)

    print("\n[i] Análisis de similitud entre archivos...")
    comparaciones = comparar_archivos(textos)

    # Etapa 2: procesamiento por archivo
    for nombre, texto in textos:
        print(f"\n[i] Procesando: {nombre}")
        capitulos = texto.split("--- FIN CAPÍTULO") if "--- FIN CAPÍTULO" in texto else [texto]
        similares = [(a, b, s) for a, b, s in comparaciones if a == nombre or b == nombre]
        clasificacion = "principales"

        for a, b, s in similares:
            if (a == nombre and len(texto) < len([t for n, t in textos if n == b][0])) or \
               (b == nombre and len(texto) < len([t for n, t in textos if n == a][0])):
                clasificacion = "versiones"

        print(f"[→] Clasificación: {clasificacion}")
        guardar_txt(nombre, capitulos, clasificacion)
        guardar_docx(nombre, capitulos, clasificacion)
        crear_epub_valido(nombre, capitulos, clasificacion)
        guardar_log(nombre, texto, clasificacion, similares)

    print("\n[✓] Todos los archivos han sido procesados exitosamente.")

# === EJECUCIÓN DIRECTA ===
if __name__ == "__main__":
    main()

Topic		Replies	Views
Looking for Document Management Software Community	0	372	September 19, 2010
ePub creation - XML/HTML editing needed HTML & CSS	6	1865	August 8, 2017
How to post wordpad doc to elance and CL Community	6	686	February 21, 2011
What is the best source format for books? Community	7	640	January 15, 2012
Wanting a script Get Started scripts	5	777	September 14, 2010