Hi everyone,

I’ve been working on a custom Python script for Termux to help me format and organize my literary texts. The idea is to take rough .docx , .pdf , and .txt drafts and automatically convert them into clean, professional EPUB, DOCX, and TXT outputs—justified, structured, and even analyzed.

It’s called MelkorFormatter-Termux, and it lives in this path (Termux with termux-setup-storage enabled):

/storage/emulated/0/Download/Originales_Estandarizar/

The script reads all supported files from there and generates outputs in a subfolder called salida_estandar/ with this structure:

salida_estandar/ ├── principales/ │ ├── txt/ │ │ └── archivo1.txt │ ├── docx/ │ │ └── archivo1.docx │ ├── epub/ │ │ └── archivo1.epub │ ├── versiones/ │ ├── txt/ │ │ └── archivo1_version2.txt │ ├── docx/ │ │ └── archivo1_version2.docx │ ├── epub/ │ │ └── archivo1_version2.epub │ ├── revision_md/ │ ├── log/ │ │ ├── archivo1_REVISION.md │ │ └── archivo1_version2_REVISION.md │ ├── logs_md/ │ ├── archivo1_LOG.md │ └── archivo1_version2_LOG.md

What the script is supposed to do

Detect chapters from .docx , .pdf , .txt using heading styles and regex

, , using heading styles and regex Generate: .txt with --- FIN CAPÍTULO X --- after each chapter .docx with Heading 1 , full justification, Times New Roman .epub with: One XHTML per chapter ( capX.xhtml ) Valid EPUB 3.0.1 files ( mimetype , container.xml , content.opf ) TOC ( nav.xhtml )

Analyze the text for: Lovecraftian word density (uses a lovecraft_excepciones.txt file) Paragraph repetitions Suggested title

Classify similar texts as versiones/ instead of principales/

instead of Generate a .md log for each file with all stats

Major Functions (and their purpose)

leer_lovecraft_excepciones() → loads custom Lovecraft terms from file

→ loads custom Lovecraft terms from file normalizar_texto() → standardizes spacing/casing for comparisons

→ standardizes spacing/casing for comparisons extraer_capitulos_*() → parses .docx, .pdf or .txt into chapter blocks

→ parses .docx, .pdf or .txt into chapter blocks guardar_docx() → generates justified DOCX with page breaks

→ generates justified DOCX with page breaks crear_epub_valido() → builds structured EPUB3 with TOC and split chapters

→ builds structured EPUB3 with TOC and split chapters guardar_log() → generates markdown log (length, density, rep, etc.)

→ generates markdown log (length, density, rep, etc.) comparar_archivos() → detects versions by similarity ratio

→ detects versions by similarity ratio main() → runs everything on all valid files in the input folder

What still fails or behaves weird

EPUB doesn’t always split chapters

Even if chapters are detected, only one .xhtml gets created. Might be a loop or overwrite issue. TXT and PDF chapter detection isn’t reliable

Especially in PDFs or texts without strong headings, it fails to detect Capítulo X headers. Lovecraftian word list isn’t applied correctly

Some known words in the list are missed in the density stats. Possibly a scoping or redefinition issue. Repetitions used to show up in logs but now don’t

Even obvious paragraph duplicates no longer appear in the logs. Classification between ‘main’ and ‘version’ isn’t consistent

Sometimes the shorter version is saved as ‘main’ instead of ‘versiones/’. Logs sometimes fail to save

Especially for .pdf or .txt , the logs_md folder stays empty or partial.

What I need help with

If you know Python (file parsing, text processing, EPUB creation), I’d really love your help to:

Debug chapter splitting in EPUB

Improve fallback detection in TXT/PDF

Fix Lovecraft list handling and repetition scan

Make classification logic more consistent

Stabilize log saving

I’ll reply with the full formateador.py below

It’s around 300 lines, modular, and uses only standard libs + python-docx , PyMuPDF , and pdfminer as backup.

You’re welcome to fork, test, fix or improve it. My goal is to make a lightweight, offline Termux formatter for authors, and I’m super close—just need help with these edge cases.

Thanks a lot for reading!