How to remove all HTML whatsoever from a file?

I have an SVG file that contains some HTML embedded deep in it and I need to remove this HTML to be able to upload the SVG file to a certain image database that doesn’t allow SVG files to contain HTML in them.

In a source code editor the file is so complex, dense and heavily-machine-code-like that I don’t think it’s the correct approach to remove the HTML manually.

Oddly enough, the HTML in that file is not nested with <html>, and <body> tags.

How to remove all HTML whatsoever from a file?

You… do realize that SVG is an XML type, right? As in… if you got rid of all the tags, you wouldnt have an SVG anymore?

I only mean HTML tags, assuming that there is an easy way to match them.

I am in a sticky situation here where Mermaid chart SVGs contain embedded HTML and because of that HTML I can’t upload them to MediaWiki website media libraries:

How to present Mermaid charts in MediaWiki? on Project:Support desk

The exported SVG appears to use embedded HTML for the chart labels. Removing it would remove your labels. If all you want to do is upload the image to a site, it’d probably be easier to just use the PNG export. If you want to keep using SVG, you’d have to convert the labels to svg text tags.

Oh.

My problems with that are these:

  1. Accessibilitywise, I would prefer SVG because it’s a combination of image and text, more accessible to blind people.
  2. After I have exported it to PNG from https://mermaid.live/ I then opened the file and zoomed in and the file appeared pixelated/smeared.

What can be done in that case?

Perhaps I misunderstand what you are saying but if I understand then it seems possible to replace the HTML with just the text. I assume that would require finding each relevant element and then using a HTML parser with it. This question has the regex tag but I would not use a regex for this.

Some lables there are written like this:

<foreignObject height="24" width="26.75"><div style="display: inline-block; white-space: nowrap;" xmlns="http://www.w3.org/1999/xhtml"><span class="nodeLabel">THIS_IS_MY_MOST_WONDERFUL_LABEL</span></div></foreignObject>

The use of CSS there indicates that it must be HTML, is it not?

Just download an SVG of the example flowchart in https://mermaid.live and you could get a full taste of how lables are organized there.

Darn, I must somehow have them not in HTML-CSS, but still neatly appearing and easily readable to everyone.

Yes, regex may be redundant here and find and replace could help, but from HTML to what and a better question perhaps from CSS to what?

You can start by replacing

<foreignObject height="24" width="26.75"><div style="display: inline-block; white-space: nowrap;" xmlns="http://www.w3.org/1999/xhtml"><span class="nodeLabel">THIS_IS_MY_MOST_WONDERFUL_LABEL</span></div></foreignObject>

with

<text>THIS_IS_MY_MOST_WONDERFUL_LABEL</text>

It won’t be exactly the same, but you can adjust things as necessary. Read the docs on <text> for info

1 Like

Kicken, if it was just one line, sure, but the chart I work with is complex and has lots and lots of such differing lines of foreignObject code.

Here is a code example for a chart with labels; the lables are lowercased.

flowchart TD
    1[aaaaa]
    1 --> 2[bbbbb]
    2 --> A1[ccccc] 
    2 --> A2[ddddd]
    2 --> A3[eeeee]
    2 --> A4[fffff]
    A2 --> A[hhhhh]
    A1 & A2 --> 5[iiiii]
    5 & A4 --> A5[jjjjj]
    5 --> 6[kkkkk]
    6 --> 7[lllll]
    8 --> 9[mmmmm]
    X[nnnnn] --> 11
    7 --> 10[ooooo] & 11[ppppp] & 12[qqqqq] & 14[rrrrr]
    XX[sssss]
    12 & XX --> B[ttttt]
    11 & 12 & 9 --> 13[uuuuu]
    11 & XX --> 15[vvvvv]
    10 & 15 --> 16[wwwww]
    13 --> 17[xxxxx]
    17 --> 18[yyyyy] & 19[zzzzz] & 20[11111] & 21[22222]
    15 & 18 & 21 --> C1[33333]
    C1 --> C2[44444]
    C2 & 16 --> C2_1[55555]
    XXX[66666] & C2 --> C3[77777]
    C3 --> C4[88888]
    C4 --> C2_1
    C4 ---> C5[99999] & C6[a1a1a1a1a1]
    A4 & 16 & C3 --> XXXX[a2a2a2a2a2]
    B & XX & XXXX & 16 & C4 --> I1[a3a3a3a3a3]
    I1 --> I2[a4a4a4a4a4] & I3[a5a5a5a5a5]
    I2 --> I2_2[a6a6a6a6a6]
    I2 --> I2_1[a7a7a7a7a7]
    I2_2 --> I3 & I5
    I3 --> I4[a8a8a8a8a8] --> I5[a9a9a9a9a9]
    I3 --> I6[b1b1b1b1b1] --> I7[b2b2b2b2b2]
    17 & C1 & C5 & C6 --> C7[b3b3b3b3b3]
    15 --> ZZ[b4b4b4b4b4] & I2 --> I8
    I2 & I6 --> I8[b5b5b5b5b5]
    I2 --> I9[b6b6b6b6b6]
    I7 --> E[b7b7b7b7b7]
    17 & C5 & C7 --> C8[b8b8b8b8b8]
    C7 --> C9[b9b9b9b9b9]
    C8 & 19 --> R1[c1c1c1c1c1]

Testing the chart and exporting an SVG is possible at https://mermaid.live

A nice “side discussion” currently being developed at Stack Exchange:

gimp - How to do artifical intelligence “upscaling” to a pixelated or smeared image? - Graphic Design Stack Exchange