Split paragraph into sentence script...greasemonkey

hello, i am a newbie but looking to accomplish a task with javascript and needed help putting together a script.

i want to take paragraphs of the main text (body) within specific webpages and place each sentence of the main text on its own line and wrapped each sentence in paragraph tags (so each sentence would be a paragraph, single spaced from each other). I am using a Firefox addon called ‘GreaseMonkey’ which allows a user to execute JS on webpages automactically.

I have found several JS ‘split paragraph’ regex on the internet but am having creating the whole script and getting the right output. i think its grabbing the text of the body that is the problem, not sure.

i have used Firefox inspector and tried to run this regex on the document.body.innerText. i can isolate the text within a window.alert message box, but cant seem to execute the “split paragraph into sentences” regex on it…

here are the resources:
the webpage is: https://web.archive.org/web/20110518220745fw_/http://cisco-futures.com/tva_background.html
the regex i found is: str.replace(/([.?!])\s*(?=[A-Z])/g, “$1|”).split(“|”)); (not sure if it works, i cant test it, i dont get an output)

document.body.innerText.replace(/([.?!])\s*(?=[A-Z])/g, "$1|").split("|");

example text:
Value Analytics is a melding of CISCO’s market condition methodology with a special blend of Market Profile. One oriented to quantitative evaluation of reference points. Market Profile concept and methodology has been around since 1985. Standard Market Profile is based on pattern recognition. Then holistically including the reference points. Market reading of who is doing what and why of reading Market Profile.

expected output:
Value Analytics is a melding of CISCO’s market condition methodology with a special blend of Market Profile.
One oriented to quantitative evaluation of reference points.
Market Profile concept and methodology has been around since 1985.
Standard Market Profile is based on pattern recognition.
Then holistically including the reference points.
Market reading of who is doing what and why of reading Market Profile.

thanks for any assistance

Hi @asa32sd23 and welcome to the forums, if I understand well… wouldn’t something like this give you the result you are after? (I have not tested it)

var text = document.body.innerText;
var sentences = text.split('.');
var lines = [];
for(var i = 0; i < sentences.length; i++) {
    lines.push(sentences[i].trim());
}
var newText = lines.join('.\r\n'); 

thanks, i try it on my webpage using the ‘greasemonkey’ Firefox addon, and nothing changes. i am expecting the webpage to display the sentences split on each line but nothing. im doubtful if the document.body.innerText is correct, but if i use a windows.alert(), it displays the text. not sure how to troubleshoot it…

the regex i got from here
and here is another example

again, i just cant get anything to display on webpage that i want changed,see in original post for url

Apologies, misunderstood somewhere… so do you want to replace the text that was there before or just add it to the page?

the text that is currently on the webpage i want to replace with this script, so take every multiline paragraph and place each sentence in its own line so its easier to read. i know that once i navigate away, it will revert back to the original, but this ‘greasemonkey’ script will run the JS everytime i go to this webpage (i plan to add the script to all pages within this domain)

I think it cannot be reliably done without probably killing the site’s functionality and styling while you are at it. InnerText gives you the text inside an HTML element but it will strip out all the HTML markup, which has hooks to the styles and javascript functionality. I cannot think right now of an unobtrusive solution but will keep you posted if anything comes to mind. Maybe someone else has better ideas

2 Likes

what about looping the main bodys paragraphs tags then looping for sentence, then apply paragraphs tags to each sentence… wouldnt that place each sentence on its own line?

again, this will run from a Firefox addon, that runs JS scripts on webpages, it for personal use and temp. its just so i can read the text easier, im not the owner of the site

It’s been a while since I wrote a GreaseMonkey script. When working with HTML outside of my control I always needed to write bespoke code to deal with “problems”.

For example, the page you linked to has errors
https://validator.w3.org/nu/?doc=https%3A%2F%2Fweb.archive.org%2Fweb%2F20110518220745fw_%2Fhttp%3A%2F%2Fcisco-futures.com%2Ftva_background.html

Maybe not all errors caused me problems, but invalid HTML on the pages I was interested in always did. The usual fix involved fixing unmatched and improperly nested tags. Other errors like obsolete / incorrect tags and attributes caused problems at times too but not as often as poor pairing and nesting.

At this point I think you would make better progress with your userscript if you started with a page that was only enough text content for testing purposes and had valid HTML.

thanks for the insight. Yes the webpage is an older website, it actually doesnt exist anymore. if you notice the url contains 'web.archive.org" its on wayback machine, and the site is old from 2003-2006 (but this is the site i need to work on)

using the script above i can actually output the parsed lineXline text to an alert window but cant get the website to display the reformmatted text.

does the idea to take each sentence and wrapped with

tags make sense. i found this link they are attempting to do it (i think, my knowledge is so limited im guessing alot)

here is a screenshot of using the FF inspector

I almost always try to work with individual problems one at a time. It helps me if I list them eg.

goal: make cisco pages easier to read by displaying sentences as block not inline.

unordered spitballs:

  • change CSS rules
  • split strings on … what? words, characters?
  • split strings using regex

procedure:

  1. access DOM
  2. parse DOM
  3. figure out how to get what I need the copy text (but not buttons etc.)
  4. figure out how to manipulate the text
  5. figure out how to display the altered content

I know that’s a lot of "figure out"s but that’s how I start.

▁▁

That suggests you could put some tentative checkmarks next to some of the todos :+1:

To display in the page instead of in an alert or the console log I prefer to “createElement” and put it somewhere into the page in addition to the original. I have a feeling you want to replace the original instead which should be doable as long as nothing else is dependent on it being there.

1 Like

yes i want to replace the text with this newly formatted text, how would i use the createElement?

I came up with something that may work but I did not get a chance to test it. It might get you somewhere hopefully:

// select all document HTML elements where we want to replace the text (add as required)
var textNodes = document.querySelectorAll('p, font');

// loop through each of the elements and replace the text
for(var i = 0; i < textNodes.length; i++) {
    replaceText(textNodes[i]);
}

function replaceText(node) {
    // innerHTML returns the contents of the node including HTML, versus innerText which will strip-out any HTML mark-up
    var text = node.innerHTML;
    // Split the text into an array of sentences based on full-stops. The space after the dot is to ensure it is not an acronym what we are dealing with.
    var sentences = text.split('. ');
    var lines = [];
    // we loop through each sentence
    for(var i = 0; i < sentences.length; i++) {
        // we trim any extra white space from the line and push it to the lines array
        lines.push(sentences[i].trim());
    }
    // we join the lines array into a string with line-breaks
    var newText = lines.join('.<br/>');
    // we replace the old text with the new text
    node.innerHTML = newText;
}
2 Likes

thanks Andres, this is a good start and i have been troubleshooting it. It doesnt work yet but i am testing it on several sites. it seems use for example (‘p,a,font’) text in various places and not just whats in the main body or main content.

after doing a little inspection i think limiting it to specific classes would isolate the main body text.

could you add the ability to limit to specific class, in the first line…
var textNodes = document.querySelectorAll(‘p, font’);

else, is there a JS method that just returns the main html text body?

thanks for the hhlp with this.

Yes you could:

var textNodes = document.querySelectorAll(‘p, font, .my-class’);

The approach I gave you is not perfect and I would advise to use it only on elements that only contain text or very limited HTML mark-up… The reason being is that through those selectors you might be iterating over nested elements multiple times producing unexpected results…

I advise against this approach because:

  • If you use innerHTML you will have to parse the whole HTML and any regex to deal with that would be a nightmare and very error prone, and the script I gave you would fail in the instances where it finds a . within the HTML and completely mess up your page.
  • If you use innerText you will be stripping out the whole HTML and with it the JavaScript functionality and the styles, so the site would just become plain text which I guess is not what you want.
1 Like

thanks so much Andres, it slowly getting there, the 2nd to last line tells the script to join using .\r\n
var newText = lines.join('\r\n.');

if i use a window alert to display newText, i see it does make each sentence a new line. i even changed it to \n\n for double spacing and it shows in popup window.

BUT, on screen, it never updates the pages HTML to place on new line…when i changed it to \n\n from .\r\n I noticed the period in the pages HTML was removed. so it is changing it but not by adding it to a newline. it just stays the same.

instead of joining the sentences with line break. Is it possible to wrap each sentence in p paragraphs tags then writing to page? id rather have each sentence as a paragraph for other reasons, maybe that can also solve this problem… thanks again for the attention

Hi the \r\n was a mistake on my side that I made at first, but I had already edited the post and changed it to <br/> which is an HTML line break, versus the other one which you normally use in source code, check my previous post again.

1 Like

thanks Andres, works now… i appreciate you helping me out with this!

1 Like

This topic was automatically closed 91 days after the last reply. New replies are no longer allowed.