🤯 50% Off! 700+ courses, assessments, and books

Coding a Lorem Ipsum Alternative

    David Francis
    Share

    Lorem Ipsum generators are well known and are useful for generating text copy during website development. And if you want something that’s a little more to your own taste than pseudo-Latin, SitePoint recently published an article by Craig Buckler which presents ten of the best alternatives to the tried and tested original.

    It’s good that we have a wide selection of text generators, but how exactly are these generators made? Can we use PHP and MySQL to build our own? That’s exactly what we’ll tackle in this article. We won’t develop a fully working website; what we will cover are the essentials for building a site such as Fillerati.

    Sourcing and Extracting Paragraphs

    The project is grouped into just three tasks: sourcing the text content, storing it in a database, and giving front-end access to the content. We’ll take each of these in turn, starting with finding content, and where better to start than Project Gutenberg? Gutenberg offers thousands of public domain texts in various languages, all completely free.

    Unfortunately the HTML formatting is not consistent throughout Gutenberg’s publications; that’s not a criticism of the project, rather it’s an aspect of working with their HTML that we need to be aware of. Some paragraph elements don’t contain useful text at all – they are used merely as spacing between paragraphs. Some paragraphs may be too long for the purpose of providing dummy copy. These are details that we’ll need to code around.

    Why choose HTML rather than plain text if the formatting isn’t consistent? Simple: the HTML version contains markup that identifies paragraphs, and paragraphs are at the heart of this project. It’s not quite as easy as scanning a stream of text for <p> and </p> tags, but it gives us a good head start.

    Data gathering won’t happen often, so we can afford ourselves the luxury of loading the entire file into memory so it’s easier to search for tags and process the text. I’ve selected the HTML copy of On The Origin Of Species by Charles Darwin.

    Once you’ve downloaded the HTML file, it’s a good idea to open it in an editor and peruse the code to see what we’re up against. We can ignore everything before the first chapter heading on line 426, and the whitespace I mentioned earlier should be removed to make processing easier.

    The following is a simple approach for extracting and cleaning the text; it’s a function that’s called in a loop to scan the file and extract paragraphs. Such a loop doesn’t need to be complex.

    <?php
    function extractContent($tag, $html) {
        $closeTag = substr($tag, 0, 1) . '/' . substr($tag, 1, 3);
        $startPos = strlen($tag);
        $endPos = strpos($html, $closeTag);
        $text = substr($html, $startPos, $endPos - $startPos);
        return array($closeTag, trim(preg_replace('/(\s){2,}/', ' ', $text)));
    }
    
    $html = file_get_contents($htmlFile);
    $limits = array('min' => 200, 'max' => 2000);
    $tag = '<p>';
    $paragraphs = array();
    
    $i = 0;
    while (($pos = strpos($html, $tag, $i)) !== false) {
        list($closeTag, $text) = extractContent($tag, substr($html, $pos));
        // keep the content if it's a suitable size
        $len = strlen($text);
        if ($len >= $limits['min'] && $len <= $limits['max']) {
            $paragraphs[] = $text;
        }
        $i = $pos + strlen($tag) + strlen($text) + strlen($closeTag);
    }

    A complete book can be scanned for usable paragraphs with minimal coding. This includes a simple check on the size of the paragraph to eliminate anything that’s either too small or too large. This test has the additional benefit of eliminating tags that are used for spacing. To be sure what we have is useful, you can display a sample in the browser or write it to a log file.

    Populate the Database

    The next step is to store these paragraphs in a database. Keep in mind that we’re building the barebones of a Lorem Ipsum system, so there’s no need for a database design like this:

    lorem-ipsum-alt-01

    All we really need is one table:

    CREATE TABLE paragraphs (
        id MEDIUMINT UNSIGNED NOT NULL AUTO_INCREMENT,
        content MEDIUMTEXT NOT NULL,
        PRIMARY KEY (id)
    )

    In the interest of efficiency, I’ve chosen a suitably sized data type for both the id and content fields. For a large, fully-functional database that stores many publications, you may want to use the INTEGER and TEXT data types.

    Now we can insert the paragraphs that we extracted from the HTML file into the database.

    <?php
    $db = new PDO(DBDSN, DBUSER, DBPASS);
    
    $query = $db->prepare('INSERT INTO paragraphs (content) VALUES (:content)');
    $query->bindParam(':content', $content);
    foreach ($paragraphs as $content) {
        $query->execute();
    }

    Depending on the collating sequence you’ve chosen to use for your database, you may need to apply a conversion to the paragraph strings. This is a niggle of using a third-party data source like Gutenberg – there’s no guarantee that the text uses the same collating sequence as your database. Check the string functions and multi-byte string functions that are available in the PHP manual that may be needed.

    A Simple Front-End

    The final step is to access these paragraphs using a front-end in a browser. How the front-end should provide access to the data is limited only by our imaginations. For example, we could retrieve a certain number of paragraphs, or a specified quantity of text, or perhaps a number of characters rounded to the nearest paragraph. We could select consecutive paragraphs, or perhaps we’d be happy with random paragraphs. Whatever we choose, we need a function to read the table.

    <?php
    function selectParagraph($db, $id) {
        $query = sprintf('SELECT content FROM paragraphs WHERE id = %d', $id);
        $result = $pdo->query($sql);
        $row = $result->fetch(PDO::FETCH_ASSOC);
        $result->closeCursor();
        return $row['content'];
    }

    For demonstration purposes, the algorithm I’ll present uses a simple random number generator to select paragraphs from the database. It needs to know the maximum ID value for the paragraph records, hence the $maxID variable (this assumes the ID values are contiguous).

    <form method="post">
     <label for="slider">How many paragraphs do you want?</label>
     <input type="range" min="1" max="4" step="1" name="slider">
     <input type="submit" name="submit" value="Get Excerpt">
    </form>
    <?php
    if (isset($_POST['slider'])) {
        $i = $_POST['slider'];
        while ($i--) {
            $id = rand(1, $maxID);
            $paragraph = selectParagraph($db, $id);
            echo '<p>' . $paragraph . '</p>';
        }
    }

    And that’s the final piece of the project!

    Summary

    In this article we’ve covered the essential aspects of building an alternative to the popular Lorem Ipsum text generator. How complex we make it, how many publications and authors we include, how stylish we make the front-end, and whether we limit our choice of text to a specific genre, is entirely open to personal choice. But the essential elements will all be similar to what we’ve covered here, and all built using a smattering of PHP and MySQL. Easy!

    Code to accompany this article can be found on GitHub. Feel free to clone it expand on it.

    Image via Fotolia