Coding a Lorem Ipsum Alternative

Key Takeaways

Building a Lorem Ipsum alternative involves sourcing text content, storing it in a database, and creating a front-end access to the content. A good source of public domain texts is Project Gutenberg.
The text content can be extracted from an HTML file using PHP. The extracted content should be cleaned and checked for size to ensure it is suitable for use as dummy copy.
The extracted paragraphs are then stored in a MySQL database. The database design for a basic Lorem Ipsum system requires only one table for storing the paragraph content.
The final step involves creating a front-end in a browser to access the stored paragraphs. This can be as simple or complex as desired, with options to retrieve a certain number of paragraphs, a specified quantity of text, or a number of characters rounded to the nearest paragraph.

Lorem Ipsum generators are well known and are useful for generating text copy during website development. And if you want something that’s a little more to your own taste than pseudo-Latin, SitePoint recently published an article by Craig Buckler which presents ten of the best alternatives to the tried and tested original.

It’s good that we have a wide selection of text generators, but how exactly are these generators made? Can we use PHP and MySQL to build our own? That’s exactly what we’ll tackle in this article. We won’t develop a fully working website; what we will cover are the essentials for building a site such as Fillerati.

Sourcing and Extracting Paragraphs

The project is grouped into just three tasks: sourcing the text content, storing it in a database, and giving front-end access to the content. We’ll take each of these in turn, starting with finding content, and where better to start than Project Gutenberg? Gutenberg offers thousands of public domain texts in various languages, all completely free.

Unfortunately the HTML formatting is not consistent throughout Gutenberg’s publications; that’s not a criticism of the project, rather it’s an aspect of working with their HTML that we need to be aware of. Some paragraph elements don’t contain useful text at all – they are used merely as spacing between paragraphs. Some paragraphs may be too long for the purpose of providing dummy copy. These are details that we’ll need to code around.

Why choose HTML rather than plain text if the formatting isn’t consistent? Simple: the HTML version contains markup that identifies paragraphs, and paragraphs are at the heart of this project. It’s not quite as easy as scanning a stream of text for <p> and </p> tags, but it gives us a good head start.

Data gathering won’t happen often, so we can afford ourselves the luxury of loading the entire file into memory so it’s easier to search for tags and process the text. I’ve selected the HTML copy of On The Origin Of Species by Charles Darwin.

Once you’ve downloaded the HTML file, it’s a good idea to open it in an editor and peruse the code to see what we’re up against. We can ignore everything before the first chapter heading on line 426, and the whitespace I mentioned earlier should be removed to make processing easier.

The following is a simple approach for extracting and cleaning the text; it’s a function that’s called in a loop to scan the file and extract paragraphs. Such a loop doesn’t need to be complex.

<?php
function extractContent($tag, $html) {
    $closeTag = substr($tag, 0, 1) . '/' . substr($tag, 1, 3);
    $startPos = strlen($tag);
    $endPos = strpos($html, $closeTag);
    $text = substr($html, $startPos, $endPos - $startPos);
    return array($closeTag, trim(preg_replace('/(\s){2,}/', ' ', $text)));
}
$html = file_get_contents($htmlFile);
$limits = array('min' => 200, 'max' => 2000);
$tag = '<p>';
$paragraphs = array();
$i = 0;
while (($pos = strpos($html, $tag, $i)) !== false) {
    list($closeTag, $text) = extractContent($tag, substr($html, $pos));
    // keep the content if it's a suitable size
    $len = strlen($text);
    if ($len >= $limits['min'] && $len <= $limits['max']) {
        $paragraphs[] = $text;
    }
    $i = $pos + strlen($tag) + strlen($text) + strlen($closeTag);
}

A complete book can be scanned for usable paragraphs with minimal coding. This includes a simple check on the size of the paragraph to eliminate anything that’s either too small or too large. This test has the additional benefit of eliminating tags that are used for spacing. To be sure what we have is useful, you can display a sample in the browser or write it to a log file.

Populate the Database

The next step is to store these paragraphs in a database. Keep in mind that we’re building the barebones of a Lorem Ipsum system, so there’s no need for a database design like this:

lorem-ipsum-alt-01

All we really need is one table:

CREATE TABLE paragraphs (
    id MEDIUMINT UNSIGNED NOT NULL AUTO_INCREMENT,
    content MEDIUMTEXT NOT NULL,
    PRIMARY KEY (id)
)

In the interest of efficiency, I’ve chosen a suitably sized data type for both the id and content fields. For a large, fully-functional database that stores many publications, you may want to use the INTEGER and TEXT data types.

Now we can insert the paragraphs that we extracted from the HTML file into the database.

<?php
$db = new PDO(DBDSN, DBUSER, DBPASS);
$query = $db->prepare('INSERT INTO paragraphs (content) VALUES (:content)');
$query->bindParam(':content', $content);
foreach ($paragraphs as $content) {
    $query->execute();
}

Depending on the collating sequence you’ve chosen to use for your database, you may need to apply a conversion to the paragraph strings. This is a niggle of using a third-party data source like Gutenberg – there’s no guarantee that the text uses the same collating sequence as your database. Check the string functions and multi-byte string functions that are available in the PHP manual that may be needed.

A Simple Front-End

The final step is to access these paragraphs using a front-end in a browser. How the front-end should provide access to the data is limited only by our imaginations. For example, we could retrieve a certain number of paragraphs, or a specified quantity of text, or perhaps a number of characters rounded to the nearest paragraph. We could select consecutive paragraphs, or perhaps we’d be happy with random paragraphs. Whatever we choose, we need a function to read the table.

<?php
function selectParagraph($db, $id) {
    $query = sprintf('SELECT content FROM paragraphs WHERE id = %d', $id);
    $result = $pdo->query($sql);
    $row = $result->fetch(PDO::FETCH_ASSOC);
    $result->closeCursor();
    return $row['content'];
}

For demonstration purposes, the algorithm I’ll present uses a simple random number generator to select paragraphs from the database. It needs to know the maximum ID value for the paragraph records, hence the $maxID variable (this assumes the ID values are contiguous).

<form method="post">
 <label for="slider">How many paragraphs do you want?</label>
 <input type="range" min="1" max="4" step="1" name="slider">
 <input type="submit" name="submit" value="Get Excerpt">
</form>
<?php
if (isset($_POST['slider'])) {
    $i = $_POST['slider'];
    while ($i--) {
        $id = rand(1, $maxID);
        $paragraph = selectParagraph($db, $id);
        echo '<p>' . $paragraph . '</p>';
    }
}

And that’s the final piece of the project!

Summary

In this article we’ve covered the essential aspects of building an alternative to the popular Lorem Ipsum text generator. How complex we make it, how many publications and authors we include, how stylish we make the front-end, and whether we limit our choice of text to a specific genre, is entirely open to personal choice. But the essential elements will all be similar to what we’ve covered here, and all built using a smattering of PHP and MySQL. Easy!

Code to accompany this article can be found on GitHub. Feel free to clone it expand on it.

Image via Fotolia

Frequently Asked Questions (FAQs) about Coding a Lorem Ipsum Alternative

How can I generate a specific number of paragraphs using the Lorem Ipsum alternative?

To generate a specific number of paragraphs using the Lorem Ipsum alternative, you can modify the function to accept a parameter that specifies the number of paragraphs you want. Then, in the loop that generates the paragraphs, you can use this parameter to control how many times the loop runs. This way, you can generate any number of paragraphs you need, making the function more flexible and useful for different situations.

Can I use this Lorem Ipsum alternative in other programming languages?

The Lorem Ipsum alternative presented in the article is written in PHP. However, the logic and structure of the code can be translated into other programming languages. The key is to understand the logic behind the code: generating random sentences by selecting random words from a predefined list. Once you understand this, you can implement the same functionality in any programming language you are familiar with.

How can I add my own words to the Lorem Ipsum alternative?

To add your own words to the Lorem Ipsum alternative, you can modify the array of words in the code. Simply add your words as new elements in the array. The function will then use these words when generating the random sentences. This allows you to customize the output to better fit your needs, whether you want to include specific jargon, slang, or any other type of language.

Is there a way to control the length of the sentences generated by the Lorem Ipsum alternative?

Yes, you can control the length of the sentences generated by the Lorem Ipsum alternative. In the function, there is a loop that adds words to the sentence until it reaches a certain length. You can modify this length to make the sentences longer or shorter, depending on your needs. This gives you more control over the output and allows you to adapt the function to different contexts.

Can I use the Lorem Ipsum alternative to generate text in languages other than English?

The Lorem Ipsum alternative generates text in English because it uses an array of English words. If you want to generate text in another language, you can replace these words with words from the language you want to use. The function will then generate text in that language. This makes the Lorem Ipsum alternative a versatile tool that can be adapted to different languages and locales.

How can I integrate the Lorem Ipsum alternative into my website?

To integrate the Lorem Ipsum alternative into your website, you can call the function in the place where you want the text to appear. This could be in a template file, a content management system, or any other place where you generate HTML. The function will return a string of text, which you can then insert into your HTML.

Can I use the Lorem Ipsum alternative in commercial projects?

Yes, you can use the Lorem Ipsum alternative in commercial projects. The code is open-source and free to use. However, it’s always a good idea to check the license of any code you use in your projects to make sure you comply with its terms.

How can I contribute to the Lorem Ipsum alternative project?

If you have suggestions for improvements or new features for the Lorem Ipsum alternative, you can contribute to the project. This could involve submitting a pull request with your changes, opening an issue to discuss your ideas, or simply providing feedback on the project.

Can I use the Lorem Ipsum alternative offline?

Yes, you can use the Lorem Ipsum alternative offline. The function runs on your server, so it doesn’t require an internet connection to work. This makes it a reliable solution for generating placeholder text, even when you don’t have an internet connection.

Is there a limit to the amount of text the Lorem Ipsum alternative can generate?

There is no inherent limit to the amount of text the Lorem Ipsum alternative can generate. However, keep in mind that generating a large amount of text may take more time and resources. Therefore, it’s a good idea to test the function with different amounts of text to ensure it performs well in your specific context.