How best to test how many letters are there in the new testament?

bendqh1 · December 7, 2023, 6:05am

A Hebrew Torah book has, depends on the version (Masoretic, Samartian, Dead Sea scrolls, and possibly others) around 300,000 Hebrew letters.

A standard Quran book has exactly or approximately 327,792 Hejazi or “classical” Arabic letters ; I said exactly or aproximately because if I am not mistaken it depedns on the edition, especially when comparing the Hafs edition to the Warsh edition.

I want to know how many letters are there in a standard New Testament book, preferably in the original Koine Greek, but I guess English would do as well.

How would you prefer to do this? For example, would you use a command line user interface text processing tool? Would you use a graphical user interface tool? Would it be a local tool or a web tool?

DaveMaxwell · December 7, 2023, 1:31pm

Well first of all, there is no “standard” New Testament book - there are like 20 different versions, each slightly different.

As much as I dislike the language, this effort would be tailor made for python which is built to collate and crunch data. It would just need to import the text (assuming you can find a reliable online source), then process the file line by line and letter by letter.

m_hutley · December 7, 2023, 6:39pm

Is the intention for the user to be able to upload their own book?
What form are you ingesting? Images? PDF’s? Text files?

SamuelCalifornia · December 7, 2023, 10:24pm

How is text of the New Testament different from any other text written in the same language? If there is a relevant difference then please explain. Otherwise there are many utilities that already exist that can do what you need to do.

Also, you say letters but do you mean characters that include punctuation? Should number digits be included? You do not need a special utility to tell you how many total characters there are in a file.

m_hutley · December 7, 2023, 10:40pm

Not to mention that unless the user’s going to be uploading their own, there are almost certainly already digitized versions of most every text, which would mean at best there would be sources for this information already…

bendqh1 · December 8, 2023, 11:48am

The user would probably want to read the book online by chapters, but in that case just to get an output of how many Koine Greek letters are there in a most authentic compilation comprised of all New Testament books shared by all churches worldwide today.

bendqh1 · December 8, 2023, 11:51am

I don’t know Greek so I don’t know what symbols besides letters were used, if at all, to write the original manuscripts of the New Testament books shared between all churches today.

If it included symbols which are not letters, such as punctuation marks or numbers then of course they should be included as well.

bendqh1 · December 8, 2023, 12:07pm

I have found this source:

I assume that for starters I would have to manually copy-paste each book text into a separate file, so for example I would have to create a file named matthew.txt and then paste into it the text of all chapters of the Book of Matthew, probably separated by empty lines.

Then I would need to count the computer-characters somehow (I assume that a code editor like Geany or Visual Studio Code can do it), and then decrement at least the amount of empty lines separating each chapter from the next chapter after pasting it.

I could probably also change the .txt extension to .odt extension or to .rtf extension and then count the computer characters with LibreOffice Writer or Microsoft Word.

m_hutley · December 8, 2023, 3:34pm

be… extremely careful there… copyright lawyers the world over just perked their ears up…

SamuelCalifornia · December 8, 2023, 4:36pm

What else could be in the text? Probably nothing else. Therefore the size of the file is the count. Very simple. If you want to exclude space characters then it would be simple to count the number of spaces.

Translations of the New Testament might have a copyright but I doubt the original Greek does.

m_hutley · December 8, 2023, 8:51pm

Yeah, the original text of something like the New Testament, sure. But this is the reason I asked how and what would be ingested…

Periods, commas, hyphens, newlines, spaces, tabs… they’re all non-letter characters…

Depends on how many of the characters are multi-byte?

bendqh1 · December 9, 2023, 10:44am

Sorry, I should have said computer-letter-characters instead.

bendqh1 · December 9, 2023, 11:25am

Therefore the size of the file is the count

Is that correct in 100% of the cases? If I am not mistaken, some computer-characters weigh more than 1 byte (or more than 8 bits).

SamuelCalifornia · December 9, 2023, 5:03pm

That depends on the definition of multi-byte. If it refers to a character encoding in which all characters are the same number of bytes then the difference between a single byte and multiple bytes is trivial.

I admit that I am accustomed to characters occupying just one byte.

If you can be specific about what the encoding is for your data then people can be specific. You are being vague and therefore you can only get vague answers.

m_hutley · December 9, 2023, 5:16pm

Well if we’re talking about putting Greek characters in, you’re not going to be able to get Unicode value 913 into an 8 bit value. But a space (unicode value 32) does. And if a single letter character takes 2 bytes, and symbols take 1, and you’re counting the number of bytes to count the number of characters, you’re in trouble, because SOME of your characters are 1 byte, and some are 2…

SamuelCalifornia · December 9, 2023, 5:36pm

We are speculating about the format. Simple Unicode uses a fixed number of bytes for each character.

bendqh1 · December 10, 2023, 3:50am

I meant to the standard coding, which I understand to be UTF-8.
Even in that standard, is that correct in 100% of the cases?..

Anyway I think that counting computer-letter-characters in an opened file is the most accurate method.

SamuelCalifornia · December 10, 2023, 4:58am

Please provide the specifics of your requirements. Without specifics people must speculate and speculation tends to become long discussions, much of which might not be relevant.

bendqh1 · December 10, 2023, 5:49am

I find no speculation is a regex command like this:

[a-zA-Z]

Although, in this case I need to match Koine Greek letters and I don’t know if there is regex for that, but it could be invented by creating a regex sequence of this language’s writing-system letter characters.

SamuelCalifornia · December 10, 2023, 4:30pm

I will be more specific. First decide what specific data you need to process. Specify it in a manner that people can access it or at least be specific about what the language is and the encoding and specifics like that. You must at least know the encoding of the data.

Free online Greek word counter tool | GoTranscript might help.