PER MØLDRUP-DALUM

Naive analysis of books

March, 2016

I have 65.000 books from Project Gutenberg

Book analysis_1.png

Book analysis_2.gif

Book analysis_3.png

Create two helper functions

Book analysis_4.png

How much RAM does that data use

Book analysis_5.png

Book analysis_6.png

Let’s try to read some metadata from 10 books

Book analysis_7.png

Book analysis_8.png

Book analysis_9.png

The same metadata in tabular form

Book analysis_10.png

10000.txt Anonymous The Magna Carta English 101718
10001.txt Lucius Seneca Apocolocyntosis English 52510
10002-8.txt William Hope Hodgson The House on the Borderland English 306901
10002.txt William Hope Hodgson The House on the Borderland English 306892
10003-8.txt Mary King Waddington My First Years As A Frenchwoman, 1876-1879 English 380829
10003.txt Mary King Waddington My First Years As A Frenchwoman, 1876-1879 English 380817
10004-8.txt Lindsay, Anna Robertson Brown The Warriors English 302753
10004.txt Lindsay, Anna Robertson Brown The Warriors English 302750
10005-8.txt George Tucker A Voyage to the Moon English 434769
10005.txt George Tucker A Voyage to the Moon English 434760

Get the metadata from all 65.000 books

Book analysis_11.png

How longc did it take to get the four metadata values from those books?

Book analysis_12.png

Book analysis_13.png

Book analysis_14.png

Book analysis_15.png

Book analysis_16.png

Book analysis_17.png

Now it’s time to select the training sets and the validation sets. First select all Shakespeare, Jules Verne, and some other author to be decided.

Still, first we will ignore anything that’s not written in English.

Which languages do we have?

Book analysis_18.png

English
English and Aleutian
French
Latin
German
Italian
Dutch
Swedish
Danish
Spanish
Finnish
French (with English)
Bulgarian
Spanish and English
English and Spanish
Serbian
Norwegian
Portuguese
German and Catalan
Esperanto
French and Dutch
English / French
Romanian
FASTA
English and French
german
(English and Nahuatl)
English with French
English, with Italian and French
Chinese
Spanish with English
French/English
French and English
English, Latin, Spanish, and Italian
English with Khasi (Language spoken in N.E. India)
Czech
Tagalog
ASCII
English and Latin
***CAREFUL***
Dutch and Flemish
Portugese
French / English
Spanish/English
Welsh
Italian and English
English and Old English
English, Middle (1100-1500)
english
en
Russian
EN
Catalan
Latin and English
Latin with English and Greek (ancient)
English and Nahuatl
Icelandic
Polish
En
Quiche
French / Onondaga / English
Spanish and Tagalog
Ilocano
Iloko
English and Chinook
Interlingua
Irish
Iloko, Spanish
Friulian
Afrikaans
English/latin
Kamilaroi and English
Gascon
Greek
Italian and French
Neapolitan
Hebrew
Hungarian
Japanese
Frisian
Venetian
English - Latin
Spanish, English and Tagalog
Cebuano
Galician
Nahuatl
Maori
Middle English
German and English
Breton
Arapaho
Czech, Esperanto
Zh
Inuktitut
Bagobo
Kashubian and Polish
Gaelic
English and German
Portuguese & French
GR
Slovenian
NU
Englishs
Telugu
JP
English & Spanish
Ojibwa
Chinese and English
French and Latin
Arabic
Estonian
Farsi
48771
Scots
5648
Latin, German
GERMAN
English/Latin
German, with English comments

How many different languages?

Book analysis_19.png

Book analysis_20.png

Well, drop everything not in English and only English

Book analysis_21.png

Book analysis_22.png

Every language
64599
Only English
55340

Now from the set, select texts written by Shakespeare, but first create a small utility function

Book analysis_23.png

Book analysis_24.png

Book analysis_25.png

Book analysis_26.png

Book analysis_27.png

Select all works by Jules Verne

Book analysis_28.png

Book analysis_29.png

Book analysis_30.png

Book analysis_31.png

Now, who should the third author be? Which authors do we have to choose from?

Book analysis_32.png

Book analysis_33.png

Book analysis_34.png

Book analysis_35.png

Ah, that’s a lot! That numbers gives one the urge to examine the author to works per author relationship, but that’s not my mission here.

Maybe go for Dosteyevsky?

Book analysis_36.png

Book analysis_37.png

Naw, too many different spellings. I’ll select Victor Hugo as the third author

Book analysis_38.png

Book analysis_39.png

Now, onwards to the stuf I know nothing about, but which is so easy to do in Mathematica. First create a function that splits a list of works into a training and test set

Book analysis_40.png

Then create training and test sets for the three authors

Book analysis_41.gif

Book analysis_42.gif

Book analysis_43.png

Book analysis_44.png

Book analysis_45.png

Now it’s time to actually load the content of the books or works into memory

Book analysis_46.gif

And create a classifier function based on this content

Book analysis_47.png

Book analysis_48.gif

That took less than a minute!

Now let’s test the classifier

Book analysis_49.gif

Book analysis_50.png

Book analysis_51.png

Book analysis_52.png

Book analysis_53.png

Book analysis_54.png

Book analysis_55.png

To my laymen’s eyes it looks good. Let’s pick a “random” book

Book analysis_56.png

Book analysis_57.png

Of pur three authors, who did most likely write that text?

Book analysis_58.png

Book analysis_59.png

I don’t know…

Book analysis_60.png

Book analysis_61.png

Why is that more Shakespeare than Verne?

What if I wrote som small sentences where I tried to sound like one of the three authors?

Book analysis_62.png

Book analysis_63.png

Book analysis_64.png

Book analysis_65.png

Book analysis_66.png

Book analysis_67.png

Below just some jibberish…

Book analysis_68.png

Book analysis_69.png

Book analysis_70.png

Book analysis_71.png

Book analysis_72.png

Book analysis_73.png

Book analysis_74.png

Book analysis_75.png

Book analysis_76.png

Book analysis_77.png

Book analysis_78.png

An option for showing the timing for the latest computation

Book analysis_79.png

Book analysis_80.png

Created with the Wolfram Language