散々迷ったあげく、ファイルサイズの小さいデータでやり直した。
もともとは、11Gだったが、今回 46M のデータを利用。もちろんできた。

データは、
https://dumps.wikimedia.org/enwiki/latest/
から小さいデータを選んだ。とにかく RaspberryPi2 で出きることをやる。


$ python -m gensim.scripts.make_wiki enwiki-latest-pages-articles1.xml-p000000010p000010000.bz2 wiki_en_output 1000
2015-05-13 07:49:11,950 : INFO : running /usr/local/lib/python2.7/dist-packages/gensim/scripts/make_wiki.py enwiki-latest-pages-articles1.xml-p000000010p000010000.bz2 wiki_en_output 1000
2015-05-13 07:49:13,406 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2015-05-13 08:00:07,043 : INFO : finished iterating over Wikipedia corpus of 4680 documents with 13500677 positions (total 6280 articles, 13507365 positions before pruning articles shorter than 50 words)
2015-05-13 08:00:07,044 : INFO : built Dictionary(260454 unique tokens: [u'biennials', u'tripolitan', u'unsupportable', u'refreshable', u'nunnery']...) from 4680 documents (total 13500677 corpus positions)
2015-05-13 08:00:11,267 : INFO : discarding 240659 tokens: [(u'ability', 817), (u'able', 1252), (u'about', 2936), (u'abstention', 7), (u'abstentionism', 2), (u'according', 2057), (u'account', 890), (u'accounts', 486), (u'achieved', 648), (u'across', 1240)]...
2015-05-13 08:00:11,268 : INFO : keeping 19795 tokens which were in no less than 20 and no more than 468 (=10.0%) documents
2015-05-13 08:00:12,348 : INFO : resulting dictionary: Dictionary(19795 unique tokens: [u'writings', u'homomorphism', u'hordes', u'yellow', u'gag']...)
2015-05-13 08:00:12,564 : INFO : storing corpus in Matrix Market format to wiki_en_output_bow.mm
2015-05-13 08:00:12,566 : INFO : saving sparse matrix to wiki_en_output_bow.mm
2015-05-13 08:00:14,260 : INFO : PROGRESS: saving document #0
2015-05-13 08:12:06,707 : INFO : finished iterating over Wikipedia corpus of 4680 documents with 13500677 positions (total 6280 articles, 13507365 positions before pruning articles shorter than 50 words)
2015-05-13 08:12:06,709 : INFO : saved 4680x19795 matrix, density=1.896% (1756120/92640600)
2015-05-13 08:12:06,711 : INFO : saving MmCorpus index to wiki_en_output_bow.mm.index
2015-05-13 08:12:06,726 : INFO : saving dictionary mapping to wiki_en_output_wordids.txt.bz2
2015-05-13 08:12:11,686 : INFO : loaded corpus index from wiki_en_output_bow.mm.index
2015-05-13 08:12:11,687 : INFO : initializing corpus reader from wiki_en_output_bow.mm
2015-05-13 08:12:11,688 : INFO : accepted corpus with 4680 documents, 19795 features, 1756120 non-zero entries
2015-05-13 08:12:11,689 : INFO : collecting document frequencies
2015-05-13 08:12:11,753 : INFO : PROGRESS: processing document #0
2015-05-13 08:13:48,804 : INFO : calculating IDF weights for 4680 documents and 19794 features (1756120 matrix non-zeros)
2015-05-13 08:13:48,979 : INFO : storing corpus in Matrix Market format to wiki_en_output_tfidf.mm
2015-05-13 08:13:48,980 : INFO : saving sparse matrix to wiki_en_output_tfidf.mm
2015-05-13 08:13:49,063 : INFO : PROGRESS: saving document #0
2015-05-13 08:17:32,296 : INFO : saved 4680x19795 matrix, density=1.896% (1756120/92640600)
2015-05-13 08:17:32,297 : INFO : saving MmCorpus index to wiki_en_output_tfidf.mm.index
2015-05-13 08:17:32,309 : INFO : finished running make_wiki.py
$ ls -l
-rw-r--r-- 1 pi pi 11820881800  4月  7 07:06 enwiki-latest-pages-articles.xml.bz2
-rw-r--r-- 1 pi pi    46529467  5月 13 07:44 enwiki-latest-pages-articles1.xml-p000000010p000010000.bz2
-rw-r--r-- 1 pi pi    21416947  5月 13 08:12 wiki_en_output_bow.mm
-rw-r--r-- 1 pi pi       26176  5月 13 08:12 wiki_en_output_bow.mm.index
-rw-r--r-- 1 pi pi    46458890  5月 13 08:17 wiki_en_output_tfidf.mm
-rw-r--r-- 1 pi pi       27274  5月 13 08:17 wiki_en_output_tfidf.mm.index
-rw-r--r-- 1 pi pi      131400  5月 13 08:12 wiki_en_output_wordids.txt.bz2

無事ファイルが完成したので、続きをコマンドラインで。


$ python
Python 2.7.3 (default, Mar 18 2014, 05:13:23) 
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import logging,gensim
>>> logging.basicConfig(
... format='%(asctime)s : %(levelname)s : %(message)s',level=logging.DEBUG)
>>> id2word=gensim.corpora.Dictionary.load_from_text('wiki_en_output_wordids.txt')
>>> mm=gensim.corpora.MmCorpus('wiki_en_output_tfidf.mm')
2015-05-13 19:59:11,277 : INFO : loaded corpus index from wiki_en_output_tfidf.mm.index
2015-05-13 19:59:11,278 : INFO : initializing corpus reader from wiki_en_output_tfidf.mm
2015-05-13 19:59:11,279 : INFO : accepted corpus with 4680 documents, 19795 features, 1756120 non-zero entries
>>> 
>>> model=gensim.models.ldamodel.LdaModel(
... corpus=mm,
... id2word=id2word,
... num_topics=100,
... update_every=1,
... chunksize=10000,
... passes=1)
2015-05-13 20:01:19,311 : INFO : using symmetric alpha at 0.01
2015-05-13 20:01:19,312 : INFO : using serial LDA version on this node
2015-05-13 20:01:24,286 : INFO : running online LDA training, 100 topics, 1 passes over the supplied corpus of 4680 documents, updating model once every 4680 documents, evaluating perplexity every 4680 documents, iterating 50x with a convergence threshold of 0.001000
2015-05-13 20:01:24,287 : WARNING : too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
2015-05-13 20:04:31,711 : DEBUG : bound: at document #0
2015-05-13 20:16:24,490 : INFO : -160.871 per-word bound, 2673883593567069621132943961470864934975370690560.0 perplexity estimate based on a held-out corpus of 4680 documents with 40842 words
2015-05-13 20:16:24,491 : INFO : PROGRESS: pass 0, at document #4680/4680
2015-05-13 20:16:24,491 : DEBUG : performing inference on a chunk of 4680 documents
2015-05-13 20:22:44,939 : DEBUG : 4629/4680 documents converged within 50 iterations
2015-05-13 20:22:45,092 : DEBUG : updating topics
2015-05-13 20:22:52,047 : INFO : topic #0 (0.010): 0.002*djibouti + 0.002*atlantic + 0.002*diagrams + 0.002*extension + 0.002*bs + 0.002*dualism + 0.002*chromium + 0.002*ethernet + 0.002*dia + 0.002*patent
2015-05-13 20:22:52,119 : INFO : topic #11 (0.010): 0.002*confederation + 0.002*cholesterol + 0.002*corinth + 0.002*basel + 0.001*esperanto + 0.001*hubbard + 0.001*berlin + 0.001*panthers + 0.001*beer + 0.001*theatre
2015-05-13 20:22:52,195 : INFO : topic #15 (0.010): 0.002*politician + 0.002*actor + 0.002*actress + 0.002*singer + 0.002*songwriter + 0.001*andorra + 0.001*thousand + 0.001*cone + 0.001*atari + 0.001*baseball
2015-05-13 20:22:52,276 : INFO : topic #7 (0.010): 0.002*bahá + 0.002*canton + 0.002*constantine + 0.002*columbus + 0.002*barbados + 0.002*diffusion + 0.002*clarinet + 0.002*ajax + 0.002*dominican + 0.002*buses
2015-05-13 20:22:52,351 : INFO : topic #60 (0.010): 0.002*binary + 0.002*chocolate + 0.002*huxley + 0.002*franc + 0.002*concord + 0.002*ellipse + 0.002*einstein + 0.002*botswana + 0.002*allen + 0.002*casimir
2015-05-13 20:22:52,429 : INFO : topic #52 (0.010): 0.002*planet + 0.002*cooking + 0.002*christopher + 0.002*basque + 0.002*shannon + 0.002*cartesian + 0.002*andes + 0.002*felix + 0.001*christmas + 0.001*parsons
2015-05-13 20:22:52,502 : INFO : topic #28 (0.010): 0.003*congo + 0.002*bt + 0.002*engine + 0.002*enemy + 0.002*diet + 0.002*atlanta + 0.002*compound + 0.002*bug + 0.002*hilbert + 0.002*guthrie
2015-05-13 20:22:52,578 : INFO : topic #23 (0.010): 0.002*orioles + 0.002*est + 0.002*cortex + 0.002*calcium + 0.001*cholera + 0.001*columbia + 0.001*samoa + 0.001*easter + 0.001*baseball + 0.001*communist
2015-05-13 20:22:52,655 : INFO : topic #4 (0.010): 0.002*adobe + 0.002*aircraft + 0.002*brewster + 0.002*bliss + 0.002*albert + 0.001*cervical + 0.001*realism + 0.001*soccer + 0.001*burkina + 0.001*camel
2015-05-13 20:22:52,733 : INFO : topic #93 (0.010): 0.003*dominica + 0.002*behaviour + 0.002*australian + 0.002*antwerp + 0.002*claudius + 0.002*acm + 0.002*bt + 0.001*czech + 0.001*px + 0.001*ethiopia
2015-05-13 20:22:52,812 : INFO : topic #49 (0.010): 0.003*singer + 0.003*politician + 0.003*actor + 0.002*actress + 0.002*songwriter + 0.002*footballer + 0.002*painter + 0.001*armenian + 0.001*buffalo + 0.001*cardiff
2015-05-13 20:22:52,891 : INFO : topic #53 (0.010): 0.002*bar + 0.002*phillip + 0.002*arbitration + 0.002*equivalence + 0.002*constellations + 0.002*aa + 0.002*carboniferous + 0.002*diameter + 0.002*coral + 0.002*andrew
2015-05-13 20:22:52,970 : INFO : topic #70 (0.010): 0.003*binary + 0.002*dürer + 0.002*dipole + 0.002*entertainment + 0.002*ira + 0.002*bay + 0.002*coercion + 0.002*determinant + 0.002*cinema + 0.002*architect
2015-05-13 20:22:53,049 : INFO : topic #90 (0.010): 0.004*acid + 0.002*conditioning + 0.002*dublin + 0.002*barium + 0.002*khmer + 0.002*exponential + 0.002*caste + 0.002*brussels + 0.002*hume + 0.002*dune
2015-05-13 20:22:53,139 : INFO : topic #17 (0.010): 0.002*psychology + 0.001*singer + 0.001*beads + 0.001*lynch + 0.001*actress + 0.001*eta + 0.001*borneo + 0.001*bombardier + 0.001*celtic + 0.001*circumference
2015-05-13 20:22:53,267 : INFO : topic diff=63.770846, rho=1.000000
>>> model.save('wiki_lda.pkl')
2015-05-13 20:30:33,138 : INFO : saving LdaState object under wiki_lda.pkl.state, separately None
2015-05-13 20:30:33,322 : INFO : saving LdaModel object under wiki_lda.pkl, separately None
2015-05-13 20:30:33,323 : INFO : not storing attribute state
2015-05-13 20:30:33,324 : INFO : not storing attribute dispatcher
>>>
>>> topics=[]
>>> for doc in mm:
...    topics.append(model[doc])
... 
>>> 
>>> import numpy as np
>>> lens=np.array([len(t) for t in topics])
>>> print np.mean(lens)
9.46688034188
>>> print np.mean(lens <= 10)
0.577136752137
>>> 
>>> counts=np.zeros(100)
>>> for doc_top in topics:
...    for ti,_ in doc_top:
...      counts[ti] += 1
... 
>>> words=model.show_topic(counts.argmax(),64)
>>> print(words)
[(0.0019511514744959915, u'bar'), (0.0017204889523604191, u'album'), (0.0014997809217989011, u'deposition'), (0.0014561321088780057, u'prefecture'), (0.001412610803118459, u'medal'), (0.0014001989940011519, u'columbia'), (0.0013967128181081685, u'europa'), (0.001393409856782905, u'austin'), (0.0013604164377232532, u'etruscan'), (0.0013331262195861782, u'euclid'), (0.0013300347548554738, u'peirce'), (0.0013265628902306172, u'defendant'), (0.0013212767842717351, u'beads'), (0.001296087756180846, u'damascus'), (0.0012735981875431235, u'aw'), (0.0012567275897854878, u'botswana'), (0.0012132685543834306, u'elf'), (0.0011873116173288999, u'comic'), (0.0011594699813450219, u'comics'), (0.0011558146533967512, u'sony'), (0.0011457678546582734, u'alcohol'), (0.0011397616571998223, u'belief'), (0.0011377791475402123, u'principia'), (0.001123990253454664, u'christology'), (0.0011057081815832499, u'sat'), (0.0010900809551991945, u'examination'), (0.0010843325721316222, u'sovereign'), (0.0010645770100080913, u'lds'), (0.0010572014440389512, u'christ'), (0.0010433722774033656, u'jesus'), (0.0010382809253345444, u'indus'), (0.0010185539403574427, u'epistemology'), (0.00096854626618787538, u'costa'), (0.00096378144448323378, u'rica'), (0.00092972346440327216, u'turkish'), (0.00090123807667366715, u'click'), (0.00089160920978130801, u'missouri'), (0.00087301854700155878, u'fuller'), (0.00085897788648539196, u'euler'), (0.00085040099254520246, u'players'), (0.0008475911648792, u'steel'), (0.00081649846111695389, u'creature'), (0.0008072256714587835, u'km'), (0.00080612038252994935, u'defects'), (0.00080313519256669551, u'lawyer'), (0.00078919398182289869, u'germanic'), (0.00078555082412381898, u'anarchism'), (0.0007844950792263519, u'golf'), (0.00077538079310564615, u'patrician'), (0.00077411353336251452, u'tigers'), (0.00076895196279059609, u'pottery'), (0.00076311593891956664, u'conscious'), (0.00076022762732353972, u'distinguished'), (0.00075503932154170395, u'wizard'), (0.0007519006543297577, u'alexandria'), (0.00073471294624361514, u'drainage'), (0.00072321430554564447, u'witness'), (0.00072216619220361435, u'lebanon'), (0.00070989907535951545, u'string'), (0.00070257964664247161, u'transformers'), (0.0006906105101552075, u'aegean'), (0.00068860576872043747, u'atkinson'), (0.00068438253218207985, u'aachen'), (0.00068420848654377354, u'atoms')]
>>> 




ワードクラウド画像を作るために、

http://www.wordle.net/

で作ろうとすると Epiphanyで missing plugin とか出て java plugin が動いてなかったが、このパッケージ一発で見れるようになった。

icedtea-plugin - web browser plugin to execute Java applets (dependency package)

$ sudo apt-get install icedtea-plugin


wordleの Advanced の方で、以下のような「単語:表示回数」のフォーマットで貼り付ける。

belong:100
waves:238
treat:90
storm:238
manuscript:138
jerusalem:238
integral:28
ian:138

出てきた画像を Layoutや Font でいじってみると、それらしいのができた。


データ解析: LDAの実装(gensim)
http://openbook4.me/projects/193/sections/1154