Workdocumentation 2022-08-19

From BITPlan ceur-ws Wiki
Jump to navigation Jump to search

Participants

  • Wolfgang

Agenda

  • CEUR-WS Volume index.html fixes
  • dblp

CEUR-WS Volume index.html fixes

wf@capri:/hd/luxio/CEUR-WS/www/Vol-457$ git diff 8285e6269493d1dd8d6dde06e4f4805fedb4d3f5 index.html
diff --git a/www/Vol-457/index.html b/www/Vol-457/index.html
index 1fef928a2..21cf5bed7 100644
--- a/www/Vol-457/index.html
+++ b/www/Vol-457/index.html
@@ -34,7 +34,7 @@ owners.</font></p>
 
 <h1><a href="http://www.bgu.ac.il/~sturm/DE@CAiSE09/">DE@CAiSE'09</a><br> 
  
-Domain Engineering
+Domain Engineering</h1>
 
 <h3>Proceedings of the First International Workshop on Domain Engineering held in
 conjunction with <A href="http://caise09.thenetworkinstitute.eu/index.php">CAiSE'09</a> Conference</h3> 
@@ -91,4 +91,4 @@ Paul Johannesson<sup><font size=-1>2</font></sup>, Royal Institute of Technology
 </body>
 </html>

dblp

Import RDF Dump to QLever (39 min)

see Workdocumentation_2022-08-16#on_RWTH_Aachen_DBIS_i5_server for preparations

Steps with QLever Control script

Download and Indexing

wf@confident:/hd/torterra/dblp2022-08$ . ../qlever/qlever-control/qlever dblp

QLEVER CONFIG

Checking your PATH ...
Added the directory "/hd/torterra/qlever/qlever-control" to your PATH

Setting up bash autocompletion ...
Done, number of completions: 35

Creating new Qleverfile ...
Copied pre-configured Qleverfile for "dblp" into current directory.

Setup is complete
Type "qlever" and use autocompletion to see which actions are available. Add a
"show" in the end to see what an action does without executing it (for example,
"qlever index show"). Typing "qlever" without arguments gives some basic help
and pointers for further help. Edit your local "Qleverfile" to change settings.

wf@confident:/hd/torterra/dblp2022-08$ qlever get-data

This is the "qlever" script, call without argument for help

Executing "get-data":

wget -nc -O dblp.nt.gz https://dblp.org/rdf/dblp.nt.gz

Getting data using GET_DATA_CMD from Qleverfile ...

--2022-08-19 07:16:17--  https://dblp.org/rdf/dblp.nt.gz
Resolving dblp.org (dblp.org)... 192.76.146.204
Connecting to dblp.org (dblp.org)|192.76.146.204|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2793364255 (2.6G) [application/x-gzip]
Saving to: ‘dblp.nt.gz’

dblp.nt.gz          100%[===================>]   2.60G  43.3MB/s    in 64s     

2022-08-19 07:17:21 (41.8 MB/s) - ‘dblp.nt.gz’ saved [2793364255/2793364255]


wf@confident:/hd/torterra/dblp2022-08$ qlever index

This is the "qlever" script, call without argument for help

Executing "index":

bash -c "zcat dblp.nt.gz | IndexBuilderMain -F ttl -f - -i dblp -s dblp.settings.json --words-from-literals | tee dblp.index-log.txt"

bash: IndexBuilderMain: command not found

wf@confident:/hd/torterra/dblp2022-08$ 
Max RAM usage: 0.0 GB

wf@confident:/hd/torterra/dblp2022-08$ ls
Qleverfile  dblp.index-log.txt  dblp.nt.gz  dblp.settings.json
wf@confident:/hd/torterra/dblp2022-08$ vi Qleverfile 
# modify USE_DOCKER              = true 
wf@confident:/hd/torterra/dblp2022-08$ qlever index

This is the "qlever" script, call without argument for help

Executing "index":

docker run -it --rm -u 1001:1001 -v /hd/torterra/dblp2022-08:/index -w /index --entrypoint bash --name qlever.dblp.index-build adfreiburg/qlever -c "zcat dblp.nt.gz | IndexBuilderMain -F ttl -f - -i dblp -s dblp.settings.json --words-from-literals | tee dblp.index-log.txt"

2022-08-19 05:19:12.735	- INFO:  QLever IndexBuilder, compiled on Mon Aug 15 05:40:57 UTC 2022 using git hash 406dda
2022-08-19 05:19:12.736	- INFO:  You specified the input format: TTL
2022-08-19 05:19:12.737	- INFO:  Locale was not specified in settings file, default is en_US
2022-08-19 05:19:12.737	- INFO:  You specified "locale = en_US" and "ignore-punctuation = 0"
2022-08-19 05:19:12.738	- INFO:  You specified "ascii-prefixes-only = true", which enables faster parsing for well-behaved TTL files
2022-08-19 05:19:12.738	- INFO:  You specified "num-triples-per-batch = 5,000,000", choose a lower value if the index builder runs out of memory
2022-08-19 05:19:12.738	- INFO:  Integers that cannot be represented by QLever will throw an exception (this is the default behavior)
2022-08-19 05:19:12.738	- INFO:  Processing input triples from /dev/stdin ...
2022-08-19 05:31:18.190	- INFO:  Triples converted: 100,000,000
2022-08-19 05:31:36.447	- INFO:  Triples converted: 200,000,000
2022-08-19 05:31:48.312	- INFO:  Done, total number of triples converted: 268,701,236
2022-08-19 05:31:48.318	- INFO:  Building prefix tree from internal vocabulary ...
2022-08-19 05:32:32.605	- INFO:  Computing maximally compressing prefixes (greedy algorithm) ...
2022-08-19 05:33:59.130	- INFO:  Reduction of size of internal vocabulary: 24%
2022-08-19 05:34:02.208	- INFO:  Writing compressed vocabulary to disk ...
2022-08-19 05:35:42.396	- INFO:  Creating a pair of index permutations ... 
2022-08-19 05:37:03.671	- INFO:  Statistics for PSO: #relations = 65, #blocks = 542, #triples = 268,672,977
2022-08-19 05:37:03.674	- INFO:  Statistics for POS: #relations = 65, #blocks = 542, #triples = 268,672,977
2022-08-19 05:37:03.675	- INFO:  Exchanging multiplicities for PSO and POS ...
2022-08-19 05:37:03.675	- INFO:  Writing meta data for PSO and POS ...
2022-08-19 05:37:08.712	- INFO:  Creating a pair of index permutations ... 
2022-08-19 05:38:11.124	- INFO:  Statistics for SPO: #relations = 44,834,357, #blocks = 342, #triples = 268,672,977
2022-08-19 05:38:11.124	- INFO:  Statistics for SOP: #relations = 44,834,357, #blocks = 342, #triples = 268,672,977
2022-08-19 05:38:11.124	- INFO:  Exchanging multiplicities for SPO and SOP ...
2022-08-19 05:38:21.281	- INFO:  Writing meta data for SPO and SOP ...
2022-08-19 05:38:21.385	- INFO:  Number of distinct patterns: 1,276
2022-08-19 05:38:21.385	- INFO:  Number of subjects with pattern: 44,834,357 [all]
2022-08-19 05:38:21.385	- INFO:  Total number of distinct subject-predicate pairs: 228,395,931
2022-08-19 05:38:21.385	- INFO:  Average number of predicates per subject: 5.1
2022-08-19 05:38:21.389	- INFO:  Average number of subjects per predicate: 3,625,332
2022-08-19 05:38:28.373	- INFO:  Creating a pair of index permutations ... 
2022-08-19 05:39:29.422	- INFO:  Statistics for OSP: #relations = 85,894,696, #blocks = 435, #triples = 268,672,977
2022-08-19 05:39:29.423	- INFO:  Statistics for OPS: #relations = 85,894,696, #blocks = 435, #triples = 268,672,977
2022-08-19 05:39:29.423	- INFO:  Exchanging multiplicities for OSP and OPS ...
2022-08-19 05:39:48.764	- INFO:  Writing meta data for OSP and OPS ...
2022-08-19 05:39:48.946	- INFO:  Index build completed
2022-08-19 05:39:49.086	- INFO:  
2022-08-19 05:39:49.086	- INFO:  Adding text index ...
2022-08-19 05:39:49.086	- INFO:  Considering each literal as a text record
2022-08-19 05:39:49.099	- INFO:  The git hash used to build this index was "406ddab3953b604f7f37e83307b8c3db5a3c04dd"
2022-08-19 05:39:49.100	- INFO:  Reading vocabulary from file dblp.vocabulary.internal ...
2022-08-19 05:39:58.361	- INFO:  Done, number of words: 92,096,717
2022-08-19 05:39:58.361	- INFO:  Building text vocabulary ...
2022-08-19 05:41:07.506	- INFO:  Writing vocabulary to file dblp.text.vocabulary ...
2022-08-19 05:41:07.592	- INFO:  Done, number of words: 9,463,510
2022-08-19 05:41:07.896	- INFO:  Building the half-inverted index lists ...
2022-08-19 05:46:10.425	- WARN:  Entity from text not in KB: "James Cummings and Ernest Schimmerling, editors. Lecture Note Series of the London Mathematical Society, vol. 406. Cambridge University Press, New York, xi + 419 pp. - Paul B. Larson, Peter Lumsdaine, and Yimu Yin. An introduction to Pmax forcing. pp. 5-23. - Simon Thomas and Scott Schneider. Countable Borel equivalence relations. pp. 25-62. - Ilijas Farah and Eric Wofsey. Set theory and operator algebras. pp. 63-119. - Justin Moore and David Milovich. A tutorial on set mapping reflection. pp. 121-144. - Vladimir G. Pestov and Aleksandra Kwiatkowska. An introduction to hyperlinear and sofic groups. pp. 145-185. - Itay Neeman and Spencer Unger. Aronszajn trees and the SCH. pp. 187-206. - Todd Eisworth, Justin Tatch Moore, and David Milovich. Iterated forcing and the Continuum Hypothesis. pp. 207-244. - Moti Gitik and Spencer Unger. Short extender forcing. pp. 245-263. - Alexander S. Kechris and Robin D. Tucker-Drob. The complexity of classification problems in ergodic theory. pp. 265-299. - Menachem Magidor and Chris Lambie-Hanson. On the strengths and weaknesses of weak squares. pp. 301-330. - Boban Veličković and Giorgio Venturi. Proper forcing remastered. pp. 331-362. - Asger ToÖrnquist and Martino Lupini. Set theory and von Neumann algebras. pp. 363-396. - W. Hugh Woodin, Jacob Davis, and Daniel RodrÍguez. The HOD dichotomy. pp. 397-419."
2022-08-19 05:47:50.808	- WARN:  Entity from text not in KB: "Natasha Dobrinen: James Cummings and Ernest Schimmerling, editors. Lecture Note Series of the London Mathematical Society, vol. 406. Cambridge University Press, New York, xi + 419 pp. - Paul B. Larson, Peter Lumsdaine, and Yimu Yin. An introduction to Pmax forcing. pp. 5-23. - Simon Thomas and Scott Schneider. Countable Borel equivalence relations. pp. 25-62. - Ilijas Farah and Eric Wofsey. Set theory and operator algebras. pp. 63-119. - Justin Moore and David Milovich. A tutorial on set mapping reflection. pp. 121-144. - Vladimir G. Pestov and Aleksandra Kwiatkowska. An introduction to hyperlinear and sofic groups. pp. 145-185. - Itay Neeman and Spencer Unger. Aronszajn trees and the SCH. pp. 187-206. - Todd Eisworth, Justin Tatch Moore, and David Milovich. Iterated forcing and the Continuum Hypothesis. pp. 207-244. - Moti Gitik and Spencer Unger. Short extender forcing. pp. 245-263. - Alexander S. Kechris and Robin D. Tucker-Drob. The complexity of classification problems in ergodic theory. pp. 265-299. - Menachem Magidor and Chris Lambie-Hanson. On the strengths and weaknesses of weak squares. pp. 301-330. - Boban Veličković and Giorgio Venturi. Proper forcing remastered. pp. 331-362. - Asger ToÖrnquist and Martino Lupini. Set theory and von Neumann algebras. pp. 363-396. - W. Hugh Woodin, Jacob Davis, and Daniel RodrÍguez. The HOD dichotomy. pp. 397-419. (2014)"
2022-08-19 05:49:30.949	- WARN:  Entity from text not in KB: "Tony Owen: Numerical Recipes Book (PASCAL) by William H. Press, Brian P. Flannery, Saul A. Teukolsky and William T. Vetterling Cambridge University Press, Cambridge, 1990, 759 pages including index (£30.00 hdb).Numerical Recipes Diskette (PASCAL) version 2.0 by William H. Press, et al. Cambridge University Press, Cambridge, 03 1990 (£21.50).Numerical Recipes Example Handbook (PASCAL) by William H. Press, Brian P. Flannery, Saul A. Teukolsky and William T. Vetterling Cambridge University Press, Cambridge, 09 1990, 223 pages including index of demonstrated procedures (£19·50, hdb).Numerical Recipes Example Diskette (PASCAL) version 2.0 by William H. Press et al. Cambridge University Press, Cambridge, 02 1990 (£21.50).Numerical Recipes Routines and Examples in Basic by Julian C. Sprott Cambridge University Press, Cambridge (paperback), 1991, 398 pages including index of programs (£19.50; pbk).Numerical Recipes Diskette Basic version 1.0 by Julian C. Sprott Cambridge University Press, Cambridge, 1991 (£21.50). (1992)"
2022-08-19 05:50:15.628	- WARN:  Number of mentions of entities not found in the vocabulary: 3
2022-08-19 05:55:07.011	- INFO:  Statistics for text index: #records = 32,052,337, #words = 256,962,549, #entities = 32,052,337, #blocks = 32,279,050
2022-08-19 05:55:12.745	- INFO:  Text index build completed

Server Start

qlever start

This is the "qlever" script, call without argument for help

Executing "start":

docker run -d --restart unless-stopped -u 1001:1001 -it -v /hd/torterra/qlever/dblp:/index -p 7015:7015 -w /index --entrypoint bash --name qlever.dblp adfreiburg/qlever -c "ServerMain -i dblp -j 8 -p 7015 -m 20 -c 5 -e 1 -k 100 -a \"dblp_620614028\" -t > dblp.server-log.txt" > /dev/null

Starting the QLever server in the background and waiting until it's ready (Ctrl+C will not kill it) ...

2022-08-19 06:02:25.290	- INFO:  QLever Server, compiled on Mon Aug 15 05:40:57 UTC 2022 using git hash 406dda
2022-08-19 06:02:25.294	- INFO:  Initializing server ...
2022-08-19 06:02:25.297	- INFO:  The git hash used to build this index was "406ddab3953b604f7f37e83307b8c3db5a3c04dd"
2022-08-19 06:02:25.298	- INFO:  Reading vocabulary from file dblp.vocabulary.internal ...
2022-08-19 06:02:33.264	- INFO:  Done, number of words: 92,096,717
2022-08-19 06:02:33.266	- INFO:  Registered PSO permutation: #relations = 65, #blocks = 542, #triples = 268,672,977
2022-08-19 06:02:33.267	- INFO:  Registered POS permutation: #relations = 65, #blocks = 542, #triples = 268,672,977
2022-08-19 06:02:33.268	- INFO:  Registered OPS permutation: #relations = 85,894,696, #blocks = 435, #triples = 268,672,977
2022-08-19 06:02:33.269	- INFO:  Registered OSP permutation: #relations = 85,894,696, #blocks = 435, #triples = 268,672,977
2022-08-19 06:02:33.270	- INFO:  Registered SPO permutation: #relations = 44,834,357, #blocks = 342, #triples = 268,672,977
2022-08-19 06:02:33.270	- INFO:  Registered SOP permutation: #relations = 44,834,357, #blocks = 342, #triples = 268,672,977
2022-08-19 06:02:33.270	- INFO:  Reading patterns from file dblp.index.patterns ...
2022-08-19 06:02:34.049	- INFO:  Reading vocabulary from file dblp.text.vocabulary ...
2022-08-19 06:02:34.424	- INFO:  Done, number of words: 9,463,510
2022-08-19 06:02:34.424	- INFO:  Reading metadata from file dblp.text.index ...
2022-08-19 06:02:36.068	- INFO:  Registered text index: #records = 32,052,337, #words = 256,962,549, #entities = 32,052,337, #blocks = 32,279,050
2022-08-19 06:02:36.232	- INFO:  Sorting random result tables to estimate the sorting performance of this machine ...
2022-08-19 06:02:37.124	- INFO:  Access token for restricted API calls is "****"
2022-08-19 06:02:37.124	- INFO:  The server is ready, listening for requests on port 7015 ...
2022-08-19 06:02:37.438	- INFO:  
2022-08-19 06:02:37.438	- INFO:  Request received via GET, no content type specified
2022-08-19 06:02:37.438	- INFO:  Alive check with message "from the qlever script"
2022-08-19 06:02:37.451	- INFO:  
2022-08-19 06:02:37.451	- INFO:  Request received via GET, no content type specified
2022-08-19 06:02:37.451	- INFO:  Setting index description to: "RDF from https://dblp.org/rdf/dblp.nt.gz, version from 19.08.2022 01:33"
2022-08-19 06:02:37.463	- INFO:  
2022-08-19 06:02:37.463	- INFO:  Request received via GET, no content type specified
2022-08-19 06:02:37.463	- INFO:  Setting text description to: "All literals, search with FILTER CONTAINS(?var, "...")"

Test Queries

see https://dblp.org/rdf/schema.nt

classHistogramm

sparqlquery -qp ./queries.yaml -qn classHistogramm -en dblp -f mediawiki

query

SELECT ?c (COUNT(?c) AS ?count)
WHERE {
  ?subject a ?c
}
GROUP BY ?c
HAVING (?count >100)
ORDER BY DESC(?count)

try it!

result

c count
http://www.w3.org/1999/02/22-rdf-syntax-ns#List 19787573
http://purl.org/spar/datacite/ResourceIdentifier 11903671
https://dblp.org/rdf/schema#Publication 6255926
http://purl.org/spar/datacite/PersonalIdentifier 3240093
https://dblp.org/rdf/schema#Inproceedings 3084330
https://dblp.org/rdf/schema#Creator 3060810
https://dblp.org/rdf/schema#Person 3048893
https://dblp.org/rdf/schema#Article 2450094
http://purl.org/spar/datacite/Identifier 586067
https://dblp.org/rdf/schema#Informal 480275
https://dblp.org/rdf/schema#Book 109789
https://dblp.org/rdf/schema#Editorship 54116
https://dblp.org/rdf/schema#Incollection 41301
https://dblp.org/rdf/schema#Reference 27321
https://dblp.org/rdf/schema#AmbiguousCreator 11615
https://dblp.org/rdf/schema#Withdrawn 5334
https://dblp.org/rdf/schema#Data 3366
https://dblp.org/rdf/schema#Group 302

propertyHistogramm

 sparqlquery -qp ./queries.yaml -qn propertyHistogramm -en dblp -f mediawiki

query

SELECT ?property (COUNT(?property) AS ?propTotal)
WHERE { ?s ?property ?o . }
GROUP BY ?property
HAVING (?propTotal >1000)
ORDER BY DESC(?propTotal)

try it!

result

property propTotal
http://www.w3.org/1999/02/22-rdf-syntax-ns#type 54150953
http://www.w3.org/1999/02/22-rdf-syntax-ns#first 19787573
http://www.w3.org/1999/02/22-rdf-syntax-ns#rest 19787573
https://dblp.org/rdf/schema#authoredBy 19657389
http://purl.org/spar/literal/hasLiteralValue 15739144
http://purl.org/spar/datacite/hasIdentifier 15729831
http://purl.org/spar/datacite/usesIdentifierScheme 15729831
http://www.w3.org/2002/07/owl#sameAs 10969110
https://dblp.org/rdf/schema#doi 10172400
http://www.w3.org/2000/01/rdf-schema#label 9316809
https://dblp.org/rdf/schema#bibtexType 6255926
https://dblp.org/rdf/schema#numberOfCreators 6255926
https://dblp.org/rdf/schema#title 6255926
https://dblp.org/rdf/schema#yearOfPublication 6255256
https://dblp.org/rdf/schema#orderedCreators 6211335
https://dblp.org/rdf/schema#listedOnTocPage 6145619
https://dblp.org/rdf/schema#publishedIn 6123761
https://dblp.org/rdf/schema#primaryElectronicEdition 6020680
https://dblp.org/rdf/schema#pagination 5420153
https://dblp.org/rdf/schema#publishedInBook 3158233
https://dblp.org/rdf/schema#publishedAsPartOf 3157180
https://dblp.org/rdf/schema#yearOfEvent 3089601
https://dblp.org/rdf/schema#primaryFullCreatorName 3060810
https://dblp.org/rdf/schema#publishedInJournal 2933815
https://dblp.org/rdf/schema#publishedInJournalVolume 2933288
https://dblp.org/rdf/schema#publishedInJournalVolumeIssue 2029455
https://dblp.org/rdf/schema#otherElectronicEdition 709704
https://dblp.org/rdf/schema#wikidata 573060
https://dblp.org/rdf/schema#editedBy 129940
https://dblp.org/rdf/schema#primaryAffiliation 117163
https://dblp.org/rdf/schema#orcid 111282
https://dblp.org/rdf/schema#thesisAcceptedBySchool 92572
https://dblp.org/rdf/schema#otherFullCreatorName 84015
https://dblp.org/rdf/schema#isbn 79299
https://dblp.org/rdf/schema#webpage 78998
https://dblp.org/rdf/schema#publishedBy 76004
https://dblp.org/rdf/schema#archivedElectronicEdition 65489
https://dblp.org/rdf/schema#primaryHomepage 46600
https://dblp.org/rdf/schema#publicationNote 37837
https://dblp.org/rdf/schema#otherAffiliation 34918
https://dblp.org/rdf/schema#publishedInSeries 31713
https://dblp.org/rdf/schema#publishedInSeriesVolume 26666
https://dblp.org/rdf/schema#monthOfPublication 11546
https://dblp.org/rdf/schema#wikipedia 7306
https://dblp.org/rdf/schema#otherHomepage 6438
https://dblp.org/rdf/schema#creatorNote 1740
http://www.w3.org/2002/07/owl#differentFrom 1142

CEUR-WS Papercount

sparqlquery -en dblp -qp ./dblp.yaml -qn "CEUR-WS Papercount" -f mediawiki

query

PREFIX dblp: <https://dblp.org/rdf/schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
SELECT (COUNT(?paper) as ?count)
WHERE { 
    ?proceeding dblp:publishedIn "CEUR Workshop Proceedings".
    ?paper dblp:publishedAsPartOf ?proceeding.
}

try it!

result

count
45158

CEUR-WS Counts

sparqlquery -en dblp -qp ./dblp.yaml -qn "CEUR-WS Counts" -f mediawiki

query

PREFIX dblp: <https://dblp.org/rdf/schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
SELECT (COUNT(DISTINCT ?author) as ?numberOfAuthors) 
       (COUNT(DISTINCT ?paper) as ?numberOfPapers) 
       (COUNT(DISTINCT ?editor) as ?numberOfEditors)
       (COUNT(DISTINCT ?proceeding) as ?numberOfVolumes)
WHERE { 
    ?proceeding dblp:publishedIn "CEUR Workshop Proceedings".
    OPTIONAL{?proceeding dblp:editedBy ?editor}
    OPTIONAL{
        ?paper dblp:publishedAsPartOf ?proceeding.
        OPTIONAL{?paper dblp:authoredBy ?author}
    }

}

try it!

result

numberOfAuthors numberOfPapers numberOfEditors numberOfVolumes
71260 45158 4665 2399