AISOP domains

Last modified by Paul Libbrecht on 2025/04/17 21:43

this page describes what an AISOP domain is, a project that reflects the course material and how it is used in the AISOP webapp.

Purpose of a domain

Contain all course-specific information that is used by the AISOP web-application so that it can apply to a seminar, display analyses and classify the portfolios made in these courses in a meaningful manner.

Ingredients of a domain

Minimum: this allows the AISOP web-app to run:

title and description
set of labels of level 1 and of other levels
concept-map (exported in CXL) where the labels are nodes
spacy model that allows the text-classification of paragraphs of portfolios made in this course: one for level 1, and one for each of the level 1 topics (to classify its subtopics)
sequence of analysis scripts and aggregation scripts to deliver the portfolio explorer and portfolio dashboard

Optional: this allows others to further develop the domain

source content which contains the sentences used for training
extracted sentences/fragments
annotations for these extracted sentences within the labels of level 1 and the others
annotation statistics and model training results (in the form of statistics)
test sentences to verify the proper elementary function of the classifiers
test portfolios to verify the proper function

Packaging of a domain

We propose that a domain be packaged as a directory which can be shared as a repository and to contain the following directory organization:

the directory-name reflects the course name
the directory contains a file about.json with the properties title, description, language (in iso-639-3) and an array of strings for the authors, subjects an array of strings containing the LC-subject-classification, and logo (the link to a logo)
the directory contains a file license.txt with the license text
the directory contains a labels.txt file with a list of label names, organized in a hierarchy by simple indenting
the directory contains a file pipeline.json with the steps of the analysis and aggregation
- the pipeline refers to the model spacy directories (level 1 and one for the children of each level 1) which are included
the concept-map used called cmap.cxl and its source cmap.cmap (for dev)
all models:
- the l1-model directory is the spacy model for the classifier for the l1-topics
- the l2-models contains a directory for each l1-topic which contains a spacy model for the sub-labels of this l1-topic
any extra file or directory mentioned as link
the tests.txt file contains the test fragments so that the debug tool can be used right away, one line per fragment
the log.txt file contains the statistical output of the training and/or statistitics: one line per label, one column par dimension
optionally, any file used for development, documented by a README.md (see below)

All paths of links used in the about.json and pipeline.json files can be resolved in a relative manner. For them to be recognized, we recommend to express relative paths with the syntax of starting with ./ as in "logo":"./my-logo.svg". This allows the web-app to perform relative resolution in a secure way (not going outside of the domain directory except for known places) before it is given to the web-server or to the analysis scripts.

While the README.md should be the main entry point for the source work for creating the domain, we propose the following folder names:

source-content: a collection of files (e.g. PDFs, pictures, texts, pptx, ...) that represent the source input from where an extraction is made
extracts is the result of the extraction process and is made of JSON files, one, or one folder, per source collection
annotations is the result of the annotations exported from prodigy in the form of JSONL files
moreover, instructions used and the log of all processes is visible in the README.md file

Tags:

Applications

More applications

Need help?

If you need help with XWiki you can contact: