this page describes what an AISOP domain is, a project that reflects the course material and how it is used in the AISOP webapp.

Purpose of a domain

Contain all course-specific information that is used by the AISOP web-application so that it can apply to a seminar, display analyses and classify the portfolios made in these courses in a meaningful manner.

Ingredients of a domain

Minimum: this allows the AISOP web-app to run:

  • title and description
  • set of labels of level 1 and of other levels
  • concept-map (exported in CXL) where the labels are nodes
  • spacy model that allows the text-classification of paragraphs of portfolios made in this course: one for level 1, and one for each of the level 1 topics (to classify its subtopics)
  • sequence of analysis scripts and aggregation scripts to deliver the portfolio explorer and portfolio dashboard

Optional: this allows others to further develop the domain

  • source content which contains the sentences used for training
  • extracted sentences/fragments
  • annotations for these extracted sentences within the labels of level 1 and the others
  • annotation statistics and model training results (in the form of statistics)
  • test sentences to verify the proper elementary function of the classifiers
  • test portfolios to verify the proper function

Packaging of a domain

We propose that a domain be packaged as a directory which can be shared as a repository and to contain the following directory organization:

  • the directory-name reflects the course name
  • the directory contains a file about.json with the properties title, description, language (in iso-639-3) and an array of strings for the authors, subjects an array of strings containing the LC-subject-classification, and logo (the link to a logo)
  • the directory contains a file license.txt with the license text
  • the directory contains a labels.txt file with a list of label names, organized in a hierarchy by simple indenting
  • the directory contains a file pipeline.json with the steps of the analysis and aggregation
    • the pipeline refers to the model spacy directories (level 1 and one for the children of each level 1) which are included
  • the concept-map used called cmap.cxl and its source cmap.cmap (for dev)
  • all models:
    • the l1-model directory is the spacy model for the classifier for the l1-topics
    • the l2-models contains a directory for each l1-topic which contains a spacy model for the sub-labels of this l1-topic
  • any extra file or directory mentioned as link
  • the tests.txt file contains the test fragments so that the debug tool can be used right away, one line per fragment
  • the log.txt file contains the statistical output of the training and/or statistitics: one line per label, one column par dimension
  • optionally, any file used for development, documented by a README.md (see below)

All paths of links used in the about.json and pipeline.json files can be resolved in a relative manner. For them to be recognized, we recommend to express relative paths with the syntax of starting with ./ as in "logo":"./my-logo.svg". This allows the web-app to perform relative resolution in a secure way (not going outside of the domain directory except for known places) before it is given to the web-server or to the analysis scripts.

While the README.md should be the main entry point for the source work for creating the domain, we propose the following folder names:

  • source-content: a collection of files (e.g. PDFs, pictures, texts, pptx, ...) that represent the source input from where an extraction is made
  • extracts is the result of the extraction process and is made of JSON files, one, or one folder, per source collection
  • annotations is the result of the annotations exported from prodigy in the form of JSONL files
  • moreover, instructions used and the log of all processes is visible in the README.md file
Tags:
    

Need help?

If you need help with XWiki you can contact: