<
From version < 4.1 >
edited by Paul Libbrecht
on 2025/04/17 11:21
To version < 6.1
edited by Paul Libbrecht
on 2025/04/17 21:43
Change comment: There is no comment for this version

Summary

Details

Page properties
Content
... ... @@ -33,8 +33,20 @@
33 33  * the directory contains a `labels.txt` file with a list of label names, organized in a hierarchy by simple indenting
34 34  * the directory contains a file `pipeline.json` with the steps of the analysis and aggregation
35 35   * the pipeline refers to the model spacy directories (level 1 and one for the children of each level 1) which are included
36 -* the concept-map used called cmap.cxl and its source cmap.cmap (for dev)
36 +* the concept-map used called `cmap.cxl` and its source `cmap.cmap` (for dev)
37 +* all models:
38 + * the `l1-model` directory is the spacy model for the classifier for the l1-topics
39 + * the `l2-models` contains a directory for each l1-topic which contains a spacy model for the sub-labels of this l1-topic
37 37  * any extra file or directory mentioned as link
38 -* any file used for development, documented by a `README.md`
41 +* the `tests.txt` file contains the test fragments so that the debug tool can be used right away, one line per fragment
42 +* the `log.txt` file contains the statistical output of the training and/or statistitics: one line per label, one column par dimension
43 +* optionally, any file used for development, documented by a `README.md` (see below)
39 39  
40 40  All paths of links used in the `about.json` and `pipeline.json` files can be resolved in a relative manner. For them to be recognized, we recommend to express relative paths with the syntax of starting with `./` as in `"logo":"./my-logo.svg"`. This allows the web-app to perform relative resolution in a secure way (not going outside of the domain directory except for known places) before it is given to the web-server or to the analysis scripts.
46 +
47 +While the README.md should be the main entry point for the source work for creating the domain, we propose the following folder names:
48 +
49 +- `source-content`: a collection of files (e.g. PDFs, pictures, texts, pptx, ...) that represent the source input from where an extraction is made
50 +- `extracts` is the result of the extraction process and is made of JSON files, one, or one folder, per source collection
51 +- `annotations` is the result of the annotations exported from prodigy in the form of JSONL files
52 +- moreover, instructions used and the log of all processes is visible in the `README.md` file

Need help?

If you need help with XWiki you can contact: