Hide last authors
Paul Libbrecht 1.1 1 (% class="jumbotron" %)
2 (((
3 (% class="container" %)
4 (((
AISOP Admin 1.4 5 = The AISOP recipe =
Paul Libbrecht 1.1 6
AISOP Admin 1.4 7 The AISOP webapp is a service built as the result of various training and configurations.
8 This recipe explains how to extract the content fragments, annotate them, and create model trained on it. This will let us create a pipeline and a seminar on which we can analyse portfolios.
Paul Libbrecht 1.1 9 )))
10 )))
11
AISOP Admin 1.5 12
AISOP Admin 1.4 13 == Basic Terms ==
Paul Libbrecht 1.1 14
AISOP Admin 1.4 15 The context of the AISOP-web-app usage is that of a course at learning institution which typically has fixed students and fixed contents. A course can contain multiple courses or modules.
Paul Libbrecht 1.1 16
AISOP Admin 1.4 17 * **AISOP Web-app:** The nodeJS server that interfaces with the portfolio-composing system.
18 * **Portfolio:** the content written by a student in order to represent his or her progress, learning and knowledge using a textual and graphical form. Generally expressed in HTML, can be embedded in various web-pages.
19 * **Course-contents:** The set of slides, their annotations, the videos and handouts that normally read by students and teachers.
20 * **Analysis:** The set of programmes that recognize and measure the contents of a portfolio. Often also the name of the resulting interactive presentation (which can feature summaries or enriched portfolio views).
21 * **Composition Platform:** A space where the portfolio is written. Normally a web-space. In AISOP we have focussed on the classical e-portfol;io composition platform Mahara (a PHP server).
Paul Libbrecht 1.1 22
AISOP Admin 1.4 23 ----
Paul Libbrecht 1.1 24
AISOP Admin 1.4 25 == 1) Data Preparation ==
Paul Libbrecht 1.1 26
AISOP Admin 1.4 27 === 1.1: Make a Concept Map ===
Paul Libbrecht 1.1 28
AISOP Admin 1.4 29 Employing tools such as CMapTools, create a graphical concept map that represents the topics of the course. This concept map can be familiar with the teachers and learners of this course as a way to show the paths through the content.
Paul Libbrecht 1.1 30
AISOP Admin 1.4 31 From the concept map, extract a .cxl file which carries the same information and will be presented on the web-page.
Paul Libbrecht 1.1 32
AISOP Admin 1.4 33 From the concept map, also extract a hierarchy of topics, assuming there is more than (approx) 10 topics in the map. The hierarchy should be a text file with a label per line and the label indented to the right in case of children relation as in the following example:
Paul Libbrecht 1.1 34
AISOP Admin 1.4 35 >Algorithmization
36 > Flow Charts
37 > Programming
38 > Programming Paradigm
AISOP Admin 1.6 39 > Imperative Programming
40 >Data-Structure
41 > ....
42 >Operating System
43 > ....
Paul Libbrecht 1.1 44
AISOP Admin 1.6 45 We'll name this file labels-all-depths.txt. From this text file, extract a text file with only the top labels (in the extract above only Algorithmization, Data-Structure and Operating System), named labels-depth1.txt.
46
AISOP Admin 1.4 47 === 1.2: Extract Text of the Course Content ===
Paul Libbrecht 1.1 48
AISOP Admin 1.4 49 In order for the topic recognition to work, a model needs to be trained that will recognize the words used by the students to denote a part or another of the course. This allows to create relations between the concepts of the course and the paragraphs of the portfolio and offer these in the interactive dashboards. The training is the result of annotating fragments of texts which, first, need to be extracted from their media, be them PDF files, PowerPoint slides, scanned texts or student works. These texts will not be shared so that even protected material or even personal-information carrying texts can be used.
Paul Libbrecht 1.1 50
AISOP Admin 1.4 51 Practically:
Paul Libbrecht 1.1 52
AISOP Admin 1.6 53 * Make all documents accessible for you to open and browse (e.g. download them or get the authorized accesses)
54 * Install and launch the [[clipboard extractor>>https://gitlab.com/aisop/aisop-hacking/-/tree/main/aisop-clipboard-extractor?ref_type=heads]] which will gather the fragments in a text file
55 * Go through all contents and copy each fragment. A fragment is expected to be the size of a paragraph so this is what you should copy.
56 * The extractor should have copied all the fragments in one file. Which we shall call extraction.json.
57 * The least amount of content to be extracted is the complete set of slides and their comments. We recommend to use past students' e-portfolios too. We had rather good experience with about 1000 fragments for a course.
58 * If interrupted, the process may create several JSON files. You can combine them using the [[merge-tool>>https://gitlab.com/aisop/aisop-hacking/-/tree/main/merge-json-files]].
Paul Libbrecht 1.1 59
Andreas Isking 10.1 60
61 ==== 1.2.1: Extract Text from PDF or PNG (PDF → PNG → Text) ====
62
63 Text extraction from PDFs is sometimes faulty. Additionally, many PDFs contain images. To capture this text, Tesseract can be used. A brief explanation of how to use it is provided here.
64
65 **Prerequisites**
66
67 ~1. Tesseract must be installed
68
69 (% class="box" %)
70 (((
71 ##tesseract ~-~-version##
72 )))
73
74 2. Poppler must be installed
75
76 (% class="box" %)
77 (((
78 ##brew install poppler##
79 )))
80
81 **Code**
82
83 (% class="box" %)
84 (((
85 ##for pdf in *.pdf; do
86 # Extract the base name of the PDF without the extension
87 basename="${pdf%.pdf}"
88 \\# Convert PDF to PNGs
89 pdftoppm -png "$pdf" "$basename"
90 \\# Create a text file with the same name as the PDF.
91 for png in "$basename"-*.png; do
92 tesseract "$png" stdout -l deu ~-~-oem 1 | tr '~\~\n' ' ' | sed 's/  */ /g' >> "$basename.txt"
93 echo -e "~\~\n~\~\n~-~--~\~\n~\~\n" >> "$basename.txt"
94 done
95 done
96 rm *.png##
97 )))
98
99 **Explanation**
100
101 ~1. ##for pdf in *.pdf; do ...; done##
102
103 • Loops through all PDF files in the directory.
104
105 2. ##basename="${pdf%.pdf}"##
106
107 • Extracts the filename of the PDF without the .pdf extension.
108
109 3. ##pdftoppm -png "$pdf" "$basename"##
110
111 • Converts the PDF into PNG images, named in the format BASENAME-1.png, BASENAME-2.png, etc.
112
113 4. ##for png in "$basename"-*.png; do ...; done##
114
115 • Processes only the PNG files generated from the current PDF.
116
117 5. ##tesseract "$png" stdout -l deu ~-~-oem 1##
118
119 • Performs OCR on the PNG file.
120
121 6. ##tr '\n' ' '##
122
123 • Replaces line breaks with spaces.
124
125 7. ##sed 's/  */ /g'##
126
127 • Reduces multiple spaces to a single space.
128
129 8. ##>> "$basename.txt"##
130
131 • Appends the recognized text to a text file with the same name as the PDF.
132
133 9. ##echo -e "\n\n~-~--\n\n" >> "$basename.txt"##
134
135 • Adds a separator line (~-~--) after each page.
136
137 **Result**
138
139 • A separate text file is created for each PDF, e.g.:
140
141 • file1.txt for file1.pdf
142
143 • file2.txt for file2.pdf
144
145 • The OCR results of all pages from the respective PDF are written into this text file.
146
147 • Each page is separated by a separator line (~-~--).
148
149 • Temporary PNG files are deleted at the end.
150
AISOP Admin 1.4 151 === 1.3: Annotate Text Fragments ===
152
AISOP Admin 1.6 153 It is time to endow the fragments with topics so that we can recognize students' paragraphs' topics. In AISOP, we have used the (commercial) [[prodigy>>https://prodi.gy/]] for this task in two steps which, both, iterate through all fragments to give them topics.
Paul Libbrecht 1.1 154
AISOP Admin 8.1 155 **The first step: top-level-labels:** This is the simple [["text classifier" recipe>>https://prodi.gy/docs/recipes#textcat]] of prodigy: we can invoke the following command for this: ##prodigy textcat.manual the-course-name-l1 ./fragments.jsonl ~-~-label labels-depth1.txt##  which will offer a web-interface on which each fragment is annotated with the (top-level) label. This web-interface can be left running for several days.
156 Then extract the content into a file: ##prodigy db-out the-course-name-l1 > the-course-name-dbout.jsonl##
AISOP Admin 1.6 157
AISOP Admin 9.1 158 **The second step is the hierarchical annotation** [[custom recipe>>https://gitlab.com/aisop/aisop-nlp/-/tree/main/hierarchical_annotation?ref_type=heads]] (link to become public soon): The same fragments are now annotated with the top-level annotation and all their children. E.g. using the command ##python -m prodigy subcat_annotate_with_top2 the-course-name-l2 \
159 the-course-name-dbout.jsonl labels-all-depths.txt  -F ./subcat_annotate_with_top2.py##.
AISOP Admin 1.6 160
Paul Libbrecht 11.1 161 The resulting data-set can be extracted out of prodigy using the db-out recipe, e.g. ##prodigy db-out the-course-name-l2 the-course-name-l2-dbout##.
AISOP Admin 1.6 162
163
AISOP Admin 1.4 164 ----
Paul Libbrecht 1.1 165
AISOP Admin 1.4 166 == 2) Deployment ==
Paul Libbrecht 1.1 167
Paul Libbrecht 11.1 168 Most of the steps below contribute to creating an _[[AISOP-domain>>the-AISOP-domains.WebHome]]_, a directory with all of the subject-specific information. This directory contains runtime information; it can be first composed as a source directory with all the annotated and source documents. The AISOP-domain should be possible to share.
Paul Libbrecht 1.1 169
Paul Libbrecht 11.1 170 === 2.1 Train a Series of Recognition Models ===
AISOP Admin 1.4 171
Paul Libbrecht 11.1 172 The annotation process above will collect all annotations in a single JSONL file. It needs to be split for the different trainings.
173
174 For the level 1, one creates a recognition model by training on the annotated dataset:
175
176 For differentiating among topics of the level 1:
177
178 ##prodigy train the-course-name-l1 ~-~-base-model de_core_news_sm ~-~-lang de ~-~-textcat-multilabel the-course-name --label-stats##
179
180 This outputs a directory called ##the-course-name## with the latest and the best models. Choose one as the chosen model to be copied inside `l1-model` of the domain.
181
182 Then, for each of the L1 topic, you need to separate the annotations (??? Pierre ???) to individual files which you then involve in a training. E.g. here for Error-Correction:
183
184 ##prodigy train the-course-name-l2-Error-Correction ~-~-base-model de_core_news_sm ~-~-lang de ~-~-textcat-multilabel the-course-name-Error-Correction --label-stats##
185
186 Inspecting the output statistics is an effective way to prevent low-quality results for some of the contents.
187
188 Copy each of the produced best models into the domain's `l2-models` directory.
189
190 The two models directories should be put inside `domains` at the root of the AISOP-webapp.
191
AISOP Admin 1.4 192 === 2.2 Create a Pipeline ===
193
AISOP Admin 1.7 194 ... write down the configuration JSON of the pipeline, get inspired [[pipeline-medieninderlehre.json>>https://gitlab.com/aisop/aisop-webapp/-/blob/main/config/couchdb/pipeline-medieninderlehre.json?ref_type=heads]]
AISOP Admin 1.4 195
Paul Libbrecht 11.1 196 The pipeline is the central configuration information referencing the models and the scripts (found in the `scripts/python` directory). After changing the pipeline and copying it inside the domain, the web-application should be restarted.
AISOP Admin 1.4 197
Paul Libbrecht 11.1 198 === 2.3 Test ===
AISOP Admin 1.4 199
Paul Libbrecht 11.1 200 A simple tool to analyze pasted sentences is available using the following schema:
201 ##https://app-url/debug-classifier/model-l1/model-l2/?text=theText##. This allows to verify that the analysis of expected classical sentences performs correctly. The formulation a URLs containing all information allows creators of domains to collect a series of tests that they can perform again after each adjustments (e.g. after a change of hyperparameter of the training or a change of the annotations).
AISOP Admin 1.7 202
Paul Libbrecht 11.1 203 === 2.4 Create a Seminar and Import Content ===
204
205 Now that the pipeline is effective, we can create, within a group, a new __seminar__ which defines the pipeline and the associated views from within all views accessible to the group administrators.
206
207 The button analyze should the be visible to you. Consider that analyzing all portfolios within a seminar can take time. We have experienced several hours for some cases.
208
AISOP Admin 1.4 209 === 2.4 Interface with the composition platform ===
210
Paul Libbrecht 11.1 211 The app installation is concluded with the configuration of the service for the Mahara platform: the instructions of the [[aisop-oauth>>https://gitlab.com/aisop/aisop-oauth]] apply here.
AISOP Admin 1.4 212
213 ----
214
215 == 3) Usage ==
216
Paul Libbrecht 11.1 217 Make sure that the users who are going to be revising others' e-portfolios are group-administrator within Mahara.
218
AISOP Admin 1.4 219 === 3.1 Invite Users ===
220
Paul Libbrecht 11.1 221 Within a learning management, announce to the expected users of the AISOP-webapp, the availability of the web-app URL. They will login by authorising the app to download from Mahara. The invitation should contain a little description of the expected function.
AISOP Admin 1.4 222
Paul Libbrecht 11.1 223 The AISOP-webapp is likely able to support the students' writing process.
224
AISOP Admin 1.4 225 === 3.2 Verify Imports and Analyses ===
226
Paul Libbrecht 11.1 227 Once a sufficient amount of portfolios is available among the views of the seminars. It is possible launch global analyses. These make output synthetic graphs expressing the covergage of each topic. These graphs can be used in classroom to reflect on the course.
AISOP Admin 1.4 228
229 === 3.3 Observe Usage and Reflect on Quality ===
230
Paul Libbrecht 11.1 231 The AI-based observation can be used and, for each Mahara view, a portfolio-explorer and global-dashboard is available. They both can be the opportunity to observe the e-portfolios quality. They can also reveal classifications' flaws and, most probably, the appearance of topics that don't fit well any of the classification.
AISOP Admin 1.4 232
Paul Libbrecht 11.1 233
234
AISOP Admin 1.4 235 === 3.4 Gather Enhancements ===
236
Paul Libbrecht 11.1 237 As a first feedback after the use of the web-app. Multiple refinements of the e-portfolios could be suggested. Several such can be discussed in class.
238
239 Further enhancements can be obtained from the web-app usage: if the proportion of allocated content matches the expectation from the course, if the students' understanding has been suffering for particular contents, or if some content parts are particularly popular.
240
241 Finally, enhancements can be reflected upon for the subsequent issues of the course: Enhancements to the classification, to the contents, to the annotation sources or to the web-app and processes.

Need help?

If you need help with XWiki you can contact: