Changes for page The AISOP recipe
Last modified by Paul Libbrecht on 2025/06/15 23:32
Change comment:
There is no comment for this version
Summary
-
Page properties (2 modified, 0 added, 0 removed)
Details
- Page properties
-
- Author
-
... ... @@ -1,1 +1,1 @@ 1 -XWiki. polx1 +XWiki.AISOPAdmin - Content
-
... ... @@ -57,97 +57,6 @@ 57 57 * The least amount of content to be extracted is the complete set of slides and their comments. We recommend to use past students' e-portfolios too. We had rather good experience with about 1000 fragments for a course. 58 58 * If interrupted, the process may create several JSON files. You can combine them using the [[merge-tool>>https://gitlab.com/aisop/aisop-hacking/-/tree/main/merge-json-files]]. 59 59 60 - 61 -==== 1.2.1: Extract Text from PDF or PNG (PDF → PNG → Text) ==== 62 - 63 -Text extraction from PDFs is sometimes faulty. Additionally, many PDFs contain images. To capture this text, Tesseract can be used. A brief explanation of how to use it is provided here. 64 - 65 -**Prerequisites** 66 - 67 -~1. Tesseract must be installed 68 - 69 -(% class="box" %) 70 -((( 71 -##tesseract ~-~-version## 72 -))) 73 - 74 -2. Poppler must be installed 75 - 76 -(% class="box" %) 77 -((( 78 -##brew install poppler## 79 -))) 80 - 81 -**Code** 82 - 83 -(% class="box" %) 84 -((( 85 -##for pdf in *.pdf; do 86 -# Extract the base name of the PDF without the extension 87 -basename="${pdf%.pdf}" 88 -\\# Convert PDF to PNGs 89 -pdftoppm -png "$pdf" "$basename" 90 -\\# Create a text file with the same name as the PDF. 91 -for png in "$basename"-*.png; do 92 - tesseract "$png" stdout -l deu ~-~-oem 1 | tr '~\~\n' ' ' | sed 's/ */ /g' >> "$basename.txt" 93 - echo -e "~\~\n~\~\n~-~--~\~\n~\~\n" >> "$basename.txt" 94 -done 95 -done 96 -rm *.png## 97 -))) 98 - 99 -**Explanation** 100 - 101 -~1. ##for pdf in *.pdf; do ...; done## 102 - 103 -• Loops through all PDF files in the directory. 104 - 105 -2. ##basename="${pdf%.pdf}"## 106 - 107 -• Extracts the filename of the PDF without the .pdf extension. 108 - 109 -3. ##pdftoppm -png "$pdf" "$basename"## 110 - 111 -• Converts the PDF into PNG images, named in the format BASENAME-1.png, BASENAME-2.png, etc. 112 - 113 -4. ##for png in "$basename"-*.png; do ...; done## 114 - 115 -• Processes only the PNG files generated from the current PDF. 116 - 117 -5. ##tesseract "$png" stdout -l deu ~-~-oem 1## 118 - 119 -• Performs OCR on the PNG file. 120 - 121 -6. ##tr '\n' ' '## 122 - 123 -• Replaces line breaks with spaces. 124 - 125 -7. ##sed 's/ */ /g'## 126 - 127 -• Reduces multiple spaces to a single space. 128 - 129 -8. ##>> "$basename.txt"## 130 - 131 -• Appends the recognized text to a text file with the same name as the PDF. 132 - 133 -9. ##echo -e "\n\n~-~--\n\n" >> "$basename.txt"## 134 - 135 -• Adds a separator line (~-~--) after each page. 136 - 137 -**Result** 138 - 139 -• A separate text file is created for each PDF, e.g.: 140 - 141 -• file1.txt for file1.pdf 142 - 143 -• file2.txt for file2.pdf 144 - 145 -• The OCR results of all pages from the respective PDF are written into this text file. 146 - 147 -• Each page is separated by a separator line (~-~--). 148 - 149 -• Temporary PNG files are deleted at the end. 150 - 151 151 === 1.3: Annotate Text Fragments === 152 152 153 153 It is time to endow the fragments with topics so that we can recognize students' paragraphs' topics. In AISOP, we have used the (commercial) [[prodigy>>https://prodi.gy/]] for this task in two steps which, both, iterate through all fragments to give them topics. ... ... @@ -158,7 +158,7 @@ 158 158 **The second step is the hierarchical annotation** [[custom recipe>>https://gitlab.com/aisop/aisop-nlp/-/tree/main/hierarchical_annotation?ref_type=heads]] (link to become public soon): The same fragments are now annotated with the top-level annotation and all their children. E.g. using the command ##python -m prodigy subcat_annotate_with_top2 the-course-name-l2 \ 159 159 the-course-name-dbout.jsonl labels-all-depths.txt -F ./subcat_annotate_with_top2.py##. 160 160 161 -The resulting data-set can be extracted out of prodigy using the db-out recipe, e.g. ##prodigy db-out the-course-name-l2 the-course-name-l2-dbout##.70 +The resulting data-set can be extracted out of prodigy using the db-out recipe, e.g. prodigy db-out the-course-name-l2 the-course-name-l2-dbout or can be converted to a spaCy dataset for training e.g. using the command xxxxx (see [[here>>https://gitlab.com/aisop/aisop-nlp/-/tree/main/it3/fundamental-principles]]) 162 162 163 163 164 164 ---- ... ... @@ -165,77 +165,40 @@ 165 165 166 166 == 2) Deployment == 167 167 168 - Mostof the steps below contribute to creating an _[[AISOP-domain>>the-AISOP-domains.WebHome]]_,a directory with all of the subject-specific information.This directory containsruntime information;it can befirstcomposed as a source directory with all the annotated andsourcedocuments. The AISOP-domain shouldbe possible to share.77 +=== 2.1 Train a Recognition Model === 169 169 170 - ===2.1 TrainaSeriesof RecognitionModels===79 +See [[here>>https://gitlab.com/aisop/aisop-nlp/-/tree/main/it3/fundamental-principles]]. 171 171 172 -The annotation process above will collect all annotations in a single JSONL file. It needs to be split for the different trainings. 173 - 174 -For the level 1, one creates a recognition model by training on the annotated dataset: 175 - 176 -For differentiating among topics of the level 1: 177 - 178 -##prodigy train the-course-name-l1 ~-~-base-model de_core_news_sm ~-~-lang de ~-~-textcat-multilabel the-course-name --label-stats## 179 - 180 -This outputs a directory called ##the-course-name## with the latest and the best models. Choose one as the chosen model to be copied inside `l1-model` of the domain. 181 - 182 -Then, for each of the L1 topic, you need to separate the annotations (??? Pierre ???) to individual files which you then involve in a training. E.g. here for Error-Correction: 183 - 184 -##prodigy train the-course-name-l2-Error-Correction ~-~-base-model de_core_news_sm ~-~-lang de ~-~-textcat-multilabel the-course-name-Error-Correction --label-stats## 185 - 186 -Inspecting the output statistics is an effective way to prevent low-quality results for some of the contents. 187 - 188 -Copy each of the produced best models into the domain's `l2-models` directory. 189 - 190 -The two models directories should be put inside `domains` at the root of the AISOP-webapp. 191 - 192 192 === 2.2 Create a Pipeline === 193 193 194 194 ... write down the configuration JSON of the pipeline, get inspired [[pipeline-medieninderlehre.json>>https://gitlab.com/aisop/aisop-webapp/-/blob/main/config/couchdb/pipeline-medieninderlehre.json?ref_type=heads]] 195 195 196 - Thepipelineis the central configurationinformationreferencing the modelsand the scripts(foundin the `scripts/python` directory).After changing the pipeline and copying itinsidethe domain,theweb-application should be restarted.85 +=== 2.3 Create a Seminar and Import Content === 197 197 198 - === 2.3 Test ===87 +... 199 199 200 -A simple tool to analyze pasted sentences is available using the following schema: 201 -##https://app-url/debug-classifier/model-l1/model-l2/?text=theText##. This allows to verify that the analysis of expected classical sentences performs correctly. The formulation a URLs containing all information allows creators of domains to collect a series of tests that they can perform again after each adjustments (e.g. after a change of hyperparameter of the training or a change of the annotations). 89 +Create a seminar with the web-interface, associate the appropriate pipeline. 202 202 203 -=== 2.4 Create a Seminar and Import Content === 204 - 205 -Now that the pipeline is effective, we can create, within a group, a new __seminar__ which defines the pipeline and the associated views from within all views accessible to the group administrators. 206 - 207 -The button analyze should the be visible to you. Consider that analyzing all portfolios within a seminar can take time. We have experienced several hours for some cases. 208 - 209 209 === 2.4 Interface with the composition platform === 210 210 211 - Theapp installation is concludedwiththeconfiguration of the service for theMaharaplatform: theinstructionsofthe [[aisop-oauth>>https://gitlab.com/aisop/aisop-oauth]] apply here.93 +See the Mahara authorization configuration. 212 212 213 213 ---- 214 214 215 215 == 3) Usage == 216 216 217 -Make sure that the users who are going to be revising others' e-portfolios are group-administrator within Mahara. 218 - 219 219 === 3.1 Invite Users === 220 220 221 - Within a learning management, announce to the expected users of the AISOP-webapp, the availability of the web-app URL.They will login by authorising the app to download from Mahara.The invitation should contain a little description of the expected function.101 +... 222 222 223 -The AISOP-webapp is likely able to support the students' writing process. 224 - 225 225 === 3.2 Verify Imports and Analyses === 226 226 227 - Once a sufficient amount of portfolios is available among the views of the seminars.It is possible launch global analyses.These make output synthetic graphs expressing the covergage of each topic.These graphs can be used in classroom to reflect on the course.105 +... 228 228 229 229 === 3.3 Observe Usage and Reflect on Quality === 230 230 231 - The AI-based observation can be used and, for each Mahara view, a portfolio-explorer and global-dashboard is available.They both can be the opportunity to observe the e-portfolios quality.They can also reveal classifications' flaws and, most probably, the appearance of topics that don't fit well any of the classification.109 +... 232 232 233 - 234 - 235 235 === 3.4 Gather Enhancements === 236 236 237 -As a first feedback after the use of the web-app. Multiple refinements of the e-portfolios could be suggested. Several such can be discussed in class. 238 - 239 -Further enhancements can be obtained from the web-app usage: if the proportion of allocated content matches the expectation from the course, if the students' understanding has been suffering for particular contents, or if some content parts are particularly popular. 240 - 241 -Finally, enhancements can be reflected upon for the subsequent issues of the course: Enhancements to the classification, to the contents, to the annotation sources or to the web-app and processes. 113 +... on the web-app, on the creation process, and on the course