<
From version < 10.1 >
edited by Andreas Isking
on 2025/02/28 18:15
To version < 1.4 >
edited by AISOP Admin
on 2025/01/14 20:31
>
Change comment: There is no comment for this version

Summary

Details

Page properties
Author
... ... @@ -1,1 +1,1 @@
1 -XWiki.andisk
1 +XWiki.AISOPAdmin
Content
... ... @@ -9,7 +9,10 @@
9 9  )))
10 10  )))
11 11  
12 -
12 +(% class="row" %)
13 +(((
14 +(% class="col-xs-12 col-sm-8" %)
15 +(((
13 13  == Basic Terms ==
14 14  
15 15  The context of the AISOP-web-app usage is that of a course at learning institution which typically has fixed students and fixed contents. A course can contain multiple courses or modules.
... ... @@ -22,6 +22,7 @@
22 22  
23 23  ----
24 24  
28 +(% class="wikigeneratedid" %)
25 25  == 1) Data Preparation ==
26 26  
27 27  === 1.1: Make a Concept Map ===
... ... @@ -36,14 +36,8 @@
36 36  > Flow Charts
37 37  > Programming
38 38  > Programming Paradigm
39 -> Imperative Programming
40 ->Data-Structure
41 -> ....
42 ->Operating System
43 -> ....
43 +> Imperative Programming....
44 44  
45 -We'll name this file labels-all-depths.txt. From this text file, extract a text file with only the top labels (in the extract above only Algorithmization, Data-Structure and Operating System), named labels-depth1.txt.
46 -
47 47  === 1.2: Extract Text of the Course Content ===
48 48  
49 49  In order for the topic recognition to work, a model needs to be trained that will recognize the words used by the students to denote a part or another of the course. This allows to create relations between the concepts of the course and the paragraphs of the portfolio and offer these in the interactive dashboards. The training is the result of annotating fragments of texts which, first, need to be extracted from their media, be them PDF files, PowerPoint slides, scanned texts or student works. These texts will not be shared so that even protected material or even personal-information carrying texts can be used.
... ... @@ -50,117 +50,12 @@
50 50  
51 51  Practically:
52 52  
53 -* Make all documents accessible for you to open and browse (e.g. download them or get the authorized accesses)
54 -* Install and launch the [[clipboard extractor>>https://gitlab.com/aisop/aisop-hacking/-/tree/main/aisop-clipboard-extractor?ref_type=heads]] which will gather the fragments in a text file
55 -* Go through all contents and copy each fragment. A fragment is expected to be the size of a paragraph so this is what you should copy.
56 -* The extractor should have copied all the fragments in one file. Which we shall call extraction.json.
57 -* The least amount of content to be extracted is the complete set of slides and their comments. We recommend to use past students' e-portfolios too. We had rather good experience with about 1000 fragments for a course.
58 -* If interrupted, the process may create several JSON files. You can combine them using the [[merge-tool>>https://gitlab.com/aisop/aisop-hacking/-/tree/main/merge-json-files]].
51 +* Assemble the documents
59 59  
60 -
61 -==== 1.2.1: Extract Text from PDF or PNG (PDF → PNG → Text) ====
62 -
63 -Text extraction from PDFs is sometimes faulty. Additionally, many PDFs contain images. To capture this text, Tesseract can be used. A brief explanation of how to use it is provided here.
64 -
65 -**Prerequisites**
66 -
67 -~1. Tesseract must be installed
68 -
69 -(% class="box" %)
70 -(((
71 -##tesseract ~-~-version##
72 -)))
73 -
74 -2. Poppler must be installed
75 -
76 -(% class="box" %)
77 -(((
78 -##brew install poppler##
79 -)))
80 -
81 -**Code**
82 -
83 -(% class="box" %)
84 -(((
85 -##for pdf in *.pdf; do
86 -# Extract the base name of the PDF without the extension
87 -basename="${pdf%.pdf}"
88 -\\# Convert PDF to PNGs
89 -pdftoppm -png "$pdf" "$basename"
90 -\\# Create a text file with the same name as the PDF.
91 -for png in "$basename"-*.png; do
92 - tesseract "$png" stdout -l deu ~-~-oem 1 | tr '~\~\n' ' ' | sed 's/  */ /g' >> "$basename.txt"
93 - echo -e "~\~\n~\~\n~-~--~\~\n~\~\n" >> "$basename.txt"
94 -done
95 -done
96 -rm *.png##
97 -)))
98 -
99 -**Explanation**
100 -
101 -~1. ##for pdf in *.pdf; do ...; done##
102 -
103 -• Loops through all PDF files in the directory.
104 -
105 -2. ##basename="${pdf%.pdf}"##
106 -
107 -• Extracts the filename of the PDF without the .pdf extension.
108 -
109 -3. ##pdftoppm -png "$pdf" "$basename"##
110 -
111 -• Converts the PDF into PNG images, named in the format BASENAME-1.png, BASENAME-2.png, etc.
112 -
113 -4. ##for png in "$basename"-*.png; do ...; done##
114 -
115 -• Processes only the PNG files generated from the current PDF.
116 -
117 -5. ##tesseract "$png" stdout -l deu ~-~-oem 1##
118 -
119 -• Performs OCR on the PNG file.
120 -
121 -6. ##tr '\n' ' '##
122 -
123 -• Replaces line breaks with spaces.
124 -
125 -7. ##sed 's/  */ /g'##
126 -
127 -• Reduces multiple spaces to a single space.
128 -
129 -8. ##>> "$basename.txt"##
130 -
131 -• Appends the recognized text to a text file with the same name as the PDF.
132 -
133 -9. ##echo -e "\n\n~-~--\n\n" >> "$basename.txt"##
134 -
135 -• Adds a separator line (~-~--) after each page.
136 -
137 -**Result**
138 -
139 -• A separate text file is created for each PDF, e.g.:
140 -
141 -• file1.txt for file1.pdf
142 -
143 -• file2.txt for file2.pdf
144 -
145 -• The OCR results of all pages from the respective PDF are written into this text file.
146 -
147 -• Each page is separated by a separator line (~-~--).
148 -
149 -• Temporary PNG files are deleted at the end.
150 -
151 151  === 1.3: Annotate Text Fragments ===
152 152  
153 -It is time to endow the fragments with topics so that we can recognize students' paragraphs' topics. In AISOP, we have used the (commercial) [[prodigy>>https://prodi.gy/]] for this task in two steps which, both, iterate through all fragments to give them topics.
55 +Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
154 154  
155 -**The first step: top-level-labels:** This is the simple [["text classifier" recipe>>https://prodi.gy/docs/recipes#textcat]] of prodigy: we can invoke the following command for this: ##prodigy textcat.manual the-course-name-l1 ./fragments.jsonl ~-~-label labels-depth1.txt##  which will offer a web-interface on which each fragment is annotated with the (top-level) label. This web-interface can be left running for several days.
156 -Then extract the content into a file: ##prodigy db-out the-course-name-l1 > the-course-name-dbout.jsonl##
157 -
158 -**The second step is the hierarchical annotation** [[custom recipe>>https://gitlab.com/aisop/aisop-nlp/-/tree/main/hierarchical_annotation?ref_type=heads]] (link to become public soon): The same fragments are now annotated with the top-level annotation and all their children. E.g. using the command ##python -m prodigy subcat_annotate_with_top2 the-course-name-l2 \
159 - the-course-name-dbout.jsonl labels-all-depths.txt  -F ./subcat_annotate_with_top2.py##.
160 -
161 -The resulting data-set can be extracted out of prodigy using the db-out recipe, e.g. prodigy db-out the-course-name-l2 the-course-name-l2-dbout or can be converted to a spaCy dataset for training e.g. using the command xxxxx (see [[here>>https://gitlab.com/aisop/aisop-nlp/-/tree/main/it3/fundamental-principles]])
162 -
163 -
164 164  ----
165 165  
166 166  == 2) Deployment ==
... ... @@ -167,21 +167,19 @@
167 167  
168 168  === 2.1 Train a Recognition Model ===
169 169  
170 -See [[here>>https://gitlab.com/aisop/aisop-nlp/-/tree/main/it3/fundamental-principles]].
63 +...
171 171  
172 172  === 2.2 Create a Pipeline ===
173 173  
174 -... write down the configuration JSON of the pipeline, get inspired [[pipeline-medieninderlehre.json>>https://gitlab.com/aisop/aisop-webapp/-/blob/main/config/couchdb/pipeline-medieninderlehre.json?ref_type=heads]]
67 +...
175 175  
176 176  === 2.3 Create a Seminar and Import Content ===
177 177  
178 178  ...
179 179  
180 -Create a seminar with the web-interface, associate the appropriate pipeline.
181 -
182 182  === 2.4 Interface with the composition platform ===
183 183  
184 -See the Mahara authorization configuration.
75 +...
185 185  
186 186  ----
187 187  
... ... @@ -202,3 +202,5 @@
202 202  === 3.4 Gather Enhancements ===
203 203  
204 204  ... on the web-app, on the creation process, and on the course
96 +)))
97 +)))

Need help?

If you need help with XWiki you can contact: