Wiki source code of The AISOP recipe

Version 10.1 by Andreas Isking on 2025/02/28 18:15

version	line-number	content
1.1	1	(% class="jumbotron" %)
	2	(((
	3	(% class="container" %)
	4	(((
1.4	5	= The AISOP recipe =
1.1	6
1.4	7	The AISOP webapp is a service built as the result of various training and configurations.
	8	This recipe explains how to extract the content fragments, annotate them, and create model trained on it. This will let us create a pipeline and a seminar on which we can analyse portfolios.
1.1	9	)))
	10	)))
	11
1.5	12
1.4	13	== Basic Terms ==
1.1	14
1.4	15	The context of the AISOP-web-app usage is that of a course at learning institution which typically has fixed students and fixed contents. A course can contain multiple courses or modules.
1.1	16
1.4	17	* AISOP Web-app: The nodeJS server that interfaces with the portfolio-composing system.
	18	* Portfolio: the content written by a student in order to represent his or her progress, learning and knowledge using a textual and graphical form. Generally expressed in HTML, can be embedded in various web-pages.
	19	* Course-contents: The set of slides, their annotations, the videos and handouts that normally read by students and teachers.
	20	* Analysis: The set of programmes that recognize and measure the contents of a portfolio. Often also the name of the resulting interactive presentation (which can feature summaries or enriched portfolio views).
	21	* Composition Platform: A space where the portfolio is written. Normally a web-space. In AISOP we have focussed on the classical e-portfol;io composition platform Mahara (a PHP server).
1.1	22
1.4	23	----
1.1	24
1.4	25	== 1) Data Preparation ==
1.1	26
1.4	27	=== 1.1: Make a Concept Map ===
1.1	28
1.4	29	Employing tools such as CMapTools, create a graphical concept map that represents the topics of the course. This concept map can be familiar with the teachers and learners of this course as a way to show the paths through the content.
1.1	30
1.4	31	From the concept map, extract a .cxl file which carries the same information and will be presented on the web-page.
1.1	32
1.4	33	From the concept map, also extract a hierarchy of topics, assuming there is more than (approx) 10 topics in the map. The hierarchy should be a text file with a label per line and the label indented to the right in case of children relation as in the following example:
1.1	34
1.4	35	>Algorithmization
	36	> Flow Charts
	37	> Programming
	38	> Programming Paradigm
1.6	39	> Imperative Programming
	40	>Data-Structure
	41	> ....
	42	>Operating System
	43	> ....
1.1	44
1.6	45	We'll name this file labels-all-depths.txt. From this text file, extract a text file with only the top labels (in the extract above only Algorithmization, Data-Structure and Operating System), named labels-depth1.txt.
	46
1.4	47	=== 1.2: Extract Text of the Course Content ===
1.1	48
1.4	49	In order for the topic recognition to work, a model needs to be trained that will recognize the words used by the students to denote a part or another of the course. This allows to create relations between the concepts of the course and the paragraphs of the portfolio and offer these in the interactive dashboards. The training is the result of annotating fragments of texts which, first, need to be extracted from their media, be them PDF files, PowerPoint slides, scanned texts or student works. These texts will not be shared so that even protected material or even personal-information carrying texts can be used.
1.1	50
1.4	51	Practically:
1.1	52
1.6	53	* Make all documents accessible for you to open and browse (e.g. download them or get the authorized accesses)
	54	* Install and launch the [[clipboard extractor>>https://gitlab.com/aisop/aisop-hacking/-/tree/main/aisop-clipboard-extractor?ref_type=heads]] which will gather the fragments in a text file
	55	* Go through all contents and copy each fragment. A fragment is expected to be the size of a paragraph so this is what you should copy.
	56	* The extractor should have copied all the fragments in one file. Which we shall call extraction.json.
	57	* The least amount of content to be extracted is the complete set of slides and their comments. We recommend to use past students' e-portfolios too. We had rather good experience with about 1000 fragments for a course.
	58	* If interrupted, the process may create several JSON files. You can combine them using the [[merge-tool>>https://gitlab.com/aisop/aisop-hacking/-/tree/main/merge-json-files]].
1.1	59
10.1	60
	61	==== 1.2.1: Extract Text from PDF or PNG (PDF → PNG → Text) ====
	62
	63	Text extraction from PDFs is sometimes faulty. Additionally, many PDFs contain images. To capture this text, Tesseract can be used. A brief explanation of how to use it is provided here.
	64
	65	Prerequisites
	66
	67	~1. Tesseract must be installed
	68
	69	(% class="box" %)
	70	(((
	71	##tesseract ~-~-version##
	72	)))
	73
	74	2. Poppler must be installed
	75
	76	(% class="box" %)
	77	(((
	78	##brew install poppler##
	79	)))
	80
	81	Code
	82
	83	(% class="box" %)
	84	(((
	85	##for pdf in *.pdf; do
	86	# Extract the base name of the PDF without the extension
	87	basename="${pdf%.pdf}"
	88	\\# Convert PDF to PNGs
	89	pdftoppm -png "$pdf" "$basename"
	90	\\# Create a text file with the same name as the PDF.
	91	for png in "$basename"-*.png; do
	92	tesseract "$png" stdout -l deu ~-~-oem 1 \| tr '~\~\n' ' ' \| sed 's/ */ /g' >> "$basename.txt"
	93	echo -e "~\~\n~\~\n~-~--~\~\n~\~\n" >> "$basename.txt"
	94	done
	95	done
	96	rm *.png##
	97	)))
	98
	99	Explanation
	100
	101	~1. ##for pdf in *.pdf; do ...; done##
	102
	103	• Loops through all PDF files in the directory.
	104
	105	2. ##basename="${pdf%.pdf}"##
	106
	107	• Extracts the filename of the PDF without the .pdf extension.
	108
	109	3. ##pdftoppm -png "$pdf" "$basename"##
	110
	111	• Converts the PDF into PNG images, named in the format BASENAME-1.png, BASENAME-2.png, etc.
	112
	113	4. ##for png in "$basename"-*.png; do ...; done##
	114
	115	• Processes only the PNG files generated from the current PDF.
	116
	117	5. ##tesseract "$png" stdout -l deu ~-~-oem 1##
	118
	119	• Performs OCR on the PNG file.
	120
	121	6. ##tr '\n' ' '##
	122
	123	• Replaces line breaks with spaces.
	124
	125	7. ##sed 's/ */ /g'##
	126
	127	• Reduces multiple spaces to a single space.
	128
	129	8. ##>> "$basename.txt"##
	130
	131	• Appends the recognized text to a text file with the same name as the PDF.
	132
	133	9. ##echo -e "\n\n~-~--\n\n" >> "$basename.txt"##
	134
	135	• Adds a separator line (~-~--) after each page.
	136
	137	Result
	138
	139	• A separate text file is created for each PDF, e.g.:
	140
	141	• file1.txt for file1.pdf
	142
	143	• file2.txt for file2.pdf
	144
	145	• The OCR results of all pages from the respective PDF are written into this text file.
	146
	147	• Each page is separated by a separator line (~-~--).
	148
	149	• Temporary PNG files are deleted at the end.
	150
1.4	151	=== 1.3: Annotate Text Fragments ===
	152
1.6	153	It is time to endow the fragments with topics so that we can recognize students' paragraphs' topics. In AISOP, we have used the (commercial) [[prodigy>>https://prodi.gy/]] for this task in two steps which, both, iterate through all fragments to give them topics.
1.1	154
8.1	155	The first step: top-level-labels: This is the simple [["text classifier" recipe>>https://prodi.gy/docs/recipes#textcat]] of prodigy: we can invoke the following command for this: ##prodigy textcat.manual the-course-name-l1 ./fragments.jsonl ~-~-label labels-depth1.txt## which will offer a web-interface on which each fragment is annotated with the (top-level) label. This web-interface can be left running for several days.
	156	Then extract the content into a file: ##prodigy db-out the-course-name-l1 > the-course-name-dbout.jsonl##
1.6	157
9.1	158	The second step is the hierarchical annotation [[custom recipe>>https://gitlab.com/aisop/aisop-nlp/-/tree/main/hierarchical_annotation?ref_type=heads]] (link to become public soon): The same fragments are now annotated with the top-level annotation and all their children. E.g. using the command ##python -m prodigy subcat_annotate_with_top2 the-course-name-l2 \
	159	the-course-name-dbout.jsonl labels-all-depths.txt -F ./subcat_annotate_with_top2.py##.
1.6	160
1.7	161	The resulting data-set can be extracted out of prodigy using the db-out recipe, e.g. prodigy db-out the-course-name-l2 the-course-name-l2-dbout or can be converted to a spaCy dataset for training e.g. using the command xxxxx (see [[here>>https://gitlab.com/aisop/aisop-nlp/-/tree/main/it3/fundamental-principles]])
1.6	162
	163
1.4	164	----
1.1	165
1.4	166	== 2) Deployment ==
1.1	167
1.4	168	=== 2.1 Train a Recognition Model ===
1.1	169
1.7	170	See [[here>>https://gitlab.com/aisop/aisop-nlp/-/tree/main/it3/fundamental-principles]].
1.4	171
	172	=== 2.2 Create a Pipeline ===
	173
1.7	174	... write down the configuration JSON of the pipeline, get inspired [[pipeline-medieninderlehre.json>>https://gitlab.com/aisop/aisop-webapp/-/blob/main/config/couchdb/pipeline-medieninderlehre.json?ref_type=heads]]
1.4	175
	176	=== 2.3 Create a Seminar and Import Content ===
	177
	178	...
	179
1.7	180	Create a seminar with the web-interface, associate the appropriate pipeline.
	181
1.4	182	=== 2.4 Interface with the composition platform ===
	183
1.7	184	See the Mahara authorization configuration.
1.4	185
	186	----
	187
	188	== 3) Usage ==
	189
	190	=== 3.1 Invite Users ===
	191
	192	...
	193
	194	=== 3.2 Verify Imports and Analyses ===
	195
	196	...
	197
	198	=== 3.3 Observe Usage and Reflect on Quality ===
	199
	200	...
	201
	202	=== 3.4 Gather Enhancements ===
	203
	204	... on the web-app, on the creation process, and on the course

Applications

More applications

Need help?

If you need help with XWiki you can contact:

XWiki 13.10.2

Wiki source code of The AISOP recipe

Applications

Navigation

Need help?