Changes for page The AISOP recipe

Last modified by Paul Libbrecht on 2025/06/15 23:32

From version < 8.1 >

edited by AISOP Admin
on 2025/01/14 21:22

To version < 10.1 >

edited by Andreas Isking
on 2025/02/28 18:15

Change comment: There is no comment for this version

Raw
Rendered

Summary

Page properties (2 modified, 0 added, 0 removed)

Details

Page properties

Author

@@ -1,1 +1,1 @@
--XWiki.AISOPAdmin
++XWiki.andisk

Content

@@ -57,6 +57,97 @@
  * The least amount of content to be extracted is the complete set of slides and their comments. We recommend to use past students' e-portfolios too. We had rather good experience with about 1000 fragments for a course.
  * If interrupted, the process may create several JSON files. You can combine them using the [[merge-tool>>https://gitlab.com/aisop/aisop-hacking/-/tree/main/merge-json-files]].
++
++==== 1.2.1: Extract Text from PDF or PNG (PDF → PNG → Text) ====
++
++Text extraction from PDFs is sometimes faulty. Additionally, many PDFs contain images. To capture this text, Tesseract can be used. A brief explanation of how to use it is provided here.
++
++**Prerequisites**
++
++~1. Tesseract must be installed
++
++(% class="box" %)
++(((
++##tesseract ~-~-version##
++)))
++
++2. Poppler must be installed
++
++(% class="box" %)
++(((
++##brew install poppler##
++)))
++
++**Code**
++
++(% class="box" %)
++(((
++##for pdf in *.pdf; do
++# Extract the base name of the PDF without the extension
++basename="${pdf%.pdf}"
++\\# Convert PDF to PNGs
++pdftoppm -png "$pdf" "$basename"
++\\# Create a text file with the same name as the PDF.
++for png in "$basename"-*.png; do
++    tesseract "$png" stdout -l deu ~-~-oem 1 | tr '~\~\n' ' ' | sed 's/  */ /g' >> "$basename.txt"
++    echo -e "~\~\n~\~\n~-~--~\~\n~\~\n" >> "$basename.txt"
++done
++done
++rm *.png##
++)))
++
++**Explanation**
++
++~1. ##for pdf in *.pdf; do ...; done##
++
++• Loops through all PDF files in the directory.
++
++2. ##basename="${pdf%.pdf}"##
++
++• Extracts the filename of the PDF without the .pdf extension.
++
++3. ##pdftoppm -png "$pdf" "$basename"##
++
++• Converts the PDF into PNG images, named in the format BASENAME-1.png, BASENAME-2.png, etc.
++
++4. ##for png in "$basename"-*.png; do ...; done##
++
++• Processes only the PNG files generated from the current PDF.
++
++5. ##tesseract "$png" stdout -l deu ~-~-oem 1##
++
++• Performs OCR on the PNG file.
++
++6. ##tr '\n' ' '##
++
++• Replaces line breaks with spaces.
++
++7. ##sed 's/  */ /g'##
++
++• Reduces multiple spaces to a single space.
++
++8. ##>> "$basename.txt"##
++
++• Appends the recognized text to a text file with the same name as the PDF.
++
++9. ##echo -e "\n\n~-~--\n\n" >> "$basename.txt"##
++
++• Adds a separator line (~-~--) after each page.
++
++**Result**
++
++• A separate text file is created for each PDF, e.g.:
++
++• file1.txt for file1.pdf
++
++• file2.txt for file2.pdf
++
++• The OCR results of all pages from the respective PDF are written into this text file.
++
++• Each page is separated by a separator line (~-~--).
++
++• Temporary PNG files are deleted at the end.
++
  === 1.3: Annotate Text Fragments ===
  It is time to endow the fragments with topics so that we can recognize students' paragraphs' topics. In AISOP, we have used the (commercial) [[prodigy>>https://prodi.gy/]] for this task in two steps which, both, iterate through all fragments to give them topics.
@@ -64,7 +64,8 @@
  **The first step: top-level-labels:** This is the simple [["text classifier" recipe>>https://prodi.gy/docs/recipes#textcat]] of prodigy: we can invoke the following command for this: ##prodigy textcat.manual the-course-name-l1 ./fragments.jsonl ~-~-label labels-depth1.txt##  which will offer a web-interface on which each fragment is annotated with the (top-level) label. This web-interface can be left running for several days.
  Then extract the content into a file: ##prodigy db-out the-course-name-l1 > the-course-name-dbout.jsonl##
--**The second step is the hierarchical annotation** [[custom recipe>>https://gitlab.com/aisop/aisop-nlp/-/tree/main/hierarchical_annotation?ref_type=heads]] (link to become public soon): The same fragments are now annotated with the top-level annotation and all their children. E.g. using the command ##python -m prodigy subcat_annotate_with_top2 the-course-name-l2 the-course-name-dbout.jsonl labels-all-depths.txt  -F ./subcat_annotate_with_top2.py##.
++**The second step is the hierarchical annotation** [[custom recipe>>https://gitlab.com/aisop/aisop-nlp/-/tree/main/hierarchical_annotation?ref_type=heads]] (link to become public soon): The same fragments are now annotated with the top-level annotation and all their children. E.g. using the command ##python -m prodigy subcat_annotate_with_top2 the-course-name-l2 \
++  the-course-name-dbout.jsonl labels-all-depths.txt  -F ./subcat_annotate_with_top2.py##.
  The resulting data-set can be extracted out of prodigy using the db-out recipe, e.g. prodigy db-out the-course-name-l2 the-course-name-l2-dbout or can be converted to a spaCy dataset for training e.g. using the command xxxxx (see [[here>>https://gitlab.com/aisop/aisop-nlp/-/tree/main/it3/fundamental-principles]])

Applications

More applications

Need help?

If you need help with XWiki you can contact:

XWiki 13.10.2

Changes for page The AISOP recipe

Summary

Details

Applications

Navigation

Need help?