Changes for page The AISOP recipe

Last modified by Paul Libbrecht on 2025/06/15 23:32

From version < 10.1 >

edited by Andreas Isking
on 2025/02/28 18:15

To version < 8.1 >

edited by AISOP Admin
on 2025/01/14 21:22

Change comment: There is no comment for this version

Raw
Rendered

Summary

Page properties (2 modified, 0 added, 0 removed)

Details

Page properties

Author

@@ -1,1 +1,1 @@
--XWiki.andisk
++XWiki.AISOPAdmin

Content

@@ -57,97 +57,6 @@
  * The least amount of content to be extracted is the complete set of slides and their comments. We recommend to use past students' e-portfolios too. We had rather good experience with about 1000 fragments for a course.
  * If interrupted, the process may create several JSON files. You can combine them using the [[merge-tool>>https://gitlab.com/aisop/aisop-hacking/-/tree/main/merge-json-files]].
--
--==== 1.2.1: Extract Text from PDF or PNG (PDF → PNG → Text) ====
--
--Text extraction from PDFs is sometimes faulty. Additionally, many PDFs contain images. To capture this text, Tesseract can be used. A brief explanation of how to use it is provided here.
--
--**Prerequisites**
--
--~1. Tesseract must be installed
--
--(% class="box" %)
--(((
--##tesseract ~-~-version##
--)))
--
--2. Poppler must be installed
--
--(% class="box" %)
--(((
--##brew install poppler##
--)))
--
--**Code**
--
--(% class="box" %)
--(((
--##for pdf in *.pdf; do
--# Extract the base name of the PDF without the extension
--basename="${pdf%.pdf}"
--\\# Convert PDF to PNGs
--pdftoppm -png "$pdf" "$basename"
--\\# Create a text file with the same name as the PDF.
--for png in "$basename"-*.png; do
--    tesseract "$png" stdout -l deu ~-~-oem 1 | tr '~\~\n' ' ' | sed 's/  */ /g' >> "$basename.txt"
--    echo -e "~\~\n~\~\n~-~--~\~\n~\~\n" >> "$basename.txt"
--done
--done
--rm *.png##
--)))
--
--**Explanation**
--
--~1. ##for pdf in *.pdf; do ...; done##
--
--• Loops through all PDF files in the directory.
--
--2. ##basename="${pdf%.pdf}"##
--
--• Extracts the filename of the PDF without the .pdf extension.
--
--3. ##pdftoppm -png "$pdf" "$basename"##
--
--• Converts the PDF into PNG images, named in the format BASENAME-1.png, BASENAME-2.png, etc.
--
--4. ##for png in "$basename"-*.png; do ...; done##
--
--• Processes only the PNG files generated from the current PDF.
--
--5. ##tesseract "$png" stdout -l deu ~-~-oem 1##
--
--• Performs OCR on the PNG file.
--
--6. ##tr '\n' ' '##
--
--• Replaces line breaks with spaces.
--
--7. ##sed 's/  */ /g'##
--
--• Reduces multiple spaces to a single space.
--
--8. ##>> "$basename.txt"##
--
--• Appends the recognized text to a text file with the same name as the PDF.
--
--9. ##echo -e "\n\n~-~--\n\n" >> "$basename.txt"##
--
--• Adds a separator line (~-~--) after each page.
--
--**Result**
--
--• A separate text file is created for each PDF, e.g.:
--
--• file1.txt for file1.pdf
--
--• file2.txt for file2.pdf
--
--• The OCR results of all pages from the respective PDF are written into this text file.
--
--• Each page is separated by a separator line (~-~--).
--
--• Temporary PNG files are deleted at the end.
--
  === 1.3: Annotate Text Fragments ===
  It is time to endow the fragments with topics so that we can recognize students' paragraphs' topics. In AISOP, we have used the (commercial) [[prodigy>>https://prodi.gy/]] for this task in two steps which, both, iterate through all fragments to give them topics.
@@ -155,8 +155,7 @@
  **The first step: top-level-labels:** This is the simple [["text classifier" recipe>>https://prodi.gy/docs/recipes#textcat]] of prodigy: we can invoke the following command for this: ##prodigy textcat.manual the-course-name-l1 ./fragments.jsonl ~-~-label labels-depth1.txt##  which will offer a web-interface on which each fragment is annotated with the (top-level) label. This web-interface can be left running for several days.
  Then extract the content into a file: ##prodigy db-out the-course-name-l1 > the-course-name-dbout.jsonl##
--**The second step is the hierarchical annotation** [[custom recipe>>https://gitlab.com/aisop/aisop-nlp/-/tree/main/hierarchical_annotation?ref_type=heads]] (link to become public soon): The same fragments are now annotated with the top-level annotation and all their children. E.g. using the command ##python -m prodigy subcat_annotate_with_top2 the-course-name-l2 \
--  the-course-name-dbout.jsonl labels-all-depths.txt  -F ./subcat_annotate_with_top2.py##.
++**The second step is the hierarchical annotation** [[custom recipe>>https://gitlab.com/aisop/aisop-nlp/-/tree/main/hierarchical_annotation?ref_type=heads]] (link to become public soon): The same fragments are now annotated with the top-level annotation and all their children. E.g. using the command ##python -m prodigy subcat_annotate_with_top2 the-course-name-l2 the-course-name-dbout.jsonl labels-all-depths.txt  -F ./subcat_annotate_with_top2.py##.
  The resulting data-set can be extracted out of prodigy using the db-out recipe, e.g. prodigy db-out the-course-name-l2 the-course-name-l2-dbout or can be converted to a spaCy dataset for training e.g. using the command xxxxx (see [[here>>https://gitlab.com/aisop/aisop-nlp/-/tree/main/it3/fundamental-principles]])

Applications

More applications

Need help?

If you need help with XWiki you can contact:

XWiki 13.10.2

Changes for page The AISOP recipe

Summary

Details

Applications

Navigation

Need help?