Changes for page The AISOP recipe
Last modified by Paul Libbrecht on 2025/06/15 23:32
Change comment:
There is no comment for this version
Summary
-
Page properties (2 modified, 0 added, 0 removed)
Details
- Page properties
-
- Author
-
... ... @@ -1,1 +1,1 @@ 1 -XWiki. AISOPAdmin1 +XWiki.andisk - Content
-
... ... @@ -57,6 +57,97 @@ 57 57 * The least amount of content to be extracted is the complete set of slides and their comments. We recommend to use past students' e-portfolios too. We had rather good experience with about 1000 fragments for a course. 58 58 * If interrupted, the process may create several JSON files. You can combine them using the [[merge-tool>>https://gitlab.com/aisop/aisop-hacking/-/tree/main/merge-json-files]]. 59 59 60 + 61 +==== 1.2.1: Extract Text from PDF or PNG (PDF → PNG → Text) ==== 62 + 63 +Text extraction from PDFs is sometimes faulty. Additionally, many PDFs contain images. To capture this text, Tesseract can be used. A brief explanation of how to use it is provided here. 64 + 65 +**Prerequisites** 66 + 67 +~1. Tesseract must be installed 68 + 69 +(% class="box" %) 70 +((( 71 +##tesseract ~-~-version## 72 +))) 73 + 74 +2. Poppler must be installed 75 + 76 +(% class="box" %) 77 +((( 78 +##brew install poppler## 79 +))) 80 + 81 +**Code** 82 + 83 +(% class="box" %) 84 +((( 85 +##for pdf in *.pdf; do 86 +# Extract the base name of the PDF without the extension 87 +basename="${pdf%.pdf}" 88 +\\# Convert PDF to PNGs 89 +pdftoppm -png "$pdf" "$basename" 90 +\\# Create a text file with the same name as the PDF. 91 +for png in "$basename"-*.png; do 92 + tesseract "$png" stdout -l deu ~-~-oem 1 | tr '~\~\n' ' ' | sed 's/ */ /g' >> "$basename.txt" 93 + echo -e "~\~\n~\~\n~-~--~\~\n~\~\n" >> "$basename.txt" 94 +done 95 +done 96 +rm *.png## 97 +))) 98 + 99 +**Explanation** 100 + 101 +~1. ##for pdf in *.pdf; do ...; done## 102 + 103 +• Loops through all PDF files in the directory. 104 + 105 +2. ##basename="${pdf%.pdf}"## 106 + 107 +• Extracts the filename of the PDF without the .pdf extension. 108 + 109 +3. ##pdftoppm -png "$pdf" "$basename"## 110 + 111 +• Converts the PDF into PNG images, named in the format BASENAME-1.png, BASENAME-2.png, etc. 112 + 113 +4. ##for png in "$basename"-*.png; do ...; done## 114 + 115 +• Processes only the PNG files generated from the current PDF. 116 + 117 +5. ##tesseract "$png" stdout -l deu ~-~-oem 1## 118 + 119 +• Performs OCR on the PNG file. 120 + 121 +6. ##tr '\n' ' '## 122 + 123 +• Replaces line breaks with spaces. 124 + 125 +7. ##sed 's/ */ /g'## 126 + 127 +• Reduces multiple spaces to a single space. 128 + 129 +8. ##>> "$basename.txt"## 130 + 131 +• Appends the recognized text to a text file with the same name as the PDF. 132 + 133 +9. ##echo -e "\n\n~-~--\n\n" >> "$basename.txt"## 134 + 135 +• Adds a separator line (~-~--) after each page. 136 + 137 +**Result** 138 + 139 +• A separate text file is created for each PDF, e.g.: 140 + 141 +• file1.txt for file1.pdf 142 + 143 +• file2.txt for file2.pdf 144 + 145 +• The OCR results of all pages from the respective PDF are written into this text file. 146 + 147 +• Each page is separated by a separator line (~-~--). 148 + 149 +• Temporary PNG files are deleted at the end. 150 + 60 60 === 1.3: Annotate Text Fragments === 61 61 62 62 It is time to endow the fragments with topics so that we can recognize students' paragraphs' topics. In AISOP, we have used the (commercial) [[prodigy>>https://prodi.gy/]] for this task in two steps which, both, iterate through all fragments to give them topics. ... ... @@ -64,7 +64,8 @@ 64 64 **The first step: top-level-labels:** This is the simple [["text classifier" recipe>>https://prodi.gy/docs/recipes#textcat]] of prodigy: we can invoke the following command for this: ##prodigy textcat.manual the-course-name-l1 ./fragments.jsonl ~-~-label labels-depth1.txt## which will offer a web-interface on which each fragment is annotated with the (top-level) label. This web-interface can be left running for several days. 65 65 Then extract the content into a file: ##prodigy db-out the-course-name-l1 > the-course-name-dbout.jsonl## 66 66 67 -**The second step is the hierarchical annotation** [[custom recipe>>https://gitlab.com/aisop/aisop-nlp/-/tree/main/hierarchical_annotation?ref_type=heads]] (link to become public soon): The same fragments are now annotated with the top-level annotation and all their children. E.g. using the command ##python -m prodigy subcat_annotate_with_top2 the-course-name-l2 the-course-name-dbout.jsonl labels-all-depths.txt -F ./subcat_annotate_with_top2.py##. 158 +**The second step is the hierarchical annotation** [[custom recipe>>https://gitlab.com/aisop/aisop-nlp/-/tree/main/hierarchical_annotation?ref_type=heads]] (link to become public soon): The same fragments are now annotated with the top-level annotation and all their children. E.g. using the command ##python -m prodigy subcat_annotate_with_top2 the-course-name-l2 \ 159 + the-course-name-dbout.jsonl labels-all-depths.txt -F ./subcat_annotate_with_top2.py##. 68 68 69 69 The resulting data-set can be extracted out of prodigy using the db-out recipe, e.g. prodigy db-out the-course-name-l2 the-course-name-l2-dbout or can be converted to a spaCy dataset for training e.g. using the command xxxxx (see [[here>>https://gitlab.com/aisop/aisop-nlp/-/tree/main/it3/fundamental-principles]]) 70 70