كيفية تحليل النص فقط من HTML

https://stackoverflow.com/questions/3507353

java
jsoup

29-09-2019
|

سؤال

كيف يمكنني تحليل النص فقط من صفحة ويب باستخدام JSoup باستخدام Java؟

المحلول

من كتاب الطبخ JSoup: http://jsoup.org/cookbook/extracting-data/attributes-text-html

String html = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
Document doc = Jsoup.parse(html);
String text = doc.body().text(); // "An example link"

نصائح أخرى

باستخدام فصول جزء من JDK:

import java.io.*;
import java.net.*;
import javax.swing.text.*;
import javax.swing.text.html.*;

class GetHTMLText
{
    public static void main(String[] args)
        throws Exception
    {
        EditorKit kit = new HTMLEditorKit();
        Document doc = kit.createDefaultDocument();

        // The Document class does not yet handle charset's properly.
        doc.putProperty("IgnoreCharsetDirective", Boolean.TRUE);

        // Create a reader on the HTML content.

        Reader rd = getReader(args[0]);

        // Parse the HTML.

        kit.read(rd, doc, 0);

        //  The HTML text is now stored in the document

        System.out.println( doc.getText(0, doc.getLength()) );
    }

    // Returns a reader on the HTML data. If 'uri' begins
    // with "http:", it's treated as a URL; otherwise,
    // it's assumed to be a local filename.

    static Reader getReader(String uri)
        throws IOException
    {
        // Retrieve from Internet.
        if (uri.startsWith("http:"))
        {
            URLConnection conn = new URL(uri).openConnection();
            return new InputStreamReader(conn.getInputStream());
        }
        // Retrieve from file.
        else
        {
            return new FileReader(uri);
        }
    }
}

حسنًا ، إليك طريقة سريعة رميتها معًا مرة واحدة. ويستخدم تعبيرات منتظمة لإنجاز المهمة. سوف يتفق معظم الناس على أن هذه ليست طريقة جيدة للقيام بذلك. لذلك استخدام على مسؤوليتك الخاصة.

public static String getPlainText(String html) {
    String htmlBody = html.replaceAll("<hr>", ""); // one off for horizontal rule lines
    String plainTextBody = htmlBody.replaceAll("<[^<>]+>([^<>]*)<[^<>]+>", "$1");
    plainTextBody = plainTextBody.replaceAll("<br ?/>", "");
    return decodeHtml(plainTextBody);
}

تم استخدام هذا في الأصل في غلاف API الخاص بي لمكدس Overflow API. لذلك ، تم اختباره فقط تحت مجموعة فرعية صغيرة من علامات HTML.

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى StackOverflow