stringa di disposizione alla lunghezza ignorando HTML

https://stackoverflow.com/questions/736155

09-09-2019
|

Domanda

Questo problema è molto impegnativo. La nostra applicazione consente agli utenti di pubblicare notizie sulla homepage. La notizia è di input tramite un editor di testo ricco che consente di HTML. Sulla homepage vogliamo visualizzare solo una sintesi troncato della notizia.

Per esempio, qui è il testo integrale che mostriamo, tra cui HTML


  Nel tentativo di rendere un po 'più di spazio in ufficio, cucina, ho tirato fuori tutte le tazze casuali e metterli sul tavolo della sala da pranzo. A meno che non si sente fortemente circa la titolarità di tale Cheyenne Corriere tazza dal 1992 o forse quella tazza aC Tel avanzata Communications dal 1997, che sarà messo in una scatola e donato ad un ufficio in più bisogno di tazze di noi.

Vogliamo tagliare la notizia a 250 caratteri, ma esclude HTML.

Il metodo che stiamo usando per tagliare attualmente include il codice HTML, e questo si traduce in alcuni post di notizie che sono HTML pesanti ottenere troncato considerevolmente.

Per esempio, se l'esempio precedente comprendeva tonnellate di HTML, potrebbe potenzialmente simile a questa:

Nel tentativo di rendere un po 'più di spazio in ufficio, cucina, ho tirato ...

Questo non è quello che vogliamo.

Qualcuno ha un modo di creazione di token tag HTML al fine di mantenere la posizione nella stringa, effettuare un controllo di lunghezza e / o trim sulla corda, e ripristinare il codice HTML all'interno della stringa nella sua vecchia posizione?

Soluzione

Inizia il primo carattere del post, scavalcando ogni personaggio. Ogni volta che passo nel corso di un personaggio, incrementare un contatore. Quando si trova un '<' carattere, fermare incrementando il contatore fino a colpire un '>' carattere. La vostra posizione quando il contatore arriva a 250 è dove si vuole realmente tagliare.

prendere atto che questo avrà un altro problema che dovrete affrontare quando un tag HTML viene aperta, ma non chiuso prima del taglio.

Altri suggerimenti

Dopo la 2-Stato macchina a stati finiti suggerimento, ho appena sviluppato un semplice parser HTML per questo scopo, in Java:

http://pastebin.com/jCRqiwNH

e qui un banco di prova:

http://pastebin.com/37gCS4tV

E qui il codice Java:

import java.util.Collections;
import java.util.LinkedList;
import java.util.List;

public class HtmlShortener {

    private static final String TAGS_TO_SKIP = "br,hr,img,link";
    private static final String[] tagsToSkip = TAGS_TO_SKIP.split(",");
    private static final int STATUS_READY = 0;

        private int cutPoint = -1;
    private String htmlString = "";

    final List<String> tags = new LinkedList<String>();

    StringBuilder sb = new StringBuilder("");
    StringBuilder tagSb = new StringBuilder("");

    int charCount = 0;
    int status = STATUS_READY;

    public HtmlShortener(String htmlString, int cutPoint){
        this.cutPoint = cutPoint;
        this.htmlString = htmlString;
    }

    public String cut(){

        // reset 
        tags.clear();
        sb = new StringBuilder("");
        tagSb = new StringBuilder("");
        charCount = 0;
        status = STATUS_READY;

        String tag = "";

        if (cutPoint < 0){
            return htmlString;
        }

        if (null != htmlString){

            if (cutPoint == 0){
                return "";
            }

            for (int i = 0; i < htmlString.length(); i++){

                String strC = htmlString.substring(i, i+1);


                if (strC.equals("<")){

                    // new tag or tag closure

                    // previous tag reset
                    tagSb = new StringBuilder("");
                    tag = "";

                    // find tag type and name
                    for (int k = i; k < htmlString.length(); k++){

                        String tagC = htmlString.substring(k, k+1);
                        tagSb.append(tagC);

                        if (tagC.equals(">")){
                            tag = getTag(tagSb.toString());
                            if (tag.startsWith("/")){

                                // closure
                                if (!isToSkip(tag)){
                                    sb.append("</").append(tags.get(tags.size() - 1)).append(">");
                                    tags.remove((tags.size() - 1));
                                }

                            } else {

                                // new tag
                                sb.append(tagSb.toString());

                                if (!isToSkip(tag)){
                                    tags.add(tag);  
                                }

                            }

                            i = k;
                            break;
                        }

                    }

                } else {

                    sb.append(strC);
                    charCount++;

                }

                // cut check
                if (charCount >= cutPoint){

                    // close previously open tags
                    Collections.reverse(tags);
                    for (String t : tags){
                        sb.append("</").append(t).append(">");
                    }
                    break;
                } 

            }

            return sb.toString();

        } else {
            return null;
        }

    }

    private boolean isToSkip(String tag) {

        if (tag.startsWith("/")){
            tag = tag.substring(1, tag.length());
        }

        for (String tagToSkip : tagsToSkip){
            if (tagToSkip.equals(tag)){
                return true;
            }
        }

        return false;
    }

    private String getTag(String tagString) {

        if (tagString.contains(" ")){
            // tag with attributes
            return tagString.substring(tagString.indexOf("<") + 1, tagString.indexOf(" "));
        } else {
            // simple tag
            return tagString.substring(tagString.indexOf("<") + 1, tagString.indexOf(">"));
        }


    }

}

Se ho capito bene il problema, si desidera mantenere la formattazione HTML, ma si vuole non contano come parte della lunghezza della stringa si stanno mantenendo.

È possibile raggiungere questo obiettivo con il codice che implementa un semplice href="http://en.wikipedia.org/wiki/Finite_state_machine" macchina a stati finiti .

2 stati: Intag, OutOfTag
  Intag:
    - Va a OutOfTag se si verifica
personaggio >     - Va a sé si incontra
qualsiasi altro carattere   OutOfTag:
    - Va a Intag se si verifica
personaggio <     - Va a se stesso si verifica un qualsiasi altro carattere

Il tuo stato di partenza sarà OutOfTag.

È implementare una macchina a stati finiti da procesing 1 carattere alla volta. L'elaborazione di ogni personaggio si porta a un nuovo stato.

Come si esegue il testo attraverso la macchina a stati finiti, si desidera mantenere anche un buffer di output e una lunghezza varaible finora incontrato (in modo da sapere quando fermarsi).

Incrementa la variabile Lunghezza ogni volta che sono in stato OutOfTag e si elabora un altro personaggio. Opzionalmente si può non incrementare questa variabile se si dispone di uno spazio vuoto.
terminare l'algoritmo quando non si hanno più caratteri o si ha la lunghezza desiderata di cui al # 1.
Nel vostro buffer di uscita, includere i caratteri che si incontrano fino alla lunghezza di cui al # 1.
Mantenere una pila di tag non chiusi. Quando si raggiunge la lunghezza, per ciascun elemento nella pila, aggiungere un tag di chiusura. Come si esegue attraverso l'algoritmo si può sapere quando si incontra un tag, mantenendo una variabile current_tag. Questa variabile current_tag si avvia quando si entra lo stato di Intag, e si è conclusa quando si entra lo stato OutOfTag (o quando si incontra un personaggio whitepsace, mentre nello stato Intag). Se si dispone di un tag di apertura si metterla in pila. Se si dispone di un tag di chiusura, si pop dallo stack.

Ecco l'implementazione che mi è venuta, in C #:

public static string TrimToLength(string input, int length)
{
  if (string.IsNullOrEmpty(input))
    return string.Empty;

  if (input.Length <= length)
    return input;

  bool inTag = false;
  int targetLength = 0;

  for (int i = 0; i < input.Length; i++)
  {
    char c = input[i];

    if (c == '>')
    {
      inTag = false;
      continue;
    }

    if (c == '<')
    {
      inTag = true;
      continue;
    }

    if (inTag || char.IsWhiteSpace(c))
    {
      continue;
    }

    targetLength++;

    if (targetLength == length)
    {
      return ConvertToXhtml(input.Substring(0, i + 1));
    }
  }

  return input;
}

E qualche unit test che ho usato con TDD:

[Test]
public void Html_TrimReturnsEmptyStringWhenNullPassed()
{
  Assert.That(Html.TrimToLength(null, 1000), Is.Empty);
}

[Test]
public void Html_TrimReturnsEmptyStringWhenEmptyPassed()
{
  Assert.That(Html.TrimToLength(string.Empty, 1000), Is.Empty);
}

[Test]
public void Html_TrimReturnsUnmodifiedStringWhenSameAsLength()
{
  string source = "<div lang=\"en\" class=\"textBody localizable\" id=\"pageBody_en\">" +
                  "<img photoid=\"4041\" src=\"http://xxxxxxxx/imagethumb/562103830000/4041/300x300/False/mugs.jpg\" style=\"float: right;\" class=\"photoRight\" alt=\"\"/>" +
                  "<br/>" +
                  "In an attempt to make a bit more space in the office, kitchen, I";

  Assert.That(Html.TrimToLength(source, 250), Is.EqualTo(source));
}

[Test]
public void Html_TrimWellFormedHtml()
{
  string source = "<div lang=\"en\" class=\"textBody localizable\" id=\"pageBody_en\">" +
             "<img photoid=\"4041\" src=\"http://xxxxxxxx/imagethumb/562103830000/4041/300x300/False/mugs.jpg\" style=\"float: right;\" class=\"photoRight\" alt=\"\"/>" +
             "<br/>" +
             "In an attempt to make a bit more space in the office, kitchen, I've pulled out all of the random mugs and put them onto the lunch room table. Unless you feel strongly about the ownership of that Cheyenne Courier mug from 1992 or perhaps that BC Tel Advanced Communications mug from 1997, they will be put in a box and donated to an office in more need of mugs than us. <br/><br/>" +
             "In the meantime we have a nice selection of white Ikea mugs, some random Starbucks mugs, and others that have made their way into the office over the years. Hopefully that will suffice. <br/><br/>" +
             "</div>";

  string expected = "<div lang=\"en\" class=\"textBody localizable\" id=\"pageBody_en\">" +
                    "<img photoid=\"4041\" src=\"http://xxxxxxxx/imagethumb/562103830000/4041/300x300/False/mugs.jpg\" style=\"float: right;\" class=\"photoRight\" alt=\"\"/>" +
                    "<br/>" +
                    "In an attempt to make a bit more space in the office, kitchen, I've pulled out all of the random mugs and put them onto the lunch room table. Unless you feel strongly about the ownership of that Cheyenne Courier mug from 1992 or perhaps that BC Tel Advanced Communications mug from 1997, they will be put in";

  Assert.That(Html.TrimToLength(source, 250), Is.EqualTo(expected));
}

[Test]
public void Html_TrimMalformedHtml()
{
  string malformedHtml = "<div lang=\"en\" class=\"textBody localizable\" id=\"pageBody_en\">" +
                         "<img photoid=\"4041\" src=\"http://xxxxxxxx/imagethumb/562103830000/4041/300x300/False/mugs.jpg\" style=\"float: right;\" class=\"photoRight\" alt=\"\"/>" +
                         "<br/>" +
                         "In an attempt to make a bit more space in the office, kitchen, I've pulled out all of the random mugs and put them onto the lunch room table. Unless you feel strongly about the ownership of that Cheyenne Courier mug from 1992 or perhaps that BC Tel Advanced Communications mug from 1997, they will be put in a box and donated to an office in more need of mugs than us. <br/><br/>" +
                         "In the meantime we have a nice selection of white Ikea mugs, some random Starbucks mugs, and others that have made their way into the office over the years. Hopefully that will suffice. <br/><br/>";

  string expected = "<div lang=\"en\" class=\"textBody localizable\" id=\"pageBody_en\">" +
              "<img photoid=\"4041\" src=\"http://xxxxxxxx/imagethumb/562103830000/4041/300x300/False/mugs.jpg\" style=\"float: right;\" class=\"photoRight\" alt=\"\"/>" +
              "<br/>" +
              "In an attempt to make a bit more space in the office, kitchen, I've pulled out all of the random mugs and put them onto the lunch room table. Unless you feel strongly about the ownership of that Cheyenne Courier mug from 1992 or perhaps that BC Tel Advanced Communications mug from 1997, they will be put in";

  Assert.That(Html.TrimToLength(malformedHtml, 250), Is.EqualTo(expected));
}

Sono consapevole del fatto che questo è un po 'dopo la data della pubblicazione, ma ho avuto un problema simile e questo è come ho finito per risolverlo. La mia preoccupazione sarebbe la velocità di espressione regolare contro interating attraverso una serie.

Anche se si dispone di uno spazio prima di un tag HTML, e dopo questo non risolve che

private string HtmlTrimmer(string input, int len)
{
    if (string.IsNullOrEmpty(input))
        return string.Empty;
    if (input.Length <= len)
        return input;

    // this is necissary because regex "^"  applies to the start of the string, not where you tell it to start from
    string inputCopy;
    string tag;

    string result = "";
    int strLen = 0;
    int strMarker = 0;
    int inputLength = input.Length;     

    Stack stack = new Stack(10);
    Regex text = new Regex("^[^<&]+");                
    Regex singleUseTag = new Regex("^<[^>]*?/>");            
    Regex specChar = new Regex("^&[^;]*?;");
    Regex htmlTag = new Regex("^<.*?>");

    while (strLen < len)
    {
        inputCopy = input.Substring(strMarker);
        //If the marker is at the end of the string OR 
        //the sum of the remaining characters and those analyzed is less then the maxlength
        if (strMarker >= inputLength || (inputLength - strMarker) + strLen < len)
            break;

        //Match regular text
        result += text.Match(inputCopy,0,len-strLen);
        strLen += result.Length - strMarker;
        strMarker = result.Length;

        inputCopy = input.Substring(strMarker);
        if (singleUseTag.IsMatch(inputCopy))
            result += singleUseTag.Match(inputCopy);
        else if (specChar.IsMatch(inputCopy))
        {
            //think of &nbsp; as 1 character instead of 5
            result += specChar.Match(inputCopy);
            ++strLen;
        }
        else if (htmlTag.IsMatch(inputCopy))
        {
            tag = htmlTag.Match(inputCopy).ToString();
            //This only works if this is valid Markup...
            if(tag[1]=='/')         //Closing tag
                stack.Pop();
            else                    //not a closing tag
                stack.Push(tag);
            result += tag;
        }
        else    //Bad syntax
            result += input[strMarker];

        strMarker = result.Length;
    }

    while (stack.Count > 0)
    {
        tag = stack.Pop().ToString();
        result += tag.Insert(1, "/");
    }
    if (strLen == len)
        result += "...";
    return result;
}

È possibile provare il seguente pacchetto NPM

tagliare-html

E 'tagliare testo sufficiente all'interno tag html, salvare stenosi HTML originale, rimuovere i tag HTML dopo viene raggiunto il limite e la chiusura ha aperto i tag.

Non sarebbe il modo più veloce è quello di utilizzare il metodo text() di jQuery?

Ad esempio:

<ul>
  <li>One</li>
  <li>Two</li>
  <li>Three</li>
</ul>

var text = $('ul').text();

darebbe l'OneTwoThree valore nella variabile text. Ciò consentirebbe di ottenere la lunghezza effettiva del testo senza il codice HTML incluso.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow