Сколько работа должна конструктор для класса парсинга HTML делать?

Question

Сколько работа должна конструктор для класса парсинга HTML делать?

Наиболее распространенное решение, которое я видел к перечислению в PHP, состояло в том, чтобы создать универсальный класс Enum и затем расширить его. Вы могли бы смотреть на это .

ОБНОВЛЕНИЕ: , С другой стороны, я нашел это с phpclasses.org.

33

oop language-agnostic constructor

задан 5 revs, 3 users 98% 5 August 2009 в 17:26

19 ответов

I personally put nothing in constructors and have a set of initialization functions. I find standard constructor methods have limited and cumbersome reuse.

-1

ответ дан 27 November 2019 в 17:55

Как правило, конструктор должен:

Инициализировать все поля.
Оставить полученный объект в допустимом состоянии.

Однако я бы не стал использовать конструктор в как у вас. Анализ должен быть отделен от использования результатов синтаксического анализа.

Обычно, когда я пишу синтаксический анализатор, я пишу его как синглтон. Я не не храню никаких полей в объекте, кроме единственного экземпляра; вместо этого я использую только локальные переменные в методах. Теоретически это могут быть просто статические методы (на уровне класса), но это означает, что я не могу сделать их виртуальными.

0

ответ дан 27 November 2019 в 17:55

Возможный вариант - переместить код синтаксического анализа в отдельную функцию, сделать конструктор закрытым и иметь статическую функцию синтаксического анализа (html), которая создает объект и немедленно вызывает функцию синтаксического анализа.
Таким образом вы избегаете проблем с синтаксическим анализом в конструкции (несогласованное состояние, проблемы при вызове переопределенных функций, ...). Но клиентский код по-прежнему имеет все преимущества (один вызов для получения проанализированного HTML или «ранняя» ошибка).

0

ответ дан 27 November 2019 в 17:55

I agree with the posters here arguing minimal work in the constructor, really just putting the object into a non-zombie state, then have verb functions like parseHTML();

One point I'd like to make, although I don't want to cause a flame war, is consider the case of a non-exception environment. I know you're talking about C#, but I try to keep my programming models as similar as possible between c++ and c#. For various reasons, I don't use exceptions in C++ (think embedded video game programming), I use return code errors.

In this case, I can't throw exceptions in a constructor, so I tend to not have a constructor do anything which could fail. I leave that to the accessor functions.

0

ответ дан 27 November 2019 в 17:55

Как многие прокомментировали, общее правило - только делать инициализация в конструкторах и никогда не используйте виртуальные методы say (вы получите предупреждение компилятора, если попытаетесь обратить внимание на это предупреждение :)). В вашем конкретном случае я бы не стал Я также не использую метод parHTML. объект должен быть в допустимом состоянии, когда он построен, вам нужно будет что-то сделать с объектом, прежде чем вы сможете его действительно использовать.

Лично я бы выбрал фабричный метод. Предоставление класса без общедоступных конструкторов и создание его с помощью фабричного метода. Пусть ваш фабричный метод выполнит синтаксический анализ и передаст проанализированный результат частному / защищенному конструктору.

Взгляните на System.Web.WebRequest, если вы хотите увидеть образец некоторой похожей логики.

0

ответ дан 27 November 2019 в 17:55

In my case, the entire contents of the HTML file are passed through a String. The string is no longer required once it is parsed and is fairly large (a few hundred kilobytes). So it would be best to not keep it in memory. The object shouldn't be used for other cases. It was designed to parse a certain page. Parsing something else should prompt the creation of a different object to parse that.

It sounds very much as though your object isn't really a parser. Does it just wrap a call to a parser and presents the results in a (presumably) more usable fashion? Because of this, you need to call the parser in the constructor as your object would be in a non-useful state otherwise.

I'm not sure how the "object-oriented" part helps here. If there's only one object and it can only process one specific page then it's not clear why it needs to be an object. You could do this just as easily in procedural (i.e. non-OO) code.

For languages that only have objects (e.g. Java) you could just create a static method in a class that had no accessible constructor and then invoke the parser and return all of the parsed values in a Map or similar collection

0

ответ дан 27 November 2019 в 17:55

Why not just pass the parser to the constructor? This would allow you to change the implementation without changing the model:

public interface IParser
{
    Dictionary<string, object> ParseDocument(string document);
}

public class HtmlParser : IParser
{
    // Properties, etc...

    public Dictionary<string, object> ParseDocument(string document){
         //Do what you need to, return the collection of properties
         return someDictionaryOfHtmlObjects;
    }
}

public class HtmlScrapper
{
    // Properties, etc...

    public HtmlScrapper(IParser parser, string HtmlDocument){
         //Set your properties
    }

    public void ParseDocument(){
         this.myDictionaryOfHtmlObjects = 
                  parser.ParseDocument(this.htmlDocument);
    }

}

This should give you some flexibility in changing/improving how your application performs without needing to rewrite this class.

0

ответ дан 27 November 2019 в 17:55

I would not do the parsing in the constructor. I would do everything necessary to validate the constructor parameters, and to ensure that the HTML can be parsed when needed.

But I'd have the accessor methods do the parsing if the HTML is not parsed by the time they need it to be. The parsing can wait until that time - it does not need to be done in the constructor.

Suggested code, for discussion purposes:

public class MyHtmlScraper {
    private TextReader _htmlFileReader;
    private bool _parsed;

    public MyHtmlScraper(string htmlFilePath) {
        _htmlFileReader = new StreamReader(htmlFilePath);
        // If done in the constructor, DoTheParse would be called here
    }

    private string _parsedValue1;
    public string Accessor1 {
        get {
            EnsureParsed();
            return _parsedValue1;
        }
    }

    private string _parsedValue2;
    public string Accessor2 {
        get {
            EnsureParsed();
            return _parsedValue2;
        }
    }

    private void EnsureParsed(){
        if (_parsed) return;
        DoTheParse();
        _parsed = true;
    }

    private void DoTheParse() {
        // parse the file here, using _htmlFileReader
        // parse into _parsedValue1, 2, etc.
    }
}

With this code in front of us, we can see there's very little difference between doing all the parsing in the constructor, and doing it on demand. There's a test of a boolean flag, and the setting of the flag, and the extra calls to EnsureParsed in each accessor. I'd be surprised if that extra code were not inlined.

This isn't a huge big deal, but my inclination is to do as little as possible in the constructor. That allows for scenarios where the construction needs to be fast. These will no doubt be situations you have not considered, like deserialization.

Again, it's not a huge big deal, but you can avoid doing the work in the constructor, and it's not expensive to do the work elsewhere. I admit, it's not like you're off doing network I/O in the constructor (unless, of course, a UNC file path is passed in), and you're not going to have to wait long in the constructor (unless there are networking problems, or you generalize the class to be able to read the HTML from places other than a file, some of which might be slow).

But since you don't have to do it in the constructor, my advice is simply - don't.

And if you do, it could be years before it causes an issue, if at all.

0

ответ дан 27 November 2019 в 17:55

The constructor should create a valid object. If in your case that requires reading and parsing information, than so be it.

If the object can be used for other purposes without parsing the information first, than consider making two constructors, or a separate method.

2

ответ дан 27 November 2019 в 17:55

A constructor should set up the object to be used.

So whatever that is. That may include taking action on some data or just setting fields. It will change from each class.

In the case you are speaking of an Html Parser, I would opt for creating the class, and then calling a Parse Html method. The reason for this is it gives you a furture opportunity to set items in the class for parsing the Html.

1

ответ дан 27 November 2019 в 17:55

I think when you create a class ($obj = new class), the class should not affect the page at all, and should be relatively low processing.

For instance:

If you have a user class, it should be checking for incoming login/logout parameters, along with cookies, and assigning them to class variables.

If you have a database class, it should make the connection to the database so it is ready when you are going to start a query.

If you have a class that deals with a particular form, it should go get the form values.

In a lot of my classes, I check for certain parameters to define an 'action', like add, edit or delete.

All of these things don't really affect the page, so it wouldn't matter too much if you created them or not. They are simply ready for when you are going to call that first method.

0

ответ дан 27 November 2019 в 17:55

У Миско Хевери есть хорошая история на эту тему с точки зрения модульного тестирования, здесь .

3

ответ дан 27 November 2019 в 17:55

It is good rule of thumb to only initialize fields in constructors, and otherwise do as little as possible to initialize the Object. Using Java as an example, you could run into problems if you call methods in your constructor, especially if you subclass your Object. This is because, due to the order of operations in the instantiation of Objects, instance variables will not be evaluated until after the super constructor has finished. If you try to access the field during the super constructor's process, you will throw an Exception

Suppose you have a superclass

class Test {

   Test () {
      doSomething();
   }

   void doSomething() {
     ...
   }
 }

and you have a subclass:

class SubTest extends Test {
    Object myObj = new Object();

    @Override
    void doSomething() {
        System.out.println(myObj.toString()); // throws a NullPointerException          
    }
 }

This is an example specific to Java, and while different languages handle this sort of ordering differently, it serves to drive the point home.

edit as an answer to your comment:

Though I would normally shy away from methods in constructors, in this case you have a few options:

In your constructor, set the HTML string as a field in your Class, and parse every time your getters are called. This most likely will not be very efficient.
Set the HTML as a field on your object, and then introduce a dependency on parse(), with it needing to be called either right after the constructor is finished or include some sort of lazy parsing by adding something like 'ensureParsed()' at the head of your accessors. I dont like this all that much, as you could have the HTML around after you've parsed, and your ensureParsed() call could be coded to set all of your parsed fields, thereby introducing a side-effect to your getter.
You could call parse() from your constructor and run the risk of throwing an exception. As you say, you are setting the fields to initialize the Object, so this is really OK. With regard to the Exception, stating that there was an illegal argument passed into a constructor is acceptable. If you do this, you should be careful to ensure that you understand the way that your language handles the creation of Objects as discussed above. To follow up with the Java example above, you can do this without fear if you ensure that only private methods (and therefore not eligible for overriding by subclasses) are called from within a constructor.

3

ответ дан 27 November 2019 в 17:55

You should try to keep the constructor from doing unnecessary work. In the end, it all depends on what the class should do, and how it should be used.

For instance, will all the accessors be called after constructing your object? If not, then you've processed data unnecessarily. Also, there's a bigger risk of throwing a "senseless" exception (oh, while trying to create the parser, I got an error because the file was malformed, but I didn't even ask it to parse anything...)

On second thought, you might need the access to this data fast after it is built, but you may take long building the object. It might be ok in this case.

Anyway, if the building process is complicated, I'd suggest using a creational pattern (factory, builder).

3

ответ дан 27 November 2019 в 17:55

I'd probably just pass enough to initialize the object and then have a 'parse' method. The idea is that expensive operations should be as obvious as possible.

5

ответ дан 27 November 2019 в 17:55

A constructor should do whatever is necessary to put that instance into a runnable, valid, ready-to-use state. If that means some validation or analysis, I'd say it belongs there. Just be careful about how much the constructor does.

There might be other places in your design where validation fits as well.

If the input values are coming from a UI, I'd say that it should have a hand in ensuring valid input.

If the input values are being unmarshalled from an incoming XML stream, I'd think about using schemas to validate it.

5

ответ дан 27 November 2019 в 17:55

"Should the parsing code be placed within a void parseHtml() method and the accessors only return valid values once this method is called?"

Yes.

"The design of the class is such that the class' constructor does the parsing"

This prevents customization, extension, and -- most importantly -- dependency injection.

There will be times when you want to do the following

Construct a parser.
Add Features to the parser: Business Rules, Filters, Better Algorithms, Strategies, Commands, whatever.
Parse.

Generally, it's best to do as little as possible in a constructor so that you are free to extend or modify.

Edit

"Couldn't extensions simply parse the extra information in their constructors?"

Only if they don't have any kind of features that need to be injected. If you want to add features -- say a different strategy for constructing the parse tree -- your subclasses have to also manage this feature addition before they parse. It may not amount to a simple super() because the superclass does too much.

"Also, parsing in the constructor allows me to fail early"

Kind of. Failing during construction is a weird use case. Failing during construction makes it difficult to construct a parser like this...

class SomeClient {
    parser p = new Parser();
    void aMethod() {...}
}

Usually a construction failure means you're out of memory. There's rarely a good reason to catch construction exceptions because you're doomed anyway.

You're forced to build the parser in a method body because it has too complex arguments.

In short, you've removed options from the clients of your parser.

"It's inadvisable to inherit from this class to replace an algorithm."

That's funny. Seriously. It's an outrageous claim. No algorithm is optimal for all possible use cases. Often a high-performance algorithm uses a lot of memory. A client may want to replace the algorithm with a slower one that uses less memory.

You can claim perfection, but it's rare. Subclasses are the norm, not an exception. Someone will always improve on your "perfection". If you limit their ability to subclass your parser, they'll simply discard it for something more flexible.

"I don't see needing step 2 as described in the answer."

A bold statement. Dependencies, Strategies and related injection design patterns are common requirements. Indeed, they're so essential for unit testing that a design which makes it difficult or complex often turns out to be a bad design.

Limiting the ability to subclass or extend your parser is a bad policy.

Bottom Line.

Assume nothing. Write a class with as few assumptions about it's use cases as possible. Parsing at construction time makes too many assumptions about client use cases.

18

ответ дан 27 November 2019 в 17:55

В этом конкретном случае я бы сказал, что здесь есть два класса: синтаксический анализатор и результат синтаксического анализа.

public class Parser {
    public Parser() {
        // Do what is necessary to construct a parser.
        // Perhaps we need to initialize a Unicode library, UTF-8 decoder, etc
    }
    public virtual ParseResult parseHTMLString(final string html) throws ParsingException
    {
        // Parser would do actual work here
        return new ParseResult(1, 2);
    }
}
public class ParseResult
{
    private int field1;
    private int field2;
    public ParseResult(int _field1, int _field2)
    {
        field1 = _field1;
        field2 = _field2;
    }
    public int getField1()
    {
        return field1;
    }
    public int getField2()
    {
        return field2;
    }
}

Если бы ваш синтаксический анализатор мог работать с частичными наборами данных, я бы подозревал, что было бы целесообразно добавить в смесь еще один класс. Возможно, PartialParseResult ?

1

ответ дан 27 November 2019 в 17:55

Другие вопросы по тегам:

oop language-agnostic constructor

Похожие вопросы:

score 36 · Accepted Answer

I normally follow one easy principle:

Everything that is mandatory for the correct existence and behavior of the class instance should be passed and done into the constructor.

Every other activity is done by other methods.

The constructor should never:

use other methods of the class with the purpose of using overriding behavior
act on its private attributes via methods

Because I learned the hard way that while you are in the constructor, the object is in a incoherent, intermediate state which is too dangerous to handle. Some of this unexpected behavior could be expected from your code, some could be from the language architecture and compiler decisions. Never guess, stay safe, be minimal.

In your case, I would use a Parser::parseHtml(file) method. The instantiation of the parser and the parsing are two different operations. When you instance a parser, the constructor puts it in the condition to perform its job (parsing). Then you use its method to perform the parsing. You then have two choices:

Either you allow the parser to contain the results of the parsing, and give the clients an interface to retrieve the parsed information (e.g. Parser::getFooValue()). The methods will return Null if you haven't performed parsing yet, or if the parsing failed.
or your Parser::parseHtml() returns a ParsingResult instance, containing what the Parser found.

The second strategy grants you better granularity, as the Parser is now stateless, and the client needs to interact with the methods of the ParsingResult interface. The Parser interface remains sleek and simple. The internals of the Parser class will tend to follow the Builder pattern.

You comment: "I feel as though returning an instance of a parser that hasn't parsed anything (as you suggest), a constructor that's lost its purpose. There's no use in initializing a parser without the intent of actually parsing the information. So if parsing is going to happen for sure, should we parse as early as possible and report and errors early, such as during the construction of the parser? I feel as though initializing a parser with invalid data should result in an error being thrown."

Not really. If you return an instance of a Parser, of course it's going to parse. In Qt, when you instantiate a button, of course it's going to be shown. However, you have the method QWidget::show() to manually call before something is visible to the user.

Any object in OOP has two concerns: initialization, and operation (ignore finalization, it's not on discussion right now). If you keep these two operations together, you both risk trouble (having an incomplete object operating) and you lose flexibility. There are plenty of reasons why you would perform intermediate setup of your object before calling parseHtml(). Example: suppose you want to configure your Parser to be strict (so to fail if a given column in a table contains a string instead of an integer) or permissive. Or to register a listener object which is warned every time a new parsing is performed or ended (think GUI progress bar). These are optional information, and if your architecture puts the constructor as the übermethod that does everything, you end up having a huge list of optional method parameters and conditions to handle into a method which is inherently a minefield.

"Caching should not be the responsibility of a parser. If data is to be cached, a separate cache class should be created to provide that functionality."

On the opposite. If you know that you are going to use the parsing functionality on a lot of files, and there's a significant chance that the files are going to be accessed and parsed again later on, it is internal responsability of the Parser to perform smart caching of what it already saw. From the client perspective, it is totally oblivious if this caching is performed or not. He is still callling the parsing, and still obtaining a result object. but it is getting the answer much faster. I think there's no better demonstration of separation of concerns than this. You boost performance with absolutely no change in the contract interface or the whole software architecture.

However, note that I am not advocating that you should never use a constructor call to perform parsing. I am just claiming that it's potentially dangerous and you lose flexibility. There are plenty of examples out there where the constructor is at the center of the actual activity of the object, but there is also plenty of examples of the opposite. Example (although biased, it arises from C style): in python, I would consider very weird something like this

f = file()
f.setReadOnly()
f.open(filename)

instead of the actual

f = file(filename,"r")

But I am sure there are IO access libraries using the first approach (with the second as a sugar-syntax approach).

Edit: finally, remember that while it's easy and compatible to add in the future a constructor "shortcut", it is not possible to remove this functionality if you find it dangerous or problematic. Additions to the interface are much easier than removals, for obvious reasons. Sugary behavior must be weighted against future support you have to provide to that behavior.