git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [DISCUSS] Unsustainable situation with ptests


I have been working on fixing this situation while commits were still coming in.

All the tests that have been disabled are in:
https://issues.apache.org/jira/browse/HIVE-19509
I have created new issues to reenable each of them, they are linked to that issue.
Maybe I was slightly aggressive disabling some of the tests, however that seemed to be the only way to bring the tests failures with age count > 1 to zero.

Instead of starting a vote to freeze the commits in another thread, I will start a vote to be stricter wrt committing to master, i.e., only commit if we get a clean QA run.

We can discuss more about this issue over there.

Thanks,
Jesús



On 5/14/18, 4:11 PM, "Sergey Shelukhin" <sergey@xxxxxxxxxxxxxxx> wrote:

    Can we please make this freeze conditional, i.e. we unfreeze automatically
    after ptest is clean (as evidenced by the clean HiveQA run on a given
    JIRA).
    
    On 18/5/14, 15:16, "Alan Gates" <alanfgates@xxxxxxxxx> wrote:
    
    >We should do it in a separate thread so that people can see it with the
    >[VOTE] subject.  Some people use that as a filter in their email to know
    >when to pay attention to things.
    >
    >Alan.
    >
    >On Mon, May 14, 2018 at 2:36 PM, Prasanth Jayachandran <
    >pjayachandran@xxxxxxxxxxxxxxx> wrote:
    >
    >> Will there be a separate voting thread? Or the voting on this thread is
    >> sufficient for lock down?
    >>
    >> Thanks
    >> Prasanth
    >>
    >> > On May 14, 2018, at 2:34 PM, Alan Gates <alanfgates@xxxxxxxxx> wrote:
    >> >
    >> > ​I see there's support for this, but people are still pouring in
    >>commits.
    >> > I proposed we have a quick vote on this to lock down the commits
    >>until we
    >> > get to green.  That way everyone knows we have drawn the line at a
    >> specific
    >> > point.  Any commits after that point would be reverted.  There isn't a
    >> > category in the bylaws that fits this kind of vote but I suggest lazy
    >> > majority as the most appropriate one (at least 3 votes, more +1s than
    >> > -1s).
    >> >
    >> > Alan.​
    >> >
    >> > On Mon, May 14, 2018 at 10:34 AM, Vihang Karajgaonkar <
    >> vihang@xxxxxxxxxxxx>
    >> > wrote:
    >> >
    >> >> I worked on a few quick-fix optimizations in Ptest infrastructure
    >>over
    >> the
    >> >> weekend which reduced the execution run from ~90 min to ~70 min per
    >> run. I
    >> >> had to restart Ptest multiple times. I was resubmitting the patches
    >> which
    >> >> were in the queue manually, but I may have missed a few. In case you
    >> have a
    >> >> patch which is pending pre-commit and you don't see it in the queue,
    >> please
    >> >> submit it manually or let me know if you don't have access to the
    >> jenkins
    >> >> job. I will continue to work on the sub-tasks in HIVE-19425 and will
    >>do
    >> >> some maintenance next weekend as well.
    >> >>
    >> >> On Mon, May 14, 2018 at 7:42 AM, Jesus Camacho Rodriguez <
    >> >> jcamacho@xxxxxxxxxx> wrote:
    >> >>
    >> >>> Vineet has already been working on disabling those tests that were
    >> timing
    >> >>> out. I am working on disabling those that are generating different q
    >> >> files
    >> >>> consistently for last ptests n runs. I am keeping track of all these
    >> >> tests
    >> >>> in https://issues.apache.org/jira/browse/HIVE-19509.
    >> >>>
    >> >>> -Jesús
    >> >>>
    >> >>> On 5/14/18, 2:25 AM, "Prasanth Jayachandran" <
    >> >>> pjayachandran@xxxxxxxxxxxxxxx> wrote:
    >> >>>
    >> >>>    +1 on freezing commits until we get repetitive green tests. We
    >> should
    >> >>> probably disable (and remember in a jira to reenable then at later
    >> point)
    >> >>> tests that are flaky to get repetitive green test runs.
    >> >>>
    >> >>>    Thanks
    >> >>>    Prasanth
    >> >>>
    >> >>>
    >> >>>
    >> >>>    On Mon, May 14, 2018 at 2:15 AM -0700, "Rui Li" <
    >> >> lirui.fudan@xxxxxxxxx
    >> >>> <mailto:lirui.fudan@xxxxxxxxx>> wrote:
    >> >>>
    >> >>>
    >> >>>    +1 to freezing commits until we stabilize
    >> >>>
    >> >>>    On Sat, May 12, 2018 at 6:10 AM, Vihang Karajgaonkar
    >> >>>    wrote:
    >> >>>
    >> >>>> In order to understand the end-to-end precommit flow I would like
    >> >> to
    >> >>> get
    >> >>>> access to the PreCommit-HIVE-Build jenkins script. Does anyone one
    >> >>> know how
    >> >>>> can I get that?
    >> >>>>
    >> >>>> On Fri, May 11, 2018 at 2:03 PM, Jesus Camacho Rodriguez <
    >> >>>> jcamacho@xxxxxxxxxx> wrote:
    >> >>>>
    >> >>>>> Bq. For the short term green runs, I think we should @Ignore the
    >> >>> tests
    >> >>>>> which
    >> >>>>> are known to be failing since many runs. They are anyways not
    >> >> being
    >> >>>>> addressed as such. If people think they are important to be run
    >> >> we
    >> >>> should
    >> >>>>> fix them and only then re-enable them.
    >> >>>>>
    >> >>>>> I think that is a good idea, as we would minimize the time that
    >> >> we
    >> >>> halt
    >> >>>>> development. We can create a JIRA where we list all tests that
    >> >> were
    >> >>>>> failing, and we have disabled to get the clean run. From that
    >> >>> moment, we
    >> >>>>> will have zero tolerance towards committing with failing tests.
    >> >>> And we
    >> >>>> need
    >> >>>>> to pick up those tests that should not be ignored and bring them
    >> >>> up again
    >> >>>>> but passing. If there is no disagreement, I can start working on
    >> >>> that.
    >> >>>>>
    >> >>>>> Once I am done, I can try to help with infra tickets too.
    >> >>>>>
    >> >>>>> -Jesús
    >> >>>>>
    >> >>>>>
    >> >>>>> On 5/11/18, 1:57 PM, "Vineet Garg"  wrote:
    >> >>>>>
    >> >>>>>    +1. I strongly vote for freezing commits and getting our
    >> >>> testing
    >> >>>>> coverage in acceptable state.  We have been struggling to
    >> >> stabilize
    >> >>>>> branch-3 due to test failures and releasing Hive 3.0 in current
    >> >>> state
    >> >>>> would
    >> >>>>> be unacceptable.
    >> >>>>>
    >> >>>>>    Currently there are quite a few test suites which are not
    >> >> even
    >> >>>> running
    >> >>>>> and are being timed out. We have been committing patches (to both
    >> >>>> branch-3
    >> >>>>> and master) without test coverage for these tests.
    >> >>>>>    We should immediately figure out what’s going on before we
    >> >>> proceed
    >> >>>>> with commits.
    >> >>>>>
    >> >>>>>    For reference following test suites are timing out on
    >> >> master: (
    >> >>>>> https://issues.apache.org/jira/browse/HIVE-19506)
    >> >>>>>
    >> >>>>>
    >> >>>>>    TestDbNotificationListener - did not produce a TEST-*.xml
    >> >> file
    >> >>>> (likely
    >> >>>>> timed out)
    >> >>>>>
    >> >>>>>    TestHCatHiveCompatibility - did not produce a TEST-*.xml file
    >> >>> (likely
    >> >>>>> timed out)
    >> >>>>>
    >> >>>>>    TestNegativeCliDriver - did not produce a TEST-*.xml file
    >> >>> (likely
    >> >>>>> timed out)
    >> >>>>>
    >> >>>>>    TestNonCatCallsWithCatalog - did not produce a TEST-*.xml
    >> >> file
    >> >>>> (likely
    >> >>>>> timed out)
    >> >>>>>
    >> >>>>>    TestSequenceFileReadWrite - did not produce a TEST-*.xml file
    >> >>> (likely
    >> >>>>> timed out)
    >> >>>>>
    >> >>>>>    TestTxnExIm - did not produce a TEST-*.xml file (likely timed
    >> >>> out)
    >> >>>>>
    >> >>>>>
    >> >>>>>    Vineet
    >> >>>>>
    >> >>>>>
    >> >>>>>    On May 11, 2018, at 1:46 PM, Vihang Karajgaonkar <
    >> >>>> vihang@xxxxxxxxxxxx
    >> >>>>>> wrote:
    >> >>>>>
    >> >>>>>    +1 There are many problems with the test infrastructure and
    >> >> in
    >> >>> my
    >> >>>>> opinion
    >> >>>>>    it has not become number one bottleneck for the project. I
    >> >> was
    >> >>>> looking
    >> >>>>> at
    >> >>>>>    the infrastructure yesterday and I think the current
    >> >>> infrastructure
    >> >>>>> (even
    >> >>>>>    its own set of problems) is still under-utilized. I am
    >> >>> planning to
    >> >>>>> increase
    >> >>>>>    the number of threads to process the parallel test batches to
    >> >>> start
    >> >>>>> with.
    >> >>>>>    It needs a restart on the server side. I can do it now, it
    >> >>> folks are
    >> >>>>> okay
    >> >>>>>    with it. Else I can do it over weekend when the queue is
    >> >> small.
    >> >>>>>
    >> >>>>>    I listed the improvements which I thought would be useful
    >> >> under
    >> >>>>>    https://issues.apache.org/jira/browse/HIVE-19425 but frankly
    >> >>>> speaking
    >> >>>>> I am
    >> >>>>>    not able to devote as much time as I would like to on it. I
    >> >>> would
    >> >>>>>    appreciate if folks who have some more time if they can help
    >> >>> out.
    >> >>>>>
    >> >>>>>    I think to start with https://issues.apache.org/
    >> >>>> jira/browse/HIVE-19429
    >> >>>>> will
    >> >>>>>    help a lot. We need to pack more test runs in parallel and
    >> >>> containers
    >> >>>>>    provide good isolation.
    >> >>>>>
    >> >>>>>    For the short term green runs, I think we should @Ignore the
    >> >>> tests
    >> >>>>> which
    >> >>>>>    are known to be failing since many runs. They are anyways not
    >> >>> being
    >> >>>>>    addressed as such. If people think they are important to be
    >> >>> run we
    >> >>>>> should
    >> >>>>>    fix them and only then re-enable them.
    >> >>>>>
    >> >>>>>    Also, I feel we need light-weight test run which we can run
    >> >>> locally
    >> >>>>> before
    >> >>>>>    submitting it for the full-suite. That way minor issues with
    >> >>> the
    >> >>>> patch
    >> >>>>> can
    >> >>>>>    be handled locally. May be create a profile which runs a
    >> >>> subset of
    >> >>>>>    important tests which are consistent. We can apply some label
    >> >>> that
    >> >>>>>    pre-checkin-local tests are runs successful and only then we
    >> >>> submit
    >> >>>>> for the
    >> >>>>>    full-suite.
    >> >>>>>
    >> >>>>>    More thoughts are welcome. Thanks for starting this
    >> >>> conversation.
    >> >>>>>
    >> >>>>>    On Fri, May 11, 2018 at 1:27 PM, Jesus Camacho Rodriguez <
    >> >>>>>    jcamacho@xxxxxxxxxx> wrote:
    >> >>>>>
    >> >>>>>    I believe we have reached a state (maybe we did reach it a
    >> >>> while ago)
    >> >>>>> that
    >> >>>>>    is not sustainable anymore, as there are so many tests
    >> >> failing
    >> >>> /
    >> >>>>> timing out
    >> >>>>>    that it is not possible to verify whether a patch is breaking
    >> >>> some
    >> >>>>> critical
    >> >>>>>    parts of the system or not. It also seems to me that due to
    >> >> the
    >> >>>>> timeouts
    >> >>>>>    (maybe due to infra, maybe not), ptest runs are taking even
    >> >>> longer
    >> >>>> than
    >> >>>>>    usual, which in turn creates even longer queue of patches.
    >> >>>>>
    >> >>>>>    There is an ongoing effort to improve ptests usability (
    >> >>>>>    https://issues.apache.org/jira/browse/HIVE-19425), but apart
    >> >>> from
    >> >>>>> that,
    >> >>>>>    we need to make an effort to stabilize existing tests and
    >> >>> bring that
    >> >>>>>    failure count to zero.
    >> >>>>>
    >> >>>>>    Hence, I am suggesting *we stop committing any patch before
    >> >> we
    >> >>> get a
    >> >>>>> green
    >> >>>>>    run*. If someone thinks this proposal is too radical, please
    >> >>> come up
    >> >>>>> with
    >> >>>>>    an alternative, because I do not think it is OK to have the
    >> >>> ptest
    >> >>>> runs
    >> >>>>> in
    >> >>>>>    their current state. Other projects of certain size (e.g.,
    >> >>> Hadoop,
    >> >>>>> Spark)
    >> >>>>>    are always green, we should be able to do the same.
    >> >>>>>
    >> >>>>>    Finally, once we get to zero failures, I suggest we are less
    >> >>> tolerant
    >> >>>>> with
    >> >>>>>    committing without getting a clean ptests run. If there is a
    >> >>> failure,
    >> >>>>> we
    >> >>>>>    need to fix it or revert the patch that caused it, then we
    >> >>> continue
    >> >>>>>    developing.
    >> >>>>>
    >> >>>>>    Please, let’s all work together as a community to fix this
    >> >>> issue,
    >> >>>> that
    >> >>>>> is
    >> >>>>>    the only way to get to zero quickly.
    >> >>>>>
    >> >>>>>    Thanks,
    >> >>>>>    Jesús
    >> >>>>>
    >> >>>>>    PS. I assume the flaky tests will come into the discussion.
    >> >>> Let´s see
    >> >>>>>    first how many of those we have, then we can work to find a
    >> >>> fix.
    >> >>>>>
    >> >>>>>
    >> >>>>>
    >> >>>>>
    >> >>>>>
    >> >>>>>
    >> >>>>>
    >> >>>>>
    >> >>>>
    >> >>>
    >> >>>
    >> >>>
    >> >>>    --
    >> >>>    Best regards!
    >> >>>    Rui Li
    >> >>>
    >> >>>
    >> >>>
    >> >>>
    >> >>>
    >> >>
    >>
    >>