git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [DISCUSS] Unsustainable situation with ptests


We have just had the first clean run in a while:
https://builds.apache.org/job/PreCommit-HIVE-Build/10971/testReport/

I will continue monitoring follow-up runs.

Thanks,
-Jesús


On 5/14/18, 11:28 PM, "Prasanth Jayachandran" <pjayachandran@xxxxxxxxxxxxxxx> wrote:

    Wondering if we can add a state transition from “Patch Available” to “Ready To Commit” which can only be triggered by ptest bot on green test run.
    
    Thanks
    Prasanth
    
    
    
    On Mon, May 14, 2018 at 10:44 PM -0700, "Jesus Camacho Rodriguez" <jcamacho@xxxxxxxxxx<mailto:jcamacho@xxxxxxxxxx>> wrote:
    
    
    I have been working on fixing this situation while commits were still coming in.
    
    All the tests that have been disabled are in:
    https://issues.apache.org/jira/browse/HIVE-19509
    I have created new issues to reenable each of them, they are linked to that issue.
    Maybe I was slightly aggressive disabling some of the tests, however that seemed to be the only way to bring the tests failures with age count > 1 to zero.
    
    Instead of starting a vote to freeze the commits in another thread, I will start a vote to be stricter wrt committing to master, i.e., only commit if we get a clean QA run.
    
    We can discuss more about this issue over there.
    
    Thanks,
    Jesús
    
    
    
    On 5/14/18, 4:11 PM, "Sergey Shelukhin"  wrote:
    
        Can we please make this freeze conditional, i.e. we unfreeze automatically
        after ptest is clean (as evidenced by the clean HiveQA run on a given
        JIRA).
    
        On 18/5/14, 15:16, "Alan Gates"  wrote:
    
        >We should do it in a separate thread so that people can see it with the
        >[VOTE] subject.  Some people use that as a filter in their email to know
        >when to pay attention to things.
        >
        >Alan.
        >
        >On Mon, May 14, 2018 at 2:36 PM, Prasanth Jayachandran <
        >pjayachandran@xxxxxxxxxxxxxxx> wrote:
        >
        >> Will there be a separate voting thread? Or the voting on this thread is
        >> sufficient for lock down?
        >>
        >> Thanks
        >> Prasanth
        >>
        >> > On May 14, 2018, at 2:34 PM, Alan Gates  wrote:
        >> >
        >> > ​I see there's support for this, but people are still pouring in
        >>commits.
        >> > I proposed we have a quick vote on this to lock down the commits
        >>until we
        >> > get to green.  That way everyone knows we have drawn the line at a
        >> specific
        >> > point.  Any commits after that point would be reverted.  There isn't a
        >> > category in the bylaws that fits this kind of vote but I suggest lazy
        >> > majority as the most appropriate one (at least 3 votes, more +1s than
        >> > -1s).
        >> >
        >> > Alan.​
        >> >
        >> > On Mon, May 14, 2018 at 10:34 AM, Vihang Karajgaonkar <
        >> vihang@xxxxxxxxxxxx>
        >> > wrote:
        >> >
        >> >> I worked on a few quick-fix optimizations in Ptest infrastructure
        >>over
        >> the
        >> >> weekend which reduced the execution run from ~90 min to ~70 min per
        >> run. I
        >> >> had to restart Ptest multiple times. I was resubmitting the patches
        >> which
        >> >> were in the queue manually, but I may have missed a few. In case you
        >> have a
        >> >> patch which is pending pre-commit and you don't see it in the queue,
        >> please
        >> >> submit it manually or let me know if you don't have access to the
        >> jenkins
        >> >> job. I will continue to work on the sub-tasks in HIVE-19425 and will
        >>do
        >> >> some maintenance next weekend as well.
        >> >>
        >> >> On Mon, May 14, 2018 at 7:42 AM, Jesus Camacho Rodriguez <
        >> >> jcamacho@xxxxxxxxxx> wrote:
        >> >>
        >> >>> Vineet has already been working on disabling those tests that were
        >> timing
        >> >>> out. I am working on disabling those that are generating different q
        >> >> files
        >> >>> consistently for last ptests n runs. I am keeping track of all these
        >> >> tests
        >> >>> in https://issues.apache.org/jira/browse/HIVE-19509.
        >> >>>
        >> >>> -Jesús
        >> >>>
        >> >>> On 5/14/18, 2:25 AM, "Prasanth Jayachandran" <
        >> >>> pjayachandran@xxxxxxxxxxxxxxx> wrote:
        >> >>>
        >> >>>    +1 on freezing commits until we get repetitive green tests. We
        >> should
        >> >>> probably disable (and remember in a jira to reenable then at later
        >> point)
        >> >>> tests that are flaky to get repetitive green test runs.
        >> >>>
        >> >>>    Thanks
        >> >>>    Prasanth
        >> >>>
        >> >>>
        >> >>>
        >> >>>    On Mon, May 14, 2018 at 2:15 AM -0700, "Rui Li" <
        >> >> lirui.fudan@xxxxxxxxx
        >> >>> > wrote:
        >> >>>
        >> >>>
        >> >>>    +1 to freezing commits until we stabilize
        >> >>>
        >> >>>    On Sat, May 12, 2018 at 6:10 AM, Vihang Karajgaonkar
        >> >>>    wrote:
        >> >>>
        >> >>>> In order to understand the end-to-end precommit flow I would like
        >> >> to
        >> >>> get
        >> >>>> access to the PreCommit-HIVE-Build jenkins script. Does anyone one
        >> >>> know how
        >> >>>> can I get that?
        >> >>>>
        >> >>>> On Fri, May 11, 2018 at 2:03 PM, Jesus Camacho Rodriguez <
        >> >>>> jcamacho@xxxxxxxxxx> wrote:
        >> >>>>
        >> >>>>> Bq. For the short term green runs, I think we should @Ignore the
        >> >>> tests
        >> >>>>> which
        >> >>>>> are known to be failing since many runs. They are anyways not
        >> >> being
        >> >>>>> addressed as such. If people think they are important to be run
        >> >> we
        >> >>> should
        >> >>>>> fix them and only then re-enable them.
        >> >>>>>
        >> >>>>> I think that is a good idea, as we would minimize the time that
        >> >> we
        >> >>> halt
        >> >>>>> development. We can create a JIRA where we list all tests that
        >> >> were
        >> >>>>> failing, and we have disabled to get the clean run. From that
        >> >>> moment, we
        >> >>>>> will have zero tolerance towards committing with failing tests.
        >> >>> And we
        >> >>>> need
        >> >>>>> to pick up those tests that should not be ignored and bring them
        >> >>> up again
        >> >>>>> but passing. If there is no disagreement, I can start working on
        >> >>> that.
        >> >>>>>
        >> >>>>> Once I am done, I can try to help with infra tickets too.
        >> >>>>>
        >> >>>>> -Jesús
        >> >>>>>
        >> >>>>>
        >> >>>>> On 5/11/18, 1:57 PM, "Vineet Garg"  wrote:
        >> >>>>>
        >> >>>>>    +1. I strongly vote for freezing commits and getting our
        >> >>> testing
        >> >>>>> coverage in acceptable state.  We have been struggling to
        >> >> stabilize
        >> >>>>> branch-3 due to test failures and releasing Hive 3.0 in current
        >> >>> state
        >> >>>> would
        >> >>>>> be unacceptable.
        >> >>>>>
        >> >>>>>    Currently there are quite a few test suites which are not
        >> >> even
        >> >>>> running
        >> >>>>> and are being timed out. We have been committing patches (to both
        >> >>>> branch-3
        >> >>>>> and master) without test coverage for these tests.
        >> >>>>>    We should immediately figure out what’s going on before we
        >> >>> proceed
        >> >>>>> with commits.
        >> >>>>>
        >> >>>>>    For reference following test suites are timing out on
        >> >> master: (
        >> >>>>> https://issues.apache.org/jira/browse/HIVE-19506)
        >> >>>>>
        >> >>>>>
        >> >>>>>    TestDbNotificationListener - did not produce a TEST-*.xml
        >> >> file
        >> >>>> (likely
        >> >>>>> timed out)
        >> >>>>>
        >> >>>>>    TestHCatHiveCompatibility - did not produce a TEST-*.xml file
        >> >>> (likely
        >> >>>>> timed out)
        >> >>>>>
        >> >>>>>    TestNegativeCliDriver - did not produce a TEST-*.xml file
        >> >>> (likely
        >> >>>>> timed out)
        >> >>>>>
        >> >>>>>    TestNonCatCallsWithCatalog - did not produce a TEST-*.xml
        >> >> file
        >> >>>> (likely
        >> >>>>> timed out)
        >> >>>>>
        >> >>>>>    TestSequenceFileReadWrite - did not produce a TEST-*.xml file
        >> >>> (likely
        >> >>>>> timed out)
        >> >>>>>
        >> >>>>>    TestTxnExIm - did not produce a TEST-*.xml file (likely timed
        >> >>> out)
        >> >>>>>
        >> >>>>>
        >> >>>>>    Vineet
        >> >>>>>
        >> >>>>>
        >> >>>>>    On May 11, 2018, at 1:46 PM, Vihang Karajgaonkar <
        >> >>>> vihang@xxxxxxxxxxxx
        >> >>>>>> wrote:
        >> >>>>>
        >> >>>>>    +1 There are many problems with the test infrastructure and
        >> >> in
        >> >>> my
        >> >>>>> opinion
        >> >>>>>    it has not become number one bottleneck for the project. I
        >> >> was
        >> >>>> looking
        >> >>>>> at
        >> >>>>>    the infrastructure yesterday and I think the current
        >> >>> infrastructure
        >> >>>>> (even
        >> >>>>>    its own set of problems) is still under-utilized. I am
        >> >>> planning to
        >> >>>>> increase
        >> >>>>>    the number of threads to process the parallel test batches to
        >> >>> start
        >> >>>>> with.
        >> >>>>>    It needs a restart on the server side. I can do it now, it
        >> >>> folks are
        >> >>>>> okay
        >> >>>>>    with it. Else I can do it over weekend when the queue is
        >> >> small.
        >> >>>>>
        >> >>>>>    I listed the improvements which I thought would be useful
        >> >> under
        >> >>>>>    https://issues.apache.org/jira/browse/HIVE-19425 but frankly
        >> >>>> speaking
        >> >>>>> I am
        >> >>>>>    not able to devote as much time as I would like to on it. I
        >> >>> would
        >> >>>>>    appreciate if folks who have some more time if they can help
        >> >>> out.
        >> >>>>>
        >> >>>>>    I think to start with https://issues.apache.org/
        >> >>>> jira/browse/HIVE-19429
        >> >>>>> will
        >> >>>>>    help a lot. We need to pack more test runs in parallel and
        >> >>> containers
        >> >>>>>    provide good isolation.
        >> >>>>>
        >> >>>>>    For the short term green runs, I think we should @Ignore the
        >> >>> tests
        >> >>>>> which
        >> >>>>>    are known to be failing since many runs. They are anyways not
        >> >>> being
        >> >>>>>    addressed as such. If people think they are important to be
        >> >>> run we
        >> >>>>> should
        >> >>>>>    fix them and only then re-enable them.
        >> >>>>>
        >> >>>>>    Also, I feel we need light-weight test run which we can run
        >> >>> locally
        >> >>>>> before
        >> >>>>>    submitting it for the full-suite. That way minor issues with
        >> >>> the
        >> >>>> patch
        >> >>>>> can
        >> >>>>>    be handled locally. May be create a profile which runs a
        >> >>> subset of
        >> >>>>>    important tests which are consistent. We can apply some label
        >> >>> that
        >> >>>>>    pre-checkin-local tests are runs successful and only then we
        >> >>> submit
        >> >>>>> for the
        >> >>>>>    full-suite.
        >> >>>>>
        >> >>>>>    More thoughts are welcome. Thanks for starting this
        >> >>> conversation.
        >> >>>>>
        >> >>>>>    On Fri, May 11, 2018 at 1:27 PM, Jesus Camacho Rodriguez <
        >> >>>>>    jcamacho@xxxxxxxxxx> wrote:
        >> >>>>>
        >> >>>>>    I believe we have reached a state (maybe we did reach it a
        >> >>> while ago)
        >> >>>>> that
        >> >>>>>    is not sustainable anymore, as there are so many tests
        >> >> failing
        >> >>> /
        >> >>>>> timing out
        >> >>>>>    that it is not possible to verify whether a patch is breaking
        >> >>> some
        >> >>>>> critical
        >> >>>>>    parts of the system or not. It also seems to me that due to
        >> >> the
        >> >>>>> timeouts
        >> >>>>>    (maybe due to infra, maybe not), ptest runs are taking even
        >> >>> longer
        >> >>>> than
        >> >>>>>    usual, which in turn creates even longer queue of patches.
        >> >>>>>
        >> >>>>>    There is an ongoing effort to improve ptests usability (
        >> >>>>>    https://issues.apache.org/jira/browse/HIVE-19425), but apart
        >> >>> from
        >> >>>>> that,
        >> >>>>>    we need to make an effort to stabilize existing tests and
        >> >>> bring that
        >> >>>>>    failure count to zero.
        >> >>>>>
        >> >>>>>    Hence, I am suggesting *we stop committing any patch before
        >> >> we
        >> >>> get a
        >> >>>>> green
        >> >>>>>    run*. If someone thinks this proposal is too radical, please
        >> >>> come up
        >> >>>>> with
        >> >>>>>    an alternative, because I do not think it is OK to have the
        >> >>> ptest
        >> >>>> runs
        >> >>>>> in
        >> >>>>>    their current state. Other projects of certain size (e.g.,
        >> >>> Hadoop,
        >> >>>>> Spark)
        >> >>>>>    are always green, we should be able to do the same.
        >> >>>>>
        >> >>>>>    Finally, once we get to zero failures, I suggest we are less
        >> >>> tolerant
        >> >>>>> with
        >> >>>>>    committing without getting a clean ptests run. If there is a
        >> >>> failure,
        >> >>>>> we
        >> >>>>>    need to fix it or revert the patch that caused it, then we
        >> >>> continue
        >> >>>>>    developing.
        >> >>>>>
        >> >>>>>    Please, let’s all work together as a community to fix this
        >> >>> issue,
        >> >>>> that
        >> >>>>> is
        >> >>>>>    the only way to get to zero quickly.
        >> >>>>>
        >> >>>>>    Thanks,
        >> >>>>>    Jesús
        >> >>>>>
        >> >>>>>    PS. I assume the flaky tests will come into the discussion.
        >> >>> Let´s see
        >> >>>>>    first how many of those we have, then we can work to find a
        >> >>> fix.
        >> >>>>>
        >> >>>>>
        >> >>>>>
        >> >>>>>
        >> >>>>>
        >> >>>>>
        >> >>>>>
        >> >>>>>
        >> >>>>
        >> >>>
        >> >>>
        >> >>>
        >> >>>    --
        >> >>>    Best regards!
        >> >>>    Rui Li
        >> >>>
        >> >>>
        >> >>>
        >> >>>
        >> >>>
        >> >>
        >>
        >>